What’s a Psychometrician?
When I first learned that we had a psychometrician on our team, I pictured someone who spent their day at a workbench full of phrenology skulls, calipers, and bubbling beakers—maybe even with an adorable Muppet assistant named Beaker.
Meep! I was intrigued.
I caught up with Barbara Rowan, SweetRush’s resident psychometrician, to learn what a day in her life is really like. (Spoiler: It’s absolutely riveting—and caliper-free!)
TV: Help me out—what does a psychometrician actually do? Do you have an elevator speech you give at parties?
BR: Sure! A psychometrician is someone who’s an expert in assessment and measurement. We write tests, but we do so much more! We can also look at existing tests to make sure they’re reliable and valid—or that each item measures what we say it’s measuring.
We also review existing tests. It’s all about analyzing data that help us pinpoint exactly how well a test is functioning. For example, if your learners say your test questions are too hard, I can find which—if any—are too hard. Ideally, a test should have a balance of hard, easy, and average items. I can also tell you if your learners aren’t studying or, on the other hand, if your items are too easy.
TV: I think I recognize the terms reliable and valid from my Statistics 101 days—do I have that right?
BR: Reliability and validity are some of the first concepts we learn in statistics, but psychometrics takes them to a whole new depth.
Let’s start with validity. Put simply, validity means that your assessment measures what it’s supposed to measure and you’re not accidentally measuring other skills and knowledge instead.
TV: Can you share a bit about why that matters?
BR: In short, moral and legal accountability. Assessments decide where—and whether—we’re admitted to schools or offered a job.
For example, a class-action lawsuit was brought against The Educational Testing Service by test takers who had received incorrect scores on an assessment some states used in teacher licensing decisions. They ended up paying $11.1 million to the plaintiffs.
There have also been cases against high schools by learners who failed the exit exams required to graduate. These learners claim the test is not reliable or valid (We’ll talk more about reliability shortly!), which isn’t acceptable for such a high-stakes test.
Several cases were brought against other high schools by students who are differently abled or non-native speakers of English. They felt that the schools’ assessments were biased against them.
As you can see, test questions have a real impact on people’s lives and futures. It’s important for organizations to know that their assessments are measuring what they purport to measure. Those that build their assessments responsibly have performed all the right psychometric tests and documented the results in a technical manual.
TV: This is so timely, with all of the discussion and reconsideration around standardized testing. I want to get to how to do things right—but I’m also morbidly curious about what it means for an assessment to measure something other than what it claims to measure. Can we delve into the dark side just for a moment?
BR: Imagine that you are taking a literature test on the computer. You don’t do very well.
So, why was your score lower than you expected? Perhaps it’s difficult for you to read a computer screen. Perhaps you have to scroll down the page to completely read the passage, so you can’t see the entire passage while answering the questions. Perhaps you don’t feel comfortable with using technology. Maybe English isn’t your first language. So, this literature test is not accurately measuring your ability to read a passage and answer questions. Instead, the test is highlighting the difficulties you have taking tests on computers or in the English language.
Another example of this is when we are trying to measure one construct, but inadvertently measure another one. Imagine a math test with story problems. Not only are we measuring one’s math skills, but we could also inadvertently be measuring one’s reading skills. A poor test score could mean that the student doesn’t know how to perform the math calculations necessary OR it could mean that the reading level is too high for this particular learner.
So organizations that use tests to make any decision—especially high-stakes decisions, have a moral and legal obligation to ensure that their tests are fair and equitable for all test takers. At a minimum, organizations must perform the Big Three of psychometrics.
TV: I have a feeling you’re not referring to the auto industry when you say “the Big Three.” What does the Big Three mean to a psychometrician?
BR: The Big Three are the top—you guessed it!—three indicators of whether an assessment is performing the way it should. They need to be measured with every assessment, every time. The good news is, they are easy to calculate with the right software.
The Big Three includes:
- Item difficulty
- Item discrimination
Reliability means that learners get essentially the same score if they take the assessment more than once. Reliability also measures an assessment’s internal consistency, or how well each single item relates to a learner’s total score. There are several measures of reliability used in psychometrics, but Cronbach’s Alpha is the most widely used. Cronbach’s Alpha is a test-level statistic, but I also care about every individual item on the test.
Item difficulty is just as it sounds. This calculation gives us an indication of how difficult or how easy a question is. This is a statistic that we calculate for each question. We ultimately want the majority of our questions falling in the moderate level of difficulty.
Item discrimination indicates how well a question discriminates between learners who understand the content and learners who don’t. Ideally, we want questions that highly discriminate between those who do well on the test and those who do not. We definitely don’t want a question that low scorers are getting correct and high scorers are getting incorrect. That is a question that does not discriminate well.
In addition to the Big Three, we need to conduct validity studies.
There are so many types of validity! And most validity studies take several months or more to conduct. However, one of the quickest and easiest types of validity to establish is content validity. To establish content validity, I work with subject matter experts (SMEs) to review an assessment before administering it to the learner. Through this process, the SMEs review the questions to ensure that the content is correct, and that all of the questions measure the construct, or subject, that we intended. To calculate other types of validity, learner sample size is critical. Ideally, we’d include between 300 and 500 learners—but we can work with a minimum of 200. Larger numbers of learners reduces error and gives us more faith in the results.
TV: Wow, that kind of deep study makes a lot of sense—especially for high-stakes assessments that affect people’s lives and futures. Is that kind of assessment situation the best case for a psychometrician?
BR: Anytime an assessment is being written and anytime you need to vet an assessment you’ve already developed, you’ve got a case for a psychometrician.
Do you think your questions are too hard? Too easy? Get hold of the data, and I can tell you.
As you’re building a course and deciding what your content needs to be, please bring in a psychometrician.
I need to partner with instructional designers (IDs) from the beginning, as they’re considering the learning objectives (LOs) for a solution. We need to ensure that their LOs can be measured—if they can’t, then our assessment results are meaningless.
For example, some LOs can’t be measured by the kinds of autograded assessments we see in many eLearning modules; they can only be measured by having learners create, write, or build something. If live assessment graders aren’t part of the project scope, we need to rethink the assessment and the LOs.
Once we land on measurable LOs, the ID creates the learning journey and the content. I come back in when the assessment items need to be written.
I think of my relationship with my ID friends as a system of checks and balances. I can’t do what they do, and they can’t do what I do—but we make one heck of a partnership!
TV: As a former ID, I appreciate that! And I hear you about the importance of measurable LOs. Can you share more about the risks of not involving a psychometrician in a learning solution design?
BR: A big part of these risks goes back to moral and legal accountability. Obviously, we want to build a sound assessment tool because it’s the right thing to do. But we also need to be sure we are protected in case a learner questions the results.
Bringing in a psychometrician early in the development process can get you answers to these key questions:
- How do you know your assessment is measuring what you say it is? If you’re using it to make decisions, you need to know that it’s performing well.
- How do you know that the decisions you make using your assessment data are the right decisions? You want to do your best work, and you want a testing instrument that has been properly vetted.
- How sure are you that your assessment is free of bias? You want a fair playing field for everyone taking your test—and you want to be able to show the work you’ve done to provide an equal opportunity for everyone.
These questions aren’t a one-and-done, either: You should be reviewing your assessment every few years. A psychometrician can put your assessment questions to the test—and help you respond in case your assessment is questioned.
Suppose I’m applying for a job, and an organization’s HR department administers a test. I feel the questions are biased and I say that probably one group of people is performing much better than everyone else. If the organization hasn’t done their homework and studied the Big 3, I could very well be right. And if bias or a lack of reliability are discovered after the fact—or worse, if they were discovered but not addressed—the organization is liable.
Even something as simple as test format can have an impact! My dissertation pitted paper and computer versions of the same STEM literacy exam against one another. I wanted to see if either format conveyed an advantage. Controlling for gender, age, ethnicity, and race, I found that the mean scores weren’t significantly different. (For those who speak stats: The t-test showed no significant difference between mean scores on the two delivery methods.) Even though the t-Test was not significant, the two versions of the test were found to be tau equivalent. This means that the two versions of the test were measuring the same construct, but on a different scale. To use these two test forms interchangeably, the scores would have to be rescaled to the same scale. Most people wouldn’t even think about the fact that the paper and computer versions could measure on different scales. I mean, each and every question is exactly the same across both versions.
The lesson? Even two versions of the identical test don’t necessarily perform the same way or on the same scale when delivery methods differ.
TV: Wow. I am thoroughly cured of the illusion that I can write a solid assessment.
Let’s close on a lighter note. Can you tell me about an assessment you’re really proud of?
BR: Absolutely! I was working with a client-partner at a global organization focused on improving community health. They were struggling with evaluating their new hires. These new hires were the people who went out to conduct workshops in local communities. But not all of the new hires who passed the evaluation actually did well in those communities. People who lacked the skills to do the job effectively were passing.
This was a case for a psychometrician! I partnered with the organization to standardize the new hire assessment and ensure that everyone who passed was actually ready to go forth and serve the communities.
I took a look at their old assessment, which consisted of a list of checkboxes. There was a lot of room for individual interpretation on these! I worked with the evaluation team to develop new rubrics using definable, observable criteria in three key areas of evaluation.
When they took the first rubric into the field for testing, the team found that people actually failed. And in this case, that was a good thing! It meant that the rubric finally had a high level of discrimination—in other words, the people who didn’t have the skills to do the work didn’t pass. The evaluation team could follow up with people who didn’t pass to offer additional training—or initiate job fit conversations.
TV: That’s a great example! I love that it serves a cause—and helps good people do better work. Thanks so much for sharing more about what you do.
Join us for Part II of Ask a Psychometrician, where Barbara will show us how to write great test questions—and make our assessments better, smarter, and fairer.
Got a Case for a Psychometrician? Here’s How to Tell
Not sure if you’ve got a case for a psychometrician? You're not alone!
Rodrigo Salazar-Kawer, our Director of Talent Solutions, likens Barbara's value-add to the invention of the automobile. Before the Model T, people looking for speed were in the market for faster horses. They couldn't even conceive of something as fast as a car.
That’s the kind of power a psychometrician brings to your assessment!
And if any of the following challenges sound familiar, you just might have a case for involving one:
Like any test drive, there's no obligation—just an opportunity for Barbara to ask a lot of questions. It's a needs assessment...for your assessment.
There's a range of models to choose from, too! Some client-partners are ready to bring in Barbara for their entire project from the outset—while others may prefer to work in phases. Phased work is a great option for clients who need to demonstrate results or secure budget incrementally.
A Tale of Two Phases
A leading technology company believed its assessment questions were too easy and asked Barbara to review them. This client-partner noticed that too many learners were passing its exam. That was an immediate red flag!
Barbara’s Phase I project was to examine the learning outcomes and content of the course. She discovered that most of the questions were well mapped to the learning outcomes—but the learning outcomes were too low-level. To make the questions more complex, her client-partner would need higher-level learning outcomes.
That meant a lot of changes ahead. For Phase II, Barbara created new assessment questions based on the client-partner’s learning outcomes—and ensure that these outcomes were measurable by the autograded assessments they needed to use.
Backtracking is never fun! That’s why it’s best to bring a psychometrician in as you develop your learning outcomes and content blueprint. They’ll tell you what’s possible–and how it can be measured fairly and accurately.
Want to chat about your assessment challenge? Get in touch.