Can Weighting Reduce the Bias of Norm Scores in Non-Representative Samples?
Friday, August 11, 2023
The following is a summary of the paper “Reducing the Bias of Norm Scores in Non-Representative Samples: Weighting as an Adjunct to Continuous Norming Methods” co-authored by WPS’s Chief Operating Officer David Herzberg. To read the original article, please visit this website. |
The debate around what’s “normal” is never ending, and it’s often harmful to marginalized groups and persons. But in academic studies, and especially in education, the practice of age and grade norming is essential to help students and parents measure ability and progress in testing. Norming typically involves collecting data from a large and diverse sample of individuals who are similar to the population for which the test is intended, and then using statistical methods to establish “norms” based on this sample. Like the discussion around the term “normal,” norming can also fall subject to bias when the testing sample doesn’t have enough diversity.
Standardized tests are fixed by nature: everything from IQ tests to the SAT. To produce meaningful results from standardized tests, researchers, assessors, and graders use norming in K-12 schools, colleges, and even workplaces. If you have ever taken an entrance exam or personality test, your results have most likely been compared to a normative sample. For example, standardized tests like the SAT or ACT use grade norming to compare the performance of test-takers to other students in their same grade level. Similarly, classroom assessments and assignments may be graded using age- or grade-normed rubrics or criteria that are specific to the expectations for a particular grade level.
In a 2023 article titled, “Reducing the Bias of Norm Scores in Non-Representative Samples: Weighting as an Adjunct to Continuous Norming Methods,” published in the journal Assessments (ASM), WPS Chief Operating Officer, David Herzberg, joined authors Sebastian Gary, Alexandra Lenhard, and Wolfgang Lenhard to explore statistically reliable ways to reduce bias in norming. The authors used a method called raking—commonly used in the social sciences but rarely applied to the results of psychometric tests.
Raking simply applies “weights” or multipliers to test scores from a sample group. This process adjusts the distribution of scores in the sample so that it approximates the distribution of scores from a demographically representative sample. What demographic variables do researchers consider in this process? The demographics include gender, race, age, education level, and regionality.
Take the example of gender – in the general U.S. population, males and females are represented in approximately equal proportions. If a test sample had 40 boys and 60 girls, and test performance differed between the two genders, the distribution of scores could be skewed. With raking, sample differences in gender are adjusted to approximate the general population. (Interestingly enough, the statistical term ‘raking’ was inspired by the tool used to smooth over the surface of a garden.)
In this paper, raking is used in combination with continuous norming methods to improve the accuracy of normed test scores derived from non-demographically representative samples. The authors created a simulation using sample data, then used raking to adjust for education, ethnicity, and region. The authors found they could successfully reduce bias in norm scores and improve their accuracy by using a combination of methods.
Although this paper had very promising findings, there is no substitute for real diversity in a test sample. The authors noted that further research is needed to fully evaluate the effectiveness of raking and other weighting techniques in the norming of psychometric tests.
In the future, raking can be a useful tool for adjusting for lack of representativeness in normative samples, but there’s also a possibility it may introduce error into the raw-to-norm-score relationships if not used carefully. The authors also noted that it is important to consider joint distributions of demographic variables in developing norms, as these variables may interact with one another in their effects on test scores. Future research should explore alternative methods for developing norms that consider complex interactions among demographic variables.
Research and Resources:
Gary, S., Lenhard, A., Lenhard, W., & Herzberg, D. S. (2023). Reducing the Bias of Norm Scores in Non-Representative Samples: Weighting as an Adjunct to Continuous Norming Methods. Assessment, 0(0). https://doi.org/10.1177/10731911231153832