Expensive assessment

One metric for evaluating automated scoring is to compare it against human scoring. For some domains and test formats (e.g., multiple-choice items on factual knowledge), automation has an accepted advantage in objectivity and reliability, although whether such questions assess meaningful understanding is often debated. With more open-ended domains and designs, human reading is typically considered superior, allowing room for individual nuance to shine through and get recognized.

Yet this exposé of some professional scorers’ experience reveals how even that cherished human judgment can get distorted and devalued. Here, narrow rubrics, mandated consistency, and expectations of bell curves valued sameness over subtlety and efficiency over reflection. In essence, such simplistic algorithms resulted in reverse-engineering cookie-cutter essays that all had to fit one of their six categories, differing details be damned.

Individual algorithms and procedures for assessing tests need to be improved so that they can make better use of a broader base of information. So does a system which relies so heavily on particular assessments that the impact of their weaknesses can get magnified so greatly. Teachers and schools collect a wealth of assessment data all the time; better mechanisms for aggregating and analyzing these data can extract more informational value from them and decrease the disproportionate weight on testing factories. When designed well, algorithms and automated tools for assessment can enhance human judgment rather than reducing it to an arbitrary bin-sorting exercise.

What should we assess?

Some thoughts on what tests should measure, from Justin Minkel:

Harvard education scholar Tony Wagner was quoted in a recent op-ed piece by Thomas Friedman on what we should be measuring instead: “Because knowledge is available on every Internet-connected device, what you know matters far less than what you can do with what you know. The capacity to innovate—the ability to solve problems creatively or bring new possibilities to life—and skills like critical thinking, communication and collaboration are far more important than academic knowledge.”

Can we measure these things that matter? I think we can. It’s harder to measure critical thinking and innovation than it is to measure basic skills. Harder but not impossible.

His suggestions:

For starters, we need to make sure that tests students take meets [sic] three basic criteria:

1. They must measure individual student growth.

2. Questions must be differentiated, so the test captures what students below and above grade-level know and still need to learn.

3. The tests must measures [sic] what matters: critical thinking, ingenuity, collaboration, and real-world problem-solving.

Measuring individual growth and providing differentiated questions are obvious design goals for personalized assessment. The third remains a challenge for assessment design all around.