MOOC measurement problems reveal systemic evaluation challenges

Sadly, it’s not particularly surprising that it took a proclamation by researchers from prominent institutions (Harvard and MIT) to get the media’s attention to what should have been obvious all along. That they don’t have alternative metrics handy highlights the difficulties of assessment in the absence of high-quality data both inside and outside the system. Inside the system, designers of online courses are still figuring out how to assess knowledge and learning quickly and effectively. Outside the system, would-be analysts lack information on how students (graduates and drop-outs alike) make use of what they learned– or not. Measuring long-term retention and far transfer will continue to pose a problem for evaluating educational experiences as they become more modularized and unbundled, unless systems emerge for integrating outcome data across experiences and over time. In economic terms, it exemplifies the need to internalize the externalities to the system.

Advertisements

Not all uses of data are equal

Gil Press worries that “big data enthusiasts may encourage (probably unintentionally) a new misguided belief, that ‘putting data in front of the teacher’ is in and of itself a solution [to what ails education today].”

As an advocate for the better use of educational data and learning analytics to serve teachers, I worry about careless endorsements and applications of “big data” that overlook these concerns:

1. Available data are not always the most important data.
2. Data should motivate providing support, not merely accountability.
3. Teachers are neither scientists nor laypeople in their use of data. They rely on data constantly, but need representations that they can interpret and turn into action readily.

Assessment specialists have long noted the many uses of assessment data; all educational data should be weighed as carefully, even more so when implemented at a large scale which magnifies the influence of errors.

Expensive assessment

One metric for evaluating automated scoring is to compare it against human scoring. For some domains and test formats (e.g., multiple-choice items on factual knowledge), automation has an accepted advantage in objectivity and reliability, although whether such questions assess meaningful understanding is often debated. With more open-ended domains and designs, human reading is typically considered superior, allowing room for individual nuance to shine through and get recognized.

Yet this exposé of some professional scorers’ experience reveals how even that cherished human judgment can get distorted and devalued. Here, narrow rubrics, mandated consistency, and expectations of bell curves valued sameness over subtlety and efficiency over reflection. In essence, such simplistic algorithms resulted in reverse-engineering cookie-cutter essays that all had to fit one of their six categories, differing details be damned.

Individual algorithms and procedures for assessing tests need to be improved so that they can make better use of a broader base of information. So does a system which relies so heavily on particular assessments that the impact of their weaknesses can get magnified so greatly. Teachers and schools collect a wealth of assessment data all the time; better mechanisms for aggregating and analyzing these data can extract more informational value from them and decrease the disproportionate weight on testing factories. When designed well, algorithms and automated tools for assessment can enhance human judgment rather than reducing it to an arbitrary bin-sorting exercise.

Standardized tests as market distortions

Some historical context on how standardized tests have affected the elite points out how gatekeepers can magnify the influence of certain factors over others– whether through chance or through bias:

In 1947, the three significant testing organizations, the College Entrance Examination Board, the Carnegie Foundation for the Advancement of Teaching and the American Council on Education, merged their testing divisions into the Educational Testing Service, which was headed by former Harvard Dean Henry Chauncey.

Chauncey was greatly affected by a 1948 Scientific Monthly article, “The Measurement of Mental Systems (Can Intelligence Be Measured?)” by W. Allison Davis and Robert J. Havighurst, which called intelligence tests nothing more than a scientific way to give preference to children from middle- and upper-middle-class families. The article challenged Chauncey’s belief that by expanding standardized tests of mental ability and knowledge America’s colleges would become the vanguard of a new meritocracy of intellect, ability and ambition, and not finishing schools for the privileged.

The authors, and others, challenged that the tests were biased. Challenges aside, the proponents of widespread standardized testing were instrumental in the process of who crossed the American economic divide, as college graduates became the country’s economic winners in the postwar era.

As Nicholas Lemann wrote in his book “The Big Test,” “The machinery that (Harvard President James) Conant and Chauncey and their allies created is today so familiar and all-encompassing that its seems almost like a natural phenomenon, or at least an organism that evolved spontaneously in response to conditions. … It’s not.”

As a New Mexico elementary teacher and blogger explains:

My point is that test scores have a lot of IMPACT because of the graduation requirements, even if they don’t always have a lot of VALUE as a measure of growth.

Instead of grade inflation, we have testing-influence inflation, where the impact of certain tests is magnified beyond that of other assessment metrics. It becomes a kind of market distortion in the economics of test scores, where some measurements are more visible and assume more value than others, inviting cheating and “gaming the system“.

We can restore openness and transparency to the system by collecting continuous assessment data that assign more equal weight across a wider range of testing experiences, removing incentives to cheat or “teach to the test”. Adaptive and personalized assessment go further in alleviating pressures to cheat, by reducing the inflated number of competitors against whom one may be compared. Assessment can then return to fulfilling its intended role of providing useful information on what a student has learned, thereby yielding better measures of growth and becoming more honestly meritocratic.

Beating cheating

Between cheating to learn and learning to cheat, current discourse on academic dishonesty upends the “if you can’t beat ’em, join ’em” approach.

From Peter Nonacs, UCLA professor teaching Behavioral Ecology:

Tests are really just measures of how the Education Game is proceeding. Professors test to measure their success at teaching, and students take tests in order to get a good grade.  Might these goals be maximized simultaneously? What if I let the students write their own rules for the test-taking game?  Allow them to do everything we would normally call cheating?

And in a new MOOC titled “Understanding Cheating in Online Courses,” taught by Bernard Bull at Concordia University Wisconsin:

The start of the course will cover the basic vocabulary and different types of cheating. The course will then move into discussing the differences between online and face-to-face learning, and the philosophy and psychology behind academic integrity. One unit will examine the best practices to minimize cheating.

Cheating crops up whenever there is a mismatch between effort and reward, something which happens often in our current educational system. Assigning unequal rewards to equal efforts biases attention toward the inflated reward, motivating cheating. Assigning equal rewards to unequal efforts favors the lesser effort, enabling cheating. The greater the disparities, the greater the likelihood of cheating.

Thus, one potential avenue for reducing cheating would be to better align the reward to the effort, to link the evaluation of outputs more closely to the actual inputs. High-stakes tests separate them by exaggerating the influence of a single, limited snapshot. In contrast, continuous, passive assessment brings them closer by examining a much broader range of work over time, collected in authentic learning contexts rather than artificial testing situations. Education then becomes a series of honest learning experiences, rather than an arbitrary system to game.

In an era where students learn what gets assessed, the answer may be to assess everything.

Unpacking degrees

Chris Dillow questions the purpose and value of a university degree (linked from Observational Epidemiology):

What is university for? I ask this old question because the utilitarian answer which was especially popular in the New Labour years – that the economy needs more graduates – might be becoming less plausible. A new paper by Paul Beaudry and colleagues says (pdf) there has been a “great reversal” in the demand for high cognitive skills in the US since around 2000, and the BLS forecasts that the fastest-growing occupations between now and 2020 will be mostly traditionally non-graduate ones, such as care assistants, fast food workers and truck drivers; Allister Heath thinks a similar thing might be true for the UK.

Nevertheless,we should ask: what function would universities serve in an economy where demand for higher cognitive skills is declining? There are many possibilities:

– A signaling device. A degree tells prospective employers that its holder is intelligent, hard-working and moderately conventional – all attractive qualities.

– Network effects. University teaches you to associate with the sort of people who might have good jobs in future, and might give you the contacts to get such jobs later.

– A lottery ticket.A degree doesn’t guarantee getting a good job. But without one, you have no chance.

– Flexibility. A graduate can stack shelves, and might be more attractive as a shelf-stacker than a non-graduate. Beaudry and colleagues decribe how the falling demand for graduates has caused graduates to displace non-graduates in less skilled jobs.

– Maturation & hidden unemployment. 21-year-olds are more employable than 18-year-olds, simply because they are three years less foolish. In this sense, university lets people pass time without showing up in the unemployment data.

– Consumption benefits. University is a less unpleasant way of spending three years than work. And it can provide a stock of consumption capital which improves the quality of our future leisure. By far the most important thing I learnt at Oxford was a love of Hank Williams and Leonard Cohen.

As the signaling function of the degree falls, we should consider how the signaling power of certificates, competencies, and other innovations may rise to overtake it. With specific knowledge and skills unbundled from each other, these markers may be more responsive to actual demand. More specific assessment metrics can help stakeholders better evaluate different programs of study, while more flexible learning paths can help students more efficiently pursue the knowledge and skills that will be most valuable to them.

Using personalized assessment to change the high-stakes testing culture

Criticisms of high-stakes tests abound as we usher in the start of K-12 testing season. Students worry about being judged on a bad day and note that tests measure only one kind of success, while teachers lament the narrowing of the curriculum. Others object to the lack of transparency in a system entrusted with such great influence.

Yet the problem isn’t tests themselves, but relying on only a few tests. What we actually need is more information, not less. Ongoing assessment collected from multiple opportunities, in varied contexts, and across time can help shield any one datapoint from receiving undue weight.

Personalized assessment goes further in acknowledging the difference between standardization in measurement (valuable) and uniformity in testing (unhelpful). Students with different goals deserve to be assessed by different standards and methods, and not arbitrarily pitted against each other in universal comparisons. Gathering more data from richer contexts that are better matched to students’ learning needs is a fundamental tenet of personalization.

If assessments are diagnoses, what are the prescriptions?

I happen to like statistics. I appreciate qualitative observations, too– data of all sorts can be deeply illuminating. But I also believe that the most important part of interpreting them is understanding what they do and don’t measure. And in terms of policy, it’s important to consider what one will do with the data once collected, organized, analyzed, and interpreted. What do the data tell us that we didn’t know before? Now that we have this knowledge, how will we apply it to achieve the desired change?

In an eloquent, impassioned open letter to President Obama, Education Secretary Arne Duncan, Bill Gates and other billionaires pouring investments into business-driven education reforms (revised version at Washington Post), elementary teacher and literacy coach Peggy Robertson argues that all these standardized tests don’t give her more information than what she already knew from observing her students directly. She also argues that the money that would go toward administering all these tests would be better spent on basic resources such as stocking school libraries with books for the students and reducing poverty.

She doesn’t go so far as to question the current most-talked-about proposals for using those test data: performance-based pay, tenure, and firing decisions. But I will. I can think of a much more immediate and important use for the streams of data many are proposing on educational outcomes and processes: Use them to improve teachers’ professional development, not just to evaluate, reward and punish them.

Simply put, teachers deserve formative assessment too.

Statistical issues with applying VAM

There’s a wonderful statistical discussion of Michael Winerip’s NYT article critiquing the use of value-added modeling in evaluating teachers, which I referenced in a previous post. I wanted to highlight some of the key statistical errors in that discussion, since I think these are important and understandable concepts for the general public to consider.

  • Margin of error: Ms. Isaacson’s 7th percentile score actually ranged from 0 to 52, yet the state is disregarding that uncertainty in making its employment recommendations. This is why I dislike the article’s headline, or more generally the saying, “Numbers don’t lie.” No, they don’t lie, but they do approximate, and can thus mislead, if those approximations aren’t adequately conveyed and recognized.
  • Reversion to the mean: (You may be more familiar with this concept as “regression to the mean,” but since it applies more broadly than linear regression, “reversion” is a more suitable term.) A single measurement can be influenced by many randomly varying factors, so one extreme value could reflect an unusual cluster of chance events. Measuring it again is likely to yield a value closer to the mean, simply because those chance events are unlikely to coincide again to produce another extreme value. Ms. Isaacson’s students could have been lucky in their high scores the previous year, causing their scores in the subsequent year to look low compared to predictions.
  • Using only 4 discrete categories (or ranks) for grades:
    • The first problem with this is the imprecision that results. The model exaggerates the impact of between-grade transitions (e.g., improving from a 3 to a 4) but ignores within-grade changes (e.g., improving from a low 3 to a high 3).
    • The second problem is that this exacerbates the nonlinearity of the assessment (discussed next). When changes that produce grade transitions are more likely than changes that don’t produce grade transitions, having so few possible grade transitions further inflates their impact.
      Another instantiation of this problem is that the imprecision also exaggerates the ceiling effects mentioned below, in that benefits to students already earning the maximum score become invisible (as noted in a comment by journalist Steve Sailer

      Maybe this high IQ 7th grade teacher is doing a lot of good for students who were already 4s, the maximum score. A lot of her students later qualify for admission to Stuyvesant, the most exclusive public high school in New York.
      But, if she is, the formula can’t measure it because 4 is the highest score you can get.

  • Nonlinearity: Not all grade transitions are equally likely, but the model treats them as such. Here are two major reasons why some transitions are more likely than others.
    • Measurement ceiling effects: Improving at the top range is more difficult and unlikely than improving in the middle range, as discussed in this comment:

      Going from 3.6 to 3.7 is much more difficult than going from 2.0 to 2.1, simply due to the upper-bound scoring of 4.

      However, the commenter then gives an example of a natural ceiling rather than a measurement ceiling. Natural ceilings (e.g., decreasing changes in weight loss, long jump, reaction time, etc. as the values become more extreme) do translate into nonlinearity, but due to physiological limitations rather than measurement ceilings. That said, the above quote still holds true because of the measurement ceiling, which masks the upper-bound variability among students who could have scored higher but inflates the relative lower-bound variability due to missing a question (whether from carelessness, a bad day, or bad luck in the question selection for the test). These students have more opportunities to be hurt by bad luck than helped by good luck because the test imposes a ceiling (doesn’t ask all the harder questions which they perhaps could have answered).

    • Unequal responses to feedback: The students and teachers all know that some grade transitions are more important than others. Just as students invest extra effort to turn an F into a D, so do teachers invest extra resources in moving students from below-basic to basic scores.
      More generally, a fundamental tenet of assessment is to inform the students in advance of the grading expectations. That means that there will always be nonlinearity, since now the students (and teachers) are “boundary-conscious” and behaving in ways to deliberately try to cross (or not cross) certain boundaries.
  • Definition of “value”: The value-added model described compares students’ current scores against predictions based on their prior-year scores. That implies that earning a 3 in 4th grade has no more value than earning a 3 in 3rd grade. As noted in this comment:

    There appears to be a failure to acknowledge that students must make academic progress just to maintain a high score from one year to the next, assuming all of the tests are grade level appropriate.

    Perhaps students can earn the same (high or moderate) score year after year on badly designed tests simply through good test-taking strategies, but presumably the tests being used in these models are believed to measure actual learning. A teacher who helps “proficient” students earn “proficient” scores the next year is still teaching them something worthwhile, even if there’s room for more improvement.

These criticisms can be addressed by several recommendations:

  1. Margin of error. Don’t base high-stakes decisions on highly uncertain metrics.
  2. Reversion to the mean. Use multiple measures. These could be estimates across multiple years (as in multiyear smoothing, as another commenter suggested), or values from multiple different assessments.
  3. Few grading categories. At the very least, use more scoring categories. Better yet, use the raw scores.
  4. Ceiling effect. Use tests with a higher ceiling. This could be an interesting application for using a form of dynamic assessment for measuring learning potential, although that might be tricky from a psychometric or educational measurement perspective.
  5. Nonlinearity of feedback. Draw from a broader pool of assessments that measure learning in a variety of ways, to discourage “gaming the system” on just one test (being overly sensitive to one set of arbitrary scoring boundaries).
  6. Definition of “value.” Change the baseline expectation (either in the model itself or in the interpretation of its results) to reflect the reality that earning the same score on a harder test actually does demonstrate learning.

Those are just the statistical issues. Don’t forget all the other problems we’ve mentioned, especially: the flaws in applying aggregate inferences to the individual; the imperfect link between student performance and teacher effectiveness; the lack of usable information provided to teachers; and the importance of attracting, training, and retaining good teachers.

Some limitations of value-added modeling

Following this discussion on teacher evaluation led me to a fascinating analysis by Jim Manzi.

We’ve already discussed some concerns with using standardized test scores as the outcome measures in value-added modeling; Manzi points out other problems with the model and the inputs to the model.

  1. Teaching is complex.
  2. It’s difficult to make good predictions about achievement across different domains.
  3. It’s unrealistic to attribute success or failure only to a single teacher.
  4. The effects of teaching extend beyond one school year, and therefore measurements capture influences that go back beyond one year and one teacher.

I’m not particularly fond of the above list—while I agree with all the claims, they’re not explained very clearly and they don’t capture the below key issues, which he discusses in more depth.

  1. Inferences about the aggregate are not inferences about an individual.
  2. More deeply, the model is valid at the aggregate level, “but any one data point cannot be validated.” This is a fundamental problem, true of stereotypes, of generalizations, and of averages. While they may enable you to make broad claims about a population of people, you can’t apply those claims to policies about a particular individual with enough confidence to justify high-stakes outcomes such as firing decisions. As Manzi summarizes it, an evaluation system works to help an organization achieve an outcome, not to be fair to the individuals within that organization.

    This is also related to problems with data mining—by throwing a bunch of data into a model and turning the crank, you can end up with all kinds of difficult-to-interpret correlations which are excellent predictors but which don’t make a whole lot of sense from a theoretical standpoint.

  3. Basing decisions on single instead of multiple measures is flawed.
  4. From a statistical modeling perspective, it’s easier to work with a single precise, quantitative measure than with multiple measures. But this inflates the influence of that one measure, which is often limited in time and scale. Figuring out how to combine multiple measures into a single metric requires subjective judgment (and thus organizational agreement), and, in Manzi’s words, “is very unlikely to work” with value-added modeling. (I do wish he’d expanded on this point further, though.)

  5. All assessments are proxies.
  6. If the proxy is given more value than the underlying phenomenon it’s supposed to measure, this can incentivize “teaching to the test”. With much at stake, some people will try to game the system. This may motivate those who construct and rely on the model to periodically change the metrics, but that introduces more instability in interpreting and calibrating the results across implementations.

In highlighting these weaknesses of value-added modeling, Manzi concludes by arguing that improving teacher evaluation requires a lot more careful interpretation of its results, within the context of better teacher management. I would very much welcome hearing more dialogue about what that management and leadership should look like, instead of so much hype about impressive but complex statistical tools expected to solve the whole problem on their own.