MOOC measurement problems reveal systemic evaluation challenges

Sadly, it’s not particularly surprising that it took a proclamation by researchers from prominent institutions (Harvard and MIT) to get the media’s attention to what should have been obvious all along. That they don’t have alternative metrics handy highlights the difficulties of assessment in the absence of high-quality data both inside and outside the system. Inside the system, designers of online courses are still figuring out how to assess knowledge and learning quickly and effectively. Outside the system, would-be analysts lack information on how students (graduates and drop-outs alike) make use of what they learned– or not. Measuring long-term retention and far transfer will continue to pose a problem for evaluating educational experiences as they become more modularized and unbundled, unless systems emerge for integrating outcome data across experiences and over time. In economic terms, it exemplifies the need to internalize the externalities to the system.

Not all uses of data are equal

Gil Press worries that “big data enthusiasts may encourage (probably unintentionally) a new misguided belief, that ‘putting data in front of the teacher’ is in and of itself a solution [to what ails education today].”

As an advocate for the better use of educational data and learning analytics to serve teachers, I worry about careless endorsements and applications of “big data” that overlook these concerns:

1. Available data are not always the most important data.
2. Data should motivate providing support, not merely accountability.
3. Teachers are neither scientists nor laypeople in their use of data. They rely on data constantly, but need representations that they can interpret and turn into action readily.

Assessment specialists have long noted the many uses of assessment data; all educational data should be weighed as carefully, even more so when implemented at a large scale which magnifies the influence of errors.

On the realistic use of teaching machines

From the perspective that all publicity is good publicity, the continued hype-and-backlash cycle in media representations of educational technology is helping to fuel interest in its potential use.  However, misleading representations, even artistic or satirical, can skew the discourse away from realistic discussions of the true capacity and constraints of the technology and its appropriate use. We need honest appraisals of strengths and weaknesses to inform our judgment of what to do, and what not to do, when incorporating teaching machines into learning environments.

Adam Bessie and Arthur King’s cartoon depiction of the Automated Teaching Machine convey dire warnings about the evils of technology based on several common misconceptions regarding its use. One presents a false dichotomy between machine and teacher, portraying the goal of technology as replacing teachers through automation. While certain low-level tasks like marking multiple-choice questions can be automated, other aspects of teaching cannot. Even while advocating for greater use of automated assessment, I note that it is best used in conjunction with human judgment and interaction. Technology should augment what teachers can do, not replace it.

A second misconception is that educational programs are just Skinner machines that reinforce stimulus-response links. The very premise of cognitive science, and thus the foundation of modern cognitive tutors, is the need to go beyond observable behaviors to draw inferences about internal mental representations and processes. Adaptations to student performance are based on judgments about internal states, including not just knowledge but also motivation and affect.

A third misconception is that human presence corresponds to the quality of teaching and learning taking place. What matters is the quality of the interaction, between student and teacher, between student and peer, and between student and content. Human presence is a necessary precondition for human interaction, but it is neither a guarantee nor a perfect correlate of productive human interaction for learning.

Educational technology definitely needs critique, especially in the face of its possible widespread adoption. But those critiques should be based on the realities of its actual use and potential. How should the boundaries between human-human and human-computer interaction be navigated so that the activities mutually support each other? What kinds of representations and recommendations help teachers make effective use of assessment data? These are the kinds of questions we need to tackle in service of improving education.

MOOCs plus big data

In The Coming Big Data Education Revolution, Doug Guthrie argues that “big data”, rather than MOOCs, represent the true revolution in education:

MOOCs are not a transformative innovation that will forever remake academia. That honor belongs to a more disruptive and far-reaching innovation – “big data.” A catchall phrase that refers to the vast numbers of data sets that are collected daily, big data promises to revolutionize online learning and, in doing so, higher education.

I agree that there are exciting new discoveries and innovations still yet to be made through the advent of big data in education, and I also agree that MOOCs’ current reliance on scaling up delivery of existing content isn’t particularly revolutionary. Yet I see the two movements as overlapping and complementary, rather than as competing forces.

While MOOCs may not (yet) have revolutionized instruction, they have revolutionized access for many learners. Part of their appeal for those interested in their growth is their potential for enabling large-scale analysis due to the high enrollments as well as the availability of online data. The opportunity to study such large numbers of students across such disparate contexts is rare in traditional academic settings, and it permits discoveries of learning trajectories and error patterns that might otherwise get missed as noise amidst smaller samples.

Another potential innovation which traditional MOOCs (xMOOCs) have not yet explored is new models of building cohorts and communities from amidst a large pool of learners, a goal at the heart of “connectivist MOOCs” (cMOOCs) that highlights peer-learning pedagogy. Combine xMOOCs and cMOOCs, and you can improve educational access even further by enabling courses to spring up whenever and wherever enough people, interest, and resources converge. Add in the analytical power of big data, and then you have the capacity to truly personalize learning, by providing both the experiences that best support students’ learning and the human interactions that will enrich those experiences.

Expensive assessment

One metric for evaluating automated scoring is to compare it against human scoring. For some domains and test formats (e.g., multiple-choice items on factual knowledge), automation has an accepted advantage in objectivity and reliability, although whether such questions assess meaningful understanding is often debated. With more open-ended domains and designs, human reading is typically considered superior, allowing room for individual nuance to shine through and get recognized.

Yet this exposé of some professional scorers’ experience reveals how even that cherished human judgment can get distorted and devalued. Here, narrow rubrics, mandated consistency, and expectations of bell curves valued sameness over subtlety and efficiency over reflection. In essence, such simplistic algorithms resulted in reverse-engineering cookie-cutter essays that all had to fit one of their six categories, differing details be damned.

Individual algorithms and procedures for assessing tests need to be improved so that they can make better use of a broader base of information. So does a system which relies so heavily on particular assessments that the impact of their weaknesses can get magnified so greatly. Teachers and schools collect a wealth of assessment data all the time; better mechanisms for aggregating and analyzing these data can extract more informational value from them and decrease the disproportionate weight on testing factories. When designed well, algorithms and automated tools for assessment can enhance human judgment rather than reducing it to an arbitrary bin-sorting exercise.

Standardized tests as market distortions

Some historical context on how standardized tests have affected the elite points out how gatekeepers can magnify the influence of certain factors over others– whether through chance or through bias:

In 1947, the three significant testing organizations, the College Entrance Examination Board, the Carnegie Foundation for the Advancement of Teaching and the American Council on Education, merged their testing divisions into the Educational Testing Service, which was headed by former Harvard Dean Henry Chauncey.

Chauncey was greatly affected by a 1948 Scientific Monthly article, “The Measurement of Mental Systems (Can Intelligence Be Measured?)” by W. Allison Davis and Robert J. Havighurst, which called intelligence tests nothing more than a scientific way to give preference to children from middle- and upper-middle-class families. The article challenged Chauncey’s belief that by expanding standardized tests of mental ability and knowledge America’s colleges would become the vanguard of a new meritocracy of intellect, ability and ambition, and not finishing schools for the privileged.

The authors, and others, challenged that the tests were biased. Challenges aside, the proponents of widespread standardized testing were instrumental in the process of who crossed the American economic divide, as college graduates became the country’s economic winners in the postwar era.

As Nicholas Lemann wrote in his book “The Big Test,” “The machinery that (Harvard President James) Conant and Chauncey and their allies created is today so familiar and all-encompassing that its seems almost like a natural phenomenon, or at least an organism that evolved spontaneously in response to conditions. … It’s not.”

As a New Mexico elementary teacher and blogger explains:

My point is that test scores have a lot of IMPACT because of the graduation requirements, even if they don’t always have a lot of VALUE as a measure of growth.

Instead of grade inflation, we have testing-influence inflation, where the impact of certain tests is magnified beyond that of other assessment metrics. It becomes a kind of market distortion in the economics of test scores, where some measurements are more visible and assume more value than others, inviting cheating and “gaming the system“.

We can restore openness and transparency to the system by collecting continuous assessment data that assign more equal weight across a wider range of testing experiences, removing incentives to cheat or “teach to the test”. Adaptive and personalized assessment go further in alleviating pressures to cheat, by reducing the inflated number of competitors against whom one may be compared. Assessment can then return to fulfilling its intended role of providing useful information on what a student has learned, thereby yielding better measures of growth and becoming more honestly meritocratic.

Learner, Know Thyself

As “Big Data” loom larger and larger, the value of owning your own data likewise increases. Learners need to have access to all of their prior educational data, just as much as patients need access to all of their prior medical records, especially as they move between multiple providers and change over time. Instead of locking up valuable information in the hands of individual organizations with their own proprietary or idiosyncratic institutional habits, this lets the learner share their data for new educational providers to analyze.

Putting data back in the learners’ hands also empowers them to act as their own student-advocates, not just recognizing patterns in when they are learning more effectively (or less), but having the evidence to support their position. With accurate self-assessment and self-regulated learning becoming increasingly important goals in education these days, having students take literal ownership of their own learning and assessment data can help them make progress toward those goals.

Beating cheating

Between cheating to learn and learning to cheat, current discourse on academic dishonesty upends the “if you can’t beat ’em, join ’em” approach.

From Peter Nonacs, UCLA professor teaching Behavioral Ecology:

Tests are really just measures of how the Education Game is proceeding. Professors test to measure their success at teaching, and students take tests in order to get a good grade.  Might these goals be maximized simultaneously? What if I let the students write their own rules for the test-taking game?  Allow them to do everything we would normally call cheating?

And in a new MOOC titled “Understanding Cheating in Online Courses,” taught by Bernard Bull at Concordia University Wisconsin:

The start of the course will cover the basic vocabulary and different types of cheating. The course will then move into discussing the differences between online and face-to-face learning, and the philosophy and psychology behind academic integrity. One unit will examine the best practices to minimize cheating.

Cheating crops up whenever there is a mismatch between effort and reward, something which happens often in our current educational system. Assigning unequal rewards to equal efforts biases attention toward the inflated reward, motivating cheating. Assigning equal rewards to unequal efforts favors the lesser effort, enabling cheating. The greater the disparities, the greater the likelihood of cheating.

Thus, one potential avenue for reducing cheating would be to better align the reward to the effort, to link the evaluation of outputs more closely to the actual inputs. High-stakes tests separate them by exaggerating the influence of a single, limited snapshot. In contrast, continuous, passive assessment brings them closer by examining a much broader range of work over time, collected in authentic learning contexts rather than artificial testing situations. Education then becomes a series of honest learning experiences, rather than an arbitrary system to game.

In an era where students learn what gets assessed, the answer may be to assess everything.

Using personalized assessment to change the high-stakes testing culture

Criticisms of high-stakes tests abound as we usher in the start of K-12 testing season. Students worry about being judged on a bad day and note that tests measure only one kind of success, while teachers lament the narrowing of the curriculum. Others object to the lack of transparency in a system entrusted with such great influence.

Yet the problem isn’t tests themselves, but relying on only a few tests. What we actually need is more information, not less. Ongoing assessment collected from multiple opportunities, in varied contexts, and across time can help shield any one datapoint from receiving undue weight.

Personalized assessment goes further in acknowledging the difference between standardization in measurement (valuable) and uniformity in testing (unhelpful). Students with different goals deserve to be assessed by different standards and methods, and not arbitrarily pitted against each other in universal comparisons. Gathering more data from richer contexts that are better matched to students’ learning needs is a fundamental tenet of personalization.