Problems with pay-for-performance

If pay-for-performance doesn’t work in medicine, what should our expectations be for its success in education?

“No matter how we looked at the numbers, the evidence was unmistakable; by no measure did pay-for-performance benefit patients with hypertension,” says lead author Brian Serumaga.

Interestingly, hypertension is “a condition where other interventions such as patient education have shown to be very effective.”

According to Anthony Avery… “Doctor performance is based on many factors besides money that were not addressed in this program: patient behavior, continuing MD training, shared responsibility and teamwork with pharmacists, nurses and other health professionals. These are factors that reach far beyond simple monetary incentives.”

It’s not hard to complete the analogy: doctor = teacher; patient = student; MD training = pre-service and in-service professional development; pharmacists, nurses and other health professionals =  lots of other education professionals.

One may question whether the problem is that money is an insufficient motivator, that pay-for-performance amounts to ambiguous global rather than specific local feedback, or that there are too many other factors not well under the doctor’s control to reveal an effect. Still, this does give pause to efforts to incentivize teachers by paying them for their students’ good test scores.

B. Serumaga, D. Ross-Degnan, A. J. Avery, R. A. Elliott, S. R. Majumdar, F. Zhang, S. B. Soumerai. Effect of pay for performance on the management and outcomes of hypertension in the United Kingdom: interrupted time series study. BMJ, 2011; 342 (jan25 3): d108 DOI: 10.1136/bmj.d108

 

Some limitations of value-added modeling

Following this discussion on teacher evaluation led me to a fascinating analysis by Jim Manzi.

We’ve already discussed some concerns with using standardized test scores as the outcome measures in value-added modeling; Manzi points out other problems with the model and the inputs to the model.

  1. Teaching is complex.
  2. It’s difficult to make good predictions about achievement across different domains.
  3. It’s unrealistic to attribute success or failure only to a single teacher.
  4. The effects of teaching extend beyond one school year, and therefore measurements capture influences that go back beyond one year and one teacher.

I’m not particularly fond of the above list—while I agree with all the claims, they’re not explained very clearly and they don’t capture the below key issues, which he discusses in more depth.

  1. Inferences about the aggregate are not inferences about an individual.
  2. More deeply, the model is valid at the aggregate level, “but any one data point cannot be validated.” This is a fundamental problem, true of stereotypes, of generalizations, and of averages. While they may enable you to make broad claims about a population of people, you can’t apply those claims to policies about a particular individual with enough confidence to justify high-stakes outcomes such as firing decisions. As Manzi summarizes it, an evaluation system works to help an organization achieve an outcome, not to be fair to the individuals within that organization.

    This is also related to problems with data mining—by throwing a bunch of data into a model and turning the crank, you can end up with all kinds of difficult-to-interpret correlations which are excellent predictors but which don’t make a whole lot of sense from a theoretical standpoint.

  3. Basing decisions on single instead of multiple measures is flawed.
  4. From a statistical modeling perspective, it’s easier to work with a single precise, quantitative measure than with multiple measures. But this inflates the influence of that one measure, which is often limited in time and scale. Figuring out how to combine multiple measures into a single metric requires subjective judgment (and thus organizational agreement), and, in Manzi’s words, “is very unlikely to work” with value-added modeling. (I do wish he’d expanded on this point further, though.)

  5. All assessments are proxies.
  6. If the proxy is given more value than the underlying phenomenon it’s supposed to measure, this can incentivize “teaching to the test”. With much at stake, some people will try to game the system. This may motivate those who construct and rely on the model to periodically change the metrics, but that introduces more instability in interpreting and calibrating the results across implementations.

In highlighting these weaknesses of value-added modeling, Manzi concludes by arguing that improving teacher evaluation requires a lot more careful interpretation of its results, within the context of better teacher management. I would very much welcome hearing more dialogue about what that management and leadership should look like, instead of so much hype about impressive but complex statistical tools expected to solve the whole problem on their own.

Supporting imaginative play

I came across this article when looking for a suitable reference in my last post, and I thought it deserved its own summary for all the information it contains. Educators and parents are constantly seeking concrete descriptions and recommendations for how to help their students and children develop, and this article is helpfully specific in describing the characteristics of mature imaginative play and techniques for supporting it. Below I summarize the main points in each category.

Characteristics of mature play (what to look for and encourage):

  • Imaginary situations
    • Assigning new meanings to people and objects
    • Focusing on abstract rather than concrete properties
    • Inventing new uses for familiar objects
    • Describing missing props with words and gestures
  • Multiple roles
    • Assuming multiple roles, including supporting characters
    • Practicing the actions and emotions of the role rather than their own
  • Clearly defined rules
    • Delaying immediate fulfillment of their desires (and thereby developing better self-regulation)
  • Flexible themes
    • Incorporating new roles and ideas from other themes
    • Negotiating plans across themes
  • Language development
    • Us[ing] language to plan the play scenario, negotiate and act out roles, explain “pretend” behaviors to others, regulate rule compliance
    • Modifying speech intonation and register, vocabulary to code-switch between real and pretend speech
  • Length of play
    • Staying with same play theme across multiple sessions over days
    • Creating, reviewing, revising plans
    • Elaborating on imaginary situation, integrating new roles, discovering new uses for props

How to support imaginative play

  • Intervene sometimes:
    • Beware of intervening so much that the play loses its spontaneous, child-initiated character and changes into another adult-directed activity.
    • Do intervene when children’s play remains stereotypical and unexciting day after day to help kids expand the scope of their play.
  • Create imaginary situations:
    • Provide multipurpose props that can stand for many objects (which also promotes cognitive flexibility).
    • Combine multipurpose props with realistic ones to keep play going and then gradually provide more unstructured materials.
    • Show the children different common objects and brainstorm how they can use them in different ways in play.
    • Encourage children to use both gestures and words to describe how they are using the object in a pretend way.
  • Integrate different play themes and roles:
    • Use field trips, literature, and videos to expand children’s repertoire of play themes and roles.
    • Point out the ‘people’ part of each new setting—the many different roles that people have and how the roles relate to one another.
  • Sustain play:
    • Help children plan play in advance by asking them to record their plans by drawing or writing them. This may help stimulate them to create new developments in their play scenario.

(Italicized text represents direct or near-direct quotations; parenthetical comments represent my own added interpretation.)


Bodrova, E.B., & Leong, D.J. (2003). The importance of being playful. Educational Leadership, 60, 50-53.

Diligence vs. competitiveness

I continue to be disappointed by the mainstream media’s coverage of the “tiger parenting” phenomenon. Although there’s now somewhat more discussion of the specific parenting practices that are good or bad for the child, distinct practices are still getting conflated. One of the latest that I’ve seen claims that “the intense emphasis on hard work comes with a deep, obsessive competitiveness.”

It ain’t necessarily so.

Setting aside my frustration with the media’s reliance on first-person anecdotes rather than research on aggregate populations, I’ll first acknowledge that emphasizing the value of hard work and discipline is indeed very productive. As I wrote in my previous post, it takes about ten years of deliberate practice to attain expertise[1], and those who believe in the value of effort are more likely to invest further effort[2][3] . A recently reported twin study similarly notes that children with greater self-control do better in school and have better health, financial, and social outcomes as adults[4].

But deliberate practice isn’t just rote practice, and worthwhile effort isn’t merely hours of exhaustion. Both require thoughtfulness in figuring out what needs more work and how to tackle it. Likewise, developing self-control requires more than simply being placed in a constraining environment. As I’ve previously noted, the complex and ill-structured world of imaginative play can improve children’s impulse control and self-regulation[5]. Much like the riddle about the town with two barbers, “just because it looks like what you want doesn’t mean it will produce what you want.” Here, it’s not enough just to put children and students through their paces in rigid settings that prevent them from going astray. The theme underlying all of these phenomena is that people need to learn how to decide for themselves how to manage their efforts and how to improve.

So how does this relate to competitiveness? Competition is based on social comparison, or norm-referenced assessment. While it may be inspiring to see the accomplishments of peers as a possibility for oneself, it can also be limiting to endeavor only to best them and not more generally to excel. Even more damaging, people have no control over the performance of their rivals, and chasing this uncontrollable target can foster that worrisome learned helplessness which can dampen enthusiasm, lower self-esteem, and inhibit future effort[6][7]. Instead, criterion-referenced assessment measures performance against fixed goals which can be set up as benchmarks and eventually internalized by the learner. Not only does this provide a clearer way of measuring success than normative comparisons, but it also helps the budding athlete, musician, or student succeed better in the long run.

I fully recognize that Chinese parents can be obsessively competitive about their children’s achievements, even through their self-effacing denials that their children are anything special. And it’s already been documented that the Chinese culture places a high premium on hard work. But these aren’t uniquely Chinese values, and they can be decoupled. The result isn’t some kind of compromise or blend between Chinese and American philosophies, but a selection of those beliefs and practices that have been shown to be productive across cultures.


[1] Ericsson, K.A. (1996). The acquisition of expert performance: An introduction to some of the issues. In K. A. Ericsson (Ed.), The Road to Excellence: The Acquisition of Expert Performance in the Arts and Sciences, Sports, and Games (pp. 1-50). Mahwah, NJ: Lawrence Erlbaum Associates.
[2] Mueller, C.M., & Dweck, C.S. (1998). Praise for intelligence can undermine children’s motivation and performance. Journal of Personality and Social Psychology, 75, 33-52.
[3] Dweck, C.M. (2006). Mindset: The New Psychology of Success. New York NY: Random House.
[4] Moffitt, T.E., Arseneault, L., Belsky, D., Dickson, N., Hancox, R.J., Harrington, H.L., Houts, R., Poulton, R., Roberts, B.W., Ross, S., Sears, M.R., Thomson, W.M., & Caspi, A. (in press). A gradient of childhood self-control predicts health, wealth, and public safety. Proceedings of the National Academy of Sciences.
[5] Bodrova, E.B., & Leong, D.J. (2003). The importance of being playful. Educational Leadership, 60, 50-53.
[6] Diener, C.I., & Dweck, C.S. (1978). An analysis of learned helplessness: Continuous changes in performance, strategy, and achievement cognitions following failure. Journal of Personality and Social Psychology, 36, 451-462.
[7] Diener, C.I., & Dweck, C.S. (1980). An analysis of learned helplessness: II. The processing of success. Journal of Personality and Social Psychology, 39, 940-952.

Using student evaluations to measure teaching effectiveness

I came across a fascinating discussion on the use of student evaluations to measure teaching effectiveness upon following this Observational Epidemiology blog post by Mark, a statistical consultant. The original paper by Scott Carrell and James West uses value-added modeling to estimate teachers’ contributions to students’ grades in introductory courses and in subsequent courses, then analyzes the relationship between those contributions and student evaluations. (An ungated version of the paper is also available.) Key conclusions are:

Student evaluations are positively correlated with contemporaneous professor value‐added and negatively correlated with follow‐on student achievement. That is, students appear to reward higher grades in the introductory course but punish professors who increase deep learning (introductory course professor value‐added in follow‐on courses).

We find that less experienced and less qualified professors produce students who perform significantly better in the contemporaneous course being taught, whereas more experienced and highly qualified professors produce students who perform better in the follow‐on related curriculum.

Not having closely followed the research on this, I’ll simply note some key comments from other blogs.

Direct examination:

Several have posted links that suggest an endorsement of this paper’s conclusion, such as George Mason University professor of economics Tyler Cowen, Harvard professor of economics Greg Mankiw, and Northwestern professor of managerial economics Sandeep Baliga. Michael Bishop, a contributor to Permutations (“official blog of the Mathematical Sociology Section of the American Sociological Association“), provides some more detail in his analysis:

In my post on Babcock’s and Marks’ research, I touched on the possible unintended consequences of student evaluations of professors.  This paper gives new reasons for concern (not to mention much additional evidence, e.g. that physical attractiveness strongly boosts student evaluations).

That said, the scary thing is that even with random assignment, rich data, and careful analysis there are multiple, quite different, explanations.

The obvious first possibility is that inexperienced professors, (perhaps under pressure to get good teaching evaluations) focus strictly on teaching students what they need to know for good grades.  More experienced professors teach a broader curriculum, the benefits of which you might take on faith but needn’t because their students do better in the follow-up course!

After citing this alternative explanation from the authors:

Students of low value added professors in the introductory course may increase effort in follow-on courses to help “erase” their lower than expected grade in the introductory course.

Bishop also notes that motivating students to invest more effort in future courses would be a desirable effect of good professors as well. (But how to distinguish between “good” and “bad” methods for producing this motivation isn’t obvious.)

Cross-examination:

Others critique the article and defend the usefulness of student evaluations with observations that provoke further fascinating discussions.

Andrew Gelman, Columbia professor of statistics and political science, expresses skepticism about the claims:

Carrell and West estimate that the effects of instructors on performance in the follow-on class is as large as the effects on the class they’re teaching. This seems hard to believe, and it seems central enough to their story that I don’t know what to think about everything else in the paper.

At Education Sector, Forrest Hinton expresses strong reservations about the conclusions and the methods:

If you’re like me, you are utterly perplexed by a system that would mostly determine the quality of a Calculus I instructor by students’ performance in a Calculus II or aeronautical engineering course taught by a different instructor, while discounting students’ mastery of Calculus I concepts.

The trouble with complex value-added models, like the one used in this report, is that the number of people who have the technical skills necessary to participate in the debate and critique process is very limited—mostly to academics themselves, who have their own special interests.

Jeff Ely, Northwestern professor of economics, objects to the authors’ interpretation of their results:

I don’t see any way the authors have ruled out the following equally plausible explanation for the statistical findings.  First, students are targeting a GPA.  If I am an outstanding teacher and they do unusually well in my class they don’t need to spend as much effort in their next class as those who had lousy teachers, did poorly this time around, and have some catching up to do next time.  Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.

In agreement, Ed Dolan, an economist who was also for ten years “a teacher and administrator in a graduate business program that did not have tenure,” comments on Jeff Ely’s blog:

I reject the hypothesis that students give high evaluations to instructors who dumb down their courses, teach to the test, grade high, and joke a lot in class. On the contrary, they resent such teachers because they are not getting their money’s worth. I observed a positive correlation between overall evaluation scores and a key evaluation-form item that indicated that the course required more work than average. Informal conversations with students known to be serious tended to confirm the formal evaluation scores.

Re-direct:

Dean Eckles, PhD candidate at Stanford’s CHIMe lab offers this response to Andrew Gelman’s blog post (linked above):

Students like doing well on tests etc. This happens when the teacher is either easier (either through making evaluations easier or teaching more directly to the test) or more effective.

Conditioning on this outcome, is conditioning on a collider that introduces a negative dependence between teacher quality and other factors affecting student satisfaction (e.g., how easy they are).

From Jeff Ely’s blog, a comment by Brian Moore raises this critical question:

“Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.”

Do we know this for sure? Perhaps they know when they have an outstanding teacher, but by definition, those are relatively few.

Closing thoughts:

These discussions raise many key questions, namely:

  • how to measure good teaching;
  • tensions between short-term and long-term assessment and evaluation[1];
  • how well students’ grades measure learning, and how grades impact their perception of learning;
  • the relationship between learning, motivation, and affect (satisfaction);
  • but perhaps most deeply, the question of student metacognition.

The anecdotal comments others have provided about how students respond on evaluations are more fairly couched in the terms “some students.” Given the considerable variability among students, interpreting student evaluations needs to account for those individual differences in teasing out the actual teaching and learning that underlie self-reported perceptions. Buried within those evaluations may be a valuable signal masked by a lot of noise– or more problematically, multiple signals that cancel and drown each other out.

[1] For example, see this review of research demonstrating that training which produces better short-term performance can produce worse long-term learning:
Schmidt, R.A., & Bjork, R.A. (1992). New conceptualizations of practice: Common principles in three paradigms suggest new concepts for training. Psychological Science, 3, 207-217.

Retrieval is only part of the picture

The latest educational research to make the rounds has been reported variously as “Test-Taking Cements Knowledge Better Than Studying,” “Simple Recall Exercises Make Science Learning Easier,” “Practising Retrieval is Best Tool for Learning,” and “Learning Science: Actively Recalling Information from Memory Beats Elaborate Study Methods.” Before anyone gets carried away seeking to apply these findings to practice, let’s correct the headlines and clarify what the researchers actually studied.

First, the “test-taking” vs. “studying” dichotomy presented by the NYT is too broad. The winning condition was “retrieval practice”, described fairly as “actively recalling information from memory” or even “simple recall exercises.” The multiple-choice questions popular on so many standardized tests don’t qualify because they assess recognition of information, not recall. In this study, participants had to report as much information as they could remember from the text, a more generative task than picking the best among the possible answers presented to them.

Nor were the comparison conditions merely “studying.” While the worst-performing conditions asked students to read (and perhaps reread) the text, they were dropped from the second experiment, which contrasted retrieval practice against “elaborative concept-mapping.” Thus, the “elaborate” (better read as “elaborative”) study methods reported in the ScienceDaily headline are overly broad, since concept-mapping is only one of many kinds of elaborative study methods. That the researchers found no benefit for students who had previous concept-mapping experience may simply mean that it requires more than one or two exposures to be useful.

The premise underlying concept-mapping as a learning tool is that re-representing knowledge in another format helps students identify and understand relationships between the concepts. But producing a new representation on paper (or some other external medium) doesn’t require constructing a new internal mental representation. In focusing on producing a concept map, students may simply have copied the information from the text to their diagram without deeply processing what they were writing or drawing. By scoring the concept maps by completeness (number of ideas) rather than quality (appropriateness of node placement and links), this study did not fully safeguard against this.

To a certain extent that may be the exact point the researchers wanted to make: That concept-mapping can be executed in an “active” yet non-generative fashion. Even reviewing a concept map (as the participants were encouraged to do with any remaining time) can be done very superficially, simply checking to make sure that all the information is present, rather than reflecting on the relationships represented—similar to making a “cheat sheet” for a test and trusting that all the formulas and definitions are there, instead of evaluating the conditions and rationale for applying them.

One may construe this as an argument against concept-mapping as a study technique, if it is so difficult to utilize it effectively. But just because a given tool can be used poorly does not mean it should be avoided completely; that could be true of any teaching or learning approach. Nor does this necessarily constitute an argument against other elaborative study methods. Explaining a text or diagram, whether to oneself or to others, is another form of elaboration that has been well documented for its effectiveness in supporting learning[1]. This constitutes an interesting hybrid between elaboration and retrieval, insofar as explanation adds information beyond the source but may also demand partial recall of the contents of the source even when present. If the value of explanation is solely in the retrieval involved, then it should fare worse against pure retrieval and better against pure elaboration.

All of this begs the question, “Better for what?” The tests in this study primarily measured retrieval, with 84% of the points counting the presence of ideas and the rest (from only two questions) assessing inference. Yet even those inference questions depended partially on retrieval, making it ambiguous whether wrong answers reflected a failure to retrieve, comprehend, or apply knowledge. What this study showed most clearly was that retrieval practice is valuable for improving retrieval. Elaboration and other activities may still be valuable for promoting transfer and inference. There could also be a possible interaction whereby elaboration and retrieval mutually enhance each other, since remembering and conducting inferences is easier with robust knowledge structures. The lesson may not be that elaborative activities are a poor use of time, but that they need to incorporate retrieval practice to be most effective.

I don’t at all doubt the validity of the finding, or the importance of retrieval in promoting learning. I share the authors’ frustration with the often-empty trumpeting of “active learning,” which can assume ineffective and meaningless forms [2][3]. I also recognize the value of knowing certain information in order to utilize it efficiently and flexibly. My concerns are in interpreting and applying this finding sensibly to real-life teaching and learning.

  • Retrieval is only part of the picture. Educators need to assess and support multiple skills, including and beyond retrieval. There’s a great danger of forgetting other learning goals (such as understanding, applying, creating, evaluating, etc.) when pressured to document success in retrieval.
  • Is it retrieving knowledge or generating knowledge? I also wonder whether “retrieval” may be too narrow a label for the broader phenomenon of generating knowledge. This may be a specific instance of the well-documented generation effect [4], and it may not always be most beneficial to focus only on retrieving the particular facts. There could be a similar advantage to other generative tasks, such as inventing a new application of a given phenomenon, writing a story incorporating new vocabulary words, or creating a problem that could almost be solved by a particular strategy. None of these require retrieving the phenomenon, the definitions, or the solution method to be learned, but they all require elaborating upon the knowledge-to-be-learned by generating new information and deeper understanding of it. Knowledge is more than a list of disconnected facts [5]; it needs a structure to be meaningful [6]. Focusing too heavily on retrieving the list downplays the importance of developing the supporting structure.
  • Retrieval isn’t recognition, and not all retrieval is worthwhile. Most important, I’m especially concerned that the mainstream media’s reporting of this finding may make it too easily misinterpreted. It would be a shame if this were used to justify more multiple-choice testing, or if a well-meaning student thought that accurately reproducing a graph from a textbook by memory constituted better studying than explaining the relationships embedded within that graph.

For the sake of a healthy relationship between research and practice, I hope the general public and policymakers will take this finding in context and not champion it into the latest silver bullet that will save education. Careless conversion of research into practice undermines the scientific process, effective policymaking, and teachers’ professional judgment, all of which need to collaborate instead of collide.

J. D. Karpicke, J. R. Blunt. Retrieval Practice Produces More Learning than Elaborative Studying with Concept Mapping. Science, 2011; DOI: 10.1126/science.1199327


[1] Chi, M.T.H., de Leeuw, N., Chiu, M.H., & LaVancher, C. (1994). Eliciting self-explanations improves understanding. Cognitive Science, 18, 439-477.
[2] For example, see the “Teacher A” model described in:
Scardamalia, M., & Bereiter, C. (1991). Higher levels of agency for children in knowledge building: A challenge for the design of new knowledge media. Journal of the Learning Sciences, 1, 37-68.
(There’s also a “Johnny Appleseed” project description I once read that’s a bit of a caricature of poorly-designed project-based learning, but I can’t seem to find it now. If anyone knows of this example, please share it with me!)
[3] This is one reason why some educators now advocate “minds-on” rather than simply “hands-on” learning. Of course, what those minds are focused on still deserves better clarification.
[4] e.g., Slamecka, N.J., & Graf, P. (1978). The generation effect: Delineation of a phenomenon. Journal of Experimental Psychology: Human Learning and Memory, 4, 592-604.
[5] In the following study, some gifted students outscored historians in their fact recall, but could not evaluate and interpret claims as effectively:
Wineburg, S.S. (1991). Historical problem solving: A study of the cognitive processes used in the evaluation of documentary and pictorial evidence. Journal of Educational Psychology, 83, 73-87.
[6] For a fuller description of the importance of structured knowledge representations, see:
Bransford, J.D., Brown, A.L., & Cocking, R.R. (2000). How people learn: Brain, mind, experience, and school (Expanded edition). Washington DC: National Academy Press, pp. 31-50 (Ch. 2: How Experts Differ from Novices).