Using student evaluations to measure teaching effectiveness

I came across a fascinating discussion on the use of student evaluations to measure teaching effectiveness upon following this Observational Epidemiology blog post by Mark, a statistical consultant. The original paper by Scott Carrell and James West uses value-added modeling to estimate teachers’ contributions to students’ grades in introductory courses and in subsequent courses, then analyzes the relationship between those contributions and student evaluations. (An ungated version of the paper is also available.) Key conclusions are:

Student evaluations are positively correlated with contemporaneous professor value‐added and negatively correlated with follow‐on student achievement. That is, students appear to reward higher grades in the introductory course but punish professors who increase deep learning (introductory course professor value‐added in follow‐on courses).

We find that less experienced and less qualified professors produce students who perform significantly better in the contemporaneous course being taught, whereas more experienced and highly qualified professors produce students who perform better in the follow‐on related curriculum.

Not having closely followed the research on this, I’ll simply note some key comments from other blogs.

Direct examination:

Several have posted links that suggest an endorsement of this paper’s conclusion, such as George Mason University professor of economics Tyler Cowen, Harvard professor of economics Greg Mankiw, and Northwestern professor of managerial economics Sandeep Baliga. Michael Bishop, a contributor to Permutations (“official blog of the Mathematical Sociology Section of the American Sociological Association“), provides some more detail in his analysis:

In my post on Babcock’s and Marks’ research, I touched on the possible unintended consequences of student evaluations of professors.  This paper gives new reasons for concern (not to mention much additional evidence, e.g. that physical attractiveness strongly boosts student evaluations).

That said, the scary thing is that even with random assignment, rich data, and careful analysis there are multiple, quite different, explanations.

The obvious first possibility is that inexperienced professors, (perhaps under pressure to get good teaching evaluations) focus strictly on teaching students what they need to know for good grades.  More experienced professors teach a broader curriculum, the benefits of which you might take on faith but needn’t because their students do better in the follow-up course!

After citing this alternative explanation from the authors:

Students of low value added professors in the introductory course may increase effort in follow-on courses to help “erase” their lower than expected grade in the introductory course.

Bishop also notes that motivating students to invest more effort in future courses would be a desirable effect of good professors as well. (But how to distinguish between “good” and “bad” methods for producing this motivation isn’t obvious.)


Others critique the article and defend the usefulness of student evaluations with observations that provoke further fascinating discussions.

Andrew Gelman, Columbia professor of statistics and political science, expresses skepticism about the claims:

Carrell and West estimate that the effects of instructors on performance in the follow-on class is as large as the effects on the class they’re teaching. This seems hard to believe, and it seems central enough to their story that I don’t know what to think about everything else in the paper.

At Education Sector, Forrest Hinton expresses strong reservations about the conclusions and the methods:

If you’re like me, you are utterly perplexed by a system that would mostly determine the quality of a Calculus I instructor by students’ performance in a Calculus II or aeronautical engineering course taught by a different instructor, while discounting students’ mastery of Calculus I concepts.

The trouble with complex value-added models, like the one used in this report, is that the number of people who have the technical skills necessary to participate in the debate and critique process is very limited—mostly to academics themselves, who have their own special interests.

Jeff Ely, Northwestern professor of economics, objects to the authors’ interpretation of their results:

I don’t see any way the authors have ruled out the following equally plausible explanation for the statistical findings.  First, students are targeting a GPA.  If I am an outstanding teacher and they do unusually well in my class they don’t need to spend as much effort in their next class as those who had lousy teachers, did poorly this time around, and have some catching up to do next time.  Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.

In agreement, Ed Dolan, an economist who was also for ten years “a teacher and administrator in a graduate business program that did not have tenure,” comments on Jeff Ely’s blog:

I reject the hypothesis that students give high evaluations to instructors who dumb down their courses, teach to the test, grade high, and joke a lot in class. On the contrary, they resent such teachers because they are not getting their money’s worth. I observed a positive correlation between overall evaluation scores and a key evaluation-form item that indicated that the course required more work than average. Informal conversations with students known to be serious tended to confirm the formal evaluation scores.


Dean Eckles, PhD candidate at Stanford’s CHIMe lab offers this response to Andrew Gelman’s blog post (linked above):

Students like doing well on tests etc. This happens when the teacher is either easier (either through making evaluations easier or teaching more directly to the test) or more effective.

Conditioning on this outcome, is conditioning on a collider that introduces a negative dependence between teacher quality and other factors affecting student satisfaction (e.g., how easy they are).

From Jeff Ely’s blog, a comment by Brian Moore raises this critical question:

“Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.”

Do we know this for sure? Perhaps they know when they have an outstanding teacher, but by definition, those are relatively few.

Closing thoughts:

These discussions raise many key questions, namely:

  • how to measure good teaching;
  • tensions between short-term and long-term assessment and evaluation[1];
  • how well students’ grades measure learning, and how grades impact their perception of learning;
  • the relationship between learning, motivation, and affect (satisfaction);
  • but perhaps most deeply, the question of student metacognition.

The anecdotal comments others have provided about how students respond on evaluations are more fairly couched in the terms “some students.” Given the considerable variability among students, interpreting student evaluations needs to account for those individual differences in teasing out the actual teaching and learning that underlie self-reported perceptions. Buried within those evaluations may be a valuable signal masked by a lot of noise– or more problematically, multiple signals that cancel and drown each other out.

[1] For example, see this review of research demonstrating that training which produces better short-term performance can produce worse long-term learning:
Schmidt, R.A., & Bjork, R.A. (1992). New conceptualizations of practice: Common principles in three paradigms suggest new concepts for training. Psychological Science, 3, 207-217.

5 thoughts on “Using student evaluations to measure teaching effectiveness

  1. Though I didn’t take a clear position in my blog post on the paper, I think Jeff Ely’s alternative explanation for the findings deserves more attention than was given in the paper. I’m not ready to do away with student course evaluations, but I think they have enough demonstrated problems that, at the very least, they need to be supplemented by other forms of evaluation. Third-party evaluation of students work, and random observation of classroom teaching are options to consider.

  2. Thank you both for your thoughtful comments! I agree about the importance of both improving and augmenting the evaluations. While it’s difficult to track and impossible to control the variability among students (e.g., how metacognitively aware they are, what their learning goals are, the teaching methods under which they’re most successful), we can analyze and control the variability among the questions in the evaluations. I would be very interested in the results of interdisciplinary collaborations with psychometricians to examine whether this effect still holds across different question types (such as low-inference vs. high-inference, reflections on learning vs. perceptions of value, etc.).

    Michele, when you have the chance, could you share some references for the other studies that you mention?

    And triangulating with other forms of data such as additional external assessments and observations would most certainly help elucidate the outcomes and processes in more detail.

  3. I’ve enjoyed all the comments, especially Gelman and Hinton’s. What nobody has commented on is the quality of the form itself and its psychometric properties. we know a lot about the design of good forms. Of course, I would have to get the whole article, but the overall questions on which most studies are based are the ones that foster halo effects and biases of all kinds, including grade inflation etc. Low-inference questions are much more reliable, and from those you can get a composite score. And when you actually make students think about their own learning and rate how much they learned in the course (also broken down in various categories) that seems to absorb other biases (the attractiveness study never looked at that and I do have to wonder how much of their effect would still be present after learning (or self-perceptions of learning) are accounted for.

  4. Good points Michele! If only colleges would incorporate your suggestion by downplaying, if not eliminating, the “overall” rating and emphasizing important “low-inference” questions. The low-inference question I’m particularly eager to have access to, in the context of this study, is “On average, how many hours a week did you study or work on homework?”

  5. What a great compilation of various points of view on this topic!

    Anecdotally, and to add my own two cents to the discussion, I do know one person who has managed to receive student evaluations well above the institutional average despite giving out grades that are below the institutional average. As best I can tell, she manages to do this at least in part by setting low grade expectations at the beginning of the course and creating a sort of “tough love” relationship with students throughout the semester (e.g., “This is really difficult, but you’re going to be so proud of yourselves at the end!”). It’s an approach that I have not yet succeeded in emulating.

    On a more cynical note, I think all of us instructors would be wise to apply the philosophy of “teaching to the test” to our end-of-semester evaluations. For instance, if we know there is going to be an item like “displays enthusiasm for material” it would probably be smart to explicitly state things like “I’m so enthusiastic about this!” repeatedly throughout the term, rather than just believing that students will implicitly pick up on a positive attitude.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s