From David Rockwell’s “Unpacking Imagination,” here’s a proposal for an inexpensive, portable playground kit that lets kids build their own playland:

This design comes from Rockwell’s architecture firm, as he describes:

Although traditional playgrounds can easily cost in the millions to build, boxed imagination playgrounds can be put together for under $10,000. (Land costs not included!) The design below is one that my architecture firm has done in collaboration with the New York City Parks Department and KaBoom, a nonprofit organization. But it needn’t be the only one out there. There are a lot of ways to build a playground— and a lot of communities in need of one. Let a thousand portable playgrounds bloom.

Just another example of how a little bit of structure— not too much— can enable lots of imaginative play.

Direct instruction of discovery learning

From Lisa Guernsey’s A False Debate about Preschool (and K-12) Learning:

When a child sees an intriguing model of how to ask questions, explore and test hypotheses, that child will want to do the same.

What children need are more learning environments – not just in preschool, but throughout their early, middle and later years of school – that give them day-to-day experience with adults who offer them effective and engaging models of what it looks like to learn.

Maybe we could think of this as direct instruction of discovery learning, especially if we note that modeling and imitation learning can be much more powerful and direct than declarative description / prescription.

Statistical issues with applying VAM

There’s a wonderful statistical discussion of Michael Winerip’s NYT article critiquing the use of value-added modeling in evaluating teachers, which I referenced in a previous post. I wanted to highlight some of the key statistical errors in that discussion, since I think these are important and understandable concepts for the general public to consider.

  • Margin of error: Ms. Isaacson’s 7th percentile score actually ranged from 0 to 52, yet the state is disregarding that uncertainty in making its employment recommendations. This is why I dislike the article’s headline, or more generally the saying, “Numbers don’t lie.” No, they don’t lie, but they do approximate, and can thus mislead, if those approximations aren’t adequately conveyed and recognized.
  • Reversion to the mean: (You may be more familiar with this concept as “regression to the mean,” but since it applies more broadly than linear regression, “reversion” is a more suitable term.) A single measurement can be influenced by many randomly varying factors, so one extreme value could reflect an unusual cluster of chance events. Measuring it again is likely to yield a value closer to the mean, simply because those chance events are unlikely to coincide again to produce another extreme value. Ms. Isaacson’s students could have been lucky in their high scores the previous year, causing their scores in the subsequent year to look low compared to predictions.
  • Using only 4 discrete categories (or ranks) for grades:
    • The first problem with this is the imprecision that results. The model exaggerates the impact of between-grade transitions (e.g., improving from a 3 to a 4) but ignores within-grade changes (e.g., improving from a low 3 to a high 3).
    • The second problem is that this exacerbates the nonlinearity of the assessment (discussed next). When changes that produce grade transitions are more likely than changes that don’t produce grade transitions, having so few possible grade transitions further inflates their impact.
      Another instantiation of this problem is that the imprecision also exaggerates the ceiling effects mentioned below, in that benefits to students already earning the maximum score become invisible (as noted in a comment by journalist Steve Sailer

      Maybe this high IQ 7th grade teacher is doing a lot of good for students who were already 4s, the maximum score. A lot of her students later qualify for admission to Stuyvesant, the most exclusive public high school in New York.
      But, if she is, the formula can’t measure it because 4 is the highest score you can get.

  • Nonlinearity: Not all grade transitions are equally likely, but the model treats them as such. Here are two major reasons why some transitions are more likely than others.
    • Measurement ceiling effects: Improving at the top range is more difficult and unlikely than improving in the middle range, as discussed in this comment:

      Going from 3.6 to 3.7 is much more difficult than going from 2.0 to 2.1, simply due to the upper-bound scoring of 4.

      However, the commenter then gives an example of a natural ceiling rather than a measurement ceiling. Natural ceilings (e.g., decreasing changes in weight loss, long jump, reaction time, etc. as the values become more extreme) do translate into nonlinearity, but due to physiological limitations rather than measurement ceilings. That said, the above quote still holds true because of the measurement ceiling, which masks the upper-bound variability among students who could have scored higher but inflates the relative lower-bound variability due to missing a question (whether from carelessness, a bad day, or bad luck in the question selection for the test). These students have more opportunities to be hurt by bad luck than helped by good luck because the test imposes a ceiling (doesn’t ask all the harder questions which they perhaps could have answered).

    • Unequal responses to feedback: The students and teachers all know that some grade transitions are more important than others. Just as students invest extra effort to turn an F into a D, so do teachers invest extra resources in moving students from below-basic to basic scores.
      More generally, a fundamental tenet of assessment is to inform the students in advance of the grading expectations. That means that there will always be nonlinearity, since now the students (and teachers) are “boundary-conscious” and behaving in ways to deliberately try to cross (or not cross) certain boundaries.
  • Definition of “value”: The value-added model described compares students’ current scores against predictions based on their prior-year scores. That implies that earning a 3 in 4th grade has no more value than earning a 3 in 3rd grade. As noted in this comment:

    There appears to be a failure to acknowledge that students must make academic progress just to maintain a high score from one year to the next, assuming all of the tests are grade level appropriate.

    Perhaps students can earn the same (high or moderate) score year after year on badly designed tests simply through good test-taking strategies, but presumably the tests being used in these models are believed to measure actual learning. A teacher who helps “proficient” students earn “proficient” scores the next year is still teaching them something worthwhile, even if there’s room for more improvement.

These criticisms can be addressed by several recommendations:

  1. Margin of error. Don’t base high-stakes decisions on highly uncertain metrics.
  2. Reversion to the mean. Use multiple measures. These could be estimates across multiple years (as in multiyear smoothing, as another commenter suggested), or values from multiple different assessments.
  3. Few grading categories. At the very least, use more scoring categories. Better yet, use the raw scores.
  4. Ceiling effect. Use tests with a higher ceiling. This could be an interesting application for using a form of dynamic assessment for measuring learning potential, although that might be tricky from a psychometric or educational measurement perspective.
  5. Nonlinearity of feedback. Draw from a broader pool of assessments that measure learning in a variety of ways, to discourage “gaming the system” on just one test (being overly sensitive to one set of arbitrary scoring boundaries).
  6. Definition of “value.” Change the baseline expectation (either in the model itself or in the interpretation of its results) to reflect the reality that earning the same score on a harder test actually does demonstrate learning.

Those are just the statistical issues. Don’t forget all the other problems we’ve mentioned, especially: the flaws in applying aggregate inferences to the individual; the imperfect link between student performance and teacher effectiveness; the lack of usable information provided to teachers; and the importance of attracting, training, and retaining good teachers.

Some history and context on VAM in teacher evaluation

In the Columbia Journalism Review’s Tested: Covering schools in the age of micro-measurement, LynNell Hancock provides a rich survey of the history and context of the current debate over value-added modeling in teacher evaluation, with a particular focus on LA and NY.

Here are some key points from the critique:

1. In spite of their complexity, value-added models are based on very limited sources of data: who taught the students, without regard to how or under what conditions, and standardized tests, which are a very narrow and imperfect measure of learning,

No allowance is made for many “inside school” factors… Since the number is based on manipulating one-day snapshot tests—the value of which is a matter of debate—what does it really measure?

2. Value-added modeling is an imprecise method whose parameters and outcomes are highly dependent on the assumptions built into the model.

In February, two University of Colorado, Boulder researchers caused a dustup when they called the Times’s data “demonstrably inadequate.” After running the same data through their own methodology, controlling for added factors such as school demographics, the researchers found about half the reading teachers’ scores changed. On the extreme ends, about 8 percent were bumped from ineffective to effective, and 12 percent bumped the other way. To the researchers, the added factors were reasonable, and the fact that they changed the results so dramatically demonstrated the fragility of the value-added method.

3. Value-added modeling is inappropriate to use as grounds for firing teachers or calculating merit pay.

Nearly every economist who weighed in agreed that districts should not use these indicators to make high-stakes decisions, like whether to fire teachers or add bonuses to paychecks.

Further, it’s questionable how effective it is as a policy to focus simply on individual teacher quality, when poverty has a greater impact on a child’s learning:

The federal Coleman Report issued [in 1966] found that a child’s family economic status was the most telling predictor of school achievement. That stubborn fact remains discomfiting—but undisputed—among education researchers today.

These should all be familiar concerns by now. What this article adds is a much richer picture of the historical and political context for the many players in the debate. I’m deeply disturbed that NYS Supreme Court Judge Cynthia Kern ruled that “there is no requirement that data be reliable for it to be disclosed.” At least Trontz at the NY Times acknowledges the importance of publishing reliable information as opposed to spurious claims, except he seems to overlook all the arguments against the merits of the data:

If we find the data is so completely botched, or riddled with errors that it would be unfair to release it, then we would have to think very long and hard about releasing it.

That’s the whole point: applying value-added modeling to standardized test scores to fire or reward teachers is unreliable to the point of being unfair. Adding noise and confusion to the conversation isn’t “a net positive,” as Arthur Browne from The Daily News seems to believe; it degrades the discussion, at great harm to the individual teachers, their students, the institutions that house them, and the society that purports to sustain them and benefit from them.

Look for the story behind the numbers, not the numbers alone

This time I’ll let the journalists get away with their fondness for reporting the compelling individual story, since the single counterexample is the whole point here.

High-stakes testing was bad enough. But high-stakes evaluating and hiring? This is a great example of the dangers of applying quantitative metrics inappropriately. While value-added modeling may be able to capture properties of the aggregate, it makes occasional errors at the level of the individual. Just one error (whether it’s a factual or exaggerated case, it still illustrates the point) demonstrates the ethical and managerial problems in firing the wrong person based on aggregated data.

Nor do I understand the political eagerness to fire teachers so readily. I’m not convinced that teachers are such an abundant resource that we can afford to burn through them so callously. With teacher shortages in multiple areas and a national teacher attrition rate of 15-20%, we would do better to keep, train, and support the teachers we already have, rather than toss them out and discourage new recruits from joining an increasingly unfriendly profession.

While I agree that it’s important to judge teaching by its merits rather than just the years spent, we need to formulate those measurements carefully. Test scores alone give a misleading illusion of greater precision than they actually have and

From positive self-esteem to positive other-esteem and learning

Dealing with differences needs to be encouraged gently, whether with ideas or with people.

As described in “People with Low Self-Esteem Show More Signs of Prejudice”[1]:

When people are feeling bad about themselves, they’re more likely to show bias against people who are different. …People who feel bad about themselves show enhanced prejudice because negative associations are activated to a greater degree, but not because they are less likely to suppress those feelings.

The connection between low self-esteem and negative expectations reminds me of related research on the impact of a value-affirming writing exercise in improving the academic performance of minority students:

From “Simple writing exercise helps break vicious cycle that holds back black students”[2]:

In 2007, [Geoffrey Cohen from the University of Colorado] showed that a simple 15-minute writing exercise at the start of a school year could boost the grades of black students by the end of the semester. The assignment was designed to boost the student’s sense of self-worth, and in doing so, it helped to narrow the typical performance gap that would normally separate them from white students.

After two years, the black students earned higher GPAs if they wrote self-affirming pieces on themselves rather than irrelevant essays about other people or their daily routines. On average, the exercises raised their GPA by a quarter of a point.

And from 15-minute writing exercise closes the gender gap in university-level physics[3]:

Think about the things that are important to you. Perhaps you care about creativity, family relationships, your career, or having a sense of humour. Pick two or three of these values and write a few sentences about why they are important to you. You have fifteen minutes. …

In a university physics class, Akira Miyake from the University of Colorado used [this writing exercise] to close the gap between male and female performance. … With nothing but his fifteen-minute exercise, performed twice at the beginning of the year, he virtually abolished the gender divide and allowed the female physicists to challenge their male peers.

Helping people feel better about themselves seems like an obvious, “everybody-wins” approach to improving education, social relations, and accepting different ideas.

[1] T. J. Allen, J. W. Sherman. Ego Threat and Intergroup Bias: A Test of Motivated-Activation Versus Self-Regulatory Accounts. Psychological Science, 2011. DOI:

[2] Cohen, G.L., Garcia, J., Purdie-Vaughns, V., Apfel, N., & Brzustoski, P. (2009). Recursive Processes in Self-Affirmation: Intervening to Close the Minority Achievement Gap. Science, 324(5925), 400-403. DOI:

[3] Miyake, A., Kost-Smith, L.E., Finkelstein, N.D., Pollock, S.J., Cohen, G.L., & Ito, T.A. (2010). Reducing the Gender Achievement Gap in College Science: A Classroom Study of Values Affirmation. Science, 330(6008), 1234-1237. DOI:

The problem isn’t pretty pink princesses, but what becomes of them

There’s nothing wrong with pink. It’s a perfectly fine color. The problem is its arbitrary association with gender[1], to the point where it becomes a code for what girls and boys are “supposed to” like and dislike, and prevents them from judging for themselves what they like based on any dimension other than color.

Nor would I necessarily take issue with princess fantasies, on the grounds that fantasizing oneself as royalty, a dragon-slayer, or a time-traveler can all be healthy exercises in one’s imagination. The deeper problem is that wanting to be a princess is too often about wanting to be pretty, pampered, and protected. To the extent that it’s about something in one’s control, becoming a princess is about being able to marry a prince. I’m not too keen on encouraging young girls to define themselves or build dreams around their marriage prospects. I think the key question to ask girls playing at being princesses is, “What will you do when you’re a princess?”

What does a girl do when she’s a pretty pink princess? What does she do to become one?

Peggy Orenstein and others have dissected the dangers of wanting to be pretty along the lines of promoting consumerism, narcissism, eating disorders, and premature sexualization of girls. At its simplest, I see the ideal of “being valued for what you do and not how you look” as just another expression of the importance of believing that effort and controllable behaviors matter more than intelligence, talent, or looks. I won’t dispute the value of attractiveness or positive self-presentation in influencing success or self-esteem; the issue is that improving one’s appearance is more limited in capacity than improving one’s skills[2] and ultimately more limited in impact. One can only be so average (noting that more average faces are more beautiful), but one can always be more capable.

I also worry about encouraging children to seek some status or reward simply for its own sake. The pleasures of being pretty go beyond ornamenting someone else’s world and receiving an extra boost in attention. Attractiveness shouldn’t be an end in itself, but a stepping-stone toward further positive outcomes—whether building confidence to pursue ambitious goals, landing a CEO or political position where leadership can make a difference, or developing interpersonal skills that help bring others together. Otherwise it’s little more than an uncashed lottery ticket, devoid of real appreciation.

For me, the bottom line is about helping all children to pursue goals which they can control and which will help them develop. I want them to choose games and activities based on how interesting or worthwhile they seem, not some marketing message that arbitrarily dictates preferences around colors and images[3]. I want them to actively create their own questions and ambitions, explore the world, and forge paths toward fulfilling those desires. I want them to focus on boosting what they know and can do, not what they have and how they look, to better support them in tackling future challenges.

So the next time I see your daughter, please understand if I don’t immediately comment on her adorable outfit. I’m probably debating whether to reinforce her perspective-taking, self-regulation, or cognitive flexibility.

[1] LoBue, V., & DeLoache, J.S. (2011). Pretty in pink: The early development of gender-stereotyped colour preferences. To appear in British Journal of Developmental Psychology.

See also:
Chiu, S.W., Gervan, S., Fairbrother, C., Johnson, L.L, Owen-Anderson, A.F.H., Bradley, S.J., & Zucker, K.J. (2006). Sex-dimorphic color preference in children with gender identity disorder: A comparison in clinical and community controls. Sex Roles, 55, 385-395.

[2] There’s a phrase (I thought) I once heard, about the unrealistic yet persistent modern belief in “the infinite perfectibility of the human body” and its application to girls’ striving to be thin and beautiful. Unfortunately I haven’t been able to track it down; if you know its source, please send it along!

[3] I’m not interested in pandering to stereotype with boy-targeted car imagery or girl-targeted pink frills, whether for “good causes” or for some toy company’s profit. Even if topics like online shopping and cosmetic surgery interest more girls in statistics, I would still advocate finding more neutral problem contexts and framing for both boys and girls.

From: Sylvie Kerger, Romain Martin, Martin Brunner. How can we enhance girls’ interest in scientific topics? British Journal of Educational Psychology, 2011; DOI: 10.1111/j.2044-8279.2011.02019.x

How science supplements cognition

Chris Mooney provides some choice excerpts from his interview of astrophysicist Neil DeGrasse Tyson on this week’s Point of Inquiry:

Science exists… because the data-taking faculties of the human body are faulty. And what science does as an enterprise is provide ways to get data, acquire data from the natural world that don’t have to filter through your senses. And this ensures, or at least minimizes as far as possible, the capacity of your brain to fool itself.

If it were natural to think scientifically, science as we currently practice it would have been going on for thousands of years. But it hasn’t…. Science as we now practice it [has] been going on for no more than 400 years.

The operations of the universe can be understood through your fluency in math and science, and it’s math and science that give people the greatest challenges in the school system.

It is precisely because they are not “natural” to our thinking that math and science are such powerful tools: They enable us to overcome our natural cognitive biases.

Math and science are perhaps the greatest cultural artifacts that we have, because our appreciation of them is not innate (as opposed to language, music, and visual perception). Rather, our understanding of them derives from the wisdom discovered, constructed, and passed down from others.

Other research-based commentary on “tiger parenting”

For what I hope will be my last post on the subject, I wanted to share some gems I’ve found from my online meanderings following link after link on Amy Chua’s views on parenting. These all draw from relevant research to critique specific practices rather than an imprecise “parenting style.”

1. In this edited interview transcript with Scientific American, Temple University developmental psychologist Laurence Steinberg reviews the literature on many of the specific practices Chua describes, pointing out both the good and the bad. On his “good” list are high expectations, parental involvement, and positive feedback for genuine accomplishment (but not cultivating false self-esteem). On his “bad” list are excessive punishment, being overly restrictive, and squelching autonomy (characteristics of authoritarian rather than authoritative parenting). He further questions Chua’s views on desirable goals for her children and highlights the value of unstructured play for children’s development. Although he mentions cultural differences in parenting and acknowledges that Americans might misperceive Chinese parenting as being more authoritarian than it really is, he doesn’t analyze cultural influences in much depth here.

2. On Parenting Science, Gwen Dewar (an interdisciplinary social scientist whose background also includes psychology) provides a fuller analysis on both the authoritarian / authoritative parenting style dimension and the cultural differences between Chinese and American parenting. Like Steinberg and others, she too affirms the importance of believing in effort over innate ability, noting that this characterizes Chinese more than American values. Ironically, she includes more detail than Steinberg on his own research, describing the potential for positive peer pressure among Chinese-American youth, whose peers encourage them to achieve rather than rejecting them for geekiness. Most thankfully, she highlights that the positive aspects of traditional Chinese parenting can be separated from undesirable authoritarian practices.

3. Finally, on the NY Times Freakonomics blog, Yale professor of law and economics Ian Ayres (who acknowledges being a friend and colleague of Amy Chua’s) delves into the cognitive benefits of some of these parenting practices, rather than their developmental or cultural consequences. While I’m disappointed that he discusses the benefits of “tiger parenting” without strong caveats against its harms (or an acknowledgment that they can be separated), I particularly appreciated his economic analysis of the attitudes and behaviors that may result.

One virtue Ayres extols is delayed gratification, which he quantifies as “the intertemporal marginal rate of substitution, the willingness to forego current consumption in order to consume more in the future.” It’s another lens on the importance of grit, perseverance, and conscientiousness in enduring challenges while pursuing distant goals. He points out research indicating that these skills may be a stronger predictor of future success than intelligence (as measured by IQ).

(Although Ayres didn’t mention it, this further validates Dweck’s research on the importance of believing that effort matters more than innate ability in determining success.)

He also cites Ericsson’s research on the amount of effort necessary to develop expertise. Despite believing that such discipline is likely to transfer over to other pursuits, he admits that he would probably choose skills with more immediate benefits:

My personal bias is in guiding my children toward endeavors (like learning statistics or US History or corporate finance or Python — all subjects of daddy school) that I think are likely to pay higher direct adult dividends than music or sport skills that atrophy in adulthood.

Inclined though I am to agree with him, I wonder how effective such pursuits are as targets for kids to develop discipline and expertise. Aside from the value of music and sports in themselves, they also carry salient milestones—some culturally derived (such as soccer tournaments and numbered Suzuki-method books), but others perceptually evident—that help children self-assess progress. Computer programming has concrete markers of success in getting a program to produce the desired output, but growth in using analytical tools such as statistics is a bit harder for a youngster to perceive and appreciate.

Oddly, Ayres portrays Chua’s methods as an effective “taking choice off the table” technique for building discipline, despite this crucial difference between his parenting approach and hers: he explicitly involved his children in the initial choice process and explained the pros and cons of their choices, whereas Chua imposed her choices on her children. That initial commitment by the child is key. Without it, the child is simply following rules divorced from meaning. With it, the child learns to connect desire with dedication and goal with process.

Collectively, these articles echo the fundamental values of attributing success to effort, nurturing intrinsic motivation, and setting high expectations that I summarized earlier, while also adding a richer perspective on cultural differences, peer pressure, and delayed gratification in promoting perseverance. While these articles haven’t and won’t receive the audience that brazen storytelling attracts, they give voice to relevant research that too often whispers quietly from the archives.