We Could Do Better If We Tried!

A Discussion of “Measuring Success in Education: The Role of Effort on the Test Itself”

US students tend to score lower than we Americans would like on international tests.

For example, on the 2012 Programme for International Student Assessment (PISA) that was taken in 65 countries around the world including US high school students, our students scored 36th in math. (p. 292)

For a country that considers itself top in many measures that is a bit painful. And it causes a process of soul searching to figure out what is going on so we can figure out how to fix it.

Possible reasons that are examined typically include the school system, socioeconomic factors, and our culture. (p. 292)

But what if it is not the system that is the problem?

What if our students are just not trying that hard on the test?

That would mean the tests are not measuring our students are less capable but instead that they are less motivated.

That is what Gneezy, List et al. (2019) set out to determine.

Often these sorts of tests that are given internationally do not impact our students’ lives. That is, it is understandable that many students taking the SAT are highly motivated because college admittance and scholarships are on the line.

But a few days of taking a standardized test that does not affect your grade, your future college, or anything — how motivating is that?

Some of you may be thinking, “Well, I always do my best. Why wouldn’t I try my hardest?”

While others may be remembering feeling like it was a couple of days off the normal drudgery of school that allowed you to take it easy.

Essentially, the question does come down to motivation. Intrinsically motivated students will try their hardest on the exam regardless if the results have any impact on their life. Doing their best is its own reward.

But extrinsically motivated students do not see the point in great exertion for little to no reward.

Attempting to measure differences in test effort in different countries is exactly what the authors are trying to do.

The Experiment

They choose to conduct their experiment using 4 high schools in Shanghai, China and 2 in the United States.

They chose these 2 countries on purpose. For one, they say Shanghai was ranked first in mathematics on the 2012 PISA test (p. 294). For another, they cite evidence

… from descriptive studies showing that compared to the United States, East Asian parents, teachers, and students put more emphasis on diligence and effort…Traditional East Asian values also emphasize the importance of fulfilling obligations and duties…These include high academic achievement, which is regarded as an obligation to oneself as well as to the family and society…Hence, East Asian students may put forth higher effort on standardized tests if doing well on those tests is considered an obligation. (p. 294)

The authors acknowledge that the sample they tested is not a representative sample of all students who take the test around the world so the results should not be extracted beyond this experiment. But they do try to have a variety of students in the experiment. The schools included are (p. 295)

a US high-performing private boarding school,
a US large public high school with both low- and average- performing students,
a low-performing Shanghai school,
an average-performing Shanghai school, and
two above average-performing Shanghai schools.

On test day, they launch their experiment: they divide the students they are including in the experiment from each country into 2 groups, a control group and an experimental group.

The control group simply takes the test.

Those in the experimental group are offered a financial incentive just before the test begins. They are given $25 (or the equivalent value in China) but are then told they will lose $1 for each incorrect answer. Skipped questions are considered incorrect. (p. 295)

All groups then took the 25-minute, 25 question experimental math section the authors composed, comprised of questions from prior PISA exams. (p. 294)

I assume this is like the experimental sections the SAT and other standardized tests would include to try out new questions. I think this process has changed some for the SAT over the years, but essentially it is for their use and does not impact your final score. This way, the authors could have complete control of the experiment without impacting the PISA results.

They then compare the results of each experimental group of students within each country to the control group in that country to determine if effort level changed.

Results

US students are extrinsically motivated.

The impact on the Shanghai students is negligible indicating they were already working at their ability without a financial incentive.

However, the US students with money on the line saw a significant increase across all ability levels. In fact, they found those in the middle of the distribution had the largest incentive effects. (p. 298)

They use 3 proxies to measure effort (p. 300)

the number of questions attempted,
the proportion of attempted questions answered correctly, and
the proportion of questions correct, that is, overall test scores.

US students attempted more questions with the incentive. In particular the control group efforts were much lower in the number of questions attempted in the second half of the test than the experimental group. (p. 302)

However, it was not just guessing because they also increased the percentage answered correctly, thus indicating more thinking and more effort by the students. (p. 302)

Finally, the impact on overall number of correct questions is a 5% increase. (p. 302)

Why are motivations different?

That is a question beyond the authors’ study.

Their goal was to impact our policy reactions. When these low international scores are reported, there are often calls to overhaul our education system.

One reform I would not object to is starting school later to better align with teenagers’ circadian rhythms.

Start School Later, See Scores Rise?

That study used a basic skills test that was likely low stakes to the students so they may not have been trying their best. Even so, the change in school time did show some modest improvements. Perhaps the test results would be more informative if the students were more incentivized?

However, I think the lessons we are to learn from Gneezy, List et al. (2019) is to be cautious of reforms that could prove to be unnecessary and possibly harmful.

What should we do then? Is knowing “half the battle”?

Or, do we start paying our students to take tests?

I am sure the students would not say no, but I think the rest of the world would object.

I think this study shows us that further research into why we are more extrinsically motivated could be interesting.

One fact that stood out to me is the authors say Germany also underperformed on these tests (p. 292) Our educational system comes from Germany historically. Is there something about this system that is more extrinsically focused?

In contrast, Finland scored higher than expected, (p. 292) Their educational system has been in the news lately for producing such good results. Is there then something about their system that is more intrinsically focused?

Or is it simply cultural factors, not easily changed?

Finally, is this an issue to be fixed or just information that allows us to understand our low performance on international tests?

As a teacher, I have to say I wish my students were more intrinsically motivated. Learning for the sake of learning, ah, the dream!

The reality is if an assignment or lesson is not graded, it will not be done by the vast majority of students.

Is it part of the capitalist system that for any cost incurred it should be offset by a benefit?

And even if that were the source of this behavior, is that a flaw that should be fixed or just an interesting cultural fact to be aware of?

I have more questions than answers, but at least now we have a good excuse to assuage our bruised egos the next time international testing shows US students lagging behind other countries.

“We could do better if we tried…you just need to make it worth our while!”

References: Gneezy, Uri, John A. List, Jeffrey A. Livingston, Xiandong Qin, Sally Sadoff, and Yang Xu (2019). “Measuring Success in Education: The Role of Effort on the Test Itself.” AER: Insights, 1(3): 291–308.

A Discussion of “Measuring Success in Education: The Role of Effort on the Test Itself”

Related Posts