Thursday, April 26, 2012

Computer Scoring Open Ended History Questions

Justin Reich:

Last week, Barbara Chow, the director of the education program at the Hewlett Foundation explained to a meeting of grantees why the foundation was investing in research concerning Automated Essay Score Predictors as part of their strategy of expanding opportunities for Deeper Learning in schools. (Disclosure: I run a Hewlett-funded research project, and Hewlett has indirectly paid me a salary for four years, though Harvard is my direct employer. That said, when I had a chance to speak for 15 minutes at the Grantee meeting, I devoted the entire time to explaining how their Open Educational Resources grantmaking program could potentially be expanding educational inequalities. So there is some evidence that I try to call it as I see it.) Again, it's the kind of argument that raises eyebrows. "If we replace human essay raters with machines, students will have a richer learning experience." Oh, really?

First point: there are two consortia (PARCC and SBAC) developing new tests for the Common Core Standards. In 2014 or 2015, we're going to have some brand new tests in states all across the country. We have an opportunity to make them better. Here's how Barbara makes the case that Automated Essay Score Predictors can do that

Here is an example of a test question from the AP US History test (2006 Released Exam):

Which of the following colonies required each community of 50 or more families to provide a teacher of reading and writing?

A. Pennsylvania
B. Massachusetts
C. Virginia
D. Maryland
E. Rhode Island

Now, this is the kind of question that makes most educators go berserk. A student can have a deep, rich understanding of early American history and not know that factoid. So what if we could replace questions like that, with questions like this (thanks to College Board for sharing):

By the early twentieth century, the United States had emerged as a world power. Historians have proposed various dates for the beginning of this process, including the three listed below. Choose one of the three dates below or choose one of your own, and write a paragraph explaining why this date best marks the beginning of the United States' emergence as a world power. Write a second paragraph explaining why you did not choose the other dates. Support your argument with appropriate evidence.

  • 1898 (Spanish-American War)
  • 1917 (Entry into the First World War)
  • 1941 (Entry into the Second World War)

I have some quibbles, but this is a much, much better question. The question calls upon several skills broadly identified with deeper learning: solving an ill structured problem—one without a correct answer and requiring tacit knowledge—and communicating that answer in a persuasive, evidence-based argument.

The problem with this is that the computer can score this for structure -- for having the form of an academic argument based on evidence -- but it does not know the whole scope of the problem domain -- it can't know all the evidence and facts relating to the issue, so a savvy student can just make stuff up. So... for example:

Among naval historians, the standard definition of a "world power" is one that can maintain two high seas fleets in two separate oceans indefinitely, while retaining a significant reserve and shipbuilding capacity. In Michael Doyle's authoritative history of the US Navy, Quahogs and Other Submersibles, a Rumination, until 1923 and the launch of the US 5th Fleet in the Pacific the US could only maintain one high-seas fleet. Therefore, in 1923, the US became a naval "world power," a status it maintains to this day.

I chose this date because I am a naval historian, and I am applying the standard definition of "world power" as used by naval historians.

If you're my teacher in an actual class -- and in particular you know what a clever smartass I can be -- you'll see through this in a second. If you are a computer, what's it going to do, check my facts? All the computer can say is that this has more or less has the form of an academic response including evidence.

You can also note that temps scoring stacks of these things in data centers can't really score this kind of thing accurately either, but that's an argument against high stakes standardized tests in general, not in favor of computer scoring.

This is why the Common Core standards are so strict about keeping everything limited to textual analysis and evidence from texts. It is partly a revival of New Criticism, but mostly it is to make sure that the valid answers to the question is constrained to bits of text that can be identified by a computer.

3 comments:

Dina said...

Michael Doyle's authoritative history. Snort.

Justin Reich said...

Hi Tom,

We'll trade comments today :)

If you think that source based essays could potentially be part of the solution, then you should read the original study. This was one of the first to suggest that automated essay score predictors performed as reliably with those content-based questions as humans.

Also, what do you think of this point: at present, 100% of these essays are read in about 90 second by rushed humans. What if 10% of essays (the training sample) were read for 5 or 7 minutes by humans, with much greater historical training, time to Google facts, etc? If students knew that there was a chance that their essay would be read by someone looking really closely, would that disincentive gaming behaviors? In my classroom, I graded 1 out of every 10 journal or homework assignments at random, and that was often enough to get good quality work on the whole...

In this whole discussion, I think the key point is this: most of people's objections to computer score predictors are actually objections to standardized tests.

(Also, I should say in all public forums, that an argument for more writing on standardized tests is not an argument for more standardized tests. We should have less and better testing, and use it more wisely).

Tom Hoffman said...

Hi Justin,

The problem is that the standards, curriculum and tests are already being bent to conform to the assessment technologies. The CCSS standards *never* require a student to generate an independent interpretation of a text, and never require the student to make a connection outside the text (except with a specifically defined text).

I don't actually believe this is because the employees of the testing companies who wrote these standards are such strict adherents to mid-20th century approaches to literary criticism. I think they did it to make the tests computer scorable.

No other country limits the scope of their standards in this way.