The e-rater® engine scores essays by extracting a set of features representing important aspects of writing quality from each essay. These features must not only be predictive of readers' scores, but must also have some logical correspondence to the features that readers are instructed to consider when they award scores. These scoring features are then combined in a statistical model to produce a final score estimate, with the weight of each feature determined by a statistical process designed to maximize the agreement with human scoring.
The features currently included in the e-rater scoring engine include:
- content analysis based on vocabulary measures
- lexical complexity/diction
- proportion of grammar errors
- proportion of usage errors
- proportion of mechanics errors
- proportion of style comments
- organization and development scores
- features rewarding idiomatic phraseology
The weighting of features to assign a total score to an essay can be done in a way tailored to a particular prompt, or in a "generic" fashion, so that the same e-rater model can be used to score responses to a variety of prompts. Work has also been done to establish a vertically linked scale of K–12 writing scores across grades based on the e-rater engine, known as the Developmental Writing Scale.
The features used for e-rater scoring are the result of nearly two decades of natural language processing research at ETS, and each feature may itself be composed of a multiplicity of independent sub-features. For instance, the grammatical error feature includes modules for detecting errors in preposition usage, run-on sentences and errors in subject-verb agreement. The e-rater engine is continually updated to reflect advances in natural language processing that can be applied to student texts.
Every year more than a million GRE essays cross the desks of ETS essay raters. These same submissions slide through the subroutines of e-rater, an automated scoring program developed by ETS. With a scoring speed of 800 essays per second, e-rater could evaluate every GRE essay from 2013–2014 (about 1.1 million submissions) in under 25 minutes. In that same time, a human rater will usually score around 10 essays.
This feat of education-automation raises certain questions. Do humans still read and score essays? When e-rater and a human score the same essay, do they give the same score? Do they even look at all the same features—from the grammar, vocabulary, and organization to the logic, evidence, and creativity? The answers may surprise you.
E-rater Monitors Rather Than Replaces Human Essay Scores
In 2008, six years after Analytical Writing became part of the GRE, ETS began to use e-rater as a quality control on human scores. Each essay receives a human score and an e-rater score on a scale of 0 to 6, low to high. If these scores differ by less than 0.5, the human score stands. Otherwise, a second human scores the essay, and the two human scores are averaged to get a final score. The e-rater score won't supplant the human score(s).
E-rater simply supplies a "check score"—that is, a score the human rater is checked against. Humans can be biased and inconsistent, resulting in scores that are inaccurate and unfair. E-rater escapes some of the foibles of human faculties. For instance, whereas humans tend to deviate from standard scoring rules over time, e-rater always applies the same scoring algorithm (annual software upgrades aside).
E-rater and Human Raters Give (Nearly) Equal Scores for 98% of GRE Essays
If e-rater sets the bar for consistency, then human raters seem to do a decent job of clearing it. The human score and the e-rater score fall within 1-point of each other for about 98% of essays, according to a 2014 report from ETS.
Still, that 1-point can be quite significant: a "Strong" essay gets a score just 1-point higher than an "Adequate" one.
What about the overall Analytical Writing score? No report (that I've read) compares an e-rater-only Analytical Writing score to an all-human one. (For a hypothetical comparison, see this post on why you don't learn your e-rater scores.) The 2014 report does look at how the score you'd get from one or more humans with e-rater—the "check score" approach—stacks up to the score you'd get from two or more humans without e-rater. The Analytical Writing score comes out the same roughly 98% of the time.
E-rater Evaluates How You Write, Not What You Think
Automated essay scoring sounds like a helpful supplement to human essay scoring—even a reasonable substitute for it. So what, if anything, is e-rater missing?
Here are two things e-rater is unable to do:
- Understand the meaning of the text being scored.
- Make a reasonable judgment about the essay's overall quality.
E-rater's deficits run deep. The software doesn't grasp the content or quality of your thinking. It also doesn't evaluate your essay as a whole. Yet ETS says that GRE essays are scored on a "holistic scale," with an emphasis on how well you think, not just how well you write. From the FAQ on Scoring and Reporting:
For the computer-delivered test, each essay receives a score…using a six-point holistic scale. In holistic scoring, readers are trained to assign scores on the basis of the overall quality of an essay in response to the assigned task…
The primary emphasis in scoring the Analytical Writing section is on your critical thinking and analytical writing skills rather than on grammar and mechanics.
But then how do e-rater and human raters end up giving similar scores? The answer is that the quality of your thinking correlates strongly with the quality of your writing. E-rater looks at the latter. In total, the software evaluates nine features of your essay. From the 2014 e-rater report:
- Grammar (e.g., subject–verb agreement)
- Usage (e.g., then vs. than)
- Mechanics (spelling and capitalization)
- Style (e.g., repetitive phrases and passive voice)
- Organization (e.g., thesis statement, main points, supporting details, conclusions)
- Development (e.g., main points precede details)
- Positive features:
- Correct preposition usage (the probability of using the correct preposition in a phrase)
- Good collocation use (i.e., collocations occur when two contiguous words appear together more often in language use than other pairs of words, such as the pairing of tall trees and high mountains as opposed to high trees and tall mountains)
- Sentence variety (i.e., the ability to use correct phrasing and a variety of grammatical structures)
- Lexical complexity with average word length (i.e., the use of vocabulary with different counts of letters)
- Lexical complexity with sophistication of word choice (i.e., the use of sophisticated vocabulary)
During "training," e-rater analyzes thousands of already-scored GRE essays. For each essay, the software quantifies the use (or misuse) of the above features and measures their correlation to the human-assigned scores. Even though human raters (presumably) look at more than just writing quality, the scores they give can be simulated based on writing quality alone.
You might wonder why e-rater's inner-workings matter for GRE prep. After all, e-rater doesn't assign your scores. But it does approximate them, so having the software score a practice essay could be worthwhile. You can have it evaluate several using ScoreItNow, the practice writing service from ETS.
To see what happened when I submitted an essay to ScoreItNow, check out this post: Official GRE Essay Practice: Score 6 vs. Score 5 on the Argument Task.