Part I: Review of Handbook of Automated Essay Evaluation: Current Applications and New Directions. Eds. Mark D. Shermis and Jill Burstein
Shermis, M., & Burstein J. (2013). Review of Handbook ofAutomated Essay Evaluation: Current Applications and New Directions. New York, NY: Routledge.
By Lori Beth De Hertogh, Washington State University
The Handbook of Automated Essay Evaluation: Current Applications and New Directions edited by Mark D. Shermis, University of Akron, and Jill Burstein, Educational Testing Service, features twenty chapters that each deals with a different aspect of automated essay evaluation (AEE). The overall purpose of the collection is to help professionals (i.e. educators, program administrators, researchers, testing specialists) working in a range of assessment contexts in K-12 and higher education better understand the capabilities of AEE. It also strives to demystify machine scoring and to highlight advances in several scoring platforms.
The collection is loosely organized into three parts. Authors of the first three chapters discuss automated essay evaluation in classroom contexts. The next section examines the workflow of various scoring engines. In the final section, authors highlight advances in automated essay evaluation. My two-part review generally follows this organizational scheme, except that I begin by examining the workflow of several scoring systems as well as platform options. I then review how several chapters describe potential uses of AEE in classroom contexts and recent developments in machine scoring.
The Handbook of Automated Essay Evaluation devotes considerable energy to explaining how scoring engines work. Matthew Schultz, director of psychometric services for Vantage Learning, describes in Chapter Six how the IntelliMetric™ engine analyzes and scores a text:
The IntelliMetric system must be ‘trained’ with a set of previously scored responses drawn from expert raters or scorers. These papers are used as a basis for the system to ‘learn’ the rubric and infer the pooled judgments of the human scorers. The IntelliMetric system internalizes the characteristics or features of the responses associated with each score point and applies this intelligence to score essays with unknown scores. (p. 89)
While the methods platforms like IntelliMetric use to determine a score are slightly different, they all employ a multistage process, which consists of four basic steps:
- receiving the text,
- using natural language processing to parse text components such as structure, content, and style,
- analyzing the text against a database of previously human- and machine-scored texts,
- producing a score based on how the text is similar or dissimilar to previously rated texts.
In Chapter Eight, Elijah Mayfield and Carolyn Penstein Rosé, language and technology specialists at Carnegie Mellon University, demonstrate how this four-step process works by describing the workflow of LightSIDE, an open source machine scoring engine and learning tool. In doing so, they illustrate how the program is able to match or exceed “human performance nearly universally” due to its ability to track and develop large-scale aggregate data based on text data. Mayfield and Rosé argue that this feature allows LightSIDE to tackle “the technical challenges of data collection” in diverse assessment contexts (p. 130). They also emphasize that this capability can help users curate large-scale data based on error-analysis. Writing specialists can then use this information to identify areas (i.e. grammar, sentence structure, organization) where students need instructional and institutional support.
Chapter Four, “The e-rater® Automated Essay Scoring System,” provides a “description of e-rater’s features and their relevance to the writing construct” (p. 55). Authors Jill Burstein, Joel Tetreault, and Nitin Madnani, research scientists at Educational Testing Service, stress that the workflow capabilities of scoring systems like e-rater or Criterion (a platform developed by ETS) make them useful tools for providing students with immediate, relevant feedback on the grammatical and structural aspects of their writing in addition to being useful in administrative settings where access to aggregate data is critical (pp. 64-65). The authors argue that e-rater’s ability to generate a range of data make it an asset in responding to both local and national assessment requirements (p. 65).
In Chapter Nineteen, “Contrasting State-of-the-Art Automated Scoring of Essays,” authors Mark D. Shermis and Ben Hamner (Kaggle) offer readers a comparison of nine scoring engines’ responses to a variety of prompts in an effort to assess and compare the workflow and performance levels of each system, some of which include Intelligent Essay Assessor, LightSIDE, e-rater, and Project Essay Grade. This chapter may be particularly useful to individuals tasked with determining which type of automated evaluation system to adopt or replace. In addition, this chapter provides a brief guide to understanding how a variety of systems operate and an overview of “vendor variability in performance” (p. 337).
The Handbook of Automated Essay Evaluation: Current Applications and New Directions provides assessment scholars, practitioners, and writing teachers relevant information about the workflow of various scoring engines and how these systems’ functioning capabilities can be applied to a range of educational settings. By understanding how these systems work and their potential applications, individuals tasked with writing assessment can make more informed choices about the potential benefits and consequences of adopting automated essay evaluation.