Lately, I have been developing automatic essay evaluations for IELTS and now for college-level academic writing. I sent an email to my department to share my progress, and a colleague wrote back to share his concerns about my new automatically self-scoring film analysis essay writing assignment. I had shared a sample essay of a film analysis essay and a link to the academic essay writing test. He had tried it, and though impressed, he was concerned about automating essay evaluations.
There are lots of variables in such texts that, I believe, cannot be analyzed in a systematic way. And those variables are often what separates a great essay from a passable one. A grabber is not just a question. It has to be stimulating. A thesis has to be precise and thought provoking. The topic sentences have to be directly linked to the thesis as well as provide further insight. Can a computer truly give relevant feedback on whether or not something is ‘stimulating’, ‘precise’, ‘thought-provoking’ and ‘providing further insight on a previous idea’?An ESL teacher on the VWT’s new automatic essay evaluator
Can a computer give relevant feedback?
I share all these doubts about brute force calculations of good writing. Surely, narrow artificial intelligence that calculates scores based on lists of structural features, lexical items, and grammatical error patterns will miss the value of a meaningful expression of nuanced broad human intelligence.
Nevertheless, these doubts remind me of the observation the ancient Taoist philosopher Chuang-Tzu makes.
“All things have different uses. A horse can travel a hundred miles a day, but it cannot catch mice.”Chuang-Tzu, translated by Thomas Merton
More recently and just as wisely, Oisin Woods, a German teacher and colleague in our Modern Languages Department at Ahuntsic College, told me, “Let’s not make perfection the enemy of the good.” Rather than focusing on the limitations of computers, it is worth noting what machines can do in the service of good pedagogy. Machines can count, match patterns, and respond in milliseconds. Humans can understand and reflect. We can be allies in the provision of feedback.
A research question and a hypothesis
The practical pedagogical question I ask myself these days is this, “Can narrow artificial intelligence provide useful formative feedback to learners and help teachers score essays more reliably?” The answer seems to be, on balance, “yes.”
Automatic scoring and feedback will help students become better writers and help teachers evaluate essays more reliably.My current hypothesis
A null hypothesis
However, let’s see if there is any evidence in the research literature to support a null hypothesis.
- Computers can be fooled by clever nonsense (Monaghan & Bridgeman, 2005).
- Brilliant non-conformist writing will score lower because it is eccentric (Monaghan & Bridgeman, 2005).
- Automatic scoring of more complex argument essays is less reliable than of inherently less complex opinion essays (0.76 vs. 0.81) (Bridgeman, 2004). Here 1.0 is complete reliability. The difference in reliability of 0.05 is significant but not huge.
These seem to fit with concerns that some elements of meaningful essays cannot be analyzed programmatically in an effective way.
My colleague asked, “Can a computer truly give relevant feedback on whether or not something is ‘stimulating’, ‘precise’, ‘thought-provoking’ and ‘providing further insight on a previous idea’?” I don’t know. I have my doubts, but I also have my doubts that a teacher can explain the mechanics of why a sentence stimulates or provokes thought. All I have ever been able to do in these areas of writing is dramatize the presence of a reader by indicating when a sentence stimulates or provokes me, often with cryptic or terse comments in the margin (Wow! Nice! Interesting! Provocative!).
A reasonable hypothesis
Turning to possibilities and evidence for my hypothesis that automatic scoring and feedback could help students become better writers and help teachers evaluate essays, here is what I have observed and what I have read.
- Teachers take two weeks to provide feedback on a first draft. The VWT takes two seconds.
- Teachers limit the number of essays students write because of the impact of corrections on the teacher’s workload, Necessity has made us put time-resource limitations ahead of the pedagogical goal of maximizing meaningful practice with a focus on form. In other words, there are not enough teachers to provide all of the feedback students require (Monaghan & Bridgeman, 2004).
- Automatically comparing multiple features of a student’s essay to an ideal essay was found to provide useful formative feedback to students (Foltz et al., 1999).
- Automatic scores using grammar, topic, discourse features, and sentiment analysis are very highly correlated to expert human scores (Farra et al., 2015).
- The reliability of a teacher’s ratings of student essays declines with fatigue. Machines don’t tire and can evaluate essays consistently.
- Researchers found that using one essay task and one human evaluator to measure achievement produced unreliable scores (Brendland et al., 2004), and yet that is what we do at finals every semester.
- A computer rating combined with 1 human rating was found to be more reliable than the combination of scores by 2 human raters (Bridgeman, 2004). Computer-assisted scoring is more reliable than both exclusive computer scoring and exclusive human scoring. Humans diverge too much in their judgements.
All that said, I think that there is evidence that a free source of computerized formative feedback available 24 hours a day online is likely to help students improve their writing and their self-assessments of their writing.
John Hattie (2009) found in his meta-analysis of 800+ meta-analyses that the most effective thing students can do to improve the performance is to openly declare to the class what score they expect to achieve on an upcoming evaluation because of the goal-setting involved in such predictions. Minorities tend to inflate their predictions. Girls tend to minimize their predictions. Practice tests make all students more reliable at predicting their own performance.
The next most effective intervention is for teachers to give formative evaluations/practice tests (Hattie, 2009). A robot that can score and give detailed feedback on first drafts of essays is a type of formative evaluation/practice test. Not to labour the point, but Hattie also found that computer-assisted learning (computerized feedback + teacher feedback) produced double the performance gains when compared to smaller class sizes.
I remain reasonable optimistic that the Virtual Writing Tutor can, in time, score essays and provide helpful formative feedback. One reason for my optimism is due to a text message that Frank Bonkowski sent me recently to tell me about his experience using automated feedback on essays with a group of students earlier in the day. Here is what he wrote.
A super-motivated girl visited the VWT 3x today for feedback on her film analysis essay. She went from 40% to 56% to 88%. She was super happy. Me, too.Text message from Frank
Bridgeman, B. (2004, December). E-rater as a qualitycontrol on human scorers. Presentation in the ETSResearch Colloquium Series, Princeton, NJ.
[Farra et al.2015] Noura Farra, Swapna Somasundaran, and Jill Burstein. 2015. Scoring persuasive essays using opinions and their targets. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 64–74.
Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The Intelligent Essay Assessor: Applications to Educational Technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2)
Hattie, J. A. C. (2009). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. London, UK: Routledge.
Monaghan, W., & Bridgeman, B. (2005). E-Rater as a Quality Control on Human Scores. ETS R&D Connections: Princeton, NJ: ETS.