Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Combined Human and Automated Scoring of Writing

Stuart KahlMeasured Progress


The Challenges of Performance Tasks


Contention

The number of independent score points should reflect the amount of evidence.


1. Second reads (double scoring)

2. Additional independent scores

3. Both

Uses of Automated Essay Scoring


1. Are a couple human scores plus computer-

generated trait scores “better than” many human

analytic scores?

2. When humans focus on fewer traits, are their

agreement rates higher?

Questions


From Grade 8 NECAP

* 1 long and 1 short essay (based on 2 common

prompts) and 10 MC responses from each of

1694 students.

From Grade 11 NECAP

* 2 long essays (based on 2 common prompts)

from each of 590 students.

The Student Work We Used


MP/Gates human – 1 holistic, 5 traits (organization, support, focus, language, conventions), double- scored

MP/Gates human – 1 holistic, 1 trait (support), double-scored

Computer-generated trait scores (word choice, mechanics, style, organization, development)

NECAP human – 1 holistic, double-scored

The Essay Score Data


Scorer agreement – discrepancy (>1) rates Decision accuracy – estimate of proportion of

categorization decisions that would match decisions that would result if scores contained no measurement error

Decision consistency – estimate of proportion of categorization decisions that would match decisions based on scores from a parallel form

Standard error at cut points

Statistics


MP/Gates Scorer Agreement –# Discrepancies (>1)


Decision Accuracy (and Consistency) – Grade 8

Score Combination(1 long+1 short essay)

Overall Near Proficient

vs Proficient

MP/Gates human holistic+5 traits

.86(.80) .93(.91)

MP/Gates human holistic+1 trait+ automated 5 traits

.84(.77) .93(.91)

NECAP holistic+MC+ automated 5 traits

.82(.74) .93(.90)


Decision Accuracy (and Consistency) – Grade 8, continued

Score Combination Overall Near Proficient

vs Proficient

NECAP holistic+auto. traits(1 long essay w/MC)

.76(.67) .91(.88)

NECAP holistic+auto. traits(1 long essay w/o MC)

.75(.66) .91(.88)

NECAP holistic+auto. traits(1 long+1 short essay w/MC)

.82(.74) .93(.90)

NECAP holistic+auto. traits(1 long+1 short essay w/o

MC)

.81(.73) .93(.90)


Decision Accuracy (and Consistency) – Grade 11

Score Combination(2 long essays)

Overall Near Proficient

vs Proficient


.73(.65) .92(.89)

MP/Gates human holistic+1 trait+ automated 5 traits

.69(.60) .90(.86)


.63(.54) .88(.84)


Standard Errors at Cuts – Grade 8

Score Combination(1 long+1 short

essay)

C1 C2 C3


.87 1.01 1.40

MP/Gates human holistic+1 trait

+ automated 5 traits

1.36 1.44 1.57


1.68 1.70 2.04


Standard Errors at Cuts – Grade 8, continued

Score Combination C1 C2 C3

NECAP holistic+auto. traits(1 long essay w/MC)

1.84 1.85 2.28

NECAP holistic+auto. traits(1 long essay w/o MC)

2.13 2.04 2.45

NECAP holistic+auto. traits(1 long+1 short essay w/MC)

1.68 1.70 2.04

NECAP holistic+auto. traits(1 long+1 short essay w/o

MC)

1.89 1.80 2.13


Standard Errors at Cuts – Grade 11

Score Combination(2 long essays)

C1 C2 C3


1.09 1.10 1.24

MP/Gates human holistic +1 trait + auto. 5 traits

1.39 1.39 1.54

NECAP holistic+MC+ auto. 5 traits

1.73 1.73 1.83


Primary The approach (human holistic + 5 traits vs human holistic + 1

trait + automated 5 traits) did not make a difference with respect to decision accuracy/consistency, but did with respect to standard error, the first approach associated with lower standard errors.

Scorer discrepancy rates were lower when scorers evaluated fewer traits.

Secondary The inclusion of MC items with student essays did not make a

difference with respect to decision accuracy/consistency, but did reduce standard errors at the cuts.

The addition of a second essay both improved decision accuracy/consistency and reduced standard errors at the cuts.

Preliminary Findings


investigate other score combinations relative to the ones we looked at, especially “holistic alone.”

understand why approach (the ones investigated) and MC items made no difference with respect to decision accuracy/consistency, but did with respect to standard errors at the cuts.

test significance.

Still need to:


Human holistic and limited analytic scores

+ “trained” automated holistic scores as second read and as check of human scores to determine need for arbitration

+ “untrained” automated analytic trait scores

What Might Be


P.O. Box 1217, Dover, NH 03821-1217 | Web: measuredprogress.org | Office: 603.749.9102 It’s all about student learning. Period.

It’s all about student learning. Period.

Documents

Combined Human and Automated Scoring of Writing