Upload
anne
View
24
Download
3
Embed Size (px)
DESCRIPTION
Combined Human and Automated Scoring of Writing. Stuart Kahl Measured Progress. The Challenges of Performance Tasks. Contention. The number of independent score points should reflect the amount of evidence. 1. Second reads (double scoring) 2. Additional independent scores 3. Both. - PowerPoint PPT Presentation
Citation preview
Measured Progress ©2012
Combined Human and Automated Scoring of Writing
Stuart KahlMeasured Progress
Measured Progress ©2012
The Challenges of Performance Tasks
Measured Progress ©2012
Contention
The number of independent score points should reflect the amount of evidence.
Measured Progress ©2012
1. Second reads (double scoring)
2. Additional independent scores
3. Both
Uses of Automated Essay Scoring
Measured Progress ©2012
1. Are a couple human scores plus computer-
generated trait scores “better than” many human
analytic scores?
2. When humans focus on fewer traits, are their
agreement rates higher?
Questions
Measured Progress ©2012
From Grade 8 NECAP
* 1 long and 1 short essay (based on 2 common
prompts) and 10 MC responses from each of
1694 students.
From Grade 11 NECAP
* 2 long essays (based on 2 common prompts)
from each of 590 students.
The Student Work We Used
Measured Progress ©2012
MP/Gates human – 1 holistic, 5 traits (organization, support, focus, language, conventions), double- scored
MP/Gates human – 1 holistic, 1 trait (support), double-scored
Computer-generated trait scores (word choice, mechanics, style, organization, development)
NECAP human – 1 holistic, double-scored
The Essay Score Data
Measured Progress ©2012
Scorer agreement – discrepancy (>1) rates Decision accuracy – estimate of proportion of
categorization decisions that would match decisions that would result if scores contained no measurement error
Decision consistency – estimate of proportion of categorization decisions that would match decisions based on scores from a parallel form
Standard error at cut points
Statistics
Measured Progress ©2012
MP/Gates Scorer Agreement –# Discrepancies (>1)
Measured Progress ©2012
Decision Accuracy (and Consistency) – Grade 8
Score Combination(1 long+1 short essay)
Overall Near Proficient
vs Proficient
MP/Gates human holistic+5 traits
.86(.80) .93(.91)
MP/Gates human holistic+1 trait+ automated 5 traits
.84(.77) .93(.91)
NECAP holistic+MC+ automated 5 traits
.82(.74) .93(.90)
Measured Progress ©2012
Decision Accuracy (and Consistency) – Grade 8, continued
Score Combination Overall Near Proficient
vs Proficient
NECAP holistic+auto. traits(1 long essay w/MC)
.76(.67) .91(.88)
NECAP holistic+auto. traits(1 long essay w/o MC)
.75(.66) .91(.88)
NECAP holistic+auto. traits(1 long+1 short essay w/MC)
.82(.74) .93(.90)
NECAP holistic+auto. traits(1 long+1 short essay w/o
MC)
.81(.73) .93(.90)
Measured Progress ©2012
Decision Accuracy (and Consistency) – Grade 11
Score Combination(2 long essays)
Overall Near Proficient
vs Proficient
MP/Gates human holistic+5 traits
.73(.65) .92(.89)
MP/Gates human holistic+1 trait+ automated 5 traits
.69(.60) .90(.86)
NECAP holistic+MC+ automated 5 traits
.63(.54) .88(.84)
Measured Progress ©2012
Standard Errors at Cuts – Grade 8
Score Combination(1 long+1 short
essay)
C1 C2 C3
MP/Gates human holistic+5 traits
.87 1.01 1.40
MP/Gates human holistic+1 trait
+ automated 5 traits
1.36 1.44 1.57
NECAP holistic+MC+ automated 5 traits
1.68 1.70 2.04
Measured Progress ©2012
Standard Errors at Cuts – Grade 8, continued
Score Combination C1 C2 C3
NECAP holistic+auto. traits(1 long essay w/MC)
1.84 1.85 2.28
NECAP holistic+auto. traits(1 long essay w/o MC)
2.13 2.04 2.45
NECAP holistic+auto. traits(1 long+1 short essay w/MC)
1.68 1.70 2.04
NECAP holistic+auto. traits(1 long+1 short essay w/o
MC)
1.89 1.80 2.13
Measured Progress ©2012
Standard Errors at Cuts – Grade 11
Score Combination(2 long essays)
C1 C2 C3
MP/Gates human holistic+5 traits
1.09 1.10 1.24
MP/Gates human holistic +1 trait + auto. 5 traits
1.39 1.39 1.54
NECAP holistic+MC+ auto. 5 traits
1.73 1.73 1.83
Measured Progress ©2012
Primary The approach (human holistic + 5 traits vs human holistic + 1
trait + automated 5 traits) did not make a difference with respect to decision accuracy/consistency, but did with respect to standard error, the first approach associated with lower standard errors.
Scorer discrepancy rates were lower when scorers evaluated fewer traits.
Secondary The inclusion of MC items with student essays did not make a
difference with respect to decision accuracy/consistency, but did reduce standard errors at the cuts.
The addition of a second essay both improved decision accuracy/consistency and reduced standard errors at the cuts.
Preliminary Findings
Measured Progress ©2012
investigate other score combinations relative to the ones we looked at, especially “holistic alone.”
understand why approach (the ones investigated) and MC items made no difference with respect to decision accuracy/consistency, but did with respect to standard errors at the cuts.
test significance.
Still need to:
Measured Progress ©2012
Human holistic and limited analytic scores
+ “trained” automated holistic scores as second read and as check of human scores to determine need for arbitration
+ “untrained” automated analytic trait scores
What Might Be
Measured Progress ©2012
P.O. Box 1217, Dover, NH 03821-1217 | Web: measuredprogress.org | Office: 603.749.9102 It’s all about student learning. Period.
It’s all about student learning. Period.