19
Measured Progress ©2012 Combined Human and Automated Scoring of Writing Stuart Kahl Measured Progress

Combined Human and Automated Scoring of Writing

  • Upload
    anne

  • View
    24

  • Download
    3

Embed Size (px)

DESCRIPTION

Combined Human and Automated Scoring of Writing. Stuart Kahl Measured Progress. The Challenges of Performance Tasks. Contention. The number of independent score points should reflect the amount of evidence. 1. Second reads (double scoring) 2. Additional independent scores 3. Both. - PowerPoint PPT Presentation

Citation preview

Page 1: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Combined Human and Automated Scoring of Writing

Stuart KahlMeasured Progress

Page 2: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

The Challenges of Performance Tasks

Page 3: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Contention

The number of independent score points should reflect the amount of evidence.

Page 4: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

1. Second reads (double scoring)

2. Additional independent scores

3. Both

Uses of Automated Essay Scoring

Page 5: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

1. Are a couple human scores plus computer-

generated trait scores “better than” many human

analytic scores?

2. When humans focus on fewer traits, are their

agreement rates higher?

Questions

Page 6: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

From Grade 8 NECAP

* 1 long and 1 short essay (based on 2 common

prompts) and 10 MC responses from each of

1694 students.

From Grade 11 NECAP

* 2 long essays (based on 2 common prompts)

from each of 590 students.

The Student Work We Used

Page 7: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

MP/Gates human – 1 holistic, 5 traits (organization, support, focus, language, conventions), double- scored

MP/Gates human – 1 holistic, 1 trait (support), double-scored

Computer-generated trait scores (word choice, mechanics, style, organization, development)

NECAP human – 1 holistic, double-scored

The Essay Score Data

Page 8: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Scorer agreement – discrepancy (>1) rates Decision accuracy – estimate of proportion of

categorization decisions that would match decisions that would result if scores contained no measurement error

Decision consistency – estimate of proportion of categorization decisions that would match decisions based on scores from a parallel form

Standard error at cut points

Statistics

Page 9: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

MP/Gates Scorer Agreement –# Discrepancies (>1)

Page 10: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Decision Accuracy (and Consistency) – Grade 8

Score Combination(1 long+1 short essay)

Overall Near Proficient

vs Proficient

MP/Gates human holistic+5 traits

.86(.80) .93(.91)

MP/Gates human holistic+1 trait+ automated 5 traits

.84(.77) .93(.91)

NECAP holistic+MC+ automated 5 traits

.82(.74) .93(.90)

Page 11: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Decision Accuracy (and Consistency) – Grade 8, continued

Score Combination Overall Near Proficient

vs Proficient

NECAP holistic+auto. traits(1 long essay w/MC)

.76(.67) .91(.88)

NECAP holistic+auto. traits(1 long essay w/o MC)

.75(.66) .91(.88)

NECAP holistic+auto. traits(1 long+1 short essay w/MC)

.82(.74) .93(.90)

NECAP holistic+auto. traits(1 long+1 short essay w/o

MC)

.81(.73) .93(.90)

Page 12: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Decision Accuracy (and Consistency) – Grade 11

Score Combination(2 long essays)

Overall Near Proficient

vs Proficient

MP/Gates human holistic+5 traits

.73(.65) .92(.89)

MP/Gates human holistic+1 trait+ automated 5 traits

.69(.60) .90(.86)

NECAP holistic+MC+ automated 5 traits

.63(.54) .88(.84)

Page 13: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Standard Errors at Cuts – Grade 8

Score Combination(1 long+1 short

essay)

C1 C2 C3

MP/Gates human holistic+5 traits

.87 1.01 1.40

MP/Gates human holistic+1 trait

+ automated 5 traits

1.36 1.44 1.57

NECAP holistic+MC+ automated 5 traits

1.68 1.70 2.04

Page 14: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Standard Errors at Cuts – Grade 8, continued

Score Combination C1 C2 C3

NECAP holistic+auto. traits(1 long essay w/MC)

1.84 1.85 2.28

NECAP holistic+auto. traits(1 long essay w/o MC)

2.13 2.04 2.45

NECAP holistic+auto. traits(1 long+1 short essay w/MC)

1.68 1.70 2.04

NECAP holistic+auto. traits(1 long+1 short essay w/o

MC)

1.89 1.80 2.13

Page 15: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Standard Errors at Cuts – Grade 11

Score Combination(2 long essays)

C1 C2 C3

MP/Gates human holistic+5 traits

1.09 1.10 1.24

MP/Gates human holistic +1 trait + auto. 5 traits

1.39 1.39 1.54

NECAP holistic+MC+ auto. 5 traits

1.73 1.73 1.83

Page 16: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Primary The approach (human holistic + 5 traits vs human holistic + 1

trait + automated 5 traits) did not make a difference with respect to decision accuracy/consistency, but did with respect to standard error, the first approach associated with lower standard errors.

Scorer discrepancy rates were lower when scorers evaluated fewer traits.

Secondary The inclusion of MC items with student essays did not make a

difference with respect to decision accuracy/consistency, but did reduce standard errors at the cuts.

The addition of a second essay both improved decision accuracy/consistency and reduced standard errors at the cuts.

Preliminary Findings

Page 17: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

investigate other score combinations relative to the ones we looked at, especially “holistic alone.”

understand why approach (the ones investigated) and MC items made no difference with respect to decision accuracy/consistency, but did with respect to standard errors at the cuts.

test significance.

Still need to:

Page 18: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

Human holistic and limited analytic scores

+ “trained” automated holistic scores as second read and as check of human scores to determine need for arbitration

+ “untrained” automated analytic trait scores

What Might Be

Page 19: Combined Human and Automated Scoring of Writing

Measured Progress ©2012

P.O. Box 1217, Dover, NH 03821-1217 | Web: measuredprogress.org | Office: 603.749.9102 It’s all about student learning. Period.

It’s all about student learning. Period.