Evaluating Rater Performance in the Scoring of Career

7/29/2019 Evaluating Rater Performance in the Scoring of Career

1/23

Evaluating Rater

Performance in the

Scoring of Career

Commitment EssaysSusan Gracia

Ph.D. In Education Faculty Research

SeminarDecember 4, 2008


2/23

Career Commitment Essay

One of several admissions

requirements to FSEHD

2-3 page essay describing: Why a candidate wants to be a

teacher

Personal skills and characteristicss/he brings

Area for improvement


3/23

CCE Rubric

Evaluation based on 4 dimensions or traits:

Content

Expression/voice Organization

Conventions

Plus an overall, holistic score

The 2 types of scoring provide a usefuldistinction between overall performance and

performance using particular skills


4/23

Scoring

Scorers: higher ed. faculty & K-12 practitioners

Scorer training: 1.5-2 hours

Each essay gets 2 blind reads

Holistic scores are averaged to determine pass/fail

If 2 essay scores are exactly the same or within 1 point ofeach other, student receives average of the 2 scores .5 averages are bumped up to next highest score

Average holistic scores of 3 or 4: pass

Average holistic scores of 1 or 2: revise and resubmit

If essay scores deviate by more than 1 point, essay isread by a 3rd scorer Student receives average of 2 highest scores


5/23

Accreditation and other

requirements

o The unit conducts thorough studies to

establish validity and reliability of its

performance assessment procedures.


6/23

Research Questions

Do the raters differ in the levels ofseverity they exercise?

Do faculty and practitioners rateessays in the same manner?

Are there any inconsistent raterswhose patterns of ratings show littlesystematic relationship to theratings that other raters give?


7/23

Research Questions

Are there any raters who cannot effectivelydifferentiate between rubric dimensions,giving each candidate very similar ratingsacross a number of conceptually distinctdimensions?

Do some candidates exhibit unusualprofiles of ratings, receivingunexpectedly high (or low) ratings on

certain dimensions, given the ratings thecandidate received on otherdimensions?


8/23

Sample

40 scorers (17=higher ed.; 23

practitioners) over 4 scoring sessions

from Oct. 2006-March 2007 476 teacher candidates

Essays were randomly assigned to

scorers


9/23

Data Analysis

Many Facet Rasch Analysis A statistical approach to the analysis of rating data

Facilitates the study of facets of interest inassessments that typically involve human judgment

Facet: A definable aspect of an assessment settingthat may exert influence on the measurement process

Raters, tasks, students, rater or student backgroundvariables, situational variables, etc.

Each facet in the analysis is composed of variousindividual elements

Facets examined in this study: Candidates, raters,tasks (i.e., rubric dimensions)

FACETS software was used


10/23

How is MFRM useful?

Makes possible the analysis of assessments that havemultiple potential sources of measurement error, suchas: Tasks

Raters Rating scales

Allows us to quantify typical and expected sources ofvariability within the assessment system

Enables us to identify observations that lie outside theusual ranges of variability (fit analysis)

Helps answer the critical question: Are any of thesesources introducing unwanted construct-irrelevantvariation into the ratings?


11/23

What is MFRM designed to

do?

Helps establish quality control over an

assessment system by:

Providing useful information about how

individual elements within a facet are

performing

Determining which elements of the system

are/are not working as intended

Identifying specific aspects of the system

that may need to be tweaked to remediate

system deficiencies


12/23

Findings: Raters

All raters do not score with similarlevels of severity

10.4 statistically distinct levels of raterseverity

Significant fixed chi-square statistic,rejecting null hypothesis that all raters

are equally lenient Exact inter-rater agreement=34.4%

(1654 out of 4802 ratings)


13/23

----------------------------------------

|Measr|-Judges |Scale|

----------------------------------------

+ 4 + + (4) +

| | | |

| | 1067 | |

| | | |

| | | |

+ 3 + + +

| | | |

| | | || | 1015 1022 | |

| | 1003 | |

+ 2 + + 3 +

| | | |

| | | |

| | 1069 | |

| | | |

+ 1 + 1001 1006 1010 1064 + +

| | 1018 | |

| | 1005 1009 | |

| | | |

| | 1066 | |

* 0 * 1012 1021 * --- *

| | 1011 1062 1063 | |

| | 1002 1019 1024 1061 | |

| | 1008 1013 1023 1060 1065 | |

| | 1072 | |

+ -1 + 1004 + +

| | | |

| | 1017 | |

| | | |

| | 1025 | |

+ -2 + 1016 1050 + 2 +

| | 1007 1020 | |

| | 1070 | |

| | 1071 | |

| | | |

+ -3 + + +

| | | |

| | | |

| | | |

| | | |

+ -4 + 1068 + --- +

| | | |

| | 1014 | |

| | | |

| | 1073 | |

+ -5 + + (1) +

----------------------------------------

|Measr|-Judges |Scale|

----------------------------------------


14/23

Findings: Raters

21 of 4802 ratings (0.4%) were highlyunexpected

Taking into account the raters overallseverity and the other ratings thecandidate received from other judges

Two raters gave 8 of 21 unexpected

ratings Most misfitting ratings were awarded

on Conventions, followed by Content


15/23

Unexpected Ratings

------------------------------------------------------------------

|Cat Step Exp. Resd StRes| Num Stu Num Judg N Item |

------------------------------------------------------------------

| 3 3 3.9 -.9 -4.1 | 325 325 1073 1073 2 Expression |

| 3 3 3.9 -.9 -3.7 | 416 416 1073 1073 1 Content |

| 3 3 3.9 -.9 -3.5 | 325 325 1073 1073 4 Conventions |

| 3 3 3.9 -.9 -3.5 | 416 416 1073 1073 5 Overall |

| 1 1 3.0 -2.0 -3.5 | 927 927 1065 1065 4 Conventions || 1 1 2.8 -1.8 -3.0 | 906 906 1065 1065 2 Expression |

| 1 1 2.9 -1.9 -3.4 | 42 42 1060 1060 1 Content |

| 1 1 2.9 -1.9 -3.4 | 915 915 1060 1060 1 Content |

| 1 1 2.9 -1.9 -3.3 | 42 42 1060 1060 5 Overall |

| 1 1 2.9 -1.9 -3.3 | 915 915 1060 1060 5 Overall |

| 3 3 3.9 -.9 -3.7 | 280 280 1025 1025 4 Conventions |

| 4 4 2.0 2.0 3.3 | 987 987 1022 1022 3 Organization |

| 1 1 2.9 -1.9 -3.4 | 387 387 1012 1012 1 Content || 4 4 2.1 1.9 3.1 | 991 991 1012 1012 4 Conventions |

| 4 4 1.8 2.2 3.7 | 381 381 1012 1012 4 Conventions |

| 1 1 2.8 -1.8 -3.0 | 924 924 1006 1006 1 Content |

| 3 3 3.9 -.9 -3.6 | 186 186 1002 1002 4 Conventions |

| 1 1 3.1 -2.1 -3.7 | 914 914 1002 1002 4 Conventions |

| 3 3 1.3 1.7 3.7 | 328 328 1001 1001 4 Conventions |

| 1 1 3.0 -2.0 -3.6 | 81 81 1001 1001 1 Content |)

------------------------------------------------------------------


16/23

Findings: Raters

6 of 9 scorers with unexpected ratings werepractitioners

15 of 21 unexpected ratings were from practitioners

Overall, practitioner raters were more lenient thanfaculty raters

21 raters did not use rating scales consistently acrossall candidates and all rubric dimensions (mean squareinfit and outfit statistics >1.2 or


17/23

Findings: Candidates

Candidates are well differentiated in termsof essay writing skills

4.1 statistically distinct levels of candidate

proficiency 26% had ratings that were more variable

than expected (unusually severe or lenientraters)

51% had ratings with less variance thanexpected (little variation in ratings acrossrubric dimensions)


18/23

Findings: Candidates

Observed vs. Fair Average scores

Fair Average score adjusts Observed

Average based on differences in

severity/leniency of raters

Overall Score:

Observed Average=2.62; Fair Average=2.64

Mean difference is not statisticallysignificant, but it may be significant for some

students!


19/23

Observed vs. Fair Average

Scores

Use of Fair

Average would

affect individual

pass rates

If used, cut off

score must be

carefully selected

Observed Fair Difference Student

2 1.11 0.89 363

3.5 2.79 0.71 382

3.5 2.79 0.71 416

3 2.31 0.69 336

3.5 2.82 0.68 72

2.5 1.82 0.68 91

3 2.33 0.67 254

3.5 2.87 0.63 341

2 1.41 0.59 298

2 1.44 0.56 343

2.5 1.94 0.56 337

3 2.48 0.52 310

3.5 2.98 0.52 3043 2.5 0.5 79

4 3.5 0.5 96

3.5 3.01 0.49 66

2 1.52 0.48 961

3 2.52 0.48 331

3 2.52 0.48 389

3.5 3.02 0.48 350


20/23

Findings: Rubrics

Rubric dimensions differ significantlyin terms of difficulty

Difficulty measures (in logits) Overall Score (.33)

Content (.19)

Conventions (-.06)

Organization (-.08)

Expression (-.38)


21/23

----------------------------------------------------------------------------------------

|Measr|+Students |-Judges |-Item | S.1 | S.2 | S.3 | S.4 | S.5 |

----------------------------------------------------------------------------------------

+ 6 + *. + + + (4) + (4) + (4) + (4) + (4) +

| | | | | | | | | |

| | | | | | | | | |

| | . | | | | | | | |

+ 5 + . + + + + + + + +

| | . | | | | | | | |

| | . | | | | | | | || | | | | | | | | |

+ 4 + . + + + + + + + +

| | . | | | | | | | |

| | *. | | | | | | | |

| | . | | | | --- | | | |

+ 3 + . + + + + + + --- + --- +

| | . | | | --- | | --- | | |

| | *** | | | | | | | |

| | *. | | | | | | | |

+ 2 + **. + + + + + + + +

| | ****. | | | | | | | |

| | ****. | * | | | 3 | | | 3 |

| | *****. | ** | | 3 | | 3 | 3 | |

+ 1 + ****** + *** + + + + + + +| | ****. | | | | | | | |

| | *******. | ** | | | | | | |

| | ********. | ** | Content Overall | | | | | |

* 0 * *******. * ****** * Conventions Organization * * * --- * * --- *

| | *******. | ** | | --- | --- | | --- | |

| | *********. | *** | Expression | | | | | |

| | ******. | ***** | | | | | | |

+ -1 + ******* + *** + + + + + + +

| | *** | *** | | | | 2 | | |

| | ****. | * | | 2 | 2 | | 2 | 2 |

| | ***** | | | | | | | |

+ -2 + ***. + *** + + + + + + +

| | ** | * | | | | | | |

| | *. | | | | | --- | | || | **. | * | | --- | | | --- | |

+ -3 + . + + + + --- + + + --- +

| | *. | * | | | | | | |

| | . | | | | | | | |

| | . | | | | | | | |

+ -4 + + + + + + + + +

| | . | | | | | | | |

| | | | | | | | | |

| | . | * | | | | | | |

+ -5 + + + + (1) + (1) + (1) + (1) + (1) +

----------------------------------------------------------------------------------------

Figure 1: Career Commitment Essay Variable Map


22/23

Conclusions

Inter-rater reliability could be improved

Some ratings are overly consistent

Some raters are inconsistent Current practice of bumping scores

up compensates for overly severe

raters


23/23

Recommendations

Increase rater training overall, with targeted training forspecific raters. Focus on:

Meaning and distinctions among rubric dimensions

Using entire rating scale

Use skilled, experienced scorers rather thancontinuously recruiting new scorers

Share results such as these in raters training

Consider using Fair Average score

Implement standard setting procedure for determiningcut off score

Evaluate predictive validity of Career CommitmentEssay

Evaluating Rater Performance in the Scoring of Career

Documents

Evaluation of the e-rater® Scoring Engine for the TOEFL ... · the e-rater model performance against human scores. Performance was also evaluated across different demographic subgroups

Measuring Essay Assessment: Intra-rater and Inter-rater ... · Intra-rater and inter-rater reliability of essay assessments ... comparison and contrast, classification, process analysis,

RATER EXPERTISE IN A SECOND LANGUAGE SPEAKING …...raters’ scoring patterns, cognition, and behavior. Experienced teachers of English ( N =20) scored recorded examinee responses

Yigal Attali Jill Burstein - Seperac.com Essay Scoring With E-rater V.2.0... · Abstract E-rater has been used by the Educational Testing Service for automated essay scoring since

Rater Sorjo Volume1

Beyond Essay Length: Evaluating e-rater®'s Performance on TOEFL… · 2016-05-19 · TOEFL, e-rater, essay length . ii ... a prompt that has been randomly drawn from a large set

Rater Scoring Modeling Tool

Evaluating the constructs and automated scoring …...1 Evaluating the constructs and automated scoring performance for speaking tasks in the Versant Tests and PTE Academic Alistair

CapersJones- Scoring and Evaluating Software Methods, Practices, And Results

Evaluating Probabilistic Forecasts with scoringRules · evaluating the model’s simulation output. Deﬁnitions and details on the use of multivariate scoring rules are provided

Video Rater

Automated Essay Scoring With E-rater v.2.0

Stumping E-Rater E-Rater...Stumping E-Rater: Challenging the Validity of Automated Essay Scoring Donald E. Powers Jill C. Burstein Martin Chodorow Mary E. Fowles Karen Kukich GRE No

Analytic Scoring of TOEFL® CBT Essays - ETS Home · PDF fileAnalytic Scoring of TOEFL® CBT Essays: Scores From Humans and E-rater® Yong-Won Lee, Claudia Gentile, and Robert Kantor1

Evaluating the Construct-Coverage of the e-rater Scoring ...By evaluating the construct coverage of the e-rater system, we can identify e-rater measures with the greatest construct

Beyond Essay Length: Evaluating e-rater®'s … Essay Length: Evaluating e-rater ... confidentiality will be protected. ... A Do you agree or disagree with the following statement?

A BRIEF INTRODUCTION TO Inter-Rater Agreement & Inter-Rater Reliability

Evaluation of the e-rater® Scoring Engine for the GRE ... · The generic e-rater scoring model with operational prompt-specific intercept for the issue-writing task and prompt-specific

Inter-Rater Reliability

Modified Early Warning Scoring (MEWS) Tools Including ... · Modified Early Warning Scoring (MEWS) Tools Including Sepsis Screening Criteria Literature Review Evaluating the Evidence