View
219
Download
0
Category
Preview:
Citation preview
7/29/2019 Evaluating Rater Performance in the Scoring of Career
1/23
Evaluating Rater
Performance in the
Scoring of Career
Commitment EssaysSusan Gracia
Ph.D. In Education Faculty Research
SeminarDecember 4, 2008
7/29/2019 Evaluating Rater Performance in the Scoring of Career
2/23
Career Commitment Essay
One of several admissions
requirements to FSEHD
2-3 page essay describing: Why a candidate wants to be a
teacher
Personal skills and characteristicss/he brings
Area for improvement
7/29/2019 Evaluating Rater Performance in the Scoring of Career
3/23
CCE Rubric
Evaluation based on 4 dimensions or traits:
Content
Expression/voice Organization
Conventions
Plus an overall, holistic score
The 2 types of scoring provide a usefuldistinction between overall performance and
performance using particular skills
7/29/2019 Evaluating Rater Performance in the Scoring of Career
4/23
Scoring
Scorers: higher ed. faculty & K-12 practitioners
Scorer training: 1.5-2 hours
Each essay gets 2 blind reads
Holistic scores are averaged to determine pass/fail
If 2 essay scores are exactly the same or within 1 point ofeach other, student receives average of the 2 scores .5 averages are bumped up to next highest score
Average holistic scores of 3 or 4: pass
Average holistic scores of 1 or 2: revise and resubmit
If essay scores deviate by more than 1 point, essay isread by a 3rd scorer Student receives average of 2 highest scores
7/29/2019 Evaluating Rater Performance in the Scoring of Career
5/23
Accreditation and other
requirements
o The unit conducts thorough studies to
establish validity and reliability of its
performance assessment procedures.
7/29/2019 Evaluating Rater Performance in the Scoring of Career
6/23
Research Questions
Do the raters differ in the levels ofseverity they exercise?
Do faculty and practitioners rateessays in the same manner?
Are there any inconsistent raterswhose patterns of ratings show littlesystematic relationship to theratings that other raters give?
7/29/2019 Evaluating Rater Performance in the Scoring of Career
7/23
Research Questions
Are there any raters who cannot effectivelydifferentiate between rubric dimensions,giving each candidate very similar ratingsacross a number of conceptually distinctdimensions?
Do some candidates exhibit unusualprofiles of ratings, receivingunexpectedly high (or low) ratings on
certain dimensions, given the ratings thecandidate received on otherdimensions?
7/29/2019 Evaluating Rater Performance in the Scoring of Career
8/23
Sample
40 scorers (17=higher ed.; 23
practitioners) over 4 scoring sessions
from Oct. 2006-March 2007 476 teacher candidates
Essays were randomly assigned to
scorers
7/29/2019 Evaluating Rater Performance in the Scoring of Career
9/23
Data Analysis
Many Facet Rasch Analysis A statistical approach to the analysis of rating data
Facilitates the study of facets of interest inassessments that typically involve human judgment
Facet: A definable aspect of an assessment settingthat may exert influence on the measurement process
Raters, tasks, students, rater or student backgroundvariables, situational variables, etc.
Each facet in the analysis is composed of variousindividual elements
Facets examined in this study: Candidates, raters,tasks (i.e., rubric dimensions)
FACETS software was used
7/29/2019 Evaluating Rater Performance in the Scoring of Career
10/23
How is MFRM useful?
Makes possible the analysis of assessments that havemultiple potential sources of measurement error, suchas: Tasks
Raters Rating scales
Allows us to quantify typical and expected sources ofvariability within the assessment system
Enables us to identify observations that lie outside theusual ranges of variability (fit analysis)
Helps answer the critical question: Are any of thesesources introducing unwanted construct-irrelevantvariation into the ratings?
7/29/2019 Evaluating Rater Performance in the Scoring of Career
11/23
What is MFRM designed to
do?
Helps establish quality control over an
assessment system by:
Providing useful information about how
individual elements within a facet are
performing
Determining which elements of the system
are/are not working as intended
Identifying specific aspects of the system
that may need to be tweaked to remediate
system deficiencies
7/29/2019 Evaluating Rater Performance in the Scoring of Career
12/23
Findings: Raters
All raters do not score with similarlevels of severity
10.4 statistically distinct levels of raterseverity
Significant fixed chi-square statistic,rejecting null hypothesis that all raters
are equally lenient Exact inter-rater agreement=34.4%
(1654 out of 4802 ratings)
7/29/2019 Evaluating Rater Performance in the Scoring of Career
13/23
----------------------------------------
|Measr|-Judges |Scale|
----------------------------------------
+ 4 + + (4) +
| | | |
| | 1067 | |
| | | |
| | | |
+ 3 + + +
| | | |
| | | || | 1015 1022 | |
| | 1003 | |
+ 2 + + 3 +
| | | |
| | | |
| | 1069 | |
| | | |
+ 1 + 1001 1006 1010 1064 + +
| | 1018 | |
| | 1005 1009 | |
| | | |
| | 1066 | |
* 0 * 1012 1021 * --- *
| | 1011 1062 1063 | |
| | 1002 1019 1024 1061 | |
| | 1008 1013 1023 1060 1065 | |
| | 1072 | |
+ -1 + 1004 + +
| | | |
| | 1017 | |
| | | |
| | 1025 | |
+ -2 + 1016 1050 + 2 +
| | 1007 1020 | |
| | 1070 | |
| | 1071 | |
| | | |
+ -3 + + +
| | | |
| | | |
| | | |
| | | |
+ -4 + 1068 + --- +
| | | |
| | 1014 | |
| | | |
| | 1073 | |
+ -5 + + (1) +
----------------------------------------
|Measr|-Judges |Scale|
----------------------------------------
7/29/2019 Evaluating Rater Performance in the Scoring of Career
14/23
Findings: Raters
21 of 4802 ratings (0.4%) were highlyunexpected
Taking into account the raters overallseverity and the other ratings thecandidate received from other judges
Two raters gave 8 of 21 unexpected
ratings Most misfitting ratings were awarded
on Conventions, followed by Content
7/29/2019 Evaluating Rater Performance in the Scoring of Career
15/23
Unexpected Ratings
------------------------------------------------------------------
|Cat Step Exp. Resd StRes| Num Stu Num Judg N Item |
------------------------------------------------------------------
| 3 3 3.9 -.9 -4.1 | 325 325 1073 1073 2 Expression |
| 3 3 3.9 -.9 -3.7 | 416 416 1073 1073 1 Content |
| 3 3 3.9 -.9 -3.5 | 325 325 1073 1073 4 Conventions |
| 3 3 3.9 -.9 -3.5 | 416 416 1073 1073 5 Overall |
| 1 1 3.0 -2.0 -3.5 | 927 927 1065 1065 4 Conventions || 1 1 2.8 -1.8 -3.0 | 906 906 1065 1065 2 Expression |
| 1 1 2.9 -1.9 -3.4 | 42 42 1060 1060 1 Content |
| 1 1 2.9 -1.9 -3.4 | 915 915 1060 1060 1 Content |
| 1 1 2.9 -1.9 -3.3 | 42 42 1060 1060 5 Overall |
| 1 1 2.9 -1.9 -3.3 | 915 915 1060 1060 5 Overall |
| 3 3 3.9 -.9 -3.7 | 280 280 1025 1025 4 Conventions |
| 4 4 2.0 2.0 3.3 | 987 987 1022 1022 3 Organization |
| 1 1 2.9 -1.9 -3.4 | 387 387 1012 1012 1 Content || 4 4 2.1 1.9 3.1 | 991 991 1012 1012 4 Conventions |
| 4 4 1.8 2.2 3.7 | 381 381 1012 1012 4 Conventions |
| 1 1 2.8 -1.8 -3.0 | 924 924 1006 1006 1 Content |
| 3 3 3.9 -.9 -3.6 | 186 186 1002 1002 4 Conventions |
| 1 1 3.1 -2.1 -3.7 | 914 914 1002 1002 4 Conventions |
| 3 3 1.3 1.7 3.7 | 328 328 1001 1001 4 Conventions |
| 1 1 3.0 -2.0 -3.6 | 81 81 1001 1001 1 Content |)
------------------------------------------------------------------
7/29/2019 Evaluating Rater Performance in the Scoring of Career
16/23
Findings: Raters
6 of 9 scorers with unexpected ratings werepractitioners
15 of 21 unexpected ratings were from practitioners
Overall, practitioner raters were more lenient thanfaculty raters
21 raters did not use rating scales consistently acrossall candidates and all rubric dimensions (mean squareinfit and outfit statistics >1.2 or
7/29/2019 Evaluating Rater Performance in the Scoring of Career
17/23
Findings: Candidates
Candidates are well differentiated in termsof essay writing skills
4.1 statistically distinct levels of candidate
proficiency 26% had ratings that were more variable
than expected (unusually severe or lenientraters)
51% had ratings with less variance thanexpected (little variation in ratings acrossrubric dimensions)
7/29/2019 Evaluating Rater Performance in the Scoring of Career
18/23
Findings: Candidates
Observed vs. Fair Average scores
Fair Average score adjusts Observed
Average based on differences in
severity/leniency of raters
Overall Score:
Observed Average=2.62; Fair Average=2.64
Mean difference is not statisticallysignificant, but it may be significant for some
students!
7/29/2019 Evaluating Rater Performance in the Scoring of Career
19/23
Observed vs. Fair Average
Scores
Use of Fair
Average would
affect individual
pass rates
If used, cut off
score must be
carefully selected
Observed Fair Difference Student
2 1.11 0.89 363
3.5 2.79 0.71 382
3.5 2.79 0.71 416
3 2.31 0.69 336
3.5 2.82 0.68 72
2.5 1.82 0.68 91
3 2.33 0.67 254
3.5 2.87 0.63 341
2 1.41 0.59 298
2 1.44 0.56 343
2.5 1.94 0.56 337
3 2.48 0.52 310
3.5 2.98 0.52 3043 2.5 0.5 79
4 3.5 0.5 96
3.5 3.01 0.49 66
2 1.52 0.48 961
3 2.52 0.48 331
3 2.52 0.48 389
3.5 3.02 0.48 350
7/29/2019 Evaluating Rater Performance in the Scoring of Career
20/23
Findings: Rubrics
Rubric dimensions differ significantlyin terms of difficulty
Difficulty measures (in logits) Overall Score (.33)
Content (.19)
Conventions (-.06)
Organization (-.08)
Expression (-.38)
7/29/2019 Evaluating Rater Performance in the Scoring of Career
21/23
----------------------------------------------------------------------------------------
|Measr|+Students |-Judges |-Item | S.1 | S.2 | S.3 | S.4 | S.5 |
----------------------------------------------------------------------------------------
+ 6 + *. + + + (4) + (4) + (4) + (4) + (4) +
| | | | | | | | | |
| | | | | | | | | |
| | . | | | | | | | |
+ 5 + . + + + + + + + +
| | . | | | | | | | |
| | . | | | | | | | || | | | | | | | | |
+ 4 + . + + + + + + + +
| | . | | | | | | | |
| | *. | | | | | | | |
| | . | | | | --- | | | |
+ 3 + . + + + + + + --- + --- +
| | . | | | --- | | --- | | |
| | *** | | | | | | | |
| | *. | | | | | | | |
+ 2 + **. + + + + + + + +
| | ****. | | | | | | | |
| | ****. | * | | | 3 | | | 3 |
| | *****. | ** | | 3 | | 3 | 3 | |
+ 1 + ****** + *** + + + + + + +| | ****. | | | | | | | |
| | *******. | ** | | | | | | |
| | ********. | ** | Content Overall | | | | | |
* 0 * *******. * ****** * Conventions Organization * * * --- * * --- *
| | *******. | ** | | --- | --- | | --- | |
| | *********. | *** | Expression | | | | | |
| | ******. | ***** | | | | | | |
+ -1 + ******* + *** + + + + + + +
| | *** | *** | | | | 2 | | |
| | ****. | * | | 2 | 2 | | 2 | 2 |
| | ***** | | | | | | | |
+ -2 + ***. + *** + + + + + + +
| | ** | * | | | | | | |
| | *. | | | | | --- | | || | **. | * | | --- | | | --- | |
+ -3 + . + + + + --- + + + --- +
| | *. | * | | | | | | |
| | . | | | | | | | |
| | . | | | | | | | |
+ -4 + + + + + + + + +
| | . | | | | | | | |
| | | | | | | | | |
| | . | * | | | | | | |
+ -5 + + + + (1) + (1) + (1) + (1) + (1) +
----------------------------------------------------------------------------------------
Figure 1: Career Commitment Essay Variable Map
7/29/2019 Evaluating Rater Performance in the Scoring of Career
22/23
Conclusions
Inter-rater reliability could be improved
Some ratings are overly consistent
Some raters are inconsistent Current practice of bumping scores
up compensates for overly severe
raters
7/29/2019 Evaluating Rater Performance in the Scoring of Career
23/23
Recommendations
Increase rater training overall, with targeted training forspecific raters. Focus on:
Meaning and distinctions among rubric dimensions
Using entire rating scale
Use skilled, experienced scorers rather thancontinuously recruiting new scorers
Share results such as these in raters training
Consider using Fair Average score
Implement standard setting procedure for determiningcut off score
Evaluate predictive validity of Career CommitmentEssay
Recommended