Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

Generalized Mixed-effects Models Generalized Mixed-effects Models for Monitoring Cut-scores for for Monitoring Cut-scores for Differences Between Raters, Differences Between Raters,

Procedures, and TimeProcedures, and Time

Yeow Meng ThumHye Sook Shin

UCLA Graduate School of Education & Information StudiesNational Center for Research on Evaluation,Standards, and Student Testing (CRESST)

CRESST Conference 2004

Los Angeles

RationaleRationale• Research shows that cut-scores vary as a

function of many factors: raters, procedures, and over time.

• How does one defend a particular cut-score? Averaging several values, use of collateral information are current options.

• High-stakes accountability hinges on the comparability of performance standards over time.

• Some method is required to monitor cut-scores for consistency across groups and over time. (Green, et al)

Purpose of StudyPurpose of Study

• An approach for estimating the impact from procedural factors and rater characteristics and time.

• Monitoring the consistency of cut-scores across several groups.

Transforming Judgments into Transforming Judgments into Scale ScoresScale Scores

-4 -2 0 2 4480 540 600 660 720

0.0

0.2

0.4

0.6

0.8

1.0

Pro

bab

ility

Logit

Scale Score

Cut-Score

0.633 logits

619 scale-score points

Figure 1: Working with the Grade 3 SAT-9 mathematics scale

Performance DistributionPerformance Distributionfor Four Urban Schoolsfor Four Urban Schools

Figure 2: Grade 3 SAT-9 mathematics scale score distribution for four schools

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012School A32% Proficient

619

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012School B70% Proficient

619

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012School C19% Proficient

619

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012

School D32% Proficient

619

Potential ImpactPotential Impactof Revising a Cut-score of Revising a Cut-score

Revised Cut-score

(as fraction of sem)

school -1 -0.5 0 0.5 1

A 41% 37% 32% 29% 26%

B 78% 75% 70% 67% 63%

C 25% 23% 19% 15% 13%

D 40% 36% 32% 28% 25%

Table 1: Potential impact on school performance when cut-score changes

Data & ModelData & Model• Simulate Data for a standard setting study

design : a ramdomized block comfounded factorial design (Kirk, 1995)

• Factors of standard setting study

a. Rater Dimensions (Teacher, Non-Teacher, etc.)

b. Procedural Factors/Treatments

1. Type of Feedback (Outcome or impact Feedback- “yes” or “No”, etc)

2. Item Sampling in Booklet (Number of items, etc)

3. Type of Task (A modified Angoff, a contrasting group approach, or

Bookmark method, etc)

Treating Binary OutcomesTreating Binary Outcomes

1 for "pass"

0 for "fail",ijty

ln1

ijtijt

ijt

p

p

(2)

Binary outcome

(1)

Logit link function

(pass if rater j thinks, the passing candidate has a good chance of getting the ith item right in session t)

s= + jt sjt jts

K S

IRT Model for Cut-score - IIRT Model for Cut-score - I

Procedural Factors Impacting A Rater’s Cut-scores

(3)

Where s is the fixed effect due to session characteristics s

is random effect, which evolves over time ROUNDjt, and a function of rater characteristics, Xpj

jt

Item Response Model (IRT)

= - ijt jt ijtK d

(4)

Estimating Factors Impacting A Rater’s Cut-scores

(5)

0 1

0 00 0 0

1 10 1 1

jt j j jt

j p pj jp

j p pj jp

ROUND

X u

X u

0 1( , )j ju u are distributed bivariate normal

with means (0, 0) and variance-covariances

00 01

10 11

T

IRT Model for Cut-score - IIIRT Model for Cut-score - II

LikelihoodLikelihood

( , )jg T

(1 )( ; ) [1 ]ijt ijty yj j ijt ijt

t i

f y p p

(7)

Prior distribution of j

Conditional posterior of the rater random effects j is

( ; ) ( ; )

( , , )j j j

j

f y g T

h y T

where ( , , ) ( ; ) ( ; )j

j j j j jh y T f y g T

Condition on , y has probability

(6)

Joint marginal likelihood

( , , )jj

h y T (8)

= + jt sg sjt jts

K S

Multiple StudiesMultiple StudiesConsistency & StabilityConsistency & Stability

Procedural Factors Impacting A Rater’s Cut-scores for separate study g (g=1.2.3….,g)

Where sg is the fixed effect due to session characteristics s

is random effect, which evolves over time SESSIONjt, and a function of rater characteristics, Xpj

jt

(9)

Group Factors Impacting A Rater’s Severity

0 1

0 00 0 01 1

1 10 1 11 1

jt j j jt

G G

j gj g gj g p pj jg g p

G G

j gj g gj p pj jg g p

ROUND

GROUP GROUP X u

GROUP GROUP g X u

(10)

SimulationSimulationSAS Proc NLMixed

150 raters who are randomly exposed 4 rounds to STD setting exercise varying on 3 session factors.

Session Factor 1: Feedback type

Session Factor 2: Item Targeting in Booklet

Session Factor 3: Type of Standard Setting Task

Rater Characteristics: Teacher, Non-Teacher

Change over Round (time)

Selected Results Selected Results

• Model (reasonably) recovers parameters within sampling uncertainty across 3 studies.

• Average cut-score (All Teachers) for each rater group at the last Round is not significantly different from 619, while the first Round results were significantly different.

• Results from the model for multiple studies are similarly encouraging.

SuggestionsSuggestions

• Large-scale testing programs should monitor their cut-score estimates for consistency and stability.

• For stable performance scale, estimates of cut-scores and factor effects should be replicable to a reasonable degree across groups and over time.

• The model in this paper can be modified based on actual data so that we verify and balance out the variation due to the relevant factors of the study.

Documents

Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate