Upload
lucas-may
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Generalized Mixed-effects Models Generalized Mixed-effects Models for Monitoring Cut-scores for for Monitoring Cut-scores for Differences Between Raters, Differences Between Raters,
Procedures, and TimeProcedures, and Time
Yeow Meng ThumHye Sook Shin
UCLA Graduate School of Education & Information StudiesNational Center for Research on Evaluation,Standards, and Student Testing (CRESST)
CRESST Conference 2004
Los Angeles
RationaleRationale• Research shows that cut-scores vary as a
function of many factors: raters, procedures, and over time.
• How does one defend a particular cut-score? Averaging several values, use of collateral information are current options.
• High-stakes accountability hinges on the comparability of performance standards over time.
• Some method is required to monitor cut-scores for consistency across groups and over time. (Green, et al)
Purpose of StudyPurpose of Study
• An approach for estimating the impact from procedural factors and rater characteristics and time.
• Monitoring the consistency of cut-scores across several groups.
Transforming Judgments into Transforming Judgments into Scale ScoresScale Scores
-4 -2 0 2 4480 540 600 660 720
0.0
0.2
0.4
0.6
0.8
1.0
Pro
bab
ility
Logit
Scale Score
Cut-Score
0.633 logits
619 scale-score points
Figure 1: Working with the Grade 3 SAT-9 mathematics scale
Performance DistributionPerformance Distributionfor Four Urban Schoolsfor Four Urban Schools
Figure 2: Grade 3 SAT-9 mathematics scale score distribution for four schools
400 450 500 550 600 650 700 750 8000.000
0.004
0.008
0.012School A32% Proficient
619
400 450 500 550 600 650 700 750 8000.000
0.004
0.008
0.012School B70% Proficient
619
400 450 500 550 600 650 700 750 8000.000
0.004
0.008
0.012School C19% Proficient
619
400 450 500 550 600 650 700 750 8000.000
0.004
0.008
0.012
School D32% Proficient
619
Potential ImpactPotential Impactof Revising a Cut-score of Revising a Cut-score
Revised Cut-score
(as fraction of sem)
school -1 -0.5 0 0.5 1
A 41% 37% 32% 29% 26%
B 78% 75% 70% 67% 63%
C 25% 23% 19% 15% 13%
D 40% 36% 32% 28% 25%
Table 1: Potential impact on school performance when cut-score changes
Data & ModelData & Model• Simulate Data for a standard setting study
design : a ramdomized block comfounded factorial design (Kirk, 1995)
• Factors of standard setting study
a. Rater Dimensions (Teacher, Non-Teacher, etc.)
b. Procedural Factors/Treatments
1. Type of Feedback (Outcome or impact Feedback- “yes” or “No”, etc)
2. Item Sampling in Booklet (Number of items, etc)
3. Type of Task (A modified Angoff, a contrasting group approach, or
Bookmark method, etc)
Treating Binary OutcomesTreating Binary Outcomes
1 for "pass"
0 for "fail",ijty
ln1
ijtijt
ijt
p
p
(2)
Binary outcome
(1)
Logit link function
(pass if rater j thinks, the passing candidate has a good chance of getting the ith item right in session t)
s= + jt sjt jts
K S
IRT Model for Cut-score - IIRT Model for Cut-score - I
Procedural Factors Impacting A Rater’s Cut-scores
(3)
Where s is the fixed effect due to session characteristics s
is random effect, which evolves over time ROUNDjt, and a function of rater characteristics, Xpj
jt
Item Response Model (IRT)
= - ijt jt ijtK d
(4)
Estimating Factors Impacting A Rater’s Cut-scores
(5)
0 1
0 00 0 0
1 10 1 1
jt j j jt
j p pj jp
j p pj jp
ROUND
X u
X u
0 1( , )j ju u are distributed bivariate normal
with means (0, 0) and variance-covariances
00 01
10 11
T
IRT Model for Cut-score - IIIRT Model for Cut-score - II
LikelihoodLikelihood
( , )jg T
(1 )( ; ) [1 ]ijt ijty yj j ijt ijt
t i
f y p p
(7)
Prior distribution of j
Conditional posterior of the rater random effects j is
( ; ) ( ; )
( , , )j j j
j
f y g T
h y T
where ( , , ) ( ; ) ( ; )j
j j j j jh y T f y g T
Condition on , y has probability
(6)
Joint marginal likelihood
( , , )jj
h y T (8)
= + jt sg sjt jts
K S
Multiple StudiesMultiple StudiesConsistency & StabilityConsistency & Stability
Procedural Factors Impacting A Rater’s Cut-scores for separate study g (g=1.2.3….,g)
Where sg is the fixed effect due to session characteristics s
is random effect, which evolves over time SESSIONjt, and a function of rater characteristics, Xpj
jt
(9)
Group Factors Impacting A Rater’s Severity
0 1
0 00 0 01 1
1 10 1 11 1
jt j j jt
G G
j gj g gj g p pj jg g p
G G
j gj g gj p pj jg g p
ROUND
GROUP GROUP X u
GROUP GROUP g X u
(10)
SimulationSimulationSAS Proc NLMixed
150 raters who are randomly exposed 4 rounds to STD setting exercise varying on 3 session factors.
Session Factor 1: Feedback type
Session Factor 2: Item Targeting in Booklet
Session Factor 3: Type of Standard Setting Task
Rater Characteristics: Teacher, Non-Teacher
Change over Round (time)
Selected Results Selected Results
• Model (reasonably) recovers parameters within sampling uncertainty across 3 studies.
• Average cut-score (All Teachers) for each rater group at the last Round is not significantly different from 619, while the first Round results were significantly different.
• Results from the model for multiple studies are similarly encouraging.
SuggestionsSuggestions
• Large-scale testing programs should monitor their cut-score estimates for consistency and stability.
• For stable performance scale, estimates of cut-scores and factor effects should be replicable to a reasonable degree across groups and over time.
• The model in this paper can be modified based on actual data so that we verify and balance out the variation due to the relevant factors of the study.