Upload
nathan-thompson
View
150
Download
1
Embed Size (px)
Citation preview
An Introduction to CAT
Outline
What and why?Intro to IRTThe 5 componentsModifying CAT for special situationsBuilding a CAT
Background?
How much do you know about CAT?Heard of itI am familiar with IRTHave taken a course/workshopHave built a CAT (why are you here?!?!?)
The ultimate goal: CAT in a Box
How much do you know about CAT?Heard of itI am familiar with IRTHave taken a course/workshopHave built a CAT (why are you here?!?!?)
Part 1Benefits of CAT
What is CAT?
A Computerized Adaptive Test (CAT)is a test administered by computer that dynamically adjusts itself to the trait level of each examinee as the test is being administered
CAT is an algorithm
Why? Benefits of CAT
Efficiency Reduce test length by 50% or more Can be gigantic financial/time benefits (main driver of
CAT) Control of measurement precision
Measure or classify all examinees with the same degree of precision
Increased fairness What is more fair: seeing the same items but some
students having unreliable scores, or reliable scores but different items?
Benefits of CAT
Added securityIf everyone receives with the same 100 items, the items will become well known
More frequent retestingIf you go study for a month then retest, you will probably get a different test
Benefits of CAT
Better examinee motivationTop students do not waste time on easy questions
Low students are not discouraged by tough questions
More precise scoresEspecially at extremes
Benefits of Computerized Testing (in general)
Immediate score reportingP&P testing requires the question papers to come back and be scored
More item formatsAudio, video, hotspots, drag and drop, etc.
Easier results collectionAnd management/reporting
Disadvantages of CAT
Public relationsNeed to explain to examinees/parents why certain things can happen, like failing after only 10 questions, or passing with a 50% correct score
Heavy requirements to do it rightSoftwareExpertise Effort
Disadvantages of CAT
Item ExposureBut better than fixed forms!
Not feasible/applicable for every situationSmall samplesSubjective itemsItems not scorable in real time
Part 2Intro to IRT
Intro to IRT
While it is possible to design CATs with classical test theory (Frick, 1992), IRT is more appropriate because it puts items and examinees on the same scale
We assume use of IRT
Classical item statistics
CTT: option proportions are often translated to a quantile plot
The item response functionBasic building block of IRT
The item response function
The IRFParrameters: a,b,cItem information functionIs used for scoring examinees
The IRFa
The discrimination parameter Represents how well the item differentiates examinees Slope of the curve at its center
b The difficulty parameter Represents how easy or hard the item is with respect to
examinees Location of the curve (left to right)
c The pseudoguessing parameter Represents the ‘base probability’ of answering the question Lower asymptote
Item information functions
Example 5 items
TIF and SEMIIFs can be summed into a Test Information Function (TIF)
The TIF can be inverted into a SEM function (more info = less error)
IRT Scoring
Scoring happens by multiplying IRFs to get a likelihood function
The maximum likelihood is the examinee score
It is on the b/θ scale, so we can then use it to adapt
Part 2Basic principles of CAT (The Five Components)
CAT Components
1. Calibrated item bank2. Starting rule3. Item selection rule4. Scoring rule5. Stopping rule
Given 1 and 2, we loop 3-4-5 until 5 is satisfiedAll CAT follows this basic format – we just modify the details for whatever testing situation we have
CAT Components
1. Calibrated item bank2. Starting rule3. Item selection rule4. Scoring rule5. Stopping rule
Given 1 and 2, we repeat 3 and 4 until 5 is satisfiedAll CAT follows this basic format – we just modify the details for whatever testing situation we have
Algorithms inside your
testing engine
1. Calibrated item bank
Goal: develop an item bank that meets the needs of your CAT
Q: What are the needs? A: Monte carlo simulation
More broadly: think about TIF and CSEM, and work backwards
1. Calibrated item bank
SEM: How accurate do you want to be?How many items would it take?Given your items, what is the best hope?
1. Calibrated item bank
Do you need a lot of info near a cutscore?
2. Starting rule
1. Can start everyone with the same theta estimate (e.g., theta = 0.0)
Everyone gets the same first item Could be an exposure problem in a high stakes test
2. Assign a random theta estimate within an interval
E.g., between theta = -0.5 and +0.5 Improves exposure levels and has little effect on a
properly implemented CAT
2. Starting rule
3. Use prior information available for a given examinee
Subjective evaluations, e.g., below average, above average
Theta estimates from previously administered tests (same or different)
Projected theta based on biodata (age, GPA, etc.) – IACAT 2010
3. Item selection rule
Items are selected to maximize informationThere are different ways to quantify
Fisher (single-point) information at the current score is most common
3. Item selection
Also, there are usually practical constraints in item selectionItem exposureContent area (domain)Cognitive levelEtc.
3. Item selection
Example 5 items
4. Scoring rule
Typically, MLE is used to score examinees after each item (or sometimes Bayesian)
However, this assumes that there is a mixed response vector (not all correct or all incorrect)
But this is rarely the case in the first few items
And never the case after the first item
4. Scoring rule
So, we need to adapt the scoring rule if nonmixed
We can always start with BayesianCan also use fixed step sizes
Add or subtract 1 to score after each item until mixed
5. Stopping rule
Depends primarily on purpose of the test: point estimation or classification?Point estimation: we want an accurate score for each student
Classification: we do NOT need an accurate score, just a classification into pass/fail etc.
Also, Change: to see if score has gone up/down a certain amount
5. Stopping rule
Point estimation methods involve actual scores, and stop when we have zeroed in enough
Classification methods check after every item to see if we can make a classification within a certain degree of accuracy
Specifics will be later
5. Stopping rule
Either type of CAT can be designed with a fixed number of items
But this is a bad idea from a psychometric perspective but can greatly enhance perceived fairness
Variable-length testing is much more efficient
The big picture
1
2
3
4
5
Point estimation CAT
Let’s look at the most common scenarioNeed accurate score for each examineeAdapt the quantity and difficulty of items
1. Bank: Given2. Starting Point: 0.03. Item selection: Info at current theta4. Scoring: MLE5. Termination: Target SEM
Example CAT Item-By-Item Report of Maximum Information CAT This test will terminate when the SEM is equal to or less than 0.200
Minimum number of items = 5 Maximum number of items = 40The standard error band plotted as ---- is plus or minus 2.00 standard errors. X = Initial theta value C = Correct answer I = Incorrect answer
Item Theta SE -3.......-2........-1.........0........+1........+2........+3 0 -0.24* 1.00* --------------------X-------------------- 1 4.00* 1.00* . --------------------> 2 4.00* 1.00* . --------------------> 3 2.52 0.84 . -----------------I----- 4 2.77 0.68 . -------------C--- 5 2.38 0.61 . ------------I------- 6 2.09 0.61 . ------------I---------- 7 1.49 0.89 -----------------I---------------- 8 0.36 1.00 --------------------I-------------------- 9 0.88 0.63 ------------C------------- 10 1.13 0.56 -----------C----------- 11 1.34 0.49 . ----------C---------- 12 1.44 0.46 . ---------C--------- 13 1.55 0.43 . ---------C--------- 14 1.67 0.41 . --------C-------- 15 1.54 0.38 . --------I-------- 16 1.60 0.36 . --------C-------
Example CAT 17 1.70 0.35 . -------C------- 18 1.76 0.34 . -------C------- 19 1.65 0.32 . ------I------ 20 1.52 0.31 . -------I------ 21 1.40 0.30 . -------I------ 22 1.27 0.30 . ------I------ 23 1.30 0.28 . ------C----- 24 1.32 0.28 . ------C----- 25 1.36 0.27 . -----C------ 26 1.40 0.27 . -----C------ 27 1.31 0.26 . ------I----- 28 1.34 0.25 . -----C----- 29 1.37 0.25 . -----C----- 30 1.40 0.24 . ----C----- 31 1.43 0.24 . -----C----- 32 1.46 0.24 . -----C----- 33 1.50 0.24 . ----C----- 34 1.53 0.23 . -----C---- 35 1.55 0.23 . -----C----- 36 1.59 0.23 . ----C----- 37 1.62 0.23 . -----C---- 38 1.58 0.22 . ----I----- 39 1.53 0.22 . -----I---- 40 1.55 0.22 . ----C----_______________________________________________________________________________*Arbitrarily assigned value.This test was terminated when the maximum number of items was reached.
Part 3Modifying CAT for special situations
Practical constraintsTest length
Minimum: Reduce complaints from failuresMaximum: Stop neverending testsFixed: Appears to be fair but is not
ContentEducational standards, etc.
Item exposureReduce overuse of your best items!Randomesque, Sympson-Hetter
Classification testing
What about pass/fail or multicategory?The conventional approach is to administer a long test to obtain an accurate score, then compare to a cutscore
But we don’t need an accurate score Students far above or below the cutscore can be stopped early
Sometimes, EXTREMELY early
Classification testing
This is referred to as computerized classification testing (CCT: Lin & Spray, 2000)
Very similar to CAT, but with a few notable differences
CCT
Methods are primarily delineated based on the stopping ruleAbility confidence intervals (ACI: originally called “adaptive mastery testing” – see Kingsbury & Weiss, 1983)
Likelihood ratio (SPRT or GLR; Wald, 1947; Ferguson, 1967; Reckase, 1983; Thompson et al., 2008)
ACI
Specify the confidence interval: e.g., 95% interval is plus or minus 1.96 SEM
Keep administering items until interval is above or below cutscore (or you hit your maximum test length)
ACIExample graph:
Likelihood ratio
The LR approach completely abandons the use of the theta estimate
Tests only that a student is above or below a cutscore
Indifference regionMore efficient than ACI (Spray & Reckase, 1996; Eggen, 1999; Thompson et al. 2007
GLR (Advanced SPRT)
Example
Part 5Building a CAT
Background
Here is a frameworkIdentify issues and best way to investigateLeads to better quality in the endAlso the foundation for validity arguments
Why did you choose certain things?
Where to start
Go back to the 5 components:1. Calibrated item bank2. Starting rule3. Item selection rule4. Scoring rule5. Stopping rule
Where to start
How do we ensure that we build a good bank and choose good algorithms?
Simulations drives much of this
The model
In practice, you have these steps:Seq. Stage Primary work1 Feasibility, applicability, and
planning studiesMonte carlo simulation; business case evaluation
2 Develop item bank content or utilize existing bank
Item writing and review
3 Pretest and calibrate item bank
Pretesting; item analysis
4 Determine specifications for final CAT
Post-hoc or hybrid simulations
5 Publish live CAT Publishing and distribution; software development
1. Feasibility, applicability, planning
Three types of simulationsMonte CarloGenerate fake item responses (but based on reality)Random U(0,1), compare to P(X) from IRT
Post hoc (real data)Hybrid
1. Feasibility, applicability, planning
Simulations act by “administering” a CATSelect first itemGenerate responseEstimate θSelect next item…
Then analyze resultsAverage test lengthAccuracy: CAT θ vs. true θ (or full bank)
1. Feasibility, applicability, planning
At this point, real data not likely, so Monte Carlo
Generate plausible situations Item bank: 100, 200, 300… Item quality: a = 0.7, 0.8…; spread of bDesired precision: SEM = 0.2, 0.3, 0.4…
Compare results to each other and fixed formsBase values on reality (e.g., mean a)
1. Feasibility, applicability, planning
How do we “generate” responses?Calculate the probability P(X) of a correct response to the item, given q (which is known, because the examinees are imaginary)Done using the IRT equation
Generate a random number between 0.0 and 1.0Compare: if random > P(X) then incorrect, and vice versa
1. Feasibility, applicability, planning
Example with q = 1.0: Item 1 has a=1, b=0, c=0.2P(X) = 0.877Random number = 0.426Random < P(X), so CORRECT Item 2 has a=1, b=1, c=0.2P(X) = 0.6Random number = 0.813Random < P(X), so INCORRECTAnd so on…
1. Feasibility, applicability, planning
Software will do this for you, allowing you to simulate CATs for thousands of examinees in seconds
You can then easily set up an experiment with a wide range of conditions, and run a simulation for each
1. Feasibility, applicability, planning
Example result:CAT with bank of 300 items and SEM=0.25 has average of 53 items
Current fixed test has 100 items, SEM=0.23 in middle and 0.35+ beyond θ of ±1.5
CAT will make test more accurate for extreme examinees, about same accuracy for middle, but with 50% reduction
1. Feasibility, applicability, planning
More on this topic in my session
Epilogue: Maintaining CATLike fixed form testing, maintenance is usually necessary
Check that performing as expectedIs termination criterion being satisfied?Examinees hitting test length or other constraints?
Average test length what you expected?
Epilogue: Maintaining CATExposure – are certain items overused?
CAT is greedy, always selecting items with best discriminations
Refresh pool by replacing exposed itemsTest security: Are any items out on the internet?Related to exposure, but not the same
Parameter drift?P&P comparability study? Equity?
Wrap-up
Questions?More resources?
67
• Is CAT a good fit for my organization?• Many organizations hear these benefits of
CAT• However, CAT is not relevant or possible
in most situations• How can I evaluate if it is applicable for
my organization?• Consider the following requirements…
Computerized Adaptive Testing
68
Items scoreable in real time CAT scores items immediately so the next
item can be determined instantaneously Paper testing obviously irrelevant Essays and other constructed-response
items are not possible unless you package an automated scoring system
Requirement #1
69
Large item banks A general rule of thumb is that you need 3
times as many items in the bank as intended for the test
100 item test? Need 300 items in the bank You then need resources to write, review,
and pilot all of these items Piloting is an issue unto itself
Requirement #2
70
CAT Requirements
Large pilot samples CAT uses item response theory (IRT) as the
underlying psychometric model This requires that all items have at least 100
examinee responses just for a cursory analysis Preferably at least 1000 Then if you have 300 items and each examinee
sees 100 during a pilot, you need 3000 examinees just for the pilot study!
Requirement #3
71
PhD psychometricians Psychometricians are necessary to
perform complex IRT analysis Also needed to perform CAT simulation
validity studies Test is otherwise not defensible
Requirement #4
72
Sophisticated software Software designed specifically for IRT
analysis is necessary to calibrate the pilot data
Separate software is necessary for the CAT simulation studies
General statistical software is not usually acceptable
Requirement #5
73
IRT item banker Items must be packaged with the IRT
parameters Your item banking system must utilize IRT
parameters appropriately for assembling banks/forms
Requirement #6
74
CAT delivery system You must have a computerized test
delivery system capable of performing all the complex CAT calculations
Crude, homegrown approximations are not defensible
Must be reliable, scalable, and secure
Requirement #7
75
Money and resources Developing a defensible CAT is extremely
expensive Small research project: $20,000 Large operational CAT: $hundreds of
thousands
However, the benefits stated earlier can outweigh these costs, especially with large volume tests
CAT can then be a positive investment
Requirement #8