75
An Introduction to CAT

Introduction to Computerized Adaptive Testing (CAT)

Embed Size (px)

Citation preview

Page 1: Introduction to Computerized Adaptive Testing (CAT)

An Introduction to CAT

Page 2: Introduction to Computerized Adaptive Testing (CAT)

Outline

What and why?Intro to IRTThe 5 componentsModifying CAT for special situationsBuilding a CAT

Page 3: Introduction to Computerized Adaptive Testing (CAT)

Background?

How much do you know about CAT?Heard of itI am familiar with IRTHave taken a course/workshopHave built a CAT (why are you here?!?!?)

Page 4: Introduction to Computerized Adaptive Testing (CAT)

The ultimate goal: CAT in a Box

How much do you know about CAT?Heard of itI am familiar with IRTHave taken a course/workshopHave built a CAT (why are you here?!?!?)

Page 5: Introduction to Computerized Adaptive Testing (CAT)

Part 1Benefits of CAT

Page 6: Introduction to Computerized Adaptive Testing (CAT)

What is CAT?

A Computerized Adaptive Test (CAT)is a test administered by computer that dynamically adjusts itself to the trait level of each examinee as the test is being administered

CAT is an algorithm

Page 7: Introduction to Computerized Adaptive Testing (CAT)

Why? Benefits of CAT

Efficiency Reduce test length by 50% or more Can be gigantic financial/time benefits (main driver of

CAT) Control of measurement precision

Measure or classify all examinees with the same degree of precision

Increased fairness What is more fair: seeing the same items but some

students having unreliable scores, or reliable scores but different items?

Page 8: Introduction to Computerized Adaptive Testing (CAT)

Benefits of CAT

Added securityIf everyone receives with the same 100 items, the items will become well known

More frequent retestingIf you go study for a month then retest, you will probably get a different test

Page 9: Introduction to Computerized Adaptive Testing (CAT)

Benefits of CAT

Better examinee motivationTop students do not waste time on easy questions

Low students are not discouraged by tough questions

More precise scoresEspecially at extremes

Page 10: Introduction to Computerized Adaptive Testing (CAT)

Benefits of Computerized Testing (in general)

Immediate score reportingP&P testing requires the question papers to come back and be scored

More item formatsAudio, video, hotspots, drag and drop, etc.

Easier results collectionAnd management/reporting

Page 11: Introduction to Computerized Adaptive Testing (CAT)

Disadvantages of CAT

Public relationsNeed to explain to examinees/parents why certain things can happen, like failing after only 10 questions, or passing with a 50% correct score

Heavy requirements to do it rightSoftwareExpertise Effort

Page 12: Introduction to Computerized Adaptive Testing (CAT)

Disadvantages of CAT

Item ExposureBut better than fixed forms!

Not feasible/applicable for every situationSmall samplesSubjective itemsItems not scorable in real time

Page 13: Introduction to Computerized Adaptive Testing (CAT)

Part 2Intro to IRT

Page 14: Introduction to Computerized Adaptive Testing (CAT)

Intro to IRT

While it is possible to design CATs with classical test theory (Frick, 1992), IRT is more appropriate because it puts items and examinees on the same scale

We assume use of IRT

Page 15: Introduction to Computerized Adaptive Testing (CAT)

Classical item statistics

CTT: option proportions are often translated to a quantile plot

Page 16: Introduction to Computerized Adaptive Testing (CAT)

The item response functionBasic building block of IRT

Page 17: Introduction to Computerized Adaptive Testing (CAT)

The item response function

The IRFParrameters: a,b,cItem information functionIs used for scoring examinees

Page 18: Introduction to Computerized Adaptive Testing (CAT)

The IRFa

The discrimination parameter Represents how well the item differentiates examinees Slope of the curve at its center

b The difficulty parameter Represents how easy or hard the item is with respect to

examinees Location of the curve (left to right)

c The pseudoguessing parameter Represents the ‘base probability’ of answering the question Lower asymptote

Page 19: Introduction to Computerized Adaptive Testing (CAT)

Item information functions

Example 5 items

Page 20: Introduction to Computerized Adaptive Testing (CAT)

TIF and SEMIIFs can be summed into a Test Information Function (TIF)

The TIF can be inverted into a SEM function (more info = less error)

Page 21: Introduction to Computerized Adaptive Testing (CAT)

IRT Scoring

Scoring happens by multiplying IRFs to get a likelihood function

The maximum likelihood is the examinee score

It is on the b/θ scale, so we can then use it to adapt

Page 22: Introduction to Computerized Adaptive Testing (CAT)

Part 2Basic principles of CAT (The Five Components)

Page 23: Introduction to Computerized Adaptive Testing (CAT)

CAT Components

1. Calibrated item bank2. Starting rule3. Item selection rule4. Scoring rule5. Stopping rule

Given 1 and 2, we loop 3-4-5 until 5 is satisfiedAll CAT follows this basic format – we just modify the details for whatever testing situation we have

Page 24: Introduction to Computerized Adaptive Testing (CAT)

CAT Components

1. Calibrated item bank2. Starting rule3. Item selection rule4. Scoring rule5. Stopping rule

Given 1 and 2, we repeat 3 and 4 until 5 is satisfiedAll CAT follows this basic format – we just modify the details for whatever testing situation we have

Algorithms inside your

testing engine

Page 25: Introduction to Computerized Adaptive Testing (CAT)

1. Calibrated item bank

Goal: develop an item bank that meets the needs of your CAT

Q: What are the needs? A: Monte carlo simulation

More broadly: think about TIF and CSEM, and work backwards

Page 26: Introduction to Computerized Adaptive Testing (CAT)

1. Calibrated item bank

SEM: How accurate do you want to be?How many items would it take?Given your items, what is the best hope?

Page 27: Introduction to Computerized Adaptive Testing (CAT)

1. Calibrated item bank

Do you need a lot of info near a cutscore?

Page 28: Introduction to Computerized Adaptive Testing (CAT)

2. Starting rule

1. Can start everyone with the same theta estimate (e.g., theta = 0.0)

Everyone gets the same first item Could be an exposure problem in a high stakes test

2. Assign a random theta estimate within an interval

E.g., between theta = -0.5 and +0.5 Improves exposure levels and has little effect on a

properly implemented CAT

Page 29: Introduction to Computerized Adaptive Testing (CAT)

2. Starting rule

3. Use prior information available for a given examinee

Subjective evaluations, e.g., below average, above average

Theta estimates from previously administered tests (same or different)

Projected theta based on biodata (age, GPA, etc.) – IACAT 2010

Page 30: Introduction to Computerized Adaptive Testing (CAT)

3. Item selection rule

Items are selected to maximize informationThere are different ways to quantify

Fisher (single-point) information at the current score is most common

Page 31: Introduction to Computerized Adaptive Testing (CAT)

3. Item selection

Also, there are usually practical constraints in item selectionItem exposureContent area (domain)Cognitive levelEtc.

Page 32: Introduction to Computerized Adaptive Testing (CAT)

3. Item selection

Example 5 items

Page 33: Introduction to Computerized Adaptive Testing (CAT)

4. Scoring rule

Typically, MLE is used to score examinees after each item (or sometimes Bayesian)

However, this assumes that there is a mixed response vector (not all correct or all incorrect)

But this is rarely the case in the first few items

And never the case after the first item

Page 34: Introduction to Computerized Adaptive Testing (CAT)

4. Scoring rule

So, we need to adapt the scoring rule if nonmixed

We can always start with BayesianCan also use fixed step sizes

Add or subtract 1 to score after each item until mixed

Page 35: Introduction to Computerized Adaptive Testing (CAT)

5. Stopping rule

Depends primarily on purpose of the test: point estimation or classification?Point estimation: we want an accurate score for each student

Classification: we do NOT need an accurate score, just a classification into pass/fail etc.

Also, Change: to see if score has gone up/down a certain amount

Page 36: Introduction to Computerized Adaptive Testing (CAT)

5. Stopping rule

Point estimation methods involve actual scores, and stop when we have zeroed in enough

Classification methods check after every item to see if we can make a classification within a certain degree of accuracy

Specifics will be later

Page 37: Introduction to Computerized Adaptive Testing (CAT)

5. Stopping rule

Either type of CAT can be designed with a fixed number of items

But this is a bad idea from a psychometric perspective but can greatly enhance perceived fairness

Variable-length testing is much more efficient

Page 38: Introduction to Computerized Adaptive Testing (CAT)

The big picture

1

2

3

4

5

Page 39: Introduction to Computerized Adaptive Testing (CAT)

Point estimation CAT

Let’s look at the most common scenarioNeed accurate score for each examineeAdapt the quantity and difficulty of items

1. Bank: Given2. Starting Point: 0.03. Item selection: Info at current theta4. Scoring: MLE5. Termination: Target SEM

Page 40: Introduction to Computerized Adaptive Testing (CAT)

Example CAT Item-By-Item Report of Maximum Information CAT This test will terminate when the SEM is equal to or less than 0.200

Minimum number of items = 5 Maximum number of items = 40The standard error band plotted as ---- is plus or minus 2.00 standard errors. X = Initial theta value C = Correct answer I = Incorrect answer

Item Theta SE -3.......-2........-1.........0........+1........+2........+3 0 -0.24* 1.00* --------------------X-------------------- 1 4.00* 1.00* . --------------------> 2 4.00* 1.00* . --------------------> 3 2.52 0.84 . -----------------I----- 4 2.77 0.68 . -------------C--- 5 2.38 0.61 . ------------I------- 6 2.09 0.61 . ------------I---------- 7 1.49 0.89 -----------------I---------------- 8 0.36 1.00 --------------------I-------------------- 9 0.88 0.63 ------------C------------- 10 1.13 0.56 -----------C----------- 11 1.34 0.49 . ----------C---------- 12 1.44 0.46 . ---------C--------- 13 1.55 0.43 . ---------C--------- 14 1.67 0.41 . --------C-------- 15 1.54 0.38 . --------I-------- 16 1.60 0.36 . --------C-------

Page 41: Introduction to Computerized Adaptive Testing (CAT)

Example CAT 17 1.70 0.35 . -------C------- 18 1.76 0.34 . -------C------- 19 1.65 0.32 . ------I------ 20 1.52 0.31 . -------I------ 21 1.40 0.30 . -------I------ 22 1.27 0.30 . ------I------ 23 1.30 0.28 . ------C----- 24 1.32 0.28 . ------C----- 25 1.36 0.27 . -----C------ 26 1.40 0.27 . -----C------ 27 1.31 0.26 . ------I----- 28 1.34 0.25 . -----C----- 29 1.37 0.25 . -----C----- 30 1.40 0.24 . ----C----- 31 1.43 0.24 . -----C----- 32 1.46 0.24 . -----C----- 33 1.50 0.24 . ----C----- 34 1.53 0.23 . -----C---- 35 1.55 0.23 . -----C----- 36 1.59 0.23 . ----C----- 37 1.62 0.23 . -----C---- 38 1.58 0.22 . ----I----- 39 1.53 0.22 . -----I---- 40 1.55 0.22 . ----C----_______________________________________________________________________________*Arbitrarily assigned value.This test was terminated when the maximum number of items was reached.

Page 42: Introduction to Computerized Adaptive Testing (CAT)

Part 3Modifying CAT for special situations

Page 43: Introduction to Computerized Adaptive Testing (CAT)

Practical constraintsTest length

Minimum: Reduce complaints from failuresMaximum: Stop neverending testsFixed: Appears to be fair but is not

ContentEducational standards, etc.

Item exposureReduce overuse of your best items!Randomesque, Sympson-Hetter

Page 44: Introduction to Computerized Adaptive Testing (CAT)

Classification testing

What about pass/fail or multicategory?The conventional approach is to administer a long test to obtain an accurate score, then compare to a cutscore

But we don’t need an accurate score Students far above or below the cutscore can be stopped early

Sometimes, EXTREMELY early

Page 45: Introduction to Computerized Adaptive Testing (CAT)

Classification testing

This is referred to as computerized classification testing (CCT: Lin & Spray, 2000)

Very similar to CAT, but with a few notable differences

Page 46: Introduction to Computerized Adaptive Testing (CAT)

CCT

Methods are primarily delineated based on the stopping ruleAbility confidence intervals (ACI: originally called “adaptive mastery testing” – see Kingsbury & Weiss, 1983)

Likelihood ratio (SPRT or GLR; Wald, 1947; Ferguson, 1967; Reckase, 1983; Thompson et al., 2008)

Page 47: Introduction to Computerized Adaptive Testing (CAT)

ACI

Specify the confidence interval: e.g., 95% interval is plus or minus 1.96 SEM

Keep administering items until interval is above or below cutscore (or you hit your maximum test length)

Page 48: Introduction to Computerized Adaptive Testing (CAT)

ACIExample graph:

Page 49: Introduction to Computerized Adaptive Testing (CAT)

Likelihood ratio

The LR approach completely abandons the use of the theta estimate

Tests only that a student is above or below a cutscore

Indifference regionMore efficient than ACI (Spray & Reckase, 1996; Eggen, 1999; Thompson et al. 2007

Page 50: Introduction to Computerized Adaptive Testing (CAT)

GLR (Advanced SPRT)

Example

Page 51: Introduction to Computerized Adaptive Testing (CAT)

Part 5Building a CAT

Page 52: Introduction to Computerized Adaptive Testing (CAT)

Background

Here is a frameworkIdentify issues and best way to investigateLeads to better quality in the endAlso the foundation for validity arguments

Why did you choose certain things?

Page 53: Introduction to Computerized Adaptive Testing (CAT)

Where to start

Go back to the 5 components:1. Calibrated item bank2. Starting rule3. Item selection rule4. Scoring rule5. Stopping rule

Page 54: Introduction to Computerized Adaptive Testing (CAT)

Where to start

How do we ensure that we build a good bank and choose good algorithms?

Simulations drives much of this

Page 55: Introduction to Computerized Adaptive Testing (CAT)

The model

In practice, you have these steps:Seq. Stage Primary work1 Feasibility, applicability, and

planning studiesMonte carlo simulation; business case evaluation

2 Develop item bank content or utilize existing bank

Item writing and review

3 Pretest and calibrate item bank

Pretesting; item analysis

4 Determine specifications for final CAT

Post-hoc or hybrid simulations

5 Publish live CAT Publishing and distribution; software development

Page 56: Introduction to Computerized Adaptive Testing (CAT)

1. Feasibility, applicability, planning

Three types of simulationsMonte CarloGenerate fake item responses (but based on reality)Random U(0,1), compare to P(X) from IRT

Post hoc (real data)Hybrid

Page 57: Introduction to Computerized Adaptive Testing (CAT)

1. Feasibility, applicability, planning

Simulations act by “administering” a CATSelect first itemGenerate responseEstimate θSelect next item…

Then analyze resultsAverage test lengthAccuracy: CAT θ vs. true θ (or full bank)

Page 58: Introduction to Computerized Adaptive Testing (CAT)

1. Feasibility, applicability, planning

At this point, real data not likely, so Monte Carlo

Generate plausible situations Item bank: 100, 200, 300… Item quality: a = 0.7, 0.8…; spread of bDesired precision: SEM = 0.2, 0.3, 0.4…

Compare results to each other and fixed formsBase values on reality (e.g., mean a)

Page 59: Introduction to Computerized Adaptive Testing (CAT)

1. Feasibility, applicability, planning

How do we “generate” responses?Calculate the probability P(X) of a correct response to the item, given q (which is known, because the examinees are imaginary)Done using the IRT equation

Generate a random number between 0.0 and 1.0Compare: if random > P(X) then incorrect, and vice versa

Page 60: Introduction to Computerized Adaptive Testing (CAT)

1. Feasibility, applicability, planning

Example with q = 1.0: Item 1 has a=1, b=0, c=0.2P(X) = 0.877Random number = 0.426Random < P(X), so CORRECT Item 2 has a=1, b=1, c=0.2P(X) = 0.6Random number = 0.813Random < P(X), so INCORRECTAnd so on…

Page 61: Introduction to Computerized Adaptive Testing (CAT)

1. Feasibility, applicability, planning

Software will do this for you, allowing you to simulate CATs for thousands of examinees in seconds

You can then easily set up an experiment with a wide range of conditions, and run a simulation for each

Page 62: Introduction to Computerized Adaptive Testing (CAT)

1. Feasibility, applicability, planning

Example result:CAT with bank of 300 items and SEM=0.25 has average of 53 items

Current fixed test has 100 items, SEM=0.23 in middle and 0.35+ beyond θ of ±1.5

CAT will make test more accurate for extreme examinees, about same accuracy for middle, but with 50% reduction

Page 63: Introduction to Computerized Adaptive Testing (CAT)

1. Feasibility, applicability, planning

More on this topic in my session

Page 64: Introduction to Computerized Adaptive Testing (CAT)

Epilogue: Maintaining CATLike fixed form testing, maintenance is usually necessary

Check that performing as expectedIs termination criterion being satisfied?Examinees hitting test length or other constraints?

Average test length what you expected?

Page 65: Introduction to Computerized Adaptive Testing (CAT)

Epilogue: Maintaining CATExposure – are certain items overused?

CAT is greedy, always selecting items with best discriminations

Refresh pool by replacing exposed itemsTest security: Are any items out on the internet?Related to exposure, but not the same

Parameter drift?P&P comparability study? Equity?

Page 66: Introduction to Computerized Adaptive Testing (CAT)

Wrap-up

Questions?More resources?

Page 67: Introduction to Computerized Adaptive Testing (CAT)

67

• Is CAT a good fit for my organization?• Many organizations hear these benefits of

CAT• However, CAT is not relevant or possible

in most situations• How can I evaluate if it is applicable for

my organization?• Consider the following requirements…

Computerized Adaptive Testing

Page 68: Introduction to Computerized Adaptive Testing (CAT)

68

Items scoreable in real time CAT scores items immediately so the next

item can be determined instantaneously Paper testing obviously irrelevant Essays and other constructed-response

items are not possible unless you package an automated scoring system

Requirement #1

Page 69: Introduction to Computerized Adaptive Testing (CAT)

69

Large item banks A general rule of thumb is that you need 3

times as many items in the bank as intended for the test

100 item test? Need 300 items in the bank You then need resources to write, review,

and pilot all of these items Piloting is an issue unto itself

Requirement #2

Page 70: Introduction to Computerized Adaptive Testing (CAT)

70

CAT Requirements

Large pilot samples CAT uses item response theory (IRT) as the

underlying psychometric model This requires that all items have at least 100

examinee responses just for a cursory analysis Preferably at least 1000 Then if you have 300 items and each examinee

sees 100 during a pilot, you need 3000 examinees just for the pilot study!

Requirement #3

Page 71: Introduction to Computerized Adaptive Testing (CAT)

71

PhD psychometricians Psychometricians are necessary to

perform complex IRT analysis Also needed to perform CAT simulation

validity studies Test is otherwise not defensible

Requirement #4

Page 72: Introduction to Computerized Adaptive Testing (CAT)

72

Sophisticated software Software designed specifically for IRT

analysis is necessary to calibrate the pilot data

Separate software is necessary for the CAT simulation studies

General statistical software is not usually acceptable

Requirement #5

Page 73: Introduction to Computerized Adaptive Testing (CAT)

73

IRT item banker Items must be packaged with the IRT

parameters Your item banking system must utilize IRT

parameters appropriately for assembling banks/forms

Requirement #6

Page 74: Introduction to Computerized Adaptive Testing (CAT)

74

CAT delivery system You must have a computerized test

delivery system capable of performing all the complex CAT calculations

Crude, homegrown approximations are not defensible

Must be reliable, scalable, and secure

Requirement #7

Page 75: Introduction to Computerized Adaptive Testing (CAT)

75

Money and resources Developing a defensible CAT is extremely

expensive Small research project: $20,000 Large operational CAT: $hundreds of

thousands

However, the benefits stated earlier can outweigh these costs, especially with large volume tests

CAT can then be a positive investment

Requirement #8