Introduction to Computerized Adaptive Testing (CAT)

An Introduction to CAT

Outline

What and why?Intro to IRTThe 5 componentsModifying CAT for special situationsBuilding a CAT

Background?

How much do you know about CAT?Heard of itI am familiar with IRTHave taken a course/workshopHave built a CAT (why are you here?!?!?)

The ultimate goal: CAT in a Box

How much do you know about CAT?Heard of itI am familiar with IRTHave taken a course/workshopHave built a CAT (why are you here?!?!?)

Part 1Benefits of CAT

What is CAT?

A Computerized Adaptive Test (CAT)is a test administered by computer that dynamically adjusts itself to the trait level of each examinee as the test is being administered

CAT is an algorithm

Why? Benefits of CAT

Efficiency Reduce test length by 50% or more Can be gigantic financial/time benefits (main driver of

CAT) Control of measurement precision

Measure or classify all examinees with the same degree of precision

Increased fairness What is more fair: seeing the same items but some

students having unreliable scores, or reliable scores but different items?

Benefits of CAT

Added securityIf everyone receives with the same 100 items, the items will become well known

More frequent retestingIf you go study for a month then retest, you will probably get a different test

Benefits of CAT

Better examinee motivationTop students do not waste time on easy questions

Low students are not discouraged by tough questions

More precise scoresEspecially at extremes

Benefits of Computerized Testing (in general)

Immediate score reportingP&P testing requires the question papers to come back and be scored

More item formatsAudio, video, hotspots, drag and drop, etc.

Easier results collectionAnd management/reporting

Disadvantages of CAT

Public relationsNeed to explain to examinees/parents why certain things can happen, like failing after only 10 questions, or passing with a 50% correct score

Heavy requirements to do it rightSoftwareExpertise Effort

Disadvantages of CAT

Item ExposureBut better than fixed forms!

Not feasible/applicable for every situationSmall samplesSubjective itemsItems not scorable in real time

Part 2Intro to IRT

Intro to IRT

While it is possible to design CATs with classical test theory (Frick, 1992), IRT is more appropriate because it puts items and examinees on the same scale

We assume use of IRT

Classical item statistics

CTT: option proportions are often translated to a quantile plot

The item response functionBasic building block of IRT

The item response function

The IRFParrameters: a,b,cItem information functionIs used for scoring examinees

The IRFa

The discrimination parameter Represents how well the item differentiates examinees Slope of the curve at its center

b The difficulty parameter Represents how easy or hard the item is with respect to

examinees Location of the curve (left to right)

c The pseudoguessing parameter Represents the ‘base probability’ of answering the question Lower asymptote

Item information functions

Example 5 items

TIF and SEMIIFs can be summed into a Test Information Function (TIF)

The TIF can be inverted into a SEM function (more info = less error)

IRT Scoring

Scoring happens by multiplying IRFs to get a likelihood function

The maximum likelihood is the examinee score

It is on the b/θ scale, so we can then use it to adapt

Part 2Basic principles of CAT (The Five Components)

CAT Components

1. Calibrated item bank2. Starting rule3. Item selection rule4. Scoring rule5. Stopping rule

Given 1 and 2, we loop 3-4-5 until 5 is satisfiedAll CAT follows this basic format – we just modify the details for whatever testing situation we have

CAT Components

1. Calibrated item bank2. Starting rule3. Item selection rule4. Scoring rule5. Stopping rule

Given 1 and 2, we repeat 3 and 4 until 5 is satisfiedAll CAT follows this basic format – we just modify the details for whatever testing situation we have

Algorithms inside your

testing engine

1. Calibrated item bank

Goal: develop an item bank that meets the needs of your CAT

Q: What are the needs? A: Monte carlo simulation

More broadly: think about TIF and CSEM, and work backwards


SEM: How accurate do you want to be?How many items would it take?Given your items, what is the best hope?


Do you need a lot of info near a cutscore?

2. Starting rule

1. Can start everyone with the same theta estimate (e.g., theta = 0.0)

Everyone gets the same first item Could be an exposure problem in a high stakes test

2. Assign a random theta estimate within an interval

E.g., between theta = -0.5 and +0.5 Improves exposure levels and has little effect on a

properly implemented CAT

2. Starting rule

3. Use prior information available for a given examinee

Subjective evaluations, e.g., below average, above average

Theta estimates from previously administered tests (same or different)

Projected theta based on biodata (age, GPA, etc.) – IACAT 2010

3. Item selection rule

Items are selected to maximize informationThere are different ways to quantify

Fisher (single-point) information at the current score is most common

3. Item selection

Also, there are usually practical constraints in item selectionItem exposureContent area (domain)Cognitive levelEtc.

3. Item selection

Example 5 items

4. Scoring rule

Typically, MLE is used to score examinees after each item (or sometimes Bayesian)

However, this assumes that there is a mixed response vector (not all correct or all incorrect)

But this is rarely the case in the first few items

And never the case after the first item

4. Scoring rule

So, we need to adapt the scoring rule if nonmixed

We can always start with BayesianCan also use fixed step sizes

Add or subtract 1 to score after each item until mixed

5. Stopping rule

Depends primarily on purpose of the test: point estimation or classification?Point estimation: we want an accurate score for each student

Classification: we do NOT need an accurate score, just a classification into pass/fail etc.

Also, Change: to see if score has gone up/down a certain amount

5. Stopping rule

Point estimation methods involve actual scores, and stop when we have zeroed in enough

Classification methods check after every item to see if we can make a classification within a certain degree of accuracy

Specifics will be later

5. Stopping rule

Either type of CAT can be designed with a fixed number of items

But this is a bad idea from a psychometric perspective but can greatly enhance perceived fairness

Variable-length testing is much more efficient

The big picture

1

2

3

4

5

Point estimation CAT

Let’s look at the most common scenarioNeed accurate score for each examineeAdapt the quantity and difficulty of items

1. Bank: Given2. Starting Point: 0.03. Item selection: Info at current theta4. Scoring: MLE5. Termination: Target SEM

Example CAT Item-By-Item Report of Maximum Information CAT This test will terminate when the SEM is equal to or less than 0.200

Minimum number of items = 5 Maximum number of items = 40The standard error band plotted as ---- is plus or minus 2.00 standard errors. X = Initial theta value C = Correct answer I = Incorrect answer

Item Theta SE -3.......-2........-1.........0........+1........+2........+3 0 -0.24* 1.00* --------------------X-------------------- 1 4.00* 1.00* . --------------------> 2 4.00* 1.00* . --------------------> 3 2.52 0.84 . -----------------I----- 4 2.77 0.68 . -------------C--- 5 2.38 0.61 . ------------I------- 6 2.09 0.61 . ------------I---------- 7 1.49 0.89 -----------------I---------------- 8 0.36 1.00 --------------------I-------------------- 9 0.88 0.63 ------------C------------- 10 1.13 0.56 -----------C----------- 11 1.34 0.49 . ----------C---------- 12 1.44 0.46 . ---------C--------- 13 1.55 0.43 . ---------C--------- 14 1.67 0.41 . --------C-------- 15 1.54 0.38 . --------I-------- 16 1.60 0.36 . --------C-------

Example CAT 17 1.70 0.35 . -------C------- 18 1.76 0.34 . -------C------- 19 1.65 0.32 . ------I------ 20 1.52 0.31 . -------I------ 21 1.40 0.30 . -------I------ 22 1.27 0.30 . ------I------ 23 1.30 0.28 . ------C----- 24 1.32 0.28 . ------C----- 25 1.36 0.27 . -----C------ 26 1.40 0.27 . -----C------ 27 1.31 0.26 . ------I----- 28 1.34 0.25 . -----C----- 29 1.37 0.25 . -----C----- 30 1.40 0.24 . ----C----- 31 1.43 0.24 . -----C----- 32 1.46 0.24 . -----C----- 33 1.50 0.24 . ----C----- 34 1.53 0.23 . -----C---- 35 1.55 0.23 . -----C----- 36 1.59 0.23 . ----C----- 37 1.62 0.23 . -----C---- 38 1.58 0.22 . ----I----- 39 1.53 0.22 . -----I---- 40 1.55 0.22 . ----C----_______________________________________________________________________________*Arbitrarily assigned value.This test was terminated when the maximum number of items was reached.

Part 3Modifying CAT for special situations

Practical constraintsTest length

Minimum: Reduce complaints from failuresMaximum: Stop neverending testsFixed: Appears to be fair but is not

ContentEducational standards, etc.

Item exposureReduce overuse of your best items!Randomesque, Sympson-Hetter

Classification testing

What about pass/fail or multicategory?The conventional approach is to administer a long test to obtain an accurate score, then compare to a cutscore

But we don’t need an accurate score Students far above or below the cutscore can be stopped early

Sometimes, EXTREMELY early

Classification testing

This is referred to as computerized classification testing (CCT: Lin & Spray, 2000)

Very similar to CAT, but with a few notable differences

CCT

Methods are primarily delineated based on the stopping ruleAbility confidence intervals (ACI: originally called “adaptive mastery testing” – see Kingsbury & Weiss, 1983)

Likelihood ratio (SPRT or GLR; Wald, 1947; Ferguson, 1967; Reckase, 1983; Thompson et al., 2008)

ACI

Specify the confidence interval: e.g., 95% interval is plus or minus 1.96 SEM

Keep administering items until interval is above or below cutscore (or you hit your maximum test length)

ACIExample graph:

Likelihood ratio

The LR approach completely abandons the use of the theta estimate

Tests only that a student is above or below a cutscore

Indifference regionMore efficient than ACI (Spray & Reckase, 1996; Eggen, 1999; Thompson et al. 2007

GLR (Advanced SPRT)

Example

Part 5Building a CAT

Background

Here is a frameworkIdentify issues and best way to investigateLeads to better quality in the endAlso the foundation for validity arguments

Why did you choose certain things?

Where to start

Go back to the 5 components:1. Calibrated item bank2. Starting rule3. Item selection rule4. Scoring rule5. Stopping rule

Where to start

How do we ensure that we build a good bank and choose good algorithms?

Simulations drives much of this

The model

In practice, you have these steps:Seq. Stage Primary work1 Feasibility, applicability, and

planning studiesMonte carlo simulation; business case evaluation

2 Develop item bank content or utilize existing bank

Item writing and review

3 Pretest and calibrate item bank

Pretesting; item analysis

4 Determine specifications for final CAT

Post-hoc or hybrid simulations

5 Publish live CAT Publishing and distribution; software development

1. Feasibility, applicability, planning

Three types of simulationsMonte CarloGenerate fake item responses (but based on reality)Random U(0,1), compare to P(X) from IRT

Post hoc (real data)Hybrid


Simulations act by “administering” a CATSelect first itemGenerate responseEstimate θSelect next item…

Then analyze resultsAverage test lengthAccuracy: CAT θ vs. true θ (or full bank)


At this point, real data not likely, so Monte Carlo

Generate plausible situations Item bank: 100, 200, 300… Item quality: a = 0.7, 0.8…; spread of bDesired precision: SEM = 0.2, 0.3, 0.4…

Compare results to each other and fixed formsBase values on reality (e.g., mean a)


How do we “generate” responses?Calculate the probability P(X) of a correct response to the item, given q (which is known, because the examinees are imaginary)Done using the IRT equation

Generate a random number between 0.0 and 1.0Compare: if random > P(X) then incorrect, and vice versa


Example with q = 1.0: Item 1 has a=1, b=0, c=0.2P(X) = 0.877Random number = 0.426Random < P(X), so CORRECT Item 2 has a=1, b=1, c=0.2P(X) = 0.6Random number = 0.813Random < P(X), so INCORRECTAnd so on…


Software will do this for you, allowing you to simulate CATs for thousands of examinees in seconds

You can then easily set up an experiment with a wide range of conditions, and run a simulation for each


Example result:CAT with bank of 300 items and SEM=0.25 has average of 53 items

Current fixed test has 100 items, SEM=0.23 in middle and 0.35+ beyond θ of ±1.5

CAT will make test more accurate for extreme examinees, about same accuracy for middle, but with 50% reduction


More on this topic in my session

Epilogue: Maintaining CATLike fixed form testing, maintenance is usually necessary

Check that performing as expectedIs termination criterion being satisfied?Examinees hitting test length or other constraints?

Average test length what you expected?

Epilogue: Maintaining CATExposure – are certain items overused?

CAT is greedy, always selecting items with best discriminations

Refresh pool by replacing exposed itemsTest security: Are any items out on the internet?Related to exposure, but not the same

Parameter drift?P&P comparability study? Equity?

Wrap-up

Questions?More resources?

67

• Is CAT a good fit for my organization?• Many organizations hear these benefits of

CAT• However, CAT is not relevant or possible

in most situations• How can I evaluate if it is applicable for

my organization?• Consider the following requirements…

Computerized Adaptive Testing

68

Items scoreable in real time CAT scores items immediately so the next

item can be determined instantaneously Paper testing obviously irrelevant Essays and other constructed-response

items are not possible unless you package an automated scoring system

Requirement #1

69

Large item banks A general rule of thumb is that you need 3

times as many items in the bank as intended for the test

100 item test? Need 300 items in the bank You then need resources to write, review,

and pilot all of these items Piloting is an issue unto itself

Requirement #2

70

CAT Requirements

Large pilot samples CAT uses item response theory (IRT) as the

underlying psychometric model This requires that all items have at least 100

examinee responses just for a cursory analysis Preferably at least 1000 Then if you have 300 items and each examinee

sees 100 during a pilot, you need 3000 examinees just for the pilot study!

Requirement #3

71

PhD psychometricians Psychometricians are necessary to

perform complex IRT analysis Also needed to perform CAT simulation

validity studies Test is otherwise not defensible

Requirement #4

72

Sophisticated software Software designed specifically for IRT

analysis is necessary to calibrate the pilot data

Separate software is necessary for the CAT simulation studies

General statistical software is not usually acceptable

Requirement #5

73

IRT item banker Items must be packaged with the IRT

parameters Your item banking system must utilize IRT

parameters appropriately for assembling banks/forms

Requirement #6

74

CAT delivery system You must have a computerized test

delivery system capable of performing all the complex CAT calculations

Crude, homegrown approximations are not defensible

Must be reliable, scalable, and secure

Requirement #7

75

Money and resources Developing a defensible CAT is extremely

expensive Small research project: $20,000 Large operational CAT: $hundreds of

thousands

However, the benefits stated earlier can outweigh these costs, especially with large volume tests

CAT can then be a positive investment

Requirement #8