66
Day 1 PM: Using IRT Item and test information Comparison of IRT to Classical Test Theory How to do IRT analysis

Using Item Response Theory to Improve Assessment

Embed Size (px)

Citation preview

Page 1: Using Item Response Theory to Improve Assessment

Day 1 PM: Using IRT

Item and test information

Comparison of IRT to Classical Test Theory

How to do IRT analysis

Page 2: Using Item Response Theory to Improve Assessment

Part 1

Item and test information

Page 3: Using Item Response Theory to Improve Assessment

Information

Information is the tool that IRT uses

to build tests

It is a statistical term that quantifies

how much something “adds” to a

procedure

Or, alternatively, how much

uncertainty (error) it decreases

A good test has a lot of information!

Page 4: Using Item Response Theory to Improve Assessment

Item information

IRT calculates information for each

item and test at each level of q

It is therefore not a single number –

it is a function across ability

Each item has an item information

function

Each test has an test information

function

Page 5: Using Item Response Theory to Improve Assessment

Item information

Some items provide information for

high students, some for low

Same is true for tests: a test can be

more accurate for certain score

ranges – and IRT will tell you which

Page 6: Using Item Response Theory to Improve Assessment

Information

Item information is summative, that

is, it can be added up to obtain the

test information function (TIF)

Then we know where to add/subtract

items

Bonus: The TIF can also be inverted

to obtain a predicted SEM curve

Page 7: Using Item Response Theory to Improve Assessment

Item information

With CTT, “information” can be

conceptualized by jointly considering

the P and rpbis

◦ Obviously, a higher rpbis is better

Definitely don’t want negative!

◦ P represents which examinees it is most

appropriate for

P = 0.95 is easy, good for low examinees

P = 0.50 is hard, good for high examinees

Page 8: Using Item Response Theory to Improve Assessment

Item information

But since items and examinees are

not on the same scale, there is no

direct connection

With IRT, there is

Item with b = 0.7 is good for person

with q = 0.7

◦ This is the basis of adaptive testing –

doing this continually

Page 9: Using Item Response Theory to Improve Assessment

Item information

Item information takes this idea and quantifies it across the spectrum

It is therefore a function of q as well as the item parameters

Where P(q) is the probability of a correct answer for a given value and Q(q) is 1-P

2

2 2 ( ) ( )I

( ) 1

i i i

i

i i

Q P cD a

P c

q q q

q

Page 10: Using Item Response Theory to Improve Assessment

Item information

That is the computational equation

Conceptual version that is seen in the

literature is

Or the slope squared over the

conditional variance

2

I ( ) / ( ) (1 ( ) )i i iP P Pq q q q

Page 11: Using Item Response Theory to Improve Assessment

Graphing info

So what does this mean?

We calculate with that equation, and

it will be higher wherever the slope

of the IRF is higher (for a given value

of q)

This is the item information function

(IIF)

Page 12: Using Item Response Theory to Improve Assessment

Graphing info

So the location of the item

determines the location of the IIF

The discrimination of the item

determines the spread/peakedness of

the IIF

Information decreases as the guessing

parameter increases

Page 13: Using Item Response Theory to Improve Assessment

Some example items

Seq a b c1 1.00 -2.00 0.262 0.70 -1.00 0.213 0.40 -0.50 0.304 0.50 1.00 0.005 0.80 0.00 0.22

Page 14: Using Item Response Theory to Improve Assessment

Example item IRFs

Page 15: Using Item Response Theory to Improve Assessment

IIFs – example items

Page 16: Using Item Response Theory to Improve Assessment

Graphing info functions

Note that a lower slope is not ALL

bad

Even though Item 3’s peak is lower, it

provides some info at a much wider

range

So items like that are quite useful

when info is needed across a wide

range

Page 17: Using Item Response Theory to Improve Assessment

Using item info

Item information is inversely related

to error in measurement

If the item provides more info, it

reduces error

The equation:

2/1

1 qq ISEM

Page 18: Using Item Response Theory to Improve Assessment

Using item info

Key point: an item has less error where it has more information

--> where it has more slope

A test has less error where it has more information (items)

Page 19: Using Item Response Theory to Improve Assessment

Using item info

IIFs are another way to examine

items individually

They are also what adaptive testing

utilizes for item selection

But the best use of item info: test

information and test assembly…

Page 20: Using Item Response Theory to Improve Assessment

Test information

As a result of the assumption of local

independence, IIFs can be summed to

obtain a test information function

(TIF)

Same is true for IRFs – they can be

summed into a TRF

◦ This converts thetas to estimated raw

score

Page 21: Using Item Response Theory to Improve Assessment

Test information

Test information, like item

information, shows how well a test

measures at each value of q

Also inverts to CSEM

This is extremely useful for test

assembly (aka construction, design,

or building)

Page 22: Using Item Response Theory to Improve Assessment

Test information

Consider the 5 IRFs…

Page 23: Using Item Response Theory to Improve Assessment

Test information

The TRF is…

Page 24: Using Item Response Theory to Improve Assessment

Test information

Consider the 5 IIFs…

Page 25: Using Item Response Theory to Improve Assessment

Test information

The TIF is…

Page 26: Using Item Response Theory to Improve Assessment

Test information

The CSEM curve is…

Page 27: Using Item Response Theory to Improve Assessment

Test assembly

Form building is more efficient and

better directed with IRT

Reason: we can predict measurement

error (SEM) at each level of θ, not

just overall reliability

Page 28: Using Item Response Theory to Improve Assessment

Test assembly

This then allows you to build test

forms with specific TIFs or CSEMs in

mind

Or multiple forms with the same TIF

The following figures have the same

average a (0.9) but differ in where

they provide information

Page 29: Using Item Response Theory to Improve Assessment

TRFs

Page 30: Using Item Response Theory to Improve Assessment

TIFs

Page 31: Using Item Response Theory to Improve Assessment

CSEMs

Page 32: Using Item Response Theory to Improve Assessment

Test development

You can build your test with specific

TRF/TIF/SEM graph in mind

Peak at cutscore?

This can be done inside item bankers

(FastTEST & FT Web) or in separate

spreadsheets (my Form Building Tool)

Page 33: Using Item Response Theory to Improve Assessment

Bank development

You can also build the bank for a

testing program with the desired TIF

in mind

If you know you want it to be peaked,

write items at the desired level of

difficulty to build an adequate bank

Page 34: Using Item Response Theory to Improve Assessment

Bank development

Otherwise you risk overexposure

Don’t use all your best items at once

to make a peaked TIF – or any TIF for

that matter

In the theoretical IRT world, we don’t

have to worry about that, but

exposure is a real issue

Page 35: Using Item Response Theory to Improve Assessment

Bank development

That is the reason linear-on-the-fly

(LOFT )was developed – to massively

reduce exposure and increase

security

◦ Every person gets an very similar TIF, but

a completely different test

◦ These tests are parallel, from an IRT

point of view

◦ Tests are conventional fixed-form

Page 36: Using Item Response Theory to Improve Assessment

Part 2

A brief comparison of CTT and IRT

Page 37: Using Item Response Theory to Improve Assessment

CTT and IRT Assumptions

IRT: ◦ Unidimensionality and local independence

◦ Responses modeled by IRF

◦ Parameters, not statistics (sample independence)

CTT: ◦ X = T + E

◦ (1) true scores and error scores are uncorrelated; (2) the average error score in the sample is zero

◦ Statistics (not parameters) are sample-based

Page 38: Using Item Response Theory to Improve Assessment

Comparing CTT and IRT

CTT is said to have weaker assumptions

◦ Does not explicitly assume

unidimensionality

But if not there, statistics will be iffy, and rpbis

and reliability suffer

Sum scoring implicitly assumes items are

equivalent, which means unidimensional (all

items count equally on one total score)

Page 39: Using Item Response Theory to Improve Assessment

Comparing CTT and IRT

CTT is said to have weaker assumptions

◦ Does not explicitly assume IRF

But if the idea of an IRF is not working, then the

item isn’t either

And if you use rpbis, you assume a linear IRF,

which is actually impossible!

Page 40: Using Item Response Theory to Improve Assessment

Comparing CTT and IRT

CTT item statistics are at odds with

each other

◦ P says that there is one common

probability of a correct response

(binomial)

◦ But rpbis says that P increases with total

score (~ability)

Page 41: Using Item Response Theory to Improve Assessment

Comparing CTT and IRT

Classical SEM: same for everyone

IRT SEM: different for everyone –

depends on the items you see and

your ability

Which is more realistic?

Page 42: Using Item Response Theory to Improve Assessment

Comparing CTT and IRT

Direct comparison of item statistics

◦ We still use “difficulty” and

“discrimination”

◦ How different are they from CTT?

◦ Difficulty correlates highly (>0.90)

◦ Discrimination does not – because Rpbis

is linear and IRT is not

Page 43: Using Item Response Theory to Improve Assessment

Comparing CTT and IRT

IRT and CTT scores also correlate

>0.95

So why use IRT?

There are distinct advantages…

Page 44: Using Item Response Theory to Improve Assessment

Advantages of IRT

IRT has parameters, not statistics

Sample-independent… within a linear

transformation

Huh? This means that if you have two

calibration groups of different levels,

we can convert parameters/scores

with a simple y = mx + b

(Linking)

Page 45: Using Item Response Theory to Improve Assessment

Advantages of IRT

Items and people are on the same

scale

Easier to interpret, and allows

adaptive testing

Page 46: Using Item Response Theory to Improve Assessment

Advantages of IRT

Information provides an important

tool for test building and bank

development

Better match the purposes of a test

IRT CSEM allows far better

description of precision

Page 47: Using Item Response Theory to Improve Assessment

Advantages of IRT

More precise scores

CTT number correct scoring is limited

to k + 1 scores

3PL has 2k scores

Compare with 10 items:

◦ 11 vs 1024 possible scores

Page 48: Using Item Response Theory to Improve Assessment

Advantages of IRT

Scores take item difficulty into

account

Allows direct comparison of

examinees that saw different sets of

items

Scores also account for guessing

Page 49: Using Item Response Theory to Improve Assessment

Advantages of IRT

Nonlinear IRF – the linear IRF

assumed by CTT is impossible

Allows for different SEM for every

examinee

Not realistic to assume they are all

the same

Page 50: Using Item Response Theory to Improve Assessment

Disadvantages of IRT

Sample size

CTT: 50 is OK, 100 is great

◦ It is much easier to fit a straight line

“model” than an IRF because it is an

oversimplification

IRT: 100 is bare minimum for 1PL

◦ 3PL? ~500

◦ Puts it out of reach of small testing

programs

Page 51: Using Item Response Theory to Improve Assessment

Disadvantages of IRT

No “native” distractor analysis unless

polytomous models

Can adapt the CTT idea of

quantile/distractor plot with IRT

◦ IRT programs will also give you option P

and Rpbis

Page 52: Using Item Response Theory to Improve Assessment

Disadvantages of IRT

Complexity

◦ Not only do you have to understand it

yourself, but…

◦ You also have to explain it to

stakeholders!

Page 53: Using Item Response Theory to Improve Assessment

Disadvantages of IRT

However, note that these are not big

problems

◦ Many places have plenty of sample size

◦ You can still use CTT for distractor

analysis (always use both!!!!)

◦ The complexity is not too bad unless

using complex models

◦ Often, the biggest issue is the

stakeholders!

Page 54: Using Item Response Theory to Improve Assessment

IRT Analysis

How do I go about doing this?

Page 55: Using Item Response Theory to Improve Assessment

IRT Analysis

Xcalibre 4 for IRT

CTT analysis with Iteman 4 (not

necessary, but sometimes helps)

Also:

◦ Scoring and graphing tool

◦ Form building tool

◦ Empirical IRFs in Excel

◦ Have we covered these sufficiently?

Page 56: Using Item Response Theory to Improve Assessment

IRT Analysis

I’m assuming here we are analyzing

just one sample of one test

What would I look for? Basic…

◦ Items with good parameters (keep/clone)

◦ Items with bad parameters (retire)

Evaluate their CTT option statistics

◦ TIF/CSEM – meet our needs? (not

good/bad in absolute sense)

Page 57: Using Item Response Theory to Improve Assessment

IRT Analysis

What would I look for? Advanced…

◦ Dimensionality assessment (reliability,

any items/sections “off on their own”)

◦ Item fit (also dimensionality, and possible

item issues)

◦ Test sections – any stand out for being

hard, easy, low discriminations, poor

precision, etc?

◦ CSEM/TIF for sections: anything under-

measured?

Page 58: Using Item Response Theory to Improve Assessment

IRT Analysis

What would I look for? Advanced…

◦ Finally: what do you want to see in the

data, and how will the test be used?

Later, we’ll talk about more

advanced uses like:

◦ Linking and equating multiple forms

◦ Test assembly

◦ Adaptive testing

◦ Dimensionality evaluation

Page 59: Using Item Response Theory to Improve Assessment

Iteman 4.1

Performs comprehensive classical

analysis

Quantile plots allow broad evaluation

of IRF shape

Advantages:

◦ Easily understandable – can use with SMEs

◦ Includes distractors

Page 60: Using Item Response Theory to Improve Assessment

Xcalibre 4.1

Provides a comprehensive and user-

friendly IRT analysis

Allows evaluation of individual items

and test as a whole

All major graphs

Many summary graphs (freqs etc.)

Classical analysis too

Page 61: Using Item Response Theory to Improve Assessment

Reasons for Xcalibre 4.1

Current available software (Parscale,

Bilog, Multilog, ConQuest, WinSteps,

ICL) still require programming skills

Some still run on DOS!

If IRT is to be more widely used, it

needs a user-friendly system

◦ Input and output

Page 62: Using Item Response Theory to Improve Assessment

Reasons for Xcalibre 4.1

Better input

◦ Yes: Point and click buttons

◦ No: DOS programming quasi-language

Better output

◦ Yes: Word docs (RTF), spreadsheets (CSV)

◦ No: DOS txt files with ugly tables

Page 63: Using Item Response Theory to Improve Assessment

Reasons for Xcalibre 4.1

Advanced users with programming

skills and need for customized analysis

can still utilize previous software

Xcalibre 4.1 is designed for a wider

range of users

The following description is of Xcalibre

4, but also applies to Iteman 4

Page 64: Using Item Response Theory to Improve Assessment

Xcalibre 4.1 Interface

Divided into tabs

Move left to right…

Page 65: Using Item Response Theory to Improve Assessment

Xcalibre 4.1 Interface

All options are specified with buttons

or simple entry boxes

No code based on keywords

◦ Best example: IRT models (you’ll see)

Also: usable error messages

Page 66: Using Item Response Theory to Improve Assessment

Specify files/input; choose options

I’ll now show how to use X4, and do

some analysis of real data…