30
THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF CONNECTICUT

THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Embed Size (px)

Citation preview

Page 1: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

THE DEVELOPMENT OF COMPUTER BASED TESTING AND

COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES,

AND SOLUTIONS

HARIHARAN SWAMINATHANUNIVERISTY OF CONNECTICUT

Page 2: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

TESTING: A Brief History1. Testing and humans have had a love-hate

relationship since the dawn of history. First “mastery test” is mentioned in the old testament to “classify” people into two categories: The Ephramites and the Gileadites

2. Civil service exams were used in China more than 3000 years ago.

3. In the West, testing has had a mixed history – its use waxed and waned.

4. In the US, Horace Mann argued for written exam( objective type) and the first test was introduced in Boston in 1845

5. Was used for grade to grade promotion

Page 3: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

TESTING: A Brief History6. This testing practice fell into disrepute

because of teaching to the test.

7. Grade promotion based on testing was banned in Chicago in 1881.

8. Binet introduced mental testing in 1901 (became the Stanford Binet Test).

9. The issue of fairness that “everyone should get the same test” was not relevant to him.

10.Binet rank ordered the items in order of difficulty and targeted items to the child’s ability.

Page 4: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

• Adaptive testing was born.

• Was the primary mode of testing before the notion of group testing was introduced.

• With the advent of group testing, individualized (adaptive) testing went to the back burner.

• Impossible to administer to large groups.

• Group based adaptive testing was not feasible, Until……

INDIVIDUALIZED TESTING

Page 5: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

TAILORED TESTING• Fred Lord introduced Item Response

Theory in the early 50’s and with it, the notion of Tailored Testing.

• Without computers, tailored testing was not feasible.

• To overcome this problem, Lord developed “FLEXILEVEL TESTING”

• Flexilevel test follows Binet’s idea; only the difficulty level of the item is used in routing

Page 6: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

FLEXILEVEL TESTING• Flexilevel testing may even be

administered as a Paper and Pencil Test as it was initially intended

• The scoring algorithm is simple enough that a fexilevel test is self scoring

• Its simplicity was equated with lack of glamor and as an approximation to CAT.

Page 7: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

FLEXILEVEL TESTING• Flexilevel testing languished,

unwanted and ignored by the methodology-addicted psychometric researchers.

• It is making a comeback in non-high-stakes evaluation, medicine, and allied health, where a full blown CAT is not required or not feasible.

• It has the potential for being used innovatively for classroom assessment and diagnostic purposes.

Page 8: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

CAT• Meanwhile, important technical

advances were being made CAT research.

• The Office of Naval Research, the Army, and the Air Force funded research for advancing CAT during the 70s.

• The name “Computerized Adaptive Test” was coined by David Weiss

Page 9: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

CAT• David Weiss and his team at University

of Minnesota were funded for developing operational procedures for implementing CAT

• I was funded for the development of Bayesian estimation procedures so that we can estimate item parameters more accurately

• All these activities were motivated because of the large volume of test takers in the armed forces

Page 10: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

CAT on a Hot Tin Roof:Operationalizing CAT

• I was on the Board of Directors of GRE in the mid 80s.

• The Board authorized and funded research for implementing GRE-CAT

• GRE CAT was operational in the early 90s. GMAT followed suit soon after.

Page 11: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Theory V Practice• First clash between theory and practice

occurred in the implementation of CAT

• As PCs were not common, GRE had to contract with a delivery system provider

• “Seat-time” was the major obstacle.

• In theory, testing should continue until the stopping criterion, prescribed standard error, was reached. Instead the time and test length, were fixed.

• Examinees had to complete 80% of items

Page 12: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Flexilevel Test and CAT• A Flexilevel test is a CAT albeit with one

foot (one-parameter model)

• It DOES need a good item bank

• It is an approximation to a full blown CAT.

• Many more items than a full fledged CAT are required to obtain the same level of precision.

• Nevertheless, with care a fexilevel test can be made to function effectively

Page 13: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Issues • Item Bank: A large item bank is needed

and maintained well. In developing item banks, items from paper and pencil administration should not be used without careful investigation.

• Exposure Control: In high stakes testing, exposure control is critical

• Content Specification and Balancing: This is a critical issue and must be addressed early on in the development of item bank and item selection criteria

Page 14: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Issues (cont’d) • CAT algorithm. Unchecked, a CAT algorithm

greedily choose items that provide the most with the most information. Algorithms for selecting items with content balancing must be in place.

• Item parameter Shift : Over time item parameter values will change because of instruction, targeted instruction, and exposure of items. Item parameters must be re-estimated and items that show large drifts must be eliminated. Procedures for detecting cheating in CAT have been developed and are useful here.

Page 15: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Issues (Cont’d) • Item BIAS (Differential Item Functioning):

Performance of subgroups on items must be examined to determine if subgroups are performing differentially on items . This is part of validity analysis and must be carried out not only in the development of the item bank but also during operational administrations.

Page 16: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

MULTISTAGE TESTING• Although CAT is efficient, constraints on

content balancing in item selection may pose insurmountable problems.

• In these cases, MULTSTAGE testing is a viable option, and is in use in some large scale testing programs.

• Instead of administering an item at a time a mini test (testlet) at varying levels of difficulty is administered in stages.

Page 17: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF
Page 18: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

MULTISTAGE TESTING• Content balancing is achieved elegantly

• Each testlet has sufficient number of items for estimation of proficiency

• Performs almost as well as CAT

• We evaluated several designs for the administration of Russian language test in the US and recommended a three stage testing scheme.

• Multistage testing has the potential for national assessments.

Page 19: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Growth Assessment:Vertical Scale

• Growth assessment of individual has been mandated by states as well as the federal government.

• To develop a vertical scale items have to be administered according to the following scheme (as implemented in Connecticut)

• Through this design all items across grades are linked

Page 20: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Test Administration Design

Page 21: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF
Page 22: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

THETA DISTRIBUTION FOR MATHEMATICS

Page 23: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Growth Assessment• In Growth assessment we need the

growth rates of individuals as well as subgroups

• Scores over time are nested within individuals who are in turn nested within classrooms, schools, and districts.

• The statistical models must take this nesting into account. The process is complex but can be done.

• Use of growth for teacher evaluation

Page 24: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

National Assessments

• Growth assessment of individual is not important; we need the characteristics of subpopulations.

• Proper coverage of the content domain is critical. Matrix sampling of items is necessary.

• CAT is being considered by NAEP; a multistage approach may be better for ensuring content coverage.

Page 25: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Computer Based Testing

• Was developed as part of the Computer Assisted Instruction movement in the mid 60s by Patrick Suppes.

• It is a linear test as the P& P test

• P&P test and CBT items are not equivalent. Easy P&P item may become difficult in CBT and vice versa.

• Our study in Connecticut showed the items behaved differently in the two modes

Page 26: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Computer Based Testing• CBT has the advantage of using

innovative item types

• Science Test in Connecticut is being developed as a CBT

• Has been used by NBME innovatively in testing

• PIRLSe is using CBT approach; PISA may become a CBT soon.

• Standard procedures for scoring (automated) and item analysis are usable

Page 27: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

New Research on CAT and CBT

• Use of polytomous items

• Use of free response items – automated scoring of items

• Multidimensional item response models for vertical scaling

• Item generation: Item cloning

• Classification rather than estimation. Item selection is based on measures of information (Shanon, Kullback).

Page 28: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

The Politics of testing

• Closely related – education and testing have occupied center stage in politics

• Politicians in the US have smelled CAT in the water and are circling to take a bite

• There have been debates about CAT item administration and special interest groups have weighed in for and against CAT item selection algorithms.

• Issue of item release is a problem for CAT banks

Page 29: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Politics of testing

• Transparency and honesty are critical to convince the public who may not understand the mathematics involved

• Testing must be above reproach.

• As Caesar, in divorcing Pompeia said- it is not enough to be beyond reproach. You must also GIVE THE APPERANACE OF BEING BEYOND REPROACH

Page 30: THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF

Conclusion• CAT, Multistage Testing, and Computer

Based Testing are playing major roles in statewide and national assessments

• These assessments are designed for assessing student growth at the individual as well as the group level.

• We have solutions or near solutions for most of the issues that face us in the implementation of these testing designs, and the research continues.