90
ED 353 302 AUTHOR TITLE INSTITUTION SPONS AGENCY PUB DATE CONTRACT NOTE PUB TYPE EDRS PRICE DESCRIPTORS IDENTIFIERS ABSTRACT DOCUMENT RESUME TM 019 366 Mislevy, Robert J. Linking Educational Assessments: Concepts, Issues, Methods, and Prospects. Educational Testing Service, Princeton, NJ. Policy Information Center. Office of Educational Research and Improvement (ED), Washington, DC. Dec 92 R117G10027 97p.; Foreword by Robert L. Linn. Reports Evaluative/Feasibility (142) MF01/PC04 Plus Postage. Academic Achievement; Competence; *Educational Assessment; Elementary Secondary Education; Equated Scores; Evaluation Utilization; *Measurement Techniques; Projective Measures; Statistical Analysis; Student Evaluation Calibration; *Linking Educational Assessments; Statistical Moderation This paper describes the basic concepts of linking educational assessments. Although it discusses statistical machinery for interpreting evidence about students' achievements, this paper's principal message is that linking tests is not just a technical problem. Technical questions can not be asked or answered until questions about the nature of learning and achievement and about the purposes and consequences of assessment -.re addressed. Some fundamental ideas about educational assessment and test theory (chains of inferences and roles of judgment) are considered. The following approaches to linking assessments are described: (1) equating in physical measurement and educational assessment; (2) calibration in physical measurement and educational assessment; (3) projection in physical measurement and educational assessment; and (4) statistical moderation and social moderation. Implications for a system of monitoring progress toward educational standards are discussed. The following obtainable goals are highlighted: (1) comparing levels of performance across clusters directly in terms of common indicators of performance on a selected sample of consensually defined tasks administered under standard conditions; (2) estimating levels of performance of groups or individuals within clusters at the levels of accuracy demanded by purposes within clusters; (3) comparing levels of performance across clusters in terms of performance ratings on customized assessments in terms of a consensually defined, more abstract description of developing competence; and (4) making projections about how students from one cluster might have performed on the assessment of another cluster. Seven tables, 14 figures, and 41 references are included. (RLC) *********************************************************************** Reproductions supplied by EDRS are the best that can be made from the original document. *****************************************************************k*****

DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

ED 353 302

AUTHORTITLE

INSTITUTION

SPONS AGENCY

PUB DATECONTRACTNOTEPUB TYPE

EDRS PRICEDESCRIPTORS

IDENTIFIERS

ABSTRACT

DOCUMENT RESUME

TM 019 366

Mislevy, Robert J.Linking Educational Assessments: Concepts, Issues,Methods, and Prospects.

Educational Testing Service, Princeton, NJ. PolicyInformation Center.Office of Educational Research and Improvement (ED),Washington, DC.Dec 92

R117G1002797p.; Foreword by Robert L. Linn.Reports Evaluative/Feasibility (142)

MF01/PC04 Plus Postage.Academic Achievement; Competence; *EducationalAssessment; Elementary Secondary Education; EquatedScores; Evaluation Utilization; *MeasurementTechniques; Projective Measures; StatisticalAnalysis; Student EvaluationCalibration; *Linking Educational Assessments;Statistical Moderation

This paper describes the basic concepts of linkingeducational assessments. Although it discusses statistical machineryfor interpreting evidence about students' achievements, this paper'sprincipal message is that linking tests is not just a technicalproblem. Technical questions can not be asked or answered untilquestions about the nature of learning and achievement and about thepurposes and consequences of assessment -.re addressed. Somefundamental ideas about educational assessment and test theory(chains of inferences and roles of judgment) are considered. Thefollowing approaches to linking assessments are described: (1)equating in physical measurement and educational assessment; (2)calibration in physical measurement and educational assessment; (3)projection in physical measurement and educational assessment; and(4) statistical moderation and social moderation. Implications for asystem of monitoring progress toward educational standards arediscussed. The following obtainable goals are highlighted: (1)comparing levels of performance across clusters directly in terms ofcommon indicators of performance on a selected sample of consensuallydefined tasks administered under standard conditions; (2) estimatinglevels of performance of groups or individuals

within clusters at thelevels of accuracy demanded by purposes within clusters; (3)comparing levels of performance across clusters in terms ofperformance ratings on customized assessments in terms of aconsensually defined, more abstract description of developingcompetence; and (4) making projections about how students from onecluster might have performed on the assessment of another cluster.Seven tables, 14 figures, and 41 references are included. (RLC)

***********************************************************************Reproductions supplied by EDRS are the best that can be made

from the original document.*****************************************************************k*****

Page 2: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

dx.

POLICY INFORMATION CENTER

EducationalTesting Service

U.S. DEPARTMENT Of EDUCATIONOffice of Educational Research and improvement

EDUCATIONAL RESOURCES INFORMATIONCENTER (ERIC)

'Thu document hat been reproduced esreceived from the person or organizationoriginating It

O Minor changed have been made to improvereproduction puskty

Points of view or opinions stated in this docu-ment do not necessarily represent officialOERI position or policy

"PERMISSION TO REPRODUCE THISMATERIAL HAS BEEN GRANTED BY

0.01--t

TO THE EDUCATIONAL RESOURCESINFORMATION CENTER (ERIC)."

LinkingEducationalAssessmentsConcepts,Issues,Methods, andProspects

BEST COPY AVAILABLE

2

Page 3: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Linking Educational Assessments:Concepts, Issues, Methods, and Prospects

By

Robert J. MislevyEducational Testing Service

With a Foreword by

Robert L. LinnCenter for Research on Evaluation,

Standards, and Student TestingUniversity of Colorado at Boulder

December 1992

ETSPolicy Information Center

Princeton, NJ 08541

3

Page 4: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Copyright © 1992 by Educational Testing Service. All rights reserved.

Page 5: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Table of Contents

Page

Preface iAcknowledgments iiForeword iiiIntroduction 1

Educational Assessments 2Purposes of Assessment 4Conceptions of Competence 6Assessments as Operational Definitions 7Standardization 12

Test Theory 13

Chains of Inferences 16The Roles of Judgment 19

Linking Tests: An Overview 21

A Note on Linking Studies 26

Equating 26

Equating in Physical Measurement 28Equating in Educational Assessment 32Comments on Equating 37

Calibration 37

Calibration in Physical Measurement 38Calibration in Educational Assessment, Case 1:

Long and Short Tests Built to the Same Specifications 42Calibration in Educational Assessment, Case 2:

Item Response Theory for Patterns of Behavior 44Calibration in Educational Assessment, Case 3:

Item Response Theory for Judgments on anAbstractly Defined Proficiency 50

Comments on Calibration 53

Projection54

Projection in Physical Measurement 56Projection in Educational Assessment 59

Page 6: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Page

Another Type of Data for Projection Linking 61Comments on Projection 62

Moderation 64

Statistical Moderation 64Social Moderation 68Comments on Moderation 72

Cone usi on 72

References 77

Page 7: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

List of Tables and FiguresPage

Tables

1. Description of Assessment Purposes 52. ACTFL Proficiency Guidelines for Reading 33. Methods of Linking Educational Assessments 224. Temperature Measures for 60 Cities 305. Correspondence Table for Thermometer X and Thermometer Y 346. Cross-Tabulation of Oregon Ideas Scores

and Arizona Content Scores for the Oregon Middle-School Papers 637. Decathlon Results from the 1932 Olympics 71

Figures

1. Hierarchy of Educational Outcomes 102. Chains of Inference from Test X and Test Y to Two Sets of

Specific Curricular Objectives 183. Equating Linkages 274. Distributions of True Temperatures and Temperature Readings 295. Sixty Cities 336. Unadjusted Thermometer Y Readings Plotted Against

Thermometer X Readings 337. Cricket Temperature Readings Plotted Against True Temperatures 398. Distributions of True Temperatures and Selected Estimates 409. Calibration in Terms of Behavior 4510. Calibration in Terms of Judgment at a Higher Level of Abstraction 5111. Empirical Linkages for Projection 5512. True Temperatures Plotted Against Latitude 5813. True Temperatures Plotted Against Latitude,

with Regions Distinguished , 5814. Moderated Distributions of Goals Scored and Batting Averages 67

Page 8: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

PrefaceAs this is written, a fairly wide consensus has developed around the proposition

that we need to establish national education standards and a system of assessments tomeasure whether the standards are being achieved. Rejecting a single national exami-nation, the call is for a voluntary system of assessments, all geared to national stan-dards, in which states, groups of states, or other groups design their own assessments.According to the report of the National Council on Education Standards and Testing,the key features of assessments "would be alignment with high national standards andthe capacity to produce useful, comparable results."

We foclis here on the word "comparable." Irrespective of which assessment isgiven, there is a desire to be able to translate the results so that students across thecountry can be compared on their achievement of the national standards. Indeed, thishas become a key to developing consensus. To those who fear that national standardsand an assessment system will lead to a national curriculum, the response is that thestandards are just "frameworks" within which curriculum can be determined locally,and assessments can reflect this curriculum. Techniques would be used, such as "cali-bration," to make the results comparable. At a meeting of the National Education GoalsPanel, a governor said something to the effect that we can have as many tests as wewant; all we have to do is calibrate them. How far can local curriculum stray from thenational standards and differ from one locality to another, and how different can as-sessments be in how and what they test and still enable comparable scores to be con-structed?

The ETS Policy Information Center commissioned Robert J. Mislevy to giveguidance to the policy and education community on this question. While his reportestablishes the range of freedom, which may not be as wide as many have assumed orhoped, it also tries to give guidance on how comparability can be achieved. It is critical,I believe, to know this in advance and design the system in a way that comparability ispossible. There are no neat technical tricks by which the results of just any old assess-ments can be compared.

We asked Mislevy to write in as nontechnical a manner as possible so that aperson with a need to know could understand. However, it is not a simple matter. Theconclusion, on page 72, will give the less motivated reader the bottom line on whatMislevy thinks possible, along with a summary of approaches to linking tests.

We are grateful to the National Center for Research on Evaluation, Standards,and Student Testing (CRESST) for joining in the funding of this work, and we aregrateful for the willingness of Robert Linn to write the foreword; his own work hasalready contributed much to the understanding of both limitations and possibilities.

Paul E. Barton, DirectorPolicy Information Center

Page 9: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

AcknowledgmentsThis monograph evolved from an internal memorandum on linking assessments by

Martha Stocking and me. The work received support from the Policy Information Cen-ter and the Statistical and Psychometric Research Division of Educational TestingService and the National Center for Research on Evaluation, Standards, and StudentTesting (CRESST), Educational Research and Development Program, cooperativeagreement number R117G10027 and CFDA catalog number 84.117G, as administeredby the Office of Educational Research and Improvement, U.S. Department of Educa-tion.

I am grateful to many colleagues for discussions on the topics addressed in thispaper and comments on earlier drafts, among them Nancy Allen, Paul Barton, AlBeaton, Charles Davis, John Frederiksen, Gene Johnson, Charlie Lewis, Bob Linn,Skip Livingston, John Mazzeo, Sam Messick, Carol Myford, Mimi Reed, KarenSheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer, Michael Zieky, andRebecca Zwick. The views I express do not necessarily represent theirs. JoannePfleiderer and Richard J. Coley provided editorial assistance, Jim Wert designed thecover, and Carla Cooper was responsible for desktop publishing services.

Robert J. Misle-vy

ii

Page 10: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Foreword

As is attested to by the abundance of concordance tables that have been developedto provide conversions of composite scores of the American College Testing (ACT) Pro-gram Assessment to the scale of the Scholastic Aptitude Test (SAT) or vice versa, theinterest in linking results from different assessments is not new. Some of the advocatedchanges in the nature and uses of educational assessments during the past fewyears,however, have greatly expanded interest in issues of linking results of different assess-ments. The January 1992 report of the National Council on Education Standards andTesting* (NCEST) clearly illustrates the need for the type of careful attention to issuesof linking that is found in Bob Mislevy's report.

The NCEST report recommended the development of national "content standards"and "student performance standards" that would define the skills and understandingsthat need to be taught and the levels of competence that students need to achieve. TheCouncil also concluded that a system of assessments was needed to make the standardsmeaningful. Two features of the system of assessments envisioned by the Council areparticularly relevant to the issues addressed by Mislevy. First, the Council recom-mended that the national system of assessments involve multiple assessments ratherthan a single test. As we stated in the report, states or groups of states would be ex-pected "to adopt assessments linked to national standards. States can design the as-sessments or they may acquire them" (NCEST, 1992, p. 30). California might developits own assessments, Arizona might contract with a test publisher to develop its ownassessments, and a group of New England stotes might join forces to develop a commonset of assessments for use in those states. But it is hoped that they would all be linkedto the national standards.

Second, as is implicit in the desired linking to national standards, the Councilconcluded that "it is essential that different assessments produce comparable results inthe attainment of the standards" (NCEST, 1992, p. 30). That is, it is expected that theperformance of students who respond to one set of assessment tasks in Florida and thatof students who respond to a completely different set of tasks in Illinois can nonethe-less be compared in terms of common national standards. It is expected that a "pass" or"high pass" will have common meaning and value in terms of the national standardsdespite the use of different assessments in different locales.

Another example of the expanded desire for linking comes from states that want toexpress state assessment results in terms of the scales used by the National Assess-ment of Educational Progress (NAEP). The state assessment may differ from NAEP informat and in its content specifications, but there is still a desire to estimate the per-centage of students in the state who would score above a given level on NAEP based ontheir performance on the state assessment.

The National Council on Education Standards and Testing. (1992). Raising standardsfor American education. Washington, DC: Author.

iii

Page 11: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

The basis for linking assessments that might be developed according to the NCESTplan or for a state to link its results to NAEP is quite different from that undergirdingthe equating of alternate forms ofa test developed using common test specifications bya single test publisher. Yet, some of the discussion of the desired linkings blurs thedistinctions with the use of a wide variety of terminology, some of which is undefined ina technical sense and some of which is used in ways quite inconsistent with well-estab-lished technical meanings. Confusion has resulted from the use of terms such as equat-ing, calibration, benchmarking, anchoring, moderation, verification, and predictionwith little regard for the implied technical requirements or for specific types of com-parisons that are justified.

Bob Mislevy's paper provides much needed clarifications of the terminology. Moreimportantly, it provides a lucid explication of the concepts of equating, calibration,projection, and statistical moderation along with concrete illustrations of the importantdistinguishing characteristics of these concepts and the associated statistical methods.By using physical examples such as various measures of temperature, Mislevy clearlyillustrates that the issues involved in justifying various types of comparisons are notunique to the peculiarities of assessments of student knowledge, skills, and under-standings. Rather, the requirements for the simpler and more rigorous forms of linkingresult "not from the statistical procedures used to map the correspondence, but fromthe way the assessments are constructed" (page 75).

Mislevy's paper should prove to be of great value in clarifying the discussion oflinking issues. Although not everyone will like the fact that some desired types ofcorrespondence for substantially different assessments are simply impossible, thepaper points the direction to "attaining less ambitious, but more realistic, goals" (page73) for linking different assessments. The report provides a much needed foundationthat should help the field achieve greater specificity about attainable linking goals andlead to the use of techniques suitable to support more realistic interpretations of re-sults from different assessments.

Robert LinnCenter for Research on Evaluation,Standards, and Student Testing

University of Colorado at Boulder

J

iv

Page 12: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

1

Introduction

Can We Measure Progress Toward National Goals ifDifferent Students Take Different Tests?

The problem of how to link the results of different testsis a century oldand as urgent as today's headlines.Charles Spearman first addressed aspects of this problemin his 1904 paper, "The proof and measurement of associa-tion between two things."

This monograph was inspired by the President's andgovernors' recent joint statement on National Goals forEducation:

The time has come for the first time in the United Stateshistory to establish clear national performance goals,goals that will make us internationally competitive.

aational educational goals will be meaningless unlessprogress toward meeting them is measured accuratelyand adequately, and reported to the American people.

National goals for education,U.S. Department of Education,1990

The academic performance of elementary and secondarystudents will increase significantly in every quartile, andthe distribution of minority students will more closelyreflect the student population as a whole.

The percentage of students who demonstrate the abilityto reason, solve problems, apply knowledge, and writeand communicate effectively will increase substantially.

Two objectives for Goal #3,student achievement andcitizenship, in Nationalgoals for education, pp. 5

The specific language of these objectives seems todemand a common yardstick for measuring studentachievement across the nation. But recognizing that differ-ent schools and communities may well want to assess dif-ferent competencies in different ways,

...the National Educational Goals Panel, the NationalCouncil on Education Standards and Testing, and theNew Standards Project ... all discourage the use of a

1

Page 13: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

This paper describesthe basic concepts oflinking educationalassessments.

Technical questionscannot be askedmuch less answereduntil questions aboutthe nature of learningand achievement andabout the purposesand consequences ofassessment are dealtwith.

2

single national exam...They advocate that ... districts orstates wishing to participate in the new system shouldform "clusters" agreeing to develop common exams linkedto the national standards. Performances on the assess-ments would then be "calibrated" against the nationalstandards to yield comparable data for different examsused by different clusters. ... Proponents of a new nationalsystem of exams ... say that the "cluster" model of calibrat-ing the results of different tests to common standards willavoid the problem of dictating to states and local districtswhat should be taught.

ASCD Update, 1991, 33(8),pp. 1-6

Can technical "calibration procedures" use results fromdifferent assessments to gauge common standards? ProfessorAndrew Porter is skeptical:

If this practice of separate assessments continues, can theresults be somehow equated so that results on one can alsobe stated in terms of results on. the other? There are thosewho place great faith in the ability of statisticians toequate tests, but that faith is largely unjustified. Equatingcan be done only when tests measure the same thing.

Andrew Porter, 1991, pp. 35

This paper describes the basic concepts of linking educa-tional assessments. Although it discusses statistical machin-ery for interpreting evidence about students' achievements,the principal message is that linking tests is not just a tech-nical problem. Technical questions cannot be askedmuchless answereduntil questions about the nature of learningand achievement and about the purposes and consequencesof assessment are dealt with. We begin by discussing somefundamental ideas about educational assessment and testtheory. We then describe and illustrate approaches to linkingassessments. Finally, we consider implications for a systemof monitoring progress toward educational standards.

Educational Assessments

I use the term "educational assessment" to mean system-atic ways of gathering and summarizing evidence aboutstudent competencies. It includes, but is not limited to, famil-iar standardized tests in which large numbers of studentsanswer the same kinds of prespecified questions under thesame conditions, such as the Scholastic Aptitude Test (SAT),the National Assessment of Educational Progress (NAEP),

i3

Page 14: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

An assessment thatproduces solid evidenceat reasonable costs forone mission can proveunreliable, exorbitant,or simply irrelevant foranother.

The degree to whichlinking can succeed,and the nature of themachinery required tocarry it out, depend onthe matchups betweenthe purposes for whichthe assessments wereconstructed and theaspects of competencethey were designed toreveal.

and the written part of driver's license exams. I also includesuch activities as doctoral dissertations and the portfoliosstudents create for the College Board's Advanced Place-ment (AP) Studio Art Examination. These latter kinds ofassessments require concentrated effort over an extendedperiod of time on challenges determined to a large degreeby the student. Summarizing evidence on these assess-ments requires judgment.

Assessments provide data such as written essays,correct and incorrect marks on an answer sheet, andstudents' explanations of the rationales for their problemsolutions. This data becomes evidence, however, only withrespect to conjectures about students' competence'perhaps concerning individual students, the group as awhole or particular subgroups, or even predictions aboutfuture performance or the outcomes of instruction. Ourpurpose for an assessment and our conception of the natureof competence drive the form an assessment takes. Anassessment that produces solid evidence at reasonable costsfor one mission can prove unreliable, exorbitant, or simplyirrelevant for another.

Suppose that Assessment X provides evidence aboutinstructional or policy questions involving student compe-tencies. In particular, we might want to know whetherindividuals can perform at particular levels of compe-tencei.e., whether they are achieving specified educa-tional standards. Suppose that with appropriate statisticaltools, we could provide answers, each properly qualified byan indication of the strength of evidence.

Let's also suppose that someone wants to link Assess-ment Y to Assessment X. This notion stems from a desire toaddress the same questions posed in terms of Assessment Xwhen we observe students' performances on Assessment Yrather than on X. The degree to which linking can succeed,and the nature of the machinery required to carry it out,depend on the matchups between the purposes for whichthe assessments were constructed and the aspects ofcompetence they were designed to reveal.

1 Schum (1987, p. 16) emphasizes the distinction between data andevidence in the context of intelligence analysis. His book provides awide variety of examples that offer insights into inferential tasks ineducational assessment.

23

Page 15: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

4

Purposes of Assessment

A wide variety of purposes can motivate educationalassessment, and we make no effort to survey them here.Table 1 gives the interested reader a feel for some of theways assessment purposes might be classified. What's impor-tant is that different kinds and amounts of evidence must begathered to suit different purposes. Certain distinctionsamong purposes prove especially pertinent to our discussionabout linking:

Will important decisions be based on the results; thatis, is it a "high-stakes" assessment? A quiz to helpstudents decide what to work on during a particularday is a low-stakes assessmenta poor choice is easilyremedied. An assessment to determine whether stu-dents should graduate from eighth grade is highstakes at the individual level. One designed to decidehow to distribute ,,cate funds among schools is highstakes at the school level. Assessments that supportimportant decisions must provide commensuratelydependable evidence, just as a criminal convictiondemands proof "beyond a reasonable doubt."

Do inferences concern a student's comparative stand-ing in a group of studentsthat is, are they "norm-referenced?" Or do they gauge the student's competen-cies in terms of particular levels of skills or perfor-mancethat is, are they "criterion referenced?" Anorm-referenced test assembles tasks to focus evidencefor questions such as "Is Eiji more or less skilled thanSung-Ho?" A criterion-referenced test in the samesubject area might include similar tasks, but theywould be selected to focus evidence for questions suchas "What levels of skills has Eiji attained?"

Do inferences concern the competencies of individualstudents, as with medical certification examinations,or the distributions of competencies in groups ofstudents, as with NAEP? When the focus is on theindividual, enough evidence must be gathered on eachstudent to support inferences about him or her specifi-cally. On the other hand, a bit of information abouteach student in a sampletoo little to say much aboutany of them as an individualcan suffice in the aggre-gate to monitor the level of performance in a school ora state.

Page 16: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Type ofInference

Desired

Description ofindividualexaminees'attainments

Masterydecision(individualexaminee aboveor below cutoff)

Description ofperformance fora group orsystem

Table 11Description of Assessment Purposes

Domain to Which Inferences Will Be Made

Curricular Domain

BeforeInstruction

Placement

Selection

Preinstructionstatus forresearch orevaluation

DuringInstruction

Diagnosis

Instructionalguidance

Process andcurriculumevaluation

lAdapted from Millman & Givene's Table 8.1 (1989)

AfterInstruction

Grading

Promotion

Postinstructionstatus forresearch orevaluation

Accountabilityreporting at thelevel of, e.g.,schools orinstructionalprograms

J

CognitiveDomain

Reporting tostudents and/orparents

Certification

Licensing

Constructmeasurementfor research orevaluation

Accountabilityreporting atstate ornational levels

FuturePerformancein Criterion

Setting

Guidance andcounseling

Selection

Admission

Licensing

Research andplanning; e.g.:

determiningcutoff levels formasterydecisions

evaluatingthe effective-ness ofinstructionalprograms

5

Page 17: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

From the behavioristperspective, thespecifications for anassessment describetask contexts from theassessor's point ofview and provide asystem for classifyingstudent responses.

6

Conceptions of Competence

The role of psychological perspectives on competence ineducational assessment can be illustrated by two contrastingquotations. The first reflects the behaviorist tradition inpsychology:

The educational procrds consists of providing a series o;environments that permit the student to learn new behav-iors or modify or eliminate existing behaviors and topractice these behaviors to the point that he displays themat some reasonably satisfactory level of competence andregularity under appropriate circumstances. The state-ment of objectives becomes the description of behaviorsthat the student is expected to display with some regular-ity. The evaluation of the success of instruction and of thestudent's learning becomes a matter of placing the studentin a sample of situations in which the different learnedbehaviors may appropriately occur and noting the fre-quency and accuracy with which they do occur.

D.R. Krathwohl & D.A.Payne, 1971, pp. 17-18

From the behaviorist perspective, the specifications foran assessment describe task contexts from the assessor'spoint of view and provide a system for classifying studentresponses. Responses in some contexts are unambiguouslyright or wrong; in other contexts, the occurrence of certaintype of behaviors, the classification of which may requireexpert judgment, are recorded.

The second quotation reflects what has come to be calleda cognitive perspective:

Essential characteristics of proficient performance havebeen described in various domains and provide usefulindices for assessment. We know that, at specific stages oflearning, there exist different integrations of knowledge,different forms of skill, differences in access to knowledge,and differences in the efficiency of performance. Thesestages can define criteria for test design. We can nowpropose a set of candidate dimensions atung which subject-matter competence can be assessed. As competence in asubject matter grows, evidence of a knowledge base that isincreasingly coherent, principled, useful, and goal-oriented is displayed, and test items can be designed tocapture such evidence. (emphasis original]

I 7

R. Glaser, 1991,pp. 26

Page 18: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

From the cognitiveperspective, the specifi-cations for an assess;ment describe contextsthat can evoke evidenceabout students' compe-tence as conceived at ahigher level of abstrac-tion, and providejudgmental guidelinesfor mapping fromobserved behavior tothis inferred profi-ciency.

Disparate assessmentsall provide evidenceabout students' compe-tencebut each takes aparticular point of viewabout what competenceis and how it develops.

From the cognitive perspective, the specifications foran assessment describe contexts that can evoke evidenceabout students' competence as conceived at a higher level ofabstraction, and provide judgmental guidelines for mappingfrom observed behavior to this inferred proficiency. Table 2,the American Council on the Training of ForeignLanguages (ACTFL) guidelines for assessing reading profi-ciency, illustrates the point. The ACTFL guidelinescontrast Intermediate readers' performance using texts"about which the reader has personal interest or knowl-edge" with Advanced readers' comprehension of "textswhich treat unfamiliar topics and situations." This distinc-tion is fundamental to the underlying conception of devel-oping language proficiency, but can a situation that isfamiliar to one student be obviously unfamiliar to others.The evidence conveyed by the same behavior in the samesituation can differ radically for different students andalter what we infer about their capabilities from theirbehavior.

Assessments as Operational Definitions

Educators can agree unanimously that we need to helpstudents "improve their math skills" but disagree vehe-mently about how to appraise these skills. Their concep-tions of mathematical skills often diverge as they movefrom generalities to the classroom because they employ thelanguage and concepts of differing perspectives on howmathematics is taught and learned as well as what topicsand skills are important. Disparate assessments all provideevidence about students' competencebut each takes aparticular point of view about what competence is and howit develops.

Figure 12 is a hypothetical illustration of the levels thatcould lie between common, broad perceptions of educationalgoals and a variety of assessments. Higher in the schemeare generally-stated objectives, such as:

The percentage of students who demonstrate the abilityto reason, solve problems, apply knowledge, and writeand communicate effectively will increase substantially.

National goals for education,U.S. Department of Education,1990

2 Suggest -!1:1 by Payne's discussion of levels of specificity in educationaloutcomes (1992, pp. 64).

i7

Page 19: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Novice-Low

Novice-Mid

Novice-High

Intermediate-Low

Intermediate-Mid

Intermediate-High

Advanced

8

Table 2ACTFL Proficiency Guidelines for Reading*

Able occasionally to identify isolated words and/or major phrases whenstrongly supported by context.

Able to recognize the symbols of an alphabetic and/or syllabic writingsystem and/or a limited number of characters in a system that usescharacters. The reader can identify an increasing number of highlycontextualized words and/or phrases including cognates and borrowedwords, where appropriate. Material understood rarely exceeds a singlephrase at a time, and rereading may be required.

Has sufficient control of the writing system to interpret written languagein areas of practical need. Where vocabulary has been learned, can readfor instructional and directional purposes standardized messages,phrases, or expressions, such as some items on menus, schedules,timetables, maps, and signs. At times, but not on a consistent basis, thenovice-high reader may be able to derive meaning from material at aslightly higher level where context and/or extralinguistic backgroundknowledge are supportive.

Able to understand main ideas and/or some facts from the simplestconnected texts dealing with basic personal and social needs. Such textsare linguistically noncomplex and have a clear underlying internalstructure, for example, chronological sequencing. They impart basicinformation about which the reader has to make only minimalsuppositions or to which the reader brings personal interest and/orknowledge. Examples include messages with social purposes orinformation for the widest possible audience, such as publicannouncements and short, straightforward instructions for dealing withpublic life. Some misunderstandings will occur.

Able to read consistently with increased understanding simple connectedtexts dealing with a variety of basic and social needs. Such texts are stilllinguistically noncomplex and have a clear underlying internal structure.They impart basic information about which the reader has to makeminimal suppositions and to which tl-vr; reader brings personal informationand/or knowledge. Examples may include short, straightforwarddescriptions of persons, places, and things written for a wide audience.

Able to read consistently with full understanding simple connected textsdealing with basic personal and social needs about which the reader haspersonal interest and/or knowledge. Can get some main ideas and detailsfrom texts at the next higher level featuring description and narration.Structural complexity may interfere with comprehension; for example,basic grammatical relations may be misinterpreted and temporalreferences may rely primarily on lexical items. Has some difficulty withcohesive factors in discourse, such as matchingpronouns with referents.While texts do not differ significantly from those at the Advanced level,comprehension is less consistent. May have to read several times forunderstanding.

Able to read somewhat longer prose of several paragraphs in length,particularly if presented with a clear underlying structure. The prose ispredominantly in familiar sentence patterns. Reader gets the main ideasand facts and misses some details. Comprehension derives not only fromsituational and subject matter knowledge but from increasing control ofthe language. Texts at this level include descriptions and narrations suchas simple short stories, news items, bibliographical information, socialnotices, personal correspondence, routinized business letters, and simpletechnical material written for the general reader.

-dn

Page 20: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Advanced-Plus

Superior

Distinguished

Table 2, continuedACTFL Proficiency Guidelines for Reading

Able to follow essential points of written discourse at the Superior level inareas of special interest or knowledge. Able to understand parts of textswhich are conceptually abstract and linguistically complex, and/or textswhich treat unfamiliar topics and situations, as well as some texts whichinvolve aspects of target-language culture. Able to comprehend the facts tomake appropriate inferences. An emerging awareness of the aestheticproperties of language and of its literary styles permits comprehension ofa wider variety of texts, including literary. Misunderstandings may occur.

Able to read with almost complete comprehension and at normal speedexpository prose on unfamiliar subjects and a variety of literary texts.Reading ability is not dependent on subject matter knowledge, althoughthe reader is not expected to comprehend thoroughly texts which arehighly dependent on the knowledge of the target culture. Reads easily forpleasure. Superior-level texts feature hypotheses, argumentation, andsupported opinions and include grammatical patterns and vocabularyordinarily encountered in academic/professional reading. At this level,due to the control of general vocabulary and structure, the reader isalmost always able to match the meanings derived from extralinguisticknowledge with meanings derived from knowledge of the language,allowing for smooth and efficient reading of diverse texts. Occasionalmisunderstandings may still occur; for example, the reader mayexperience some difficulty with unusually complex structures andlow-frequency idioms. At the superior level the reader can matchstrategies, top-down or bottom-up, which are most appropriate to the text.(Top-down strategies rely on real-world knowledge and prediction basedon genre and organizational scheme of the text. Bottom-up strategies relyon actual linguistic knowledge.) Material at this level will includeavariety of literary texts, editorials, correspondence, general reports, andtechnical material in professional fields. Rereading is rarely necessary,and misreading is rare.

Able to read fluently and accurately most styles and forms of the languagepertinent to academic and professional needs. Able to relate inferences inthe text to real-world knowledge and understand almostall sociolinguisticand cultural references by processing language from within the culturalframework. Able to understand the writer's use of nuance and subtlety.Can readily follow unpredictable turns of thought and author intent insuch materials as sophisticated editorials, specialized journal articles, andliterary texts such as novels, plays, poems, as well as in any subjectmatter area directed to the general reader.

* Based on the ACTFL proficiency guidelines, American Council on the Trainingof Foreign Languages (1989).

BEST COPY AVAILABLE

9

Page 21: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Figu

re 1

Hie

rarc

hy o

f E

duca

tiona

l Out

com

es

Nat

iona

lE

duca

tion

Goa

ls

I

Bro

adly

sta

ted

Cur

ricu

lum

Obj

ectiv

es

Goa

ls f

or E

duca

ted

Pers

ons

Bro

adly

sta

ted

Cur

ricu

lum

Obj

ectiv

es

Spec

ific

Spec

ific

Spec

ific

Spec

ific

Cur

ricu

lum

Cur

ricu

lum

Cur

ricu

lum

Cur

ricu

lum

Obj

ectiv

esO

bjec

tives

Obj

ectiv

esO

bjec

tives

Tes

tT

est

Tes

tT

est

Blu

epri

ntB

luep

rint

Blu

epri

ntB

luep

rint

Tes

tT

est

Tes

tT

est

Tes

tT

est

Tes

tT

est

Alte

rnat

ive

broa

d st

atem

ent

of g

oals

Page 22: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

The National Council of Teachers of Mathematics' (NCTM)Curriculum and Evaluation Standards for School Math-ematics offers steps along one possible path toward makingsuch goals meaningful. The excerpt below and the examplesin the NCTM Standards exemplify less abstract levels inFigure 1:

Standard 1: Mathematics as Problem Solving

In grades K-4, the study of mathematicsshould emphasize problem solving so thatstudents can ...

use problem-solving approaches to investigateand understand mathematical content;

formulate problems from everyday and math-ematical situations;

develop and apply strategies to solve a widevariety of problems;

verify and interpret results with respect to theoriginal problem;

acquire confidence in using mathematicsmeaningfully.

Curriculum and Evalua-tion Standards for SchoolMathematics; NCTM, pp.23

The level of Figure 1 labeled "test blueprint" repre-sents what a particular assessment should comprise: thekinds and numbers of tasks, the way it will be imple-mented, and the processes by which observations will besummarized and reported. This level of specificity consti-tutes an operational definition of competence. Quality-control statistician W. Edwards Deming describes howsimilar processes are routinely required in industry, law,and medicine:

Does pollution main, for example, carbon monoxide insufficient concentration, to cause sickness in 3 breaths,or does one mean carbon monoxide in sufficient concen-tration to cause sickness when breathed continuouslyover a period of 5 days? In either case, how is the effectgoing to be recognized? By what procedure is the pres-

11

Page 23: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

...an educationalassessment oftencomprises multiplescores or ratings foreach student, toprovide a fullerpicture of individualcompetencies. Be-cause some linkingmethods are designedto align only one scorewith another, ourdiscussion will focuson assessments thatprovide only a singlescore.

Did Duanli scorehigher than Markbecause she had moretime, easier questions,or a more lenientgrader? Standardizingtiming, task specifica-tions, and ratingcriteria reduces thechance of this occur-rence.

12

ence of carbon monoxide to be detected? What is thediagnosis or criterion for poisoning? Men? Animals? Ifmen, how will they be selected? How many? How many inthe sample must satisfy the criteria for poisoning fromcarbon monoxide in order that we may declare the air to beunsafe for a few breaths, or for a steady diet?

Operational definitions are necessary for economy andreliability. Without an operational definition, unemploy-ment, pollution, safety of goods and of apparatus, effective-ness (as of a drug), side-effects, duration of dosage beforeside-effects become apparent (as examples), have nomeaning unless defined in statistical terms. Without anoperational definition, investigations on a problem will becostly and ineffective, almost certain to lead to endlessbickering and controversy.

An operational definition of pollution in terms of offensive-ness to the nose would be an example. It is not an impos-sible definition (being close kin to statistical methods formaintaining constant quality and taste in foods andbeverages), but 4,.1 ess it be statistically defined, it wouldbe meaningless.

W. Edwards Deming,1980, pp. 259

A study of pollution in cities might include several ofthese operational definitions, to provide a fuller picture oftheir environments. In the same way, an educational assess-ment often comprises multiple scores or ratings for eachstudent, to provide a fuller picture of individual competen-cies. Because some linking methods are designed to alignonly one score with another, our discussion will focus onassessments that provide only a single score. In practice, thistype of linking can proceed simultaneously among matchingsets of scores from multifaceted assessments (for example,fractions scores with fractions scores, vocabulary scores withvocabulary scores, and so on).

Standardization

From any given set of specifications, an assessment canbe implemented in countless ways. Differences, small orlarge, might exist in tasks, administration conditions, typog-raphy, identity and number of judges, and so on. Standardiz-ing an aspect of an assessment means limiting the variationsstudents encounter in that aspect, in an effort to eliminateclasses of hypotheses about students' assessment results. DidDuanli score higher than Mark because she had more time,easier questions, or a more lenient grader? Standardizing

Page 24: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Note that "standardiza-tion" is not synonymouswith "multiple-choice,"although multiple-choice tests do stan-dardize the range ofresponses students canmake.

timing, task specifications, and rating criteria reduces thechance of this occurrence. Other potential explanations forthe higher score exist, of course. The plausibility of eachexplanation can be strengthened or diminished with addi-tional evidence.'

Standardizing progressively more aspects of an assess-ment increases the precision of inferences about behavior inthe assessment setting, but it can also reduce opportunitiesto observe students performing tasks more personallyrelevant to their own varieties of competence. Assessingdeveloping competence when there is neither a single pathtoward "better" nor a fixed and final definition of "best"may require different kinds of evidence from differentstudents (Lesh, Lamon, Behr, & Lester, 1992, pp. 407).

Note that "standardization" is not synonymous with"multiple-choice," although multiple-choice tests do stan-dardize the range of responses students can make. The APStudio Art portfoliothe antithesis ofa multiple-choicetestis standardized in other respects. Among the require-ments for each portfolio in the general section are fouroriginal works that meet size specifications; four slidesfocusing on color and design; and up to 20 slides, a film, ora videotape illustrating a concentration on a student-selected theme. These requirements ensure th ".t evidenceabout certain multiple aspects of artistic development willbe evoked, although the wide latitude of student choicevirtually guarantees that different students will providedifferent forms of evidence.

Questions about which aspects of an assessment tostandardize and to what degree arise under all purposesand modes of testing and all views of competence. Theanswers depend in part on the evidential value of the obser-vations in view of the purposes of the assessment, theconception of competence, and the resource demands.

Test Theory

Test theory is the statistical machinery for reasoningfrom students' behavior to conjectures about their compe-

3 A branch of test theory called generalizability (Cronbach, Gleser,Nanda, & Rajaratnam, 1972) examines the impact ofvariations in.aspects of assessment settings on test scores.

13

Page 25: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Test theory is thestatistical machineryfor reasoning fromstudents' behavior toconjectures abouttheir competence, asframed by a particularconception of compe-tence.

14

tence, as framed by a particular conception of competence. Inan application, a conception of competence is operationalizedas a set of ways students might differ from one another.These are the variables in a student model, a simplifieddescription of selected aspects of the infinite varieties ofskills and knowledge that characterize real students. Depend-ing on our purposes, we might isolate one or hundreds offacets. They might be expressed in terms of numbers, catego-ries, or some mixture; they might be conceived of as persist-ing over long periods of time or apt to change at the nextmoment. They might concern tendencies in behavior, concep-tions of phenomena, available strategies, or levels of develop-

, ment. We don't observe these variables directly. We observeonly a sample of specific instances of students' behavior inlimited circumstancesindirect evidence about their compe-tence as conceived in the abstract.

Suppose we want to make a statement about Jasmine'sproficiency, based on a variable in a model built around somekey areas of competence. We can't observe her value on thisvariable directly, but perhaps we can make an observationthat bears information about it: her answer to a multiple-choice question, say, or two sets of judges' ratings of herviolin solo, or an essay outlining how to determine whichpaper towel is most absorbent. The observation can't tell usher value with certainty, because the same behavior could beproduced by students with different underlying values. It ismore likely to be produced by students at some levels thanothers, however. Nonsensically answering ",Como estausted?" with "Me llamo Carlos," for example, is much morelikely from a student classfied as a Low-Novice under theACTFL guidelines than an Advanced student.

The key question from a statistician's point of view is"How probable is this particular observation, from each of thepossible values in the competence model?" The answertheso-called "likelihood function" induced by the responseembodies the information that the observation conveys aboutcompetence, in the way competence is being conceived. If theobservation is equally likely from students at all values of thevariables in the competence model, it carries no information.If it is likely at some values but not others, it sways our beliefin those directions, with strength in proportion to how muchmore likely the observation is at those values.

For example, compare the results of observing two headsout of four coin tosses with observing 1,000 heads out of

Page 26: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

The most familiar toolswe have for linkingassessments evolvedunder the paradigm of"mental measurement,"introduced a centuryago in an attempt to"measure intelligence."

...the traits achievementtests purportedlymeasure, such asmathematical ability,reading 13vel, or physicsachievement, are notfeatures of objectivereality, but constructs ofhuman design...

2,000 tosses. Both experiments suggest that the most likelyvalue of the probability of heads on any given toss is 1/2;that is, the coin is fair. The evidence from 2,000 tosses ismuch stronger, however. It would not be unusual to obtaintwo heads out of four tosses with a trick coin biased toward,say, a 3/4 probability of heads, while such a coin wouldhardly ever produce as few as 1,000 heads from 2,000tosses.

The most familiar tools we have for linking assess-ments evolved under the paradigm of "mental measure-ment," introduced a century ago in an attempt to "measureintelligence." The measurement paradigm posits thatimportant aspects of students' knowledge or skills cansometimes be represented by numbers that locate themalong continua, much as their heights and weights measuresome of their physical characteristics. From the behavioralpoint of view, the variable of interest might be the propor-tion of correct answers a student would give on every itemin a large domain. From the cognitive point of view, thevariable might be a level on a developmental scale such asthe ACTFL reading guidelines. Standard test theory viewsobservations as noisy manifestations of these inherentlyunobservable variables and attacks the problem of infer-ence in the face of measurement error. We discuss classicaltest theory in the section about equating and discuss itemresponse theory in the section about cclibration. The statis-tical roots of those theories are grounded in physicalmeasurement, so examples about temperature and weightillustrate the potential and the limitations of test theory tolink assessments.

It is important to remember that the traits achieve-ment tests purportedly measure, such as mathematicalability, reading level, or physics achievement, are notfeatures of objective reality, but constructs of humandesigninvented to organize experience and solve prob-lems, but shaped by the science and the society in whichthey evolved. Contemporary conceptions of learning do notdescribe developing competence in terms of increasing traitvalues, but in terms of alternative constructs: constructingand reconstructing mental structures that organize factsand skills ("schenias"); learning how to plan, monitor, and,when necessary, switch problem-solving strategies ("meta-cognitive skills"); and practicing procedures to the pointthat they no longer demand high levels of attention ("auto-maticity"). Test scores tell us something about what

15

Page 27: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

The mental measure-ment paradigm isuseful only to theextent that the pat-terns of behavior itcaptures guide in-struction and policyamong options framedin our current concep-tions of how compe-tence develops.

Two similar scoresconvey similarmeanings to theextent that theysummarize perfor-mances on suitablysimilar tasks, insuitably similar ways,for suitably similarstudents.

students know and can do, but any assessment setting stimu-lates a unique constellation of knowledge, skill, strategies,and motivation within each examinee, arirl different settingsstimulate different unique constellations. Jie mental mea-surement paradigm. is useful only to the extent that thepatterns of behavior it captures guide instruction and policyamong options framed in our current conceptions of howcompetence develops.4

We may therefore build assessments around taskssuggested by an appropriate psychology but determinewhether measurement models at appropriate levels of detailprovide adequate summaries of evidence. In some applica-tions, we may wish to model competencies with detailedstructures suggested by the psychology of learning in thedomain. An example is tutoring an individual student'sforeign language proficiencies. The ACTFL scale doesn'tcapture the distinctions we'd need to help a mid-novice be-come a high-novice. In other applications, such as document-ing a student's progress, the ACTFL scale may suffice.

Because measurement model results are, at best, grosssummaries of aspects of students' thinking and problemsolving abilities, we are obliged to identify contexts thatcircumscribe their usefulness. Two similar scores conveysimilar meanings to the extent that they summarize perfor-mances on suitably similar tasks, in suitably similar ways,for suitably similar students. We must be alert to patterns inindividual students' data that cast doubt on using their testscores to compare them to other students, and we must bereluctant to infer educational implications without examin-ing qualitatively different kinds of evidence.

Chains of Inferences

A criminal prosecutor must establish the link betweenobservation and conjecture. Seeing Jennifer's car at theFotomat Thursday night is direct evidence for the claim thather car was there but indirect evidence that Jennifer herselfwas. It is indirect and less compelling evidence for the conjec-ture that she was the burglar. Each additional event orinference between an observation and a conjecture adds

4 An engineering perspective is apropos: "The engineer combines math-ematical and physical insight to generate a representation of an unknownprocess suitable for research, design, and control. A model does not need

1 to mimic the system; it is sufficient to represent the relevant characteris-tics necessary for the task at hand." (Linse, 1992, pp. 1)

Page 28: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Student behavior in anassessment is closelyrelated to the variablesin the conception ofcompetence that gener-ated the assessment. Itis less directly related tomore abstract levels ofcompetence and evenfurther removed fromalternative conceptionsof competence thatmight have led todifferent assessments.

uncertainty. The greater the number of possibilities toaccount for an observation, the higher the potential thatnew information would alter our beliefs. For example,Sam's claim that Jennifer's car had been stolen Thursdaymorning does not, in and of itself, contain much informa-tion about whether the car was at the Fotomat, but it doestemper our belief that Jennifer herself was.

Student behavior in an assessment is closely related tothe variables in the conception of competence that gener-ated the assessment. It is less directly related to moreabstract levels of competence and even further removedfrom alternative conceptions of competence that might haveled to different assessments. In Figure 2, Test X and Test Ycan both be traced back to the same high-level conception ofcompetence, but through different curricular objectives andtest specifications. Step 1X represents inference fromperformance specific to Text X, to general statements at thelevel of the family of tests sharing the same test specifica-tions. Step 2X represents inference related to the curricularobjectives from which the Test X specifications were derived(and from which different test specifications may have alsobeen derived). The chain of inference from Test X to thecurricular objectives that led to Test Y involves comparablesteps, with an additional "linking" step (Step XY). Whenwefollow this chain, inferences from Test X data to Test Ycurricular objectives cannot be more precise than those thatwould be obtained more directly from Test Y data.

There are two inferential steps between Test X and itscorresponding curricular objectives. For assessment fromthe behavioral perspective, the first step (Step 1) is directand can be quite definitive. We observe the frequency ofbehavior X in a sample of contexts, and draw inferencesabout the tendency toward behavior X in these kinds ofcontexts. Subsequent inferences related to a more abstractlydefined notion of competence (Step 2) add another layer ofuncertainty, however. The relationship between behavioraltendencies in the domain of tasks and the more abstractlydefined competence that inspired the task domain can bemuch less direct. We may be able to estimate behavioraltendencies accurately but face a difficult link in a chain ofinference when we want to interpret behavior in terms ofcompetence. Step 1 is comparatively easy; Step 2 is thechallenge. In an assessment built from a cognitive perspec-tive, the level of test specifications is more directly relatedto curricular objectives, but more judgment is required toreason from behavior to the level of test specifications.

r,17

Page 29: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Bro

adly

sat

edC

urri

culu

m

Obj

ectiv

es

Spec

ific

Spec

ific

Cur

ricu

lum

Cur

ricu

lum

Obj

ectiv

esO

bjec

tives

Figu

re 2

Cha

ins

of I

nfer

ence

fro

m T

est X

and

Tes

t Yto

Tw

o Se

tsof

Spe

cifi

c C

urri

cula

r O

bjec

tives

.. ,

......

....

......

.....

......

....

....

......

..

t1

t0

:1

III

00

/ /%

/e

, #0

/f

I

/4

/..

3/

Page 30: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

A judge's rating, likethe testimony of awitness, is the first linkin a chain of inference.The role of test theory isto characterize theuncertainty associatedwith ratings and guidethe construction of acommon framework ofjudgmental standardsin order to reduceuncertainty.

The Roles of Judgment

We now consider some test theory issues concerningjudgments. When the potential range of students' perfor-mances on an assessment task is constrained to a readilydistinguishable set (e.g., multiple-choice options), summa-rizing performance is straightforward. The tough valuejudgments have already been made before Kikumi ever seesher test form. These judgments are implicit in the testspecifications, which lay out the content and the nature ofitems, and in the value assignments for each possibleresponse (often simply right or wrong, but possibly previ-ously ordered as to quality or other characteristics). Fortasks that allow students to respond with fewer constraints,like writing a letter, performing an experiment, or develop-ing a project over time, judgment is also required after theresponse has been made.

This judgment must characterize students' possiblyhighly individualistic performances within a common frameof reference. The judgment may produce one or moresummary statements (e.g., scores, ratings, checklist tallies,evaluative comments); these may be qualitative or quanti-tative and need not apply to all performances. A judge'srating, like the testimony of a witness, is the first link in achain of inference. The role of test theory is to characterizethe uncertainty associated with ratings and guide theconstruction of a common framework of judgmental stan-dards in order to reduce uncertainty. Examples and discus-sions are the best ways to build such a frameworkexamples that contrast levels of standards that are statedin general terms, feedback on examples analyzed by thegroup, discussions of actual performances that provokeddifferences of opinion. This process is analogous to thejudgmental ratings of food and pollution Deming mentioned,and the statistical procedures that support them are simi-lar. It can come into play at one or more levels in a givenassessment system:

1. Same conception of competence, same task. At themost constrained level, judges must be able toprovide sufficiently similar ratings of performancefor a particular task. For example, students might beasked to write a letter to order a T-shirt from amagazine ad. Judges should be able to agree, to theextent demanded by the purpose, in their ratings of asample of students' letters. Discrepencies amongjudges' ratings of the same performance reflect a

19

Page 31: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Assuring fairness,equity, and consis-tency in applyingstandards becomesimportant in high-stakes applications.Developing a statisti-cal framework forsuch a system servestwo functions: (1) as aquality assurancemechanism, to flagunusual or atypicalratings as a safeguardagainst biases andinconsistent applica-tions of criteria, and(2) as a means ofquantifying thetypical uncertaintiesassociated with thescoring of students'performances even inthe absence of anoma-lies.

20

source of uncertainty that must be taken into accountin inferences about students' competencies.

2. Same conception of competence, different tasks. At thislevel, judges must relate performances on differenttasks, or more varied performances within a less stan-dardized setting, to values of a more generally statedcompetency variable. Students may be asked to writeabout different topics, for example, or they may beallowed to choose the topic about which they write.Judges should be able to agree, again to an extentdemanded by the purpose, about how the differentessays relate to the competence variable. This kind ofjudgment is required for any assessment that person-alizes tasks for different students while gatheringinformation about the same generally stated compe-tency. Nested levels can be conceived, as in the ACTFLlanguage assessment guidelines in Table 2. We canstudy, through statistics and discourse, ratings ofdiverse performances first within the same language,then from different languages. We will return to thistopic in the section on calibration.

3. Different conceptions of competence, different tasks.Judges might attempt to relate information fromdifferent tasks to different generally described compe-tencies. We might require judgments to determine theweight of evidence that performances designed toinform conjectures in one frame of reference convey forconjectures conceived under a different frame of refer-ence. We shall discuss this further in the sections onprojection and moderation.

In all these cases, the raters' judgments play a crucialrole in assessing students' competence. Assuring fairness,equity, and consistency in applying standards becomesimportant in high-stakes applications. Developing a statisti-cal framework for such a system serves two functions: (1) asa quality assurance mechanism, to flag unusual or atypicalratings as a safeguard against biases and inconsistent appli-cations of criteria, and (2) as a means of quantifying thetypical uncertainties associated with the scoring of students'performances even in the absence of anomalies. Recognizingthat any judgment is a unique interaction between the quali-ties of individual performance and the personal perspectiveof an individual judge, such a framework enables us to tacklethe practical problems that inevitably arise: How do we helpjudges learn to make these judgments? How do we ascertain

3

Page 32: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

...central problemsrelated to linking two ormore assessments are(1) discerning therelationships among theevidence the assess-ments provide aboutconjectures of interest,and (2) figuring out howto interpret this evi-dence correctly.

Linking is strongest andsimplest if AssessmentY has been constructedfrom the same blueprintas Assessment X.

the degree of consistency among judges' perceptions of therating dimensions, and the extent of agreement amongtheir judgments? How do we arrange the numbers anddesigns of judgments to assure quality for our purposes?

Linking Tests: An OverviewThe central problems related to linking two or more

assessments are (1) discerning the relationships among theevidence the assessments provide about conjectures ofinterest, and (2) figuring out how to interpret this evidencecorrectly. Deming continues:

The number of samples for testing, how to select them,how to calculate estimates, how to calculate and inter-pret their margin of uncertainty, tests of variancebetween instruments, between operators, between days,between laboratories, the detection and evaluation of theeffect of non-sampling errors, are statistical problems ofhigh order. The difference between two methods ofinvestigation (questionnaire, test) can be measuredreliably and economically only by statistical design andcalculation.

W. Edwards Deming,1980, pp. 259

Summaries of methods that tackle this problem in educa-tional assessment appear below. Table 3 summarizes theirkey features. Subsequent sections will describe and illus-trate each method in greater detail.

Equating. Linking is strongest and simplest ifAssessment Y has been constructed from the sameblueprint as Assessment X. Under these carefullycontrolled circumstances, the weight and nature ofevidence the two assessments provide about a broadarray of conjectures is practically identical. By match-ing up score distributions from the same or similarstudents, we can construct a one-to-one table ofcorre-spondence between scores on X and scores on Y toapproximate the following property: Any questionthat could be addressed using X scores can be ad-dressed in exactly the same way with correspondingY scores, and vice versa.

Calibration. A different kind of linking is possible ifAssessment Y has been constructed to provide evi-dence about the same conception of competence as

21

Page 33: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Link

Equating

Calibration

Projection

22

Table 3Methods of Linking Educational Assessments

Description

Equated scores fromtests taken to provideequivalent evidencefor all conjectures.

Score levels andweights of evidencematch up betweenscores on tests.

Tests "measure thesame thing," butperhaps with differentaccuracy or indifferent ways.

Results from each testare mapped to acommon variable,matching up the mostlikely score of a givenstudent on all tests.

Tests don't "measurethe same thing," butcan estimate theempirical relation.ships among theirscores.

After observing scoreon Y, you cancalculate what you'dbe likely to observe ifX were administered.

Procedure

1. Construct testsfrom same blueprint.

2. Estimatedistribution of tests ingiven population.

3. Make corre-spondence table thatmatches distributions.

Case 1: Use samecontent, format, anddifficulty blueprint toconstruct tests, butwith more or feweritems on differenttests. Expectedpercents correct arecalibrated.

Case 2: Constructtests from a collectionof items that fits anIRT modelsatisfactorily. Carryout inferences interms of IRTproficiency variable.

Case 3: Obtainjudgments ofperformances on acommon, moreabstractly definedvariable. Verifyconsistency ofjudgments (varietiesof statisticalmoderation).

Mmini ter tests tothe same students andestimate jointdistribution. Canderive predictivedistribution for Test Xperformance, givenTest Y observation.Can be conditional onadditional informationabout student.

(continued)

Example

Two forms of adriver's license test,written to the samecontent and formatspecifications.

Case 1: A long formand a short form of aninterest inventoryquestionnaire.

Case 2: NAEPgeometry subscale forgrades 4 and 8,connected by IRTscale with commonitems.

Case 3: Judges'ratings of AP StudioArt portfoliosincluding student-selected art projects.

Determine jointdistribution amongstudents' multiple-choice science scores,lab notebook ratings,and judgments ofobserved experimentalprocedures.

BEST COPY AVAILABLE 3

Comments

Foundation is notstatistical procedurebut the way tests areconstructed.

Correspondence tablematches up "bestestimates,' but becauseweights of evidencemay differ, thedistribution of "bestestimates" can differover tests.

Same expected pointestimates for individualstudents, but withdiffering accuracy.

Different estimates ofmany groupcharacteristics, e.g.,variance andpopulation proportionabove cut point.

What Test Y tells youabout what Test Xperformance mighthave been. Can changewith additionalinformation about astudent.

Estimated relation-ships can vary with thegroup of students in thelinking study and overtime in ways thatdistort trends andgroup comparisons.

Page 34: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Link

Statisticalmoderation

Social moderation

Table 3, (continued)Methods of Linking Educational Assessments

Description

Tests don't "measurethe same thing," butcan match updistributions of theirscores in real orhypothetical groups ofstudents to obtaincorrespondence tableof "comparable"scores.

Tests don't "measurethe same thing," butcan match updistributions by directjudgment to obtaincorrespondence tableof "comparable"scores.

Procedure

Case 1: If you canadminister both X andY to same students,estimate X and Ydistributions. Align Xand Y with equatingformulas.

Case 2: If not,administer X and"moderator"Assessment Z to onegroup, and Z and Y toanother. Impute Xand Y distributions forhypothetical commongroup. Use formulasof equating to align Xand Y.

Obtain samples ofperformances fromtwo assessments.Have judgesdetermine whichlevels of performanceon the two are to betreated ascomparable. Can beaided by performanceon a commonassessment.

Example

Case 1:Correspondence tablebetween SAT andACT college entranceexams, based onstudents who tookboth.

Case 2: Achievementresults from History,Spanish, andChemistry put on"comparable" scales,using common SAT-Vand SAT-M tests asmoderators.

Obtain samples ofOregon and Arizonaessays, each ratedthrough their ownrubrics. Determine,through comparisonsof scores given toexamples,"comparable" levels ofscore scales.

U,}

Comments

Comments forprojection also applyto statisticalmoderation.

"Comparable" scoresneed not offercomparable evidenceabout nature ofstudents' competence.Rather, they areperceived to be ofcomparable value in agiven context, for agiven purpose.

"Comparable" scoresneed not offercomparable evidenceabout nature ofstudents' competence.They are perceived tobe of comparablevalue in a givencontext, for a givenpurpose.

23

Page 35: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

24

Unlike equating,which matches tests toone another directly,calibration relates theresults of differentassessments to acommon frame ofreference, and thus toone another onlyindirectly.

If it is sensible toadminister both X andY to any student,statistical machineryexists to deriveprojections from Ydata about whatanswer to X questionsmight have been.

Assessment X, but the kinds or amounts of evidencediffer. The psychometric meaning of the term "calibra-tion" is analogous to its physical measurement mean-ing: Scales are adjusted so that the expected score of astudent is the same on all tests that are appropriate toadminister to him or her.5 Unlike equating, whichmatches tests to one another directly, calibrationrelates the results of different assessments to acommon frame of reference, and thus to one anotheronly indirectly. Some properties of calibration aredisconcerting to those familiar only with equating: Asa consequence of different weights of evidence in X andY data, the procedures needed to give the right an-swers to some X questions from Y data give the wronganswers to others. It is possible to answer X questionswith Y data if a calibration model is suitable, butgenerally not by means of a single correspondencetable. We discuss three settings in which calibrationapplies: (1) constructing tests of differing lengths fromessentially the same blueprint, (2) using item responsetheory to link responses to a collection of items built tomeasure the same construct, and (3) soliciting judg-ments in terms of abstractly defined criteria.

Projection. If assessments are constructed arounddifferent types of tasks, administered under differentconditions, or used for purposes that bear differentimplications for students' affect and motivation, thenmechanically applying equating or calibration formu-las can prove seriously misleading: X and Y do not"measure the same thing." This is not merely a matterof stronger or weaker information but of qualitativelydifferent information. If it is sensible to administerboth X and Y to any student, statistical machineryexists to (1) estimate, in a linking study, relationshipsamong scores from X and Y, and other variables in apopulation of interest and then (2) derive projectionsfrom Y data about what the answers to the X questionsmight have been, in terms of a probability distributionfor our expectations about the possible outcomes. As Xand Y become increasingly discrepantsuch as whenthey are meant to provide evidence for increasinglydifferent conceptions of competencethe evidentialvalue of Y data for X questions drops, and projections

5 Some writers use the terms "equating and "calibrating" interchange-ably to describe what I call calibrating. The differences in procedures andproperties I describe here merit maintaining the distinction.

Page 36: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Whereas projectionevaluates the evidencethat results on oneassessment provideabout likely outcomes onanother, moderationsimply aligns scoresfrom the two as to somemeasure of comparableworth.

become increasingly sensitive to other sources ofinformation. The relationship between X and Y candiffer among groups of students and can change overtime in response to policy and instruction.

Moderation. Certain assessment systems obtain TestX scores from some students and Test Y scores fromothers, under circumstances in which it isn't sensibleto administer both tests to any student. Literaturestudents take a literature test, for example, andhistory students take a history test. There is nopretense that the two tests measure the same thing,but scores that are in some sense comparable aredesired nevertheless. Whereas projection evaluatesthe evidence that results on one assessment provideabout likely outcomes on another, moderation simplyaligns scores from the two as to some measure ofcomparable worth. The way that comparable worth isdetermined distinguishes two varieties of modera-tion:

1) Statistical moderation aligns X and Y scoredistributions, sometimes as a function of jointscore distributions with a third "moderatortest" that all students take. That is, a score onX and a score on Y are deemed comparable ifthe same proportion of students in a desig-nated reference population (real or hypotheti-cal) attains scores at or above those two levelson their respective tests. The score levels thatend up matched can depend materially on thechoice of the reference population and, if thereis one, the moderator test, Historically, theseprocedures have been discussed as a form of"scaling" (Angoff, 1984).

2) Social moderation uses judgment to matchlevels of performance on different assessmentsdirectly to one another (Wilson, 1992). Thiscontrasts with the judgmental linkingdiscussed under calibration, which relatesperformances from different assessments to acommon, more abstractly defined variable, andunder projection, which evaluates the evidencethat judgmental ratings obtained in onecontext hold for another context.

25

Page 37: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Differences in levels ofindividual perfor-mance can easily arisebetween high-stakesand low-stakesadministrations of thesame tasks.

26

1

Equating, calibration, and moderation address the corre-spondence between single scores from two assessments. Twoassessments can each have multiple scores, however, andthese approaches could apply to matched sets of scores. Pro-jection can address joint relationships among all scores frommultiple assessments simultaneously.

A Note on Linking Studies

All of the linking methods described here require data ofone kind or another, such as student responses or judges'ratings. If the data are collected in a special linking studyoutside the normal data-gathering context ofany of theassessments, then the possible consequences of differences instudent or rater behavior between the linking study and theusual context constitute an additional link of inference fromone assessment to the other. One must take care to minimizethe uncertainty this step engenders, and, whenever possible,quantify its effect on inferences (e.g., Mislevy, 1990). Thiscaveat extends to psychological as well as physical condi-tions. Differences in levels of individual performance caneasily arise between high-stakes and low-stakes administra-tions of the same tasks.

EquatingSelection and placement testing programs update their

tests periodically, as the content of specific items becomeseither obsolete or familiar to prospective examinees. New testforms are constructed according to the same blueprint asprevious forms, with the same number of items asking simi-lar questions about similar topics in the same ways. SATMathematics test developers, for example, maintain a balanceamong items with no diagrams, with diagrams drawn toscale, and diagrams not drawn to scale, along with a hundredother formal and informal constraints.6 Figure 3 relates anequating link to the hierarchy of competence definitionsintroduced earlier. Pretest samples of examinee responses tothe new items are gathered along with responses to itemsfrom previous forms, and items' statistical properties frompretest samples may also be used to select items for newforms. The objective is to create "parallel" test forms, whichprovide approximately equivalent evidence for a broad range

6 Stocking, Swanson, & Pearlman (1991) describe how these constraintswere specified and employed in a demonstration of an automated testassembly algorithm.

Page 38: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

......

...

Figu

re 3

Equ

atin

g L

inka

ges

.....

....

........

`.

4U

Page 39: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

of potential conjectures. Equating makes slight adjustmentsin the results of such test forms (which were expressly con-structed to make such adjustments negligible!) by aligningthe distributions of scores from the same or similar studentson the two forms.

Equating in Physical Measurement

Two hundred years ago, Karl Friedrich Gauss studiedhow to estimate the "true position" of a star from multipleobservationsall similar but not identical because of theinherent imperfections of the telescope-and-observer system.If each observation differs from the true value by a "measure-ment error" unrelated to either the true value or the errors ofother observations, then (1) the average of the measurementsis a good estimate of the true value, and (2) the estimate canbe made more precise by averaging over more observations.The nature of the measuring instrument determines thetypical size and distribution of the measurement errors. Iftwo instruments react to exactly the same physical propertywith exactly the same sensitivity, their readings can beequated, in the following sense: A table of correspondence canbe constructed so that a value from one instrument hasexactly the same interpretation and measurement errordistribution as the corresponding value would from the otherinstrument.

Equating Temperature Readings fromTwo Similar Thermometers

Table 4 gives July temperatures for the 60 U.S. cities.'We will use these as "true scores," from which to con-struct "observed scores" from some hypothetical mea-suring instruments. The top panel in Figure 4 showsthe distribution of true temperatuzes, highlighting thenine cities from the southeast and six from the westcoast; Figure 5 shows the locations of the cities. Inanalogy to achievement testing, however, we neverobserve these true measures directly. Instead, weobserve readings from two "noisy" measures, Ther-mometer X and Thermometer Y; that is, each readingmay be a bit higher or lower than the true tempera-ture.

7 These data are taken from the SMSAs demonstration data set fromData Desk Professional 2.0 statistical package (Velleman, 1988) .

28

Page 40: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Figure 4

Distributions of True Temperatures and Temperature Readings

True temperatures

Thermometer X

Thermometer Y

Thermometer Y, afterequating by matchingaverage to Thermometer Xaverage

58 62 70 78 86

58 62 70 78

58 66 74

86

82 86

58 62

West coast cities

Southeast cities

Other cities

70 78 86

29

Page 41: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

30

Table 4Temperature Measures for 60 Cities

City

NORTH/CENTRAL

Latitude

Thermometers

Cricket EquatedCricket

"True"Temp X Y Equated Y

Akron, OH 41.1 71 70.3 68.3 71.1 61.3 64.1

Albany, NY 42.4 72 73.5 69.6 72.4 68.2 70.8

Allentown, PA 40.4 74 71.8 71.3 74.1 74.4 73.3

Baltimore, MD 39.2 77 75.2 73.1 75.9 76.2 75.2

Boston, MA 42.2 74 73.6 72.1 74.9 67.9 70.3

Bridgeport, CT 41.1 73 73.3 70.4 73.2 82.2 78.6

Buffalo, NY 42.5 70 70.2 67.6 70.4 72.8 73.0

Canton, OH 40.5 72 71.3 69.7 72.5 72.7 72.7

Chicago, IL 41.5 76 75.8 72.2 75.0 79.4 77.2

Cincinnati, OH 39.1 77 78.0 74.5 77.3 86.3 84.2

Cleveland, OH 41.3 71 68.9 67.8 70.6 71.5 71.7

Columbus, OH 40.0 75 75.5 71.6 74.4 84.2 79.6

Dayton, OH 39.5 75 75.3 71.8 74.6 75.5 74.5

Denver, CO 39.4 73 73.7 70.9 73.7 80.4 77.4

Detroit, MI 42.1 74 73.3 71.5 74.3 66.4 70.0

Flint, MI 43.0 72 71.9 67.8 70.6 76.5 75.3

Fort Worth, TX 32.5 85 85.2 81.8 84.6 83.8 79.4

Grand Rapids, MI 43.0 72 71.6 69.1 71.9 71.9 72.0

Greensboro, NC 36.0 77 76.5 75.1 77.9 77.3 75.9

Hartford, CT 41.5 72 72.3 68.0 70.8 73.2 73.3

Indianapolis, Lti 39.5 75 74.7 70.8 73.6 84.6 80.2

Kansas City, MO 39.1 81 81.0 77.3 80.1 75.5 74.7

Lancaster, PA 40.1 74 74.5 71.6 74.4 75.2 73.7

Louisville, KY 38.2 71 70.9 67.1 69.9 71.0 71.6

Milwaukee, WI 43.0 69 70.4 66.2 69.0 71.9 71.9

Minneapolis, MN 44.6 73 70.8 69.5 72.3 70.8 71.2

New Haven, CT 41.2 72 72.0 70.8 73.6 71.9 72.3

New York, NY 40.4 77 77.0 74.0 76.8 77.3 76.5

Philadelphia, PA 40.0 76 74.9 72.7 75.5 81.3 78.0

Pittsburgh, PA 40.3 72 72.7 69.4 72.2 71.6 71.8

(continued)

Page 42: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Table 4, continuedTemperature Measures for 60 Cities

City Latitude

Thermometers

Cricket EquatedCricket

"True"Temp X Y Equated Y

Providence, RI 41.5 72 71.9 69.0 71.8 63.7 67.8

Reading, PA 40.2 77 77.2 73.1 75.9 71.8 71.9

Richmond, VA 37.4 78 77.4 75.6 78.4 76.0 74.9

Rochester, NY 43.2 72 71.2 68.6 71.4 85.0 80.6

St. Louis, MO 38.4 79 79.0 76.2 79.0 85.4 82.0

Springfield, MA 42.1 74 73.0 70.2 73.0 77.3 75.8

Syracuse, NY 43.1 72 71.7 68.0 70.8 64.5 68.2

Toledo, OH 41.4 73 73.3 71.7 74.5 79.2 77.0

Utica, NY 43.1 71 '70.1 69.5 72.3 65.3 68.9

Washington, DC 38.5 78 78.3 74.7 77.5 68.1 70.4

Wichita, KS 37.4 81 82.0 79.2 82.0 74.9 73.6

Wilmington, DE 39.5 76 75.9 72.9 75.7 72.1 72.7

Worcester, MA 42.2 70 69.7 66.3 69.1 74.6 73.3

York, PA 40.0 76 76.7 73.3 76.1 67.9 70.2

Youngstown, OH 41.1 72 72.7 67.7 70.5 70.9 71.3

SOUTHEAST

Atlanta, GA 33.5 79 77.0 77.3 80.1 86.2 82.2

Birmingham, AL 33.3 80 78.6 77.2 80.0 86.9 84.7

Chattanooga, TN 35.0 79 80.2 75.9 78.7 82.2 78.3

Dallas, TX 32.5 85 84.7 82.0 84.8 77.1 75.5

Houston, TX 29.5 84 84.2 80.4 83.2 79.0 77.0

Memphis, TN 35.1 82 82.2 78.0 80.8 85.1 81.0

Miami, FL 25.5 82 80.6 80.4 83.2 82.7 79.0

Nashville, TN 36.1 80 79.6 76.0 78.8 88.3 85.2

New Orleans, LA 30.0 81 79.4 76.8 79.6 78.9 76.7

WESTCOAST

Los Angeles, CA 34.0 68 65.9 66.0 68.8 61.6 65.6

Portland, OR 45.3 67 67.8 62.5 65.3 65.6 69.7

San Diego, CA 32.4 70 70.0 65.4 68.2 74.6 73.5

San Francisco, CA 37.5 63 63.6 59.6 62.6 55.8 63.6

San Jose, CA 37.2 68 68.2 65.8 68.6 70.5 70.9

Seattle, WA 47.4 64 64.1 61.7 64.5 67.5 70.1

31

Page 43: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

32

We've had Thermometer X for some time and deter-mined that it is about right on the average.8 The sec-ond panel in Figure 5 shows the distribution of itsmeasures. The average over cities is about the same asthe true average, and the average for the regions isabout right. About the same number of cities are toohigh and too low, compared to their true values. On thewhole, the Thermometer X readings are spread out alittle more than the true values, because the measure-ment errors add variance to the collection of measures.(We shall see how this point becomes important incalibration.)

We now acquire Thermometer Y, the same kind ofthermometer as X, with the same accuracy. It comesfrom the factory with an adjustment knob to raise orlower all its readingswhich, unfortunately, doesn'twork. We need to make a table to translate Thermom-eter Y readings so they match Thermometer X read-ings. The unadjusted readings are shown in the thirdpanel of Figure 5 and plotted against Thermometer Xreadings in Figure 6.8 The spread of readings is aboutthe same for both, but the values from Y are systemati-cally lower. We find the average difference over all 60cities: 2.8. Adding 2.8 to Thermometer Y readings givesthem the same average as the Thermometer X read-ings. This "mean equating" matches up the two ther-mometers pretty well on the average, yielding thedistribution in the last panel of Figure 5. The equatingis operationalized in terms of the correspondence tableshown as Table 5.

Equating in Educational Assessment

Edgeworth (1888, 1892) and Spearman (1904, 1907)launched classical test theory (CTT) around the turn of thecentury by applying the ideas of true-score measurement totests. CTT views the average (or, equivalently, the total) of 1-for-right/O-for-wrong results from numerous test items as animperfect measure of an examinee's "true score." While each

8 To produce Thermometer X readings, I added to each "true" tempera-ture a number drawn at random from a normal distribution, with meanzero and standard deviation one.

9 Thermometer Y readings are "true" temperature, minus 3, plus arandom number from a normal distribution, with mean zero andstandard deviation one.

4v

Page 44: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

45

Figure 5Sixty Cities

:',0 -J

i

-1.25 - WO 0t

-87.5 -75.0

Negative Longitude

Figure 6

Unadjusted Thermometer Y Readings PlottedAgainst Thermometer X Readings

50 50 70 80

Thermometer X

90

6 7 33

Page 45: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

34

Table 5Correspondence Table for Thermometer X and Thermometer Y

Thermometer X Reading Thermometer Y Reading

85.8 83.0

85.7 82.9

85.6 82.8

85.5 82.7

85.4 82.685.3 82.5

85.2 82.4

85.1 82.3

85.0 82.2

84.9 82.1

84.8 82.0

84.7 81.9

84.6 81.8

84.5 81.7

84.4 81.6

84.3 81.5

84.2 81.4

84.1 81.3

84.0 81.2

83.9 81.1

83.8 81.0

83.7 80.9

83.6 80.8

83.5 80.7

83.4 80.6

83.3 80.5

83.2 80.4

83.1 80.3

83.0 80.2

82.9 80.1

82.8 80.0

82.7 79.9

82.6 79.8

82.5 79.7

82.4 79.6

82.3 79.5

82.2 79.4

4

Page 46: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Equating applies totests that embody notmerely the samegeneral statement ofcompetence butstructurally equiva-lent operationaldefinitions. But evenwith the care taken tocreate parallel testforms, scores maytend to be a bit higheror lower on theaverage with a newform than with theprevious one. Equat-ing adjusts for thesedifferences.

of the separate items taps specific skills and knowledge, ascore from such a test captures a broad general tendency toget items correct and provides information for conjecturesabout competencies also defined broadly (Green, 1978).Different tests drawn from the same domain correspond torepeated measures of the same true score.

Equating applies to tests that embody not merely thesame general statement of competence but structurallyequivalent operational definitions. But even with the caretaken to create parallel test forms, scores may tend to be abit higher or lower on the average with a new form thanwith the previous one. The new form's scores may spreadexaminees out a little more or less. An equating procedureadjusts for these overall differences, so that knowing whichparticular form of a test an examinee takes no longerconveys information about the score we'd expect. There area variety of equating schemes (see Petersen, Kolen, &Hoover, 1989), but the "equivalent groups" design capturesthe essence:

1. Administer Test X and Test Y to randomly selectedstudents from the same specified grouptypically, agroup representative of students with whom the testswill be used.

2. Make a correspondence table matching the score onTest X that 99% of the sample was above to the scoreon Test Y that 99% of that sample was above. Do thesame with the 98% score level, the 97% level, and soon. If the X and Y distributions have similar shapes,we may be able to align the distributions sufficientlywell by just matching up the means of the X and Ysamples, as we did in the temperature example, orthe means and standard deviations, as in "linearequating."

3. Once a correspondence table has been drawn up, lookup any score on Test Y, transform it to the Test Xscore on the table, and use the result exactly as ifwere an observed Test X score with that value. Thesame table can be used in the same way to go from Xscores to equated Y scores.

Long tests constructed from a given domain of taskscorrespond to more accurate measures than short testsdrawn from the same domain. Equating, strictly defined,applies to two similarly constructed long tests, or two simi-

(3 35

Page 47: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

...reliability dependsnot only on theprecision with whichthe test measuresstudents' true scores,but also on theamount of variationamong true scores inthe student popula-tion. This definitionreflects the classicnorm-referencedusage of tecte: liningup examinees along asingle dimension forselection and place-ment.

A high reliabilitycoefficient is impor-tant when scores areused for high-stakes,norm-referenced useswith individualstudents, becausehigh reliability meansthat a differentsample of items of thesame kind would leadto the same decisionabout most students.

36

laxly constructed short ones. The calibration section illus-trates the problems that arise if we use equating proceduresto link a long test and a short one.

The indicator of a test's accuracy under OTT is reliabil-ity, a number between 0 and 1 gauging the extent of agree-ment between different forms of the test. This numberdepends not only on the precision with which the test mea-sures students' true scores, but also on the amount of varia-tion among true scores in the student population. This defini-tion reflects the classic norm-referenced usage of tests: liningup examinees along a single dimension for selection andplacement. A test that measures examinees' true scoresperfectly accurately is useless for this purpose if all their truescores are the same, so this test has a reliability of 0 for thispopulationalthough it might separate examinees in adifferent population quite well and have a reliability closer toone for that group. The plot of Thermometer X and Ther-mometer Y readings shown in Figure 6 corresponds to areliability of .96 for the sample as a wholehigher than thereliability of the two-hour long SAT. The reliability of thethermometer readings is .85 for the cities in the southeastonly.

A high reliability coefficient is important when scoresare used for high-stakes, norm-referenced uses with indi-vidual students, because high reliability means that a differ-ent sample of items of the same kind would lead to the samedecision about most students. This same index can be lessimportanteven inappropriatefor other purposes. Ashorter and less reliable test can suffice for less consequen-tial norm-referenced decisionsa quiz to determine whetherto move on to the next chapter, for example. When thepurposes of assessment are criterion-referenced, evidenceabout the accuracy with which an individual's competenciesare assessed is more important than evidence about how heor she compares with other students. And accurate estimatesof aspects of the distribution ofa group of students' truescores can be obtained from short tests, even if measurementfor individuals is quite inaccurate (Lord, 1962). NAEP, whichwas designed to provide information at the level of groupsrather than individuals, thus administer a small sample ofitems from a large pool to each student, thereby amassingsubstantial evidence about groups on a broader range oftasks than could be administered to any single student.

5 u

Page 48: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Test construction andequating are insepa-rable.

When equating doeswork, it worksbecause of the way thetests are constructed,not simply because ofthe way the linkingdata are collected orcorrespondence tablesbuilt.

Calibration relatesobserved performance ondifferent assessments toa common frame ofreference. Properlycalibrated instrumentshave the same expectedvalue for measuring anobject with a given truevalue.

Reliability is a sensible summary of the evidence a testprovides in a specific context (a particular group of stu-dents), for a specific purpose (determining how those stu-dents line up in relation to one another). It does not indi-cate the strength of evidence for statements about thecompetencies of individual examinees. In the section oncalibration, we discuss item response theory indices ofevidence that are defined more along these lines.

Comments on Equating

Test construction and equating are inseparable. Whenthey are applied in concert, equated scores from paralleltest forms provide virtually exchangeable evidence aboutstudents' behavior on the same general domain of tasks,under the same specified standard conditions. When equat-ing does work, it works because of the way the tests areconstructed, not simply because of the way the linking dataare collected or correspondence tables built. Tests con-structed in this way can be equated whether or not they areactually measuring something in the deeper senses dis-cussed in the next sectionalthough of course determiningwhat they measure, if indeed anything, is crucial for justify-ing their use.

CalibrationCalibration relates observed performance on different

assessments to a common frame of reference. Properlycalibrated instruments have the same expected value formeasuring an object with a given true value. These instru-ments provide evidence about the same underlying vari-able, but, in contrast to equating, possibly in differentamounts or by different means or in different ranges. Ayardstick is less accurate than vernier calipers but ischeaper and easier to use for measuring floor tiles. Never-theless, the yardstick dnd calipers should give us about thesame measurement for a two-inch bolt. The focus is oninference from the measuring instrument to the underlyingvariable, so calibrated instruments are related to one an-other only indirectly. This too contrasts with equating,where the focus is on matching up observed performancelevels on tests to one another directly. After illustratingproperties of calibration in the context of temperature, wediscuss three cases of calibration in educational assess-ment.

37

Page 49: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

38

"Calibrated" cricketmeasures give unbi-ased answerseachone possibly differentfrom the true valuebut tending to pro-duce the right answeron the averagetoquestions about thetemperature of anindividual city orabout the average of agroup of cities.

Calibration in Physical Measurement

Calibrating a "Noisy" Measureof Temperature

The warmer it gets, the faster the snowy tree cricket(Oecanthus fultoni) chirps. The relationship betweentemperature and cricket chirps isn't as strong as therelationship between temperature and the density ofmercury, and it isn't very useful below freezing orabove about 100°. Calibrating a cricketlet's call himJiminyinvolves transforming the counts of his chirpsso that for any true temperature, the resulting esti-mate of temperature is as likely to be above the truevalue as below it. The Encyclopedia Brittanica gives"the number of chirps in 15 seconds + 40" as an ap-proximation for Fahrenheit temperature.

If the true temperature is 65°, five calibrated tempera-ture readings from Jiminy's chirps could be 57°, 61°,66°, 68°, and 71°averaging around the true value butquite spread out. In contrast, five readings from Ther-mometer X could give readings of 63°, 64°, 65°, 65°, and67°also averaging around the true value, but withmuch less spread. If we have a reading of 80° fromThermometer X or Jiminy, our best estimate of thetrue temperature is 80° either way, but the evidencethat the true temperature is around 80° rather than75° or 85° is much stronger with Thermometer X.

Table 4 contains a column with a hypothetical set ofreadings from Jiminy's calibrated measures, and Fig-ure 7 plots them against true temperature around thecountry.'° These calibrated cricket readings correspondto the true temperatures most likely to have producedthem.'1 They give unbiased answerseach one possiblydifferent from the true value but tending to producethe right answer on the averageto questions aboutthe temperature of an individual city or about theaverage of a group of cities:

10 Jiminy's readings are "true" temperatures plus a random number froma normal distribution, with mean zero and standard deviation five.

11 In statistical terms, they are, under plausible assumptions, maximumlikelihood estimates. The likelihood function induced by a chirp count ismore spread out than that of a Thermometer X or Y reading.

5

Page 50: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Figure 7Cricket Temperature Readings

Plotted Against True Temperatures

True Temperature

Is the true temperature of Fort Worthabove or below 800? See whether itscalibrated-cricket measure is above orbelow. It's above. The true value isabove, too. On the other hand, NewOrleans has a true temperature justabove 80° but a cricket measure justbelow 80°. The noisier a measure is, themore errors of this type occur. If theyare properly calibrated, though, similarnumbers of readings will be too highand too low.

What are the averages for the entiresample, for the southeast, and the westcoast? They turn out pretty close in thisexample: for the entire sample, 74.9°from the cricket vs. 74.6° actual; 65.9°vs. 66.7° for the west coast; 82.9° vs.81.3° for the southeast.

Figure 8 compares the distribution of Jiminy'smeasures to the true temperatures. Even thoughevery cricket measure has the true measure as itsexpected value, the added variation from the mea-surement error spreads the set of measures outconsiderably. This uncertainty is reflected in the

39rs'

Page 51: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

r

40

Figure 8

Distributions of True Temperatures and Selected Estimates

Ti ue temperatures

Thermometer X

Calibrated cricket readings

62 70 78 86

62 70 78 86

54

Cricket readings, after"equating" to Thermometer X

62 70 78 86

62 70

West coast cities

Southeast cities

Other cities L.

78 86

Page 52: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

spread of cricket measures' likelihood functions, butit is ignored when we use only the best point esti-mate of the temperature. As a consequence, properlycalibrated cricket measures give biased answers(tending to have a wrong answer as their average) toquestions about features of the distribution otherthan individual and group averages:

What is the range of temperatures? Thetrue range runs from a low of 63° (SanFrancisco) to a high of 85° (Dallas andFort Worth). The calibrated cricketmeasures run from 55.8° (again SanFrancisco, but with a low error termadded to an already low true value), upto 88.3° (Nashville, with a true tempera-ture of 80° but a high measurementerror term).

What proportion of cities is below 70°,and what proportion is above 80°? Theanswers for true temperatures are 10%and 13%; the calibrated-cricket answersare 22% and 25%.

Could we solve this problem by "equating" the cricketvalues to match the spread of the true measures? Wecan, in fact, match cricket measures to the Thermom-eter X distribution exactly, by mapping the highestcricket value to the highest thermometer value, thenext highest to the next highest, and so on, disre-garding which cities these values are associatedwith. The numbers appear in Table 4, and the distri-bution is shown in the last gralin in Figure 8. Be-cause the way we constructed it, this graph hasexactly the same silhouette as that of ThermometerX readingsthe same mean and spread, the sameproportions of cities below 70° and above 80°, and soon. It would seem that we have made cricket mea-sures equivalent to Thermometer X measures.

But wait! See how the Pacific coast and southeasterncities have shifted toward the center of the distribu-tion. Forcing the distribution to be right for thepopulation as a whole has compromised the cardinalproperty of calibration, matching up most likely bestestimates for individuals. A city truly above 80° is

41

Page 53: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

It is possible toestimate, from prop-erly calibrated mea-sures, characteristicsof a distribution suchas its spread andproportions aboveselected points. Morecor-olex statisticalmethods are required,however, because onemust account for thespread of evidenceabout each of theobservations, not justtheir best single-pointestimates.

The simplest calibra-tion situation ineducational assess-ment differs from theequating situation injust one respect: Twotests can be built tothe same specifica-tions and use similaritems and standardadministration condi-tions, but containmore or fewer items.

42

now more than likely to have an "equated" tempera-ture closer to the overall average, or below 80°. Aboutthe right proportion of equated measures are indeedup there, but now some of the cities that are above 80°happen to have high measurement error terms. Thesoutheastern cities shrink toward the center of thedistribution, their average dropping from the nearlycorrect 82.9° down to 79°. The Pacific coast citiessimilarly rise spuriously. Inappropriate equating hasthus distorted criterion-referenced inferences aboutindividual cities.

It is possible to estimate, from properly calibratedmeasures, characteristics of a distribution such as itsspread and proportions above selected points. Morecomplex statistical methods are required, however,because one must account for the spread of evidenceabout each of the observations, not just their bestsingle-point estimates. To estimate the proportion oftrue values above 70°, for example, requires calculat-ing and adding the probabilities that each of the citiesis -1-love 70°a negligible probability for a city with anaccurate Thermometer X reading of 65°, but about aone-out-of-five chance for a city with a noisy cricketmeasure of 65°. A paradox arises that worsens as theaccuracy of measurement declines: The proportion ofcities whose best estimate is over 70° is not generallyequal to the best estimate of the proportion of citieswith true temperatures over 70°. No single correspon-dence table can resolves the paradox. From theseobservations, we must use different statistical tech-niques to extract evidence about different kinds ofconjectures.

Calibration in Educational Assessment, Case 1:Long and Short Tests Built to the Same Specifications

The simplest calibration situation in educational assess-ment differs from the equating situation in just one respect:Two tests can be built to the same specifications and usesimilar items and standard administration conditions, butcontain more or fewer items. For example, Test X couldconsist of three items from each category in a detailed tableof specifications, while Test Y had one from each category.Test X would be used to gather information for inferencesabout individual students, while Test Y could be used forinferences about group averages.

L'ki

Page 54: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Bob Linn (in press) uses the example of a basketballfree-throw competition to illustrate calibration of varying-length tests. The coach wants to select players who shootwith at least 75% accuracy. The long test form is 20 tries;the short form is four tries. If a player's true accuracy is,say, 50%, her most likely outcome in either setting is 50%:10 of 20, or two of four. Accordingly, the properly calibratedestimate of the accuracy of a player who makes 50% of hershots is 50%. But because of the greater uncertainty associ-ated with observing only four tries, a true 50% shooter hasa probability of .31 of making at least 75% of her shots withthe short form, but a probability of less than .01 with thelong form.

In the same way, the observed percent-correct scoresfrom two educational tests constructed in this manner areapproximately calibrated in terms of expected percentcorrect on a hypothetical, infinitely long test from the samespecifications. (The item response theory methods discussedlater could be used to fine-tune the calibration.) Unbiasedestimates are obtained for individuals, but the short testwould show a wider spread ofscores than the long test. Theshort form would yield more errors about who was or wasnot above a given true-score level, but this phenomenon isinherent in its sparser evidence and cannot be eliminatedthrough calibration or equating. As with the cricketmeasures and the Thermometer X readings, building anequating correspondence table for Test X and Test Y wouldsolve this problem, but at the cost of introducing biases inscores for individuals. Inappropriately "equated" scores forthe short test would tend to give people scores too close tothe average, compared to their true scores.

Properly calibrated scores, then, are appropriate forobtaining an unbiased estimate for each student. Theywould be the choice for criterion-referenced inferences forindividuals. We could construct one-to-one correspondencetables for this purpose for different length tests from thesame specifications. But, as with cricket temperatures, thesame tables would misestimate the shape of the distribu-tion for a group of students, to a degree and in a way thatdepends on the accuracy of those estimates. For example,shorter tests generally cause additional overestimation of adistribution's spread and more greatly exaggerated esti-mates of students in the extremes. As we mentioned, cor-rectly estimating the shape of a distribution requiresaccounting for the uncertainty associated with each obser-

43

Page 55: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Standard equatingand simple calibrationdepend on the con-struction of highlyconstrained observa-tional settings.

If an IRT model is anadequate descriptionof the patterns thatoccur in item re-sponses, statistimlmachinery in the IRTframework enablesone to calibrate testsin essentially thesame sense as wecalibrate physicalinstruments.

44

vation. Methods for accomplishing this within classical testtheory have been available for some time (e.g., Gulliksen,1950/1987). Analogous methods have appeared morerecently for problems in which different observations havedifferent amounts and shapes of uncertainty, as in the itemresponse theory context discussed below (e.g., Mislevy, 1984,1991).

Calibration in Educational Assessment, Case 2:Item Response Theory for Patterns of Behavior

Standard equating and the simple calibration situationdescribed above depend on the construction of highlyconstrained observational settings. Inferences about behaviorin two settings can be related because the settings are sosimilar. This approach offers little guidance for tests built todifferent blueprints but intended to measure the same con-struct, such as harder or easier collections of similar items.Item response theory (IRT) lays out a framework that theinteractions of students and items must exhibit to satisfy theaxioms of "measurement" as it developed in the physicalsciences (see especially Rasch, 1960/1980). We illustratethese ideas in the context of right/wrong items, but exten-sions to ordered categories, counts, ratings, and other formsof data are available. If an IRT model is an adequate descrip-tion of the patterns that occur in item responses, statisticalmachinery in the IRT framework enables one to calibratetests in essentially the same sense as we calibrate physicalinstruments. In this case, the items and tests constructedfrom them are calibrated to the unobservable variable "ten-dency to get items of this kind right."

Figure 9 illustrates the linkages that can be forged bythis kind of IRT calibrationthat is, defining a variable interms of regularities of behavior patterns across a collectionof tasks. The chances that these measurement requirementswill be adequately satisfied are better for tests built from thesame framework, although it need not happen even here. Therequired regularities may also be found across sets of taskswritten to different specifications, or from different concep-tions of competence, but this is even less likely. Whether ithappens in a given application is an empirical question, to beanswered with the data from a linking study in which stu-dents are administered tasks from both assessments.

Page 56: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

If th

e re

quis

ite h

omog

eneo

us p

atte

rns

are

obse

rved

am

ong

task

s fr

om a

set

of t

ests

bein

g ca

libra

ted

toge

ther

, all

the

task

s ar

eim

plic

itly

"mea

surin

g th

e sa

me

thin

g"pr

esum

ably

the

obje

ctiv

es in

the

high

est c

omm

on e

lem

ent c

onne

ctin

g th

e te

sts.

Dar

ker

boxe

s in

dica

te b

ette

r ch

ance

s of

appr

oxim

atin

g th

e re

quire

men

ts o

f mea

surin

ga

com

mon

var

iabl

e, a

mon

g pa

ttern

s of

resp

onse

to a

ll ta

sks

with

in a

box

.

Figu

re 9

Cal

ibra

tion

in T

erm

s of

Beh

avio

r

......

....

...

......

Tas

ks p

rese

nted

to c

omm

ongr

oup

of s

tude

nts

Tas

ks p

rese

nted

to c

omm

ongr

oup

of s

tude

nts

Tas

ks p

rese

nted

to c

omm

ongr

oup

of s

tude

nts

lJJ

Page 57: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Stone Lifting as a Measure of Strength

Suppose we have a room full of stones, withweights from five to 500 pounds. To learn aboutpeople's strength in a way that corresponds toclassical test theory, we would select a randomsample of, say, 50 stones, and record how manya person could lift. This would give us (1) anestimate of the proportion of stones in the roomeach person could lift, if he or she tried them all,and (2) a way to rank people in terms of theirstrength (a norm-referenced inference). Itwouldn't tell us whether a particular personwould be likely to lift a 100-pound stone (anexample of a criterion-referenced inference). Adifferent sample of 50 stones would give thesame kind of information. Because one sampleof stones might have more heavy ones, we couldequate the number of stones lifted from the twosamples, just as we equate scores from alternateforms of the SAT.

Comparing peoples' strength in this mannerwould require strong people to lift very lightstones, and the less strong to try heavy onesboth expending their time and energy withouttelling us much about their strength. A bettersystem would note exactly which stones a per-son attempts. People can generally lift stones upthrough a certain weight, have mixed success ina narrow range, then rarely lift much heavierones. We might characterize their strength bythe point at which they have 50-50 chances oflifting a stone. Our accuracy for a particularperson would depend on how many stones are inthe right range for him or her. Knowing Alicesucceeds with 70 and 90, but not 110 and 130,would yield an estimate for her of 100, plus orminus 10. Having 95, 100, and 105 pound stoneswould make her "test" more precise. We couldthen give an estimate within five pounds. Thesesame stones would tell us virtually nothing formeasuring the strength of Jonathan, whosucceeded with 10 and 15 pounds but not 20 or25.

If we didn't have a chance to weigh the stonesahead of time, we could observe, for a number of

46

Page 58: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Rasch's IRT model forright/wrong test itemsproposes the probabil-ity that a person willrespond correctly toan item is a functionof ... the person'sproficiency and ... theitem's difficulty.

people, which stones they could lift and whichthey couldn't. Believing the stones are lined upin an order that applies in the same way toevery person, we could use the relative fre-quencies with which stones are lifted to dis-cover this order, then use it to measure morepeople for whom we believe the same orderingalso applies.

In analogy to stone lifting, Rasch's IRT model for right/wrong test items proposes the probability that a person willrespond correctly to an item is a function of (1) a parametercharacterizing the person's proficiency and (2) a parametercharacterizing the item's difficulty. If the model holds, longand short tests, hard and easy ones, constructed from thepool of items might be "calibrated" like the stones.12 Theestimates for peoples' locations would be scattered aroundtheir "true" locations, with the degree of accuracy depend-ing mainly on the number of items administered in theneighborhoods of their true abilities. Comparing peoples'measures with one another or with group distributionssupports norm-referenced inferences about individuals.Comparing their scores with item locations supports crite-rion-referenced inferences. People's IRT measures areapproximately unbiased with long tests.'s

Linking tests with Rasch's IRT model amounts toestimating the locations of tasks from both tests, based onthe responses of a sample of people who have taken at leastsome of each. If the IRT model fits the patterns in data wellenough, then estimates of students' proficiency parametersare properly calibrated measures on the same scale. Acorrespondence table could be drawn up that matches theexpected scores on the two tests for various values of "true"proficiency. As we saw in the cricket example, the estimate

12 The likelihood functions for right and wrong responses to each of theitems would be estimated from the data, characterizing the evidenceeach provided about proficiency on the domain of items. The likelihoodfunction induced by a person's set of responses to several items wouldbe the product of the appropriate responses. The highest or mostprobable point is his or her "maximum likelihood estimate" test score.

13 They are "consistent" estimates, meaning that they approach beingunbiased as the number of observations increases. In the interest ofsimplicity, I will use the more familiar term "unbiased" loosely, toencompass "consistent."

E "747

Page 59: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

...the distribution ofunbiased scores forindividuals is gener-ally not an unbiasedestimate of thedistribution of theirtrue scores.

...using a demonstra-bly good estimate forevery student in agroup can give ademonstrably badestimate about thecharacteristics of thegroup as a whole.

48

of an individual student would have about the same expecta-tion from either test, although measurement error might besmall with one test and horrendously large with the other.

As we also saw in the cricket example, however, thedistribution of unbiased scores for individuals is generallynot an unbiased estimate of the distribution of their truescores. The correspondence table described in the precedingparagraph gives the wrong answers about population charac-teristics such as spread and proportion of people above se-lected points on the scale. Statistical machinery is available(again, if the IRT model holds) to estimate population charac-teristics whether the test forms are long or short, hard oreasy, but questions about group distributions must be ad-dressed with data from the group as an entity. Mislevy,Beaton, Sheehan, and Kaplan (1992), for example, describehow such procedures are used in NAEP. They show howusing a demonstrably good estimate for every student in agroup can give a demonstrably bad estimate about the char-acteristics of the group as a whole.

I keep saying "if the IRT model holds." Beyond simplyasserting that measurability holds, IRT models give a way togauge their plausibility in terms of the patterns observed indata. Sometimes students who do well on most of the itemsmiss easy ones, and students who don't do well on mostanswer some hard ones correctly. Suppose Charlie, Gwyn,and Peter all answer three of five items correctly. Peter andGwyn miss items #4 and #5, which were in fact the hardestones. Unexpectedly, Charlie misses the easiest ones, items #1and #2. This is like lifting stones most people find heavy butfailing to lift ones most people find light. We distrust theassumption that the items line up for Charlie in the sameway they do for Gwyn and Peter and question using his scoreto compare him to his peers. Does Charlie exhibit an atypicalresponse pattern because he uses a different method to solvethe problems than most of the other students? The patternsin the observed behavior of these students with the samescore are so discrepant as to defy the claim that they aremeasuring the same thing for everyone. Some variation isexpected under the model, and statistical tests help deter-mine whether an individual's response paaern is surprisingin this sense or if a particular item tends to be unexpectedlyeasy or difficult for particular groups of people.

Since an IRT model is never "the truth" anyway, theseconcerns must be addressed as matters of degree and practi-

Page 60: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

...IRT models tend to fitbetter if the items aremore homogeneous andthe people are morehomogeneous withrespect to what theitems require.

Exactly the same items,settings, and timingscan lead to differentcharacteristic patternsamong students withdifferent backgrounds,muddling the meaningof test scores andsubverting comparisonsamong students.

cal consequence. If one acted as if the measurement modelwere true, what errors would result, and how serious wouldthey be, for which kinds of inferences?

We have learned that IRT models tend to fit better ifthe items are more homogeneous and the people are morehomogeneous with respect to what the items require. Foritems, this homogeneity includes content, the way in whichthey are presenteditem formats, timing conditions, orderof the items, and so onand the uses to which they will beput, which affect student motivation. Estimates ofgroupaverages can change more from these causes than from ayear's worth of schooling (Beaton & Zwick, 1990)! Forstudents, differences in cultural backgrounds and educa-tional experiences that interact with the item content canmake it impossible to define a single common measure froma collection of itemsthe items tend to line up differentlyfor members of different groups. Merely standardizing theconditions of observation is not sufficient for test scores tosupport measures. Exactly the same items, settings, andtimings can lead to different characteristic patterns amongstudents with different backgrounds, muddling the mean-ing of test scores and subverting comparisons amongstudents.

Suppose, for example, the tasks in Test X are builtaround the skills Mathematics Program X emphasizes,while the tasks in Test Y are built around the skills Math-

' ematics Program Y emphasizes. A combined test with bothkinds of items might approximately define a variableamong Program X students only or among Program Ystudents only. The combined test used with all studentstogether would produce scores, and we could even constructparallel test forms and successfully equate them. Butbecause of the qualitative difference between students'profiles of performance for X and Y items, the scores couldnot be thought of as measures on a single well-definedvariable. We can attack this problem in two ways:

Perhaps subsets of the tasks from the two testscohere, so that one or more portions of differentassessments can be calibrated to variables thatextend over groups.

If an IRT model holds across all tasks within eachstudent group separately, but not across groups, wemight think of the tests as measures of two distinctvariables rather than two alternative measures of

49

Page 61: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

50

the same variable. Rather than calibrating two tests tothe same single variable, we would study the relation-ships among two (or more) distinct variables. This isthe topic of the section on "projection."

Calibration in Educational Assessment, Case 3: ItemResponse Theory for Judgments on an AbstractlyDefined Proficiency

In recent years, IRT modeling has been extended fromright/wrong items to ratings and partial-credq data (see, e.g.,Thissen & Steinberg, 1986, for a taxonomy of models). Anopportunity exists to map judges' ratings of performance oncomplex tasks onto a common scale, from which properlycalibrated measures of student performance might bederived. As with IRT for right/wrong items, an opportunityalso exists to discover that performances over a given collec-tion of tasks cannot be coherently summarized as values on asingle variable! This process can be attempted at differentlevels on the hierarchy of competence definitions, assuggested by Figure 10. For reasons discussed below, linkingin this manner is more likely to succeed for tasks sharing amore focused conception of competence, that is, more likely tosucceed with the grouping of tests at the right of Figure 10than with those at the left.

The ACTFL language proficiency guidelines can be usedto illustrate those points. The calibration process might becarried out within a single language or for performancesacross several languages, the latter representing a greaterchallenge. One or more judges, observing one or more inter-views with a student, would rate students' accomplishmentsin terms of ACTFL levels. Students would be characterized interms of their tendencies to perform at levels of the guide-lines, a variable at a higher level of abstraction than thesample of their actual performances. Raters would be charac-terized in terms of their harshness or leniency. Interviewtopics or settings, possibly tailored and certainly interpretedindividually for individual students, would be characterizedin terms of their difficulty.

Compared to multiple-choice or true/false items, a judgerating performances in terms of an abstract definition ofcompetence faces a range of student responses. Mappingwhat he or she sees to one or more summary ratings is noeasy task. Consider putting the ACTFL language proficiencyguidelines into operation. The generic descriptions are meantto signify mileposts along the path to competence in learning

t,

Page 62: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Figu

re 1

0

Cal

ibra

tion

in T

erm

s of

Jud

gmen

t at a

Hig

her

Lev

el o

f A

bstr

actio

n

Dar

ker

arro

ws

indi

a.te

gre

ater

expe

ctat

ion

of c

onsi

sten

cy.

...N

.'a

..

s

6,3

\Sp

ecif

iccu

rric

ulum

ob a

ctiv

es

..

ri?

Page 63: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

...hard work andattention to unusualratings can bring ajudging system towhat Deming calls"statistical control."

The Motion PictureAcademy might haveless trouble selectingthe best actress of theyear, for example, ifall actresses had toplay Lady Macbeth ina given year. Ofcourse, makingjudging easier in thisway destroys theintention of recogniz-ing unique, individu-alistic, and oftenunanticipated kinds ofexcellence.

52

a foreign language. As they stand, however, the guidelinesare just so many words, open to interpretation when appliedto any particular performance. They acquire meaning onlythrough examplc;3 and discussion: examples for the genericguidelines from different languages, to facilitate meaningfuldiscussion across languages; examples and more specificguidelines within each language, to promote common inter-pretations for performance among students who will becompared most directly; and examples to work through, tohelp raters "make the guidelines their own" in ACTFL'sintensive four-day training sessions.

Even after successful training, judges inevitably exhibitsome variations in characterizing any given performance. Ina specific setting, however, hard work and attention tounusual ratings can bring a judging system to what Demingcalls "statistical control": The extent of variation among andwithin judges lies within steady, predictable ranges. Themore narrowly defined a performance task is, the easier itwill usually be for judges to come to agreement about themeanings of ratings, through words or examples, and themore closely their judgments will agree. The Motion PictureAcademy might have less trouble selecting the best actress ofthe year, for example, if all actresses had to play LadyMacbeth in a given year. Of course, making judging easier inthis way destroys the intention of recognizing unique, indi-vidualistic, and often unanticipated kinds ofexcellence. Soinstead, the Academy appropriately tackles the tougher taskof judging very different actresses, who take on differentkinds of challenges over a broad range of roles.'4

We can expect the typical amount of interjudge variationto increase as student competence is construed moreabstractly or as the range ofways it might be manifest broad-ens. This variation leads to larger measures of uncertaintyfor students' scores (i.e., ratings induce more dispersed likeli-hood functions). A model such as Linacre's extension of IRT(1989) to individual raters and other facets of the judgingsituation helps identify unusual ratingsan interactionbetween a judge and a performance more out of synch withother ratings of that performance and other ratings by that

14 It would be fascinating to analyze Academy members' ratings andexplore their rationales in detail. To what degree do we find systematicdifference related to "schools of acting?" Do voters' approaches and actorsapproaches interact? Does the plot or message of a movie influencevoters?

Page 64: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

...the greater thedegree of judgmentdemanded of raters,the more uncertaintyis associated withstudents' scores.

This latitude may beexactly what we wantfor instructing indi-vidual students,because it expandsthe richness, thevariety, the individu-alizationthe useful-nessof the assess-ment experience forthe teacher and thestudent.

Educational assess-ments can be linkedthrough calibration ifthe evidence eachconveys can beexpressed in terms ofa likelihood functionon a common underly-ing variable.

judge than would be expected, given the usual distributionof variation among raters and performances. Relaying thisinformation back to judges helps them come to agreementabout criteria and helps ensure quality control for high-

; stakes applications.

Experience suggests that, if all other things are equal,the greater the degree of judgment demanded ofraters, themore uncertainty is associated with students' scores. Thecontrast with multiple-choice items can be dramatic. Thislatitude may be exactly what we want for instructing indi-vidual students, because it expands the richness, the vari-ety, the individualizationthe usefulnessof the assess-ment experience for the teacher and the student. High-stakes applications in which a common framework ofmeaning is demanded over many students and across timemay prompt us to develop more constrained guidelines; tobreak scoring into multiple, more narrowly defined ratingvariables; to train raters more uniformly; or to increase thenumber of raters per performance, the number of perfor-mances per student, or both.

Comments on Calibration

Educational assessments can be linked through call-: bration if the evidence each conveys can be expressed in

terms of a likelihood function on a common underlyingvariable. The expected score of a student will be the sameon any of the assessments he or she can appropriately beadministered, although there may be more or less evidencefrom different assessments or for different students on thesame assessment. One route to producing assessments thatcan be calibrated is to write them to blueprints that are thesame except for having more or fewer tasks. Alternatively,responses to a collection of multiple-choice or judgmentallyscored tasks may satisfactorily approximate the regulari-ties of an item response theory model. Arrangements ofthese tasks in different tests for different students or dif-ferent purposes can then be calibrated in terms of a com-mon underlying variable.

The proviso for the IRT route is that the patterns indatainteractions between students, tasks, and, if theyare involved, judgesmust exhibit the regularities of themeasurement model. Irregularities such as when, com-pared to other tasks, some tasks are hard for some kinds ofstudents but relatively easy for other kinds, are more likelyto arise when we analyze more heterogeneous collections of

f;53

Page 65: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Suppose we havedetermined that twotests "measure differ-ent things." We canneither equate themnor calibratethem...but we may beable to gather dataabout the joint distri-bution of sc Isamong relevantgroups of students.

54

tasks and students. IRT statistical machinery can be used todiscover unwanted task-by-background interactions, explorethe degree to which they affect inferences, and help deter-mine whether breaking assessments into narrower sets oftasks fits better with a measurement paradigm. If the fullrange of assessment tasks don't, as a whole, conform to ameastire.n*..at model, perhaps smaller, more homogeneous,groupings of tasks will.

Even when assessments do "measure the same thing"closely enough for our purposes, measuring it with differentaccuracy introduces complications for inferences at the grouplevel. Although we get unbiased answers to questions aboutindividual students and group means with properly cali-brated scores, these same scores give biased estimates ofpopulation characteristics, such as the spread and the pro-portion of students above a particular point on a scale. Statis-tical methods can provide the correct answers to the latterquestions, but they are unfamiliar, complex, and address theconfiguration of the group's data as a whole rather than as acollection of individual scores for each student (Mislevy,Beaton, Kaplan, & Sheehan, 1992).

ProjectionSuppose we have determined that two tests "measure

different things." We can neither equate them nor calibratethem to a common frame of reference, but we may be able togather data about the joint distribution of scores amongrelevant groups of students. The linking data might be eitherthe performances of a common group of students that hasbeen administered both assessments, or ratings from a com-mon group of judges on performances from both assessments.The links that might be forged are symbolized in Figure 11.We could then make projections about how a student withresults from Assessment Y might perform on Assessment X,in terms of probabilities of the various possibilities." Projec-tion uses data from a linking study to model the relationshipamong scores on the two assessments and other characteris-tics of students that will be involved in inferences. I highlightthis phrase because "measuring the same thing" means wedon't have to worry about these other characteristics, butwhen "measuring different things," we do. The relationshipsamong scores may change dramatically for, say, students indifferent instructional programs. As the following exampleswill show, ignoring these differences can distort inferencesand corrupt decisions.

' U

Page 66: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Tes

t

......

......

......

......

N

Figu

re 1

1

Em

piri

cal L

inka

ges

for

Proj

ectio

n

s. ,

.

,..

.....

....

.

.....

\\ ..

.,..

..;,

......

....:

Tes

t 1I

Tes

tI

zZ

.

..

.:\.'

1\

.'

\:.

,:-

-.--

;

Dar

ker

arro

ws

indi

cate

str

onge

r

empi

rical

rel

atio

nshi

ps a

re e

xpec

ted.

1h°

t

'4".

.""'"

61.(

M4*

.

Page 67: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Projection uses datafrom a linking studyto model the relation-ship among scores onthe two assessmentsand other characteris-tics of students thatwill be involved ininferences.

...the basic ideabehind projection isstraightforward:Administer both teststo a large sample ofstudents "suitablysimilar" tr Ming Mei.

56

If Ming Mei takes Test Y, what does this tell us aboutwhat her score might have been on Test X? While the statis-tical machinery required to carry it out and gauge its accu-racy can be complex, the basic idea behind projection isstraightforward: Administer both tests to a large sample ofstudents "suitably similar" to Ming Mei. The distribution of Xscores of the people with the same Y score as Ming Mei is areasonable representation of how we think she might havedone. The same idea applies when both X and Y yield mul-tiple scores or ratings. The phrase "suitably similar" is thejoker in this deck.

Projection in Physical Measurement

When two instruments measure different qualities, wecan study the relationships between their values among acollection of objects and see how these relationships candepend on other properties of the objects. This provides aframework for evaluating information from one variableabout conjectures phrased in terms ofa different variable.Given a new group of objects with measures on only onevariable, we can use the results of the investigation toexpress our expectations about their values on the secondvariable.

Linking Temperature and Latitude

Cities closer to the equator tend to be warmer thancities farther away from it. This pattern shows up inour sample of cities in the U.S., as seen in Figure 12.The vertical axis is true temperature. The horizontalaxis is negative latitude; left means more northern andright means more southern. The correlation betweentemperature and southernness is .62, not impressivefor indicators of temperature but a fairly high figurefor educational test scoresa bit higher, for example,than the correlation between multiple-choice and free-response sections of typical Advanced Placement ex-aminations.

What would we anticipate about the temperature of acity if we knew its latitude? We'd look straight up thegraph from that point. For cities around 40° latitude,

15 A best single guess is a prediction. Predictions are sometimes used tolink assessments, but predictions are a special result of the analysesrequired for projection. We therefore discuss the more general case.

tm

!

Page 68: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

for example, temperatures cluster between 72° and77°, averaging about 75°. There are also four cities atabout 37° latitude. Two have temperatures around80° while two have temperatures around 65°, for anaverage of 72.5°. A huge spread, in a strange pattern,but it does represent what we know so far.

How would we use this information linking a sampleof cities by temperature and latitude to make infer-ences about the temperatures of a new sample ofcities, for which we know only latitude? For a city atany given latitude, the spread of temperatures forcities at this latitude in the linking sample repre-sents our knowledge. This is the conditional distribu-tion of temperature given latitude. It is used differ-ently for inferences about individual cities and aboutthe new sample as a whole:

For an individual city, we might use the centerof the conditional distribution as a single-pointprediction, if we have to give one. The curvedline in Figure 12 gives a smoothed version ofthe centers of the conditional distributions.This line would give a point prediction of 74°for a city at 40° latitude, and a prediction of76.5° for a city at 37° latitude. Because all thepredictions would fall along this line, we'dunderstate the spread we'd expect in a newsample of cities.

For the sample as a whole, we use the entirepredictive distribution of each city in the newsample. That is, for each city with 40° latitude,anticipate a potential spread between 72° and77°, centered around 75°. Doing this for everycity in the new sample builds up our bestguess as to the distribution of temperatures inthe sample, taking into account both the fea-tures and the limitations of our knowledge.

Let's look more closely at the relationship betweentemperature and latitude by seeing just which citiesare where in the plot. Figure 13 is just like Figure12, except that three regions of the country havebeen distinguished: the west coast, the southeast,and the rest (northeast and central). We see thatthere is very little relationship between latitude andtemperature within the west coast cities (Pacific

57

Page 69: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

58

Figure 12True Temperatures Plotted Against Latitude

85

80

75

7C

65

North

-40 -35

x

-30 -25

South

Negative Latitude

Figure 13True Temperatures Plotted Against Latitude,

with Regions Distinguished

70

55

Southeast

Utica, N1Y/

Losle;gelesPortland,.0

\, ''West coastSeattle

-'5North

-40 -35 -30 -25

Negative Latitude

South

Page 70: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

breezes keep them all cool!) or within the southeast-ern cities (they are all warm!). The strength of therelationship came mainly from the central and north-east regions. The relationship is strong within thisregion; the correlation is up to .70, even though thereis less variation in latitude. Our projections are muchdifferent if we take region into account. Consider thefollowing:

What temperature would we expect for acity at 40° latitude? If it is in the north-east or the central states, our previousbest prediction of 75° is still prettygoodmainly because, we now see, thatthe only cities in the original sample at40° latitude were in this region. But ifthe new city is on the west coast, 75° isnot a good estimate. A better estimate is67°, the average of all the west coastcities, no matter how far north or souththey are. And asking what its tempera-ture would be if it were a southeasterncity with a latitude of 40° doesn't makesense; by definition, a city that far northcan't even be in the southeast!

What would we expect about a city at37° latitude? If it's in a northeast orcentral state, probably about 80°, likeWichita and Richmond; if it's on thewest coast, again it's probably about 67°,like all the other west coast cities. Thehuge spread for cities at this latitude inour original sample was the resultmainly of whether or not they were onthe west coast.

Projection in Educational Assessment

Two assessments can differ in many ways, includingthe content of questions, the mode of testing, the conditionsof administration, and factors affecting students' motiva-tion. Because each of these changes affects the constella-tions of skills, strategies, and attitudes each student bringsto bear in the assessment context, some students will dobetter in one than another. In the projection approach tolinking assessments, a sample of students is administered

59

Page 71: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

In the projectionapproach to linkingassessments, asample of students isadministered all ofthe assessments ofinterest...so we canestimate their jointdistribution in thatsample.

The relationshipbetween assessmentscan differ systemati-cally for students withdifferent backgrounds,different styles, ordifferent ways ofsolving problems.

60

all of the assessments of interest (or interlocking portions ofthem) so we can estimate their joint distribution in thatsample. From the resulting estimated distributions, we cananswer questions such as, "If I know Ming Mei's results onAssessment Y, what probability distribution represents myexpectations for her on Assessment X?" The answer can beapplied to an inference we'd have made using Assessment Xresults, had they been available. The spread of this projecteddistribution represents the added uncertainty that must beadded to the usual measurement uncertainty associated withperformance within a given methodusing just the bestguess would overstate our confidence. We can expect that themore differences there are among assessments' contents,methods, and contexts, the weaker the association amongthem will be.

The relationship between assessments can differ system-atically for students with different backgrounds, differentstyles, or different ways of solving problems. Recall the mathtests, tailored to different math programs, containing itemsthat could not be calibrated to a single underlying variable.The students instructed from one perspective will tend to dobetter with tasks written from the same perspective. Whenwe speculate on how well Ming Mei would fare on Assess-ment X, we look at the X score distribution of "suitably simi-lar" students with the same Assessment Y score(s) as hers.

But suppose the distributions differ for students whohave studied in Program X and those who have studied inProgram Y. The answer to "What are my expectations forMing Mei on Assessment X if I know her results on Assess-ment Y and that she has studied in Program Y?" will thendiffer considerably from the answer to 'What are my expecta-tions for Ming Mei on Assessment X if I know her results onAssessment Y and that she has studied in Program X?" Ananswer that doesn't take program study into account aver-ages over the answers that would apply in the two, in propor-tion to the number of students in the linking sample whohave taken them. This gives a rather unsatisfactory state-ment about the competence of Ming Mei as an individual, foreither a high-stakes decision about her accomplishments todate or a low-stakes decision to guide her subsequent instruc-tion. We thus arrive at the first desideratum for linkingassessments through projection:

Page 72: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

The actions that tendto maximally increasescores on one test maynot be the same asthose that increasescores on a differenttest, even though therelationship betweenstudents' scores onthe two tests withinany single time pointmay be highly related.

1. A linking study intended to support inferencesbased on projection should include relation-ships among not only assessments, but alsoother student variables that will be involved inthe inference.

A similar caveat in projection linking is relationshipsamong assessments can change over time. This factorarises when test results are used for accountability pur-poses, to guide educational policy decisions. The actionsthat tend to maximally increase scores on one test may notbe the same as those that increase scores on a differenttest, even though the relationship between students' scoreson the two tests within any single time point may be highly

t related.

As an example, students who do well on multiple-choice questions about writing problems also tend to writewell, compared with other students. At a given point intime, results from the two tests would rank students simi-larly and provide similar evidence for norm-referencedinferences. Actually writing essays in class tends to raisethe essay-writing skills of all students; even though order-ing students according to multiple-choice and essay perfor-mances remains similar. Suppose we estimate the jointrelationship between essay and multiple-choice measuresat a single point in time, but use only multiple-choice scoresto track results over time. Our one-shot linking study isstructurally unable to predict an interaction betweenassessments over time. No matter whether essay scoreswould have gone up or down relative to multiple-choicescores, this analysis projects a given trend in multiple-choice scores to the same trend in essay scores. We thus

I arrive at the second desideratum for linking assessmentsthrough projection:

2. If projection is intended to support inferencesabout change, and changes may differ for thedifferent assessments involved, then linkingstudies need to be repeated over time to capturethe changing nature of the relationships.

Another Type of Data for ProjectionLinking

So far we have discussed projection with assessmentsthat can all be reasonably administered to any of the stu-dents of interest. Their results can be objectively scored,

ti 61

Page 73: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

62

Under more efficientdesigns, students takeonly selected assess-ments, or even justportions of them, ininterlocking patternsthat allow the rela-tionships to be esti-mated.

Projection addressesassessments con-structed arounddifferent conceptionsof students' compe-tence, or around thesame conceptions butwith tasks that differin format or content.

judgmental, or a mixture. All students in a sample take allassessments under the simplest design for a linking study,and relationships are estimated directly. Under more effi-cient designs, students take only selected assessments, oreven just portions of them, in interlocking patterns thatallow the relationships to be estimated.

A variation for judgmental ratings skips a link in theusual chain of inference. Suppose two frames of referenceexist for rating performances, perhaps from different concep-tions of competence or different operationalizations of whatto look for. If evidence for either would be gleaned from simi-lar assessment contexts, it may be reasonable to solicit judg-ments from both perspectives on the same set of perfor-mances. Given results under one approach to scoring, wemight then be able to project the distribution of scores underthe other.

As an example, Table 6 (based on Linn, n.d.) shows thejoint distribution of scores assigned to essays Oregon studentswrote for their assessment, as evaluated by the standards ofthe Oregon "Ideas Score" rubric and Arizona's "ContentScore" rubric. Each row in this table roughly indicates theevidence for competence on an aspect of the Arizona frame-work conveyed by a score on an aspect of the Oregon frame-work. A more thorough study would be characterized by alarger sample of papers; examination of relationships forinteresting subgroups of students, topics, and raters; and, toenable these more ambitious analyses, modeling dominanttrends and quantifying sources of variance. Note that the sixpapers receiving the same Oregon score of 5 received Arizonascores ranging from the lowest to the next to highestuncer-tainty that must be taken into account when projectingdistributions of Arizona scores from Oregon papers. Thisamount of uncertainty would give one pause before usingOregon performance for a high-stakes decision for individualstudents under Arizona standards.

Comments on Projection

Projection addresses assessments constructed aroundifferent conceptions of students' competence, or around thesame conceptions but with tasks that differ in format orcontent. These assessments provide qualitatively differentevidence for various conjectures about the competence ofgroups or individuals. With considerable care, it is possible toestimate the joint relationships among the assessment scores

Page 74: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Projection soundsrasher precarious, andit is. The more assess-inents arouse differentaspects of students'knowledge, skills, andattitudes, the wider thedoor opens for studentsto perform differently indifferent settings.

Table 6CrossTabulation of Oregon Ideas Scores and ArizonaContent Scores for the Oregon Middle-School Papers*

Oregon

Arizona Content Scores

Ideas 0 1 2 3 4Scores

10 3 2

9 3 1

8 2 2

7 2 4

6 2 3

5 1 1 3 1

4 2 1

3 3

2 1

*Based on Table 17 of Linn (n.d.)

and other variables of interest in a linking study. Theresults can be used make projections about performance onone assessment from observed performance on the other, interms of a distribution of possible outcomes. Although it ispossible to get a point prediction, no simple one-to-onecorrespondence table captures the full import of the link fortwo reasons: (1) using only the best estimate neglects theuncertainty associated with the projection, and thereforewith inferences about individual students, and (2) therelationships, and inferences they imply, can vary substan-tially with students' education and background and canchange with the passage of time and instructional inter-ventions.

Projection sounds rather precarious, and it is. Themore assessments arouse different aspects of students'knowledge, skills, and attitudes, the wider the door opensfor students to perform differently in different settings.Moderate associations among assessments can supportinferences about howgroups of students. might fare underdifferent alternatives. But projections for individual stu-dents in high-stakes applications would demand not only

63

Page 75: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

While projectionattempts to character-ize the evidence aperformance on oneassessment conveysabout likely perfor-mance on another,moderation simplyasks for a one-to-onematchup amongassessments as totheir worth.

If two tests can sensi-bly be administered tothe same students,statistical moderationsimply applies theformulas of equating,without claiming ameasurement-theoryjustification.

64

strong empirical relationships but vigorous efforts to identifygroups (through background investigations) and individuals(through additional sources of information) for whom theusual relationships fail to hold.

Moderation"Moderation" is a relatively new term in educational

testing, although the problem it addresses is not. The goal isto match up scores from different tests that admittedly do notmeasure the same thing. It differs from projection in thefollowing sense: While projection attempts to characterize theevidence a performance on one assessment conveys aboutlikely performance on another, moderation simply asks for aone-to-one matchup among assessments as to their worth.Moderation is thus an evaluation of value, as opposed to anexamination of evidence. We now consider two classes oflinking procedures that have evolved to meet this desire. Themethods of statistical moderation have been developing formore than 50 years as a subclass of scaling procedures (see,for example, Angoff, 1984, and Keeves, 1988). They are sta-tistical in appearance, using estimated score distributions todetermine score levels that are deemed comparable. Socialmoderation, a more recent development, relies upon directjudgments.

Statistical Moderation

Statistical moderation aligns score distributions in es-sentially the same manner as equating, but with tests thatadmittedly do not measure the same thing. If two tests cansensibly be administered to the same students, statisticalmoderation simply applies the formulas of equating, withoutclaiming a measurement-theory justification. The arrows forprojection linkages in Figure 11 illustrate places where sta-tistical moderation might be carried outexcept for the twotests on the left that were in fact built to the same blueprint.Applying equating formulas really does equate these tests!

The scales for the SAT Verbal and Mathematics scores(SAT-V and SAT-M) illustrate a simple example of statisticalmoderation. In April 1941, 10,654 students took the SAT.Their SAT-V and SAT-M formula scores (number correct,minus a fraction of the number wrong) were both trans-formed to an average of 500 and a standard deviation of 100.Tables of correspondence thus mapped both SAT-M and SAT-V formula scores into the same 200-800 range. An SAT-V

0 1

Page 76: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

scale score and an SAT-M score with the same numericalvalue are comparable in this normed-referenced sense: Inthe 1941 sample, the proportion of examinees with SAT-V

1 scores at or above this level, and the proportion with SAT-M scores at or above this level, were the same. (Whethersimilar proportions of this year's sample have these scoresis quite a different question, and in general they don't.)

A more complex example involves "moderator tests."Moderator tests are a device for linking disparate "special"assessments taken by students in different programs orjurisdictions, or for different reasonsfor example, Germantests for students who study German, and physics tests forstudents who study physics. Two scores are obtained fromeach student in a linking study: one for the appropriatespecial assessment, and one on a "moderator test" that allstudents take. Scores on the moderator test are used tomatch up performance on the special tests. The rationale isarticulated in The College Board Technical Handbook forthe Scholastic Aptitude Test and Achievement Tests (Donlon& Livingston, 1984, pp. 21) :

The College Board offers 14 different achievementtests.... If the scores are to be used for selection purposes,comparing students who take tests in different subjects,the score scales should be as comparable as possible. Forexample, the level of achievement in American historyindicated by a score of 560 should be as similar aspossible to the level of achievement in biology indicatedby a score of 560 on the biology test. But what does itmean to say that one student's achievement in Americanhistory is comparable to another student's achievementin biology? The Admission Testing Program's answer tothis question, which forms the basis for scaling theachievement tests, is as follows. Suppose student A'srelative standing in a group of American history stu-dents is the same as student B's relative standing in agroup of biology students. Now suppose the group ofAmerican history students is equal to the group ofbiology students in general academic ability. Then it ismeaningful to say that student A's achievement inAmerican history is comparable to student B's achieve.ment in biology.

The groups of students who choose to take the differentAchievement Tests, however, cannot be asnimed to beequal in general academic ability. Their SAT scoresoften provide evidence that they are not.. . Obviously, thedifferences are quite large in some cases and cannot bedisregarded.

65

Page 77: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Donlon and Livingston go on to describe the procedures usedto accomplish the scaling.15 An extreme sports example illus-trates the basic idea, using the simpler scheme Keeves (1988)describes for linking grades from different subjects:

Using Mile-Run Times to Moderate BattingAverages and Goals Scored

To moderate baseball batting averages andhockey goals scored using mile-run times, wewould first obtain batting averages and runtimes from a group of baseball players, andgoals scored and run. times from a group ofhockey players. Within the baseball players, thebest batting average is matched to the best runtimes, the average to the average, and so on; asimilar procedure is used for goals scored andrun times within hockey players. To find outwhat number of goals scored is comparable to abatting average of .250, we (1) find out whatproportion of the baseball players in thebaseline group hit below this average (say its60%); (2) look up the mile-run time that 60% ofthe baseball players were slower than (say its5:10); (3) look up how many of the hockey play-ers were slower than 5:10 in the mile (say its30%); and (4) look up how many goals scoredthat 30% of the hockey players were below (sayits 22).

Suppose hockey players tend to be faster run-ners than baseball players. Figure 14 shows thelinkages that result from goals scored to run-

16 First, relationships among the SAT-V, SAT-M, and an AchievementTest are estimated from an actual baseline sample of students. Then,projection procedures are used to predict the distribution of a hypotheti-cal "reference population" of students who are all "prepared" to take thespecial area test (i.e., have studied biology, if we are working with thebiology test) and have a mean of 500, a standard deviation of 100, and acorrelation of .60 on the regular SAT sections. That is, the same relation-ship among the SAT tests and the Achievement Test observed in the realsimple is assumed for the hypothetical sample, which could have a meaninOler or lower thar 500 and a standard deviation higher or lower than100. The projected special-test raw-score distribution of the hypotheticalgroup is transformed to have a mean of 500 and standard deviation of100.

66

Page 78: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

ning times, and running times to battingaverages. This procedure would be consistentwith tha arguable premise that running timesmeasure athletic ability, and hockey players,because they run fait, would have high battingaverages if they baseball players. Thejudgment of the relative value of batting andgoal-scoring skills is implicit in the choice ofthe moderator test.

Figure 14Moderated Distributions of Goals Scored

and Batting Averages

"Moderated'goals scored

Time in milerun

'Moderated'batting average

Variations in linking samples and moderating testshave no effect on norm-referenced comparisons of perfor-mances within special area tests, but they can affect norm-referenced comparisons from one special area to anothermarkedly. Carrying out the procedures of statistical mod-eration with different samples of students, at differentpoints in time or with different moderator tests, can pro-duce markedly different numerical links among tests.Particular choices for these variables can be specified as anoperational definition of comparability, but moderated

67

Page 79: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

We would have littleconfidence in acomparison of, say,subgroup meansacross "moderated"test scores unless itheld up under a broadrange of choices forlinking samples andmoderator tests.

68

scores in and of themselves offer no clue as to how much theresults would differ if the choices were altered. The more thetests vary in content, format, or context, the more the resultswill vary under alternative moderation schemes. A sensitiv-ity study compares results obtained under different alterna-tives, revealing which inferences are sensitive to the choices.We would have little confidence in a comparison of, say,subgroup means across "moderated" test scores unless it heldup under a broad range of choices for linking samples andmoderator tests.

In an application that uses a moderating test, the test'sspecifications determine the locus of value for "comparableworth." In our sports example, baseball players' perfor-mances were assigned lower values than hockey players'goals scored, simply because hockey players ran faster. In aserious educational context, consider the use of statisticalmoderation to link disparate educational assessments fromclusters of schools through NAEP. To the degree that focus-ing on skills not emphasized in NAEP trades off againstskills that are, this arrangement would work in favor ofclusters whose tests were most closely aligned with NAEP,and against clusters whose content and methodology de-parted from it.

A more subtle variation of statistical moderation isapplying an IRT model across collections of tasks and stu-dents in the face of the unwanted interactions we discussedin the section on calibration. Ifa composite test consisting oftasks keyed to Program X and Program Y is calibrated withresponses from students from both programs, task-type mayinteract with student-program. Consequential interactionslaunch us into the realm of statistical moderation. Compari-sons among students and between groups could turn outdifferently with a different balance of task types or a differ-ent balance of students in the calibration group. If the inter-action is inconsequential, however, inferences could proceedunder the conceit of a common measurement model; fewinferences would differ if the balance of items or the composi-tion of the calibration sample were to shift. Determining thesensitivity of a link to these factors is a hallmark of a respon-sible IRT linking study.

Social Moderation

Social moderation calls for direct judgments about thecomparability of performance levels on different assessments.This process could apply to performances in assessment

Page 80: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Social moderation callsfor direct judgmentsabout the comparabilityof performance levels ondifferent assessments.This process could applyto performances inassessment contextsthat already requirejudgment or score levelsin objectively scoredtests.

contexts that already require judgment or score levels inobjectively scored tests. As an example, Wilson (1992)describes the "verification" process in Victoria, Australia.Samples of students' performances on different assess-ments in different localities were brought together at asingle site to adjudicate "comparable" levels of performance.

Auxiliary information on common assessment taskscan be solicited to supplement direct judgment. For ex-ample, assessments of two school districts might containtasks unique to each, but a common core may be present inboth. If so, it would then be possible to compare the scoredistributions on unique tasks of students from both dis-tricts who had the same levels of performance on the com-mon tasks. If nothing more were done but to simply alignthese distributions, we would have an instance of statisticalmoderation. If judgments were used to further adjust theresulting matchup on the basis of factors left out ofstatisti-cal moderation, such as the relevance of the common tasksto the unique tasks, we would have social moderation.

Scoring the Decathlon

The decathlon is a medley of 10 track and fieldevents: 100-meter dash, long jump, shot put,high jump, 400-meter run, 110-meter hurdles,discus throw, pole vault, javelin throw, and1,500 meter run. Conditions are standardizedwithin events, and it is easy to rank competi-tors' performances within each. To obtain anoverall score, however, requires a commonscale of value. This is accomplished by map-ping each event's performance (a height, atime, or a distance) onto a 0-1,000 point scale,where sums are accumulated and overallperformance is determined. A table was estab-lished in 1912 for the decathlon's first appear-ance in the Olympics by the InternationalAmateur Athletic Federation (IAAF) by aconsensus among experts. Performances corre-sponding to then-current world records werealigned across events, and lesser performanceswere awarded lower scores in a manner thecommittee members judged to be comparable.

69

Page 81: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Like statisticalmoderation, socialmoderation is foundedon a particulardefinition of compara-bility, defined interms of a givenprocess, at a givenpoint in time, with agiven set of people.

70

The IAAF revised the decathlon scoring table in1936 and 1950 to reflect improvements in world-class performance and to reflect different phi-losophies of valuation. All of these earlier tablesemphasized excellent performance in individualevents, so that a superior performance in oneevent could more than offset relatively poorsr owings in several others. By scaling down theincreases at the highest levels of performance,revisions in 1964 and 1985 favored the athletewho could perform well in many events.

Table 7 shows the performance and total scoresof the top eight contenders from the 1932 Olym-pics in Los Angeles. Their scores under both the1912 tables, which were in effect at the time,and the 1985 tables are included. The 1985totals are all lower, reflecting the fact that thetop performances in 1932 were not as impres-sive as those in 1985. James Bausch's perfor-mances won the gold medal, but he would havefinished behind Akilles Jarvinen had the 1985tables been used. With the 1912 tables, Bausch'soutstanding shot put distance more than com-pensated for his relatively slower running times.

Like statistical moderation, social moderation is foundedon a particular definition of comparability, defined in termsof a given process, at a given point in time, with a given set ofpeople. The set of people now, however, is the sample of"social moderators" corresponding to the IAAF committeesthat draw up decathlon tablesrather than a sample ofstudents who take multiple assessments. Iici built-in mecha-nism indicates the uncertainties associated with thesechoices. Like statistical moderation, social moderation canalso be supported by sensitivity studies. How much do thealignments change if we carry them out in different parts ofthe country? How about with teachers versus content-areaexperts, or students, or community members carrying out thejudgments? With different supplemental information to aidthe process? Again, we would have little confidence in infer-ences based on moderated test scores unless they held upunder a broad range of reasonable alternatives for carryingout the moderation process.

Page 82: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Tab

le 7

Dec

athl

on R

esul

ts f

rom

the

1932

Oly

mpi

cs*

Nam

es

1. J

ames

Bau

sch

2. A

ides

Jar

vine

n

3. W

olra

d E

berl

e

4. W

ilson

Cha

rles

5. H

ans-

Hei

nric

hSi

ever

t

6. P

aavo

YrO

la

7. C

lyde

Clif

ford

Cof

fman

8. R

ober

t Tis

dall

U.S

.

Fin.

Ger

.

U.S

.

Ger

.

Fin.

U.S

.

Id.

Thr

owin

g E

vent

s

Dis

cus

Shot

Jave

linT

hrow

Put

Thr

ow

44.5

815

.32

61.9

1

36.8

013

.11

61.0

0

41.3

413

.22

57.4

9

38.7

112

.56

47.7

2

44.5

414

.50

53.9

1

40.7

713

.68

56.1

2

34.4

011

.86

48.8

8

33.3

112

.58

45.2

6

Hig

hJu

mp

1.70

1.75

1.65

1.85

1.78

1.75

1.70

1.65

Jum

ping

Eve

nts

Lon

gJu

mp

6.95

7.00

6.77

7.24

6.97

6.59

6.77

6.60

Pole

Vau

lt

4.00

3.60

3.50

3.40

3.20

3.10

4.00

3.20

110

Hur

dles

16.2

15.7

16.7

16.2

16.1

17.0

17.8

15.5

Run

ning

Eve

nts

100

400

Met

ers

Met

ers

11.7

54.2

11.1

50.6

11.4

50.8

11.2

51.2

11.4

53.6

11.8

52.6

11.3

51.8

11.3

49.0

500

Met

ers

5:17

.0

4:47

.0

4:34

.4

4:39

.8

5:18

.0

4:37

.4

4:48

.0

4:34

.4

Tot

al P

oint

s

1912

1985

Tab

leT

able

8462

6735

8292

6879

8031

6661

7985

6716

7941

6515

7688

6385

7534

6265

7327

6398

* Fr

om W

alln

chin

sky,

1988

, The

com

plet

e bo

ok o

f th

e O

lym

pics

'

Page 83: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Moderation shouldnot be viewed as anapplication of theprinciples of statisti-cal inference, but as away to specify "therules of the game."

If we constructassessments carefullyand interpret them inlight of other evidence,they can provideuseful evidence aboutaspects of students'competencies to guidedecisions aboutinstruction and policy.

72

Comments on Moderation

Moderation should not be viewed as an application of theprinciples of statistical inference, but as a way to specify "therules of the game." It can yield an agreed-upon way of com-paring students who differ qualitatively, but it doesn't makeinformation from tests that aren't built to measure the samething function as if they did. An arbitrarily determined op-erational definition of comparability must be defended. Instatistical moderation, this means defending the reasonable-ness of the linking sample and, if there is one, the moderat-ing test. This definition is bolstered by a sensitivity studyshowing how inferences differ if these specifications changeto other reasonable choices. In social moderation, consensualprocesses for selecting judges and carrying out the alignmentconstitutes a first, necessary line of defense, which can simi-larly be bolstered by sensitivity studies.

ConclusionCan We Measure Progress Toward National Goals ifDifferent Students Take Different Tests?

An educational assessment consists of opportunities forstudents to use their skills and apply their knowledge incontent domains. A particular collection of tasks evokes ineach student a unique mix of knowledge, skills, and strate-gies. Results summarize performances in terms of a simpli-fied reporting scheme, in a framework determined by a con-ception of important features of students' competence. If weconstruct assessments carefully and interpret them in lightof other evidence, they can provide useful evidence aboutaspects of students' competencies to guide decisions aboutinstruction and policy.

A given assessment is designed to provide evidenceabout students for a particular purpose, and in terms of aparticular conception of competence. The notion of "linking"Assessment Y to Assessment X stems from the desire toanswer questions about students' competence that are cast inthe frame of reference that led to Assessment X, wh"n weobserve only results from Assessment Y. The nature of therelationships among the methodology, content, and intendedpurpose of Assessment X and those of Assessment Y deter-mine the statistical machinery necessary to address thosequestions.

Page 84: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

No single score can givea full picture of therange of competencies ofdifferent students indifferent instructionalprograms.

No simple statisticalmachinery can trans-form the results of twoarbitrarily selectedassessments so that theyprovide interchangeableinformation about allquestions aboutstudents' competencies.

...it isn't possible toconstruct once-and-for-all correspondencetables to "calibrate"whatever assessmentsmight be built indifferent clusters ofschools, districts, orstates to providedifferent kinds ofinformation aboutstudents.

The notion of monitoring national standards impliesgathering evidence about aspects of students' competencewithin a framework that is meaningful for all students, sayin a given grade, in all schools throughout the nation. Inany content area, the development of competence has manyaspects. No single score can give a full picture of the rangeof competencies of different students in different instruc-tional programs. Accordingly, multiple sour^c:s of evi-dence different question types, formats, and contextsmust be considered. Some of these will be broadly meaning-ful and useful; others will be more idiosyncratic at the levelof the state, the school, the classroom, or even the indi-vidual student.

No simple statistical machinery can transform theresults of two arbitrarily selected assessments so that theyprovide interchangeable information about all questionsabout students' competencies. Such a strong link is in factapproximated routinely when alternative forms of the SATor of the Armed Services Vocational Aptitude Battery(ASVAB) are equatedbut only because these forms arewritten to the same tight form and content constraints andadministered under standard conditions. The simple andpowerful linking achieved in these cases results not fromthe statistical procedures used to map the correspondence,but from the way the assessments are constructed.

What, then, can we say about prospects for linkingdisparate assessments in a national system of assessments?First, it isn't possible to construct once-and-for-all corre-spondence tables to "calibrate" whatever assessmentsmight be built in different clusters of schools, districts, orstates to provide different kinds of information about stu-dents. What is possible, with carefully planned continuallinking studies and sophisticated statistical methodology, isattaining the less ambitious, but more realistic, goals be-low:

Comparing levels of performance across clustersdirectly in terms of common indicators of perfor-mance on a selected sample of consensually definedtasks administered under standard conditionsa"market basket" of valued tasks. Some aspects ofcompetence and assessment contexts for gatheringevidence about them will be considered useful by awide range of educators, and elements of an assess-ment system can solicit information in much the

1 73

Page 85: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

What is possible?

Comparing levelsof performance acrossclusters directly interms of commonindicators of perfor-mance on a selectedsample of consensu-ally defined tasksadministered understandard conditionsa "market basket" ofvalued tasks.

Estimating levelsof performance ofgroups or individualswithin clusters,possibly in quitedifferent ways indifferent clusters, atthe levels of accuracydemanded by purposeswithin clusters.

Comparing levelsof performance acrossclusters in terms ofperformance ratingson customized assess-ments, in terms of aconsensually defined,more abstract descrip-tion of developingcompetence.

Making projectionsabout how studentsfrom one cluster mighthave performed on theassessment of another.

74

same way for all (see sections on equating and calibra-tion, Case 2). A role like this might be appropriate forthe National Assessment. Gathering information inthis standardized manner, however, can fail to providesufficient information about aspects of competenceholding different salience for different clusters.Common components should be seen as one of manysources of evidence about students' competenciesalimited source for providing evidence about the varietyof individual competencies that characterize a nationof students.

Estimating levels of performance of groups or individu-als within clusters, possibly in quits different ways indifferent clusters, at the levels ofaccuracy demanded bypurposes within clusters. Methodologies for accom-plishing this under standardized conditions with objec-tively scored tasks like multiple-choice items havebeen available for some time. Analogous methodologiesfor assessment approaches based on judgment are nowemerging. Components of cluster assessments couldgather evidence for different purposes or from differentperspectives of competence, to complement informa-tion gathered by common components.

Comparing levels of performance across clusters interms of performance ratings on customized assess-ments, in terms of a consensually defined, more ab-stract description of developing competence. Whencluster-specific assessment components are designedto provide evidence about the same set of abstractlydefined standards (e.g., those of the National Councilof Teachers of Mathematics), though possibly fromdifferent perspectives or by different methods, it isfeasible o map performance in terms of a commonconstrue; (see section on calibration, Case 3). Compari-sons across clusters are not as accurate as comparisonswithin clusters based on the same methodology andperspective, nor as accurate as comparisons acrossclusters on common assessment tasks. The tradeoff,though, is the opportunity to align assessment withinstruction without requiring standardized instruc-tion. For the reasons discussed in the section on statis-tical moderation, it is inadvisable to link unique clus-ter components soley through their empirical relation-ships with the common cluster components.

Page 86: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Making projections about how students from onecluster might have performed on the assessment ofanother. When students can be administered por-tions of different clusters' assessments under condi-tions similar to those in which they are used in prac-tice, we can estimate the joint distribution of resultson those assessments. We can then address "what if'questions about what performances ofgroups orindividuals who took one of the assessments mighthave been on the other (see section on projection).The more the assessments differ in form, content,and context, though, the more uncertainty is associ-ated with these projections, the more they can beexpected to vary with students' background andeducational characteristics, and the more they canshift over time. Unless very strong relationships areobserved and found to hold up over time and acrossdifferent types of students, high-stakes uses forindividuals through this route are perilous.

75

Page 87: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

ReferencesAmerican Council on the Training of Foreign Languages. (1989). ACTFL profi-

ciency guidelines. Yonkers, NY: Author.

Angoff, W.H. (1984). Scales, norms, and equivalent scores. Princeton, NJ: Educa-tional Testing Service, Policy Information Center.

Barton, P.E. (1992). National standards for education: What they might look like.Princeton, NJ: Educational Testing Service, Policy Information Center.

Beaton, A.E., & Zwick, R. (1990). The effect of changes in the National Assess-ment: Disentangling the NAEP 1985-86 reading anomaly. (No. 17-TR-21).Princeton, NJ: National Assessment of Educational Progress/EducationalTesting Service.

Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The depend-ability of behavioral measurements: Theory ofgeneralizability for scoresand profiles. New York: Wiley.

Deming. W.E. (1980). Scientific methods in administration and management.Course No. 617. Washington, DC: George Washington University.

Donlon, T.F. (Ed.) (1984). The College Board technical handbook for the Scholas-tic Aptitude Test and Achievement Tests. New York: College EntranceExamination Board.

Donlon, T.F., & Livingston, S. A. (1984). Psychometric methods used in the Ad-missions Testing Program. In T.F. Donlon (Ed.), The College Board techni-cal handbook for the Scholastic Aptitude Test and Achievement Tests. NewYork: College Entrance Examination Board.

Edgeworth, F.Y. (1888). The statistics of examinations. Journal of the RoyalStatistical Society, 51, 599-635.

Edgeworth, F.Y. (1892). Correlated averages. Philosophical Magazine, 5th Series,34, 190-204.

Glaser, R. (1991). Expertise and assessment. In M.C. Wittrock & E.L. Baker(Eds.), Testing and cognition. Englewood Cliffs, NJ: Prentice Hall.

Green, B. (1978). In defense of measurement. American Psychologist, 33, 664-670.

Gulliksen, H. (1950/1987). Theory of mental tests. New York: Wiley. Reprint,Hillsdale, NJ: Erlbaum.

77

Page 88: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

78

Joreskog, KG., & Sorbom, D. (1979). Advances in factor analysis and structuralequation models. Cambridge, MA: Abt Books.

Keeves, J. (1988). Scaling achievement test scores. In T. Husen & T.N.Postlethwaite (Eds.), International encyclopedia of education. Oxford:Pergamon Press.

Krathwohl, & Payne, D.A. (1971). Defining and assessing educationalobjectives. In R.L. Thorndike (Ed.), Educational measurement (2nd Ed.)Washington, DC: American Council on Education.

Lesh, R., Lamon, S.J., Behr, M., & Lester, F. (1992). Future directions for math-ematics assessment. In R. Lesh & S.J. Lamon (Eds.), Assessment ofau-thentic performance in school mathematics. Washington, DC: AmericanAssociation for the Advancement of Science.

Linacre, J. M. (1989). Multi-faceted Rasch measurement. Chicago, IL: MESAPress.

Linn, R.L. (n.d.). Cross-state comparability of judgments of student writing: Re-sults from the New Standards Project. CSE Technical Report 335. LosAngeles, CA: National Center for Research on Evaluation, Standards, andStudent Testing (CRESST).

Linn, R.L. (in press). Linking results of distinct assessments. Applied Measure-ment in Education.

Linse, D.J. (1992). System identification for adaptive nonlinear flight controlusing neural networks. Doctoral dissertation. Princeton University, De-partment of Mechanical and Aerospace Engineering.

Lord, F.M. (1962). Estimating norms by item sampling. Educational and Psycho-logical Measurement, 22, 259-267.

Lunz, M.E., & Stahl, J.A. (1990). Judge consistency and severity across gradingperiods. Evaluation and the Health Professions, 13, 425-444.

Millman, J., & Greene, J. (1989). The specification and development of tests ofachievement and ability. In R.L. Linn (Ed.), Educational measurement (3rdEd.) New York: American Council on Education/Macmillan.

Mislevy, R.J. (1984). Estimating latent distributions. Psychometrika, 49, 359-381.

Mislevy, R.J. (1990). Item-by-form variation in the 1984 and 1986 NAEP readingsurveys. In A.E. Beaton & R. Zwick, The effect of changes in the NationalAssessment: Disentangling the NAEP 1985-86 reading anomaly. ReportNo. 17-TR-21. Princeton, NJ: Educational Testing Service.

c(1 J

Page 89: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Mislevy, R.J. (1991). Randomization-based inference about latent variables fromcomplex samples. Psychometrika, 56, 177-196.

Mislevy, R.J., Beaton, A.E., Kaplan, B., & Sheehan, K.M. (1992). Estimatingpopulation characteristics from sparse matrix samples of item responses.Journal of Educational Measurement, 29, 133-161.

Payne, D.A. (1992). Measuring and evaluating educational outcomes. New York:Macmillan.

Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, forming, and equat-ing. In R.L. Linn (Ed.), Educational measurement (3rd Ed.) (pp. 221-262).New York: American Council on Education/Macmillan.

Porter, A. C. (1991). Assessing national goals: some measurement dilemmas. InT. Wardell (Ed.), The assessment of nationalgoals. Proceedings of the 1990ETS Invitational Conference. Princeton, NJ: Educational Testing Service.

Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainmenttests. Copenhagen: Danish Institute for Educational Research/Chicago:University of Chicago Press (reprint).

Schum, D.A. (1987). Evidence and inference for the intelligence analyst. Lanham,MD: University Press of America.

Spearman, C. (1904). "General intelligence" objectively determined and mea-sured. American Journal of Psychology, 15, 201-292.

Spearman, C. (1907). Demonstration of formulae for true measurement of corre-lation. American Journal of Psychology, 18, 161-169.

Stocking, M.L., Swanson, L., & Pearlman, M. (1991). Automated item selection(AIS) methods in the ETS testing environment. Research Report 91-5.Princeton, NJ: Educational Testing Service.

Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models.Psychometrika, 51, 567-577.

U.S. Department of Education. (1990). National goals for education. Washington,DC: Author.

Velleman, P.F. (1988). Data Desk Professional 2.0 [computer program].Northbrook, IL: Odesta.

Wallechinsky, D. (1988). The complete book of the Olympics. New York: VikingPenguin.

79

Page 90: DOCUMENT RESUME ED 353 302 TM 019 366 AUTHOR · DOCUMENT RESUME. TM 019 366. Mislevy, Robert J. ... Mimi Reed, Karen. Sheingold, Martha Stocking, Kentaro Yamamoto, Howard Wainer,

Wilson, M.R. (1992). The integration of school-based assessments into a state-wideassessment system: Historical perspectives and contemporary issues..Unpublishedmanuscript prepared for the California Assessment Program. Berkeley, CA:University of California, Berkeley.

80