119
DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado, August 20-22, 1997). INSTITUTION National Assessment Governing Board, Washington, DC. PUB DATE 1998-00-00 NOTE 118p. AVAILABLE FROM National Assessment Governing Board, 800 North Capitol Street, NW, Suite 825, Washington, DC 20002-4233. For full text: http://www.nagb.org. PUB TYPE Collected Works Proceedings (021) EDRS PRICE MF01/PC05 Plus Postage. DESCRIPTORS *Academic Achievement; Elementary Secondary Education; Item Response Theory; National Competency Tests; *Standards; *Test Construction; *Validity IDENTIFIERS National Assessment Governing Board; *National Assessment of Educational Progress; Standard Setting ABSTRACT The papers presented at this symposium explore each component of the National Assessment of Educational Progress (NAEP), from the development of the assessment frameworks through the reporting of assessment results. The papers are: (1) "NAEP Frameworks and Achievement Levels" (Robert A Forsyth); (2) "Assembly of Test Forms for Use in Large-Scale Educational Assessments" (Wim J. van der Linden); (3) "A Brief Introduction to Item Response Theory for Items Scored in More Than Two Categories" (David Thissen, Kathleen Billeaud, Lori McLeod, and Lauren Nelson); (4) "Some Ideas about Item Response Theory Applied to Combinations of Multiple-Choice and Open-Ended Items: Scale Scores for Patterns of Summed Scores" (Kathleen Billeaud, Kimberly Swygert, Lauren Nelson, and David Thissen); (5) "Enhancing the Validity of NAEP Assessment Level Score Reporting" (Ronald K. Hambleton); (6) "1998 Civics and Writing Level-Setting Methodologies" (ACT, Inc.); (7) "The Criticality of Consequences in Standard Setting: Six Lessons Learned the Hard Way by a Standard-Setting Abettor" (W. James Popham); and (8) "Acknowledgments and Appendices." Two appendixes list participants and present the conference agenda. (SLD) Reproductions supplied by EDRS are the best that can be made from the original document.

Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

DOCUMENT RESUME

ED 458 220 TM 033 373

AUTHOR Bourque, Mary Lyn, Ed.TITLE Proceedings of Achievement Levels Workshop (Boulder,

Colorado, August 20-22, 1997).INSTITUTION National Assessment Governing Board, Washington, DC.PUB DATE 1998-00-00NOTE 118p.

AVAILABLE FROM National Assessment Governing Board, 800 North CapitolStreet, NW, Suite 825, Washington, DC 20002-4233. For fulltext: http://www.nagb.org.

PUB TYPE Collected Works Proceedings (021)EDRS PRICE MF01/PC05 Plus Postage.DESCRIPTORS *Academic Achievement; Elementary Secondary Education; Item

Response Theory; National Competency Tests; *Standards;*Test Construction; *Validity

IDENTIFIERS National Assessment Governing Board; *National Assessment ofEducational Progress; Standard Setting

ABSTRACTThe papers presented at this symposium explore each

component of the National Assessment of Educational Progress (NAEP), from thedevelopment of the assessment frameworks through the reporting of assessmentresults. The papers are: (1) "NAEP Frameworks and Achievement Levels" (RobertA Forsyth); (2) "Assembly of Test Forms for Use in Large-Scale EducationalAssessments" (Wim J. van der Linden); (3) "A Brief Introduction to ItemResponse Theory for Items Scored in More Than Two Categories" (David Thissen,Kathleen Billeaud, Lori McLeod, and Lauren Nelson); (4) "Some Ideas aboutItem Response Theory Applied to Combinations of Multiple-Choice andOpen-Ended Items: Scale Scores for Patterns of Summed Scores" (KathleenBilleaud, Kimberly Swygert, Lauren Nelson, and David Thissen); (5) "Enhancingthe Validity of NAEP Assessment Level Score Reporting" (Ronald K. Hambleton);(6) "1998 Civics and Writing Level-Setting Methodologies" (ACT, Inc.); (7)

"The Criticality of Consequences in Standard Setting: Six Lessons Learned theHard Way by a Standard-Setting Abettor" (W. James Popham); and (8)"Acknowledgments and Appendices." Two appendixes list participants andpresent the conference agenda. (SLD)

Reproductions supplied by EDRS are the best that can be madefrom the original document.

Page 2: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

&widen (CO1--)uguist 20-22,, 1997

Proc.

A

ings of

AO

*

. . A : .

- . 11

I Al

U S DEPARTMENT OF EDUCATIONOffice of Educational Research and Improvement

EDUCATIONAL RESOURCES INFORMATIONCENTER (ERIC)

o This document has been reproduced asreceived from the person or organizationoriginating it

O Minor changes have been made toimprove reproduction quality

Points of view or opinions stated in thisdocument do not necessarily representofficial OERI position or policy

PERMISSION TO REPRODUCE ANDDISSEMINATE THIS MATERIAL HAS

BEEN GRANTED BY

---%rC eids

TO THE EDUCATIONAL RESOURCESINFORMATION CENTER (ERIC)

1

BEST PY AVAILABLE

Page 3: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Whe 115 ME NellognRepor1 Card?The Nation's Report Card, the National Assessment of Educational Progress (NAEP), is the only nationally representative andcontinuing assessment of what America's students know and can do in various subjects. Since 1969, assessments have beenconducted periodically in reading, mathematics, science, writing, history/geography, and other fields. By making objectiveinformation on student performance available to policymakers at the national, state, and local levels, NAEP is an integral partof our nation's evaluation of the condition and progress of education. Only information related to academic achievement iscollected under this program. NAEP guarantees the privacy of individual students and their families.

NAEP is a congressionally mandated project of the National Center for Education Statistics, the U.S. Department of Education.The Commissioner of Education Statistics is responsible, by law, for carrying out the NAEP project through competitiveawards to qualified organizations. NAEP reports directly to the Commissioner, who is also responsible for providingcontinuing reviews, including validation studies and solicitation of public comment, on NAEP's conduct and usefulness.

The National Assessment Governing Board (NAGB) was established under section 412 of the National Education StatisticsAct of 1994 (Title IV of the Improving America's Schools Act of 1994, P.L. 103-382). The Board was established to formulatepolicy guidelines for NAEP. The Board is responsible for selecting subject areas to be assessed, developing assessmentobjectives, identifying appropriate achievement goals for each grade level and subject tested, and establishing standardsand procedures for interstate and national comparisons.

The Natrionad Assessment Governiing BoarrdMark D. Musick, Chair (1999)PresidentSouthern Regional Education BoardAtlanta, Georgia

Mary R. Blanton, Vice Chair (1998)AttorneySalisbury, North Carolina

Patsy Cavazos (1998)PrincipalW.G. Love Accelerated SchoolHouston, Texas

Catherine A. Davidson (1998)Secondary Education DirectorCentral Kitsap School DistrictSilverdale, Washington

Edward Donley (1997)Former ChairmanAir Products & Chemicals, Inc.Allentown, Pennsylvania

James E. Ellingson (1998)Fourth-Grade Classroom Teacher (Retired)Probstfield Elementary SchoolMoorhead, Minnesota

Honorable John M. Engler,(Member Designate)

Governor of MichiganLansing, Michigan

Thomas H. Fisher (1999)DirectorStudent Assessment ServicesFlorida Department of EducationTallahassee, Florida

Michael J. Guerra (2000)Executive DirectorNational Catholic Education AssociationSecondary School DepartmentWashington, D.C.

Edward H. Haertel (1999)ProfessorSchool of EducationStanford UniversityStanford, California

Lynn MarinerPresidentCincinnati Board of EducationCincinnati, Ohio

William J. Moloney (1999)Commissioner of EducationState of ColoradoDenver, Colorado

Honorable Annette Morgan (1998)Former MemberMissouri House of RepresentativesJefferson City, Missouri

Mitsugi Nakashima (2000)First Vice ChairHawaii State Board of EducationHonolulu, Hawaii

Michael T. Nettles (1999)Professor of Education and Public PolicyUniversity of MichiganAnn Arbor, MichiganandDirectorFrederick D. Patterson

Research InstituteUnited Negro College Fund

Honorable Norma Paulus (1999)Superintendent of Public InstructionOregon State Department of EducationSalem, Oregon

Jo Ann Pottorff (2000)MemberKansas House of RepresentativesWichita, Kansas

Honorable William T. Randall,Chair (1998)

Former Commissioner of EducationState of ColoradoDenver, Colorado

Diane Ravitch (2000)Senior Research ScholarNew York UniversityNew York, New York

Honorable Roy Romer (1998)Governor of ColoradoDenver, Colorado

Fannie L. Simmons (1998)Math CoordinatorDistrict 5, Lexington/Richland CountyBallentine, South Carolina

Adam Urbanski (1998)PresidentRochester Teachers AssociationRochester, New York

Deborah Voltz (1999)Assistant ProfessorDepartment of Special EducationUniversity of LouisvilleLouisville, Kentucky

Marilyn A. Whirry (1999)Twelfth-Grade English TeacherMira Costa High SchoolManhattan Beach, California

Dennie Palmer Wolf (2000)Senior Research AssociateHarvard Graduate School of EducationCambridge, Massachusetts

C. Kent McGuire (Ex-Officio)Acting Assistant Secretary of EducationOffice of Educational Research

and ImprovementU.S. Department of EducationWashington, D.C.

Page 4: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

REPORTCARD

NRIONiS

aep1==NMI IMEM

noulders

Proceedings of

Nationa

Colorado22, g997

chlevemant

Assessment Govern

Edited by

Mary Lyn BourqueNational Assessment Governing Board

ingU.S. De artment of Education

Boa4

Page 5: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

National Assessment Governing Board

Mark MusickChair

Mary R. BlantonVice Chair

Roy TrubyExecutive Director

Mary Lyn BourqueAssistant Director for Psychometrics

and Committee Staff

September 1998

Suggested Citation

Bourque, M.L. (Ed.)Proceedings of Achievement Levels Workshop

Washington, DC: National Assessment Governing Board, 1998.

For more informationcontact:

National Assessment Governing Board202-357-6938

For ordering information on this report, write:

National Assessment Governing Board800 North Capitol Street, NW

Suite 825Washington, DC 20002-4233

This report is also available on the World Wide Web.http#/www.nagb.org

5

Page 6: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Tabk ofContent

introduction

Section 11 1

NAEP Frameworks and Achievement LevelsRobert A. Forsyth, University of Iowa

Section 2 27Assembly of Test Forms for Use in Large-Scale Educational AssessmentsWim J. van der Linden, University of Twente, The Netherlands

Section 3 45A Brief Introduction to Item Response Theory for Items Scored in MoreThan Two CategoriesDavid Thissen, Kathleen Billeaud, Lori McLeod, Lauren Nelson,University of North Carolina at Chapel Hill

Section 4 63Some Ideas about Item Response Theory Applied to Combinations ofMultiple-Choice and Open-Ended Items: Scale Scores for Patterns ofSummed ScoresKathleen Billeaud, Kimberly Swygert, Lauren Nelson, David Thissen,University of North Carolina at Chapel Hill

Section 5 77Enhancing the Validity of NAEP Achievement Level Score ReportingRonald K. Hambleton, University of Massachusetts, Amherst

Section 6 991998 Civics and Writing Level-Setting MethodologiesACT, Inc., Iowa City, IA

Section 7 1 07The Criticality of Consequences in Standard Setting: Six Lessons Learnedthe Hard Way by a Standard-Setting AbettorW James Popham, University of California, Los Angeles

Section 8 1113

Acknowledgements and Appendices

6

Page 7: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

llyvrodandion

The National Assessment Governing Board's policy on student performance standardsstates that the achievement levels should influence all aspects of the NationalAssessment of Educational Progress (NAEP) assessments, from the development ofthe assessment frameworks through the reporting of assessment results. The purposeof these proceedings is to explore each component of the assessment, from drawingboard to Boardroom, in order to understand more fully the relationships between theperformance standards and these assessment components, and the mutual impactsthey have on each other.

In Section 1 Robert Forsyth from the University of Iowa examines the relationshipbetween the assessment frameworks and the levels. He discusses the general purposesof the frameworks and the preliminaty achievement level descriptions, those state-ments of content that students should know and be able to do at each level. Hispaper also examines selected characteristics of the frameworks, e.g., item formats,breadth of coverage, and the complexity of the cognitive dimensions, and how suchcharacteristics influence both the preliminary and final achievement level descriptions.

Wim van der Linden from the University of Twente in the Netherlands describes twotest assembly procedures used in large-scale assessments in Section2. One procedureassigns items to forms in units called blocks as in NAEP, while the second methodassigns unique items from the item pool to test forms. The author discusses the rela-tionship between the characteristics of the assessment, such as size of the item pooland the number of desired forms, and the recommended methodology for test assembly.

In Sections 3 and 4 David Thissen and his colleagues at the University of NorthCarolina at Chapel Hill provide an introduction to item response theory for scoringassessments such as NAEP which have both multiple choice items as well as multiple-scored items. The relationship between the score scales and the standard settingmethodologies is important since setting performance standards is impacted by datacoming from different item types. Most NAEP assessments use mixed item formats,except the writing assessment, which employs only extended constructed responsesto prompts. How these assessments are scored and scaled can have significant conse-quences for the standard-setting results.

In Section 5 Ronald Hambleton from the University of Massachusetts at Amherstexplores the issue of score reporting in NAEP, and presents the results of a small-scalestudy of the understandability of NAEP score reports. Hambleton also provides someguidance on how to improve NAEP score reports and comments on the usefulness ofmarket-basket reporting for NAEP, an index similar to the Consumer Price Index.

Section 6 provides the reader with a look at the proposed methodologies for devel-oping the student performance standards on the 1998 NAEP civics and writingassessments. This section outlines the key features of the 1998 proposal and theoverarching principles for developing the levels.

7

Page 8: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Finally, in Section 7, W. James Popham, professor emeritus from the University ofCalifornia at Los Angeles, provides a commentary on the criticality of consequencesin standard-setting. In his inimitable style, Popham offers six lessons learned the hardway by a standard-setting abettor. Popham calls on his many years of experience inimplementing standard setting initiatives for more than three dozen high-stakes testsfor students, teachers, and administrators, and shares his wisdom on the subject.

The National Assessment Governing Board hopes that the issues raised by theauthors, and the solutions proposed, will extend the national conversation on stan-dard setting, and will benefit not only the National Assessment, but all those individ-uals and agencies whose responsibilities includes setting performance standards invarious academic subjects.

svi

Page 9: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

,

r

SECTION 1

NAEP Frameworks and

Achievement Levels

Robert A. Forsyth University of Iowa

August 1997

....-_,-L3

-

.0. -

._.

Page 10: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

NAEP Frameworks and Adllkut Levells

The National Assessment of EducationalProgress (NAEP) is often identified asthe nation's most comprehensive andreliable indicator of student achieve-ment. In a 1993 report, the NationalAcademy of Education (NAE) notedthat NAEP is "an unparalleled sourceof information about U.S. students'academic achievement in many impor-tant subject areas" (National Academyof Education, 1993b, p. xix).

Because NAEP uses a careful samplingdesign, employs stringent security mea-sures, has a high participation rate (forgrades 4 and 8), collects considerablecollateral information, assesses impor-tant content domains, and providestrend data, it has become one of the keysources of information not only aboutstudent achievement but also aboutother aspects of education in the UnitedStates. During the past 30 years, NAEPhas earned its reputation as "TheNation's Report Card." The develop-ment of assessments that have gainedsuch acceptance by both educators andthe public has not been a casual under-taking. Listed below is a simplified out-line of the complex assessment processthat NAEP follows to provide achieve-ment information:

1. Develop content framework andassessment/exercise specifications.

2. Construct an item pool to fitthe specifications.

3. Gather data related to itemfunctioning.

4. Select items for the assessment.

5. Administer the assessment.

6. Analyze the assessment data.

7. Report assessment information.

Each component represents an impor-tant aspect of the overall process.1The components are interrelated, butrepresent markedly different types ofactivities. If any component is not welldesigned and implemented, the useful-ness of the information provided willbe questioned.

In 1989, the NationalAssessment Governing Board(NAGB) decided that the pri-mary reporting mechanismfor subsequent assessmentsshould be a standards-basedsystem. Specifically, threelevels of achievement (Basic,Proficient, and Advanced)were to be used as the basisfor reporting NAEP results.Inherent in such a reportingsystem is the need fordescriptions of what studentsperforming at these levelsshould be able to accomplishand a process to translatethese descriptions into performancelevels on the NAEP scale. Recent assess-ments, therefore, have incorporatedprocedures for reporting results byachievement levels.

NAG: decide.d

primary

reporting

mechaniam

ahould

atanda

tweed

0 0 0

rda-

aPom.

This paper considers the relation-ship between the achievement levelsand the content frameworks andassessment/exercise specifications of theNAEP process. The paper is divided intothree major sections. The first sectionprovides an overview of the purposesof the assessment frameworks and

I This general process is similar to that used in the development of any large-scale achievement testl....

3

Page 11: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

°VG framework

re.pre&ents

comprehenaive

overview

moist

ceek

e&sentia0

outcome& 0 0 0co

4

specifications and notes some generalevaluations of both. The recent intro-duction of preliminary achievement lev-els into the overall assessment processis then considered. The final sectionidentifies several specific characteristicsof the framework specifications andprovides a discussion of the impactthese characteristics might have onthe achievement levels. A concludingstatement ends the paper.

Generall Pasrposeoof FraMework3As noted above, the first activity in theNAEP assessment process is the devel-opment of the framework and specifica-tions. Actually, two publications usually

result from this activity.One of these is the frame-work for the assessment[e.g., Geography Framework

for the 1994 NationalAssessment of Educational

Progress (NationalAssessment GoverningBoard, 1994b)] and theother is the specificationsfor the assessment [e.g.,Geography Assessment and

Exercise Specifications for the

1994 National Assessment

of Educational Progress

(National Assessment Governing Board,1994a)].2 Typically, the specificationsdocument represents an elaboration ofthe information in the framework docu-ment. For most assessment areas, thiselaboration provides a more comprehen-sive definition of the content domain thanis given in the framework document.3

This paper uses the term "framework"to refer to the information in both docu-ments unless otherwise indicated.

The following passage, taken from the1994 Geography Framework, describesthe general purposes of frameworks:

The framework represents a com-prehensive overview of the mostessential outcomes of students'geography education at the pre-scribed grade levels as determinedby the consensus committees and bythe testimony of numerous witness-es at three public hearings. Designedto guide the development of assess-ment instruments, the frameworkcannot encompass everything that istaught in geography in all of thenation's classrooms, much lesseverything that should be taught.Nevertheless, this broad and innova-tive framework attempts to capturethe range of geography content andthinking skills that students shouldpossess as they progress throughschool. The framework's contentembraces the complex problemsof modern life that students willinevitably encounter both inside andoutside their classrooms. It shouldbe viewed, therefore, both as aguide for'assessment and a potentialtool for crafting a relevant and con-temporary geography curriculum asit reflects the discipline's involve-ment in the complexities of contem-porary issues. (National AssessmentGoverning Board, 1994b, pp. 2-3)

As indicated in this statement, frame-works provide detailed guidance for theconstruction of the exercise (item) pool.The types of stimulus material to be

. 2 A single publication was developed for the 1998 Writing Assessment (National Assessment Governing Board, 19986).

3 For example, 58 (66%) of the 88 pages that make up the Geography Assessment and Exercise Specifications (National Assessment Governing Board,1994a) are dedicated to detailed content specifications.

1 1

Page 12: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure Li. A simple model of the relationships among framework,achievement levels, and assessment

used, the percentage of items in eachunique combination of categories fromthe content and cognitive dimensions,the scoring criteria for constructedresponse exercises, and the percentageof testing time to be devoted to specificitem formats are among the types ofinformation provided in such frame-works. The statement does not, how-ever, recognize another NAEP activitythat depends greatly on the frameworks:the development of achievement levels.As the NAE notes:

The framework must serve as thepivotal link between the assessmentand the achievement levelsboth asthey are described narratively, andas they are operationalized intoitem-level judgments and ultimatelycut-scores. (National Academy ofEducation, 1993a, p. 47)

Figure 1.1 shows the relationshipsamong the frameworks, the assessment,and the achievement levels.4

In general, NAEP frameworks have beenhighly regarded by educators. In anevaluation of the 1992 Trial StateAssessment, NAE concluded that themathematics and reading frameworkswere acceptable as a starting point forthe assessment process. With respectto the mathematics framework, NAEstates:

The 1990 NAEP MathematicsFramework and the assessmentdesign principles on which it wasbuilt were essentially sound.Furthermore, the framework repre-sented a reasonable compromisebetween current instructional prac-tices and the standards being putforth by the professional mathemat-ics community at that time.(National Academy of Education,1993b, p. 51)5

Similarly, with respect to the 1992NAEP Reading Framework, NAEconcludes:

'Figure 1.1 is similar to a figure published in National Academy of Education (1993a, p. 48). However, the NAE figure shows a reciprocal rela-tionship between cut-scores and the assessment. This relationship did not seem reasonable, at least for initial assessments in an area. The NAEfigure also does not indicate that the assessment has an impact on the achievement level descriptions.

'The 1992 Mathematics Assessment was also based on the 1990 framework.

12

Page 13: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

The reading consensus project pro-duced a framework that representsa substantial advance over previousNAEP Reading Frameworks and isreasonably responsive to most ofthe current theories and practicesin reading. (National Academyof Education, 1993b, p. 54)6

More recently, Mullis (1995, p. 3)observed that "the NAEP assessmentframeworks are extremely well done

and widely recognized fortheir breadth and depth ofcoverage." Likewise, Sireci(undated, p. 8) consideredthe consensus-buildingprocess used to developthe frameworks one ofNAEP's great strengths.7Sireci also contended that"the content and cognitivedomains are articulatedclearly and are widelyaccepted by teachers,

0 0 0 669 frameworks

CEn 0.0 C3 (S0Iid

curriculum specialists,policymakers and other educationalpractitioners."

If, in fact, the frameworks are as com-prehensive and useful as indicated bythe above statements, it seems reason-able to conclude that a solid foundationis in place to develop both the exercisesand the achievement level descriptions.Of course, even the most criticallyacclaimed frameworks cannot guarantee

that either the initial pool or the final setof exercises will adequately reflect thedemands of the specifications. As Sirecinotes: "The specifications of impressivecontent and cognitive frameworks ismoot if the assessments do not ade-quately measure these frameworks"(p. 8).8 Likewise, a similar statementcould be made about the developmentof achievement level descriptions:The availability of an excellent frame-work does not guarantee that usefulachievement level descriptions willbe developed.9

In recent NAEP assessments, prelimi-nary achievement level descriptionshave been included as part of the frame-works.10 The characteristics and purpos-es of these preliminary descriptions arediscussed in the next section.

PreliminaryAchievemenk' LevellDeocripOonsThe 1994 U.S. History and GeographyAssessments were the first to includePreliminary Achievement LevelDescriptions (PALDs) as part of theirframeworks (National AssessmentGoverning Board, 1994b, 1994e).Subsequently, PALDs also wereincorporated into the frameworks forthe 1996 Science Assessment (National

'The 1994 and 1998 Reading Frameworks are identical to the 1992 framework. A subsequent evaluation of the 1994 framework by NAE yieldeda similar conclusion: "The expert advisors reaffirmed that the framework's general model of reading for meaning was consistent with currentresearch practice and worked well as the basis for assessment" (National Academy of Education, 1996, p. 16).

'The Sireci paper was commissioned by the National Academy of Sciences. It seems clear that the paper Was submitted in either 1996 or 1997.

'Later in his paper (pp. 57-59), Sireci suggests several content validity studies that should be done to investigate the congruence between theframework and the actual assessment. Linn and Dunbar (1992) have also suggested that more work needs to be done on content validity issues.When contrasting the efforts put forth to develop the frameworks to the efforts put forth to evaluate the items, they conclude: 'We might be wellserved by focusing more of the attention of subject matter experts and educators on the individual exercises that make up an assessment" (p. 182).

9An example of this problem is identified in the NAE evaluation of the 1992 Reading Assessment (National Academy of Education, 1993a). Asindicated previously, NAE considered the frameworks for this assessment to be adequate. However, the achievement levels were consideredinadequate. NAE noted that 'participants' lack of familiarity with the Reading Framework affected what they were able to hold in their mindwhen making item judgments and most certainly explains why the achievement level descriptions developed at the initial meeting had to berevised subsequently to be brought in line with the framework" (pp. 49-50).

"It should be noted that each framework since 1989 has also included what are usually referred to as 'policy" or 'generic" achievement leveldefinitions that serve as the starting point for the subsequent achievement level definitions.

6 13

Page 14: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Assessment Governing Board, 1996b)11and the 1998 Civics and WritingAssessments (National AssessmentGoverning Board, 1998a, 1998b).

The lack of congruence between theachievement level descriptions and theexercise pool was a major criticism ofearlier assessments and was probablya factor related to the introduction ofPALDs as part of the frameworks.(See, for example, National Academyof Education, 1993b; Shepard, 1995.) Inaddition, given the pivotal role of theframework in developing achievementlevels, it seems reasonable for contentexperts who develop the frameworksto be involved in the achievementlevel-setting process to some extent.

The importance placed on thesePALDs with respect to the exercise-development component of theassessment process is illustrated bythe statements below from the U.S.History and Geography Frameworks:12

U.S. History:Exercises must be developed in sucha way as to ensure that the itempool is congruent with the frame-work and corresponds to theachievement level descriptions.(National Assessment GoverningBoard, 1992, p. 12)

Geography:The item pool should be developedin such a way as to ensure that thecontent described in the achievement

level definitions ... is reflectedat each grade level. (NationalAssessment Governing Board,1994a, p. 14)

These statements quite explicitly definethe purpose of PALDs: to ensure anadequate exercise pool. However, itshould be noted that these PALDsimplicitly serve a second purpose: toguide the panels of educators andnoneducators who will set the FinalAchievement LevelDescriptions (FALDs). Thesepanels are convened after theassessment has been admin-istered, and members of theoriginal framework commit-tees who established PALDsdo not serve on them.13Under these circumstances,PALDs and FALDs could bemarkedly different.14

Figure 1.2 shows a modifica-tion of the model in figure1.1 to incorporate PALDs inthe assessment process. Asillustrated in figure 1.2,FALDs are influenced by

he/Augment

deecriptiong

nflaenced

different

C.70'&1

th re e

/before.

framework,

preliminary

cleacriptione, and

aesesement 0 0 0

three different factors: theframework, PALDs, and the assessment(both the exercises and the examinee'sresponses to these exercises).

The possibility that the two sets ofachievement level descriptions maydiffer limits the usefulness of PALDsas a guide for exercise development.In fact, Lazer, Campbell, and Donahue

The 1996 Science Assessment was originally scheduled for 1994, but did not occur until 1996. The PALDs for this assessment were added afterthe frameworks and specifications were completed.

't The frameworks for both the 1996 and 1998 assessments contain similar statements.

" The use of noneducators as part of the achievement levels panels has been questioned by some measurement experts. For example, Mehrens(1995, pp. 246-247) writes:

I do not see how the general public could make decisions about what fourth or eighth graders should know before being promotedor evenclassified as advanced, proficient, or basic. I see some logic in the general public being represented on a panel setting a standard on what ahigh school graduate should know or be able to do [either to graduate or simply to be classified]. However, from a purely methodologicalpoint of view [ignoring politics], I would always prefer the panel to be experts on the domain being assessed and, if children are involved, onthe developmental level of the children being assessed.

4 For the U.S. History and Geography Assessments, the two sets of descriptions seem fairly similar.

Page 15: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 1.2. Modification of figure 1.1 to include preliminary achievement leveldescriptions

PreliminaryAchievement Level

Descriptions

(1996), in an extensive discussion of thedevelopment of NAEP objectives, items,and background questions for the 1994Reading, U.S. History, and GeographyAssessments, do not specifically men-tion the use of PALDs as part of theexercise-development process. Theyidentify a 15-step procedure used todevelop the items, and none of thesesteps indicate that the exercises wereevaluated for their congruence withPALDs. However, in each assessment,the development, review, and selectionof exercises were guided by an Instru-ment Development Committee(National Center for EducationStatistics, 1996d). Because severalmembers of the Framework PlanningCommittee were also members of thisInstrument Development Committee,PALDs for U.S. History and Geography

could have been systematically consid-ered throughout the item-developmentprocess.15 Of course, PALDs might servetheir most important role as initial state-ments of what content experts considerto be reasonable achievement levels.Nonetheless, additional informationon the role of PALDs in the develop-ment of the exercise pool would seeman important part of an overall evalua-tion of the usefulness of PALDs.

Seeded FrameworkCharacteristics andAchievement LewesThis section discusses three characteris-tics of the frameworks (exercise format,breadth of the content dimension, and

"The National Academy of Education (1993, P. 123) recommends the establishment of 'standing subject-matter panels" and described thefunctions of such panels as follows:

The panels should provide continuity to the assessment by being involved in all aspects of the process, including formulating the frameworkand objectives; reviewing items, item-scoring rubrics, and reporting formats; and helping to achieve agreement on narrative descriptions ofperformance standards and representative illustrative tasks.

The members of the Instrument Development Committee who were also members of the Framework Planning Committee seem to be performingsome of these functions. It is not clear, however, whether the members also provide input to the subsequent components of the assessmentprocess.

8

Page 16: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Table 1.1. Achievement level cut-scores for U.S. History and World Geography,separately for dichotomous and partial-credit items

Achievement Levels/Item Types Grade 4 Grade 8 Grade 12

World Geography

BasicDichotomous Items 182 230 243Partial-Credit Items 188 247 272

ProficientDichotomous Items 236 275 295Partial-Credit Items 244 291 313

AdvancedDichotomous Items 271 306 329Partial-Credit Items 286 330 350

U.S. History

BasicDichotomous Items 171 226 264Partial-Credit Items 200 261 303

ProficientDichotomous Items 239 282 315Partial-Credit Items 246 302 334

AdvancedDichotomous Items 272 321 346Partial-Credit Items 283 334 365

Source: National Academy of Education, 1996, p. 104

clarity and complexity of the cognitivedimension). Their impact on theachievement level descriptions or theperformance levels (cut-scores) is alsoconsidered.

CExercise Format

As noted above, the frameworks pro-vide specific guidelines both for theexercise formats that are to be used(e.g., multiple-choice items and extend-ed responses) and for the distribution oftesting time allocated to these formats.For many assessments, detailed itemwriting guidelines are supplied. Forexample, the U.S. History specificationslist six requirements that alternatives forthe multiple-choice items should meet(National Assessment Governing Board,1992, pp. 16-17).

One of the most pervasive findings fromanalyses of recent NAEP data is theconsistently lower achievement levelcut-scores set for items scored dichoto-mously relative to the cut-scores basedon items using a multiple-score scaleand permitting partial credit. Examplesof the differences in the cut-scores asso-ciated with the two item types areshown in table 1.1 for the 1994 U.S.History and Geography Assessments.Results similar to those shown in table1.1 have also been observed in otherassessments.16 Various hypotheses havebeen proposed to explain the differ-ences. For example, Shepard (1995)finds that flaws in the standard-settingmethodology are the primary cause.Kane (1995) raises other possibilitiesrelated to scaling and dimensionalityproblems.

l'Such cut-score differences would be particularly critical if the item sizecifications for a given assessment area were to change and if comparisonsof the results of the earlier assessment with the results of the newilr'sSrnent were to be made.

Page 17: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

0 0 6(12G

have

Results such as those shown in table 1.1are disconcerting because, as the NationalAcademy of Education (1996, P. 93)observes, item features (e.g., format,difficulty, and number of points used inthe scoring rubric) should be "irrelevantfor cut-score determinations." Becausethese features should be irrelevant toboth the setting of the cut-scores andthe writing of the achievement leveldescriptions, the implications of these

findings for developing theframeworks and for the rolethe frameworks playin the achievement level-setting process would seemto be minimal. Even if thecauses of the differences incut-scores are known,should they have an impacton the frameworks? Forexample, should the testingtime allocated to multiple-choice items changebecause of these cut-scoredifferences? The primary

framework's

received

conoiderable

1903§2,9 IT 669ik

chOnition egeag

content doma§

10

purpose of the frameworksis to enable the development of anassessment that provides the best repre-sentation of the achievement of impor-tant educational outcomes in a domain.Accomplishing this purpose should notbe influenced by data obtained later inthe assessment process.17

Breadth of the ContentDilmensiion

As indicated in the first part of thispaper, the frameworks have receivedconsiderable praise for their definitionof the content domain. However, theirrole as the "pivotal link" between theassessment and the achievement leveldescriptions has not been extensively

evaluated. In this role, the frameworksguide the writing of PALDs and ulti-mately, along with the assessmentresults, the writing of FALDs and thesetting of the performance levels.As noted previously, achievement leveldescriptions indicate what students atthese levels should be able to accom-plish. As such, they represent what arelabeled "criterion-referenced (CR) inter-pretations" of performance. Millmanhas observed that well-defined domainsalone are "insufficient" to guaranteereasonable CR interpretations. In anarticle reviewing the history of CRtesting, he writes:

Clear and well-explicated domainsare insufficient to assure inter-pretability. If the domain definesa broad constructsuch as, knowl-edge of the American Civil Warno matter how well spelled out itis, with a limited number of testitems, we still won't know whattasks within that domain the studentcan and cannot do. We can constructreading proficiency and mathemati-cal reasoning scales. We can placestudents on such scales, a highlyimportant measurement function.However, we would probablystill not know what tasks the stu-dent can and cannot do. Low taskintercorrelationsthat is, taskspecificitywork against suchCR interpretations. Reporting bya narrower domain, such as theBattle of Gettysburg, helps only ifenough items are sampled from thatdomain. (Millman, 1994, pp. 19-20)

Millman also considers the nature ofNAEP domains and the possibility ofCR interpretations with such domains:

"This statement does not mean that such differences should be ignored at all steps in this process. Such results may have implications for theframeworks of future assessments in an area.

Page 18: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

NAEP tests are just not designedto provide, nor do they claim toprovide, the promised CR interpre-tation. Their tonstructs are toobroad. ... Their role as "The Nation'sReport Card" requires, for all practi-cal purposes, that progress be report-ed in broadly defined domains.

Two recent changes in NAEP's oper-ation have been the return of atten-tion to performance assessmentand the use of the categoriesBasic,Proficient, and Advancedto reportlevels of achievement. Will each ofthese two shifts add to the CRinterpretability of NAEP results?The answers are no and no.(Millman, 1994, pp. 20 and 39)

To illustrate Millman's concern about"broadly defined domains," the specifi-cations of the 1994 Geography andReading Assessments are consideredbelow.

Detailed content specifications in the1994 Geography Framework (NationalAssessment Governing Board, 1994a)provide lists of statements identified asthe "Content Outline."18 The numberof such statements across the three maincontent categories (space/place, environ-ment/society, and spatial dynamics/con-nections) and the number of exercisesin the final assessment (National Center

for Education Statistics, 1996d) areshown in Table 1.2 below.

Given this information, the coverage ofthe domain by the assessment wouldstill be somewhat limited at each gradelevel, even if each statement could bemeasured directly by a single item.However, the number of potential exer-cises associated with a given statementvaries considerably. Some statementsseem to require only a single item tomeasure adequately the implied learningtarget [e.g., knowing the differencebetween fertile and infertile soils](National Center for Education Statistics,1996a, p. 35).19 For other statements, alarge number of exercises could bewritten, and the number of exercisesrequired to measure the learning targetadequately might be difficult to deter-mine. Consider, for example, thefollowing two statements:

1. Use great circle routes to measuredistances on a globe. (NationalCenter for Education Statistics,1996a, p. 35)

2. Understand how patterns andprocesses in human geography areinterrelated in the world, such ashow the growth in the number ofimmigrants often leads to an increas-ing number of minority groups ina country. (National Center forEducation Statistics, 1996a, p. 38)

Table 1.2. 1994 Geography Assessment

GradeNumber of Statementsin the Content Outline

Number of Exercisesin the Assessment

4 191 90

8 195 125

12 164 123

Total 550 338"

*Some exercises were administered at two grade levels. A total of 273 unique exercises were administered.

th In the context of classroom instruction, these statements would probably be considered 'learning targets' or 'learning objectives' (Nitko, 1996).

°All geography statements are taken from the grade 4 content outline.

1 1

1E)

Page 19: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

0 0 0 initia

underotanding

and developi

interpretation

repreoent

reading'

MO 0 0 0

behaviora

1 2

Clearly, a very large number of exercisescould be written to measure the contentof the first statement. However, theresponses to these exercises wouldprobably be so highly correlated that itmight be possible to say that a studentcould or could not perform this learningtarget on the basis of just one or twoexercises.20,21 Highly correlated meansthat if a student answers one exercisecorrectly, then he/she would "verylikely" answer the other exercisescorrectly.

Again, for the second statement, a verylarge number of exercises could be writ-ten. One of these could be related to the

example given as the lastpart of the statement. Thetotal number of other possi-ble relationships in humangeography that could beincluded as part of this con-tent is difficult to imagine.Furthermore, a studentmight "understand" somerelationships but not others.22In other words, if a studentanswers one exercise cor-

000 rectly, it is not necessarily"highly likely" that he/shewould answer the otherexercises correctly. (To illus-trate this type of situation,

specific data are reported below as partof the material related to the breadth ofthe NAEP Reading Assessment domain.)

The NAEP Reading Assessment providesanother illustration of Millman's concernabout the breadth of NAEP domains.

The assessment/exercise specifica-tions for the 1994 Reading Assessment(National Assessment Governing Board,1990) differ markedly from the 1994Geography Assessment Specifications,most importantly because no detailedcontent outline for the Reading Assess-ment is included.23 The content dimen-sion in the Reading Framework isdivided into three categories: readingfor literary experience, reading to beinformed, and reading to perform a task.Although no content outline is given,the number of potential stimuli for eachdomain is enormous. Consider, forexample, the reading for literary experi-ence domain. How many "fantasies,fables, fairy tales, myths, mysteries,realistic fiction, adventure stories"could have been identified as possiblereading passages for the 1994 ReadingAssessment (National AssessmentGoverning Board, 1990, p. 3)? Obvious-ly, the supply of such material is virtual-ly unlimited, and, therefore, the numberof possible exercises is also unlimited.

However, as noted' in the discussion ofthe geography domain, a large numberof exercises do not necessarily implythat CR interpretations would be diffi-cult to make. If the exercises based ondifferent passages exhibit little taskspecificity (i.e., if the exercises are high-ly correlated), then CR interpretationsmight still be reasonable.

For the Reading Assessment, "initialunderstanding" and "developing aninterpretation" represent two of the fourreading behaviors that are to be assessed

2011 multiple-choice items were used and, therefore, the possibility of guessing the correct answer was a factor, additional exercises may be desir-able. Actually, even if multiple-choice items were not used, measurement errors would still have to be considered.

2' If generalizations across different sizes of globes with different scales is a concern, then other exercises might be needed. However, if the concernis strictly with the procedure for using the great circle routes and not with the arithmetic, then a judgment about the mastery of the proceduremight be based on one or two exercises.

22 In this discussion of the second statement, the problems faced by item writers when they attempt to operationalize 'understand how" are notconsidered. These problems are considered in the next section.

2' Detailed content outlines exist for the U.S. History, Science, and Civics Assessments. However, the Writing Assessment, like the ReadingAssessment, does not have such an outline. Of course, given the "process nature" of these two domains, the lack of a detailed content outlinewould probably not be considered a problem.

Page 20: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

regardless of the content categoly(National Assessment Governing Board,1990). These two behaviors are definedas follows:

Initial understanding requires a broad,preliminary construction of anunderstanding of the text. Questionstesting this aspect should ask thereader to provide an initial impres-sion or unreflected understanding ofwhat was read. The first questionfollowing any passage should beone testing initial understanding.

Developing an interpretation requiresthe reader to go beyond the initialimpression to develop a more com-plete understanding of what wasread. Questions testing this aspectshould require a more specificunderstanding of the text andinvolve linking information acrossparts of the text as well as focusingon specific information. (NationalAssessment Governing Board,1990, p. 10)

Would questions measuring each behav-ior be highly correlated over passages?To provide an answer to this question,data from two sources are presented.One source deals with exercises relatedto "developing an interpretation be-havior" and the other with exercisesfocused on "initial understandingbehavior."

In an earlier paper, I observed the fol-lowing concerning exercises that wouldbe classified in the "developing an inter-pretation" category:

As an illustration of this problem[low correlation between exercises],consider the following questionstaken from the Iowa Tests ofEducational Development:

'4 This study represents a unique investigation of the task specificity issue.

1. From his manner and formal train-ing, what opinion might people haveformed of John Marshall? (28%)

2. What do the last two sentences sug-gest about Patasonians' acceptanceof U.S. aid? (44%)

3. Suppose an uninsured and unem-ployed motorist damaged someone'scar. Which speaker offers a plan thatwould allow the injured party tocollect benefits? (64%)

Each of these items is associated with aparticular reading passage, and all threeitems require the student to reach a con-clusion on the basis of the informationin the passage. The percent-age in parentheses after eachquestion show the percent-age of 10th grade studentsfrom a statewide sample inIowa who answered theitem correctly. The varyingpercentages associated withthe three questions aboveprovide evidence that gettingone item correct does notguarantee that a second itemmeasuring the same objec-tive (where the objective isdefined in fairly broad terms)will be answered correctly.Furthermore, though this isnot discernible from the data givenabove, all 28% who got the first itemcorrect did not get the second or thirditems correct (Forsyth, 1976, pp. 12-13).

The second source of information relat-ed to task specificity is a study by Allenand Isham (1996).24 Allen and Ishampresent data for grade 8 examinees thatwere collected as part of a special study(based on the NAEP Reader) in the 1994NAEP Reading Assessment.

(iould queations

menauring &ach

hghavior

highly

correlate.d

ov&r

pa.s.sage.is

1 3

Page 21: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 1.3. Mean scores for item 1

80

70

60

50

40

30

20

10

o2 3

Source: Allen & Isham, 1996, P. 13

The purpose of their study was:

to verify the assumptions that aremade by those who are proponentsof the use of choice in the NAEPReader. These assumptions are thatthe generic questions have the samemeaning no matter what story wasread, and that students do bestwhen they are able to read a storythat they are interested in, one thatthey select. (National Center forEducation Statistics, 1996d, p. 3)

Only information related to the assump-tion that the "generic questions have thesame meaning" is of interest at thistime. The basic data related to this issuewere gathered by having each of sevenrandomly equivalent groups of eighth-grade examinees respond to a differentreading passage. After reading thesedifferent passages, all examinees

4 5

Free Forced

6 7 UNK

answered the same set of generic ques-tions. Although Allen and Isham do notprovide the specific set of generic ques-tions, they provide the following defini-tion: "An example of a generic questionis one that asks about the appropriate-ness of the title of the story" (Allen &Isham, 1996, p. 3).25 All the generic ques-tions were the constructed-responsetype.

To illustrate the outcomes of this study,the data related to the first item (adichotomously scored item) in theset of 11 generic items are discussed.26Figure 1.3 shows the mean scores (per-centage correct) of the seven representa-tive groups of eighth-grade examineesfor the first item.22 The figure showsthat these percentage correct valuesrange from approximately 40 percentfor story 4 to approximately 80 percentfor story 3. Allen and Isham note that the

26 How the question was stated is not known. One possibility: Is the title of this stow appropriate? Briefly explain your answer. A second possibili-ty: What is an appropriate title for this story? Briefly explain your answer.

2' If the specifications for the NAEP Reader passages were the same as those for the Reading Assessment, this first item tested "initial understand-ing." Perhaps this item asked about the appropriateness of the title.

t7 The figure also shows the mean scores for seven groups of students who were allowed to select their reading passages. These means do notper-tain to this discussion.

1 4 -2

Page 22: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 1.4. Item response functions for item 1

1.0

0.8

00.6

A

0.4

0.2

0.04.0 3.0 2.0 1.0 0.0 1.0

THETASUBPOP: 0 story 1 0 story 2

story 3 e story 4IX1 story 5 n story 6

0 story 7

Source: Allen & [sham, 1996, p. 13

differences among the groups are "quitestriking" (Allen & Isham, 1996, p. 8).

To provide additional information aboutthe comparability of generic questionsacross reading passages, each genericquestion was scaled using item responsetheory "as if [the question] had the samemeaning no matter which story wasused" (Allen & Isham, 1996, p. 8). Figure1.4 shows the item response functions(IRF) for item 1. As Allen and Ishamobserve, these IRFs differ across therange of the proficiency scale.

One way to examine the possible impli-cations of the results shown in figure1.4 is to speculate about how these"identical items" might be used in thereporting of NAEP results. The 1994NAEP Report Cards (National Center for

2.0 3.0 4 0 ,

Education Statistics, 1996a, 1996b,1996c) report NAEP results using threedifferent procedures: scale anchoring,item mapping, and achievement levels.28Figure 1.5 is adapted from the NAEP 1994Reading Report Card (National Center forEducation Statistics, 1996b, p. 93) andshows the mapping of selected items onto the reading for literary experiencesubscale. The question is: Where wouldgeneric item 1 map on to this scale?Given the results in figure 1.4, theanswer depends on which story (stories)had been included in the assessment.To illustrate this point, assume that themean and standard deviation of the..lit-erary experience subscale are 259 and37, respectively.29 Given these valuesand using the same item-mapping pro-cedure employed with the NAEP data,

TtThe scale-anchoring and item-mapping procedures are described in Appendix B of the NAEP 1994 Reading Report Card for the Nation and theStates (National Center for Education Statistics, 1996b).

2° The mean of this subscale is reported as 259 in the NAEP 1994 Reading Report Card for the Nation and the States (National Center for EducationStatistics, 1996b, p. 87). However, a standard deviation value for this subscale could not be located. Therefore, the standard deviation for thecomposite reading scale was selected (ibid, 1996b, p. 307).

1 5

22

Page 23: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 1.5. Mapping of reading items

NAEPScale

M2.5

jx

2.50

Figure 6.4. Map of Selected Items on the Reading for literary Experiencesubscale for Grades 4, 8, and 12

Each reading question was mapped onto the NAEP literary subscale based on students performance. Thepoint on the subscale at which a question is positioned on the map represents the subscale score attainedby students who had a 65 percent probability of successfully answering the question. Thus, it can be saidfor each question and its corresponding subscale scorestudents with proficiency scores above that pointon the subscale have a greater than 65 percent chance of successfully answering the question, while thosebelow that point on the subscale have a less than 65 percent chance. (The probability was set at 74 percentfor multiple-choice questions.) In interpreting the item map information it should be kept in mind thatstudents at different grades demonstrated these reading abilities with grade-appropriate reading materials.

GRADE 4 GRADE 8 GRADE 12

(329) Explain thematic difference between poems

(314) Explain symbolism of story element

(309) Make intertextual connection to

interpret character

(306) Recognize implicit aspect of character

(304) Explain character's perspective

(288) Use metaphor to interpret character

(270) Identify/infer character trait from

story event

(260) Identify application of story theme

(258) Connect text ideas to describe

character motivation

(255) Identify character's main dilemma

(250) Infer reason for character's perspective

(245) Recognize cause of character's feelings

(243) Use narrative context to define a

specific phrase

(240) Identify character's perspective on

story event

(235) Recognize reason for character's feelings

(226) Identify explicitly stated cause of action

Source: Adapted from National Center for Education Statistics, 1996b,.pI-94

1 6 2.3

Page 24: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

the following estimated scale valueswould be associated with stories 1,3, and 4:30

Story Scale Value

3 222

1 259

4 296

Assume that the descriptor for this firstitem is: Recognize main topic.31 In fig-ure 1.5, the "Recognize main topic"descriptor would be near the top of themap, in the middle of the map, and atthe bottom of the map. Given the pur-pose of the item-mapping procedure,such an outcome would be somewhatconfusing.32,33

Not all of the generic questions exhibit-ed the pattern of results shown infigures 1.3 and 1.4. However, basedon their evaluations of the 11 genericquestions, Allen and Isham conclude:

In summary, some of the genericquestions seem to have similar char-acteristics no matter which storythey refer to, but others have verydifferent characteristics dependingupon the story. These similar anddiffering characteristics are reflectedin mean item scores and in empirical

item response functions. Given thatthe generic questions do not seem to havethe same characteristics across the seven

stories, treating the questions as beingthe same no matter which story was readis inappropriate [italics added]. (Allen& Isham, 1996, p. 8)34

This extensive discussion of theGeography and Reading Frameworkswas intended to reinforceMillman's observation thatNAEP domains are "broadlydefined" and to illustrate theproblems such domains cre-ate when CR interpretationsare attempted. It is impor-tant to note that theseproblems have an impacton any reporting systemthat attempts to provideCR interpretations for NAEPresults. (This was Millman'spoint also.) Thus, scale-anchoring, item-mapping,

co0 0 0 treating

queatione

6eing

matte.r

otorg

acm9

atalza

dtat read

inuijo ôrprIuivta 0 0 0co

and achievement level proceduresmust all deal with these problemsin some way.

Most of the interpretation problemsthat reporting procedures encounterare created when a single item is usedto represent a construct for which it is

" These values are relatively crude estimates. However, using a probability value of 0.65 (vertical axis), the q value associated with this probabilityis approximately -1 (one standard deviation below the mean) for story 3 and approximately +1 (one standard deviation above the mean) for story 4.

3 Items with such a descriptor would seem to be in the 'initial understanding° category.

"This type of situation already exists to some extent in the item map shown in figure 1.5. Consider the following two descriptors:(245) Recognize cause of character's feelings(235) Recognize reason for character's feelings

Likewise, understanding the difference between the behaviors represented by the following two descriptors might be difficult for most people:(306) Recognize implicit aspect of character(235) Recognize reason for character's feelings

"In this example, the three items had scale values from low to high. It could be argued that in this situation only the item with the lowest valueshould have been included in the map. However, including only this item would be misleading since examinees with proficiency levels at the lowerpart of the scale could recognize the main topic in only one out of seven reading passages. Other interpretation issues would be raised in differentsituations. For example, assume that only story 4 was part of the assessment. The only place 'Recognize main topic" would appear would be at thetop part of the map. In such a situation, it would be assumed that examinees with proficiency levels in the middle of the scale could not recognizemain topics. Such an outcome would seem inconsistent with other descriptors in the middle of the scale (e.g., Identify application of story theme).

34The importance of this conclusion cannot be overstated, even thouilh,t4. derived from the results of a single study.

17

Page 25: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

ce

an inadequate representation.35 Of thethree reporting procedures previouslynoted, the interpretation burden placedon a single item seems greatest for theitem-mapping procedure, as illustratedby the above discussion. However,similar concerns occur with the scale-anchoring procedure because, onceagain, the descriptions of what exam-inees can accomplish are based on theresults from specific items. To illustratethis concern, consider the results report-ed in the NAEP 1994 Reading Report Cardfor the Nation and the States (National

Center for EducationStatistics, 1996b).

Scale anchoring is used toanchor the reading compos-ite proficiency scale at the25th, 50th, and 90th per-centiles. The scale valuefor the 25th percentile is236 (National Center forEducation Statistics, 1996b,p. 23). One descriptor forthis percentile is "recognizemain topics" (p. 85). Assumethat generic item 1 in theAllen and Isham study wasrelated to this outcome.What if only stories 1 and

4 had been included in the ReadingAssessment? Given the criteria used toidentify possible anchoring items, it ishighly unlikely that these two itemswould be available to help describewhat examinees near the 25th percentile

course,

669.f/ha"

fnling

reports,

&Adorn&

combined

general

level

before

achievement

statements

made.

1 8

can accomplish. Under these conditions,the anchor descriptions probably wouldhave changed.36

The achievement levels procedurewould seem to be less susceptible tothis particular interpretation problembecause the descriptions should not beas dependent on individual items as arethe other two procedures.37 However,it would seem reasonable for thesedescriptions to recognize specificallythe limitations placed on the abilityto make unqualified statements aboutwhat students can accomplish. As theabove examples illustrate, large numbersof exercises could be written, even forwhat might be considered very narrowsubdomains such as "recognize maintopics" or "understand how patterns andprocesses in human geography are inter-related in the world," and the interrela-tions among the exercises within thesubdomain may be low. Of course, forNAEP reports, many subdomains arecombined before general achievementlevel statements are made. Given suchconditions, perhaps the achievementlevel descriptions (both PALDs andFALDs) should recognize that as theachievement level increases, the fre-quency with which examinees eithercan perform certain behaviors or knowthe information in certain domains alsoincreases. One illustration of statementsdescribing such a relationship, takenfrom fourth-grade PALDs of the U.S.History Framework (National

35 In a slightly different context, Stone (1995) discusses a "single item syndrome.' Stone writes: "A single item may not be an adequate represen-tation of a body of knowledge and by placing such emphasis on the sanctity of the single item, disturbing results may evidence themselves"(p. 12). Lissitz and Bourque (1995) make a similar point when they wrote that "describing what [individual items] mean on an abilitycontinuum (the NAEP scale) is a high-inference task" (p. 17).

3° As explained in appendix B of the NAEP 1994 Reading Report Card for the Nation and the States (National Center for Education Statistics, 1996b),the anchoring process is considerably more complicated than this simple example indicates. However, the example illustrates an important issuefor scale anchoring.

Reckase (1993) and Lissitz and Bourque (1995) make similar observations about achievement level procedures relative to scale-anchoringprocedures.

Page 26: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Assessment Governing Board, 1994e,p. 49), is shown below:

Basic ... Should be able to identifyand describe a few of the most famil-iar people, events, and documents inAmerican history.

Proficient ... Should demonstratefamiliarity with a number of histori-cal people, places, and events.

Advanced ... Should demonstrate aconsiderable familiarity with historicalpeople, places, and events."

Of course, standard-setting panelmembers would face the difficult taskof using both these frequency indicatorsand their knowledge and understandingof the content domain to arrive at thespecific performance levels on theNAEP scale.

Cagily and Compiexity ofthe Cognitive DimensionThe cognitive dimension forms animportant part of most frameworks.39The purpose of this dimension is toensure that the entire spectrum ofstudent outcomes is represented in theexercise pool. For example, the geogra-phy framework uses three cognitivecategories: knowing, understanding,and applying.° Each item in the exercisepool is classified into one of these cate-gories. However, Lazer, Campbell, andDonahue observe that such classifications

might not be very accurate.41 Concerningthe geography classifications, they state:

It should be noted that the classifica-tion of items into different cognitivecategoriesconducted by bothEducational Testing Service staffand members of the assessmentdevelopment committeeis likelyan imprecise process. (Lazer,Campbell, and Donahue, 1996,p. 62)

Similar statements accompany discus-sion of the development of the U.S.history (Lazer, Campbell, & Donahue,1996, p. 53) and reading (p. 43) exercis-es. Given that The NAEP 1994 TechnicalReport (National Center for EducationStatistics, 1996d) provides no data relat-ed to the magnitude of the imprecisionin these classifications, formal proce-dures for investigating this concernwere evidently not undertaken.42,43

However, the National Academy ofEducation investigated the consistencyof the cognitive classifications betweenNAEP item developers and outsideexperts for the 1992 MathematicsAssessment (grade 4) and the 1992 and1994 Reading Assessments (grades 4, 8,and 12). Tables 1.3 and 1.4 present someof the results from these investigations."For the 157 items in the 1992 fourth-grade Mathematics Assessment, the twogroups of raters agreed on the cognitiveclassifications for 109 (69.4%) items

These statements are from the PALDs. The Proficient and Advanced statements were changed in the FALDs. For Proficient, "many" was usedin place of 'a number of' and for Advanced this statement was omitted.

" The 1998 Writing Assessment does not have a formal cognitive dimension.

4° Applying is a very broad category. 110 involves the higher-order thinking processes of classifying, hypothesizing, using inductive and deductivereasoning, and forming problem-solving models' (National Assessment Governing Board, 1994a, p. 8).

41 The accuracy of the classifications in the content categories is usually not questioned.

Sireci (undated, p. 57) notes that 'an entire literature exists documenting procedures available for determining how well the items cornpfising atest matches [its] content and cognitive specifications.... However, none of these procedures [has) been applied to NAEP tests!'

43 Possible explanations for this imprecision were not provided.

44 The results in table 1.3 are for the 1994 Reading Assessment. Similar results were observed for the 1992 Reading Assessment (National Academyof Education, 19936, p. 67).

1 9

2

Page 27: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Table 1.3. Process classification of 1992 fourth-grade Trial AssessmentMathematics items: Cross-classified by NAEP and by expert raters

NAEP

Expert RatersToal Itemsas Judgedby NAEP

ConceptualUnderstanding

ProceduralKnowledge

ProblemSolving

Conceptual 38 15 13 66Understanding (58%) (23%) (20%)

Procedural 1 25 6 32Knowledge (78%)

Problem 6 7 46 59Solving (78%)

Total Items as 45 47 65 157Judged byExpert Raters

Source: National Academy of Education, 1993b, p. 61

Table 1.4. Classification of 1994 Reading Assessment items: Expert advisorclassifications compared with official NAEP classifications(grades 4, 8, and 12 combined)

ExpertAdvisor

Classification

Official NAEP classifications

TOTALN

% of ColumnTotal

InitialUnderstanding

N% of Column

Total

DevelopingInterpretation

N% of Column

Total

PersonalResponse

N% of Column

Total

CriticalStance

N% of Column

Total

InitialUnderstanding

2071%

2416%

1

2%4

4%49

14%

DevelopingInterpretation

7

25%120

79%9

16%48

46%184

54%

PersonalResponse

NA 2

1%39

71%4

4%45

13%

CriticalStance

1

4%5

3%6

11%48

46%60

18%

TOTAL 28 151 55 104 338

Source: National Academy of Education, 1996, p. 24

(table 1.3). Of the 66 items classified asconceptual understanding by NAEP,57.6% were so classified by externalraters. Fifteen (22.7%) of these 66 itemswere classified as procedural knowl-edge, and 13 (19.7%) were classified asproblem solving. For the 338 readingitems (all grades) in the 1994 Reading

20

2 7

Assessment, the two groups of ratersagreed on the cognitive classificationsfor 227 (67.2%) items (table 1.4). Lessthan half (46.2%) of the items thatNAEP classified as critical stance itemsreceived a similar classification from theexternal raters. The external raters alsoclassified 46.2% of the items that NAEP

Page 28: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

classified in the critical stance categolyas developing interpretation items.

Data similar to those in tables 1.3 and1.4 are not available for recent assess-ments (e.g., 1994 U.S. History andGeography Assessments). However,given the mathematics and readingresults and the observations by Lazer,Campbell, and Donahue (1996), consid-erable rater disagreement with respectto the cognitive classifications seemslikely. In content areas such as U.S. his-tory and geography, one possible reasonfor the imprecision in the classificationprocess is related to the interaction ofexaminees' past experiences and thecontent of the item. Consider, for exam-ple, a content statement from the geog-raphy content outline noted previously:"Know the difference between fertileand infertile soils." Assume that theitem writer decides to measure thislearning target using the question: Whatis the difference between fertile andinfertile soils?45 For those fourth-gradestudents who have encountered discus-sions of the difference between thetwo types of soil and who rememberthat discussion, this item would belongin the "knowing" category. However,for students who have not encounteredsuch a discussion, this item mightrequire the use of some of the higher-order skills in the "applying" categoryso they could answer the item correctly.The same issue surfaces when trying todistinguish between the "knowing" and"understanding" categories. Consider,for example, the question given in the1994 Geography Framework (NationalAssessment Governing Board, 1994b, p.12) to illustrate the "understanding" cat-egory: "Why are tropical rain forestslocated near the equator?" If students

have been taught this generalization, isit a "knowing" question or an "under-standing" question?46

Underlying this issue is the premise thatsome new elements must be includedin the item before it can be labeled asmeasuring one of the "higher" cognitivecategories. More than 50 years ago,the National Society for the Study ofEducation (1946), in a book devotedto the measurement of understanding,addressed this issue in the followingway:

The Need for Novelty. Much that pass-es for understanding in the schoolis primarily memorization. On theother hand, understanding is attest-ed when the pupil dips into hisknowledge and fits itinto new patterns ofthought or action whichcould not have beendirectly learned. Forexample, being able torecite a number of rea-sons why somethinghappened is no realassurance that the per-son comprehends why ithappened. He may notunderstand (sense thesignificance of) the rea-sons that he has learned.It is no more difficult forthe pupil to learn rea-sons as facts than tolearn names and dates as facts. Anygenuine test of understanding will,therefore, require that the pupilshow his ability to utilize knowl-edge (perhaps of relationships) toexplain or interpret events in newsituations or contexts. Accordingly,understanding should not emphasize

45 Although this question may lack creativity, it matches the content statement.

'In some content areas, this problem is exacerbated because most schools do ,r,iot have specific classes for these sublectsconsider, for example,fourth-grade geography and U.S. history.

2E3

VtG mom

'surface&

Va2diet/Waal

eba

it,J3Wii

when

between

"knowing and

underatanding

categoriee.

2 1

Page 29: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

ncluding

Cignificant

number

&pacific

level

example&

higher-

cognitive

categorie&

(should help

0 0 0

item

development 0 0 0

reasons or supposed insights whichhave been taught and learned asfacts, but should call for the use ofabilities, both detailed and general,to cope with situations containing atleast some novel elements. (NationalSociety for the Study of Education,1996, p. 40)

Clearly, this "need for novelty" requirespersons who classify items in cognitivecategories to make assumptions aboutboth the background knowledge neededto answer an item and the experiencesof the typical examinee regarding thatknowledge. Thus, different classifica-

tions of the items will occur,because not all people willmake the same assump-tions.

22

Given these concerns aboutnovelty, it would be helpfulboth to the item developersand to the standard-settingpanels if increased numbersof concrete examples wereused in the frameworks toillustrate what the contentexperts view as representingthe "higher" cognitive cate-gories. Most of the exam-ples provided aS part of thestatements in the contentoutlines are not adequatefor defining the cognitivecategories clearly. Consider,for example, the following

statement from the fourth-grade geogra-phy outline:

Analyze the processes that shapecultural patterns, cause trends in

See footnote 41.

population growth, and/or influencetravel destinations; for example, thediscovery of coal and oil in westernPennsylvania led to the developmentof the industrial area aroundPittsburgh. (National AssessmentGoverning Board, 1994a, p. 56)

Presumably, the content experts wantitems related to this statement to beclassified in the "applying" category.47However, the example put forth pro-vides little if any guidance concerningwhat "analyze" means and how theseanalysis behaviors should be measured.Including a significant number of specif-ic examples in the higher-level cognitivecategories in the frameworks shouldhelp item development as well as increaserater agreement with respect to the itemclassifications in these categories. Inaddition, standard-setting panels wouldbenefit from the increased clarity suchexamples would provide for the cogni-tive dimensions.

The cognitive categories generally repre-sent increasing levels of complexity inthinking processes.48 Furthermore, theseprocesses and the general content cate-gories are usually the same for all gradelevels. Thus, if achievement leveldescriptions reflect what studentsshould be able to do with grade-appropriate material,49 these descrip-tions should probably exhibit somesimilarity across grade levels. For exam-ple, the achievement level descriptionsfor Basic should be similar regardless ofgrade level. In fact, current achievementlevel descriptions for the ReadingAssessment exhibit some similarities,as the statements below illustrate:

4B For the 1992 and 1994 Reading Assessments, these categories "do not form a sequential hierarchy' (National Assessment Governing Board,1994c, p. 13).

"'Most Report Cards remind readers that NAEP results are based on 'grade-appropriate' materials (see figure 1.5). Of course, the definition ofgrade-appropriate material presents its own set of problems. These problems will probably be particularly complex when specific grade-levelcourses typically do not exist (e.g., grade 4 geography).

29

Page 30: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Grade 4 Basic. Fourth-grade stu-dents performing at the Basic levelshould:

Demonstrate an understanding ofthe overall meaning of what theyread.

Be able to make relatively obviousconnections between the text andtheir own experiences.

Extend the ideas in the text bymaking simple inferences.

Grade 8 Basic. Eighth-grade stu-dents performing at the Basic levelshould:

Demonstrate a literal understand-ing of what they read and be ableto make some interpretations.

Be able to identify specific aspectsof the text that reflect the overallmeaning and extend the ideasin the text by making simpleinferences.

Recognize and relate interpreta-tions and connections amongideas in the text to personalexperience.

Draw conclusions based on text.

Grade 12 Basic. Twelfth-gradestudents performing at the Basiclevel should:

Be able to demonstrate an overallunderstanding and make someinterpretations of the text.

Be able to identify and relateaspects of the text to its overallmeaning, extend the ideas in thetext by making simple inferences,and recognize interpretations.

Make connections among andrelate ideas in the text to theirpersonal experiences.

Draw conclusions.

Be able to identify elements ofan author's style. (National Centerfor Education Statistics, 1996b,p. 42)

Two observations about these readingdescriptions seem relevant. First, thedescriptions are linked more to thefour cognitive categories (initial under-standing, developing an interpretation,personal reflection and response, andcritical stance) than to the three contentcategories (literary experience, gaininformation, and perform a task).Second, an implied frequen-cy/complexity dimension ispresent in these statements.Consider fourth-grade Basic.In the domain of inferences,these students should make"simple inferences." In thedomain of personal reflec-tion and response, thesestudents should make "rela-tively obvious connections."In these statements, theimplied frequency dimen-sion is linked to the com-plexity of the questions thestudent was asked to answer(simple inferences, relative-ly obvious connections).5°

The frequency/complexity link is notpart of the Proficient descriptions forgrades 4, 8, and 12. Instead, thesedescriptions (at all grade levels) includethe phrase "extend the ideas in text bymaking inferences." If "simple" were

5° In the previous example from U.S. history, the frequency dimension was linked to the number of facts (people, places, and events) the examineeshould know.

30

0 0 0 A2 fraud/0e)

deocriptiono

linked

four

ClOt;

non 6b

cognitive

categorieo

than

three

0 0 0

content

categorieo 0 0 0

23

Page 31: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

used as a qualifier in the descriptionof Basic (i.e., indicating a subset of allitems), then the absence of any qualifierin the Proficient statement would seemto indicate that Proficient studentsshould be expected to make "all"inferences.51

The above example raises the question:How applicable to other content areasis the model of achievement leveldescriptions represented by the readingassessment descriptions? For example,would this model be useful in develop-ing PALDs in civics or writing?52 Thefundamental premise of such a modelseems to be that, whereas the contentcategories (or themes) remain constantacross grade levels, the learning targets(i.e., grade-appropriate content) becomemore comprehensive and more complex.

Conchs& (8)

(41Sk`aell'HEY'D

The NAEP frameworks have receivedconsiderable praise both for the processused to develop them and for their com-prehensive coverage. This paper hasexamined the potential influence of afew select framework characteristicson achievement level descriptionsor performance levels (cut-scores).The major conclusions of thisexamination are:

1. Given the breadth of the NAEPdomains, the achievement leveldescriptions should include anexplicit "frequency dimension."

2. Given the lack of clarity of thecognitive dimension of the frame-works, both item developers andstandard-setting panels wouldbenefit from increasing the number

of concrete examples in the frame-works to illustrate how contentexperts think that higher-level cogni-tive categories should be measured.

3. Given the nature of the content andcognitive dimensions and the factthat the interpretation of NAEPresults assumes grade-appropriatematerials, the possible use ofachievement level descriptions thatare relatively similar across gradelevels should be examined.

RefferenceoAllen, N. L., & Isham, S. P. (1996). The

1994 NAEP reader and information about

the impact of choice on item response func-

tions. Paper presented at the AnnualMeeting of the American EducationalResearch Association, New York.

Forsyth, R. A. (1976). Describing what

Johnny can do. Iowa City, IA: IowaTesting Programs.

Forsyth, R. A. (1991). Do NAEP scalesyield valid criterion-referenced interpre-tations? Educational Measurement: Issues

and Practice, 4 0 (3), 3-9,16.

Kane, M. (1993). Comments on the

NAE evaluation of National Assessment

Governing Board achievement levels.

Washington, DC: National AssessmentGoverning Board.

Kane, M. (1995). Examinee-centeredvs. task-centered standard setting.Proceedings of the Joint Conference on

Standard Setting for Large-Scale

Assessments, Volume II (pp. 119-141).Washington, DC: National AssessmentGoverning Board, National Center forEducation Statistics.

5' Given, of course, grade-appropriate Materials. (See footnote 49.)

52Actually, PALDs for writing follow this model (National Assessment Governing Board, 1998b pp. 52-56). However, PALDs for civics are lists ofspecific content (from five categories) that students should be able to identify, describe, explain, or defend (National Assessment Governing Board,1998a, pp. 34-39).

24 3 1

Page 32: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Lazer, S., Campbell, J. R., & Donahue, P.(1996). Developing the NAEP objectives,items, and background questions forthe 1994 assessments of reading, U.S.history, and geography. The NAEP1994 Technical Report (pp. 29-67).Washington, DC: National Centerfor Education Statistics.

Linn, R. L., & Dunbar, S. B. (1992).Issues in the design and reporting ofthe National Assessment of EducationalProgress. Journal of EducationalMeasurement, 29, 177-194.

Lissitz, R. W., & Bourque, M. L. (1995).Reporting NAEP results usingstandards. Educational Measurement:Issues and Practices, 14 (2), 14-23,31.

Mehrens, W. A. (1995). Methodologyissues in standard setting for educationalexams. Proceedings of the Joint Conferenceon Standard Setting for Large-ScaleAssessments, Volume II (pp. 221-263).Washington, DC: National AssessmentGoverning Board, National Center forEducation Statistics.

Millman, J. (1994). Criterion-referencedtesting 30 years later: Promise broken,promise kept. Educational Measurement:Issues and Practice, 13 (4), 19-20,39.

Mullis, I. V. S. (1995). NAEP: A look atfuture needs and means. Washington, DC:National Assessment Governing Board.

National Academy of Education.(1993a). Setting performance standardsfor student achievement. Stanford, CA:Author.

National Academy of Education.(1993b). The trial state assessment: Prospectand realities. Stanford, CA: Author.

National Academy of Education. (1996).Quality and utility: The 1994 trial stateassessment in reading. Stanford, CA:Author.

National Assessment Governing Board.(1990). Assessment and exercise specifica-tions for the National Assessment ofEducational Progress in Reading:1992-1998. Washington, DC: Author.

National Assessment Governing Board.(1992). Assessment and exercise specifica-tions: 1994 National Assessment ofEducational Progress in U.S. History.Washington, DC: Author.

National Assessment Governing Board.(1994a). Geography assessment and exercisespecifications for the 1994 NationalAssessment of Educational Progress.Washington, DC: Author.

National Assessment Governing Board.(1994b). Geography framework for the1994 National Assessment of EducationalProgress. Washington, DC: Author.

National Assessment Governing Board.(1994c). Reading framework for the 1992and 1994 National Assessment ofEducational Progress. Washington, DC:Author.

National Assessment Governing Board.(1994d). Science assessment and exercisespecifications for the 1994 NationalAssessment of Educational Progress.Washington, DC: Author.

National Assessment Governing Board.(1994e). U.S. history framework for the1994 National Assessment of EducationalProgress. Washington, DC: Author.

National Assessment Governing Board.(1996a). 1998 NAEP civics assessmentplanning project: Recommended assessmentframework and test specifications.Washington, DC: Author.

National Assessment Governing Board.(1996b). Science framework for the 1996National Assessment of EducationalProgress. Washington, DC: Author.

32 25

Page 33: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

National Assessment Governing Board.(1998a). Civics assessment frameworkfor the 1998 National Assessment ofEducational Progress. Washington, DC:Author.

National Assessment Governing Board.(1998b). Writing framework and specifica-tions for the 1998 National Assessment ofEducational Progress. Washington, DC:Author.

National Assessment Governing Boardand National Center for EducationStatistics. (1995). Proceedings of theJoint Conference on Standard Setting forLarge-Scale Assessments, Volume II.Washington, DC: National AssessmentGoverning Board, National Center forEducation Statistics.

National Center for Education Statistics.(1996a). NAEP 1994 geography report card.Washington, DC: Author.

National Center for Education Statistics.(1996b). NAEP 1994 reading report card forthe nation and the states. Washington, DC:Author.

National Center for Education Statistics.(1996c). NAEP 1994 U.S. history reportcard. Washington, DC: Author.

National Center for Education Statistics.(1996d). The NAEP 1994 technical report.Washington, DC: Author.

National Center for Education Statistics.(1997). NAEP 1994 mathematics report cardfor the nation and the states. Washington,DC: Author.

26

National Society for the Study ofEducation. (1946). The measurement ofunderstanding: Forty-fifth Yearbook, Part 1.N. B. Henry, (Ed.). Chicago: Universityof Chicago Press.

Nitko, A. J. (1996). Educational assessmentof students (2nd ed.). Englewood Cliffs,NJ: Prentice-Hall.

Reckase, M. D. (1993). The defensibility ofdomain descriptions developed to support theinterpretation of achievement levels andanchor points reported on the NAEP scale.Iowa City, IA: American CollegeTesting.

Shepard, L. A. (1995). Implications forstandard setting of the NationalAcademy of Education evaluation of theNational Assessment of EducationalProgress achievement levels. Proceedingsof the Joint Conference on Standard Settingfor Large-Scale Assessments, Volume II(pp. 143-160). Washington, DC:National Assessment Governing Board,National Center for Education Statistics.

Sireci, S. G. (undated). Dimensionalityissues related to the National Assessment ofEducational Progress. Washington, DC:National Academy of Sciences.

Stone, G. E. (1995). Objective standardsetting. Paper presented at the AnnualMeeting of the American EducationalResearch Association, San Francisco.

33

Page 34: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

SECTION 2

Assembly of Test Forms

for Use in Large-Scale

Educational Assessments

Wim J. van der Linden University of Twente, The Netherlands

August 1997

_4

Page 35: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Azemblly of Teo Forms for Use in-Larme Scalle Ethscakionall Assezment

The problem of assembling test forms for use in large-scale assessments involves treating differ-

ent statistical features and content constraints for each individual form. This paper outlines two

possible methods of test assembly for use in such assessments. In one method, the items are

assigned directly from the pool to the test forms. The other method follows the balanced incom-

plete block design currently in use by the National Assessment of Educational Progress (NAEP).

The practice of educational and psycho-logical testing provides many cases inwhich the focal point is the assemblyof sets of multiple test forms rather thana single form. An obvious example is atesting organization assembling multipleparallel forms of a test for administra-tion at different locations or time slots.Another example is the assembly of testforms for use in a large-scale assess-ment. The latter example generalizesfrom the former in that the contents ofthe individual test forms typically differ.The reason for this is twofold: First, theitem pools needed to cover the subjectareas assessed are generally too largeto administer all items to each student.Second, the populations of studentsaddressed in educational assessmentshave a structure of several levels of nest-ing (e.g., classes, schools, districts).Hence, cluster sampling is mostlyapplied, with higher-level clusters beingsampled first, followed by units withinthese clusters. The negative effects onestimation efficiency inherent in clustersampling can be reduced if test formsare randomized over the units sampledfrom the same cluster (Johnson, 1992). Asecond difference encountered in assem-bling multiple parallel test forms is thatin educational assessments, the use oftest forms with different statistical fea-tures for different groups of students inthe sample may be desirable. This

option allows for the most optimal useof the tests as assessment instruments.Later in this paper, examples are givenillustrating the use of this option.

This paper presents twodifferent methods for theassembly of test forms foruse in large-scale educationalassessments. In the firstmethod, the items areassigned directly from thepool to the individual testforms. The second methoduses the current practice ofthe National Assessment ofEducational Progress (NAEP),in which items are first orga-nized as blocks, which arethen assigned to the individ-ual test forms following abalanced incomplete blockdesign. The two methods areidentical in that both use thetechnique of 0-1 linear pro-gramming (LP). The same

0 0 0 educationa

pogo hologica

ce

and

leafing

provides Mang

maa utak&

(4,9 foodpo in t

meembly

multiple

ce

M3P

rather

e to

form6

Man

angle form.

technique was used earlier tosolve such problems as matching a sin-gle test to a target information function,test assembly based on classical parame-ters, item matching, observed-scorepre-equating, and item selection inconstrained computerized adaptive test-ing. Some relevant references includeAdema (1990, 1992), Adema, Boekkooi-Timminga, and van der Linden (1991),

29

Page 36: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

30

Adema and van der Linden (1989),Amstrong and Jones (1992), Arnstrong,Jones, and Wu (1992), Boekkooi-Timminga (1987, 1990), Theunissen(1985, 1986), van der Linden (1994,1997, in press), van der Linden andBoekkooi-Timminga (1988, 1989), vander Linden and Luecht (1996, in press),and van der Linden and Reese (1998).

This paper assumes that informationon background variables explaining theachievements of students in the assess-ment is known prior to the assembly ofthe tests. These variables can be used todefine the various strata and clustersinvolved in the sampling design or spe-cial variables measured in the assess-ment. It is also assumed that the strataand clusters can be grouped accordingto their positions on these variables andthat information from previous assess-ments can be used to derive prior abilitydistributions of these groups. Finally, itis likewise assumed that several formsare assembled for administration withgroups with the same prior distribution.This assumption permits spiraling oftest forms within groups to reducethe cluster effects. These assumptionsunderlie the sampling scheme presentedin table 2.1. The term "cluster" in thetable is used in the remainder of thispaper as a generic term to denote a setof clusters of strata grouped on back-ground variables.

The models also assume that the testforms are assembled from a pool ofpretested items and that the pretest

Table 2.1. Population structure assumed

samples have been large enough to yieldaccurate estimates of such quantities asthe item response theory (IRT) parame-ters of the items, their optimal responsetimes, or other item attributes of inter-est. However, the methods will alsowork, albeit with less accuracy, fortentative estimates of these quantities.

The LP technique is used to assemblethe test forms so that the backgroundproperties of the clusters of students arematched optimally with the propertiesof the pretested items. These items aresubject to the various constraints thathave to be imposed on the contents ofthe tests. Basically, the technique con-sists of the following steps:

1. Defining the decision variables thatindicate whether an item should beassigned to a test form.

2. Using the variables to formulate a setof linear (in)equalities representingthe constraints on the values of thevariables that must be imposed.

3. Using the variables to formulate anobjective function to optimize thetests.

4. Applying an algorithm or heuristic tocalculate a feasible (i.e., admissible)set of values for the variables withan optimal value for the objectivefunction.

It is not unusual for test assembly prob-lems to have hundreds of constraints.The combinatorial complexity involvedin assembling multiple test forms

Cluster 1 Cluster 2 Cluster 3 Et cetera

Form A Form D Form F

Form B Form E Form G

Form C Form H

Form I

36

Page 37: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

with such large numbers of constraintsis already enough to motivate theapplication of 0-1 LP. Assemblingseveral forms by hand, particularly ifsome of the constraints are tight, maytake several days, whereas an appropri-ately implemented LP problem can besolved to a high degree of precision inminutes. If this complexity were the onlymotivation to use LP and the interestwas exclusively in finding a feasible solu-tion to the problem, the choice of objec-tive function would be arbitrary and anyconvenient function would do. How-ever, a large set of meaningful objectivefunctions suggests itself for applicationin test assembly for educational assess-ments, including the following options:

1. If the interest is not only in estimat-ing properties of the distributionsof certain populations but alsoin reporting individual scores toschools, it may be helpful to maxi-mize the efficiency of the individualability estimators. The standard IRTapproach in this case is to minimizethe distance between the informa-tion functions of the test forms totargets for each cluster of studentsin the sample. The backgroundinformation on the clusters availablecan be used to derive meaningfultargets. This choice of objectivefunction does not necessarily leadto better estimators of the parame-ters describing the ability distribu-tion in a population of students, butgives an additional opportunity foroptimization to improve the statisti-cal features of the estimators of theindividual Os. However, improvedestimation of the Os makes marginalanalysis of group differences morerobust against model misspecifica-dons (Mislevy, Beaton, Kaplan, &Sheehan, 1992).

2. The validity of educational assess-ments is low if the test forms donot motivate students to producetheir best answers. One way toincrease students' motivation is togive them items with probabilitiesof success that are neither too lownor too high. An objective functionincorporating this idea is the onethat minimizes the distance betweenthe probabilities of success on theitems for typical ability values of theclusters and target values for them.

3. Students vary in the time they needto produce a correct response to anitem. At the same time, items differin the time they require from a stu-dent. If individual differences inresponse time are consid-ered a nuisance variablein the assessment, it maymake sense to use anobjective function thatmaximizes the matchbetween the items andthe students in thevarious clusters.

The translation of eachobjective into a linearobjective function isdemonstrated in this paper.

If one or more of the clustersin table 2.1 (e.g., certain geo-graphic areas) contained

0 0 0 Ca Citywopritytely

implemented

problem

&dyad

high

OC3G21

degree

preciaion

minute&

ability distributions to be estimated, alegitimate goal for each cluster would beminimization of a suitable function onthe covariance matrix of the MarginalMaximum Likelihood (MML) estimatorsof the parameters characterizing its dis-tribution. However, for the mainstreamIRT models, such functions appear tobe nonlinear in the items. Although inanother multiparameter IRT problem anappropriate linearization of the objective

37 31

Page 38: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

function appears to be possible (van derLinden, 1996), no attempts have beenmade to deal with the current case.

The rest of this paper is organized asfollows. First, a standard problem ofoptimal test assembly in IRT is formal-ized as an instance of 0-1 LP. This caseis used to show how the first objectiveabove can be given the shape of a linearobjective function. In addition, it ex-plains the different types of constraintsthat can be met in test assembly. Thefirst model for the assembly of multipleforms for educational assessments isthen given. In this model, the secondobjective is illustrated. The same tech-nique of 0-1 LP is finally applied to opti-mize the assignment of blocks of itemsin a balanced incomplete block design.In this application, the last objective isshown.

(0-..7J Linear

Pm rawninwModellsfor Teot AzenthllyIt is assumed that the items in the poolare represented by decision variables xi,i=1,..., I denoting whether (xi=1) repre-senting item i included in the test, or(xi=0) representing the item i notincluded in the test. These variablesmodel an objective function that mini-mizes the distances between the testinformation function and a target func-tion over a series of values ek, k=1,..., K,in which T(ek) is used to denote the tar-get values at these points. The valuesof the information function of item i atthese points are represented by Ii(0k). Inaddition, examples from the followingfour categories of possible constraintsare taken:

1. Constraints needed to fix the lengthof the test or the length of possible

32

sections at prespecified numbers ofitems.

2. Constraints needed to model testspecifications that deal with cate-gorical item attributes. Examples ofcategorical item attributes are itemcontent and format, the presence orabsence of graphics, and the cognitivelevel of the item. The distinctive fea-ture of categorical attributes is thateach introduces a partition of theitem pool with different classes ofitems associated with different levelsof the attribute. The constraintsin this category typically specifyrequired distributions of itemsover the partitions.

3. Constraints needed to model testspecifications that deal with quantita-tive item attributes. These attributesare parameters or coefficients withnumerical values, such as itemp-values, word counts, and (expected)response times. The constraints inthis category usually require sums oraverages of the values of these attrib-utes to be in certain intervals.

4. Constraints needed to deal withpossible dependencies of the testitems in the pool. For example, cer-tain items may have to be adminis-tered as a set related to the sametext passage, whereas others are notallowed to figure in the same formbecause they have clues to oneanother.

For convenience, only one categoricalitem attribute (e.g., cognitive level) isused, with levels h=1,..., H, each corre-sponding to a different subset of itemsin the pool, Ch. For each subset, thenumber of items in the test has to bebetween nh(1) and nh(u). Likewise, onequantitative attribute is used, which ischosen to be the length of the items inthe pool measured in numbers of lines, l.

38

Page 39: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

The total number of lines of text in thetest must not exceed the amount ofspace available, 1(u). Finally, as an exam-ple of a dependency of the test items inthe pool, it is assumed that items 83,84, and 85 are not allowed in the sametest form.

As can be seen, all expressions in themodel are linear in the variables. Hence,models of this type can be solved foroptimal values of their variables usinga standard software package for LP. Achoice of algorithms and heuristics isalso offered in the test assembly pack-age ConTEST (Timminga, van der

The model runs as follows:

K I

minimize / Ii(0k)xi T(8k)]k=1 i=1

subject to

E Ii(Ok)xi T(Ok) 0, k=1,..., Ki=1

Ix. = n,i=1

I x < H,1E C h

x > n (1) h=1,..., H,lE C h h

I 1 x < 1(u)

X83 + X84 + X < 185

Linden, & Schweizer, 1996). If themodel has a special structure, efficientimplementation of some algorithmsmay be possible (for an example, seeAmstrong & Jones, 1992).

The model illustrates the use of the firstobjective for test assembly in assess-ments discussed abovenamely, mini-mization of the distances between theinformation function of the test and atarget for it. Although the distances atpoints Ok are minimized from above,an approach in which the distances areminimized from below or from bothsides is also possible.

(target information function) (1)

(positive differences) (2)

(test length) (3)

(cognitive levels) (4)

(cognitive levels) (5)

(number of lines) (6)

(mutually exclusive items) (7)

(definition of xi) (8)

33

Page 40: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Earlier ApproathE3o Maslltiplle-Form

AzembllyThe first approach to the problem ofassembling multiple test forms is assem-bling the forms in a sequential fashion,each time removing those items alreadyselected from the pool and updating themodel to fit the next problem. Versionsof this approach are followed in manytesting programs. However, the methodhas two serious disadvantages. Suppose,for example, that the problem is one ofassembling a set of parallel test forms. Ifthese forms are assembled one after theother, the value of the objective functionfor the solution to the model is likely todeteriorate with each succeeding formbecause the items with the best valuesfor their attributes tend to be selectedfirst. As a consequence, the formswould not be parallel. The second disad-vantage is the possibility of unnecessaryinfeasibility of the problem at a laterstage in the assembly process. Thisphenomenon is illustrated in table 2.2,which shows the levels of only a fewof the items in a larger pool of two ofthe attributes.

For simplicity, these attributes areassumed to represent features that theitems either have or do not have (e.g.,use of graphics in the stem). Supposethat two test forms have to be assem-bled so that each form must have at

least two items with attribute 1 and oneitem with attribute 2. In a sequentialprocedure, the selection algorithm mightpick both item 2 and item 3 for the firsttest because they have large contribu-tions to the target. However, as a conse-quence of this choice, a second formsatisfying the same set of constraints isno longer possible. In a simultaneousapproach, a sound algorithm wouldalways assign item 2 to one test formand item 3 to the other, thus preventingthis infeasibility. (In fact, it is the pres-ence of such attribute structures, whichoften are not immediately obvious, thatmakes manual assembly of multipletest forms a notorious process in which,once a feasible solution is found, testassemblers may feel inclined to stopout of relief rather than the certaintythat an optimal feasible solution hasbeen found.)

Both disadvantages were already notedin Boekkooi-Timminga (1990). Hersolution to the problem of assemblingmultiple parallel forms was to remodelthe problem using different decisionvariables. If the individual forms aredenoted by f=1,..., F, the new decisionvariables, )(if, are defined such that xif=-1indicates that item i is assigned to testform f and xj1=0 otherwise. Hence, eachitem is assigned directly to a test formand all assignments take place simulta-neously. For the model in equations(1)(7) the result would be as follows:

Table 2.2. Example of unnecessary infeasibility in sequential test assembly

Item 1 Item 2 Item 3 Item 4 Item 5

Attribute1

2

x

x

x

x

x x

Contributionto Tar

0.35 0.71 0.84 0.29 0.45

x indicates that the item has the attribute.

34

Page 41: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

F K I

minimize I E [E I (00Xif T(Ok)]f=1 k=1 i=1

subject to

Ii(9)xif T(Ok) 0, k=1,..., F, f=1,..., F,

x.f = n, F,1

xif > nh(u), h=1,..., H, f=1,..., F,iE Ch

x > n h=1,..., H, f=1,..., F,iE Ch h

lx f < 1(u), f=1,..., F,1=1 "

Xit 1, I=1,..., I,f=1

X83 + X84 + X < 185

xif = 0, 1, i=1,..., I, F.

Observe that equation (15) has beenadded to prevent each item from beingassigned to more than two forms. Thetotal number of constraints has alsoincreased because all constraints inequations (9)(14) are in force F timesand equation (15) entails I new con-straints. What is more important, how-ever, is the fact that the number ofvariables has increased by a factor EFor this reason models of this type,although powerful for smaller problems,quickly result in memory managementproblems or prohibitively large compu-tation times for more complicated prob-lems. Therefore, the method presentedin the next section is helpful.

(target information function)

(positive differences)

(test length)

(cognitive levels)

(cognitive levels)

(number of lines)

(no overlap)

(mutually exclusive items)

(definition of xtf).

Larine-ScalleallcaltionallAzezmento:Method V

(9)

As already outlined, a basic problemwith a sequential approach to assem-bling multiple forms is an unbalancedassignment of items to forms. Never-theless, the approach does have theadvantage of producing the smallestnumber of decision variables and con-straints. The simultaneous approachdiscussed in the previous section deftlysolves the problem of imbalance, but itsprice is a larger number of variables andconstraints. The method in this paper, a

35

Page 42: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

version of which was already proposedby Adema (1990) for the assembly of aweakly parallel test, provides for thebalancing of test content, while consid-erably minimizing the increase in thenumber of variables and constraints.

Basically, the method reduces anymultiple-form assembly problem to aseries of computationally less intensive,two-form problems. At each step, oneform is assembled according to the truespecifications. The other form is adummy assembled according to adaptedspecifications; its only task is to balancethe contents of the current form withlater forms. As soon as both forms havebeen assembled, the items selected forthe dummy are returned to the pool,and the process is repeated.

minimize y

subject to

131( 19f*)x1 ti y, I= 1 , , I,

P1(+ 101*)x1 ti y, i=1,..., I,

E [131 .(+10 *)z.g=f+1

g

E [P (+10 *)zgg=f+1

X = n17

1

z. = ng7i=1 1 g=f+1

t] [ E n ]y1

I=1 I''' Ip 1 Ig=f+1 °

To present the model, two different setsof decision variables are usedone setof variables x I, to denotewhether (xj=1) or (x1=0) item i will beassigned to the form assembled andanother set of variables zi, i=1,..., I, forthe same decision with respect to adummy form. The objective is now theminimization of the distances betweentarget values for the probabilities of suc-cess on the items and their likely valuesfor the various clusters. To implementthe objective, Of* is a value typical of theabilities of the students in the cluster forwhich test form f is assembled. The val-ues is assumed to be derived from back-ground information on the students. Inaddition, the target value for these stu-dents is denoted as 'Cr The model forassembling form f plus its associateddummy is:

(objective function) (18)

(target for form f)

(target for form f)

(target for dummy)

ng ly, 1=1,..., I, (target for dummy)g=f+1

x < n (u) H,lE Ch hf

x. > n (1)I

h=1, H1E Ch 1 hf

36

42

(length of form f)

(length of dummy)

(cognitive levels)

(cognitive levels)

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

Page 43: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

< E (u) h=1,..., H,iE Ch g=f+1

E z. > E nh (1), h=1,..., H,iE Ch g=f+1 g

E 1.x. 1(u),I 1

E r.z. < E r (u)i.1 " g=f+1 g

xi + 1, I=1,..., I,

x83 + x84 + x85 5_ 1,

z83 + z84 + z85. < 1,

Xi E {0,1} i=1,..., I.

E 10,11 i=1,..., I.

The constraints in equations (19) and(20) require the distances between theprobabilities of success on the items andtheir target values to be in the interval(y,y). The same is done for the dummytest in equations (21) and (22), adaptingfor differences in test length. The sizeof the interval is minimized in equation(16). The general shape of this objectivefunction is explained below. In equation(23) the length of form f is set equal tonf items, whereas in equation (24) thelength of the dummy is set equal tothe sum of the lengths of all remainingforms. The constraints related to thevarious cognitive levels in equations(25)(28) as well as to the total numberof lines available for printing the testin equations (29) and (30) are adaptedaccordingly. The constraint needed toprevent the overlap of items betweenthe test forms now has to be formulated

(cognitive levels)

(cognitive levels)

(number of lines)

(number of lines)

(no overlap)

(mutually exclusive items)

(mutually exclusive items)

(definition of xi)

(definition of zi).

(27)

(28)

(29)

(30)

(31)

(32)

(33)

(34)

(35)

as in equation (31). Finally, the con-straints necessary to deal with depen-dencies of the items must be repeatedfor the dummy test in equation (33).

Note that the coefficients in theconstraints have been made form-dependent to allow for differences inspecifications between the forms. Apartfrom the change of variables, the mainmodifications in the constraints for thedummy test are on the right-hand sidecoefficients; these have been made larg-er to enforce adequate balancing of testcontents between forms.

Obctive FunciflonThe objective function in the aboveproblem along with its definition inequations (19)(22) is of the minimaxtypethat is, for all items, a commonupper bound to the distances between

43 37

Page 44: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Multiple-form

6tari assembly

problems

alwa involve

distinct

ective

each form.

the probabilities of success and the tar-get values is defined. This bound is nextminimized. Application of the minimaxprinciple is a convenient way to unifydifferent objectives into a single objec-tive function. Multiple-form test assem-bly problems always involve a distinct

objective for each form. Inthis example, a differentobjective is involved foreach individual item.Although attractive byitself, the use of objectivesat item level is the onlyexception in which thesuggestion contained inthe following section isnot expected to worksatisfactorily.

38

ReDaxed DedskonVariabliesGenerally, if the problem isstill too large to be solved

in realistic time, an effective reductionof the computational complexityinvolved in 0-1 LP can be realized byrelaxing the decision variables for thedummy test, that is, by replacing equa-tion (35) by:

E [0,1] i=1, ..., I. (36)

This measure may result in a slightlyless effective form of content balancingamong the various test forms, but sincethe number of 0-1 variables is halved,the effect on the branch-and-bound stepgenerally used in the algorithms andheuristics in this domain should bedramatic. Since some of the variables

44

are now reals, the problem becomes theoccurrence of mixed integer linear pro-gramming (MILP). However, as noted,this approach does not work satisfacto-rily if the objective addresses individualitem attributes. In this example, anattempt is made to match the successprobabilities and the targets by havingthe algorithm assign "partial items"to the dummy form if their actualprobabilities of success at Of* are toolarge.

Large-5cakEdusca0onallAZE5511MEgilt:

Method 2Another possible method of test assem-bly in large-scale assessments is derivedfrom the two-stage method currentlyused by NAEP. In this method, the itemsin the pool are first assigned to a set ofblocks and then the blocks are assignedto test forms (called "booklets" inNAEP). The assignment in the secondstage follows a pattern known as a bal-anced incomplete block (BIB) design(Johnson, 1992). In this section, a 0-1LP model for the assignment of blocksto booklets is formulated. The potentialcontribution of this model is not somuch the possibility of automation butthat it allows for better optimization ofthe design with respect to an objectivefunction and the involvement of variousother constraints in the assignment ofblocks to booklets than those related tothe parameters of the BIB design.

Page 45: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

The following notation is needed topresent the model. The blocks inthe pool are represented by indicesi=1,..., N. To represent pairs of blocksa second index j with the same rangeof possible values is used. Booklets aredenoted by b=1,..., B. Decision variables

b are used to determine whether(xth=1) or not (xth=0) block i is assignedto booklet b. Likewise, variables zithare used to assign pair (i,j) to booklet b.Special constraints will be formulated tokeep the values of these two categoriesof variables consistent. The distributionof blocks across booklets is describedby the following parameters:

c1 number of blocks per booklet;

c2 number of booklets per block;

c3 minimum number of bookletsper pair of blocks.

To illustrate the possibility of controllingthe contents of the booklets beyond thevalues of these parameters, three differ-ent kinds of additional constraints areintroduced. First, it is assumed that theblocks are classified by content. Contentis represented by a categorical attribute

C, where V, is now defined asthe subset of blocks in the pool belong-ing to content category c and n, is thenumber of blocks to be selected fromthe pool. Second, it is assumed thatthe length of the booklets must be

controlled. The number of lines of textin block i is denoted by a quantitativeattribute li, whereas the total numberof lines available for booklet b is lb(u).Third, it is assumed that some blocksare "enemies" in the sense that theycannot be assigned to the same booklet.These sets of enemies are denoted byVe, e=1,..., E.

Finally, the model illustrates the use ofan obfrctive function based on responsetimes needed for the items in the test.The variable rth can be theresponse time needed by thestudents in the cluster forwhich booklet b is assem-bled. An ideal definition ofthis parameter would be acertain percentile below thedistribution of the actualresponse times in the cluster(e.g., 90th percentile).However, in practice it maybe hard to estimate this para-meter satisfactorily. A morepractical choice, therefore,is to make an educatedguess regarding the typicalresponse time needed basedon earlier experiences with

0 0 0 itiaa

ffuntraten

eze

model

6k mg

objective

functinn baned

renponne

timen needed

Off glg

k eGte,

item&

the same type of studentsresponding to the type of questionsdominant in the booklet. The goal forthe total response time needed for book-let b is Tb.

453 9

Page 46: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

40

The model is as follows:

minimize y

subject to

I rbxth Tb y, b=1,..., B,

E rthxth Tb y, b=1,..., B,

E xib = c1 b=1,..., B,

E xib = c2, I=1,..., N,

Z C3, i<j=1,..., N,b=1

xth + xth 2zith, N, b=1,..., B,

E Ex >n c=1,...,C,b=1 iEV ib

/1 X 5_ 1 (u), b=1,..., B,iib b

EI z..b 1, e=1,..., E, b=1,..., B,(i<DE V, n

XIb E {0,1}, i=1,..., N, b=1,..., B,

Zjib E {OM, N, b=1,..., B.

In equations (37)(39) the minimax prin-ciple is applied again, this time to opti-mize the total response time needed forthe booklets. The constraints in equa-tions (40) and (41) define the size of thebooklet by the numbers of blocks andthe number of times a block is assignedto a booklet, respectively, whereas equa-tion (42) sets the minimum number ofbooklets to which each possible pair is

4Li

(objective function) (37)

(ideal response time) (38)

(ideal response time) (39)

(number of blocks per booklet) (40)

(number of booklets per block) (41)

(number of booklets per pair) (42)

(consistent assignment) (43)

(content) (44)

(length of booklet) (45)

(enemies) (46)

(definition of xib) (47)

(definition of zith). (48)

assigned as c3. The constraints in equa-tion (43) stipulate that each time a pairof blocks is assigned (zith=1), it alsoholds that the individual pairs in thisblock are also assigned (xth=1 and xth=1).Observe that the reverse implication isnot needed. Due to the constraints inequation (44), at least nc blocks fromcontent category have been assignedto a booklet, while the constraints in

Page 47: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

equation (45) guarantee that the lengthof booklet b is not longer than lb(u) lines.Finally, the constraints in equation (46)prevent assigning more than one blockfrom each set of enemies.

DisalizionThe decision regarding which of thetwo methods should be recommendedfor use in large-scale assessmentsdepends on such parameters as thesize of the item pool, the number oftest forms, and the number of blocks.Generally, the first method has only oneparameter determining the number ofvariables in the model, namely, the sizeof the item pool. The number of testforms determines the number of itera-tive applications of the method. In thefirst application, the number of variablesis equal to 21. In the next application, nitems have been removed from the pool,and the number of variables is 2(In).The number of variables in the secondmethod depends on both the numberof blocks and the number of booklets.More precisely, the method involves NBvariables xib and N(N-1)/2 variables zo.

For either method, the number ofconstraints depends on the numberof attributes the test assembler wantsto control. In addition, both methodsinvolve technical constraints to keep thevalues of the different kinds of variablesconsistent. Unfortunately the number ofsuch constraints depends directly on thesize of the item pool or the numbers ofblocks and booklets. If the number ofconstraints becomes too large, overflowof computer memory may occur.

The algorithms and heuristics inConTEST (Timminga, van der Linden,& Schweizer, 1996) have been able tosolve problems with 2,000 to 3,000 0-1

variables and several hundreds of con-straints. For method one, these numbersimply that if relaxation of the zi vari-ables is possible, one can deal with itempools of this size. For method two, ifN=20 and B=6, the numbers of variablesand equations are typically between1,000 and 1,500, and problems of thissize also seem manageable. If the prob-lem gets too large for either method, itmay be split into subproblems dealingwith disjoint parts of the item pool andsolved separately. In fact, this measurehas already been practiced by NAEP as"focused BIB design."

Finally, it is observed that, in principle,the problem of assigning items to testforms can be further refined by assign-ing items directly to posi-tions in test forms. Thisapproach would enable thepretest positions of the itemsto be taken into account ifthey could have a possibleeffect on the values of theitem parameters. If no effectsof pretest positions are pre-sent, the approach can befollowed to neutralize possi-ble effects of the position ofthe items in the assessmenttest on the results, such asthe likelihood of the item notbeing reached or the studentbecoming tired, by systemat-ically varying their positions across testforms. However, since the decision vari-ables have to be indexed with respect tothree different factorsitems, forms,and positionsthe increase in the num-ber of variables needed can be keptwithin reasonable limits only if thepossible positions of the items arecategorized in a few larger classes(e.g., beginning, middle, and end ofthe test forms).

4 7

0 0 0

ce

number

conefrahA

depends

number

attribute's

O'Gae aneembier

'1mA acoaing

41

Page 48: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Aluthout Aloft)Correspondence concerning this papershould be addressed to W. J. van derLinden, Department of EducationalMeasurement and Data Analysis,University of Twente, P.O. Box 217,7500 AE Enschede, The Netherlands.Electronic mail may be sent to [email protected].

ReferefiwesAdema, J. J. (1990). The construction ofcustomized two-staged tests. Journal ofEducational Measurement, 27, 241-253.

Adema, J. J. (1992). Methods and modelsfor the construction of weakly paralleltests. Applied Psychological Measurement,16, 53-63.

Adema, J. J., & van der Linden, W. J.(1989). Algorithms for computerized testconstruction using classical item para-meters. Journal of Educational Statistics, 14,297-290.

Adema, J. J., Boekkooi-Timminga, E.,& van der Linden, W.J. (1991).Achievement test construction using0-1 linear programming. EuropeanJournal of Operations Research, 55,103-111.

Amstrong, R. D., & Jones, D. H. (1992).Polynomial algorithms for item match-ing. Applied Psychological Measurement,

16, 365-373.

Armstrong, R. D., Jones, D. H., & Wu,I.L. (1992). An automated test develop-ment of parallel tests. Psychometrika, 57,271-288.

Boekkooi-Timminga, E. (1987).Simultaneous test construction byzero-one programming. Methodika, 1,101-112.

42

Boekkooi-Timminga, E. (1990). The con-struction of parallel tests from IRT-baseditem banks. Journal of EducationalStatistics, 15, 129-145.

Johnson, E. G. (1992). The design ofthe national assessment of educationalprogress. Journal of EducationalMeasurement, 29, 95-110.

Lord, F. M. (1980). Applications of itemresponse theory to practical testing problems.

Hillsdale, NJ: Erlbaum.

Mislevy, R. J., Beaton, A. E., Kaplan, B.,& Sheehan, K. M. (1992). Estimatingpopulation characteristics from sparsematrix samples of item responses.Journal of Educational Measurement, 29,131-161.

Theunissen, T. J. J. M. (1985). Binaryprogramming and test design.Psychometrika, 50, 411-420.

Theunissen, T. J. J. M. (1986).Optimization algorithms in test design.Applied Psychological Measurement, 10,

381-389.

Timminga, E., van der Linden, W. J.,& Schweizer, D. A. (1996). ConTEST[Computer program and manual].Groningen, The Netherlands: iecProGAMMA.

van der Linden, W. J. (1994). Optimumdesign in item response theory:Applications to test assembly and itemcalibration. Contributions to mathematicalpsychology, psychometrics, and methodology

(pp. 308-318). G. H. Fischer & D.Laming (Eds.). New York: Springer-Verlag.

van der Linden, W. J. (1996) Assemblingtests for the measurement of multipletraits. Applied Psychological Measurement,20, 373-388.

48

Page 49: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

van der Linden, W. J. (1997). Assemblingtests for the measurement of multipletraits. Applied Psychological Measurement,20, 373-388.

van der Linden, W. J. (Ed.) (in press).Optimal test assembly. Special issueof Applied Psychological Measurement.

van der Linden, W. J., & Boekkooi-Timminga, E. (1988). A zero-one pro-gramming approach to Gulliksen'smatched random subsets method.Applied Psychological Measurement, 12,201-209.

van der Linden, W. J., & Boekkooi-Timminga, E. (1989). A maximin modelfor test design with practical constraints.Psychometrika, 17, 237-247.

van der Linden, W. J., & Luecht, R. M.(1996). An optimization model for testassembly to match observed-scoredistributions. Objective measurement:Theory into practice, (Volume 3, pp.405-418). G. Engelhard & M. Wilson(Eds.). Norwood, NJ: Ablex PublishingCompany.

van der Linden, W. J., & Luecht, R. M.(in press). Observed-score equating asa test assembly problem. Psychometrika.

van der Linden, W. J., & Reese, L. M.(1998). A model for optimal constrainedadaptive testing. Applied PsychologicalMeasurement, 22.

4943

Page 50: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

I.

4.

Ilb

III

qkqbI

I.

I.qb

*I,

I

go,

4i

t.."

Page 51: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

A Brief ilntrodusdion o Mem Responge Theoryfor lltenn Scored in More Than Two Categoriee

Many contemporary tests includeconstructed-response (CR) items, forwhich the item scores are orderedthrough categorical ratings providedby judges. When the judges' ratings useonly two categories, widely known itemresponse theory (IRT) models may beused. However, in most cases, responsesto extended CR items or performanceexercises are relatively long, and theirscoring rubrics specify several gradedcategories of performance. The use ofIRT with data from these kinds of itemsrequires generalized models to accom-modate the larger number of responses.

Alternatively, the responses to individ-ual items on modern tests may not belocally independent, as required by thecomputations that produce IRT scalescores. Many reasons exist for localdependence. Several items may be basedon a common stimulus: for example, thequestions following a passage on a read-ing comprehension test, logical reason-ing questions following a vignette, andmathematics questions based on somecommon graphic or illustration. CRitems may be divided into parts thatappear to be items. For example, amathematics problem may be followedby a second item that asks for an expla-nation of the answer, or an examineemay be asked to make a drawing andthen provide some written commentaryon his or her art. While the parts that

comprise these items may be scoredseparately, the item scores are likely tobe correlated due to immediate associa-tions related to the common stimulus orcommon aspects of the responses.Combining the parts of these locallydependent items into a larger unit, calleda testlet (Wainer & Kiely,1987), permits the use of thevaluable machinery of IRTfor item analysis and testscoring. Test lets, by nature,are large items that producescores in more than two cate-gories. They may use thesame extensions of IRT asthose developed for largeitems rated by judges in sev-eral scoring categories.

In some cases, CR items ortestlets may make up theentire test; in other cases,multiple-choice items arealso used. In either case, atotal score is often required,combining the judged ratings or testletcategory scores and the binary itemscores on the multiple-choice items ifany are present. Simple summed scoresmay not be very useful in the latter con-text because of the problems associatedwith the selection of relative weights forthe different items and item types and,in any event, because CR items areoften on forms of widely varying

Excerpts from a draft to appear in D. Thissen & H. Wainer (Eds.), Test ScoringChapter 4. A close relationship exists among various contempo-rary methods for standard setting and item response theory (IRT) scale scores for tests such as the National Assessment of Educational Progress.These brief excerpts from the volume Test Scoring (in preparation) on topics that arise in scoring tests that combine multiple-choice andconstructed-response items may be useful in the explication of both scale scores and standard-setting methods that are based on data fromdifferent item types.

51

SalGOA,

ss nature,

large

ereitems

ivsduce

azcoca

COCO'G Man

categoriee.

47

Page 52: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Pea collection

items

sumciently

represented

unidimensiona

model,

WI 07 Gil

viable

scale

tS C. 0r 9alternative.

difficulty. If the collection of items issufficiently well represented by a unidi-mensional IRT model, scale scores maybe a viable scoring alternative.

One of the great advancesof IRT over traditionalapproaches to educationaland psychologicalmeasurement is the facilitywith which IRT handlesitems that are scored inmore than two categories.Indeed, in the transitionfrom dichotomously scoreditems to polytomouslyscored items, the onlychanges in IRT are the traceline models themselves. Inthis paper, the applicationof item response modelsto data in which the itemshave multiple (that is, morethan two) possible scoreswill be considered.

Scalle 5cores for ff-EM3with More Than TwoReoponge Categorie3

EsUmates of ProfidencyBased on Response PatternsThis section is very brief because theprinciples underlying the computationof scale scores using any of the poly-tomous models are identical to thosewidely used with IRT models for binaryresponses. The joint likelihood for any

48

response pattern,u = full u

21u . is

nitems

L = Ti(ui I 0) 4)(0)

regardless of whether each u representsa dichotomous or polytomous responsecategory. The latent variable (proficien-cy) is denoted 0, and 0(0) is the popula-tion distribution [assumed to be N(0,1)in all the examples]. The only new pointto be raised here is that the trace lines,T i(u 1 0) , describing the probability of aresponse in category ui for item i as afunction of 0, may take different func-tional forms for the different item typesand response formats: It does not matterif the trace lines arise from the one-,two-, or three-parameter logistic orfrom the graded model or from thenominal model or any of its specializa-tions. MAP[0] and EAP[0] may still becomputed exactly as they are formodels for binary data.

I1R11 Anallysis: The NorthCarollina Test of Computer%HillsKeyboardingTo illustrate the ideas involved in usingIRT models for graded responses, weconsider data from a 10-item NorthCarolina Test of Computer Skills, key-boarding (KB) section. Using the itemresponse data from the 3,104 studentswho completed the item tryout forms,we fitted the graded (GR) model, theGeneralized Partial Credit (GPC) model,and the Partial Credit (PC) model,with the computer program Multilog

Page 53: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

(Thissen, 1991), all with a Gaussianpopulation distribution for 9. The itemparameter estimates for the GR modelare listed in table 3.1, and those for theGPC model are listed in table 3.2. Forthese data, the GR model fits betterthan the GPC model. Both models usethe same number of parameters, and-21og-likelihood is 105 for the GRmodel and 119 for the GPC model.(In this case, smaller is better.) Becausethese models are not hierarchically nest-ed, no straightforward way exists toassociate a probability statement withthe fact that the GR model provides abetter fit. Nevertheless, that is the fact.1The PC model (constraining all the itemdiscrimination parameters to be equal) istestable because it is hierarchically nest-ed within the GPC model. The test ofthe null hypothesis that the slopes areequal is rejected: G2(2)=7, p = 0.03.

The trace lines for the GR and GPCmodels are shown in figure 3.1. Notethat the trace lines for the two modelsare very similar. The small differencesin the shapes of the curves account forthe difference between the goodness-of-fit of the two models, but has littleconsequence for scoring. Tables 3.3and 3.4 show the values of EAP[0]and standard deviation (s.d.)[9] for the27 response patterns, using the GR andGPC models, respectively. For the mostpart, scale scores for the same responsepattern from either of the two modelsdiffer by less than 0.1 standard units.Exceptions are response patterns thatinvolve some points for items 1 and 2,but 0 for item 3 (010, 110, and 220).Those relatively rare patterns have scalescores that are 0.1 to 0.2 standard unitsdifferent for the GPC model than forthe GR model.

Table 3.1. Graded (GR) model item parameters for the North Carolina Test ofComputer Skills, keyboarding (KB) section

Item ParametersItem a b., b2

1. Typing 1.05 -1.40 -0.97

2. Spacing 1.59 -0.52 0.92

3. Length 0.93 -2.97 0.55

Table 3.2. Generalized Partial Credit (GPC) model item parameters for the NorthCarolina Test of Computer Skills, KB section

Item Parameters

Item al a2 a 3 C1 C2 C3

1. Typing -0.84 0.0 0.84 -0.47 0.57 -0.10

2. Spacing -1.19 0.0 1.19 0.04 0.42 -0.46

3. Length -0.81 0.0 0.81 -1.40 0.90 0.50

'In our experience, fitting hundreds of data sets over two decades, it has almost always been the case that the GR model fits rating data betterthan the GPC model. The difference is usually small, as it is in this case.

49

Page 54: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 3.1. The trace lines for the GR (black) and GPC (gray) models for the NorthCarolina Test of Computer Skills, keyboarding items

"Spacing"

"Length"

50

Page 55: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Table 3.3. EAP[9] response-pattern scale scores and their corresponding s.d.and frequency, using the GR model for the North Carolina Test ofComputer Skills, KB section

ResponsePattern Summed Score EAP[0] s.d. [0] Frequency

000 0 -1.55 0.76 64

001 1 -1.12 0.72 236100 1 -1.03 0.72 25

010 1 -0.74 0.70 31

002 2 -0.71 0.73 96101 2 -0.68 0.69 337200 2 -0.67 0.76 17

011 2 -0.43 0.67 142

110 2 -0.34 0.67 58020 2 -0.30 0.80 8

201 3 -0.31 0.72 137

102 3 -0.28 0.70 138

111 3 -0.08 0.64 340012 3 -0.04 0.67 57210 3 0.03 0.69 27

021 3 0.05 0.75 37120 3 0.15 0.74 11

202 4 0.15 0.74 76211 4 0.27 0.66 167

112 4 0.27 0.64 212

121 4 0.42 0.70 138

022 4 0.54 0.75 36220 4 0.65 0.76 14

212 5 0.66 0.67 126122 5 0.85 0.70 196221 5 0.87 0.71 97

222 6 1.34 0.73 281

Table 3.4. EAP[9] response-pattern scale scores and their corresponding s.d.and frequency, using the GPC model for the North Carolina Test ofComputer Skills, KB section

ResponsePattern Summed Score EAP[0] s.d. [0] Frequency

000 0 -1.53 0.75 64001 1 -1.09 0.72 236100 1 -1.07 0.72 25010 1 -0.89 0.71 31

002 2 -0.67 0.70 96101 2 -0.66 0.70 337200 2 -0.65 0.70 17

011 2 -0.49 0.69 142

110 2 -0.48 0.69 58020 2 -0.31 0.69 8

102 3 -0.27 0.69 138

201 3 -0.26 0.69 137

012 3 -0.10 0.69 57111 3 -0.09 0.68 340210 3 -0.08 0.68 27021 3 0.07 0.68 37120 3 0.08 0.68 11

202 4 0.13 0.68 76112 4 0.29 0.69 212211 4 0.30 0.69 167022 4 0.45 0.69 36121 4 0.47 0.69 138220 4 0.48 0.69 14

212 5 0.69 0.70 126122 5 0.86 0.71 196

221 5 0.88 0.71 97222 6 1.30 0.74 281

55 51

Page 56: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 3.2. Test information curves for the North Carolina Test of Computer Skills,KB section, as computed using both the GR and the GPC models

3.0

40) 2.0

1.03 2

Test information curves for this three-item test, as computed using both theGR and GPC models, are shown infigure 3.2. As one would expect giventhe similarity of the trace lines, the twoinformation curves in figure 3.2 are verysimilar. The striking feature of the twocurves is that they are both very flat,across a wide range of O. Thus, IRT dis-plays the primary advantage of multiple-category scoring: Each item providesinformation at two (in this case) ormore (in general) levels of proficiency.By adjusting the definitions of thescoring categories, a test can be reason-ably easy to construct that measuresproficiency almost equally accuratelyover a wide range of proficiency. Formost values of 0 near the middle of thescale, the value of test information isabout 2.0. Therefore, the standard errorsof the scale scores are expected to beabout 41/2 0.7, as they are in tables3.2 and 3.3.

52

+1 +2 +3

Using [IRT Salk Scores forResponse Patterns to ScoreTests Combining Mulltiplle-Choke and Constructed-Response Sections:Wisconsin Third-GradeReading FieDd Test

To illustrate the use of IRT scale scoresassociated with response patterns toscore tests comprising both multiple-choice and CR sections, we usedata from a field test of a form ofWisconsin's third-grade reading test.The data were obtained from 522examinees. Here we use the responsesto 16 multiple-choice items, as well asthe responses to 4 CR items. All 20items followed a single reading passage.The multiple-choice items were in aconventional four-alternative format;the CR items were open-ended (OE)questions that required a response on

Page 57: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

a few lines. The OE items were ratedon a four-point scale (0-3). We simulta-neously fitted the 3PL model to themultiple-choice items and the GR modelto the OE items using Multi log (Thissen,1991) and using a mild Bayesian priordistribution2 for the guessing parameterof the 3PL model (g).

Figure 3.3 illustrates the computationof the IRT scale score for an individualexaminee for this 20-item test. Theexaminee responded correctly to 12 ofthe 16 multiple-choice items, obtainedscores of 2 on 2 of the OE items, and3 on the other 2, for a total score (ifone sums the "points") of 22. IRTresponse-pattern scoring ignores thesummed score and instead multipliesthe trace lines shown in figure 3.3. Thetop panel shows the trace lines for theexaminee's responses to the multiplechoice items (12 increasing curves and4 decreasing curves, for the 12 correctand 4 incorrect item responses); themiddle panel shows the trace linesassociated with the OE responses.The 2 nonmonotonic curves are theGR trace lines for the 2s, and the 2increasing trace lines are those for the3s. The bottom panel of figure 3.3 is theproduct of the 20 curves in the other 2panels [and the N(0,1) population distri-bution curve]. The mode of the curve inthe lower panel is MAP[0], which takesa value of 0.54 (with s.e. = 0.27)thatis, the IRT scale score in z-score unitsfor this examinee.

IRT scale scores computed as in figure3.3 for each examinee provide a solutionto the "weighting problem" for tests

such as this one that combine multiple-choice and OE items. Many wouldquestion the use of summed "points" toscore a test such as this one, asking whyone of the rated "points" for the OEresponses should equal the value ofa correct multiple-choice response.However, the IRT scale scoring processneatly finesses the issue: All of the itemresponses (open-ended and multiple-choice) are implicitly "weighted"; in-deed, the effect of each item responseon the examinee's score depends on theother item responses. Each response pat-tern is scored in a way that best uses theinformation about proficiency that theentire response pattern pro-vides, assuming that themodel summarizes the dataaccurately.

IRT scale scores computedin this way may vary a gooddeal for examinees with thesame summed score. Figure3.4 shows a scatter diagramof the scale scores for thecombined 3PL and GRmodel, plotted against thesummed score computed,taking the number of OErated "points" literally.3 Forsome summed scores, therange of IRT scale scores isas much as a standard unit. A gooddeal of this range is attributable, in thiscase, to the differential treatment by theIRT model of the OE responses. Asshown in figure 3.3, the slopes of thetrace lines4 for the OE responses aresubstantially less than the slopes of

'The prior distribution used for the lower asymptote parameter for the four-alternative multiple-choice items was N(-1.1, 0.5) for logit(g). Thisdistribution has a mode of 0.25 for g.

'Examinees with missing responses are omitted from figure 3.4 because it is not clear how to compute their summed scores in a way that iscomparable to those of other examinees.

'Here, 'slope' is taken generally to mean the rate of change of the response probability as a function of 0.

5 7

Each

pattern

o

reeponee

§65

be..st

6cored

Wag that

efiaa

nformation

profkiencg

Gagen

pattern

about

that

reepOnea

providee 0 0 0

53

Page 58: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 3.3. The computation of the scale score for an individual examinee on theWisconsin third-grade reading test

1.0

T(u) 0.5

0.0-3 -2 -1 0 +1 +2 +3

0

The top panel shows the 3PL trace lines for this examinee's responses to themultiple-choice items

1.0

T(u) 0.5

0.0-3 -2 -1 o

e+1 +2 +3

The middle panel shows the GR model trace lines associated with the openended (OE) responses

^C300

,4..,7.)....

b--1.

The bottom panel shows the product of the 20 curves in the other two panelsand the N (0,1) population distribution curve

the 3PL trace lines in the vicinity ofthis examinee's score. As a result, the3PL responses "count" more in thescore, and the OE responses "count"relatively less. For examinees otherthan the one shown in figure 3.3

54

who also obtained a summed scoreof 22, the IRT scale score is higherbecause they responded correctly tomore of the highly discriminatingmultiple-choice items, even thoughthey obtained fewer points for theirOE responses.

Page 59: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

2

0

-1

-2

-3

Hgure 3.4. Scatter diagram of the scale scores for the combined 3PL and GRmodel, plotted against the summed scores for the Wisconsin third-grade reading test

0 5 10 15 20

Summed Scores

Assuming that the combined 3PL andGR model accurately represents thedata, the IRT scale scores simultaneous-ly provide more accurate estimates ofeach examinee's proficiency and avoidany need for explicit consideration ofthe relative "weights" of the differentkinds of "points."

The Tegtilekt Conce0The concept of the testlet was intro-duced in the literature by Wainer andKiely (1987, p. 190): "A testlet is a groupof items related to a single content areathat is developed as a unit and containsa fixed number of predetermined pathsthat an examinee may follow." Wainerand Kiely proposed the use of testlets asthe units of construction and analysisfor computerized adaptive tests (CATs).However; the testlet concept is nowviewed as a general-purpose solution tothe problem of local dependence (LD)(Yen, 1993). If a pair or cluster of itemsexhibits LD with respect to the con-

25 30

struct being measured by the test as awhole, that pair or cluster of items maybe aggregated into a single unita test-let. The testlet then yields locally inde-pendent responses in the context of theother items in the measure. Test lets andindividual items can then beincluded in an IRT modelfor item analysis and testscoring.

By definition, a testlet is akind of (super) test itemthat yields more than tworesponses; furthermore, therelative ordering of thoseresponses with respect to theconstruct being measuredmay or may not be knowna priori. Although traditionalapproaches to item analysisand test assembly may bestymied by the presence ofitems with multiple, purelynominal responses, Bock's (1972) IRTmodel for responses in several nominalcategories may be used to provide

dafinition,

auper)

alto

mato

kind

Ra0

that

ce

itam

yields

mora than

reaponiseis 0 0 0

55

Page 60: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

0 0 0 teatlat- agad

ffficaat2,4

yet& 0 MOOG

accuratg

dggcription ce

raliabift

and @ezifi awn

afandard

Otr miza

GOMM

ta.sta.

56

straightforward item analysis and testscoring. Analysis using the NO modelcan also be used to determine if theresponses are, as a matter of empiricalfact, ordered. If they are, then a con-strained version of the NO model, likethe GPC or PC model or Samejima's(1969) GR model, may be effectivelyused.

Our first illustration of the testlet idea(Thissen & Steinberg, 1988; Thissen,1993) used data reported by Bergan andStone (1985) involving the responsesto four items measuring the numericalknowledge of a sample of preschool

children and the nominalIRT model. This examplerepresents what wenow call response-patterntestlets, in which everypattern of responses tothe items in the testletbecomes a response catego-ry for the testlet as a whole.An extensive theoreticaltreatment of processes thatmay be represented byfitting the NO model toresponse-pattern testletshas recently been providedby Hoskens and De Boeck(1997), who reanalyze theBergan and Stone exampleand also contribute others.

No loss of information is involved in theconstruction of response-pattern testlets;the data are merely redefined. However,given current technology, response-pattern testlets may include only a fewitems (two, three, or perhaps four) andonly a few responses for each itembecause the number of response cate-gories for the testlet is the product ofthe numbers of response categories forthe items. That number cannot be larger

than, say, 16 and still permit any reason-able amount of data to be used to cali-brate the NO model.

When several items follow a commonstimulus, it may be better to view thesummed score on the test as the sum ofthe number-correct scores for the sub-sets of items associated with each of thecommon stimuli (often passages in read-ing comprehension tests). Both the nom-inal and GR models have been used forthe analysis of passage-based tests(Thissen, Steinberg, & Mooney, 1989;Wainer, Sireci, & Thissen, 1991). Toimplement this idea, a test with 10questions following each of four readingpassages is treated as a four-testlet test,and the GR model or some version ofthe NO model is fitted to the 11 re-sponse categories that represent eachpossible number-correct score. In thiscase, some loss of information occursrelative to full response-pattern analysisof the test data. However, the loss ofinformation is usually small and is morethan compensated for since the testletanalysis is a proper analysis of locallyindependent responses, while theresponse-pattern analysis may be dis-torted by the LD induced by the pas-sages. For this reason, testlet-based IRTanalysis yields a more accurate descrip-tion of the reliability and scale scorestandard errors for such tests (Sireci,Thissen, & Wainer, 1991).

The reason that pairs or clusters ofitems should be treated as testlets issometimes obvious, as in the case ofclustered tasksfor example, the pairsof questions on the preschool numer-ical knowledge test or the questionsfollowing a passage on a reading com-prehension test. On the other hand,local dependence inay also be an unex-pected or even sufprising empirical

Page 61: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 3.5. Two items (14 and 15) from the 1991 North Carolina End-of-CourseGeometry Test that exhibit substantial local dependence

14. Which term describes L1 and L2?

A verticalB complementaryC adjacentD congruent

15. If mLA = 60, what is the measure-ment of the supplement of LA?

A 120B 90C 40D 30

phenomenon. Yen (1993) and her col-leagues at CTB/McGraw-Hill have suc-cessfully used empirical procedures todetect LD on recently constructed per-formance-based educational assess-ments. They used a statistic, called Q3,proposed by Yen (1984), to identify LD.However, Q3 may exhibit unpredictablebehavior under some circumstances(Chen & Thissen, 1997; Reese, 1995).For binary test items, we use the LDindex described in detail by Chen andThissen (1997). The LD index provides astraight- forward analysis of the residu-als from the IRT model for each pair ofitems. If the items are locally indepen-dent, then the residuals from the fittedmodel for each pair of items are statisti-cally independent, and the x2-distributedstatistic is expected to be approximatelyone. If the items are locally dependent,the LD index is large. When substantialLD is detectedfor example, by the

LD indexor expected because the testwas deliberately constructed with relat-ed items, we combine those items intotestlets and use the same IRT machineryfor test scoring that we use with anytest that has items with severalresponse categories.

Using the Noma[l ModeDfor !items Combined intoResponse-Pattern Test !lets:An lExample from theNorth Caro[lina End-of-Course Geometry TestA pair of items that exhibit relativelyextreme local dependence is shownin figure 3.5. These two items werenumbers 14 and 15 on a 1991 test formof the North Carolina End-of-CourseGeometry Test and physically adjacent,as shown in the figure. Such physical

62. 57

Page 62: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

proximity often exacerbates LD when itexists. Using data from 2,739 examinees,we calibrated the 60-item test andcomputed the LD index for each pairof items. The value of the LD index forthis pair of items was 180. For a statisticthat is distributed approximately as aX2 with 1 d.f in the null case, that isremarkable.

Many more examinees than predictedby the IRT model respond correctly toboth items or incorrectly to both items.Examining the items, it is easy to seewhy. Indeed, several ways can be usedto describe the probable reasons forthe LD. A succinct description wouldbe that the two items are both "vocab-ulary" items on a geometry test, andstudents whose teachers emphasizedthe memorization of vocabulary woulddo better on both. A somewhat moreelaborate chain of reasoning wouldexplain that item 14 could serve to "giveaway" the answer to item 15, even forstudents with limited knowledge: If weassume that among the three termsadjacent, complementary, and supplemen-tary, the first is the easiest to remember,then we may assume that item 14 iseasy to answer correctly (by selectingadjacent). However, the figure clearlyshows that the angles in item 14 sumto 180°. That implies that complementary,as an incorrect distractor for item 14,cannot be the word that means "sumsto 180°." When we turn to item 15, ifwe remember that supplementary is"one of those words" and means either"sums to 180°" or "sums to 90°," weeliminate alternatives B and C, and then(correctly) choose alternative A becausesupplementary has to be "sums to 180°"if (from item 14) complementary is not.

In any event, the empirical fact is thatthe responses to the two items are notlocally independent. The solution pro-

58

posed by Yen (1993) for this kind of LDis to combine the two items into a sin-gle testlet and conduct the IRT analysisagain. This serves to eliminate the LDand keep the IRT model and its useful-ness for scale scoring. (The alternativesare to keep the LD and eliminate theunidimensional IRT model or grosslycomplicate the model; neither of theseideas is attractive.) In this case, werescored these two items as a singletestlet with four response categories: 0for response pattern {00}, 1 for responsepattern {01), 2 for response pattern {10},and 3 for response pattern {111. We thenrecalibrated the test using the 3PL modelfor the remaining 58 items and the NOmodel for the testlet.

Figures 3.6-3.8 illustrate response pat-tern items 14 and 15. The NO modeltrace lines for the four response patternsto items 14 and 15 are shown in figure3.8. The trace lines for {00}, {01}, and{10} are all monotonically decreasing(and nearly proportional to one another)over the useful range of 0. Of course,the trace line for {11} is monotonicallyincreasing. This differs from any patternthat can be obtained by combining two3PL trace lines, for which the likelihoodsof the four response patterns must beordered, with {11} associated with high-er values of 0, {01} and {10) associatedwith some intermediate values of 0,and {00} associated with the lowest.The 3PL trace lines for items 14 and 15are shown in figure 3.6, and the prod-ucts of those two curves (and theircomplements) are shown in figure 3.7.In an attempt to fit the data as accurate-ly described by the NO in figure 3.8,the 3PL model has extremely high val-ues of g (the lower asymptote) for bothitems. (They are 0.43 and 0.48, respec-tively.) Nevertheless'i the 3PL modelmust imply that the likelihood forthe response patterns {01) and {10)

62

Page 63: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 3.6. 3 PL trace lines for responses to items from the 1991 North CarolinaEnd-of-Course Test in Geometry that exhibit substantial local dependence

1.0

T(u) 0.5

Item 14

0.0-3 -2 -1 0 +1 +2 +3

0

Figure 3.7. The likelihoods for the four possible response patterns for items 14 and15, computed as products of the trace lines in Figure 3.6 and theircomplements

As 2 3PL-model Items

0.0-3

Figure 3.8. The nominal model trace lines for the four response patterns to items14 and 15 of the 1991 North Carolina End-of-Course Test in Geometry

As 1 Nominal-model Test let

63 59

Page 64: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

0 0 0 trace

from 61D

6to dig

have

lines

model

testlet

different

conne9aenCee

scale

acoring.

must be associated with relatively high-er values of 9 than {00}. This causes themisfit with the data that is detected bythe LD index.

If the 3PL trace lines were used to com-pute IRT scale scores, the effect wouldbe that examinees who responded cor-rectly to either one of the two itemswould receive higher scale scores thanthose who responded correctly to nei-ther. However, trace lines from the NOmodel for the testlet have different con-sequences for scale scoring: Effectively,the examinee "gets credit" (i.e., the scalescore increases) for response pattern{11), but the scale score tends to decreaseslightly for any of the patterns thatinclude an incorrect response. The test-

let combination and subse-quent NO model analysishave created a scoring rulefor this pair of items that,to anthropomorphize, basi-cally says, "This pair is asingle item; you get onepoint if you respond cor-rectly to both questionsand zero points if you misseither." The IRT analysisindicates that using thisscoring rule makes this pairof items a better indicatorof proficiency in geometrythan any that would be

60

obtained treating thetwo items separately.

We have described response-patterntestlet modeling and scoring as ana posteriori "fix" for observed LD,and the procedure is used this way.However, as Hoskens and De Boeck(1997) and others have pointed out, theavailability of this kind of analysis

makes it possible for the test constructorto plan or intend to construct item com-binations as testlets. An example could bethe mathematics item format that asksfor the solution to a problem, and thenfollows up with what appears to be asecond question, "Explain your answer."After the solution is scored (possibly ascorrect or incorrect) and the explanationis rated by judges following some rubric(perhaps on a three-point scale), all ofthe patterns of {solution score, explana-tion score} can be treated as the responsealternatives for a single testlet, fittedwith the NO model.5

CondllassionIRT models for items with responsesscored in more than two categoriesprovide a useful way to compute scalescores for the otherwise difficult datathat arise in the context of performanceassessments. While the rating categoriesused by judges to provide the item-levelscores for CR items are often arbitrary,the IRT models provide a mechanism tocombine those ratings into scores on auseful scale. Weighting problems, oncethe province of guesswork or commit-tees, are naturally handled in the processof the computation of IRT scale scores;each item is implicitly weighted accord-ing to its relation with the aspect profi-ciency that is common to all of theitems (0).

The computation of IRT scale scoresrequires that the "item" responses mustbe locally independent because thatjustifies the multiplication of the tracelines for those responses to computethe likelihood that is the basis of thescale scores. While this requirement

Some testing programs use ordered versions of the NO model, such as the GPC model, for this purpose. That may be effective, but we wouldrecommend that the unconstrained NO model be fitted, or some other analysis performed, to check that the empirical order of the testlet cate-gories corresponds to the order assumed in fitting an ordered item response model.

64

Page 65: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

may seem at first look restrictive in thecontext of performance assessment, wehave seen that the testlet concept maybe used to combine "items" (from thepoint of view of the examinee) that maybe locally dependent into testlets in sucha way that the testlet responses arelocally independent. Then IRT modelsfor responses in more than two cate-gories, such as those we have discussedin this chapter, may be used to gain allthe advantages of IRTa well-definedscale with well-defined standard errors,form linking, and even adaptive testing.

IRT scale scores may always be comput-ed using the full response pattern, andif the model is well chosen for the data,one that yields the most statistically effi-cient scores. In some contexts, IRT scalescores may usefully be computed foreach summed score. This is often morepractical in large-scale testing programsbut not as useful when the items beingconsidered are of very different kinds.

ReferencesBergan, J. R., & Stone, C. A. (1985).Latent class models for knowledgedomains. Psychological Bulletin, 98,

166-184.

Bock, R. D. (1972). Estimating itemparameters and latent ability whenresponses are scored in two or morelatent categories. Psychometrika, 37,

29-51.

Chen, W. H., & Thissen, D. (1997).Local dependence indices for item pairsusing item response theory. Journal of

Educational and Behavioral Statistics, 22,

265-289.

Hoskens, M., & De Boeck, P. (1997). Aparametric model for local dependenceamong test items. Psychological Methods,

2, 261-277.

Reese, L. M. (1995). The impact of local

dependencies on some LSAT outcomes.

LSAC Research Report Series.Newtown, PA: Law School AdmissionCouncil.

Sireci, S. G., Thissen, D., & Wainer, H.(1991). On the reliability of testlet-based tests.Journal of Educational

Measurement, 28, 237-247.

Thissen, D. (1991). Multilog user's

guideVersion 6 [Computer program].Chicago: Scientific Software, Inc.

Thissen, D. (1993). Repealing rules thatno longer apply to psychological mea-surement. In N. Frederiksen, R. J.Mislevy & I. Bejar (Eds.). Test theory for

a new generation of tests (pp. 79-97).Hillsdale, NJ: Lawrence ErlbaumAssociates.

Thissen, D., & Steinberg, L. (1988). Dataanalysis using item response theory.Psychological Bulletin, 104, 385-395.

Thissen, D., Steinberg, L., & Mooney,J. A. (1989). Trace lines for testlets:

A use of multiple-categorical-responsemodels. Journal of Educational

Measurement, 26, 247-260.

Wainer, H., & Kiely, G. L. (1987). Itemclusters and computerized adaptive test-ing: A case for testlets. Journal ofEducational Measurement, 24, 185-201.

Wainer, H., Sireci, S. G., & Thissen, D.(1991). Differential testlet functioning:Definitions and detection. Journal of

Educational Measurement, 28, 197-219.

Yen, W. M. (1993). Scaling performanceassessments: Strategies for managinglocal item dependence. Journal of

Educational Measurement, 30, 187-214.

65 61

Page 66: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

I 0) tA0) 0)

Ilft: QNoik

NitZ '44

o 4.41....4 Z

v) litqie V)0 NItN.4 4.tit

rLI0

1 ci)° V4NoZ1

qh) 441

Z144141

COr N 0 ileNit f:61

vitamj it QNit C4B

Ziii)

°Z4 tii)

Its.

N.ftQ

it U

Page 67: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Some lldeas about iltem ResponseTheory Applied to Combinationg of

Mulltipk-Choice and Open-Ended HEM:5ca11 e 5cores for Patterns of 5unmed Scores*

Many contemporary tests include open-ended (OE) items, for which the itemscores are ordered categorical ratingsprovided by judges, as well as multiple-choice items. If the collection of items issufficiently well represented by a unidi-mensional item response theory (IRT)model, scale scores may be a viable planfor scoring such a test. Either EAP orMAP estimates of 0 based on responsepatterns are traditional IRT answers tothe scoring of such tests. However, inmany contexts, response-pattern scoringcarries nonpsychometric penalties, andan alternative solution is required.Weighted combinations of the summedscores are widely used, but no clearlysuperior solution exists to the problemof selecting the weights.

Scage 5cores BasedOgP Patterns of5unmed ScoresA better solution to the problem ofcombining binary-scored, multiple-choice (MC) sections with items scoredin multiple categories may involve ahybridization of summed-score andresponse-pattern computation of scaledscores. To do this, one must first jointly

calibrate all the items (that is, oneestimates the item parameters), usingsuitable IRT models for each item. Thenone computes exC (9), the likelihood forsummed score x for the MC section, andexE, (0), the likelihood for summed scorex' for the OE section. For each combina-tion of a given summed score x on theMC section with any summed score x'on the OE section, compute the product

.1,30<, (9) = LIC (0) Lx,°E(0) (I) (0) (1)

The product in equation (1) is the likeli-hood for the response pattern defined as{score x on the MC section and score x'on the OE sectionl. Then we can com-pute the modeled probability of theresponse pattern of summed scoresfx,x'1,

P = f Lxx,(0) dO (2)

We may also compute the expectedvalue of 0, given the response patternof summed scores fx,x'),

'Excerpts from a draft to appear in D. Thissen and H. Wainer (Eds.), Test Scoring.

f 0 L ,(9) d(0)EAP[0 x x'} =

Pxx,

6 7

(3)

65

Page 68: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

and the corresponding standarddeviation,

s.d.[Olx , x'} =

f to EAP[OIx , x']}2 L(e) d(9) 1112

Pxx, (4)

Equations (3) and (4) define two-wayscore translation tables that providescaled scores and their standard errorsfor each such response pattern, in whichthe "pattern" refers to the ordered pair{score x on the MC section, score x' onthe OE section). This procedure offersmany of the practical advantages ofsummed-scores, while preserving thedifferences in scale scores that may beassociated with very different valuesof "points" on the MC and OE sections.

Usino11RT ScalleScores for Patternsof Summed Scoresto Score TestColl.nbllnlln

Mulltipk-Choice andConstructed-ResponseSedions: WisconsinThird-Grade ReadnFieild TestTo illustrate the construction of scalescoring tables using equations (3) and(4), we use data from the Wisconsinthird-grade reading test. We fitted the

66

16-item MC section and 4 OE items,each rated 0-3, with the 3PL and GRmodels (Birnbaum, 1968; Samejima,1969) respectively, using the computerprogram Multilog (Thissen, 1991) andcomputed scale scores for the responsepatterns to these 20 items. Here, weuse the item response models and itemparameter estimates to compute thevalues of EAP[Olx , x'] and s.d.[Olx , x'],using equations (3) and (4).

Table 4.1 shows the values of EAP{Ellx ,

x'] for the 221 combinations of x, thesummed score on the MC section, andx', the summed score on the OE section.Tabulations such as those shown intable 4.1 can be used in score-translationsystems; one enters the table with thesummed scores on the two parts of thetest and locates the scale score for thatcombination in the body of the table.A similar array of the values of s.d.[91x , x'] is shown in table 4.2; thesemay be reported as the standarderrors of the scores.

Table 4.1 shows some interesting fea-tures of IRT-based score combination,as opposed to the more commonly usedsimple weighted combinations ofsummed scores. Reading across eachrow or down each column of table 4.1,it should be noted that the "effect" of a"point" on either the OE section or theMC section depends on its context. In asimple weighted combination of thescores, obtaining 12 points instead of 11on the OE portion would have the same"effect" on the score as obtaining 1point instead of 0. This is not true forthe IRT system: The likelihood-basedsystem "considers" (to anthropomor-phize) the two scores and their consis-tency pieces of evidence. Where thetwo pieces of evidence essentially agree

68

Page 69: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Table 4.1. The values of EAP[Olx, xl for combinations of multiple-choice (MC)and open-ended (OE) summed scores on a Wisconsin reading tryoutform. 0 is standardized; EAP estimates of 0 associated with the MCand OE summed scores are shown in the margins. The unshaded areain the table represents the central 99% HDR for the response patterns

Open-Ended (Summed) Rated Score

MCSum Score 0 1 2 3 4 5 6 7 8 9 0 11 12

EAPs .1.-> -2.8 -2.5 -2.2 -1.9 -1.6 -1.3 -1.0 -0.6 -0.2 0.1 0.5 1.0 1.4Score

0 -2.3 -3.2 -3.0 -2.8 -2.6 -2.5 -2.4 -2.2 -2.1 -2.0 -1.9 -1.8 -1.8 -1.7

1 -2.2 -3.1 -2.9 -2.8 -2.6 -2.4 -2.3 -2.1 -2.0 -1.9 -1.8 -1.7 -1.7 -1.6

2 -2.1 -3.0 -2.9 -2.7 -2.5 -2.3 -2.2 -2.0 -1.9 -1.8 -1.7 -1.6 -1.6 -1.5

3 -2.0 -3.0 -2.8 -2.6 -2.4 -2.2 -2.1 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.4

4 -1.8 -2.7 -2.5 -2.3 -2.1 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2

5 -1.6 -2.8 -2.6 -2.3 -2.1 -1.9 -1.8 -1.6 -1.5 -1.4 -1.3 -1.2 -1.2 -1.1

6 -1.4 -2.7 -2.4 -2.2 -2.0 -1.8 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -1.0

7 -1.2 -2.5 -2.2 -2.0 -1.8 -1.6 -1.4 -1.3 -1.2 -1.1 -1.0 -1.0 -0.9 -0.9

8 -1.0 -2.2 -1.9 -1.7 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.8 -0.7

9 -0.9 -1.9 -1.6 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.8 -0.7 -0.6 -0.6

10 -0.7 -1.5 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.8 -0.7 -0.6 -0.5 -0.5 -0.4

11 -0.5 -1.2 -1.0 -0.9 -0.8 -0.8 -0.7 -0.6 -0.6 -0.5 -0.4 -0.4 -0.3 -0.2

12 -0.3 -0.8 -0.8 -0.7 -0.6 -0.6 -0.5 -0.5 -0.4 -0.3 -0.2 -0.2 -0.1 -0.0

13 -0.1 -0.6 -0.5 -0.5 -0.4 -0.4 -0.3 -0.3 -0.2 -0.1 -0.0 0.0 0.2 0.3

14 0.5 -0.3 -0.3 -0.3 -0.2 -0.2 -0.1 -0.1 0.0 0.1 0.2 0.3 0.5 0.6

15 0.5 -0.1 -0.1 -0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.7 0.9 1.1

6 1.1 0.2 0.2 0.2 0.3 0.3 0.4 0.5 0.6 0.7 0.9 1.1 1.4 1.7

(roughly, near the main diagonal of thetable), the summed score on each parthas a larger "effect" on the scaled scorethan it may when the scores "disagree."In the latter case, when the scores areinconsistent, the OE score is effectivelygiven less weight because the OE sec-tion is less reliable (or discriminating).

Figure 4.1 illustrates the system graphi-cally. The contents of the cell in table4.1 for xmc=ls and x'OE=9 are shownexpanded at the lower right of the figure:the population OE distribution [OM],the likelihood for OE score 9[LcI(0)], thelikelihood for MC score 15[L1(0)], theproduct of those three densities, and thelikelihood for 9 given OE score 9 andMC score 15[458,9(0)]. (To place all thelikelihoods on approximately the same

scale for the graphic, the population dis-tribution [OM], the likelihood for theOE score [L° I; (9)], and the likelihood forMC score [Lmc (0)] have all been normal-ized to have a maximum ordinate of 1.0in figures 4.1-4.4)

The plot of the likelihoods in thelower right-hand corner of figure 4.1is expanded in figure 4.2. Referring totable 4.1, we find that the value ofEAP[Olx , x'] for the combination is0.5, while the MC EAP[Olx] (in the rowmargin of table 4.1) is also 0.5, and theOE EAP[Olx'] (in the colunm margin oftable 4.1) is 0.1. Thus, although the like-lihood for the combination appears tobe between those for the MC score andthe OE score, the "average" computed,using the combination likelihood, is

69 67

Page 70: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

68

Table 4.2. The values of s.d. [eIx , xl for combinations of MC and OE summedscores on a Wisconsin reading tryout form. 0 is standardized; EAPestimates of 0 associated with the MC and OE summed scores areshown in the margins. The unshaded area in the table representsthe central 99% HDR for the response patterns

Open-Ended (Summed) Rated Score

MC Score 0 1 2 3 4 5 6 7 8 9 10 11 12Sum

EAPs -2.8 -2.5 -2.2 -1.9 -1.6 -1.3 -1.0 -0.6 -0.2 0.1 0.5 1.0 1.4Score0 -2.3 0.51 0.52 0.51 0.49 0.47 0.44 0.42 0.40 0.38 0.37 0.36 0.35 0.35

1 -2.2 0.53 0.54 0.53 0.50 0.48 0.45 0.42 0.40 0.39 0.37 0.36 0.35 0.35

2 -2.1 0.55 0.56 0.54 0.52 0.49 0.46 0.43 0.41 0.39 0.37 0.36 0.35 0.34

3 -2.0 0.58 0.58 0.56 0.53 0.50 0.46 0.43 0.41 0.39 0.37 0.36 0.35 0.34

4 -1.8 0.61 0.61 0.59 0.55 0.51 0.47 0.43 0.41 0.38 0.37 0.35 I 0.34 0.33

5 -1.6 0.65 0.64 0.61 0.56 0.51 0.47 0.43 0.40 0.38 0.36 0.34 0.33 0.32

6 -1.4 0.70 0.68 0.63 0.57 0.51 0.46 0.42 0.39 0.36 0.35 0.33 0.32 0.31

7 -1.2 0.75 0.70 0.63 0.56 0.50 0.44 0.40 0.37 0.35 0.33 0.32 0.31 0.30

8 -1.0 0.80 0.71 0.62 0.54 0.47 0.42 0.38 0.36 0.34 0.32 0.31 0.30 0.30

9 -0.9 0.81 0.68 0.57 0.49 0.44 0.39 0.36 0.34 0.32 0.31 0.31 0.30 0.30

10 -0.7 0.75 0.61 0.51 0.44 0.40 0.37 0.34 0.33 0.32 0.31 0.31 0.30 0.31

-0.5 0.63 0.51 0.44 0.40 0.37 0.35 0.33 0.32 0.31 0.31 0.31 0.32 0.33

12 -0.3 0.49 0.43 0.39 0.36 0.35 0.34 0.33 0.32 0.32 0.32 0.33 0.34 0.36

13 -0.1 0,40 0.37 0.36 0.35 0.34 0.34 0.33 0.33 0.34 0.35 0.37 0.39 0.42

14 0.5 0.36 0.36 0.35 0.35 0.35 0.35 0.35 0.36 0.37 0.39 0.42 0.45 0.49

15 0.5 0.37 0.37 0.37 0.37 0.37 0.38 0.39 0.41 0.43 0.45 0.49 0.53 0.59

16 1.1 0.40 0.40 0.40 0.41 0.42 I 0.43 0.45 0.47 0.50 0.53 0.57 0.62 0.67

Figure 4.1. Graphical illustration of the IRT pattern-of-summed-scores combination sys-tem using data from the Wisconsin third-grade reading test. The table in theupper left is a schematic representation of table 4.1. For the cell in table 4.1for xmc=15 and x'oE =9, the expanded cell at the lower right shows the pop-ulation distribution [0(0)], the likelihood for OE score 9[P,E(0)], the likeli-hood for MC score 15 [Lmi(0)], and the product of those three densities,and the likelihood for El given OE score 9 and MC score 1 5[L158,9(0)]

Open-Ended Scores

Population Distribution (Theta)Likelihood (OE Score = 9 I Theta)

Likelihood (MC Score = 15 I Theta)Likelihood (Theta I OE = 9 & MC = 15)

IRTPattern-of-Summed-ScoresScoreCombination

70

Page 71: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 4.2. The population distribution [0(0)], the likelihood for OE score 9[L09E(0)],the likelihood for MC score 15 [L(0)], the product of those threedensities, and the likelihood for 0 given OE score 9 and MC score15[L158,9(0)], using data from the Wisconsin third-grade reading test

0.8

0.6

0.4

0.2

Likelihood (OE Score = 9 I Theta)

PopulationDistribution(Theta)

(MC Score = 15 I Theta)

Likelihood (Theta I OE = 9 & MC = 15)

-3 -1 0Theta

approximately in the same location asthe MC EAP scale score. Table 4.2 givesthe value of s.d.[Olx , x'] = 0.45 for thiscell, and the likelihood plotted in figure4.2 shows this to be a rather accuratedescription: The inflection points area little higher than 0 and a little lowerthan 1.

Figure 4.3 shows a similarly constructedgraphic for the combination with MCscore 16 and OE score 12the maximumsummed score for each component. Inthis case, the combination likelihood(the product of both component likeli-hoods and the population distribution)has (EAP[Olx , x']) = 1.7 (from table 4.1),while the EAPs for the two componentsare 1.1 (for the MC section) and 1.4(for the OE section)both from themargins of table 4.1. Again, this kindof likelihood-based, score-combinationsystem is better taken as a system forcombining evidence (about 0) than asan averaging system, with or withoutweights. For example, when the examineeobtains the maximum score on both ofthe two components, the evidence is

4

compounded that the examinee's profi-ciency is very highfar from the meanof the population distribution.

Figure 4.4 shows the likelihoods foran extremely unlikely combination:all items correct on the MC section (asummed score of 16) and no points onthe OE section (summed score 0). Thissituation presents difficulty for anyscore combination system. The value of(EAP[Olx]) for the MC section alone is1.1, indicating relatively high proficien-cy, while that for (EAP[Olx']) for the OEsection alone is 2.8, indicating very lowproficiency. The value of (EAP[Olx,x1for the combination likelihood is 0.2,which is a kind of compromise, butthis is a compromise that comes witha warning.

The Probabilities of Score Combinations.

The IRT model also gives the probabili-ty for each combination of scores x andx'equation (2). For the score combina-tions in table 4.1, the values of P xx, rangefrom approximately 0.07 to less than0.000005; a truncated representation of

69

Page 72: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 4.3. The population distribution [f(q)1, the likelihood for OE score 12[L0E12(q)], the likelihood for MC score 16[LMC16 (q)], the product of thosethree densities, and the likelihood for q given OE score 12 and MCscore 16[L16&12(q)], using data from the Wisconsin third-gradereading test

0.8

0.6

0.4

0.2

PopulationDistribution(Theta)

-4 -3

Likelihood(MC Sco 16 I Theta)

Likelihood(OE Score 12 I Theta)

-2 -1 0

Theta1 2

MC = 16)

3 4

Figure 4.4. The population distribution [OA], the likelihood for OE scorethe likelihood for MC score 16 [Li%(0)], the product of those threedensities, and the likelihood for 0 given OE score 0 and MC score16[L16m(0)], using data from the Wisconsin third-grade reading test

0.8

-oo

0.6 Population.:-'.. Distribution--1

OJ (Theta)>

-z,,, 0.4

Likelihood(OE Score = 0 I Theta)

0.2

0

70

Likelihood(MC Score = 16 I Theta)

-4 -3 -2 -1 0Theta

MC = 16) x 106

1 2 3 4

72

Page 73: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

those values of P. is shown in table4.3. Considered individually, the valuesof P are not readily interpretable asreflecting likely or unlikely events inany absolute sense because the magni-tude of the individual P depends onthe number of row and column score-points. However, the values of Pxx.,may be used to construct a (1a)100%"highest density region" (HDR; Novick& Jackson, 1974) for the responsecombinations.

To construct the HDR, first sort thecells in order of P from largest tosmallest, and construct the cumulativedistribution of P using that sorted list.As an example, locate the 99% HDR byincluding the region of all those cellsthat contribute to the first 99% of thecumulative total of Pxx,. This region hasthe properties that include 99% of themodeled response probability (byconstruction). The probability of anyresponse combination within the regionis higher than the probability of anyresponse combination excluded fromthe region. According to the model,99% of the examinees should obtainscore combinations in that list of cells.

To illustrate, the 99% HDR was locatedfor the Wisconsin reading data, as wellas the 99.9% HDR; they are shownwith shading in table 4.4. (The samecells are shaded in tables 4.1-4.3). Intable 4.4 (as well as tables 4.1-4.3), the99% HDR is shown with no shading,and the region excluded from the 99.9%HDR is shaded darkly. The light grayshading and the unshaded area togetherrepresent the 99.9% HDR. Anyresponse combination in the darklyshaded area in tables 4.1-4.4 is unusual,in that, according to the model, fewerthan 1 in 1,000 examinees should pro-duce responses in that region.

Returning our attention to the unlikelyscore combination illustrated in figure4.4, we now note that the likelihood forthe combination has been multiplied by106 to make it visible on the graphic.The shadings in tables 4.1-4.4 indicatethat, according to the IRT model, thisparticular score combination is one thatthe model says should occur rarely.Rather than accept any score for thiscombination (Which one should weaccept? The high MC score? The lowOE score? An average that representsneither?), this information could be usedin some testing systems to indicate thateither the test or the model is somehowinappropriate for examinees with thisresponse pattern and that further testingmight be useful.

An Aside: Use of Pxx, as the Basis of an

"Appropriateness Index." For any testthat is constructed of blocks of items[MC-OE combinations are just oneexample; a computerized adaptive test(CAT) that administers testlets sequen-tially is another], tables of the samegeneral form as table 4.3 may also beconstructed of

Pxx = J L(0) de,

for each pair of blocks or its generaliza-tion for higher-way tables. These valuesmay be used to construct a (1 a) 100%HDR for scores for the combination ofblocks. These represent the model'spredictions of score combinations thatare likely and unlikely. Any score com-bination that lies outside the (1 a)100% HDR could be flagged in muchthe same way that "appropriatenessindices" such as lz (Drasgow, Levine,& Williams, 1985) are used to flagresponse patterns that are relativelyunlikely, according to an IRT model.However, unlike I. which relies on a

71

Page 74: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Table 4.3. The values of Px , for combinations of MC and OE summed scores ona Wisconsin reading tryout form. Entries in the table are 10,000Px ,

truncated, with a leading 0.0 suppressed (that is, 001 is 0.0001). Theunshaded area in the table represents the central 99% HDR for theresponse patterns

Open-Ended (Summed) Rated Score

MCScore 0 1 2 3 4 5 6 7 8 9 10 11 12

SumEAPs -2.8 2.5 2.2 1.9 1.6 -1.3 -1.0 -0.6 -0.2 0.1 0.5 1.0 1.4Score

0 -2.3 000 000 000 000 000 000 000 000 000 000 000 000 000

-2.2 000 000 000 000 001 002 002 001 000 000 000 000 000

2 -2.1 000 000 000 002 005 008 009 007 003 001 000 000 000

3 -2.0 000 000 001 004 011 018 021 017 010 004 000 000 000

4 -1.8 000 000 001 006 016 030 038 034 022 009 002 000 000

5 -1.6 000 000 001 007 019 039 054 054 038 018 004 000 000

6 -1.4 000 000 001 006 019 043 067 074 058 030 009 001 000

7 -1.2 000 000 001 005 017 042 074 092 080 046 015 002 000

8 -1.0 000 000 000 003 014 038 075 104 102 064 023 004 000

9 -0.9 000 000 000 002 010 033 072 112 122 086 034 006 000

10 -0.7 000 000 000 001 008 027 066 116 141 112 049 011 000_

11 -0.5 000 000 000 001 005 022 060 117 161 144 072 018 001

12 -0.3 000 000 000 000 004 018 054 118 184 188 108 031 003

13 -0.1 000 000 000 000 003 014 048 119 212 253 170 058 007

14 0.5 000 000 000 000 002 011 042 118 245 348 284 120 020

15 0.5 000 000 000 000 001 008 033 108 266 463 482 270 064

16 1.1 000 000 000 000 000 004 019 072 218 482 675 542 196

Gaussian approximation for the distrib-ution of the log-likelihood for its ques-tionable p-values (Reise & Flannery,1996), the probability statements associ-ated with use of a (1 a) 100% HDRfor two- (or three- or four-) way classifi-cations of the item responses blockwisemay be computed directly from the IRTmodel.

What should be done with examineeswhose responses are flagged as unlikelyaccording to the model? This questionraises difficult policy questions, andits answer certainly depends on thepurpose of the test. The mismatchbetween the examinee's performanceon one block and another may meanthe examinee cheated on one block, in

72

which case the better measure may betheir lower performance, or it may bethat something else went wrong withtheir performance on the block onwhich they scored lower (distraction?computer difficulties?), in which casethe higher of the two scores may bemore valid. The context of high-stakestesting, which is expensive in both timeand money for the examinee, might addfurther considerations. Davis and Lewis(1996) suggest several possible coursesof action that could be followed if thetest was computerized: One set of pos-sible actions includes an on-line exten-sion of the test, either switching froma CAT system to a long linear form orusing a special block of "silver bulletitems" to estimate more accurately the

4

Page 75: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Table 4.4. The 99% and 99.9% HDRs for score combinations on a Wisconsinreading tryout form. EAPs associated with the MC and OE summedscores are shown in the margins

Open-Ended (Summed) Rated Score

MCSum Score 0 1 2 3 4 5 6 7 8 9 10 11 12

Score EAPs 1-* -2.8 -2.5 -2.2 -1.9 -1.6 -1.3 -1.0 -0.6 -0.2 0.1 0.5 1.0 1.4

0 -2.3

1 -2.2

2 -2.1

3 -2.0

-1.8

5 -1.6

6 -1.4Unshaded:

7 -1.2 Central 99% HDR-P(any unshaded combination) >

8 -1.0P(any shaded combination)

9 -0.9

10 -0.7

11 -0.5

12 -0.3

13 -0.1Shaded

14 0.5 Darkly:< 0,1%

15 0.5

16 1.1

Figure 4.5. The values of EAP[OPcxl, computed using the 3PL/GR model combina-tion, plotted as a surface of points over a grid representing the OE andMC summed scores, using data from the Wisconsin third-grade readingtest (Only the points for the central 95% HDR of the score combinationsare plotted)

Wisconsin Reading Test

-2

11011111111-3

1614

Ope-'s-Cnded

12-11101 11-W1.---0111111111°.-

1 0

edierS c771111111111111

"'OOP- -411100.-

2-4110-

:_fill.-':411111111111111111.--4118111.--411111-41111111111.11.--12-.411.11'-i416-3

WAti0e-Choice

51.1501-ea

ve4 6

1:Sco

75 73

Page 76: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 4.6. The values of MAP[Olx,x1, computed using the Rasch model, plotted asa surface of points over a grid representing the OE and MC summedscores, using data from the Wisconsin third-grade reading test (Only thepoints for the central 95% HDR of the score combinations are plotted)

examinee's proficiency. Other possibleactions include score cancellation andretesting.

The Relation of Likelihood-Based Score

Combination with Weighted LinearCombinations and with Rasch Model Scores.Figure 4.5 shows the score combinationvalues of EAP[Olx , x'] for the Wisconsinthird-grade reading test plotted as apoint-surface over a grid representingthe MC and OE summed scores. (Onlythe points for the central 95% HDR ofthe score combinations are plotted.) Thecurvature of that surface represents thedifference between this score-combinationsystem and a linear-weighted average ofthe two scores. The corresponding pointsfor any linear-weighted average systemwould lie on a plane. IRT "adjustments"for the relative difficulty and discrimina-tion of the MC and OF, items, done ona point-by-point basis as the likelihoodsare used to combine the evidence about

74

proficiency each represents, cannot beexactly matched by any linear-weightingscheme. However, the fact that the rowsof points rise more quickly in the direc-tion of increase of the MC score thanthey do in the direction of increase forthe OE score means that the IRT systemis effectively "weighting" the MC"points" more than the OE "points."

Analysis of data like these with a Rasch-family model produces scale scores thatare the same for any score combinationthat yields the same total summed score(Masters & Wright, 1984). In the case ofRasch-family models, the two-wayarray of values of EAP[9Ix , x'], such asis shown in table 4.1, is superfluous; anycombination of x and x' that have thesame total score have the same scalescore. When applied to data that the3PL/GR model combination fits withdifferent discrimination values for theMC items (as a set) and the OE items

76

Page 77: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 4.7. The values of EAP[Olx,x1, computed using the 3PL/GR model combi-nation, plotted against the values of MAP[Olx,x1, computed using theRasch model, using data from the Wisconsin third-grade reading test.(Only the points for the central 95% HDR of the score combinationsare plotted)

Wisconsin Reading Data: Different Scores forthe MC & OE Combinations (Central 95%)

(Standardized) Rasch Estimates

(as a set), Rasch-family model scalescores and the 3PL/GR model combina-tion scale scores differ somewhat. Figure4.6 shows the values of Rasch-familyMAP[Olx x'] as computed with thecomputer program Bigsteps (Wright &Linacre, 1992) for the Wisconsin third-grade reading test plotted as a point-surface over a grid representing theMC and OE summed scores. (Again,only the points for the same central95% HDR of the score combinations areplotted as in figure 4.5; the computationof the HDR is based on the 3PL/GR model.) Unlike the pattern shownin figure 4.5, the Rasch-family scalescores increase at exactly the same rateas the MC scores increase as they dofor an OE score increase.

Figure 4.7 shows the 3PL/GR score-combination values of EAP[Olx x']plotted against the Rasch-family

MAP[Olx x'] estimates. (Because theRasch-family estimates are originallycomputed on a different scale, their val-ues in figure 4.7 have been standardizedto have the same mean and variance asthe values of EAP[Olx , x']). We see infigure 4.7 that some score combinationsfor which the Rasch-family values ofMAP[Olx x'] and the 3PL/GR values ofEAP[Olx xl differ fairly substantially.Because the 3PL/GR scale scores arecomputed to account for guessing onMC items (which almost certainly hap-pens) as well as different relative valuesof the "points" for the number-correctMC score as opposed to the rated OEscore (which also almost certainly con-tains some truth), we tend to believethat when the two scores differ, the3PL/GR scores may be more valid.

77 75

Page 78: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Reference5Birnbaum, A. (1968). Some latent traitmodels and their use in inferring anexaminee's ability. In F. M. Lord & M. R.Novick (Eds.), Statistical theories of mentaltest scores (pp. 395-479). Reading, MA:Addison-Wesley.

Davis, L. A., & Lewis, C. (1996). Person-fit indices and their role in the CAT environ-ment. Paper presented at the annualmeeting of the National Council onMeasurement in Education, New York,April 9-11.

Drasgow, E, Levine, M. V., & Williams,E. A. (1985). Appropriateness measure-ment with polytomous item responsemodels and standardized indices. BritishJournal of Mathematical and StatisticalPsychology, 38, 67-86.

Masters, G. N., & Wright, B. D. (1984).The essential process in a family ofmeasurement models. Psychometrika,49, 529-544.

78

76

Novick, M. R., & Jackson, P. H. (1974).Statistical methods for educational and psy-chological research. New York: McGraw-Hill Book Company.

Reise, S. P., & Flannery, W. P. (1996).Assessing person-fit on measures oftypical performance. Applied PsychologicalMeasurement, 9,9-26.

Samejima, F. (1969). Estimation of latentability using a response pattern of gradedscores. Psychometric Monograph, 17.

Thissen, D. (1991). Multilog user's guideVersion 6 [Computer program]. Chicago:Scientific Software, Inc.

Wright, B. D., & Linacre, J. M. (1992).Bigsteps Rasch analysis [Computer pro-gram] Chicago: MESA Press.

Page 79: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

4%.

-

Page 80: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Enhandin the Mlldily of NAEPAchilevenTen Levell kore Reportin

The National Assessment of EducationalProgress (NAEP) provides policymakers,educators, and the public with informa-tion about the reading, mathematics,science, geography, history, and writingknowledge and skills of elementary,middle, and high school students.NAEP also monitors changes in studentachievement over time. NAEP usesconsiderable statistical and psychomet-ric sophistication in its test design, datacollection, test data analysis, and scaling(Beaton & Johnson, 1992; Johnson,1992; Mislevy, Johnson, & Muraki,1992). In fact, NAEP may be the mosttechnically sophisticated assessmentsystem in the world.

Less attention, however, appears to begiven to the ways in which the complexNAEP data are organized and reported.In fact, the contrast between the effortsand success in producing sound techni-cal NAEP instruments, drawing samples,administering the assessments, and ana-lyzing the assessment data and theefforts and success in disseminatingNAEP results is striking.

Concerns about NAEP data reportinghave become an issue in recent years.These concerns have been documentedby Hambleton and Slater (1995, inpress), Jaeger (1992), Koretz and Deibert(1993), Linn and Dunbar (1992), andWainer (1996, 1997a, 1997b). Theobjectives of this paper are as follows:

(a) provide background on NAEP scorereporting with achievement levels(since 1990);

0.1

(b) describe the results of a small-scalestudy of the understandability ofNAEP score reports among policy-makers; and

(c) review several promising new direc-tions in score reporting along withtheir implications for NAEPredesign of NAEP displays (Wainer,Hambleton, & Meara, in progress),guidelines for preparing displays(Hambleton, Slater, &Allalouf, in progress), andmarket-basket reporting(the idea was suggestedby Mislevy, Bock, &Thissen).

oBackwr und

NAEP 5coreReportllnv"What is the meaning of aNAEP mathematics score of220?" "Is a national averageof 245 in mathematics goodor bad?" These two questions

moot

fact

tnag

NAEP

ey@

tochnicat

anif:shiaficatad

assessmant

ap6t2op ffal

world.

wereposed by policymakers and educatorsin a study conducted in 1994 byHambleton and Slater (1995, in press)following the release of the ExecutiveSummary of the 1992 NAEP nationaland state mathematics results.Questions about the meaning of scoresare also frequently asked by thoseattempting to make sense of IQ, SAT,ACT, and other standardized achieve-ment test scores. People are more

'See also Laboratory of Psychometric and Evaluative Research Report No. 317. Amherst, MA: University of Massachusetts, School of Education.

8079

Page 81: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Wag mcia

&Intim/we oGagara

mon mmod

intended

audience's

connect

§aft

them

number&

may

(ikV

better

anderotood Man

'Rae daccea ciod

6'Gae won azioLa.

familiar with popular ratio scales, suchas those used in measuring distance,time, and weight, than with educationaland psychological test score scales.

Test scores are elusive. Even the popularpercentage score scale, which manythink they understand, cannot be under-

stood unless the domain ofcontent to which percent-age scores are referenced isclear and the method usedfor selecting assessmentitems is known. Few seemto realize the importance ofthese two critical pieces ofinformation in interpretingpercentage scores. Thisproblem is also present instate legislation being writ-ten that establishes a pass-ing score on an importantstatewide test, withoutdetailed knowledge of thetest's content or difficulty.

80

One solution to the scoreinterpretation problem issimply to interpret thescores in a normative way(i.e., scores obtain meaningor interpretability by beingreferenced to a well-definednorm group). All popular

norm-referenced achievement tests usenorms to assist in test score interpreta-tions. However, normative statementsare not always valued. Sometimes theimportant question policymakers haveconcerns level of accomplishmentforexample, what percentage of studentshave reached a level of 250 on the assess-ment? Many policymakers would like tochoose points such as 250 to representwell-defined levels of accomplishment(which might be called Basic, Proficient,and Advanced) and then determine thenumbers of students in interest groups

(e.g., regions of the country) whoachieve these accomplishment levels.This is known as criterion-referenced(CR) assessment. Most national andstate assessments are examples of CRassessments, and with these assess-ments, scores need to be interpreted inrelation to content domains, anchorpoints, and/or performance standards(Hambleton, 1994).

NAEP constructed an arbitrary scalewith scores ranging from 0 to 500for each subject area. The scale wasobtained as follows: The distributionsof scores from nationally representativesamples of 4th-, 8th- and 12th-gradestudents were combined and scaled toa mean of 250 and a standard deviationof approximately 50 (Beaton & Johnson,1992). The task was then to facilitateCR score interpretations on this scale(Phillips et al., 1993). Placing bench-marks such as grade-level means, statemeans, and performance of various sub-groups of students (e.g., males, females,Black, Hispanic) is helpful in givingmeaning to the scale, but these bench-marks provide only a norm-referencedbasis for score interpretations.

One way to make statistical resultsmore meaningful for intended audiencesis to connect them to numbers that maybe better understood than test scoresand test score scales. For example, whenmany persons were concerned recentlyabout flying after the TWA (Flight 800)crash, the airlines reported that only oneplane crash occurs for every 2 millionflights. In case the safety of air travelstill was not clear, the airlines reportedthat a person could expect to fly everyday for the next 700 years without anaccident. Most likely, some people feltmore confident after hearing these sta-tistics reported in an understandablefashion. Nevertheless, knowing that the

81

Page 82: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 5.1. Graphical display of two item characteristic curves, item 1 being easierfor the examinees than item 2. This display highlights the potential useof item mapping to enhance the meaning of the NAEP scale

1.0

0.8

0.6

0.4

0.2

0.0100

Source: Hambleton & Slater, 1995

200 300

NAEP Achievement Scale

probability of being in a plane crash isless than 0.0000005 is not very meaning-ful to most people.

Concerning the reporting of NAEPresults, what, for example, does a singlepoint mean? It was noted that the typi-cal student (one at the 50th percentile)gained approximately 48 pointsbetween fourth and eighth grades inmathematics (Hambleton & Slater,1995), which converts to approximately1.2 points per month of instruction (again of 48 points over 40 months ofinstruction). Recognizing that thegrowth over 4 years is not necessar-ily linear (see, for example, grade-equivalent scores on standardizedachievement tests), it could be said that1 point is at least roughly equivalent to1 month of regular classroom instruc-tion. Secretary of Education RichardRiley used this approach recently tocommunicate findings in the 1996 NAEPScience Assessment, and the connectionbetween NAEP score points and instruc-tional time appeared to be a valuable

400 500

way to relay the meaning of points onthe NAEP scale.

Other possibilities with considerablepromise for CR interpretations of scoresinclude item mapping, anchor points,performance standards (called "achieve-ment levels" in the NAEP context), andbenchmarking (Phillips et al., 1993).These approaches capitalize on thefact that scales based on item responsetheory (IRT) locate both the assessmentmaterial and the examinees on the samereporting scale. Thus, at any particularpoint (i.e., ability level) of interest, thetypes of tasks that examinees at thatability level can successfully completecan be determined. Tasks that theseexaminees cannot complete with somestated degree of accuracy (e.g., 50%probability of successful completion)can also be identified. Descriptions atthese points of interest can be devel-oped and exemplary items selectedthat is, items to highlight what exami-nees at these points of interest mightbe expected to accomplish (Bourque,

8281

Page 83: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Champagne, & Crissman, 1997; Mullis,1991).

Figure 5.1 shows the "item characteristiccurves" for two NAEP dichotomouslyscored items (Hambleton, Swaminathan,& Rogers, 1991). At any point on theNAEP achievement (i.e., proficiency)scale, the probability of a correctresponse (i.e., answer) can be deter-mined. Item 2 is the more difficult itemsince, regardless of ability, the probabili-ty of a correct response to item 2 islower than item 1. The location on theability scale at which an examinee hasan 80% probability of success for anitem is called the "RP80" for the item. Infigure 5.1, the RP80 for item 1 is estimat-ed at 210 and the RP80 for item 2 isapproximately 306. This is known as"item mapping." Each item in a NAEPassessment can be located on the NAEPachievement scale according to RP80values. If 80% probability is defined asthe probability at which an examineecan reasonably be expected to know oraccomplish something (other probabili-ties, such as 65%, have often beenused), then an examinee with an abilityscore of approximately 210 could beexpected to answer items such as item1 and other items with RP80 values ofapproximately 210 on a fairly consistentbasis (i.e., approximately 80% of thetime). In this way, a limited type ofCR interpretation can be made eventhough examinees with scores ofapproximately 210 may never haveactually been administered item 1 orother items similar to it as part of theirassessment.

The validity of CR interpretationsdepends on the extent to which a unidi-mensional reporting scale fits the data towhich it is applied. If a group of exam-inees scores 270, a score of 270 is thenmade more meaningful by describingthe contents of items such as those with

82

RP80 values of approximately 270. Theitem-mapping method is one way tofacilitate CR interpretations of pointson the NAEP scale or any other scaleto which items have been referenced.Cautions regarding this approach havebeen clearly outlined by Forsyth (1991).A major concern is the nature of theinferences that can legitimately be madefrom predicted examinee performanceon a few test items.

A variation on the item-mappingmethod is to select arbitrary points on ascale and then to describe these pointsthoroughly through the knowledge andskills measured by items with RP80 val-ues in the neighborhood of the selectedpoints. In the case of NAEP reporting,arbitrarily selected points have been150, 200, 250, 300, and 350. The item-mapping method can then be used toselect items that can be answered cor-rectly by examinees at those points. Forexample, using the item characteristiccurves reported in figure 5.2, at 200,items such as 1 and 2 could be selected.At 250, item 3 would be selected. At300, items 4 and 5 would be selected.At 350, item 6 would be selected. Ofcourse, in practice, many items mightbe available for making selections todescribe the knowledge and skills asso-ciated with performance at particularpoints along the ability scale.

Currently, RP65 values, rather than RP80values, are used by NAEP, and itemsthat clearly distinguish between anchorpoints are preferred when describingthose points. For more details on currentpractices, see Beaton and Allen (1992),Mullis (1991), and Phillips et al. (1993).Note, too, that a similar process can beimplemented for the individual scorepoints on polytomously scored tasks.The method is not limited to dichoto-mously scored response data.

83

Page 84: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 5.2. Graphical display highlighting the use of anchor points and item char-acteristic curves to enhance the meaning of the NAEP scale

1.00

0.80 -

0%150 200 250 300 350

NAEP Achievement

Source: Hambleton & Slater, 1995

The National Assessment GoverningBoard (NAGB) was not completely satis-fied with the use of arbitrary points(i.e., anchor points) for reporting NAEPresults. One reason was that the points200, 250, and 300 became incorrectlyassociated by the media and policymak-ers with the standards of performanceexpected of 4th-, 8th-, and 12th-gradestudents, respectively. To eliminate theconfusion, as well as to respond to thedemand from some policymakers andeducators for real performance standardson the NAEP scale, NAGB initiated aproject to establish performance standardson the 1990 NAEP Mathematics assess-ment (Hambleton & Bourque, 1991)and has conducted similar projects toset performance standards on NAEPassessments in 1992, 1994, and 1996.

Figure 5.3 depicts the way in whichperformance standards (set on the test,'

s,

score metric, a scale more familiar tostandard-setting panelists than the NAEPachievement scale) can be mapped orplaced on the NAEP achievement scaleusing the test characteristic curve (TCC).(In general terms, the TCC is a weight-ed average item characteristic curve foritems that make up the assessment.)Of course, almost nothing is simplewith NAEP, so figure 5.3 is an oversim-plification of how the mapping is actual-ly done. However, figure 5.3 depictshow mapping is performed in manystate assessments.

The performance standards for a partic-ular grade on the NAEP achievementscale can be used to report and interpretthe actual performance of the nationalsample or any subgroup of interest.With these standards in place, the per-centage of students in each performance

8483

Page 85: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 5.3. Graphical display highlighting the connection between performancestandards on the test score scale and the NAEP score scale

100%

Adv.

Prof.

Basic

0%211 248 280 600

NAEP Achievement Scale

Source: Hambleton & Slater, 1995

category in score distributions of inter-est can be reported, and the changes inthese percentages can be monitoredover time.

Anchor points and performance stan-dards are placed on an achievementscale to enhance the content meaningof scores and to facilitate meaningfulCR interpretations of the results. (Forexample, What percentage of fourth-grade students in the 1996 NAEPScience Assessment are able to performat the Proficient level or above?) Inrecent years, both anchor points (e.g.,150, 200, 250, 300, and 350) and perfor-mance standards (e.g., borderline scoresfor Basic, Proficient, and Advanced stu-dents in grades 4, 8, and 12) have beenplaced on NAEP scales. Many stateshave adopted similar techniques forscore reporting.

Performance standards are more prob-lematic than anchor points becausethey require a fairly elaborate processto establish (e.g., the current design calls

84

for 30 panelists at a grade level workingfor 5 days) and validate. At the sametime, performance standards appear tobe greatly valued by many policymakersand educators. For example, many statedepartments of education use perfor-mance standards in reporting, and manystates involved in the NAEP trial stateassessment have indicated a preferencefor standards-based reporting overanchor points-based reporting.

Figure 5.4 provides a final exampleof score reporting. (For additionalreferences, see the report by Phillips etal., 1993.) This example is taken fromthe National Adult Literacy Survey(Kirsch et al., 1993). Each monotonicallydecreasing curve represents the perfor-mance of adults located at proficiencylevels 400, 350, 300, 250, and 200,respectively, on assessment materialorganized into levels of difficulty (level1 to level 5). With information about thetypes of items placed at each level, dif-ferences in performance among adults at

85

Page 86: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 5.4. Average probabilities of correct responses to items along the documentscale by adults with different proficiency levels

1.0

0.8

0.6

0.4

0.2

level 1 level 2 level 3

Items

Source: Kirsch, Jungeblut, Jenkins, & Kolstad, 1993

ability levels 200, 250, 300, 350, and 400can be better understood. If, insteadof choosing adults at particular points(anchor points methodology), adultscould be sorted into Below Basic, Basic,Proficient, and Advanced performancecategories, then figure 5.4 could be usedin standards-based reporting to providea better understanding of the nature ofthe performance differences amongthese four groups of adults.

Smallll-Scalleauscily of theUndergt@ndabillhyof NAEPScore ReportThe design of tables, figures, and chartsto transmit statistical data to enhancetheir meaningfulness and understand-ability is a new area of concern in edu-cation and psychology (Wainer, 1992;

level 4 level 5

Wainer & Thissen, 1981). However, anextensive literature exists that appearsrelevant to the topic of data reporting inthe fields of statistics and graphic design(Cleveland, 1985; Henry, 1995; Wainer,1997c).

How bad, or good, is the current situa-tion? Do policymakers and educatorsunderstand what they are reading aboutstudent achievements and changes overtime? Do they make reasonable infer-ences and avoid inappropriate ones?What is their opinion about the infor-mation they are given? Is it importantto them? What do they understand andwhat deficiencies and strengths existrelative to NAEP reports? In view of theshortage of available evidence aboutthe extent to which intended NAEPaudiences understand and can useNAEP reports, a small research studywas performed by Hambleton and Slater(1995, in press) to investigate the extentto which NAEP Executive SummaryReports were understood by policy-

85

Page 87: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

problem

andormandi

mi? not

aD

ce

mati.ficat

jargon 0 0 0

makers and educators, to determine theextent to which problems were identi-fied, and to offer a set of recommenda-tions for improving reporting practices.The Technical Review Panel initiatedthe project, which was funded by theNational Center for Education Statistics(NCES).

The 59 participants in the interviewsmade up a broad audience, similarto the intended audience of NAEPExecutive Summary Reports. Personsat state departments of education,attorneys, directors of companies, state

politicians and legislativeassistants, school superin-tendents, educationreporters, and directorsof public relations wereamong those interviewed.

86

The interviews were basedon the Executive Summary of

the NAEP 1992 Mathematics

Report Card for the Nation

and the States (Mullis,Dossey, Owen, & Phillips,1993). This particular reportwas chosen because it wasrelatively brief and wasintended to stand on itsown merits for policymak-

ers and educators. NAEP ExecutiveSummary Reports are also well knownand widely distributed to people work-ing in education or interested in educa-tion. (More than 100,000 copies of eachExecutive Summary are produced.)Furthermore, NAEP Executive SummaryReports, which include both nationaland state results, are thought to be ofinterest to the interviewees, who werefrom different areas of the country.

The goal of the interviews was to deter-mine how much of the information inthe Executive Summary Reports wasunderstood. An attempt was made topinpoint the aspects of reporting readers

found confusing and to identify changesthat interviewees found would improvetheir understanding of the results.

The 1992 NAEP Mathematics ExecutiveSummary Report consists of six sectionsthat highlight findings from differentaspects of the assessment. Interviewquestions were designed for each sec-tion to ascertain the kind of informationinterviewees were obtaining from thereport. Interviewees were asked to reada brief section of the report and thenwere questioned on the general meaningof the text or on the specific meaningof certain phrases. Interviewees alsoexamined tables and charts and wereasked to interpret some of the numbersand symbols. Interviewees were encour-aged to volunteer their opinions andsuggestions.

The sample of interviewees was mainlywhite and included more females (64%)than males (36%). Interviewees werefrom various areas of the educationfield, and two education reporters tookpart in the study. All interviewees indi-cated that they had medium to highinterest in national student achievementresults. Most interviewees (90%) werefamiliar with NAEP in at least a generalway, and 64% had read NAEP publica-tions prior to the interview. Almost halfthe sample had taken more than onecourse in testing or statistics (46%); onefourth had taken only one course, andanother one fourth had taken none.

Nearly all the interviewees (92%)demonstrated a general understandingof the main points of the text summa-rizing the major findings of the report,though several interviewees comment-ed that they would have liked moredescriptive information (e.g., concreteexamples). One problem in understand-ing the text was due to the use of statis-tical jargon (e.g., statistical significance,variance). This language choice confused

8 7

Page 88: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

and intimidated a number of the inter-viewees. Several interviewees suggestedthat a glossary of basic terms wouldbe very helpful. Terms such as Basic,Proficient, Advanced, standard errors,and the NAEP scale could be includedin such a glossary.

One example indicates that the phrase"statistically significant" was unclear tomany interviewees (42%). Intervieweeswere expected to know that "statisticallysignificant increases" are not increasesresulting from chance events. Fifty-eightpercent said that they thought theyknew the meaning, but many of theinterviewees in this group could notexplain what the term meant or whyit was used. This was surprising sincemore than half the interviewees hadtaken statistics courses. Typical re-sponses to the question "What doesstatistically significant mean?" were:

"More than a couple of percentagepoints."

"10 percentage points."

"At least a 5-point increase."

"More than a handfulyou haveenough numbers."

"Statisticians decide it is significantdue to certain criteria."

"The results are important."

"I wish you hadn't asked me that.I used to know."

The common mistake was to assume"statistically significant differences"were "big and important differences."

Several interviewees mentioned thatalthough they realized that certain terms(e.g., standard error, estimation, confi-dence level) were important to statisti-cians, these terms were meaningless tothem. They said that their eyes tendedto glaze over when technical terms wereused in reports or they formed their

own working definitions such as thoseoffered above.

Table 1 of the Executive SummaryReport is one of the most important andcontains a wealth of information.Results in the table are reported for thefollowing categories: grades 4, 8, and 12;1990 and 1992; average proficiency; andeach performance category. Standarderrors are given for all statistics in thetable. However, many problems withthis table were identified in the researchstudy. For example, confusion about themeaning of "At or Above"was seen in table 1. Whenasked what the 18% in line 1of table 1 meant (18% ofgrade 4 students in 1992were in the Proficient orAdvanced category in mathe-matics), more than half(53%) of the intervieweesresponded incorrectly. Severalinterviewees did not lookat the table closely enoughto see the "Percentage ofStudents At or Above" head-ing above the levels. The factthat categories were arrangedfrom Advanced to Basiccomplicated the use of thetable and the concept of "Ator Above." In this case, "At

FiEG aommooD

not

Oci otagoimg

otatio.ticcilfg

'significant

differoncoo

*),90

miaake

and

important

ditrorencoo.

or Above" meant summingfrom right to left, which seemed back-ward to interviewees when the correctinterpretation was given to them.

A summary of six problems that arosewhen interviewees read table 1 follows:

1. Interviewees were confused by thereporting of average proficiencyscores. (Few understood the 500-point NAEP scale.) Proficiency asmeasured by NAEP and reported onits scale was confused with the cate-gory of "proficient students."

8887

Page 89: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

2. Interviewees were also baffled bythe standard error beside each per-centage. The reporting of the standarderror interfered with reading the per-centages, and the footnotes did notclearly explain what a standard erroris and how it could be used.

3. The < and > signs were misunder-stood or ignored by most inter-viewees. Even after reading thefootnotes, many interviewees indi-cated that they were still uncertainabout the meaning of the signs.

4. The most confusing point was thereporting of students "At or Above"each proficiency category. Interview-ees interpreted these cumulative

percentages as the percent-age of students in eachproficiency category. Theywere surprised and con-fused when the sum ofpercentages across anyrow far exceeded 100%.Contributing to the confu-sion in table 1 was the pre-sentation of the categoriesin reverse order of whatwas expected (i.e., BelowBasic, Basic, Proficient, andAdvanced). This informa-tion as presented requiredreading from right to leftinstead of the more com-mon left to right. Onlyapproximately 10% of the

fi-gguently

Offkmaf

augg&stion

re.commendation

mak

reporta

acc&es.sible.

nonatatiaticiam.

88

interviewees were able to make thecorrect interpretations of thepercentages in table 1.

5. Footnotes were not always read byinterviewees and were often misun-derstood when they were read.

6. Interviewees expressed confusionbecause of variations between NAEPreports and their own state reports.

Given the major difficulties they had inunderstanding the information con-tained in table 1, it is not surprising thatnearly 80% of the interviewees reportedthat this table "needs work." This wasof concern because table 1 is the penul-timate table in the Executive Summaryof NAEP results. Several intervieweesexpressed that bar graphs would haveimproved the document. More than90% of those interviewed indicated thatthey did not have a lot of time to inter-pret these complex tables, and theybelieved that a simple graph could beunderstood relatively quickly.

A second example from Hambleton andSlater may also be informative. Table 2of the NAEP Executive Summary Reportwas unclear to approximately 30% ofthe interviewees. "Cutpoint" and "scalescore," examples of NAEP jargon, werethe source of the confusion. The inter-viewees had no idea of the meaning ofthe numbers in the table, and this infor-mation was not contained in the report.

The interviewees made fundamentalmistakes in interpreting the figures andtables in the Executive Summary. Nearlyall were able to understand the text ingeneral terms, though many would haveliked more descriptive information (e.g.,definitions of measurement and statisti-cal jargon as well as concrete examples).The problems in understanding the textinvolved the use of statistical jargon.This confused and even intimidatedsome interviewees. Some mentionedthat, although these terms are importantto statisticians, such terms are meaning-less to them. After years of seeing theseterms in reports, they simply passedover the words in their reading.

Many interviewees offered helpful andinsightful opinions about the report.One frequently offered suggestion is

Page 90: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

the recommendation to make thereports accessible to nonstatisticians.Another comment made by severalinterviewees was that the reportappeared to be 'written by statisticians,for statisticians." To remedy this, manysuggested removing the statistical jar-gon. Phrases such as "statistically signifi-cant" did not hold much meaningfor the policymakers and educatorsinterviewed.

The conclusions and recommendationsfrom a study such as this one must belimited because of its modest nature(only 59 interviews were conducted),the nonrepresentativeness of the per-sons interviewed (though it was aninteresting and important group ofpolicymakers and educators), and theuse of only one NAEP report in thestudy. Note, too, that the researchwas conducted on a 1992 NAEP report.Reports from 1994 and 1996 appear tobe designed better and more responsiveto the needs of the intended audiences.

Despite the limitations, several conclu-sions and recommendations wereoffered by Hambleton and Slater:

A considerable amount of misunder-standing was evident concerning theresults reported in the 1992 NAEPMathematics Assessment ExecutiveSummary Report.

Improvements should include thepreparation of a substantially moreuser-friendly report with simplifiedfigures and tables.

Reports should be straightforward,short, and clear because of the timeconstraints experienced by thoselikely to read these executivesummaries.

On the basis of the findings from theirstudy, Hambleton and Slater offeredseveral reporting guidelines for NAEPand state assessments:

1. Make charts, figures, and tablesunderstandable without reference tothe text. (Readers did not seem will-ing to search through the text forinterpretations.)

2. Field-test graphs, figures, and tableson focus groups representing theintended audiences. (Manyimportant problems canbe identified from field-testing report forms. Thesituation is analogous tofield-testing assessmentmaterials prior to theiruse.)

3. Ensure that charts, fig-ures, and tables can bereproduced and reducedwithout loss of quality.(Because interesting andimportant results will becopied and distributed,copies must be legible.)

4. Keep graphs, figures, andtables relatively simpleand straightforward tominimize confusion andshorten the time requiredby readers to identify themain trends in the data.

5. NAEP Executive Summaries shouldinclude an introduction to NAEPand NAEP scales. A glossary shouldalso be provided. Statistical jargonshould be deemphasized; tables,charts, and graphs should be simpli-fied; and more boxes and graphicsshould be used to highlight themain findings.

6. Specially designed reports may beneeded for each intended audience.For example, policymakers mightfind short reports with bulleted textthat highlights main points such asconclusions helpful.

90

(Jaye

found OD

come' ag

balance

otatieticcd rigor

Wed Ok

nformational

conetrakft

90:70titative

time

and

literacy

cnntended

audienceo.

89

Page 91: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 5.5. Average reading proficiency by grade and by regionNAEP 1992 and 1994

500

300

290

280

270

260

250

240

230

220

210

0992 1994

'Significant decrease between 1992 and 1994.

Source: Williams, Reese, Campbell, Mazzeo, & Phillips, 1995

The National Adult Literacy Survey(Kirsch et al., 1993), conducted byNCES, Westat, and the EducationalTesting Service, and the recently pub-lished standards-based report on the1996 NAEP Science Assessment(Bourque, Champagne, and Crissman,1997) appear to have benefited fromsome of the earlier evaluations of NAEPreporting. They provide excellent exam-ples of data reporting, with figure 5.4being one such example. A broadprogram of research, involving measure-ment specialists, graphic design special-ists (Cleveland, 1985), and focus groupsrepresenting intended audiences forreports, is needed to build on the suc-cesses represented in the reports byBourque, Champagne, and Crissman(1997), Jaeger (1992), Kirsch et al. (1993),Koretz and Deibert (1993), and Wainer(1996, 1997a, 1997b). Ways must befound to balance statistical rigor withthe informational needs, time con-straints, and quantitative literacy ofintended audiences. Examples from thisemerging research are described in thefollowing sections of the paper.

90

Three CasrrenAdvanceo in ScoreReporOng

Warmer, HambDetcon, andMean' StudyNCES recently funded a small project(Wainer, Hambleton, & Meara, inprogress) to extend the earlier work ofHambleton and Slater (1995, in press)and Wainer (1996, 1997a, 1997b). Thisnew study seeks to take a diverse mixof current NAEP data displays that seemproblematic, revise them along the linesof emerging data-reporting principles,and field-test both the original andrevised displays with educators andpolicymakers. The data displays forthe study were selected from the 1994NAEP Reading Study (Williams, et al.,1995). An example of an original display(without color) appears in figure 5.5,and a revised display designed byHoward Wainer appears in figure 5.6.(Four additional original displays andtheir revisions are also used in the study.)

Page 92: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 5.6. Average reading proficiency by grade and by regionNAEP 1992and 1994

300

290

280

270

a)

c.) 260ci)

°C 2504.1

fl.1ttj

240

230

220

210

200

Central

Southeast

12th Grade

Northeast 8

0

8th Grade

Southeast A

Northeast

Southeast

4th Grade

Northeast and West

43 Central

West

A

0-- All of the U.S.

0 NortheastA-- Southeast0 CentralID West

1992

Year

Source: Wainer, Hambleton, & Meara, in progress

The original and revised displays shownin figures 5.5 and 5.6 will be distributedto policymakers and educators in aninterview format, along with the follow-ing types of questions:

1. What was the general direction ofresults between 1992 and 1994?

2. Which region showed the greatestdecline in performance for 12thgraders from 1992 to 1994?

Central

1994

West

3. In 1994, which region of the countryhad the lowest average reading profi-ciency at all three grade levels?

4. What is the ranking of the regionsfrom best to worst in terms of averagereading proficiency for grade 12 in1994?

5. Which of the four regions is mosttypical of the U.S. results?

92 9 1

Page 93: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

6. Which regions for 8th graders in1994 performed better than theaverage for all of the United States?

7. In everyday language, what doyou think is meant by the phrase"significant decrease?"

The answers given by educators andpolicymakers to these seven questions,

using the two versions ofthe data displays, will becentral to the study. In addi-tion, the time needed torespond will be a depen-dent variable for somequestions.

In the Wainer, Hambleton,and Meara study, five dis-plays from the 1994 NAEP

92

Reading Assessment Studywere revised. Participants in the studywill be randomly assigned to answer aset of questions about each displayusing one of the two versions for eachdisplay. The findings from the studyshould determine whether the presenta-tions of NAEP data were improved.Results from the study will be availablein summer 1998.

HambIleton, Sbter, Aiiailouf[Instructionai Module onScore ReportingRecognizing the need for steps andguidelines in the preparation of datadisplays and as a follow-up to the workof Hambleton and Slater (1995, in press),Hambleton, Slater, and Allalouf (inprogress) are producing an instructionalmodule with a five-step model thatfollows specific guidelines for preparingtables and figures. An outline of theirguidelines for preparing data displaysfollows:

93

1. Keep presentation clear, simple, anduncluttered.

Frame the graph on all four sides.

Use no more tick marks tharinecessary and place ticks on allfour sides of the graph frame.

Do not automatically place tickmarks or numbers at the cornersof the graph frame.

Clearly label the left and bottomaxes.

Avoid the use of scale breaks.

Use visually distinguishable sym-bols. Avoid placing isolated datapoints too close to the framewhere they may be hidden.Identify overlapping data points.

Consider using visual summaries,such as regression lines or smooth-ing, if the amount of data is largeor if important trends are cloudedby clutter or unduly influenced byoutliers.

Use reference marks to highlightan important value.

Label the elements of bars andgraphs (do not use keys or grids).Horizontal bars give room forlabels and figures on or near theelements, which is why they arepreferred to vertical bars.

Use a table with round numbersin numerical form (5,000, not5,422 or 5 thousand) if memoryof specific amounts is required.

2. Ensure that the graph can standalone (i.e., a graph should be able tobe interpreted in isolation from themain text).

Highlight data, not extraneousmaterial.

Minimize noninformative materi-al in the data region.

Page 94: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Ensure that the entire graphcan be reduced and reproducedwithout loss of clarity.

3. Ensure that text complements andsupports the graph.

Never present numerical data intext form if more than one or twoitems are to be presented.

Use questions following the tableor chart to emphasize its chieffeatures.

Include text to support andimprove interpretation of chartsand tables.

4. Plan the graphical presentation.

Field-test a sample graph withthe intended audience beforeproducing the completed version.

Consider using more than onegraph to communicate an ideaor concept.

5. No form of graph is more effective inall respects than all other forms.However, the following suggestionshave been found in the literature:

Comparisons based on bar chartsare more accurate than compar-isons based on circles or squares.

Comparisons based on circles orsquares are more accurate thancomparisons based on cubes.

Bar charts prove easier to readthan line graphs.

Grouped line graphs (each ele-ment originating with baseline)are easier to read than segmentedline graphs, which prove verydifficult.

Line codes for graphs should bechosen to minimize confusion.

For reading points, multiple linesand multiple graphs are equallygood. For comparison, the

multiple-line display is alwayssuperior.

In general, color coding improvesperformance over the black-and-white code, especially formultiple-line graphs.

This instructional module, which isscheduled to appear in EducationalMeasurement: Issues and Practice in 1998,and which will include the above guide-lines with explanations and examples, isexpected to prove valuable to district,state, and national agencies that designdata displays and communicate testresults.

Ma lliset-BasketReporthMislevy, Bock, and Thissenhave created the concept ofa "market basket" for scorereporting. Mislevy et al.,(1996) provide an excellentexplanation of market-basketreporting and some of itsvariations (see also Mislevy,1998). Their idea apparentlyoriginated in market-basketreporting to explain econom-ic changes over time asreflected by the consumer

price.

market

food

0 0 C3

basket

reported

to-understand

measure

economic change..

price index. The price of amarket basket of food (with knownand fixed grocery items) is reportedeach month to provide the public witha single, easy-to-understand measure ofeconomic change. The extension to edu-cation is as follows: Imagine a collectionof test items and performance tasks thatmeasure important educational out-comes. The collection of assessmentmaterial would reflect diverse item for-mats, difficulty levels, cognitive levelswithin a subject area, and any otherdimensions of interest. The qualityof education might be monitored byr.e orting the performance of a national

9 93

Page 95: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Figure 5.7. Market-basket assessment

100%

80%

0%

sample of students on the "market bas-ket" of items each year. Many policy-makers seem to want a single, clearindex regarding the quality of education,much like the consumer price index.

The market-basket items would beclearly explained to the public toenhance the meaning of statementssuch as:

In 1996, the average Americanfourth-grade student obtained 37out of 50 points on the assessment.This is three points higher than theresults reported in 1995.

Alternatively or in addition, standardscould be used in reporting:

An advanced student would need toscore 45 points on the assessment.Approximately, 5% of the studentsin the national sample were judgedas advanced. In 1995, only 3% ofthe students met the standard forbeing advanced. Thus, evidence ofimprovement is seen.

94 95

Figure 5.7 shows the way NAEPachievement levels or performancestandards (B-Basic, P-Proficient, andA-Advanced) might be mapped on thepercentage score scale (or test scorescale) associated with the market-basketassessment. The monotonically increas-ing curve involved in the mapping isthe TCC for the assessment materialin the market basket (Hambleton,Swaminathan, & Rogers, 1991). TheTCC links the NAEP proficiency scaleand the achievement levels to a moremeaningful percentage score scale forthe particular items and tasks (i.e., theassessment material) in the market-basket assessment. The NAEP score dis-tribution for the student population ofinterest can be mapped on to the moremeaningful percentage score scale formany NAEP audiences along with theachievement levels and can then be usedin score reporting. Although many prob-lems must be overcome, the basic con-cept of market-basket reporting appearsto be attractive to NAEP audiences.

Page 96: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

A problem may occur if the market-basket items and tasks are administeredor released to the public in reports (asseems desirable to communicate fullythe meaning of the results). These itemsand tasks would be compromised andcould not be used in future assessments.Students might perform better on themin the future, not because the quality ofeducation improved, but because theassessment material had become knownand was taught to them. For the market-basket concept to work then, an equiva-lent set of items and tasks must befound for each administration. However,the construction of strictly equivalentforms of a test is a very difficult task,and even minor errors would distort theinterpretation of the results. Perhapsonly part of the market basket of assess-ment material could be released aftereach administration. Which parts andthe size of the market basket wouldneed to be determined so that only thesuitable amount would be released tothe public.

Another problem is that released itemsand tasks might have unintended effectson curricula and assessments such as anarrowing of the curricula to match thereleased assessment material and elimi-nation of assessment formats that werenot used in the market-basket items andtasks. These and other problems havereasonable solutions, but research willbe needed to address them before thisconcept is ready for implementation.

COMthligialinConsiderable evidence is found in themeasurement literature to suggest thatNAEP scales and score reports are notfully understood by intended audiences.At the same time, many signs indicatethat the problems associated with thesemisunderstandings are being addressed.A review of NAEP reports from 1990 to

1996 shows significant improvementsin the clarity of displays. The use ofanchor points, achievement levels,benchmarking, and market-basketdisplays, which were addressed inthis paper, appears to be valuable forimproving NAEP displays. Ongoingresearch of these and other innovationswill further enhance NAEP reports. Theemerging guidelines for data displaysfrom Hambleton, Slater, and Allalouf(in progress) address a pressing need andshould improve data displays. At thesame time, a considerableamount of research needs tobe conducted if NAEP dis-plays are to achieve theirpurposes.

NAEP reports, in principle,provide policymakers, educa-tors, education writers, andthe public with valuableinformation. However, thereporting agency needs toensure that reporting scalesare meaningful to the intend-ed audiences and that dis-plays are clear and under-standable. These efforts willalmost certainly require theadoption and implementa-tion of a set of guidelines forreporting, which would

MEP reporte,

principle,

provide

policgmakere,

educatore,

education

and

larftm,

public

&all&

information.

include the field-testing of allreports to ensure that they can be inter-preted fully and correctly. Special atten-tion, too, must be given to the use offigures and tables, which can conveysubstantial amounts of data clearlywhen they are properly designed.

Reference§Beaton, A. E., & Allen, N. L. (1992).Interpreting scales through scale anchor-ing. Journal of Educational Statistics, 17 (2),191-204.

96 95

Page 97: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Beaton, A. E., & Johnson, E. G. (1992).Overview of the scaling methodologyused in the National Assessment. Journalof Educational Measurement, 29 (2),163-176.

Bourque, M. L., Champagne, A. B., &Crissman, S. (1997). 1996 science perfor-mance standards: Achievement results for thenation and the states. Washington, DC:National Assessment Governing Board.

Cleveland, W. S. (1985). The elementsof graphing data. Monterey, CA:Wadsworth.

Forsyth, R. A. (1991). Do NAEP scalesyield valid criterion-referenced interpre-tations? Educational Measurement: Issuesand Practice, 40 (3), 3-9,16.

Hambleton, R. K. (1994). Scales, scores,and reporting forms to enhance the utility ofeducational testing. Paper presented at themeeting of the National Council onMeasurement in Education, NewOrleans.

Hambleton, R. K., & Bourque, M. L.(1991). The levels of mathematics achieve-ment: Initial performance standards for the1990 NAEP Mathematics Assessment.Washington, DC: National AssessmentGoverning Board.

Hambleton, R. K., & Slater, S. (1995).Using performance standards to reportnational and state assessment data: Arethe reports understandable and how canthey be improved? Proceedings of theJoint Conference on Standard Setting forLarge-Scale Assessments, Volume II (pp.325-343). Washington, DC: NationalAssessment Governing Board, NationalCenter for Education Statistics.

Hambleton, R. K., & Slater, S. (in press).Are NAEP executive summaly repot+sunderstandable to policymakers andeducators? Educational Assessment.

969 7

Hambleton, R. K., Slater, S., & Allalouf,A. (in progress). Steps for preparing dis-plays of educational data. EducationalMeasurement: Issues and Practice.

Hambleton, R. K., Swaminathan, H., &Rogers, H. J. (1991). Fundamentals of itemresponse theory. Newbury Park, CA: SagePublications.

Henry, G. T. (1995). Graphing data:Techniques for display and analysis.

Thousand Oaks, CA: Sage Publications.

Jaeger, R. (1992). General issues inreporting of the NAEP trial state assess-ment results. In R. Glaser & R. Linn(Eds.), Assessing student achievement in the

states. Stanford, CA: National Academyof Education (107-109).

Johnson, E. G. (1992). The design of theNational Assessment of EducationalProgress. Journal of EducationalMeasurement, 29, 95-110.

Kirsch, I. S., Jungeblut, A., Jenkins, L.,& Kolstad, A. (1993). Adult literacy inAmerica: A first look at the results of theNational Adult Literacy Survey.Washington, DC: U.S. GovernmentPrinting Office.

Koretz, D., & Deibert, E. (1993).Interpretations of National Assessment ofEducational Progress (NAEP) anchor points

and achievement levels by the print media in1991. Santa Monica, CA: RAND.

Linn, R. L., & Dunbar, S. B. (1992).Issues in the design and reporting of theNational Assessment of EducationalProgress. Journal of EducationalMeasurement, 29 (2), 177-194.

Mislevy, R. J. (1998). Implicationsof market-basket reporting forachievement-level setting. AppliedMeasurement in Education, 11 (1), 49-63.

Page 98: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Mislevy, R. J., Forsyth, R., Hambleton,R. K., Linn, R., & Yen, W. (1996).Design/feasibility team report to the NationalAssessment Governing Board. Washington,DC: National Assessment GoverningBoard.

Mislevy, R. J., Johnson, E. G., & Muraki,E. (1992). Scaling procedures in NAEP.Journal of Educational Statistics, 17,131-154.

Mullis, I. V. S. (1991). The NAEP scaleanchoring process for the 1990 mathematicsassessment. Paper presented at the meet-ing of American Educational ResearchAssociation, Chicago.

Mullis, I. V. S., Dossey, J. A., Owen, E.H., & Phillips, G. W. (1993). Executivesummary of the NAEP 1992 mathematicsreport card for the nation and the states.Washington, DC: U.S. Department ofEducation.

Phillips, G. W., Mullis, I. V. S., Bourque,M. L., Williams, P. L., Hambleton, R. K.,Owen, E. H., & Barton, P. E. (1993).Interpreting NAEP scales. Washington,DC: U.S. Department of Education.

Wainer, H. (1992). Understanding graphsand tables. Educational Researcher, 21 (1),14-23.

Wainer, H. (1996). Using trilinear plotsfor NAEP data. Journal of EducationalMeasurement, 33, 41-55.

Wainer, H. (1997a). Improving tabulardisplays: With NAEP tables as examplesand inspirations. Journal of Educationaland Behavioral Sciences, 21, 1-32.

Wainer, H. (1997b). Some multivariatedisplays for NAEP results. PsychologicalMethods, 2 (1), 34-63.

Wainer, H. (1997c). Visual revelations.New York: Springer-Verlag.

Wainer, H., Hambleton, R. K., & Meara,K. (in progress). A comparative study oforiginal and revised displays of NAEP data(tentative title). A project funded by theNational Center for Education Statistics.

Wainer, H., & Thissen, D. (1981).Graphical data analysis. Annual Review ofPsychology, 32, 191-241.

Williams, P. L., Reese, C. M., Campbell,J. R., Mazzeo, J., & Phillips, G. W.(1995). NAEP 1994 reading: A first look.Washington, DC: U.S. Department ofEducation.

9 897

Page 99: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

SECTION 6

1998 Civics and Writing

Level-Setting Methodologies

ACT, Inc. Iowa City, IA

August 1997

Page 100: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

V 99S Civia and Writin (1)

,re

Levell-Settin; Methodollogiag

Setting appropriate achievement levels on the National Assessment of Educational Progress

(NAEP) helps define some of the important outcomes of education, stating clearly what students

should know and be able to do at key grade levels in school. This will make the assessment far

more useful to parents and policymakers as a measure of performance in American schools and

perhaps as an inducement to higher achievement. The achievement levels will be used for

reporting NAEP results in a way that greatly increases their value to the American public.

NAGS 1199 0Achievement levels are an importantand increasingly integral part of theNational Assessment of EducationalProgress (NAEP). Achievement levelsdirectly address the way NAEP commu-nicates information about studentperformance in selected learning areasto a variety of constituencies to improveeducation in the United States and tomeet goals that signal educational parity,at a minimum, in the internationalarena. In particular, achievement levelsgive a readily understood means ofdescribing what students should knowand be able to do.

The current debate regarding nationaltests increasingly draws attention to theNational Assessment Governing Board(NAGB) and achievement levels in gen-eral. Although neither writing nor civicshas been proposed for the national tests,an increased interest in the achievementlevels set for all NAEP subjects seemsinevitable. Thus, the design of thisachievement levels-setting (ALS) processis particularly important.

In 1990, NAGB unanimously adoptedthree achievement levels to serve as theprimary means of reporting results forNAEP. The three levels are:

Proficient: This level represents solidacademic performance for each gradeassessed. Students reaching this levelhave demonstrated competencyover challenging subject

matter, including subjectmatter knowledge, appli-cations of such knowl-edge to real-world situa-tions, and analyticalskills appropriate tothe subject matter.

Advanced: This level sig-nifies superior perfor-mance beyond Proficient.

Basic: This level denotespartial mastery of pre-requisite knowledge andskills fundamental forProficient work at eachgrade level.

The plan described here is an extensionof earlier work by NAGB and ACT, Inc.,to set achievement levels in mathemat-ics, reading, writing, geography, U.S.history, and science. ACT's experiencesin carrying out the responsibilities of thecontract for setting achievement levelson NAEP in these subjects have guidedthe development of the design featuresfor the 1998 achievement levels-settingprocess for civics and writing describedin this document.

ma&

achiexeme.nt

9192

andemtood

MGC30E65

deacribing

what

ahould

gtademts

know

atda

and

100 101

Page 101: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

:come 000 070

converted

wah mhothree-parameter

item reeponee

theory

Such questions as what constitutes"civics," "writing," or any other subjectassessed by NAEP; how shall the subjectbe assessed; what information shouldbe collected; and what item develop-ment and test administration issues areimportant, but these are "givens" to thisproject. Inputs to this process are thefollowing:

Frameworks for the civics and writ-ing NAEP were developed undercontract to NAGB, through a consen-sus process involving panels of

experts and publiccomment forumsheld throughout theUnited States.

mode/

102

Policy definitions ofthe achievement levels(given above) were setby NAGB. Preliminaryachievement level defini-tions were developed byframework panelists foreach grade level withineach subject areae.g.,fourth-grade Basic civics,fourth-grade Proficientcivics, and fourth-gradeAdvanced civics.

Item development, development oftest forms (booklets), and field-testingfor each subject area were carried outby the Educational Testing Service(ETS) under contract to the NationalCenter for Education Statistics(NCES). Members of NAGB and theframework consensus panels, amongothers, participated in the selectionof items to be included on theassessment of each subject andat each grade level.

Assessments of students throughoutthe nation were selected throughsamples of schools. The assessmentsin civics and writing will be adminis-tered during the first 3 months of1998. NCES has authority to admin-ister NAEP. ETS develops the assess-ment, and Westat administers theassessment under contract to NCES.

Assessments are scored by NationalComputer Systems (NCS) under con-tract to ETS. Scores are reported toETS where they are converted to theNAEP scale using a three-parameteritem response theory (IRT) model.The item parameters and other data,such as scale transformations andstudent performance data, are pro-vided to ACT for use in settingachievement levels in each assess-ment subject area.

An ALS process will be conducted to setachievement levels in both civics andwriting to be assessed in 1998. Thisprocess is designed to produce threeproducts: descriptions of the knowledgeand skills that students in each gradelevel assessed in NAEP (4th, 8th, and12th) should have to be classified as per-forming at each level of achievement;the numerical score (or "cutpoint") asso-ciated with that level of performance onthe particular assessment administered;and test items with (written) responsesillustrative of the kinds of knowledgeand skills required of students perform-ing at each level of achievement.Outcomes of this process will be:

Content-based descriptions of eachlevel of achievement for each of thethree grade levels assessed by NAEP.

Numerical cutscores (the lowerbound at each achievement level)

Page 102: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

that tie these descriptions to perfor-mance on the assessments.

Items available from the 1998 poolof items slated for public release,illustrative of the skills and knowl-edge characterizing student perfor-mance at each achievement levelat each of the three grade levelsassessed by NAEP in these subjects.

The NAEP score associated with thecutpoints and performance at or aboveeach level tend to be the focal point ofNAEP reporting. In addition to derivingrecommended cutpoints for Basic, Pro-ficient, and Advanced achievement levels,important outcomes of this projectare the descriptions of the proposedachievement levels and the sampleitems and responses selected to repre-sent student performance at eachachievement level.

The frameworks for the 1998 NAEPassessments include preliminary defini-tions of the three achievement levels foreach grade. As a part of the framework,preliminary definitions were used toguide the development of the assess-ment items and tasks. ACT has designeda process by which the achievementlevels descriptions will be reviewedand finalized prior to convening theALS panels.

A series of focus groups, one in each ofthe four NAEP regions, has beenplanned. These focus groups are beingheld to collect recommendations formodifications in the preliminaryachievement levels descriptions thatwould increase the reasonableness of thedescriptions with respect to the policydefinitions. The plan is to convene apanel of experts to review the recom-mendations and make changes deemedappropriate, with respect to the frame-work. These finalized descriptions

will then be presented to NAGB forapproval before the ALS pilot studiesare conducted.

Because relatively few blocks of itemsin each assessment will be released forpublic review, the maximum feasiblenumber of items will be selected forconsideration as items to represent stu-dent performance at each achievementlevel. These achievement level descrip-tions and the illustrative items for eachwill play prominent roles in communi-cating the achievement of students onNAEP.

ACT intends to elicit andengage participation bynumerous experts, interestedorganizations, and individu-als. The final product willbenefit from the input ofthese individuals and inter-ests. ACT will provide theimpetus for the accumulationof information and expertiseto be focused on and chan-neled into the developmentof the achievement levelsand the validation of theirinterpretations of studentperformance.

The achievement levels willbe developed by a group ofindividuals representative ofboth the educational corn-

achievement

level&

developed

group

individuate

reprenentative

both

educational

community

genera

and

munity and the general public. A broad-ly representative set of panelists will beidentified for each content area. Theinvolvement and participation of stake-holder groups and other interested con-stituencies will be elicited in all phasesof this project. Further, the recommend-ed achievement levelsdescriptions,numerical values, and illustrativeitemswill be made available for publicreview, and ACT will attempt to engagethe public in this review. The primary

102103

Page 103: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

mode of collecting opinions from thesepersons and organizations will bethrough the Internet and Web sites forboth ACT and NAGB. All commentsoffered during this public review phasewill be shared with the TechnicalAdvisory Committee on StandardsSetting (TACSS) and NAGB, and thesecomments will become a part of theinformation compiled to develop therecommendations for NAGB regardingachievement levels in each content area.

Key Point inACT Degign

4ve

Y994,

ALS protease,

implemen

incorporated

etate-of the-a

mathi)dologiee

etandard

and

isettiog

accompliehed

AG ga bx1o re-u

ceo eucceeeful

etandard-eatti

prOC.e&S.

104

ACT has extensive experi-ence in assisting national

1 organizations in determin-ing criterion scores orstandards for their pro-grams. ACT has knowledgeof current advances in ALSprocesses and the creativityand expertise to modifyand expand upon theseas appropriate for specificimplementations. Inaddition, ACT designedand implemented theachievement levels-settingprocess for the 1992 NAEPin mathematics, writing,and reading, the 1994 NAEPin geography and U.S. his-tory, and the 1996 NAEP inscience. We believe that thisexperience provides theinsights and understanding,merged with the technicaland methodological exper-tise and experience, fordesigning a process tofully address the many chal-

lenges and requirements of the NAEPassessments.

103

In designing these processes, ACThas carefully reviewed the procedurespreviously implemented in settingachievement levels for NAEP and thosedescribed in recent literature on stan-dard setting. We have attempted toevaluate objectively the primary featuresof the processes implemented for NAEPand to identify ways to improve uponthem. We are aware of the features thatare unique to NAEP. Many NAEP fea-tures require special consideration in theALS design, and some standard-settingprocedures simply will not work forNAEP.

Many features of earlier processes havebeen retained; many others have beenimproved upon arid enhanced. A sam-pling plan for identifying and recruitingpanelists was successfully designed andimplemented for the 1992, 1994, and1996 ALS processes. The sampling planhas yielded broadly representative pan-els of well-qualified individuals. Theapproval of a diverse set of interestedindividuals, organizations, and groupswas sought and achieved for both thesampling plan and the overall researchdesign.

The ALS process, first implemented in1992, incorporated state-of-the-artmethodologies in standard setting andfully accomplished the goals required ofa successful standard-setting process.The process designed by ACT for the1994 and 1996 NAEP ALS process incor-porates all features recognized as "desir-able" during the conference on standardsetting sponsored by NAGB and NCESin 1994 except sharing consequencesdata with panelists during the ALSprocess. The current design calls for achange in NAGB policy to allow thisaddition.

Page 104: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Questions and concerns have emergedregarding psychometric and standard-setting issues related to NAEP achieve-ment levels that have never beenaddressed. ACT has raised several ofthese, and we have attempted to openlyaddress all that were brought to ourattention. The design presented in thisdocument incorporates improvementsand enhancements generated throughexperiences gained during previousachievement levels-setting efforts forNAEP.

Key features of the 1998 proposalinclude the following:

1. A sampling plan for recruiting pan-elists for each ALS meeting that willresult in the involvement of a well-qualified, representative panel ofjudges, while introducing efficien-cies resulting from the experiencesof identifying and recruiting pan-elists for many previous NAEPpanels.

2. A series of focus groups to makerecommendations that will beincorporated into the preliminaryachievement levels descriptions.These modified descriptions willbe presented to NAGB for finalapproval prior to convening ALSpanelists.

3. A research agenda incorporated intothe general approach of the ALSprocess and validation process thatwill contribute significantly to theproduct of the 1998 achievementlevels-setting process and to thebody of knowledge in standardsetting, item response theory, andother technical and methodologicalareas of educational assessment andmeasurement.

4. Field trials (two separate studieseach) for both civics and writing totest rating methodologies and other

features of the proposed design andto collect research data regardingthe design before the pilot studiesare implemented.

5. Pilot studies for both civics andwriting that will provide theopportunity for testing the processand making needed changes andadjustments prior to implementingthe ALS process.

6. Ample time in the agendafor the ALS pilotstudies and meetingsto successfully addresskey elements of theALS process.

7. Extensive training forpanelists, not only in themethods of evaluatingitems and rating them toset achievement levelsbut also in the conse-quences of those stan-dards, in the objectivesand purposes of NAEPand NAGB, and in edu-cational assessment

Pece

ftes

ssiadies

provide

olfylortunitg

0 0 0

taatiaffl ecag

prOCe&S

mak

and

needad

change& and

CrYPA(ItiaQfirlia 0 0 0

issues and policies.

8. A more deliberative process engagingpanelists in two different methodsfor arriving at the numericalcutscores.

9. Consequences data provided topanelists during the process so thatcutscores may be adjusted after pan-elists have been informed about theconsequences of setting achievementlevels at specific score points.

10. Customized computer software thatuses the IRT calibrations of theNAEP items and scale to produceon-site feedback to panelists on theconsistency and convergence of rat-ings and on the consequences oftheir ratings.

1.04 105

Page 105: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

11. Scannable rating forms to reduce thetime required to enter and analyzedata and produce feedback for pan-elists to use during the process.

12. The same, well-trained process facilita-tion staff for each pilot study andeach ALS meeting to ensure consis-tency in implementation.

13. Content facilitation staff well versed inthe NAEP framework or item poolfor the subject (or both) to workwith each grade-level panel on thepilot studies and the ALS in eachsubject.

14. On-site logistic planning and supportservices using full-time, ALS-experienced project staff.

15. A team of veterans represented onthe project staff, the internal adviso-ry team, and the external committeeof technical advisors, includinghighly experienced experts in standard-setting methodology, psychomet-rics, sampling statistics, writingand other educational assessment,collective decision making, andmeeting management, joined by

105106

new members with fresh insightsand ideas to enrich the outcomes ofthe project.

ACT's general approach to deriving rec-ommended achievement levels for theNAEP is guided by five overarchingprinciples:

1. There must be broad, thorough, andopen participation from all relevant-populations in the ALS process.

2. Highly sensitive and confidentialmaterials, reports, and informationmust be handled in an appropriatemanner.

3. The levels-setting process must becarefully designed, technically sound,rigorously implemented, and appro-priately validated.

4. The levels-setting process must becomprehensible to interested partiesand easily implemented by processparticipants.

5. NAGB must exercise informed direc-tion over all major project activitiesand be kept fully apprised of all rele-vant project information.

Page 106: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

SECTION 7

The Criticality of Consequences

in Standard Setting:Six Lessons Learned the Hard Way

by a Standard-Setting Abettor

W. James Popham University of California, Los Angeles

August 1997

' CA, ti-:7-et

1 (IC

Page 107: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

The Crilticallity of Congequences llu kandardSettim; Siz Les5ons Learned the Hard Way

by a Standard-Setting Abettor

When I was initially invited to presentsome ideas to members of the NationalAssessment Governing Board (NAGB) inAugust 1997, my topic was to be "con-sequential validity." I had written aboutconsequential validity a few months ear-lier, contending that I did not regard itas a particularly commendable concept.However, as I later learned that theAugust NAGB ses'sion was to focusdirectly on that group's standard-settingactivities, I decided to emphasize therole of consequences in the setting ofperformance standards. Although Iregard consequential validity as a psy-chometrically sordid idea, the impact ofconsequences on standard setting is, andshould be, enormous.

I decided to draw on my experiences inthe setting of standards for more thanthree dozen high-stakes tests for stu-dents, teachers, and administrators. Iwould like to describe some lessonsthat I have learned.

What to CaliOnesellf?On careful consideration, I realizedthat I had never set one of these perfor-mance standards. Most typically, Iserved as moderator for a statewidestandard-setting panel. In other settings,I functioned as an advisor to the ulti-mate standard setters. Clearly, I had tofind a suitable descriptor for my role inthe aforementioned standard-settingendeavors. I toyed with such possibilities

as consultant, advisor, and coconspirator, butnone seemed on the mark. But then Ilooked up the meaning of abettor inWebster's Collegiate Dictionary; an abet-tor's role is "to encourage, support, orcountenance by aid or approval, usuallyin wrongdoing" [italics added]. Obviously,I had found an appropriate label.

Six LezonsSix consequence-relevantlessons that this standard-setting abettor has learnedwhile wrestling with theestablishment of perfor-mance standards are foundbelow. Each lesson is fol-lowed by a brief comment.

Lesson 1. The chief deter-miner of performancestandards is not truth;it is consequences.

Abettor's Comment: If an

Ceagon

VAG chief

daterminer

perfarmanc&

aandards §2 wit

Ova*

C0/3.9.e.9CM/1C.&a.

explanation is sufficientlyimportant to warrant a formal standard-setting effort, it is invariably true thatmeaningful consequences will belinked to examinees' performances.Accordingly, when standard-setting pan-elists determine performance levels forsuch examinations, those standard set-ters are typically influenced more by theconsequences of the standards they set(e.g., the number of students who willnot receive a high school diploma) thanby any notions about true ("correct")performance levels.

107 109

Page 108: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Lesson 2. Any quest for "accurate"performance standardsis silly.

Abettor's Comment: There are alwaysmultiple consequences linked to perfor-mances on significant tests. To illustrate,for a high school graduation test, such

consequences wouldinclude diploma denials,citizens' estimates of edu-cators' effectiveness, andthe business community'ssatisfaction with diplomarecipients. Because differentstandard setters' percep-tions of the significanceof those consequences willvary, the performance levelsultimately selected typicallyreflect a judgmental amal-gam rather than a correctperformance standard.

1 1 0

Unfortunately, some psychometricianshave spent so much time searching fortrue scores and accounting for error vari-ance that they sometimes attempt toimpose a truth-and-error paradigm onthe standard-setting process. A prefer-able approach to standard setting wouldbe to recognize that it is fundamentallya consider-the-consequences enterprise.

Lesson 3. Early in the standard-setting process, all likely conse-quences should be explicated forstandard setters.

Abettor's Comment: Standard settersfrequently become so preoccupiedwith the most obvious and advertisedconsequence of using a test (e.g., forprofessional licensure) that they fail torecognize the certainty or possibility ofother potentially significant conse-

quences. If standard setters are alertedto the full range of likely consequencesof a test's use, they will be more apt tofunction thoughtfully by considering allconsequences, or at least those conse-quences they consider important.

Lesson 4. If certain standard setters(because of their positions or affilia-tions) are apt to be biased in theirjudgments, such potential biasesshould be identified early in thestandard-setting process.

Abettor's Comment: In many attemptsto establish performance standards, par-ticular standard setters enter the processfrequently with powerful biases in favorof higher or lower performance stan-dards. Such biases frequently flow fromthe orientation of the entity (e.g., teach-ers union) represented by a standardsetter. Often, these blatantly biasedindividuals will profess nonpartisanshipduring standard-setting deliberations.

Those directing the standard-settingenterprise should isolate such biases atthe outset of the deliberations. Thiscould be done by identifying the quitenormal proclivity of certain categories ofstandard setters (no names) to favor per-formance standards compatible withpreferences of the groups those standardsetters represent. Visible biases are morereadily countered than are camouflagedbiases.

Lesson 5. When it is hoped that atesting program will stimulate subse-quent increased proficiency amongthose to be tested, incremental eleva-tions of performance levels, over aperiod of time, can avoid undesirableconsequences.

Page 109: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Abettor's Comment: Frequently, theskills or knowledge assessed by a newtest will reflect higher-level proficienciesthat it is hoped will be possessed byfuture examinees. The examination sys-tem is being used to spur the acquisitionof these more demanding capabilities.Yet, because few examinees will, in thenew testing program's early days, pos-sess the ultimately desired proficiencylevels, standard setters are sometimestempted to opt for lower performancestandards to avoid penalizing these earlytest takers. If such low performancestandards are allowed to persistthroughout the duration of the testingprogram, however, its lofty improve-ment aspirations will not be realized.This classic approach-avoidance conflictcan often be circumvented through theuse of preannounced, incrementalincreases in required performancelevels over time.

Lesson 6. Performance-level descrip-tors must accurately communicate,in an intuitively comprehensible fash-ion, to all concerned constituencies.

Abettor's Comment: Those individualsmost actively involved in the develop-ment and operation of testing programs,or in the determination of performancestandards, often become so familiarwith the nuances of what is beingassessed that they devise sophisticatedperformance-level descriptive schemes

not readily understandable to the unini-tiated. Sometimes, for example, exoticscale-score reporting systems are createdwith the thinly veiled purpose of obfus-cating what would be regarded by thepublic as unacceptably low performancestandards. Performance standards mustbe readily understandable to thosewho are concerned aboutexaminees' performances.

A FinailAdmollTiggnMany important educationaldecisions are made withouta careful decision-makingprocess. Standard settingfor high-stakes tests, how-ever, should never be madein the absence of thought-ful, systemized judgment.My understanding of thestandard-setting procedurespreviously employed byNAGB is that there has beenfar too much deference givento the quantitative "truth-

Lesson

Performance-level

descriptors

ma? accaratel

communicate,

intcüIvel

comprehensible

fashion,

conc.erned

constituencies.

seekers." Even though therewill never be a standard-setting machinethat pumps out unflawed performancelevels, all we can ask is that NAGB andother standard setters circumspectlyconsider the consequences of theperformance levels they set.

1091 11

Page 110: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

SECTION 8

Acknowledgements

andAppendkes

Page 111: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

AcknowlledgementThis Achievement Levels Workshop involved the support and assistance of numerousstaff from the National Assessment Governing Board (NAGB), Educational TestingService (ETS), ACT, Inc., and Aspen Systems Corporation. The committee would liketo acknowledge the assistance of each staff member who helped to arrange and planfor this inaugural event.

In addition, the committee wishes to thank the authors whose work appears in thisvolume of the proceedings. Their informative presentations and learned papers pro-vided a collection of some of the best and current thinking about achievement levels.We would especially like to thank Robert A. Forsyth, University of Iowa; Wim J. vander Linden, University of Twente, The Netherlands; Ronald K. Hambleton, Universityof Massachusetts; David Thissen and his colleagues, University of North Carolina;W. James Popham, University of California, Los Angeles/I0X; and Susan Loomis,ACT NAEP Project Director.

We would also like to thank the attendees who offered special comments and insightson the various topics on the agenda, including Lauress Wise, member of the NationalAcademy of Sciences NAEP Evaluation Panel; Jon Cohen from the American Institutesof Research (AIR), a staff member of the Advisory Committee for Evaluation Statistics(ACES); and William Brown, member of the Technical Advisory Committee onStandards Setting (TACSS).

Special thanks goes to Munira Mwalimu and the staff at Aspen for their logisticalsupport and Jewel Bell of the NAGB staff for her assistance in preparing briefingmaterial's, attending to participant onsite needs, and helping the editor in preparingthese proceedings.

111

115

Page 112: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

AppegmhzParticipantLuz BayACT, Inc.2201 North Dodge StreetIowa City, IA 52243Telephone: (319) 337-1639Fax: (319) 337-3020

Jewel BellNational Assessment Governing Board800 North Capitol Street, NWSuite 825Washington, DC 20002-4233Telephone: (202) 357-6938Fax: (202) 357-6945

Mary BlantonAttorneyBlanton & Blanton228 West Council StreetSalisbury, NC 28145-2327Telephone: (704) 637-1100Fax: (704) 637-1500

Mary Lyn BourqueNational Assessment Governing Board800 North Capitol Street, NWSuite 825Washington, DC 20002-4233Telephone: (202) 357-6940Fax: (202) 357-6945

William BrownUniversity of North Carolina,

Chapel Hill121 Dunedin CourtCary, NC 27511Telephone: (919) 467-2404Fax: (919) 966-6761

James CarlsonEducational Testing CenterRosedale Road, Box 6710Princeton, NJ 08541-6170Telephone: (609) 734-1427Fax: (609) 734-5410

Peggy CarrNational Center for Education

Statistics555 New Jersey Avenue, NWWashington, DC 20208Telephone: (202) 219-1761Fax: (202) 219-1801

Lee ChinACT, Inc.2201 North Dodge StreetIowa City, IA 52243Telephone: (319) 337-1639Fax: (319) 339-3020

Jon CohenChief StatisticianAmerican Institutes for Research1000 Thomas Jefferson Street, NWWashington, DC 20007Telephone: (202) 944-5300Fax: (202) 344-5408

Mary CrovoNational Assessment Governing Board800 North Capitol Street, NWSuite 825Washington, DC 20002-4233Telephone: (202) 357-6938Fax: (202) 357-6945

Barbara DoddUniversity of TexasMeasurement and Evaluation Center2616 Wichita StreetAustin, TX 78705Telephone: (512) 232-2654Fax: (512) 471-3509

James EllingsonFourth-Grade TeacherProbstfield Elementary School2410 14th Street, SouthMoorhead, MN 56560Telephone: (218) 299-6252Fax: (701) 298-1509

112 117

Page 113: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Thomas FisherAdministratorDepartment of Education325 West Gaines Street, FEC 701Tallahassee, FL 32399-0400Telephone: (904) 488-8198Fax: (904) 487-1889

Robert A. ForsythProfessorCollege of EducationUniversity of Iowa320 Lindquist CenterIowa City, IA 52242Telephone: (319) 335-5412Fax: (319) 335-6038

Michael GuerraExecutive DirectorNational Catholic Education AssociationSecondary School Department1077 30th Street, NWSuite 100Washington, DC 20007Telephone: (202) 337-6232Fax: (202) 333-6706

Ronald K. HambletonProfessor of EducationUniversity of Massachusetts, AmherstHills House SouthRoom 152Amherst, MA 01003Telephone: (413) 545-0262Fax: (413) 545-4181

Patricia HanickACT, Inc.2201 North Dodge StreetIowa City, IA 52243Telephone: (319) 337-1639Fax: (319) 337-3020

Eugene JohnsonEducational Testing CenterRosedale RoadPrinceton, NJ 08541-6170Telephone: (609) 734-5598Fax: (609) 734-5420

118113

Susan LoomisACT, Inc.2201 North Dodge StreetIowa City, IA 52243Telephone: (319) 337-1048Fax: (319) 337-3021

John MazzeoEducational Testing CenterRosedale RoadPrinceton, NJ 08541-6170Telephone: (609) 734-5298Fax: (609) 734-5410

Karen MitchellNational Academy of Sciences2101 Constitution Avenue, NWRoom HA178Washington, DC 20418Telephone: (202) 334-3407Fax: (202) 334-3584

Mark MusickPresidentSouthern Regional Education Board592 10th Street, NWAtlanta, GA 30318-5790Telephone: (404) 875-9211Fax: (404) 872-1477

Michael NettlesProfessor of Education and

Public PolicyUniversity of Michigan2035 School of Education Building610 East UniversityAnn Arbor, MI 48109-1259Telephone: (313) 767-1982Fax: (313) 764-2510

Christine O'SullivanEducational Testing ServiceRosedale RoadPrinceton, NJ 08541-6170Telephone: (609) 734-5598Fax: (609) 734-5420

Page 114: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Norma PaulusSuperintendent of Public InstructionState Department of Education255 Capitol Street, NESalem, OR 97310-0203Telephone: (503) 378-3573, ext. 526Fax: (503) 378-4772

W. James PophamDirector of IOX AssessmentAssociate Professor EmeritusUniversity of California, Los Angeles5301 Beethoven StreetSuite 109Los Angeles, CA 90066Telephone: (310) 545-8761Fax: (310) 822-0269

William RandallConnect1580 Logan StreetSuite 740Denver, CO 80203Telephone: (303) 894-2146Fax: (303) 894-2141

Douglas RindoneConnecticut State Department

of EducationP.O. Box 2219165 Capitol AvenueHartford, CT 06145Telephone: (203) 566-1684Fax: (203) 566-1625

Andrea Schneider (representative)Office of Governor Roy Romer136 State Capitol Building200 East Colfax AvenueDenver, CO 80203Telephone: (303) 866-2471Fax: (303) 866-2003

Sharif ShakraniNational Assessment Governing Board800 North Capitol Street, NWSuite 825Washington, DC 20002-4233Telephone: (202) 357-6938Fax: (202) 357-6945

David ThissenProfessor of PsychologyUniversity of North CarolinaCB# 3270 Davie HallChapel Hill, NC 27599Telephone: (919) 962-5036Fax: (919) 962-2537

Roy TrubyNational Assessment Governing Board800 North Capitol Street, NWSuite 825Washington, DC 20002-4233Telephone: (202) 357-6938Fax: (202) 357-6945

Wim J. van der LindenProfessor of Educational

Measurement and Data AnalysisFacility of Educational Science

and TechnologyUniversity of Twente7500 AE EnschedeThe NetherlandsTelephone: 011-31-53-893-581Fax: 011-31-53-48-928895

Deborah VoltzAssistant ProfessorDepartment of Special EducationUniversity of Louisville154 Educational BuildingLouisville, KY 40292Telephone: (502) 852-0561Fax: (502) 852-1419

114

119

Page 115: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Lisa Weil (representative)Office of Governor Roy Romer136 State Capitol Building200 East Colfax AvenueDenver, CO 80203Telephone: (303) 866-2471Fax: (303) 866-2003

Paul WilliamsEducational Testing ServiceRosedale RoadPrinceton, NJ 08541-6170Telephone: (609) 734-1427Fax: (609) 734-1878

120

115

Lauress WisePresidentHummRo66 Canal Center PlazaSuite 400Alexandria, VA 22314Telephone: (703) 549-3611Fax: (703) 549-9025

Rebecca ZwickUniversity of California, Santa BarbaraDepartment of EducationSanta Barbara, CA 93106-9490Telephone: (805) 893-7762Fax: (805) 893-7264

Page 116: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Appendiz 63:Conference A end@

Wednesday, August 201:00 p.m. Welcome and Introductions

1:15 p.m. Working Lunch

2:15 p.m. Frameworks and Specifications: Impact on Achievement LevelsRobert A. ForsythUniversity of Iowa

3:45 p.m. Break

4:00 p.m. Discussion

5:00 p.m. Adjourn

6:00 p.m. Working Dinner, Hotel Gardens

Thursday, August 218:00 a.m. Continental breakfast

8:30 a.m. Test Construction/Preliminary Descriptions:Impact on Achievement Levels

Wim J. van der LindenLaw School Admissions Services andUniversity of Twente, The Netherlands

10:00 a.m. Break

10:15 a.m. Judgement Scoring: Impact on Achievement LevelsRonald K. HambletonUniversity of Massachusetts, Amherst

Noon Lunch, Millennium Room

1:00 p.m. Discussion

2:00 p.m. Mixed Item Formats: Impact on Setting StandardsDavid ThissenUniversity of North Carolina, Chapel Hill

3:30 p.m. Break

3:45 p.m. Validity/Reliability Issues for NAEP Achievement LevelsJon CohenAmerican Institutes for Research

4:45 p.m. Discussion

5:30 p.m. Adjourn (dinner on your own)

116121

Page 117: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

Friday, August 227:30 a.m. Continental Breakfast

8:00 a.m. Consequential Validity for Setting StandardsW. James PophamUniversity of California, Los Angeles/IOX

9:30 a.m. Discussion

10:00 a.m. Break

10:15 a.m. Civics and Writing Level-Setting MethodologiesNAEP StaffACT

Noon Lunch, Millennium Room

1:00 p.m. Wrap-up Panel DiscussionWilliam BrownBrownstar, Inc.Caty, NCW. James PophamUniversity of California, Los Angeles/IOXLauress WiseNational Academy of Sciences

2:00 p.m. Adjourn

117

122

Page 118: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

N,404:

.

-N.,

;\'

`;-

lq" (

,L1 J.. . ; t,

,,,. .: IV. N '.. ' . .' .. \ 1," .1'

"411-1i11. illi?°, ., -'; # -'.,\ ... 0 .., . '-",...,.. , '1;

.1 . :'l ,.. 7...-141/k 1/..,

.1 1 ., .. ..,'''? '''. 1

1.;;":-:'' '. '. ` !,.' i .. ,', I cLF.' I la, ;. ,t 7 .

.14' . 0 e 1. ,-, -- : ' 4 t -v-1-01 -,. ;, , , - : \ . ,';. .t.:Y " y4, ., II, f, ..-. `,.....±..l , ' ...` ',.',.' 1, .;,.. '.: .., ',,,,` 14,.. ;. < y, . ' , , '

. ...- .. .

.1 .t,It.., ,1

-,... ,L,,,,...4.,..W1.1 -;,- 4. -*. '' . )2i,. ...,"

`... ' .

. _.-. , _ , . : . '. _. .,.....- '.- - (....,.. , _. .--/ 2-, A ',7' '' 7 : .. . 7- * 74'''

sC`

7*F.71

^

Page 119: Reproductions supplied by EDRS are the best that …DOCUMENT RESUME ED 458 220 TM 033 373 AUTHOR Bourque, Mary Lyn, Ed. TITLE Proceedings of Achievement Levels Workshop (Boulder, Colorado,

U.S. Department of EducationOffice of Educational Research and Improvement (OERI)

National Libraiy of Education (NLE)

Educational Resources Information Center (ERIC)

NOTICE

Reproduction Basis

TM033373

ERIC

This document is covered by a signed "Reproduction Release(Blanket)" form (on file within the ERIC system), encompassing allor classes of documents from its source organization and, therefore,does not require a "Specific Document" Release form.

1:1This document is Federally-funded, or carries its own permission toreproduce, or is otherwise in the public domain and, therefore, maybe reproduced by ERIC without a signed Reproduction Release form(either "Specific Document" or "Blanket").

EFF-089 (3/2000)