22
APPLIED MEASUREMENT IN EDUCATION, 23: 286–306, 2010 Copyright © Taylor & Francis Group, LLC ISSN: 0895-7347 print / 1532-4818 online DOI: 10.1080/08957347.2010.486289 HAME 0895-7347 1532-4818 Applied Measurement in Education, Vol. 23, No. 3, May 2010: pp. 0–0 Applied Measurement in Education Using Confirmatory Factor Analysis and the Rasch Model to Assess Measurement Invariance in a High Stakes Reading Assessment Assessing Measurement Invariance in Reading Assessment Randall and Engelhard, Jr. Jennifer Randall Research and Evaluation Methods Program University of Massachusetts, Amherst George Engelhard, Jr. Education Studies Emory University The psychometric properties and multigroup measurement invariance of scores across subgroups, items, and persons on the Reading for Meaning items from the Georgia Criterion Referenced Competency Test (CRCT) were assessed in a sample of 778 seventh-grade students. Specifically, we sought to determine the extent to which score-based inferences on a high stakes state assessment hold across several subgroups within the population of students. To that end, both confirmatory factor analysis (CFA) and Rasch (1980) models were used to assess measurement invariance. Results revealed a unidimensional construct with factorial-level measurement invariance across disability status (students with and without specific learning disabilities), but not across test accommodations (resource guide, read- aloud, and standard administrations). Item-level analysis using the Rasch Model also revealed minimal differential item functioning across disability status, but not accommodation status. The federal government, with the Individuals with Disabilities Education Act of 2004 (IDEA), defines the term child with a disability to mean a child “with mental retardation, hearing impairments (including deafness), speech or language Correspondence should be addressed to Professor Jennifer Randall, Ph.D., University of Massachusetts, Hills House South, Room 171, Amherst, MA 01003. E-mail: [email protected]

Applied Measurement in Education Using Confirmatory Factor Analysis and the Rasch Analysis

Embed Size (px)

DESCRIPTION

MEASUREMENT

Citation preview

  • APPLIED MEASUREMENT IN EDUCATION, 23: 286306, 2010Copyright Taylor & Francis Group, LLCISSN: 0895-7347 print / 1532-4818 onlineDOI: 10.1080/08957347.2010.486289

    HAME0895-73471532-4818Applied Measurement in Education, Vol. 23, No. 3, May 2010: pp. 00Applied Measurement in EducationUsing Confirmatory Factor Analysis and the Rasch Model to Assess Measurement

    Invariance in a High Stakes Reading Assessment

    Assessing Measurement Invariance in Reading AssessmentRandall and Engelhard, Jr. Jennifer RandallResearch and Evaluation Methods Program

    University of Massachusetts, Amherst

    George Engelhard, Jr.Education Studies Emory University

    The psychometric properties and multigroup measurement invariance of scoresacross subgroups, items, and persons on the Reading for Meaning items from theGeorgia Criterion Referenced Competency Test (CRCT) were assessed in a sampleof 778 seventh-grade students. Specifically, we sought to determine the extent towhich score-based inferences on a high stakes state assessment hold across severalsubgroups within the population of students. To that end, both confirmatory factoranalysis (CFA) and Rasch (1980) models were used to assess measurementinvariance. Results revealed a unidimensional construct with factorial-levelmeasurement invariance across disability status (students with and without specificlearning disabilities), but not across test accommodations (resource guide, read-aloud, and standard administrations). Item-level analysis using the Rasch Modelalso revealed minimal differential item functioning across disability status, but notaccommodation status.

    The federal government, with the Individuals with Disabilities Education Act of2004 (IDEA), defines the term child with a disability to mean a child withmental retardation, hearing impairments (including deafness), speech or language

    Correspondence should be addressed to Professor Jennifer Randall, Ph.D., University ofMassachusetts, Hills House South, Room 171, Amherst, MA 01003. E-mail: [email protected]

  • ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 287

    impairments, visual impairments (including blindness), serious emotional distur-bance, orthopedic impairments, autism, traumatic brain injury, other healthimpairments, or specific learning disabilities and who, by reason thereof, needsspecial education and related services (Public Law 108-446, 108th Congress).Over 6.5 million infants, children, and youth are currently served under IDEAlegislation (U.S. Department of Education, 2007b) that requires that all publicschool systems provide students with disabilities a free and appropriate educationin the least restrictive environment. This least restrictive environment mandateoften requires schools and school systems to place students with disabilities inregular, nonspecial education classrooms.

    In addition to IDEA, No Child Left Behind (NCLB, 2002) seeks to address,and to prevent, the practice of excluding disabled students from quality instruc-tion and, consequently, assessment. Although the U.S. Department of Education(DOE) does not require students with significant cognitive disabilities to achieveat the same levels of non-disabled students under NCLB, the DOE does demandthat all other students with less severe disabilities make progress similarly to thatof their non-disabled peers. Because many students with disabilities must beassessed using the same tests as students without disabilities, the need for testingaccommodations that compensate for their unique needs and disabilities becomesapparent. Yet some may argue that the non-standard accommodations requiredby special needs students could undermine the meaningfulness of scores obtainedon a standardized test.

    The inclusion of students with disabilities (SWDs) certainly presents somemeasurement challenges. Federal law requires that the mandatory assessmentsof SWDs meet current psychometric and test standards related to validity, reli-ability, and fairness of the scores. States must (i) identify those accommoda-tions for each assessment that do not invalidate the score; and (ii) instruct IEPTeams to select, for each assessment, only those accommodations that do notinvalidate the score (Department of Education, 2007b, p. 177781). The Stan-dards for Educational and Psychological Testing (AERA, APA, and NCME,1999) dictate:

    Standard 10.1In testing individuals with disabilities, test developers, test administrators, and testusers should take steps to ensure that the test score inferences accurately reflect theintended construct rather than any disabilities and their associated characteristicsextraneous to the intent of the measurement. (p. 106)

    Standard 10.7When sample sizes permit, the validity of inferences made from test scores and thereliability of scores on tests administered to individuals with various disabilitiesshould be investigated and reported by the agency or publisher that makes the

  • 288 RANDALL AND ENGELHARD, JR.

    modification. Such investigations should examine the effects of modificationsmade for people with various disabilities on resulting scores, as well as the effectsof administering standard unmodified tests to them. (p. 107)

    This study seeks to address these standards by examining evidence of measure-ment invariance for a set of reading items used on the Georgia ReferencedCompetency Test (CRCT). The basic measurement problem addressed iswhether or not the probability of an observed score on these reading itemsdepends on an individuals group membership. In other words, measurementinvariance requires that students from different groups (students with disabilities,students without disabilities as well as students who receive resource guide, read-aloud, or standard administrations), but with the same true score, have the sameobserved score (Wu, Li, & Zumbo, 2007). Meredith (1993) provides a statisticaldefinition of measurement invariance:

    The random variable X is measurement invariant with respect to selection on V if F(x|w, v) = F (x|w) for all (x, w, v) in the sample space. Where X denotes an observedrandom variable with realization x; w denotes the latent variable with realization wthat underlies, or measures X. V denotes a random variable, with realization v thatfunctions as a selection of a subpopulation from the parent population by thefunction s(V), 0 s(v) 1. (see Meredith, 1993, p. 528)

    Wu et al. (2007) assert that such a general definition is useful in that it can beapplied to any observed variables at the item or test level; consequently provid-ing a statistical basis for psychometric techniques such as factor analyticinvariance, as well as differential item functioning, or item response theory meth-ods (p. 3). At the test level, factor analysis provides an excellent psychometricframework in that the factor score acts as a surrogate for an individuals truescore, and the observed random variables are represented by the items. Whenassessing data with dichotomous outcomes, factorial invariance is established ifthe factor loadings and thresholds are equivalent across multiple sub-populations.At the item level, item response models provide an appropriate psychometricframework in that a persons expected score on any one item acts a proxy for thetrue score and the observed score on that same item represents the observed ran-dom variable. Item-level invariance is established if the item parameters areequivalent across multiple populations. In other words, for all values of (theunderlying, or latent construct), the item true scores are identical across groups.Both factorial and item-level equivalence is necessary when one seeks to provideevidence of measurement equivalence. As pointed out by Bock and Jones (1968),in a well developed science, measurement can be made to yield invariant resultsover a variety of measurement methods and over a range of experimental condi-tions for any one method (p. 9).

  • ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 289

    Previously, several methods have been employed to establish the measurementinvariance of assessment results for SWDs receiving test accommodations. Analy-sis of variance and analysis of covariance procedures have been used to measurethe effects of extended time (Munger & Lloyd, 1991; Runyan, 1991) and read-aloud or oral (Meloy, Deville, Frisbie, 2000; Bolt & Thurlow, 2006; Elbaum,2007; Elbaum, Arguelles, Campbell, & Saleh, 2004) accommodations. Factor-analytic methods have been used to examine factorial invariance of assessmentsfor SWDs receiving various accommodations such as oral administration of items(Huynh & Barton, 2006; Huynh, Meyer, & Gallant, 2004), extended time(Huesman & Frisbie, 2000; Rock et al., 1987), and large type (Rock, Bennett, &Kaplan, 1985). Similarly, methods that examine item-level equivalence have beenutilized to examine across Braille (Bennett, Rock, & Novatkoksi, 1988), use ofcalculator (Fuchs, 2000a), read-aloud (Bielinski, Thurlow, Ysseldyke, Friedebach,& Friedebach, 2001; Bolt & Ysseldyke, 2006; Fuchs, 2000b), and extended time(Cohen, Gregg, & Deng, 2005; Fuchs, 2000b) accommodations.

    The purpose of this present study is to describe a coherent framework that canbe used to explore systematically whether or not specific accommodations meetpsychometric criteria of measurement invariance for students with specific learningdisabilities (SLD) on items designed to assess reading for meaning in two stages.The first stage utilizes the confirmatory factor analysis (CFA) model to establishunidimensionality and to assess measurement invariance across several subgroupsat the test level, specifically factorial invariance. In the second stage, we present adifferent approach to assessing measurement invariance using the Rasch Model(1980), an Item Response Theory Model, to investigate item-level equivalence.First, we assessed the factor structure of the reading for meaning items by examin-ing whether a single factor underlay the items. Next, we sought to determinewhether a one factor measurement model for reading for meaning was invariantacross disability status and type of test administration (i.e., assessing factorialinvariance). In the second stage of the analysis, we examined the data to insure theoverall fit to the Rasch model. Finally, we sought to test item invariance over dis-ability status and test administration using the Rasch Model. This conceptualizationof measurement invariance includes a consideration of test-level invariance asdefined within the framework of confirmatory factor analysis, as well as item-leveland person-level invariance as conceived with Rasch measurement theory.

    METHOD

    ParticipantsThe students included in this study were drawn from a larger study in Georgiathat examined the effects of two test modifications (resource guide and read

  • 290 RANDALL AND ENGELHARD, JR.

    aloud) on the performance of students with and without identified disabilities(Engelhard, Fincher, & Domaleski, 2006). The original study included studentsfrom 76 schools with a wide range of physical, mental, and intellectual disabili-ties. Because the value and impact of a test accommodation can vary in relationto the specific disability, we chose to focus only on students identified as havinga specific learning disability (SLD) within the broader category of students withdisabilities. Table 1 provides a description of the demographic characteristics ofthe students by disability status (N = 788). Table 2 provides the demographiccharacteristics by accommodation category (resource guides, read-aloud, and stan-dard administration). Consistent with previous research that indicates male studentsare disproportionately identified as having learning disabilities (DOE, 2007a,Wagner, Cameto, & Guzman, 2003; Wagner, Marder, Blackorby, & Cardoso,2002) 70% of the 219 students with specific learning disabilities were male.According to the Georgia Department of Education website, over 700,000 fullacademic year students participated in the statewide Criterion ReferencedCompetency Test in reading. Due to NCLB mandates, student ethnicity must alsobe tracked and reported. This information can be used to infer the overall demo-graphic make-up of all test-takers (as all K8 students in Georgia are required tocomplete the CRCT) in order to assess the representativeness of our sample.Across all ethnic groups the sample and population proportions were nearly iden-tical. For example, 47.3% of public school students in Georgia are White and48.0% of our sample was composed of White students. Similarly, 38.1% ofGeorgias population of students are Black, and 40.3% of sample was composedof Black students. Hispanic students compose 8.99% of the student population,

    TABLE 1Demographic Characteristics of Seventh-Grade Students by Disability Status

    SWOD SLD Total

    n = 569 27.8%

    n = 219 72.2% n = 788

    Gender (percentages)1. Male (n = 410) 32.4 19.7 52.02. Female (n = 376) 39.6 8.1 47.7

    Race/Ethnicity (percentages)1. Asian, Pacific Islander 2.3 0.8 3.12. Black, Non-Hispanic 30.0 10.3 40.33. Hispanic 3.9 2.0 6.04. American Indian, Alaskan Native 0.0 .1 .15. White, Non-Hispanic 34.2 13.7 48.06. Multiracial 1.7 0.9 2.5

    Note. SWOD = Students Without Disabilities; SLD = Specific Learning Disability.

  • ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 291

    and 6.0% of our sample. In an effort to achieve equal group sizes, students withdisabilities were oversampled in the original study. We would like to note, how-ever, that 13.39% of Georgias tested public schools students have identified dis-abilities. We feel confident that our sample adequately represents the studentpopulation of Georgia students.

    Instrument

    Developed in 2000, the Georgia Criterion Referenced Competency Test (CRCT)is a large state standardized assessment designed to measure how well public K8students in Georgia have mastered the Quality Core Curriculum (QCC). TheQCCs are the curriculum content standards developed by the Georgia Depart-ment of Education for its public schools. Georgia law requires that all students ingrades 18 be assessed in the content areas of reading, English/language arts, andmathematics. In addition, students in grades 38 are assessed in both social stud-ies and science as well. The CRCT yields information on academic achievementthat can be used to diagnose individual student strengths and weaknesses asrelated to instruction of the QCC, and to gauge the quality of education through-out Georgia. The reading CRCT for sixth-grade students consists of 40 opera-tional selected-response items and 10 embedded field test (FT) items (FT itemsdo not contribute to the students score) within four content domains: reading forvocabulary improvement, reading for meaning, reading for critical analysis, andreading for locating and recalling. Twelve items from the reading for meaningdomain were selected and analyzed here because this domain most closely repre-sents what is commonly referred to as reading comprehension. Reading forMeaning is defined as the recognition of underlying and overall themes and con-cepts in fiction and nonfiction literature as well as the main idea and details of thetext. It also includes the recognition of the structure of information in fiction andnonfiction. Items in the reading for meaning domain include identifying literaryforms; identifying purpose of text; identifying characters and their traits; recog-nizing sequence of events; recognizing text organization/structure; recognizingexplicit main idea; and retelling or summarizing.

    Data CollectionAll state schools were stratified into one of three categories based on the propor-tion of students receiving free and reduced lunch in each school. Within thosecategories schools were then randomly selected and assigned to one of three con-ditions (resource guide test modification, read-aloud test, or oral, test modifica-tion, or the standard test condition), and all students (both students withdisabilities and without disabilities) within the school were tested under the samecondition. Two of the three conditions involved the use of a test modification,

  • 292 RANDALL AND ENGELHARD, JR.

    and the third condition involved the standard administration of the test. It shouldbe noted that, for the purposes of the larger original study, that all students weretested under standard operational conditions at the end of the sixth grade duringthe regular operational administration of the reading exam. The assignment toone of three conditions involved the second administration of the same test whichwas given the following spring when students were in the seventh grade. In sum-mary, every student completed the reading exam under standard, operational con-ditions and then a second time under one of three conditions. Data from thesecond experimental administration were analyzed for the purposes of this study.

    Description of Test ModificationsThe resource guide consisted of a single page (front and back) guide that pro-vided students with key definitions and examples that were hypothesized to behelpful. The resource guides were designed to provide students with scaffoldedsupport, much like they would receive in the classroom and English languagelearners receive from a translation dictionary. The guides included commonlyused definitions of academic terms and vocabulary words (provided in alphabeti-cal order as in a dictionary) that could be applied to the test. These terms were notassessed by the exam, but rather provided explanations of construct-irrelevantwords, expressions, or phrases that might be found in the passages or within theitem stems. For example, a question may ask the student to identify the centralidea of the passage. The resource guide indicated that the central idea meant themain point. Similarly, vocabulary within a passage that a student may not befamiliar with but not directly assessed was defined in hopes that providing suchsupport would increase the students comprehension of the overall passage.Vocabulary that was assessed was not defined. The guides were developed by acommittee of Georgia Department of Education specialists from assessment, cur-riculum, and special education offices. Careful attention was given to the con-structs measured by the test items. The intent of the guides was to providestudents with information they could use to answer questions on the test, butwould not provide the students with the answers themselves. It was hypothesizedthat the removal of construct irrelevant vocabulary, or expressions, wouldimprove student performance on the exam as they would be able to focus on theintended construct without confusion or frustration. One could imagine theresource guide as a glossary of important terms used throughout the exam.Because the use of resource guides was new for most students, students weregiven the opportunity to work through a sample test using the resource guide.Teachers were allowed to review the sample test with students and provide point-ers, if necessary, on how the sample test related to the resource guide. Becausethe test material is secure, it is not possible to reveal the actual content of theresource guides here.

  • ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 293

    The read-aloud administration involved the teacher reading the entire test tostudents, including reading passages and questions. Teachers were instructed toread the test to students at a natural pace. Students were encouraged to read alongsilently with teachers. The third type of administration was simply the standardadministration in which the test was administered in the standard format as if itwere an operational administration. Engelhard et al. (2006) should be consultedfor additional details regarding the full study.

    ProceduresData analyses were conducted in two stages. In the first stage, analyses were con-ducted with Mplus computer software (Muthen & Muthen, 19982007) using theconfirmatory factor-analytic model for dichotomous data as defined by Muthenand Christofferson (1981).

    where

    xg is a vector of observed scores for each group,tg is a vector of item intercepts (or thresholds),g is a matrix of the factor loadings,xg is a vector of factor scores (latent variables),and dg is a vector of errors.

    With the CFA model, the relationship between observed variables (in this case12 reading items) and the underlying construct they are intended to measure (inthis case reading for meaning) is modeled with the observed response to an itemrepresented as a linear function of the latent construct (x), an intercept/threshold(t), and an error term (d). The factor loading () describes the amount of changein x due to a unit of change in the latent construct (x). The parameters of thethresholds (t(g)) and the factor loadings ((g)) describe the measurement proper-ties of dichotomous variables. If these measurement properties are invariantacross groups, then

    where G represents each group (see Muthen & Christofferson, 1981, p. 408).

    x g g g g g= + +t L x d , (1)

    t t t t( ) ( ) ( )... ,1 2= == G

    ( ) ( ) ( )...1 2= = =G

  • 294 RANDALL AND ENGELHARD, JR.

    Confirmatory factor analysis (CFA) was used to test the one factor measure-ment model of the 12 reading for meaning items with six groups, one for the fullsample and one for each of the five subgroups of interest: students with specificlearning disabilities, students without disabilities, students who received theresource guides, students who received the read-aloud administration, and stu-dents who received the standard administration. All items were hypothesized tobe a function of a single latent factor, and error terms were hypothesized to beuncorrelated. In each model, the factor loading from the latent factor to the firstitem was constrained to 1.0 to set the scale of measurement.

    All parameters were estimated using robust weighted least squares (WLSMV)with delta parameterization. With WLSMV estimation, probit regressions forthe factor indicators regressed on the factors are estimated. We used the chi-square statistic to assess how well the model reproduced the covariancematrix. Because this statistic is sensitive to sample size and may not be apractical test of model fit (Cheung & Rensvold, 2002), we used two addi-tional goodness of fit indexes less vulnerable to sample size: the comparativefit index (CFI) and the root mean square error approximation (RMSEA). CFIvalues near 1.0 are optimal, with values greater than .90 indicating acceptablemodel fit (Byrne, 2006). With RMSEA, values of 0.0 indicate the best fitbetween the population covariance matrix and the covariance matrix impliedby the model and estimated with sample data. Generally, values less than .08 areconsidered reasonable, with values less than .05 indicating a closer approxi-mate fit (Kline, 2005).

    Because identical model specification for each subgroup does not guaranteethat item measurement is equivalent across groups (Byrne & Campbell, 1999),we conducted a series of tests for multigroup invariance by examining twoincreasingly restrictive hierarchical CFA models. Models were run separately bydisability status as well as accommodation category, and the fit statistics wereused to verify adequate model fit before proceeding to subsequent steps (Byrne,2006). Muthen and Muthen (19982007) recommend a set of models to be con-sidered for measurement invariance of categorical variables noting that becausethe item probability curve is influenced by both parameters, factor loadings andthresholds must be constrained in tandem. With the baseline model, thresholdsand factor loadings are free across groups; scale factors are fixed at one in allgroups; and the factor means are fixed at zero in all groups (to insure model iden-tification). This baseline model provides a model by which the subsequent invari-ance model can be compared. In the second model, factor loadings and thresholdswere constrained to be invariant across the groups; scale factors were fixed at onein one group and free in the others; factor means were fixed at zero in one groupand free in the others (the Mplus default). This is because the variances of thelatent response variables are not required to be equal across subgroups (Muthen& Muthen, 19982007). Because the chi-square values for WLSMV estimations

  • ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 295

    cannot be used for chi-square difference tests, we compared the fit of the twomodels using the DIFFTEST option to determine if an argument for factorialinvariance could be supported. The DIFFTEST (available in Mplus) null hypoth-esis asserts that the restricted model worsens the model fit (i.e., the unconstrainedmodel is a better model). Non-significant p-values indicate equivalent model datafit consistent with factorial invariance. In the absence of full factorial invariance,data were also examined to determine if partial measurement invariance waspresent. Partial measurement invariance applies when factors are configurallyinvariant (as in the baseline model), but do not demonstrate metric (factor load-ings) invariance (Byrne, Shavelson, & Muthen, 1989). Byrne et al. (1989) assertthat further tests of invariance and analysis can continue as long as configuralinvariance has been established and at least one item is metrically invariant. Insuch cases Benedict, Stenkamp, and Baumgarther (1998) recommend that invari-ance constraints be relaxed for highly significant modification indices in order tominimize chance model improvement and maximize cross-validity of the model.

    In the second stage, data analyses were conducted with the FACETS, a multi-faceted Rasch measurement (MRM), computer program (Linacre, 2007). Threemodels were fit to these data. Model I can be written as follows:

    where

    Pnijk1 = probability of person n succeeding on item i for group j andadministration k,Pnijk0 = probability of person n failing on item i for group j and administration k,n = location of person n on latent variable,di = difficulty of item i,j = location of group j, andk = location of administration k.

    This model dictates that student achievement in reading for meaning is the latentvariable that is made observable through a set of 12 reading items, and that theitems vary in their locations on this latent variable. Unlike the CFA model, theRasch Model (a) allows for person and item parameters to be estimated indepen-dently of each other and (b) includes no item discrimination parameter (or itemloadings) as it is assumed to be equal across all items. The observed responsesare dichotomous (correct or incorrect), and they are a function of both personachievement and the difficulty of the item. Group membership (dichotomouslyscored as student with a specific learning disability or student without disability)

    ln /P Pnijk nijk n i j k1 0 = q d a l (2)

  • 296 RANDALL AND ENGELHARD, JR.

    and type of administration (standard, resource guide, read aloud) may influenceperson achievement levels. Once estimates of the main-effect parameters wereobtained, the residuals were defined. The unstandardized residual reflects thedifference between the observed and expected responses:

    A standardized residual can also be defined as follows:

    These residuals can be summarized to create mean square errors (MSE) statistics(labeled Infit and Outfit statistics in the FACETS computer program) for eachitem and person. These MSE can also be summarized over items and persons, aswell as subsets of items and subgroups of persons. See Wright and Masters(1982) for a description of the Rasch based fit statistics.

    In addition to establishing item parameters and model fit statistics, the FAC-ETS program was used to examine uniform differential item functioning (DIF).DIF is present when the locations of the items vary, beyond sampling error,across group characteristics, such as gender, race, or disability status. If, as aresearcher, one suspects that certain characteristics may interact or behave differ-ently than others, one can simply add an interaction term for those two character-istics. Model II focuses on examining the interaction effects between items i andgroup j (diaj). This can be written as follows

    Student groups are defined as students with and without disabilities. This modelexplores whether or not the items are functioning invariantly over disability status.

    The final model examined, Model III, assesses possible interaction effectsbetween items i and administration k (dik). It can be written as

    This model explores whether or not the items functioned invariantly across testadministrations. For both Model II and Model III, FACETS provides bias

    R x Pnijk nijk nijk= ( ). (3)

    Z = (x P )/[(1 P )P ] .nijk nijk nijk nijk nijk 1/2 (4)

    ln P / Pnijk1 nijk0 n i j k i j = [5]

    ln P / Pnijk1 nijk0 n i j i k = k [6]

  • ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 297

    measures in terms of logits. These estimates are reported as t-scores (bias mea-sure divided by its standard error) with finite degrees of freedom. When dealingwith more than 30 observations, t-scores greater than the absolute value of twoare considered statistically significant, indicating differential item functioningand a threat to item level invariance. Because we can expect statistically signifi-cant results to appear by chance due to the use of multiple significance tests, weused the Bonferroni multiple comparison correction to guard against spurioussignificance. Testing the hypothesis at the p < .05 level that there is no DIF inthis test, the most significant DIF effect must be p < .05 divided by the number ofitem-DIF contrasts.

    RESULTS

    Study results are discussed within the frameworks of the two stages: first, theresults of the multigroup confirmatory analysis using Mplus software; and sec-ondly, the results of the Rasch analyses using FACETS software.

    Stage 1: Confirmatory Factor AnalysesResults for Stage 1 of the data analysis are divided into three subsections. The firstsubsection addresses the fit of the measurement model within each subgroup:SLDs, students without disabilities, and students who received the resource guide,read-aloud, and standard administrations. The second subsection details the exami-nation of factorial invariance across test administration. The final subsectiondescribes the examination of factorial invariance across disability status.

    Model fit within each subgroup. Recollect that five separate CFAs wereconducted to examine the measurement models of reading for meaning for eachsubgroup of interest. The reading for meaning measurement model demonstratedexcellent model fit for students without a specific learning disability, c2 (45) = 41.20p = .63, CFI = 1.00, RMSEA = .00; for SLD c2* (40) = 53.222, p = .08, CFI = .96,RMSEA = .04; for students who received the resource guide test administration,c2 (38) = 40.24, p = .37, CFI = 1.00, RMSEA = .02; for students who received theread-aloud administration, c2 (36) = 44.48, p = .16, CFI = 0.98, RMSEA = .03;and for students who received the standard administration, c2 (8) = 31.97, p = .74,CFI = 1.00, RMSEA = .00. Consequently, when testing groups for factorialinvariance, we specified the same model for all subgroups.

    *Degrees of freedom for these groups differ due to the way in which they are computed for theWLSMV estimator.

  • 298 RANDALL AND ENGELHARD, JR.

    Factorial invariance across administration type. Recall that to assessbetween-group invariance, we examined change in fit statistics between the base-line model (i.e., factor loadings and thresholds free) and Model 2 in which factorloadings and thresholds were constrained to be equal or invariant. Our findings,presented in Table 3, suggest excellent overall fit, c2 (112) = 117.76, p = .34 andCFI = 1.00, RMSEA = .01. Model 2 also reflects adequate model fit, c2 (122) =152.80, p = .03 and CFI of .98, RMSEA = .03. Using the DIFFTEST option inMplus, we assessed if Model 2 (nested model) was significantly different fromthe less restrictive model: c2 (18) = 43.27, p < .01. Results suggest that the factorstructure of the reading for meaning domain is not invariant across the three testadministrations. Consequently, the data were investigated to determine if partial

    TABLE 2Demographic Characteristics of Seventh Grade Students by Test Administration

    Resource Guides Read Aloud Standard Total

    n = 25432.3%

    n = 25732.7%

    n = 27535.0% n = 786

    Gender (percentages)1. Male (n = 410) 16.8 16.9 18.4 52.22. Female (n = 376) 15.5 15.8 16.5 47.8

    Race/Ethnicity (percentages)1. Asian, Pacific Islander 1.5 0.1 1.4 3.12. Black, Non-Hispanic 9.2 13.2 17.9 40.33. Hispanic 1.7 2.4 1.9 6.04. American Indian, Alaskan Native 0.0 0.0 0.1 0.15. White, Non-Hispanic 18.8 16.0 13.1 48.06. Multiracial 1.1 0.9 0.5 2.5

    TABLE 3Tests for Invariance for Reading for Meaning Measurement Model Across Test

    Administration: Summary of Goodness of Fit Statistics

    Equality Test c2 df CFI RMSEA Dc2 p-value

    No Constraints (Configural) 117.76 112 1.00 .01 __ __Factor Loadings & Thresholds 21 152.80 122 .98 .03 43.27 .00Free Items 6 & 12 31 133.05 120 .99 .02 22.81 .08

    Note. c2 = chi-square statistic based on robust weighted least squares estimation; df = degrees offreedom; CFI = comparative fit index; RMSEA = root mean square error of approximation; Robuststatistics are reported. Students who received resource guide (n = 254), students who received read-aloud (n = 257), students who received standard administration (n = 275).

  • ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 299

    measurement invariance (Byrne, Shavelson, & Muthen, 1989) could be estab-lished across test administrations. Examination of the modification indicesrevealed that releasing the equality constraints of both the factor loadings andthresholds of Items 6 and 12 resulted in a better overall model, c2 (120) = 133.05,p = .20, CFI = 0.99 RMSEA = .02, and a non-significant chi-square test for differ-ence, c2 (15) = 22.81, p = .08. Closer examination of the unconstrained parameterestimates, displayed in Table 4, revealed that Item 6 was less discriminating andeasier for students in the read-aloud test administration than for students in theresource guide and standard test administrations. Furthermore, Item 12 was morediscriminating and easier for students in the resource guide test administrationthan in the read- aloud or standard administrations. These findings suggest partialmeasurement invariance or factorial invariance for a majority of the items.

    Factorial invariance across disability type. In the next series of modelswithin Stage 1 we examined factorial invariance across disability status aspresented in Table 5. As in the analyses of test administration, results indicateexcellent overall fit across disability status with the baseline model, c2 (84) = 95.40,p = .19, CFI = .99, RMSEA = .01. Model 2 (factor loadings and thresholds con-strained) also demonstrated adequate fit, c2 (90) = 98.95, p = .24, CFI = .99,RMSEA = .02. Again, using the DIFFTEST option in Mplus, we assessed if

    TABLE 4Item 6 and Item 12 Unconstrained Parameter Estimates

    Resource Guide Read Aloud Standard

    Item Factor Loading Threshold Factor Loading Threshold Factor Loading Threshold

    6 1.198 .644 .735 .885 1.030 .52312 .923 .239 .838 .042 .835 .080

    TABLE 5Tests for Invariance for Reading for Meaning Measurement Model Disability Status:

    Summary of Goodness of Fit Statistics

    Equality Test c2 df CFI RMSEA Dc2 p-value

    No Constraints (Configural) 95.40 84 .99 .01 __ __Factor Loadings & Thresholds 21 98.95 90 .99 .02 8.14 .62

    Note. c2 = chi-square statistic based on robust weighted least squares estimation; df = degrees offreedom; CFI = comparative fit index; RMSEA = root mean square error of approximation; Robuststatistics are reported. Regular education students (n = 569), students with specified learning disabili-ties (n = 219).

  • 300 RANDALL AND ENGELHARD, JR.

    Model 2 (nested model) was significantly different from the less restrictivemodel: c2 (10) = 8.14, p = .62. Results support complete factorial invarianceacross disability type.

    Given evidence that the measurement model representing the latent readingability for the reading for meaning factor was invariant across disability statusand demonstrated partial invariance across test administration, we ran a finalCFA of the model for the full sample for the original model (all items loading onthe latent factor reading for meaning). This final full model showed excellent fitto the data, c2 (47) = 56.89, p = .15, CFI = 1.00, RMSEA = .02. Stage 1 resultsprovide strong evidence that at the test level (a) the reading for meaning domainis a unidimensional construct and (b) the factorial structure is fully invariantacross disability status and partially invariant across administration type.

    Stage 2: Multifacted Rasch MeasurementNext, we turned our attention to Stage 2 of our data analysis based on the Raschmeasurement model. The results within this stage are divided into three subsec-tions. The first subsection presents the main effects model (Model I). The secondand third subsections explore the interaction between items and disability status(Model II) and test administration (Model III).

    Model Imain effects model. Figure 1 displays a variable map represent-ing the calibrations of the students, items, conditions, and groups. The FACETScomputer program (Linacre, 2007) was used to calibrate the four facets. The firstcolumn of Figure 1 represents the logit scale. The second column of the variablemap displays the student measures of reading (for meaning) achievement. Higherability students appear at the top of the column, while lower ability students appearat the bottom. Each asterisk represents 8 students. The student achievement mea-sures range from 4.36 logits to 4.49 logits (M = .94, SD = 1.64, N = 786). Thethird column shows the locations of the administration conditions on the latent vari-able. Administrations appearing higher in this column yielded higher achievement.In the case of the reading for meaning items, the read aloud administration yieldedslightly higher results than both the standard and resource guide administrations;the resource guide administration yielded the lowest results overall. Group differ-ences are shown in column four of the variable map. As expected, the overallachievement of the students without specific learning disabilities was higher onaverage as compared to the students with specific learning disabilities. The fifthand final column represents the location of the reading for meaning items with itemdifficulty ranging from 1.02 logits to 1.86 logits (M = .00, SD =.84, N = 12).

    Table 6 presents a variety of summary statistics related to the FACETS analy-ses. The items, administrations, and disability status are anchored at zero by defi-nition. In order to define an unambiguous frame of reference for the model, only

  • ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 301

    FIGURE 1 Variable map of reading ability. * = 8 students. Higher values for the student,type of administration, and disability status facets indicate higher scores on the reading abilityconstruct. Higher values on the item facet indicate harder items.

    Logit Students Administration Disability Items

    High High High Harder

    4 *********

    .

    3 *. .

    ****.

    *********.

    .

    2 *. 7 ****.

    ******.

    ****.

    ***.

    1 **. ******** 12 ***.

    ****. 10 11 **. SWOD 5 ***. Read-Aloud 9 0 *****. Standard ****. Guide *. SLD 1 8 ***.

    *.

    *** 6 -1 **. 2 3 4 *

    *

    .

    *

    -2 . .

    .

    .

    .

    -3 .

    -4 **. Low Low Low Easy

    Logit Students Administration Disability Items

  • 302 RANDALL AND ENGELHARD, JR.

    one facet (student measure) is allowed to vary. The overall model-data fit is quitegood. The expected value of the mean square error statistics (infit and outfit) is1.00 with a standard deviation of .20, and the expected values for these statisticsare very close to the observed values. The most prominent exception is thestudent facet which has more variation than expected for the outfit statistic(M = 1.00, SD = .59).

    As shown in Table 6, all four of the reliability of separation statistics are sta-tistically significant (p < .01): students, disability status, type of administration,and items. The reliability of separation statistic is conceptually equivalent toCronbachs coefficient alpha, and they are used to test the hypothesis ofwhether or not there are significant differences between the elements within afacet. The largest reliability of separation index is .99 (Items) indicating a goodspread of reading for meaning items on the latent variable. The smallest reli-ability of separation index is .65 (Students). Given the small number of items(N = 12), this is comparable to the values obtained for other subtests in similarsituations. Both the type of administration (.83) and disability status (.98) werealso well differentiated.

    Model II: Item disability status interactions. Model II explores theinteractions among items and disability status. This model explores whether ornot the items are functioning invariantly over groups (i.e., differential item func-tioning). Two items (4 & 7) exhibited statistically significant differential itemfunctioning. Recall that the use of multiple significance tests can result in spuri-ous significance, so Bonferroni multiple comparison tests were used to confirm

    TABLE 6Summary Statistics for FACETS Analysis (Reading for Meaning Items, Grade 7)

    Students Item Administration Disability Status

    MeasuresMean .94 .00 .00 .00SD 1.64 .84 .12 .30n 786 12 3 2INFITMean 1.00 .99 1.00 1.02SD .24 .11 .01 .05OUTFITMean 1.00 1.00 1.00 1.03SD .59 .23 .02 .07Reliability of Separation .65* .99* .83* .98*Chi-Square Statistic 1795.6 884.8 17.6 107.4Degrees of Freedom 785 11 2 1

    *p < .01.

  • ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 303

    any apparent DIF. Comparison tests revealed only one statistically significantitem. Item 7 was differentially easier for students with specific learning disabili-ties with an observed score of 0.32, but an expected score of 0.23, t = 3.53.

    Model III: Item test administration interaction. Model III explores item-level invariance across test administrations. We found no statistically significantinteraction bias between test administration and the reading for meaning items sug-gesting complete item-level invariance across the type of test administration.

    SUMMARY AND DISCUSSION

    The major contribution of this study is to encourage a systematic approach toestablishing measurement invariance with large-state assessments with dichoto-mous data. By combining and integrating both confirmatory factor-analytic andRasch measurement procedures, practitioners are able to develop a more com-plete picture of the extent to which score-based inferences from these measureshold across several subgroups within a population of students. Although estab-lishing measurement invariance is essential for all tests/measures that seek tomake inferences across multiple groups, it is particularly necessary when theseinferences have high stakes consequences (i.e., promotion/retention/graduation).Add to this, the legal obligation of a school system to assess accurately protectedor vulnerable groups (i.e., students with disabilities), and the significance of thisstudy becomes apparent.

    A two-stage approach was utilized. The first stage works within a CFA frame-work to establish both unidimensionality and test-level measurement invariance,specifically factorial equivalence. Assuming factorial equivalence is establishedin the first stage, the second stage works within a narrower conceptual frame-work focusing on invariance at the item-level using a model that allows for theseparation of item and person parameters. These complementary methods enablethe practitioner to address issues of model-data misfit to insure accurate interpre-tation of test scores.

    The results of this study provide strong evidence that the reading for meaningitems of the CRCT exhibit test-level invariance across SLDs and students with-out disabilities. The factorial invariance across test administration, however, isless clear. Multigroup confirmatory factor analysis revealed a one-factor modelwith partial-measurement invariance (when items 6 & 12 are freely estimated).These findings suggest that the use of read aloud and resource guides maychange the underlying structure of the exam. Further examination into the utilityand appropriateness of these test accommodations may be necessary.

    Analyses using the Rasch Model also suggest overall good item fit (outfit = 0.99)with only one item exhibiting evidence consistent with differential item functioning

  • 304 RANDALL AND ENGELHARD, JR.

    across disability status. SLD performed differentially better on this item thanexpected on Item 7. Closer examination of this item also reveals some mild itemmisfit (outfit = 1.26). The tendency of this item to function differentially acrossdisability status, its lack of fit to the measurement model, as well as its extremelylow p-value (0.39) suggest that it is a threat to item-level measurement invarianceand should be examined more closely by measurement professionals and practi-tioners. Indeed, such results suggest a clear need for detailed qualitative interpre-tations of the quantitative analysis. The two-stage approach to assessingmeasurement invariance described in this article provides a useful template thatcan be used, in conjunction with qualitative evaluations, to aid in establishingfairness and equity in high stakes testing.

    ACKNOWLEDGMENTS

    We thank Chris Domaleski and Melisa Fincher for providing us with access tothe data set. The opinions express in this article are those of the authors, and theydo not reflect the views of the Georgia Department of Education.

    REFERENCES

    Asparouhov, T., & Muthen, B. (2006). Robust chi-square difference testing with mean and varianceadjusted test statistics. Mplus Web Notes: No. 10. Retrieved January 13, 1998, from http://www.statmodel.com/download/webnotes/webnote10.pdf

    Benedict, J., Steenkamp, E. M., & Baumgartner, H. (1998). Assessing measurement invariance incross-national consumer research. Journal of Consumer Research, 25, 7890.

    Bennett, R., Rock, D., & Novatkoski, I. (1989). Differential item functioning on the SAT-M Brailleedition. Journal of Educational Measurement, 26(1), 6779.

    Bielinski, J., Thurlow, M., Ysseldyke, J., Freidebach, J., & Freidebach, M. (2001). Read aloudaccommodations: Effects on multiple choice reading and math items (Technical Report 31).Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

    Bock, R. D., & Jones, L. V. (1968). The measurement and prediction of judgment and choice. SanFrancisco: Holden-Day.

    Bolt, S., & Thurlow, M. (2006). Item level effects of the read aloud accommodation for students withdisabilities. (Synthesis Report 65). Minneapolis: University of Minnesota, National Center onEducational Outcomes.

    Bolt, S., & Ysseldyke, J. (2006). Comparing DIF across math and reading/language arts tests forstudents receiving a read-aloud accommodation. Applied Measurement in Education, 19(4), 329355.

    Byrne, B. (2006). Structural equation modeling with EQS: Basic concepts, applications, andprogramming (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

    Byrne, B. & Campbell, T. L. (1999). Cross-cultural comparisons and the presumption of equivalentmeasurement and theoretical structure: A look beneath the surface. Journal of Cross-CulturalPsychology, 30(5), 555574.

    Byrne, B., Shavelson, R., & Muthen, B. (1989). Testing for the equivalence of factor covariance andmean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456466.

  • ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 305

    Cheung, G., & Rensvold, R. (2002). Evaluating goodness of fit indexes for testing measurementinvariance. Structural Equation Modeling: A Multidisciplinary Journal, 9, 233245.

    Cohen, A., Gregg, N., Deng, M. (2005). The role of extended time and item content on a high-stakesmathematics test. Learning Disabilities Research and Practice, 20(4), 225233.

    Elbaum, B., Arguelles, M., Campbell, Y., & Saleh, M. (2004). Effects of a student-reads-aloudaccommodation on the performance of students with and without learning disabilities on a test ofreading comprehension. Exceptionality, 12(2), 7187.

    Elbaum, B. (2007). Effects of an oral testing accommodation on the mathematics performance of sec-ondary students with and without learning disabilities. The Journal of Special Education, 40(4),218229.

    Engelhard, G., Fincher, M., & Domaleski, C. S. (2006). Examining the reading and mathematicsperformance of students with disabilities under modified conditions: The Georgia Department ofEducation modification research study. Atlanta: Georgia Department of Education.

    Fuchs, L. (2000a, July). The validity of test accommodations for students with disabilities: Differentialitem performance on mathematics tests as a function of test accommodations and disability status.Final Report: U.S. Department of Education through the Delaware Department of Education.

    Fuchs, L. (2000b, July). The validity of test accommodations for students with disabilities: Differen-tial item performance on reading tests as a function of test accommodations and disability status.Final Report: U.S. Department of Education through the Delaware Department of Education.

    Huesman, R. & Frisbie, D. (2000, April). The validity of ITBS reading comprehension test scores forlearning disabled and non learning disabled students under extended tie conditions. Paper presentedat the annual meeting of the American Educational Research Association, New Orleans, LA.

    Huynh, H., Meyer, J., & Gallant, D. (2004). Comparability of student performance between regularand oral administrations for a high stakes mathematics test. Applied Measurement in Education,17(1), 3957.

    Huynh, H., & Barton, K. (2006). Performance of students with disabilities under regular and oral admin-istrations of a high-stakes reading examination. Applied Measurement in Education, 19(1), 2139.

    Kline, R. (2005). Principles and practice of structural equation modeling (2nd ed.). New York:Guilford.

    Linacre, J. M. (2007). A users guide to FACETS: Rasch-model computer programs. Chicago:winsteps.com.

    Meloy, L., Deville, C., & Frisbie, D. (2000, April). The effect of a reading accommodation onstandardized test scores of learning disabled students. Paper presented at the annual meeting of theAmerican Educational Research Association: New Orleans, LA.

    Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika,58(4), 525543.

    Munger, G., & Loyd, B. (1991). Effect of speededness on test performance of handicapped and non-handicapped examinees. Journal of Educational Research, 85(1), 5357.

    Muthen, B., & Christofferson, A. (1981). Simultaneous factor analysis of dichotomous variables inseveral groups. Psychometrika, 46(4), 407419.

    Muthen, L. K., & Muthen, B. O. (19982007). Mplus users guide. Fifth Edition. Los Angeles, CA:Muthen & Muthen.

    Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: TheUniversity of Chicago Press. (Original work published 1960)

    Rock, D., Bennett, R., & Kaplan, B. (1985). Internal construct validity of the SAT across handicappedand nonhandicapped populations. ETS Research Report RR-85-50. Princeton, NJ: EducationalTesting Service.

    Rock, D., Bennett, R., & Kaplan, B. (1987). Internal construct validity of a college admissions testacross handicapped and nonhandicapped groups. Educational and Psychological Measurement,47(1), 193205.

  • 306 RANDALL AND ENGELHARD, JR.

    Runyan, M. (1991). The effect of extra time on reading comprehension scores for university studentswith and without learning disabilities. Journal of Learning Disabilities, 24(2), 104108.

    U.S. Department of Education (2007a). Demographic and school characteristics of students receiv-ing special education in elementary grades (NCES Publication 2007-005). Jessup, MD: NationalCenter for Educational Statistics.

    U.S. Department of Education (2007b). Title I: Improving the academic achievement of the disadvan-taged; Individuals with Disabilities Act (IDEA); Final Rule. Federal Register, Vol. 72, No. 67,Monday, April 9, 2007.

    Wagner, M., Marder, C., Blackorby, J., & Cardosa, D. (2002). The children we serve: Thedemographic characteristics of elementary and middle school students with disabilities and theirhouseholds. Menlo Park, CA: SRI International.

    Wagner, M., Cameto, R., & Guzman, A. (2003). Who are secondary students in special educationtoday? (A Report from the National Longitudial Transition Stud). Retrieved September 1, 2008,from http://www.ncset.org/publications

    Wright, B. D., & Masters, G. (1982). Rating scale analysis: Rasch measurement. Chicago: MESAPress.

    Wu, A., Li, Z., & Zumbo, B. (2007). Decoding the meaning of factorial invariance and updating thepractice of multi-group confirmatory factor analysis: A demonstration with TIMMS data. PracticalAssessment, Research, & Evaluation, 12(3), 123.

  • Copyright of Applied Measurement in Education is the property of Taylor & Francis Ltd and its content maynot be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express writtenpermission. However, users may print, download, or email articles for individual use.