34
WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, [email protected] Paper presented to the International Objective Measurement Workshop (IOMW), New Orleans, April, 2002 Under review for publication ABSTRACT The Rasch model is a useful tool for converting raw observational data into item difficulty and person ability estimates on an approximately linear measurement scale. However, in order for the Rasch model to produce true interval results, it appears that the data set must be infinite and the underlying assumptions must be perfectly met (fit to logistic item characteristic curves with common slopes, unidimensionality, local independence, and no guessing). In this paper a series of simulation studies are used to examine the relative accuracy of the Rasch model compared to other measurement models when response distributions contain reasonable deviations from the Rasch assumptions. Person ability estimates were found to be comparable in accuracy across all models through each of the simulated error conditions. However, the item difficulty estimates were much less stable and the Rasch model was superior in only a few.. The results suggest that the Rasch model is most effective relative to other models when the items used have equal discriminations with no guessing, or when the person sample is small or extremely homogeneous. The two-parameter logistic (2PL) model in item response theory (IRT) appears to produce more accurate item difficulty estimates when even minor variations in discrimination are present (variable uniqueness components) and guessing is limited. The three-parameter logistic (3PL) model exhibits an advantage on item difficulty estimates when the data contains substantial guessing (as modeled with the 3PL model). Finally the classical test theory (CTT) model results were very similar to the Rasch model results for the simulation data used. INTRODUCTION Some Measurement Models and their Assumptions When a large collection of item responses fit an underlying equal-slope, logistic curve model it has been demonstrated (Brogden, 1977; Newby, Grant, Conner, & Bunderson, 2002; Perline, Wright, & Wainer, 1979) that the Rasch model (Rasch, 1992) is a form of additive conjoint measurement (ACM) (Luce & Tukey, 1964) and thus is able to transform this observational data into interval scale measures. Wright (Wright, 1988a, 1999) and others (Andersen, 1977; Andrich, 1988; Bond & Fox, 2001) present many intuitively appealing arguments and examples to convince us of the singular advantages of the Rasch model in creating interval scales from observational data. However there are others who, upon examining the Rasch model deny it’s potential to generate interval measures from real data (Cliff, 1989, 1996; McDonald, 1981, 1999; Michell, 1986, 1990). Indeed, one of the authors of ACM has stated that “…aside from the ordinal case, no one really has a fundamental measurement theory that nicely combines algebraic structure with stochastic features…” (R.D. Luce, personal communication, October 3, 2001) It appears that the truth depends upon one’s tolerance for error.

WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Embed Size (px)

Citation preview

Page 1: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?

Tim Pelton, [email protected] presented to the International Objective Measurement Workshop (IOMW),

New Orleans, April, 2002Under review for publication

ABSTRACT

The Rasch model is a useful tool for converting raw observational data into itemdifficulty and person ability estimates on an approximately linear measurement scale. However,in order for the Rasch model to produce true interval results, it appears that the data set must beinfinite and the underlying assumptions must be perfectly met (fit to logistic item characteristiccurves with common slopes, unidimensionality, local independence, and no guessing). In thispaper a series of simulation studies are used to examine the relative accuracy of the Rasch modelcompared to other measurement models when response distributions contain reasonabledeviations from the Rasch assumptions.

Person ability estimates were found to be comparable in accuracy across all modelsthrough each of the simulated error conditions. However, the item difficulty estimates weremuch less stable and the Rasch model was superior in only a few.. The results suggest that theRasch model is most effective relative to other models when the items used have equaldiscriminations with no guessing, or when the person sample is small or extremelyhomogeneous. The two-parameter logistic (2PL) model in item response theory (IRT) appearsto produce more accurate item difficulty estimates when even minor variations in discriminationare present (variable uniqueness components) and guessing is limited. The three-parameterlogistic (3PL) model exhibits an advantage on item difficulty estimates when the data containssubstantial guessing (as modeled with the 3PL model). Finally the classical test theory (CTT)model results were very similar to the Rasch model results for the simulation data used.

INTRODUCTION

Some Measurement Models and their AssumptionsWhen a large collection of item responses fit an underlying equal-slope, logistic curve

model it has been demonstrated (Brogden, 1977; Newby, Grant, Conner, & Bunderson, 2002;Perline, Wright, & Wainer, 1979) that the Rasch model (Rasch, 1992) is a form of additiveconjoint measurement (ACM) (Luce & Tukey, 1964) and thus is able to transform thisobservational data into interval scale measures. Wright (Wright, 1988a, 1999) and others(Andersen, 1977; Andrich, 1988; Bond & Fox, 2001) present many intuitively appealingarguments and examples to convince us of the singular advantages of the Rasch model increating interval scales from observational data.

However there are others who, upon examining the Rasch model deny it’s potential togenerate interval measures from real data (Cliff, 1989, 1996; McDonald, 1981, 1999; Michell,1986, 1990). Indeed, one of the authors of ACM has stated that “…aside from the ordinal case,no one really has a fundamental measurement theory that nicely combines algebraic structurewith stochastic features…” (R.D. Luce, personal communication, October 3, 2001) It appearsthat the truth depends upon one’s tolerance for error.

Page 2: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

2 of 34

The Rasch approach to building a measurement instrument is consistent with the carefuldevelopment of a theory or construct-linked instrument (Bunderson, 2000; Linacre & Wright,1999; Worthen, White, Fan, & Sudweeks, 1999; Wright & Stone, 1979). With judicious effortsto refine items by maximizing the consistency of the item set with a single construct factor whilemaintaining the integrity of the construct, a reasonably reliable and valid instrument can becreated (Bond & Fox, 2001).

While the creation of a set of construct-linked items that effectively load onto a singledimension seems plausible, the concurrent Rasch requirement that these items should have equaluniqueness components yielding parallel ICCs seems unlikely. If a practitioner were to engage inan overenthusiastic selection and refinement of items with a goal to create an item set that sharesa common ICC form the resulting instrument might become very reliable, and have minimalerror estimates, but it would likely also be much less valid with respect to the originally intendedconstruct. If this process were taken to the extreme, the elimination of all misfit would lead tothe elimination of all items (Linacre, 1995). When a more moderate approach is taken, and somemisfit is accepted, then the accuracy of the estimates is compromised to some degree and thusthe interval nature of the scale is questionable.

In response to the empirical reality of non-parallel ICCs in most large-scale tests, the2PL and 3PL IRT models were developed (Birnbaum, 1968; Lord, 1970, 1980). The 2PL and3PL models incorporate discrimination parameters that determine the central slope of the ICCsfor individual items and ultimately characterize the uniqueness components for each item(McDonald, 1981; Pelton & Bunderson, 2002). The 3PL model also includes a pseudo-guessingparameter that defines a lower asymptote for the ICCs. With these additional parameters, moreaccurate fit to empirical observation is obviously possible. However, these models cannot claimto produce a perfectly interval scale estimates, rather they are intended to produce pragmaticapproximations to interval scale values when the data set fits the more flexible underlying 2PL or3PL ICCs (Baker, 1992; Bock, Thissen, & Zimowski, 1997; Hambleton, Swaminathan, &Rogers, 1991; Lord, 1980).

The 2PL and 3PL IRT approaches have detractors too. As with the Rasch model, somedeny the potential of all IRT methods because of their imperfection (Cliff, 1996; Michell, 1999).Rasch proponents typically dismiss the other IRT models suggesting that additional parametersare “blandishments” (p. 91) that destroy additivity, and that the crossing ICCs model “…destroysthe variable’s criterion definition” (Wright, 1999 p. 95).

The 2PL and 3PL models also have computational limitations, in that they require priordistributions of parameters and artificial constraints in order to ensure efficient convergence(Mislevy & Bock, 1990). This limitation is directly related to the fact that there are no sufficientstatistics that allow for conjoint estimation of item difficulties and person abilities and thus noACM (although Lord does describe a sufficient statistic for person ability (Lord, 1980)).

Prior to the introduction of the Rasch and IRT models, various approaches were proposedand used that attempted to produce approximately linear scales within classical test theory (CTT)(Burke, 1953; Gulliksen, 1950; Thurstone, 1917, 1928, 1931; Wright, 1988b). The simplest ofthese approaches assumes that the distribution in question is approximately Normal and byestimating normal deviates the results can be transformed to an interval scale (Gulliksen, 1950).This approach is similar to the PROX algorithm used in the first iterations of Rasch analysis(Wright & Stone, 1979), and is the fourth model examined in this research.

Page 3: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

3 of 34

ObjectiveUltimately, researchers best efforts in the human sciences can only produce essentially

interval measurement scales with some level of error. While it is apparent that this samelimitation due to error occurs in the physical sciences, the magnitudes of error are typically muchsmaller (Kyburg, 1992), and thus less likely to impede science by obscuring the effects ofinterventions. In order to maximize the potential of our research efforts we should seek out anduse the measurement models that are able to produce the most accurate interval scale positionestimates and error estimates.

When noise is inserted into a response set that matches the assumptions of the 2PL or3PL model, more accurate results might be expected. However, because the Rasch model has anadvantage in its unique ability to generate interval scale estimates with perfectly fitting data, it isnot obvious as to the conditions under which each model might prove superior.

The following specific questions are addressed in this research:Research question 1. Which measurement model (CTT, Rasch, 2PL and 3PL) produces

the most accurate average estimates of person ability?Research question 2. Which measurement model (CTT, Rasch, 2PL and 3PL) produces

the most accurate average estimates of item difficulty?Research question 3. Are error estimates accurate? Do the mean standard errors (SEs)

of ability and difficulty estimates arising from the calibrations compare well with theobserved deviations from generating values (accuracy) across the repetitions?

METHODA simulation study seemed to be the only reasonable method to address the research

questions, because of the need to compare observed results to latent values – which by definitionare unknowable (Harwell, 1997). It was expected that evidence of accuracy, in the form of rootmean squared deviations (RMSDs) between estimated and generation parameters, would supportthe discovery of the gross relative effects of model-data misfit on the accuracy of average itemdifficulty estimates and person ability estimates across the CTT, Rasch, 2PL and 3PL models.Further evidence in the form of scatterplots were used to examine results for bias andconsistency.

Generation of Simulated DataThe 3PL model was used for the generation of data in this study because it was the best

option available. This results because the 3PL model has been found to fit empirical data well(Lord, 1980), and also because the parameters used in the generation of data allow for effectivemanipulation of Rasch model misfit (variable discrimination, pseudo guessing) and thesubsequent direct examination of the accuracy of person ability and item difficulty estimates.

To generate dichotomous test data, probability matrices (persons x items) were estimated(using the parameter distributions described in the next section) containing the probabilities ofsuccess for each person-item interaction using the 3PL measurement model. Twenty-fivedichotomous data set replications were then generated from each of the probability matrices bysystematically comparing each of the estimated probabilities to pseudo-randomly generatednumbers from a U(0,1) distribution.

Sample SizesThis study used person sample sizes of 999 and item sample sizes of 33. These sample

sizes were selected to produce evidence as to the most appropriate model(s) to use in a later

Page 4: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

4 of 34

research study with similar sample sizes. Other reasons supporting the use of these sample sizesinclude (a) the 2PL and 3PL models require substantial sample sizes in order to produce stableresults and 999 is near the recommended minimum sample size for the 3PL model (Mislevy &Bock, 1990), and (b) 33 items is a reasonable test length for well-refined items such as aresimulated in this study (no un-modeled noise) thereby allowing for the production of errorestimates that are meaningful and comparable to typical empirical applications.

Generating ParametersIn order to produce results that are somewhat generalizable, a series of generating

parameter combinations were designed to come close to addressing many typical realcircumstances. These parameter combinations are based upon various empirically relevantparameter distributions.

Difficulty Distributions. In the Rasch and 2PL models the difficulty parameter, b, is thepoint on the ability/difficulty scale at which an individual with that ability would have a 50%probability of answering the question correctly. With the addition of a pseudo-guessingparameter, c, in the 3PL model that probability is shifted to 0.5+c/2. In CTT this might be seenas the position on the normal deviate scale that would split the sample into the portion expectedto incorrectly answer the item and the proportion that would be expected to correctly answer theitem. The difficulty distributions used are not random, but rather are comprised of evenly spaceddifficulties in order to ensure uniform spread of difficulty within the distribution and thusmaximum scale definition in all models.

Two distributions of item difficulty were used in this study:1. Normal: b=N(0,1). This standard normal distribution of question difficulties might be

found on a classroom test with most of the questions designed to be of middle rangeof difficulty and only a few very easy or very difficult questions (Guilford, 1954).Difficulties were specified as equally distributed by p-value from 0.02 through 0.98 inincrements of 0.03 and then converted to normal deviates using the NORMSINVfunction in Excel(Microsoft, 2000).

2. Uniform: b=U(-1.6,+1.6). A distribution like this might be found in a placement test(or learning progress tests, or criterion referenced tests, or construct linked tests) thatis intended to identify where individuals are on a broader continuum.

Discrimination Distributions. In the Rasch model, discrimination is assumed to beconstant across items, meaning that the variances due to item uniqueness and/or sampling errorare consistent from item to item. This assumption does not accord with empirical realityhowever, which suggests that most item sets would exhibit some variations in discriminationreflecting independent multidimensionality (item uniqueness) and any remaining dependentmultidimensionality (secondary common factors). For simplicity we only address independentmultidimensionality in this research.

It is expected that the accuracy of the estimates derived using the 2PL and 3PL modelswill degrade less dramatically than Rasch degradation as the distribution of the discriminationparameters underlying the dataset is widened. Thus the more extreme distributions ofdiscrimination may allow for the 2PL and 3PL models to exhibit more overall accuracy than theRasch model. It is also expected that equal discriminations should produce the most accurateresults in the CTT model. To facilitate the evaluation of accuracy of the various models underboth ideal and quasi-realistic conditions, one fixed discrimination distribution and two normaldiscrimination distributions were used:

Page 5: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

5 of 34

1. Unidimensional: a = 0.8. This distribution is consistent with the assumption of equaldiscriminations on the Rasch model. It is unrealistic in its uniformity, but importantto allow us to see how large of an error might be introduced into the resulting abilityand difficulty scale estimates by the arbitrary constraints used in the 2PL and 3PLmodels.

2. Almost unidimensional: a = N(0.8, 0.1). This distribution is consistent with theassumptions of essential unidimensionality (and local independence). It might befound on a test that has been carefully refined to the requirements of the Raschmeasurement model.

3. Moderately multidimensional: a = N*(0.8, 0.3). This distribution is expected to beconsistent with the calibration and analysis of a test that is constructed to meetcommittee set test specifications with less concern for fit to a model. (Thisdistribution is truncated normal with a lower bound of 0.4 and an upper bound of 1.6.)Although this is the most extreme distribution, it is similar to what may be found inother research (Baker, 2001; Hambleton et al., 1991; Harwell & Gatti, 2001) and isclose to a range suggested as acceptable to the Rasch model by Linacre (Linacre,2000).

Discrimination parameters are randomly assigned to simulated items.Pseudo-guessing Distributions. In the 3PL model, the c parameter is estimated and used

in an effort to account for the effects of guessing among lower ability students. Although linkingthis parameter only to items is intuitively less appealing, it does allow for some adjustment forthe errors introduced by guessing. The three pseudo-guessing parameter distributions used in thisstudy are:

1. No guessing: c = 0. The pseudo guessing parameter is fixed at zero in this case tomaximize the fit with the Rasch model.

2. Mild pseudo-guessing: c = N(0.05, 0.02). Even with non-multiple choice items, somemild guessing or associated errors might be expected, at the lower end of the abilitylevel. This is the c parameter distribution for the core set. (This distribution istruncated normal with a minimum of 0.)

3. Moderate pseudo-guessing: c = N(0.17,0.03). This distribution is consistent with amultiple choice test in which substantial guessing is expected, and is similar to theestimates found in other studies (e.g. Hambleton et al., 1991).

Ability Distributions. In the Rasch, 2PL and 3PL models, the ability parameters, θ’s,describe how capable individuals are with respect to the construct being measured. This studyused four different ability distributions:

1. Normal or On-Target: θ= N(0,1). This is a standard normal distribution thatcorresponds to student abilities that are aligned with the mean difficulty of the ideasand tasks associated with the construct. An example situation in which such adistribution could be found might be a difficult classroom test where 50% of thestudents are expected to pass. Simulee abilities are equally spaced by abilitypercentile (starting at 0.1% and going up to 99.9% by increments of 0.1%). Thevalues are converted to the standard normal scale using the NORMSINV function inExcel to yield 999 normally distributed person (or simulee) abilities ranging from –3.09 through 3.09.

2. Off-target or High ability: θ= N(1.5,1). This is a translated normal distribution wherethe mean student abilities are 1.5 standard deviations above the mean difficulty of the

Page 6: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

6 of 34

items on the construct measured. This ability distribution scenario might correspondto a student ability distribution relative to the difficulties of an ‘easy’ basiccompetency exam where more than 93% of the students are expected to pass. Givenonly a moderate sample size (N=999), it is expected that all measurement models willstruggle with the estimation of items with difficulties that are 2 or more standarddeviations from the mean. The difficulty distribution is a translation of the N(0,1)distribution noted above.

3. Uniform ability: θ=U(-2.5,2.5). This is consistent with a carefully selected calibratingsample (Stocking, 1990) to maximize accuracy of discrimination and pseudo-guessing.

4. Constant:θ=0. It has been claimed that the Rasch model can calibrate items with thistype of ability distribution – because of the assumption of equal discriminations(Wright, 1967). It is unknown how the accuracy of the IRT models will behaverelative to each other – especially when this unusual distribution of ability iscombined with some moderate multidimensionality and mild pseudo-guessing.

Specific Parameter CombinationsUsing the above mentioned parameter distributions, the following nine generating

parameter distribution combinations were used to challenge the measurement models in a quasi-systematic fashion. The combinations are summarized in Table 1.

1. PC1 incorporates a moderately multidimensional item discrimination distribution,normal item difficulty distribution, and mild item pseudo-guessing parameterdistribution along with a normal ability distribution. This is intended to mimic a welltargeted and refined exam. It is assumed that the data sets generated by thiscombination should be similar to a typical data set in a non-multiple choice test. Itwas expected that either the Rasch or the 2PL model would produce the most accurateitem difficulty and person ability estimates.

2. PC2 is similar to PC1 except that it incorporates an almost unidimensionaldiscrimination distribution. This corresponds to a test which is measuring an almostunidimensional construct (e.g. a geography test to identify the countries of the worldwhere all of the item stems are identical). It was expected that the CTT and Raschmodels should show the greatest improvement relative to the PC1 results because thegenerating parameters are much closer to the CTT and Rasch assumptions of equalitem discrimination values.

3. PC3 differs from PC1 in that it corresponds to a simple construct exam that is almostunidimensional (same as PC2), which is presented to a high ability group – consistentwith a mastery test with very similar items. Accuracy was expected be reduced in allmodels as the available information was reduced in this off-target test. The Raschmodel was expected to have the greatest difficulties with the estimates of difficultyfor the lowest difficulty items because the majority of persons should load high on thelogistic curve and the fixed discrimination should precipitate a distortion of thedifficulty estimate.

4. PC4 simulates a unidimensional test with no guessing, normal difficulty distributionand normal, on-target person ability distribution. This was an extreme example toallow for an exploration of the limits of the models with almost perfect data. It wasexpected that the Rasch model should exhibit the lowest errors because of its ability

Page 7: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

7 of 34

to generate an interval measurement scale. The 2PL and 3PL models were expected tobe less accurate than the Rasch or CTT as a result of their greater susceptibility tosampling noise.

5. PC5 uses an item set with a uniform distribution of item difficulties (-1.6 to +1.6) thatwere moderately multidimensional and had mild pseudo-guessing and an on-targetperson ability distribution. This corresponds to an almost ideal diagnostic orplacement test. It was uncertain as to the net effect of such a distribution ofdifficulties on the overall accuracy – primarily local variations to accuracy wereexpected.

6. PC6 corresponds to PC1 but with the incorporation of a moderate item pseudo-guessing parameter distribution. Because the 3PL model accounts for a pseudo-guessing parameter, it was expected to produce item difficulty and person abilityestimates with an improved accuracy relative to the Rasch and 2PL models. However,it was uncertain as to which model would produce the most accurate results given the3PL’s assumed incapacity to produce a perfectly interval measurement scale and thedifficulties associated with estimating the c parameter with limited information.

7. PC7 is like PC1 except a uniform ability distribution was used (corresponding to acarefully selected calibrating sample). The estimated item difficulties were expectedto be more accurate for the non-central items given that more information wasavailable from which to make the estimates. A corresponding reduction in accuracyof the central items was also expected. The effects on individual estimates areexamined in other research.

8. PC8 is an unlikely case to test the Rasch model’s theoretical potential to calibrate dataeven when all of the persons in the calibrating sample have the same ability (Rasch,1992). The Rasch and CTT models were expected to produce the most accurate itemdifficulty results (because they require the least amount of information).

9. Finally PC9 is similar to PC1 in that the item set has a normal distribution ofdifficulties, is moderately multidimensional and included mild pseudo-guessing,however it is presented to an ability sample that is off-target. This sample wasgenerated to allow for the investigation of Wright’s (Wright, 1999) claim that theaccepting of data with variations in discrimination and the crossing of itemcharacteristic curves will destroy construct stability. It was expected that the Raschmodel would erroneously compensate for high and low discriminations because theoff target normal distribution of abilities would overweight the ICCs on the top endfor all but a few items. Thus it was expected that the Rasch estimates wouldunderestimate difficulty when discriminations are higher than average, andoverestimate difficulty when the discriminations are lower than average.

On reviewing these parameter combinations, it appears that the PC1, PC5, PC6 and PC7parameter combinations are the most reflective of reality. It is speculated that the simulated datasets produced using these parameter combinations would be most consistent with data setsacquired using a typical well-designed and refined exam presented to an appropriately targetedsample of persons.

Each set of parameters in the parameter-ability distribution combinations were generatedusing the distributions specified in the table. Identical copies of these generated parameterdistributions are used in each of the combinations where the same distribution is specified inorder to maximize comparability.

Page 8: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

8 of 34

Table 1.Summary of parameter combinations used to generate simulated data.

CodeDescriptiveidentifier

discrimination(a)

difficulty(b)

pseudo-guessing(c)

ability(θ)

PC1 Multidimensional N(0.8, 0.3) a N(0,1) N(0.05, 0.02) b N(0,1)PC2 Almost Unidimensional N(0.8,0.1) N(0,1) N(0.05, 0.02) b N(0,1)PC3 Off-target N(0.8,0.1) N(0,1) N(0.05, 0.02) b N(1.5,1)PC4 Unidimensional All =0.8 N(0,1) All = 0 N(0,1)PC5 Uniform difficulty N(0.8, 0.3) a U(-1.6, 1.6) N(0.05, 0.02) b N(0,1)PC6 High guessing N(0.8, 0.3) a N(0,1) N(0.17, 0.03) N(0,1)PC7 Uniform ability N(0.8, 0.3) a N(0,1) N(0.05, 0.02) b U(-2.5,2.5)PC8 Equal ability N(0.8, 0.3) a N(0,1) N(0.05, 0.02) b All = 0PC9 Off-target

Multidimensional N(0.8, 0.3) a N(0,1) N(0.05, 0.02) b N(1.5,1)a truncated normal distribution (0.4 = a = 1.6)b truncated normal distribution (0=c)

Analysis ProcedureThe calibration, or fitting, portion of the analysis of the 225 response matrices (9

parameter combinations x 25 data replications) was conducted with Excel for the CTT analysis,with Winsteps and Bigsteps (Linacre & Wright, 1999) for the Rasch analyses and BILOG(Mislevy & Bock, 1990) for the 2PL and 3PL analysis creating a total of 900 files containingperson ability estimates and 900 files containing item difficulty estimates. Spreadsheets werecreated to examine each of the parameter combination by measurement model pairings,producing a total of 36 item difficulty summaries (9x4), and 36 person ability summaries eachcontaining the results of 25 replications.

To observe trends in accuracy, as well as to illuminate any bias or distortion, scatterplotswere created showing the estimated person abilities or item difficulties (across all 25replications) along with the sample means and standard deviations plotted against the generatingabilities or difficulties.

To estimate accuracy for both the item difficulty estimates or for the person abilityestimates the RMSD was used. By using standardized scales, direct comparisons betweenRMSDs and error estimates across models using data sets generated with the same parametercombinations are simplified. Although comparisons between accuracy estimates acrossparameter combinations are slightly inaccurate because the errors are scale position dependentand PC5, PC7 and PC8 have distinctly different relative scale positions for either itemdifficulties or person abilities. Because distributions were standardized independently abilityscale estimates were not directly comparable to difficulty scale estimates (i.e. ability RMSDs arenot directly comparable to difficulty RMSDs) although deviations in the resulting scales aresmall.

The medians, interquartile ranges and ranges of the RMSDpersonsk were then calculated foreach of the parameter combination by measurement model pairing results in order to supportcomparison of the models.

Page 9: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

9 of 34

Preparations for Answering Research QuestionsResearch question 1. A comparative box and whisker graph containing the distributions

of the RMSDpersons statistics for each of the parameter combination by measurement modelpairing results are presented in the results. From this graph observations on the accuracy ofability estimates as derived from the various measurement models are noted.

Research question 2. Two comparative box and whisker graphs were created. The firstgraph contains the distributions of the RMSDitems statistics from each of the parametercombination by measurement model pairings results. The second graph contains thedistributions of the RMSDtrimmeditems statistics from each of the parameter combination bymeasurement model pairings results. These graphs are presented in the results section andobservations are noted on the accuracy of difficulty estimates for both complete and trimmeditem sets across various measurement models.

Research question 3. Comparative box and whisker graphs containing RMSDsdistributions along with corresponding SE distributions across repetitions for both abilityestimates and item difficulty estimates are presented in the results. From these graphsobservations are noted on the accuracy of error estimates relative to observed error estimates asderived from various measurement models.

RESULTS

Person Ability EstimatesBefore calculating RMSDs, scatterplots for each of the parameter combination by

measurement model pairs were examined. Figure 1 is a scatterplot showing the Rasch abilityestimates of the stratified sample of persons across the 25 replications with data generated fromthe ideal unidimensional dataset (PC4). The distributions of individual estimates acrossreplications appear to be approximately normal and unbiased except for the extreme abilities,which are consistently biased toward the mean by floor and ceiling effects. The mean standarderrors (dark gray markers) are relatively consistent with the observed standard deviations (lightgray markers) across models suggesting that the SE may be a reasonable estimate of stability ofability estimates. Very similar plots were found for each of the other models.

Figures 2 and 3 are scatterplots showing the Rasch and 2PL ability estimates of thestratified sample of persons across the 25 replications with data generated from the off-targetmultidimensional parameter combination (PC9). The means of the estimates at each generatingability level are still fairly accurate with the greatest deviations at the higher generating abilities(θ > 3). With the off-target ability sample the floor effect is essentially eliminated, while theceiling effect is exacerbated. The mean standard errors (dark gray markers) were placed relativeto the mean results to allow for visual comparison and are very consistent with the observedstandard deviations (light gray markers) across models for most of the ability range.

In Figure 4 a step effect was noticeable in the individual 2PL results from the PC7 dataThis was also noticed in the 3PL results and to lesser degrees in other parameter combinations.It is hypothesized that this step effect is an artifact from artificial constraints imposed in the 2PLand 3PL models in order to ensure convergence. While this effect distorts the distributions ofindividual estimates across replications, it appears that the net effect on accuracy on averagedability estimates is minimal. Examination of this result is left for future research.

Figures 5 and 6 present the RMSD persons along with the mean SEs for each of theparameter combinations for the Rasch and the 3PL model results. The RMSD estimates for the

Page 10: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

10 of 34

PC8 case (all equal abilities) are omitted because they are distorted by the standardizingprocedure. The differences between the RMSD and SE pairs is generally small in each of themodels suggesting that each of these models accurately estimates ability SE’s over all of theparameter combinations (the CTT estimated SE’s were somewhat inflated in the PC3 and PC9cases due to the method used to transform raw score error estimates to linear scale estimates).

It is interesting to note that the 2PL model (and the 3PL model to a lesser degree)produced substantially lower SE estimates than the observed RMSDs for the PC7 case. This islikely related to the step distortions noted above.

In Figure 7, the distributions of the RMSDpersons are compared across models andparameter combinations. The level of the RMSDpersons within each parameter combinationcondition are highly comparable across the models. Although there may be some marginallysignificant differences, there are no differences that are substantial enough to direct the use ofone model over another. The RMSDpersons estimates for the PC8 case (all equal abilities) areomitted because they were distorted by the standardizing procedure. Although the PC7 caseappears to have the most accurate results, these are not directly comparable to the other estimatesbecause of the underlying uniform ability distribution.

Page 11: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

11 of 34

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

generating ability

mean of SE estimates

mean of ability estimatesSD of ability estimates

Figure 1. An examination of the Rasch person ability estimates across 25replications to the true abilities using an on-target ability sample, with constantdiscriminations and no pseudo-guessing (PC4).

Page 12: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

12 of 34

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

generating ability

mean of SE estimates

mean of ability estimatesSD of ability estimates

Figure 2. An examination of the Rasch ability estimates across 25 replications to thetrue abilities using an off-target ability sample, with moderately varyingdiscriminations and mild pseudo-guessing (PC9).

Page 13: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

13 of 34

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

generating ability

mean of SE estimates

mean of ability estimatesSD of ability estimates

Figure 3. An examination of the 2PL ability estimates across 25 replications to thetrue abilities using an off-target ability sample, with moderately varyingdiscriminations and mild pseudo-guessing (PC9).

Page 14: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

14 of 34

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

generating ability

mean of SE estimates

mean of ability estimatesSD of ability estimates

Figure 4. An examination of the 2PL ability estimates across 25 replications to thetrue abilities using an on-target uniform ability sample, with moderately varyingdiscriminations and mild pseudo-guessing (PC7).

Page 15: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

15 of 34

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC9

z-un

its

RMSD SE

Figure 5. Distributions of RMSDpersons and SE persons using Rasch estimation ofabilities.

Page 16: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

16 of 34

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC9

z-un

its

RMSD SE

Figure 6. Distributions of RMSDpersons and SE persons using 2PL estimation of abilities.

Page 17: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

17 of 34

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC9

z-un

its

CTT 1PL 2PL 3PL

Figure 7. Distributions of RMSDpersons for person abilities across all models andparameter combinations.

Page 18: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

18 of 34

Item Difficulty EstimatesScatterplots were again constructed and examined for the item difficulty estimates. The

deviations appear to be approximately normally distributed and the mean SEs are consistent withthe observed SDs. Figure 8 presents the distribution of Rasch difficulty estimates for the PC6case (CTT estimates were almost identical). In this figure the difficulty estimates are very stableor consistent (small SDs of estimates and small SEs) but exhibit larger mean deviations fromgenerating difficulties and substantial disorder of item difficulties. It is expected that thisstability is due to having only a single parameter to estimate, while the deviations and disorderare due to the effects of the varying discrimination parameters guessing parameters as predicted.

Figure 9 illustrates that the 3PL model produces the most accurate mean estimates acrossrepetitions in the presence of substantial pseudo-guessing but shows still greater instability orvariability for individual item estimates across repetitions. 2PL estimates predictably fellbetween these two examples in disorder of the mean and stability of estimates.

Figures 10 and 11 present the Rasch and 2PL item results for the off target,multidimensional case (PC9) (The graphs for the CTT and the 3PL cases are not shown becausethey were very similar in nature to the Rasch and 2PL figures respectively). Again we see thatthe mean SEs are comparable to the observed SDs for the estimates suggesting that SE is areasonable estimate of stability.

In Figure 10 the Rasch the accuracy bias observed in Figure 8 (associated with variablediscriminations and pseudo-guessing) is exacerbated further by the off target ability distribution(θ=θ+1.5). The stability of estimates also appears to degrade according to the distance from theperson ability distribution.

In Figure 11, the accuracy of the mean 2PL estimates is very high, while the stability ordistribution of the estimates appears to be directly related to the distance from the mean of theperson ability distribution.

In both figures, the two most off target items on the lower end of the scale are more than3 SDs from the mean generating ability (+1.5). This mis-targeting results in greater samplingerror because only a handful of simulees out of the sample of 999 have a significant probabilityof answering the item incorrectly.

Page 19: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

19 of 34

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

generating difficulty

mean of SE estimates

mean of difficulty estimatesSD of difficulty estimates

Figure 8. A comparison of Rasch estimates of item difficulty across 25 replicationsto the generating or true difficulty using an on-target ability sample, withmoderately varying discriminations and moderate pseudo-guessing (PC6).

Page 20: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

20 of 34

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

generating difficulty

mean of SE estimates

mean of difficulty estimatesSD of difficulty estimates

Figure 9. A comparison of 3PL estimates of item difficulty across 25 replications tothe generating or true difficulty using an on-target ability sample, with moderatelyvarying discriminations and moderate pseudo-guessing (PC6).

Page 21: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

21 of 34

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

generating difficulty

mean of SE estimates

mean of difficulty estimatesSD of difficulty estimates

Figure 10. A comparison of Rasch estimates of item difficulty across 25 replicationsto the generating or true difficulty using an off-target ability sample, withmoderately varying discriminations and mild pseudo-guessing (PC9).

Page 22: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

22 of 34

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

generating difficulty

mean of SE estimates

mean of difficulty estimatesSD of difficulty estimates

Figure 11. A comparison of 2PL estimates of item difficulty (across 25 replications)to the generating or true difficulty using an off-target ability sample, withmoderately varying discriminations and mild pseudo-guessing (PC9).

Page 23: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

23 of 34

In Figures 12 and 13, the meanRMSDitems and the mean RMSDtrimmeditems are comparedacross measurement models and across parameter combinations. There is no estimate for the 3PLmodel in the PC8 case because BILOG was unable to converge on some of the items with the3PL model given the methods and parameters used in the calibration.

The Rasch and CTT models have a small advantage in the PC4 case when the data set isconsistent with the Rasch model assumptions (equal ICCs and no guessing). These same modelsalso exhibit an advantage when all abilities are homogeneous (PC8) which may be related to theRasch model’s reduced requirements for information. The similarity between the CTT estimatesand the Rasch estimates is evident through each of the conditions suggests that the primaryadvantages of Rasch over CTT are the error estimation, the ability to deal with non-normal andnon-symmetrical ability and difficulty distributions and potential for response vector analysis.

With the complete item set, the 2PL model exhibits a substantially lower error (RMSD)than the CTT or Rasch models in each of the 4 cases which were suggested as the mostconsistent with reality (PC1, PC5, and PC6, and PC7), while the 3PL model appears to producethe most accurate estimates when there is substantial model generated pseudo-guessing. With thetrimmed item set, the 2PL model exhibits equal or lower error than the CTT or Rasch models inall but the most unlikely (PC4 and PC8) cases.

The 2PL advantage over the CTT and Rasch estimates appears to be most substantialwhen the data contains a large amount of independent multidimensionality as defined by variablediscrimination parameters and increases dramatically when this is compounded with an off targetperson ability sample with mild pseudo-guessing (PC9). The Rasch model appears to mis-estimate difficulties of items that are extreme (relative to the ability distribution), and havevariable discriminations.

By comparing the results for PC1 and PC6 in Figures 12 and 13, it can appears that theguessing parameter further contributes to the errors. Here the 2PL model seems to be affected toa greater degree than the Rasch model by an increase in pseudo-guessing.

In Figures 14 and 15 the mean RMSDitems are compared to the mean SEs generated by theRasch and 2PL models. It is interesting to note that the distributions of the mean reported SEs arevery narrow.

In Figure 14, it is apparent that the Rasch estimates of error are substantially lower thanthe observed error distributions in all but the ideal parameter combination case (which is thesame case in which the ability error estimates were underestimated). In the more plausibleparameter combination conditions the observed errors are almost double the estimated errors,suggesting that the Rasch model does not produce reliable estimates of error for item difficultiesexcept when presented with ideal data.

In Figure 15, the mean reported SEs are also lower than the observed error distributionsin all but the ideal parameter combination case. The differences between the mean observederror and the predicted error with data from the most plausible cases are more moderate than theRasch case, with only the high guessing case (PC6) being severely under estimated by the 2PLmodel. In every case, the 2PL estimate of error is greater than the Rasch estimate of error, whileoften the true error from the 2PL estimates is lower. This suggests that the 2PL model is capableof producing more accurate estimates of error for item difficulty than the model when there aredeviations from the assumed model. The mean reported SEs were reasonable estimates of theobserved error in the 3PL model. The CTT model does not have a commonly used estimate oferror for item difficulty.

Page 24: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

24 of 34

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9

z-un

its

CTT Rasch 2PL 3PL

Figure 12. Distributions of RMSDitems across measurement models and parametercombinations.

Page 25: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

25 of 34

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

z-un

its

CTT Rasch 2PL 3PL

Figure 13. Distributions of RMSDtrimmeditems across measurement models andparameter combinations (after removing the two most difficult items and the twoeasiest items).

Page 26: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

26 of 34

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9

z-un

its

RMSD SE

Figure 14. Comparing the Rasch model distributions of RMSDitems to distributionsof SE items for each parameter combination.

Page 27: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

27 of 34

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9

z-un

its

RMSD SE

Figure 15. Comparing the 2PL model distributions of RMSDitems to distributions ofSE items for each parameter combination.

Page 28: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

28 of 34

CONCLUSIONS

Answers to Research QuestionsResearch question 1. Which measurement model (CTT, 1PL/Rasch, 2PL or 3PL)

produces the most accurate average estimates of person ability?The accuracy of ability estimates is comparable across models and appears to fluctuate

appropriately according to the amount of information available in the simulated data set (asaffected by multidimensionality, pseudo-guessing and mistargeting of the person abilitiesrelative to the item difficulties). This result is consistent with earlier empirical findings ofminimal differences between ability estimates across models (Anderson, 1998; DeMars, 2001;Fan, 1998).

These results are also consistent with Stanley and Wang’s (Stanley & Wang, 1970)review of the effects of weighted test items. Contrary to their expectation however, it appearsthat even with the IRT version of item weighting, the improvement on the accuracy of abilityestimates over CTT (with its equal item weights) is diminished by the time the number of itemsreaches 33.

Research question 2. Which measurement model (CTT, 1PL/Rasch, 2PL or 3PL)produces the most accurate average estimates of item difficulty?

For situations with moderate sample sizes (999) and test sizes (33) and moderate amountsof noise (due to moderate independent multidimensionality, mild guessing and mistargeting), itappears that the 2PL model may be expected to estimate the positions of the item difficultyparameters more accurately than CTT, Rasch, or 3PL models in situations where guessing ismild or non-existent. However caution must be used because the generating 3PL modelproduced data that may be somewhat more consistent with the 2PL model than real data thatmight also contain additional systematic or non-normal error.

The 1PL/Rasch model’s theoretical advantage (i.e. potential to generate an intervalmeasurement scale when the data fit the underlying assumptions) appears to require anunrealistically unidimensional item set, or an aberrant equal ability sample to produce itemdifficulty estimates that are more accurate than the 2PL Model.

The 3PL model produces the best item difficulty estimates when the data set is generatedusing a moderate amount of guessing. However, this is a “best case” scenario for this model andas such a reduced weight must be attached to this result.

Normal deviate CTT results for person ability and item difficulty appear to be verysimilar to the 1PL/Rasch estimates. However caution should be used in interpreting these resultsbecause the CTT estimates are derived from data sets with abilities and difficulties havingnormal distributions (for most of the parameter combinations).

Research question 3. Are error estimates accurate? Do the mean standard errors (SEs) ofability and difficulty estimates arising from the calibrations compare well with the observeddeviations from generating values (accuracy) across the repetitions?

Ability error estimates from all models were reasonably accurate reflections of theobserved errors.

Difficulty error estimates from the 1PL model appear to substantially under-estimate theactual error relative to the true difficulties in almost all cases (they are similar to the SDs of theobservations). The only good estimates of error occurred with data that were generated usingperfectly unidimensional items with no guessing. The 2PL error estimates moderately

Page 29: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

29 of 34

underestimate the true error in most cases and substantially underestimate the error in the PC6and PC8 cases. The 3PL model appears to produce mild to moderate overestimates of error.

Limitations of this StudyThe averages do not adequately describe individual consistencies and deviations across

the models. Further research is underway to expose these differences.The accuracy of the 3PL estimates may be exaggerated because the generating and

calibrating models are the same. If, as is suspected, pseudo-guessing parameters do notsatisfactorily account for real guessing behavior, the 3PL error estimates generated here arelikely to exhibit unsustainable accuracy and should be used with caution.

The accuracies of all of the estimates are optimized in this study and thus the errors areminimized because the noise that was generated was well behaved.

These results were generated using a fixed ability sample size and item sample size. Theextrapolations required for other sample sizes are very tentative.

ConjectureGiven the earlier assumptions that independent multidimensionality can be represented

by variable discriminations and guessing is approximated by the 3PL guessing parameter; andgiven the limited observations found in this study; it is conjectured that each of the Rasch, 2PLand 3PL IRT models may have a region of superiority in the accuracy of their estimation of itemdifficulty within a Test Nature Space (TNS).

The TNS is defined by three dimensions (a) the degree of independentmultidimensionality (variability of discrimination or item uniqueness), (b) the degree of guessingand (c) the amount of sample based information (sample size, test size and targeting). In Figure16, hypothesized boundaries of accuracy or efficacy between the measurement models arepresented within the TNS (note that the nature of the curved surfaces presented are dependentupon the nature of the scales used for the dimensions).

When the accuracy of item difficulty estimates using each of the measurement models iscompared, the superior model coincides with the superior region for the location of thecorresponding parameter combination in the TNS.

The Rasch region of superiority (represented by the space behind the back curvedsurface) may occur when the amount of information is limited and should continue in adiminishing zone along the information axis as multidimensionality and pseudo-guessingdiminish. Examples include all small scale tests, and refined large scale tests with veryconsistent item stems and no guessing.

The 3PL region of superiority may occur when there is sufficient information andsubstantial guessing (this is represented by the space in front of the forward surface). Examplesinclude large scale summative multiple choice tests (even when they have been refined to loadprimarily onto a single factor).

Finally the 2PL model appears to produce more accurate item difficulty estimates whenthere is substantial independent multidimensionality, mild guessing, and moderate amounts ofinformation (due either to limited sample sizes or off target samples). The 2PL superiority isrepresented by space between the curved surfaces. Examples of such a test might include amedium to large-scale test that has been designed to assess a single unidimensional constructwith minimal guessing opportunities. It is further suggested that a majority of tests, even testscreated and refined to reflect a single unidimensional construct, are likely to fall in this region

Page 30: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

30 of 34

Figure 16. Conjectured regions of superior accuracy of item difficulty estimates forvarious IRT models within the proposed Test Nature Space

Page 31: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

31 of 34

Limitations of the Conjecture.Given a complete model of the factors affecting outcomes and distributions (as in a

simulation), these curved surfaces might be analytically derived as the intersection of theexpected error contours for each measurement model. Empirical reality in the human sciences isnot so tidy and thus such a factor model is unlikely to be discovered in the near future.

The conjecture regarding the accuracy of item difficulties with respect to the effect ofinformation assumes that off-target ability samples reduce the effective sample size, orinformation, for most items (albeit to a varying degree).

Most of the simulations used normal or uniform distributions for their parametergeneration. It is unknown how non-normal distributions would affect each of the models, andthus the conjecture is most applicable to samples with approximately normal or uniformdistributions of ability and difficulty

Practical ImplicationsWhen the assumptions of this simulation study reasonably reflect the realities of the

testing situation, then the following suggestions should be considered:1. When the goals of analysis are primarily focused on acquiring estimates of ability,

each of the IRT models should produce approximately equally accurate results. Evennormalized raw scores from CTT can be appropriate if the ability distribution can beassumed to be approximately normal.

2. When the goals of analysis include the discovery or confirmation of item positions ona measurement scale then increased attention to item variances is warranted and:a. The 1PL model is recommended if there is evidence that the item variances are

very consistent and there is no guessing (i.e. having a very good fit with the Raschmodel).

b. The 2PL model is recommended for medium and large scale applications whereitem variances are heterogeneous and guessing is minimal.

c. The 3PL model, with either fixed or estimated guessing parameters, isrecommended for large samples when guessing effects are present (a fixedguessing parameter might be better than estimation when data is limited)

3. When the primary goal of measurement is to place individuals on an ordinal scale ofability, or in cases where the sample size is small, the CTT model remains mostappropriate (although the 1PL/Rasch has an advantage in its ability to provide scorespecific error estimates).

Those readers with an idealistic preference for the Rasch model may not consider thisattempt to compare the accuracy of the various models as necessary. The discriminationparameter may be seen as irrelevant to the construct being measured, but it seems to be areasonable accommodation for uniqueness. The pseudo-guessing parameter is still lessobviously construct related, but it too has the potential to improve the accuracy of results whenguessing is allowed to occur (Wright’s, 1999 suggestion for an a priori constant may be best).

The results in this research indicate that the Rasch advantage is quickly overwhelmed byerrors associated with the Rasch model’s inability to deal with the effects of uniqueness andguessing. Thus in order to maximize the accuracy of the ‘marks on the meter stick’ theresearcher must either eliminate uniqueness and guessing or adopt a model that better addressesit. The former is the ultimate goal, the latter is the pragmatic solution when many constructs inthe domain theory remain to be discovered.

Page 32: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

32 of 34

REFERENCESAndersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 69-81.Anderson, J. O. (1998, October). Does complex analysis (IRT) pay dividends in achievement

testing? Paper presented at the Measurement in Evaluation: Current and FutureDirections for the New Millenium, Banff, Canada.

Andrich, D. A. (1988). Rasch Models for Measurement (Vol. 68). Newbury Park, CA: SagePublications.

Baker, F. B. (1992). Item response theory: Parameter estimation techniques. New York: MarcelDekker.

Baker, F. B. (2001). The basics of item response theory (second). ERIC Clearinghouse onAssessment and Evaluation. Available: www.ericae.net/irt.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. InF. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading,MA: Addison-Wesley.

Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journalof Educational Measurement, 34, 197-211.

Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in thehuman sciences. Mahwah, NJ: Lawrence Erlbaum Associates.

Brogden, H. E. (1977). The Rasch model, the law of comparative judgement and additiveconjoint measurement. Psychometrika, 42, 631-634.

Bunderson, C. V. (2000, April). Design experiments, design science, and the philosophy ofmeasured realism: Philosophical foundations of design experiments. Paper presented atthe annual meeting of the American Educational Research Association, New Orleans,LA.

Burke, C. J. (1953). Additive scales and statistics. Psychological Review, 60, 73-75.Cliff, N. (1989). Ordinal consistency and ordinal true scores. Psychometrika, 54, 75-91.Cliff, N. (1996). Ordinal methods for behavioral data analysis. Mahwah, NJ: Lawrence Erlbaum

Associates.DeMars, C. (2001). Group differences based on IRT scores: Does the model matter? Educational

and Psychological Meausrement, 61, 60-70.Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their

item/person statistics. Educational and Psychological Measurement, 58, 357-381.Guilford, J. P. (1954). Psychometric methods (2nd ed.). New York: McGraw-Hill.Gulliksen, H. (1950). Theory of mental tests. New York: John Wiley & Sons.Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response

theory. Newbury Park CA: Sage Publication.Harwell, M. R. (1997). Analyzing the results of Monte Carlo studies in item response theory.

Educational and Psychological Measurement, 57, 266-279.Harwell, M. R., & Gatti, G. G. (2001). Rescaling ordinal data into interval data in educational

research. Review of Educational Research, 71, 105-131.Kyburg, H. E. (1992). Measuring errors of measurement. In C. W. Savage & P. Ehrlich (Eds.),

Philosophical and foundational issues in measurement theory (pp. 75-91). Hillsdale, NJ:Lawrence Erlbaum Associates.

Linacre, J. M. (1995, April). The effect of misfit on measurement. Paper presented at the Eighthinternational objective measurement workshop, Berkeley, CA.

Page 33: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

33 of 34

Linacre, J. M. (2000). Item discrimination and infit mean-squares. Rasch MeasurementTransactions, 14, 743.

Linacre, J. M., & Wright, B. D. (1999). A user's guide to WINSTEPS, BIGSTEPS, MINISTEP-Rasch-model computer programs. Chicago: MESA Press.

Lord, F. M. (1970). Item characteristic curves estimated without knowledge of theirmathematical form. Psychometrika, 35, 43-50.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. HillsdaleNJ: Lawrence Erlbaum Associates.

Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement. Journal ofMathematical Psychology, 1, 1-27.

McDonald, R. P. (1981). The dimensionality of tests and items. British Journal of Mathematicaland Statistical Psychology, 34, 100-117.

McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence ErlbaumAssociates.

Michell, J. (1986). Measurement scales and statistics: A clash of paradigms. PsychologicalBulletin, 100, 398-407.

Michell, J. (1990). An introduction to the logic of psychological measurement. Hillsdale, NJ:Lawrence Erlbaum Associates.

Michell, J. (1999). Measurement in psychology: A critical history of a methodological concept.Cambridge: Cambridge University Press.

Microsoft. (2000). Excel (Version 2001). Redmond WA: Microsoft Corporation.Mislevy, R. J., & Bock, R. D. (1990). BILOG (Version 3.11). Mooresville, IN: Scientific

Software, Inc.Newby, V. A., Grant, C., Conner, G., & Bunderson, C. V. (2002, April). A formal proof that the

Rasch Model is a special case of additive conjoint measurement. Paper presented at theeleventh biennial international objective measurement workshops, New Orleans, LA.

Pelton, T. W., & Bunderson, C. V. (2002, April). The recovery of the density scale using astochastic realization of quasi-additive conjoint measurement. Paper presented at theannual meeting of the American Educational Research Association, New Orleans, LA.

Perline, R., Wright, B. D., & Wainer, H. (1979). The Rasch model as additive conjointmeasurement. Applied Psychological Measurement, 3, 237-255.

Rasch, G. (1992). Probabilistic models for some intelligence and attainment tests. Chicago:MESA Press.

Stanley, J. C., & Wang, M. D. (1970). Weighting test items and test-item options, an overview ofthe analytical and empirical literature. Educational and Psychological Measurement, 30,21-35.

Stocking, M. L. (1990). Specifiying optimum examinees for item parameter estimation in itemresponse theory. Psychometrika, 55(3), 461-475.

Thurstone, L. L. (1917). A law of comparative judgment. Psychological Review, 34, 273-286.Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529-

554.Thurstone, L. L. (1931). The measurement of social attitudes. Journal of Abnormal and Social

Psychology, 26(249-269), 249-269.Worthen, B. R., White, K. R., Fan, X., & Sudweeks, R. R. (1999). Measurement and Assessment

in Schools (second ed.). New York.

Page 34: WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE?web.uvic.ca/~tpelton/oldwebsite/IOMW-Pelton-2002.pdf · WHERE ARE THE LIMITS TO THE RASCH ADVANTAGE? Tim Pelton, ... Rasch model is most

Tim Pelton, IOMW 2002 Where are the limits to the Rasch advantage?

34 of 34

Wright, B. D. (1967). Sample-free test calibration and person measurement. Paper presented atthe ETS invitational conference on testing problems, Princeton, NJ.

Wright, B. D. (1988a). How interaction denies objectivity. Rasch Measurement Transactions, 1,12.

Wright, B. D. (1988b). Rasch model from Thurstone's scaling requirements. Rasch MeasurementTransactions, 2, 13-14.

Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L.Hershberger (Eds.), The new rules of measurement (pp. 65-104). Mahwah NJ: LawrenceErlbaum Associates.

Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.