Comparing Conditional and Marginal Direct Estimation of ......This conditional normality is a less restrictive assumption compared to the marginal normality assumption, on which CJ-DE

RESEARCH REPORT January 2003 RR-03-02

Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions

Research & Development Division Princeton, NJ 08541

Matthias von Davier

Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions

Matthias von Davier

Educational Testing Service, Princeton, NJ

January 2003

Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from:

Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541

Abstract

Many large-scale assessment programs in education utilize “conditioning models” that

incorporate both cognitive item responses and additional respondent background variables

relevant for the population of interest. The set of respondent background variables serves as a

predictor for the latent traits (proficiencies/abilities) and is used to obtain a conditional prior

distribution for these traits. This is done by estimating a linear regression, assuming normality of

the conditional trait distributions given the set of background variables. Multiple imputations, or

plausible values, of trait parameter estimates are used in addition to or, better, on top of the

conditioning model—as a computationally convenient approach to generating consistent

estimates of the trait distribution characteristics for subgroups in complex assessments. This

report compares, on the basis of simulated and real data, the conditioning method with a recently

proposed method of estimating subgroup distribution statistics that assumes marginal normality.

Study I presents simulated data examples where the marginal normality assumption leads to a

model that produces appropriate estimates only if subgroup differences are small. In the presence

of larger subgroup differences that cannot be fitted by the marginal normality assumption,

however, the proposed method produces subgroup mean and variance estimates that differ

strongly from the true values. Study II extends the findings on the marginal normality estimates

to real data from large-scale assessment programs such as the National Assessment of

Educational Progress (NAEP) and the National Adult Literacy Survey (NALS). The research

presented in Study II shows differences between the two methods that are similar to the

differences found in Study I. The consequences of relying upon the assumption of marginal

normality in direct estimation are discussed.

Key words: conditioning models, large-scale assessments, NAEP, NALS, direct estimation

i

Acknowledgements

I would like to thank John Mazzeo for valuable comments on previous versions of this

document, which improved both content and presentation. Any remaining errors are mine.

ii

Introduction

Large-scale assessments such as the National Assessment of Educational Progress

(NAEP) estimate the distribution of academic achievement for policy relevant subgroups.

Examples of estimates provided by large-scale assessment are means and percentages above cut

points for the subgroups of interest. Many large-scale assessments such as NAEP use a sparse

matrix sample design in which the number of cognitive items per respondent is kept relatively

small. Using such designs allows the assessment to provide a broad coverage of the content

domain while keeping the subjects’ testing time brief. This implies that individual ability

estimates based on these kinds of assessments would have a large measurement error component,

which has to be taken into account when reporting aggregate statistics for subgroups. Direct

estimation procedures, by which these estimates are obtained without the generation of

individual scores, have been the approach most commonly taken to address this analysis

challenge. Typically, these procedures have made use of background variables along with the

cognitive item responses to ensure a higher degree of accuracy in estimating subgroup

characteristics compared to only using the cognitive responses. Moreover, matrix sampling

makes it impossible to compare subjects—or groups of subjects—based on their observed item

responses. Therefore, large-scale assessments using matrix sampling rely on item response

theory (IRT) models (Lord & Novick, 1968; Rasch 1960).

To estimate the subgroup statistics of interest, ETS has employed since 1984 a particular

approach of integrating achievement data (item responses) and background information, such as

subgroup membership and additional student variables, into a hierarchical IRT model. This

approach may be referred to as “direct estimation” because ETS estimates group statistics

without the use of individual test scores. For the purposes of this report, I refer to this approach

as ETS-DE. The core features of the ETS-DE approach include:

1. A population model that assumes proficiencies are normally distributed conditional on a

large number of background variables (grouping variables and other covariates). As a

consequence, the marginal distribution (overall and for major reporting subgroups) is a

mixture of normals.

2. The generation of a posterior latent trait distribution of proficiency for each individual in the

sample, which is based on an estimate of (1); a separately estimated set of IRT parameters

that are treated as fixed and known; the cognitive item responses, the respondents’ group

1

membership; and other covariates. The mixture of these individual posterior distributions

provides the estimate of the actual subgroup distributions.

3. The integration over posterior distributions of examinees and some of the model parameters

(the parameters of the population model defined later) in (1) to obtain estimates of means,

percentages above achievement levels, etc.

4. The use of normal approximations for the individual posteriors and a multiple-imputation

approach (the so-called plausible values) to approximate the integration in (3). Imputations

are used in conjunction with conditioning models based on both cognitive item responses and

background information. The imputations are used as a mere convenience in order to

simplify the integration in (3) and to provide data that can be used with standard tools by

secondary analysts.

Cohen and Jiang (1999) propose an alternative approach to direct estimation (which I

refer to as CJ-DE in this report) of subpopulation characteristics that does not utilize additional

background variables. Cohen and Jiang assume that CJ-DE provides consistent subgroup

estimates without the use of background variables. The core features of CJ-DE include:

1. A population model that assumes marginal normality, i.e., the ability distributions of all

subgroups align in such a way that the joint distribution is normal.

2. A measurement model for the categorical grouping variables that assumes an underlying

continuous latent variable whose joint distribution with proficiency is normal.

3. Use of a set of fixed/known IRT model parameters.

4. Item responses that are used together with a single grouping variable only—the one used for

reporting—i.e., no additional covariates like other reporting variables or their interactions are

used in the population model.

5. A direct calculational approach that bypasses the generation of individual posterior

distributions and the generation of plausible values.

Both approaches, ETS-DE and CJ-DE, may be referred to as “direct estimation” because they

estimate group statistics without the use of individual test scores. ETS-DE uses a more general

model, which includes grouping variables as well as additional background information and no

specific assumption regarding the marginal proficiency distribution. CJ-DE includes the

assumption of marginal normality and ignores all the additional background information other

2

than a single grouping variable. This report presents a comparison of ETS-DE and CJ-DE using

simulated and real data.

The ETS-DE Methodology

For obtaining estimates of subpopulation distributions, ETS-DE involves a two-phase

procedure that uses achievement data (item responses) and respondents’ background

information. Key references for a more detailed outline of the conditioning model used by the

ETS-DE method are Mislevy (1991), Mislevy, Beaton, Kaplan, and Sheehan (1992) and Thomas

(1993, 2002). The two phases of the method, which sometimes are confused when discussed in

secondary literature, are:

1. Estimation of parameters for the conditioning, or population, model.

2. Production of plausible values from individual posterior distributions given the model

parameters, item responses, and background data.

The Conditioning Model

The method used for analyzing large-scale assessments at ETS uses both item responses

and background information, sometimes numbering up to one hundred conditioning variables.

Assume that there are k scales in the assessment and that each proficiency scale follows a

unidimensional IRT model1 with the usual assumption of conditional independence given ��, i.e.,

� � ��

�Kk Kk kJj kjkkkJ

xPxxP..1 ..1 )(..1)(1

)|()|,..,( �� (1)

The conditioning model combines the k-scale IRT model with a k-dimensional

multivariate latent regression model in order to maximize the likelihood based on the posterior

distribution of the latent trait �=(��,.,��):

)|()|(~),|(),|(..1 )(..1

yxPyxfyxLKk kJj kjk

��

�� (2)

where the prior �(��| y) is assumed to be normal with ��y�� N(�'y , �). The latent trait � is

unobserved and must be inferred from the observed item responses. The predictor y is a vector of

3

individual values on a set of conditioning variables, � is a matrix of regression weights, and � is

the residual variance-covariance matrix. Note that at ETS, three software programs are currently

available to carry out the estimation: NGROUP, BGROUP, and CGROUP. All implementations

are based on the EM (estimation-maximization) algorithm. In the E-step, the posterior

distribution of � given item responses and conditional on the background variables is computed

for each individual. These estimates are then used in the M-step to obtain the regression weights

��and the residual covariance matrix �. The approaches implemented in NGROUP, BGROUP,

and CGROUP differ with respect to how each carries out the E-step:

1. NGROUP assumes that the item likelihood ��j=1..J(k)�P(xjk|��) can be approximated by

a multivariate normal distribution and has limited use. (It may be used only for generating

starting values for CGROUP or with extremely long scales.)

2. BGROUP does not assume any specific form of the item likelihood and uses a numerical

quadrature in the E-step. To date, BGROUP has been shown to not be computationally

feasible in more than two dimensions.

3. CGROUP is designed to be computationally feasible for more than two dimensions (it uses a

Laplace approximation in the E-step). CGROUP is used most frequently in NAEP since most

subject areas have multiple scales and require reporting on a composite.

In NAEP and other large-scale assessments analyzed at ETS, the estimation of the

conditioning model for multivariate latent traits is carried out with BGROUP and CGROUP.

This report uses CGROUP as the basis for evaluating the differences in direct estimation

between the conditional normality approach (ETS-DE, as implemented in CGROUP) and the

marginal normality approach (CJ-DE, as implemented in the AM software, see below) since

CGROUP has been the program most frequently used for NAEP analysis purposes.

Plausible Values

The second phase of the ETS-DE involves the production of plausible values, which

provide a computationally tractable approach of integrating the posterior distributions of

respondents to estimate the target statistics in subgroups of interest. Using plausible values

provides a means for estimating the error in the estimates due to the proficiencies being latent

(i.e., only indirectly observed) and the uncertainty about the regression parameters in the

4

population model. In addition, plausible values provide a set of quantities that researchers can

use with commercial statistical software to conduct a wide variety of secondary analyses.

The BGROUP, CGROUP, and NGROUP set of programs generate multiple imputations

for each respondent based on the estimates of � and � and on the respondents’ background data y

and the item responses x. These plausible values are drawn from the k-dimensional posterior

N(E(�|y,x),�(�|y,x)). In other words, the approach assumes that � given y and x is approximately normally distributed. This conditional normality is a less restrictive assumption

compared to the marginal normality assumption, on which CJ-DE relies (Cohen & Jiang, 1999).

The marginal distribution in ETS-DE conditioning model is therefore rather flexible and is not

limited to the normal distribution, but it is actually a mixture of the conditional posterior

distributions for the given set of items responses and background variables.

In order to carry the variability due to measurement and parameter estimation errors

through all subsequent analyses, a number of plausible values has to be drawn for each

respondent. As a rule of thumb, five to ten plausible values are drawn in most large-scale

assessment analyses. These plausible values are aggregated to provide consistent estimates of

group means, variances, and percentages above cut points for the subgroups defined by the

reporting variables. Plausible values drawn from a population model that uses item responses and

a large amount of background information are a valuable source for studying relationships

between the proficiency scales and secondary variables.

The CJ-DE Methodology

Marginal normality based direct estimation, or CJ-DE (Cohen & Jiang, 1999), is a

recently proposed method of estimation subgroup statistics based on a number of assumptions

regarding a) the marginal distribution of the latent trait and b) its relation to a set of group

indicator variables. The following studies use simulated and real data to compare the results from

the ETS-DE and CJ-DE methods. The study of real data offers a determination as to whether CJ-

DE yields estimates consistent with the results of more general models.

The software package AM (Cohen, 1998) implements the CJ-DE approach and is

available for the Windows operating system. The software provides modules for CJ-DE and

additional modules for univariate and composite regressions of the latent trait on a number of

predictors, which is referred to as marginal maximum likelihood (MML) regression in the AM

5

package. While the focus of this study is to compare CJ-DE with the ETS-DE conditioning

approach, AM's MML regression was used to make sure that both software programs—AM and

CGROUP—agree on the data structure. AM provides two procedures for CJ-DE that were

developed “…to consistently estimate subpopulation distributions when the groups are defined

by values of a [nominal or ordinal variable]” (Cohen, 1998; Cohen & Jiang, 1999). The AM

modules implementing CJ-DE are referred to as “Ordinal Table” (OT) and “Nominal Table”

(NT) in the software, depending on the scale level of the grouping variable. Both the OT and the

NT modules assume that the latent trait � is marginally normally distributed (Cohen, 1998;

Cohen & Jiang, 1999), so that the estimates of a finite mixture of subgroup distributions have to

fit this assumption.

In contrast to this assumption, the conditional normality estimation—ETS-DE, which is

used in NAEP's conditioning model and other large-scale assessment programs—does not rely

on assuming a certain form of the trait parameters’ marginal distribution. The marginal

distribution in the conditioning model is a mixture of normals. In addition, NAEP uses a

multinomial distribution to approximate the marginal distribution of � for item calibration

(Yamamoto & Mazzeo, 1992), so that the item parameters used in the conditioning model are not

based on a certain form of the marginal trait distribution.

Central Assumptions Driving CJ-DE

Cohen and Jiang (1999) propose to use the following approach in order to estimate

subgroup statistics:

a) Assume a latent trait ��~ N(,�). � is usually unobserved and has to be inferred by the

subjects responses to a number of items (x1,..,xk)

b) Assume that there are m groups, where the group membership gi indicates the maximum

outcome on a number of m unobserved variables, yl,...,ym. That means the group membership

of individual i equals k (gi = k), if for the unobserved variables yki > yli for all l k.

c) Assume that for k=1,..,m, a linear relationship exists between ��and yk (i.e., yk = ak + bk� + ek)

with mutually independent ek. The conditional distribution of yk given �� is assumed to be N(0,1).

d) Assume that conditional on �, the yi are mutually independent, i.e.,

)|(*)|()|,( �� ByPAyPByAyP jiji ��

6

Assumption (a) forces the ability distribution to be marginally normal. Assumption (c)

also is very strong and “may not be true but is a common and powerful one” (Cohen & Jiang,

1999). Assumptions (b) and (d) are used for defining the conditional density of

( | ) ( | ) ( | ) ( | ) ( | )k k j k k k jj k kf g k f x P y y j k dy f x P y y dy� � � � �� (4)

This conditional density, together with assumption (c) and the assumption of marginal normality

(a), yields

( , ) ( ) ( ) ( | )k k k k jj k kf g k z y a b P y y dy�� (5)�

where denotes the normal density and z�=��/��One more replacement uses the second

part of assumption (c), namely that the error term e in the linear relation yj=aj+bj�+e is assumed

to be N(0,1). This yields

(6) )()()()|0( �� jjkjjkkjjjk baybayePyebaPyyP ��

where ��denotes the normal distribution function. It follows that

( , ) ( ) ( ) ( )k k k k j jj k kf g k z y a b y a b dy�� (7)

Finally, the conditional density of � given group g=k is obtained by

� ��

�

�

��

��

��

��

��

�

�

ddybaybayz

dybaybayzkgf

kkj jjkkkk

kkj jjkkkk

)()()(

)()()()|( (8)

which is used to compute the conditional means and variances given subgroup g=k (see Cohen &

Jiang, 1999). We may now define

7

� �� dkgfkgEnn )|()|( (9)

in order to obtain the conditional moments of �� The parameters a1,b1...am,bm and �,�� of

f(�|g=k) are estimated by maximizing the likelihood function based on the individual likelihood

terms

� �� dkgfxpkgxbaL ),()|(),|,...,,,( 11 (10)

for a subject in group g=k with observed responses x=(x1,..,xj), and f(g=k,�) as defined by

Equation (7). The two approaches taken by ETS-DE and CJ-DE differ strongly with respect to

the information incorporated in estimating subgroup characteristics. ETS-DE uses extensive

background (conditioning) information including grouping variables in addition to the cognitive

item responses. In contrast to that, CJ-DE only includes the grouping variable together with the

item responses but draws on a number of strong assumptions regarding the shape of the marginal

ability distribution and the relation between � and the grouping variable. The following section

presents examples of the differences found between both approaches with respect to recovering

known subgroup characteristics of simulated data.

Study I: Simulation Results

The examples presented in this section compare ETS-DE and CJ-DE based on simulated

data where each simulee responds to a limited set of test items and is additionally characterized

by a small set of background variables. The simulated data sets resemble some characteristics of

NAEP, such as the number of items per subscale. Short subscales in NAEP typically consist of

an average of 6 items across booklets; long subscales consist of approximately 12 items. The

number of subscales or dimensionality of the latent trait, k=3 in the simulations, also is found in

NAEP. The number of background variables in the simulation is smaller than what is typically

used in NAEP’s conditioning approach. While NAEP’s conditioning model may include up to

hundreds of background variables, the simulated data used in the present study limits the number

of background variables to the three made-up variables, GROUP, SES, and GENDER. Four

distinct data sets were simulated following a 2 x 2 design, varying:

1. The number of items per subscale (6 versus 12 items).

8

2. The dependency of the latent traits on the background variables: Setup (1) had a strong

dependency leading to multimodal marginal trait distributions, while Setup (2) had a weak

dependency resulting in unimodal, but possibly platokurtic marginals.

Using two different linear models created the two levels of dependency of the latent traits on

the background variables. Two different sets of regression weight were used to generate the

three-dimensional trait parameters (�1, �2, �3). Each latent trait value �i for i in 1, 2, 3 was

generated based on a linear model

�i = �1yGENDER,i + �2ySES,i + �3yGROUP,i + ei (11)

incorporating fictitious GENDER, SES, and GROUP effects together with normally distributed

residuals ei. GENDER, SES, and GROUP accounted for a varying percentage of variance for the

three trait components (see regression results below). The trait variable (�1, �2, �3) and its

component-wise linear relation to GENDER, SES, and GROUP were unaffected by additional

fictitious design variables WEIGHT, STRATA, and CLUSTER. The latter variables have been

included to check whether zero correlations are recovered in the same way by the regression

modules of CJ-DE and ETS-DE.

Setup 1, which includes one bimodal and one multimodal marginal, was included to

examine how CJ-DE performs in situations where its marginal distribution assumptions are

clearly violated. Setup 2 represents a more typical situation in which the marginal distributions

are unimodal but more platokurtic than the normal (see Figure 1). Data were generated for the

six-item test for both Setups 1 and 2, the item parameters used to generate the data are given in

Appendix A and B. However, only the six-item test is presented for Setup 2, since the pattern of

results obtained for the two test lengths was similar in Setup 1.

9

Figure 1. Histograms of marginal distributions for Setups 1 (left) and 2 (right).

10

Figure 1 shows histograms with integrated density plots for Setup 1 (left column) and

Setup 2 (right column), crossed by the three (from top to bottom row) simulated latent traits.

Setup 1 on the left results in a clearly bimodal marginal for Dimension 1, whereas in Setup 2, the

marginals are platokurtic or skewed, but not obviously multimodal.

In Setup 1, the proportion of variance of ��accounted for by the fictitious GROUP and

GENDER produced bimodal (for gender) or multi-modal marginal ��distributions. In Setup 2,

the proportion of variance explained by the fictitious conditioning variables GENDER, SES, and

GROUP was reduced, so that the resulting marginal � distributions are unimodal but platokurtic.

The marginal distribution of �1 is a mixture of two subpopulations where the mean difference

between subgroups is due to the fictitious GENDER variable. �2 is a mixture of five normals

with common variance but slightly different means due to the five-category variable GROUP.

The third variable, �3, can be viewed as the “control dimension” in both setups (i.e., the subgroup

distributions are all identical as there is no effect of the conditioning variables on latent trait �3).

Setup 2 can be viewed as a less extreme, non bimodal, version of Setup 1 with higher

intercorrelations between the � variables. The data generated by both setups were analyzed with

the ETS-DE and CJ-DE approaches to direct estimation. The results of both methods were

compared to the true values obtained from analyzing the actual � values used for generating the

item responses. Tables 1a and 1b show the marginal correlations obtained from analyzing the

simulees’ generating � values, both for the 6- and the 12-item data sets.

Table 1a

Marginal � Distributions in Setup 1, Correlations Between � Dimensions

[,1] [,2] [,3]

[1,] 1.0000000 0.3985606 0.1620800

[2,] 0.3985606 1.0000000 0.1832677

[3,] –0.1620800 0.1832677 1.0000000

11

Table 1b

Marginal � Distributions in Setup 2, Correlations Between � Dimensions

[,1] [,2] [,3]

[1,] 1.0000000 0.6499676 0.5401054

[2,] 0.6499676 1.0000000 0.7718106

[3,] 0.5401054 0.7718106 1.0000000

The following sections present results based on the generating true � values on the one

hand and the two approaches to direct estimation of subgroup statistics on the other. To clarify

that the expected differences between CJ-DE and ETS-DE are the result of differences in model

assumptions, the agreement of both software packages on the correlational structure of the

simulated data was assessed. To check this, the recovery of regression weights and the residual

variance covariance matrix of both AM (the software used for CJ-DE) and CGROUP (the

software for ETS-DE) was analyzed.

Regression Module Comparison

The regression module comparison is a check of agreement between both programs using

the same data. The regression of the three dimensional latent trait ��on the variables INTER

(explicit intercept), GENDER, SES, GROUP, STRATA, and CLUSTER was compared. The

results in Table 2 are obtained by analyzing the generating � vectors (the TRUE columns in the

tables below) with standard regression procedures. The entries in the ETS-DE and MML

columns stem from analyzing the item response data with the conditioning model incorporated in

ETS-DE and with AM’s MML regression module. The MML regression module, however, is

different from the direct estimation proposed by Cohen and Jiang (1999). The MML regression

module closely resembles the regression part of the ETS-DE approach in the one-dimensional

case and consequently should yield similar results when used with the same set of background

variables. MML regression does not include the marginal normality assumptions used by CJ-DE.

Table 2 shows the estimates of the linear model for the three-dimensional � variable. The

estimates show that the GENDER variable has the largest effect on �1, whereas the GROUP

12

variable has highest impact on �2 and the effects for �3 are close to zero for all methods, as

expected.

Table 2

Regression Coefficients for the Six-item Simulated Data Set, Setup 1

Scale 1 Scale 2 Scale 3

Effect TRUE ETS-DE MML TRUE ETS-DE MML TRUE ETS-DE MML

Constant -3.460 -3.530 -3.480 -2.970 -2.590 -2.570 0.040 0.070 0.070

CLUSTER -0.030 -0.050 -0.050 0.000 -0.060 -0.060 -0.040 -0.050 -0.050

STRATA -0.010 -0.010 -0.010 0.000 0.000 0.000 0.010 -0.010 -0.010

GENDER 1.680 1.760 1.740 0.430 0.340 0.340 0.000 0.020 0.030

GROUP 0.190 0.220 0.220 0.600 0.640 0.640 0.060 0.050 0.050

SES 0.280 0.280 0.280 0.270 0.210 0.210 -0.020 0.020 0.020

MML regression and the regression that is part of ETS-DE agree closely on the estimates

for this setup. Both ETS-DE and MML regression produce estimates close to those in the TRUE

columns, even though the number of six items per scale is comparably small (i.e., the inference

on � used by ETS-DE and MML regression are subject to a rather large measurement error).

Table 3 shows the respective results based on the 12-item data set.

Table 3

Regression Coefficients 12-item Simulated Data Set, Setup 1



Constant -3.560 -3.540 -3.533 -2.940 -2.890 -2.883 -0.070 -0.030 -0.038

CLUSTER 0.000 0.000 0.004 -0.010 -0.020 -0.016 0.040 0.040 0.043

STRATA 0.000 -0.010 -0.008 0.000 0.000 -0.003 -0.010 0.000 0.000

GENDER 1.700 1.720 1.718 0.470 0.510 0.505 -0.060 -0.080 -0.085

GROUP 0.140 0.140 0.136 0.610 0.620 0.617 -0.050 -0.050 -0.047

SES 0.300 0.290 0.290 0.220 0.180 0.183 0.080 0.040 0.045

MML regression and ETS-DE recover the parameters weights more closely if the number

of items is doubled. Note that both methods also agree with the TRUE columns for Scale 3,

13

where there is no impact on the latent variable, and as expected, all three columns show values

close to zero. Table 4 shows the residual correlations and variances as they were obtained using

the true � values from the simulations as well as the corresponding values produced by the ETS-

DE regression and MML regression algorithms.

Table 4

Residual Correlations With Variances in the Diagonal, Six-item Simulated Data Set

TRUE ETS-DE MML regression

Scale 1 2 3 1 2 3 1 2 3

1 0.188 –0.025 0.203 0.199 –0.033 0.246 0.188 –0.025 0.249

2 0.194 –0.214 0.214 –0.289 0.209 –0.293

3 0.996 1.127 1.155

Table 5 shows the results for the 12-item data set. ETS-DE and MML regression

reproduce the residual correlations and variances in a very similar way, both for the 6-item and

the 12-item data set in Setup 1.

Table 5

Residual Correlations With Variances in the Diagonal, 12-item Simulated Data Set

TRUE ETS-DE MML regression

Scale 1 2 3 1 2 3 1 2 3

1 0.176 –0.036 0.186 0.183 –0.125 0.227 0.183 –0.133 0.231

2 0.196 –0.218 0.167 –0.143 0.167 –0.144

3 0.991 0.960 0.971

Subgroup Distribution Recovery

ETS-DE and CJ-DE implement two very different approaches to direct estimation. While

ETS-DE assumes that the latent trait � is conditionally normal given a vector of background

data, CJ-DE assumes that the marginal latent distribution is normal, regardless of potentially

large subgroup differences in complex samples. These two approaches are compared in this

14

section with respect to the recovery of subgroup distributions. This analysis uses the exemplary

data previously introduced as Setup 1—6 and 12 items and Setup 2—6 items.

As shown in the previous section, the ETS-DE regression and the MML regression as

implemented in the software packages CGROUP and AM agree on these data sets and reproduce

the true regression parameters in a very similar way. In contrast, ETS-DE and CJ-DE incorporate

different assumptions regarding the marginal distribution of the latent traits. Recall that the

marginal distributions for Setup 1 are bimodal for Scale 1 and multimodal for Scale 2, because

the background variable GENDER (two subgroups) explains a major part of the variance for

Scale 1 whereas the background variable GROUP (five subgroups) has a strong impact on Scale

2. It can be expected that the marginal normality assumption of CJ-DE, which is violated for

Scales 1 and 2, will result in differences between subgroup mean estimates of ETS-DE and the

true values on the one hand, and CJ-DE on the other hand.

Table 6a

Subgroup Means and Standard Deviations for the Six-item Data Set, Setup 12

Mean Standard deviation

Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

ALL –0.004 0.027(.039) -/- 1.001 1.040 -/-Female –0.849 –0.852(.030) –0.466(.074) 0.548 0.565 0.947

1 Male 0.840 0.907(.040) 0.343(.116) 0.527 0.545 0.936

ALL –0.002 –0.032(.047) -/- 0.997 0.960 -/- Female –0.213 –0.205(.057) –0.193(.088) 0.991 0.968 0.972

2 Male 0.208 0.140(.058) 0.135(.113) 0.957 0.919 0.972

ALL 0.014 –0.012(.045) -/- 1.003 1.065 -/- Female 0.000 –0.024(.057) –0.032(.058) 1.029 1.083 1.076

3 Male 0.027 0.000(.064) –0.003(.059) 0.975 1.045 1.08

Note. The results of CJ-DE direct estimation reported here are ones closest to the true values from one out of four trials with AM’s “slog through” option. Rows with large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

Table 6a shows the TRUE values for the six-item data set in Setup 1 (i.e., the values

obtained by analyzing the generating data) as well as the subgroup means and standard

deviations as estimated by ETS-DE and CJ-DE. In addition, the values in parentheses next to the

15

subgroup mean estimates show the associated standard errors either computed with Rubin’s

imputation formula in the case of ETS-DE or as given by the Taylor series estimates in the case

of CJ-DE. The Taylor series estimates are given by the CJ-DE direct estimation procedure and

are recommended to yield appropriate estimates for complex samples by Cohen and Jiang

(1999). Here, the Taylor series standard error estimates for Scales 1 and 2 are larger than the

imputation-based estimate.

Table 6b gives a more condensed overview of the same results. Instead of individual subgroup means, the table gives standardized mean differences

ZETS-DE = (METS-DE - true)/se(DETS-DE) (12)

ZCJ-DE = (MCJ-DE - true)/se(DCJ-DE) (13)

as well as the variance ratio of estimated variance divided by true variance. se(D) stands for the

standard error of the difference. Assuming the TRUE values to be fixed target statistics, the

se(D) equals the standard error associated to the respective estimate given either by ETS-DE or

CJ-DE. If the difference between the two estimates of a certain subgroup mean is standardized,

se(D) equals the square root of the sum of the squared standard errors of the two statistics. The

standardized mean differences between CJ-DE and TRUE should be ~N(0,1) if the CJ-DE model

holds. The variance ratios given in Table 6a should be close to 1 if the approach recovers the

values in the TRUE column.

16

Table 6b

Subgroup Standardized Mean Differences and Variance Ratios for the Six-item Data Set, Setup 1

Standardized mean difference Variance ratio Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

Female 0.000 –0.099 5.176 1.000 1.063 2.9861 Male 0.000 1.693 –4.284 1.000 1.069 3.154

Female 0.000 0.140 0.227 1.000 0.954 0.962 2 Male 0.000 –1.178 –0.646 1.000 0.922 1.032

Female 0.000 –0.419 –0.552 1.000 1.108 1.093 3 Male 0.000 –0.425 –0.508 1.000 1.149 1.218

Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.

The differences from the expected values are as hypothesized; CJ-DE shows large

differences for Scale 1, for which the marginal normality assumption does not hold. The absolute

standardized mean differences are 5.176 for the female subgroup and 4.284 for the male

subgroup. The variance ratios indicate that CJ-DE overestimates the subgroup variances by a

factor of ~3 for Scale 1.

Table 7a gives the mean and standard deviation for the 12-item data set in Setup 1 while

Table 7b gives the standardized mean differences and variance ratio. Table 7b enables a direct

comparison against the values 0 (zero) for the expected mean differences and 1 (one) for the

expected variance ratio if the models behind the approaches fit the data.

17

Table 7a

Subgroup Mean and Standard Deviation for GENDER for the 12-item Data Set, Setup 1



ALL 0.004 .016(.035) -/- 1.008 0.995 -/-Female –0.854 –.832(.029) –.707(.072) 0.529 0.515 0.858

1 Male 0.862 .865(.031) .704(.063) 0.527 0.524 0.746

ALL 0.000 .008(.039) -/- 1.002 0.991 -/- Female –0.234 –.252(.049) –.255(.089) 0.978 0.959 1.003

2 Male 0.234 .269(.050) .248(.124) 0.971 0.952 1.003

ALL –0.004 –.003(.036) -/- 0.996 0.993 -/- Female 0.031 .034(.047) .045(.050) 0.989 0.967 0.988

3 Male –0.041 –.042(.052) –.040(.047) 1.001 0.981 0.988


Table 7b

Subgroup Standardized Mean Differences and Variance Ratios for the 12-item Data Set, Setup 1

Standardized mean difference Variance ratio


Female 0.000 0.760 2.042 1.000 0.948 2.6311 Male 0.000 0.097 –2.508 1.000 0.989 2.004

Female 0.000 –0.367 –0.236 1.000 0.962 1.052

2 Male 0.000 0.699 0.113 1.000 0.961 1.067

Female 0.000 0.063 0.280 1.000 0.956 0.998

3 Male 0.000 –0.020 0.021 1.000 0.960 0.974


18

In the 12-item case, the current CJ-DE implementation does not converge with the default

settings for Scale 1 but needs to be put into the “slog through” mode, and the number of

iterations needs to be increased from 50 to 500. ETS-DE reproduces the subgroup means and

standard deviations accurately also for the 12-item data set. As in the six-item case, the

differences between subgroup standard deviations are not reproduced by CJ-DE.

The second reporting variable with a strong impact on one of the latent trait components

is GROUP, a variable with the categories 1..5. Table 8a shows the standardized mean differences

for the six-item data from Setup 1. Like in the above analysis with the grouping variable

GENDER, the algorithm for CJ-DE needs to be put in the “slog through” mode in AM to

converge in this example. The marginal normality assumption does not hold for Scales 1 and 2,

the first and second component of the three dimensional latent trait in the example data. It can be

expected that CJ-DE using the marginal normality assumption will not match the true subgroup

means and variances as closely as ETS-DE does in the analysis of the GROUP reporting

variable.

The subgroup mean differences for CJ-DE in Scale 2 indicate two subgroups for which

CJ-DE estimates deviate significantly from the true values. For Group 1, the absolute mean

difference between CJ-DE and the true value is 4.45, and for Group 5, the absolute mean

difference is 5.62. In contrast, the subgroup mean differences for ETS-DE and the true values are

all in the expected range. Table 8a shows also that CJ-DE overestimates the subgroup variances

for all subgroups and Scale 2 by a factor of between 1.77 and 2.54.

19

Table 8a

Subgroup Standardized Mean Differences and Variance Ratios for GROUP for the Six-item Data Set, Setup 1

Standardized mean differences Variance ratio

Scale Subgroups TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

Group 1 0.0000 0.1928 0.7942 1.0000 1.0659 1.2226 Group 2 0.0000 0.6372 0.0748 1.0000 0.9503 1.2346 Group 3 0.0000 0.5593 –0.0784 1.0000 1.0092 1.3660 Group 4 0.0000 0.8930 0.4622 1.0000 1.0661 1.4910

1

Group 5 0.0000 0.6808 0.0856 1.0000 1.0927 1.2583

Group 1 0.0000 0.4218 4.4504 1.0000 1.0483 2.1302 Group 2 0.0000 0.1192 –1.0500 1.0000 0.8695 2.0350 Group 3 0.0000 0.0802 –1.0467 1.0000 0.9320 2.1604 Group 4 0.0000 –0.5602 –1.6938 1.0000 1.1380 2.5416

2

Group 5 0.0000 –1.7368 –5.6242 1.0000 0.9781 1.7734

Group 1 0.0000 0.5929 0.4584 1.0000 1.0153 0.9944 Group 2 0.0000 –0.7004 –0.3818 1.0000 1.0841 1.1513 Group 3 0.0000 –0.4016 –0.0861 1.0000 1.0557 1.0988 Group 4 0.0000 –1.5818 –0.7226 1.0000 1.0774 1.1999

3

Group 5 0.0000 –1.0241 –0.6721 1.0000 1.1177 1.1141


Simulation Results: Setup 2

Truly multimodal distributions are rarely found in real data, even though results from

large-scale assessments show variables that account for large differences in average achievement

between subgroups. Setup 2 was designed to be a less extreme version of the same model used

for Setup 1 and was made more realistic by allowing larger between-scale correlations as they

can be found in many large-scale assessment programs. Analyses like the ones presented for

Setup 1 were carried out with the six-item data set in Setup 2 in order to obtain additional results

from this less extreme case. Table 8b shows the comparison of CJ-DE MML regression and

ETS-DE regression estimates with the regression coefficients based on the true � values.

20

Table 8b

Regression Coefficients for the Six-item Simulated Data Set, Setup 2



INTER -3.186 -3.162 -3.100 -2.953 -2.996 -2.973 0.123 0.138 0.132

CLUSTER -0.033 -0.021 -0.018 0.000 -0.013 -0.014 -0.040 -0.080 -0.080

STRATA -0.016 -0.020 -0.020 -0.005 -0.002 -0.002 -0.010 0.000 0.000

GENDER 1.203 1.199 1.178 0.495 0.484 0.482 -0.014 0.015 0.018

GROUP 0.285 0.267 0.258 0.537 0.558 0.556 0.053 0.094 0.096

SES 0.394 0.382 0.373 0.313 0.330 0.327 0.000 -0.023 -0.024

Both ETS-DE and MML regression reproduce the regression weights based on the true values

closely for this data set. This indicates that AM’s MML regression and ETS-DE agree on the underlying

relationship between the reporting variables and the latent trait variables, so that the basis on which CJ-

DE marginal direct estimation and ETS-DE’s conditioning model are compared is the same.

Table 9 shows the residual correlations and variances for the true � residuals and for the

estimates as obtained by MML regression and ETS-DE.

Table 9

Residual Correlations With Variances in the Diagonal, Six-Item Data Set, Setup 2

TRUE ETS-DE MML

Scale 1 2 3 1 2 3 1 2 3

1 0.410 –0.077 0.200 0.437 –0.102 0.330 0.412 –0.104 0.332

2 0.297 –0.228 0.349 –0.109 0.345 –0.106

3 0.996 1.036 1.069

The two approaches reproduce the residual covariance matrix in a very similar way. The

differences between ETS-DE and CJ-DE are even smaller than the small differences of the two

approaches to the true values. The results on the regression part of ETS-DE and the MML

regression module of AM give no indication that the basic relationships between the three latent

21

traits and the subgroup variables are represented differently by the two approaches to direct

estimation.

For reporting the variable GENDER in Setup 2, Table 10 shows the respective

standardized mean differences and variance ratios.

Table 10

Standardized Mean Differences and Variance Ratios for GENDER for the Six-item Data Set, Setup 2



Female 0.000 0.544 1.088 1.000 1.082 1.218 1

Male 0.000 –0.450 –1.517 1.000 0.969 1.019

Female 0.000 0.069 0.057 1.000 1.098 1.126 2

Male 0.000 –0.045 –0.277 1.000 1.084 1.117

Female 0.000 –0.641 –0.281 1.000 1.026 1.044 3

Male 0.000 –0.164 –0.015 1.000 0.990 1.086

The results for Setup 2 show smaller, but noticeable differences between the estimates of

CJ-DE on the one hand, and the true values and ETS-DE on the other hand. The marginal

distributions in Setup 2 deviate to a lesser extent from CJ-DE normality assumption, so that the

subgroup estimates of CJ-DE seem impacted less by a moderate model violation as compared to

Setup 1. Table 11 shows the results for the reporting variable GROUP, which again could not be

estimated by CJ-DE using the default options and which is the one with the strongest effect on

Scale 2.

As expected, the standardized mean differences between the true values and CJ-DE for

Scale 2 are larger than the differences between the true values and ETS-DE. In addition, the

variance ratios for CJ-DE are consistently larger than 1.5 for Scale 2, indicating that CJ-DE

overestimates subgroup variances here.

The results for both reporting variables GENDER and GROUP are similar with respect to

where CJ-DE deviates from the TRUE values and the ETS-DE approach: The GENDER effect is

largest for Scale 1, where CJ-DE deviates most when reporting GENDER subgroup means.

22

Similarly, for Scale 2, where the GROUP reporting variable has a strong effect on Latent Trait 2,

CJ-DE deviates most when reporting on the GROUP subgroups.

Table 11

Standardized Mean Differences and Variance Ratios for GROUP for the Six-item Data Set, Setup 2


Scale Subgroups TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE

Group 1 0.0000 0.0349 0.6512 1.0000 0.9608 0.9649 Group 2 0.0000 –0.0490 –0.2745 1.0000 1.1590 1.2968 Group 3 0.0000 0.3333 –0.0897 1.0000 0.8876 0.8820 Group 4 0.0000 0.1477 –0.5000 1.0000 0.9061 0.9119

1

Group 5 0.0000 –0.2414 –0.3534 1.0000 1.0628 1.1324

Group 1 0.0000 –0.5567 3.2474 1.0000 1.1540 1.9239 Group 2 0.0000 0.0300 0.0300 1.0000 1.0930 1.7859 Group 3 0.0000 0.4177 –1.0127 1.0000 1.1113 1.6488 Group 4 0.0000 0.6000 –1.2143 1.0000 1.2346 1.7322

2

Group 5 0.0000 –0.3088 –3.4118 1.0000 1.1239 1.4805

Group 1 0.0000 –0.3298 –0.6702 1.0000 0.9496 0.9625 Group 2 0.0000 0.1038 0.6321 1.0000 1.1183 1.1314 Group 3 0.0000 0.1954 0.3448 1.0000 1.0988 1.1816 Group 4 0.0000 –0.7143 –0.5048 1.0000 1.0020 1.0563

3

Group 5 0.0000 –0.8095 –0.4762 1.0000 0.8931 1.0000


Conclusions: Study I

In the examples presented above, AM’s MML regression module yields similar results to

what is found when using the regression results of the ETS-DE methodology. Regression

coefficients, residual correlations, and variances are reproduced in much the same way as ETS-

DE recovers these parameters. These results cannot be generalized as they are currently based on

a few simulated data sets only. Nevertheless, all examples presented here indicate that both

23

software programs agree on the basic correlational relationships in the data as given by the AM

MML regression module and ETS-DE’s regression estimates.

In contrast to the close agreement of ETS-DE regression and AM’s MML regression, the

AM module for CJ-DE—the marginal normality direct estimation approach—diverges from the

ETS-DE results and the true values if the marginal distributions are non-normal. The exemplary

data sets were constructed and simulated in a way to show where discrepancies can be expected,

and the results so far match the expectations. Setup 1 was constructed to study how CJ-DE

performs if marginal distributions are bimodal or multimodal, and CJ-DE did not converge with

the default settings for the scales that violated the assumptions used in the marginal direct

estimation approach. Setup 2 represents a “milder” version of model violation for CJ-DE and

also shows that under this setup, where the multimodality of the marginal is less obvious, the CJ-

DE estimates differ from the values produced by ETS-DE, the conditional normality direct

estimation and the true values.

Assuming that the latent trait is normally distributed across groups may lead to an

inappropriate model because of strong monotonicity assumptions in the IRT model (note that

IRT serves as the basis for both ETS-DE and CJ-DE). For the 1PL and 2PL IRT models as well

as the (generalized) partial credit models, a simple statistic of the observed responses—the

weighted sum of scores—is sufficient for estimating the latent trait. Even for the 3PL, the

monotonicity of the success probability P(X=1|�) in the latent trait � and in the item parameters

ensures a relationship between the observed distribution of the raw scores and the unobserved

(but not arbitrary!) distribution of the latent trait. As an example, if a test is administered to two

different samples that differ a lot in their ability distributions (e.g. a reading test taken by both a

group of kindergarten students and a group of third graders), it seems unreasonable to assume a

joint normal distribution. A model assuming marginal normality would force both distributions

under one mode and produce biased estimates of differences between these two groups and other

groups defined by additional reporting variables.

The simulated data examples revealed effects of CJ-DE in the presence of non-normal

marginal distributions: systematic deviations from the true values in the mean and in the variance

estimates. In contrast, no indication of systematic differences between the true values and the

ETS-DE approach were found in the examples analyzed here. From the perspective of data

analysis, the differences in the subgroup mean estimates of CJ-DE are easier to detect, because in

24

extreme cases CJ-DE reports when it fails to converge. Nevertheless, when using AM’s “slog

through” estimation option and increasing the number of iterations, there may be no indication of

nonconvergence. The effects of CJ-DE when estimating subgroup variances are more difficult to

detect, as this can only be accomplished by additional analysis using other, less restrictive

methods.

Study II: Comparing Marginal Direct Estimation and Conditional Direct Estimation

Subgroup Statistics for NAEP and NALS Data

Study I showed that the marginal direct estimation (CJ-DE) method relies strongly on the

assumption that the latent trait is marginally normally distributed. The CJ-DE method as

implemented in the AM software (Cohen, 1998) does not reproduce subgroup mean and variance

appropriately in cases where a significant part of subgroup differences is explained by the

grouping variable of interest.

The examples presented here help in studying consequences of this effect of marginal

direct estimation in large-scale assessment data analysis. Assessments across a number of

countries, states, regions, or other grouping variables cannot assume a certain form of marginal

distribution of the trait across the groups (Yamamoto & Mazzeo, 1992). In addition, assuming

that subgroup variances are homogenous (i.e., that the trait[s] vary to a similar degree within all

groups) might be too restrictive to fit diverse populations. Data from large-scale assessment

programs provide a source to study differences between CJ-DE and ETS-DE in a realistic data

analysis setting. Using real data with operational reporting variables enables one to formulate

expectations about whether certain variances should be equal or for which subgroups differences

may be expected. This adds a different perspective to what was examined in Study I, where

known parameters were compared with CJ-DE and ETS-DE estimates.

NAEP Math Assessment, Grade 4

As the first real data example, results were compared for ETS-DE and CJ-DE on data

from an assessment given to a nationally representative sample of 13,855 students in the fourth

grade for the National Assessment of Educational Progress (NAEP). The assessment,

administered in 2000, used a sparse matrix sample design where examinees were given a 45-

minute test of mathematics items consisting of a mixture of multiple choice and constructed

25

response items. The 173-item pool was divided into 13 blocks of items (separately-timed

sections). The blocks were assembled into 26 booklets based on a BIB (balanced incomplete

block) design (Braswell et al., 2001). Each booklet contained three blocks of items, which were

classified into five content-area scales—numeracy and operations, measurement, geometry, data

analysis, and algebra. A typical examinee answered from 6 to 12 items per scale. A multiscale

IRT model estimated with PARSCALE was used to calibrate the IRT item parameters for each

of the five scales.

The following exhibits show results based on the ETS-DE methodology using 381

background variables in addition to item responses in order to obtain subgroup estimates. The

381 background variables are factor scores based on a principal component analysis that was

conducted using the variables available from the background questionnaire (see Braswell et al.,

2001, for details on the NAEP 2000 math assessment and the available background data). The

operational NAEP 2000 item parameters were used in a five-dimensional run with CGROUP, the

current software implementation of the multidimensional ETS-DE approach. The ETS-DE

approach was found to work accurately in recovering subgroup means and variances in Study I

and serves as a benchmark for CJ-DE, which has been proposed for use for subgroup reporting

(Cohen & Jiang, 1999). In contrast to CJ-DE, the ETS-DE approach assumes conditional

normality of the latent traits with a large set of background variables. Given that a large number

of background variables are used that explain a significant portion of the latent trait variance, this

approach is capable of modeling complex mixtures of abilities resulting in non-normal

population and subgroup distributions. To compare the results of ETS-DE and CJ-DE, the

operational data and NAEP 2000 math item parameters were imported into the software that

implements CJ-DE.

School Type

The first reporting variable used in this comparison is School Type, which has three

categories in NAEP—Public, Private, and Catholic. The subsequent tables offer a comparison

between CJ-DE and ETS-DE, the benchmark, on the basis of standardized mean differences and

variance ratios similar to the exhibits in the previous part of the report. Table 12a shows the

reference values estimated by ETS-DE in the untransformed latent trait scale, not in the NAEP

reporting scale. The untransformed latent trait scale is implicitly given by the item parameters as

26

calibrated with the PARSCALE software. PARSCALE defaults to the marginal latent trait

moments M(�)=0 and a standard deviation S(�)=1.

Table 12a

ETS-DE Estimates of the Means and Standard Deviations in the Latent Trait (Theta) Scale for School Type Subgroups


Public Private Catholic Public Private Catholic

NUM&OPER –0.047 0.430 0.368 1.021 0.913 0.842

MEASURMT –0.053 0.480 0.402 1.060 0.928 0.897

GEOMETRY –0.034 0.299 0.267 1.012 0.913 0.821

DATA ANL –0.045 0.327 0.425 1.103 0.969 0.880

ALGEBRA –0.047 0.429 0.358 1.081 0.969 0.886

The Private and Catholic school categories have a mean that is about 0.35 to 0.52

standard deviations higher than the one for Public schools, whereas the respective standard

deviations for these subgroups is slightly lower than the subgroup standard deviation for Public

school category across all five scales of the NAEP math assessment. Table 12b gives the

corresponding standardized mean differences and variance ratios. The table shows these values

for the School Type subgroups, where the differences are formed by “CJ-DE minus ETS-DE”

and the ratios are “CJ-DE divided by ETS-DE.”

27

Table 12b

Standardized Mean Differences and Variance Ratios for School Type Subgroups


Public Private Catholic Public Private Catholic

NUM&OPER 0.047 –0.245 –0.737 0.931 1.129 1.338

MEASURMT 0.039 0.137 –0.293 0.911 1.143 1.235

GEOMETRY 0.074 0.658 –1.390 0.902 1.087 1.359

DATA ANL 0.071 –0.420 –0.330 0.820 1.044 1.253

ALGEBRA 0.149 –1.226 –0.439 0.832 1.010 1.211


ETS-DE and CJ-DE provide quite similar subgroup mean estimates for most of the five

scales in the three subgroups, but there are differences in the subgroup standard deviations

reported by the two methods. The ETS-DE method reports that the Catholic school subgroup has

a smaller standard deviation as compared to the Public school types on all five scales3, whereas

the CJ-DE method report comparably more similar standard deviations across the three

subgroups. In Study I, using simulated data examples, it was found that CJ-DE does not recover

differences in subgroup standard deviations correctly. The ETS-DE method, however, was found

to recover this type of subgroup heteroscedasticity in the simulated examples, and ETS-DE

reflects differences between subgroup variances in the NAEP example reported here.

Race/Ethnicity

The next variable analyzed is Race/Ethnicity, which has four categories—WHI/AI/O

(White, American Indian, Other), AFRAM (African American), HISPANIC (Hispanic

American), and ASIAM (Asian American)—in the NAEP 2000 data. Table 13 below shows the

subgroup mean differences between CJ-DE and ETS-DE and the corresponding variance ratios

for this reporting variable.

28

Table 13

Race/Ethnicity Subgroup Reports Generated Based on the NAEP 2000 Grade 4 Math Data 1


WHI/AI/O AFRAM HISPANIC ASIAM WHI/AI/O AFRAM HISPANIC ASIAM

NUM&OPER -0.259 0.482 0.348 -0.285 0.996 0.952 0.849 0.729

MEASURMT -0.109 -0.172 0.459 0.091 0.955 0.935 0.830 0.738

GEOMETRY -0.282 0.012 0.658 0.222 0.976 0.901 0.784 0.715

DATA ANL -0.094 0.743 -0.254 -0.715 0.889 0.778 0.691 0.730

ALGEBRA -0.305 0.490 0.477 -0.388 0.878 0.815 0.728 0.717

Note. Large differences from the expected values given the more general model are printed in boldface.

The subgroup mean differences indicate that the estimates of the two methods do not

differ significantly from each other. CJ-DE resembles the ETS-DE mean estimates satisfactory

for the race subgroup variable.

The standard deviation estimates given by CJ-DE differ from what is reported by the

ETS-DE method for the subgroups, AFRAM, HISPANIC, and ASIAM. The standard deviation

estimates provided by CJ-DE are about 0.7 times the size of the respective ETS-DE estimate. In

contrast to that, CJ-DE yields a standard deviation more similar to ETS-DE for the WHI/AI/O

subgroup.

Individualized Education Plan

Table 14 shows the subgroup mean differences of CJ-DE estimates against the ETS-DE

analysis and the corresponding variance ratios for the dichotomous grouping variable IEP

(Individualized Education Plan). There is a large mean difference between the two subgroups

IEP and non-IEP. The IEP group means are approximately 0.9 standard deviations smaller than

the non-IEP group estimates across all five scales (see Appendix C, where the ETS-DE estimates

for the reporting variable IEP are given).

Based on the findings of Study I, it can be expected that CJ-DE mean estimates will not

reflect the large difference between the IEP and the non-IEP subgroups. The standardized mean

differences and variance ratios for the IEP reporting variable are given in Table 14.

29

Table 14

IEP Subgroup Reports Based on the NAEP 2000 Grade 4 Math Data


IEP Non-IEP IEP Non-IEP

NUM&OPER 5.207 –0.556 0.841 1.025

MEASURMT 4.145 –1.206 0.835 0.996

GEOMETRY 4.832 0.042 0.830 1.002

DATA ANL 4.783 0.275 0.685 0.921

ALGEBRA 5.639 –0.998 0.621 0.961

Note. Large differences from the expected values given the more general model are printed in boldface.

The CJ-DE estimates show large differences to the IEP group means as provided by ETS-

DE. CJ-DE reports consistently smaller mean differences between IEP and non-IEP subgroups,

so that the corresponding mean difference between CJ-DE and ETS-DE is a large positive

number. The same was found in Study I (see above) using simulated data when the absolute

mean differences between subgroups are large. These results support the conjecture that CJ-DE

direct estimation of subgroup mean differences deviate from more general models in the

presence of large between group differences. Compared to ETS-DE, CJ-DE slightly

underestimates the IEP subgroup variances for the subscale categories—NUM&OPER,

MEASURMT, and GEOMETRY. For the subgroup variances of ALGEBRA and DATA ANL,

the CJ-DE estimates are only about 0.7 the size of the corresponding ETS-DE estimates.

National Adult Literacy Study

The second real data set used in this comparison is taken from National Adult Literacy

Survey (NALS) administered in 1992. This data set consists of 21,363 subjects and contains a

sparse matrix sample of 713 items from three content domains of literacy—quantitative, prose,

and document. NALS

…measured literacy along three dimensions, prose literacy, document literacy, and

quantitative literacy, designed to capture an ordered set of information-processing skills

and strategies that adults use to accomplish a diverse range of literacy tasks. The literacy

30

scales make it possible to profile the various types and levels of literacy among different

subgroups in our society (“Defining and measuring literacy,” n.d.).

The exemplary comparisons presented here utilize the NALS main assessment data file

and the operational item parameters, which were used with the CGROUP program, which is the

current implementation of the ETS-DE approach. The same data and item parameters were

imported into the implementation of the CJ-DE approach, the AM software.

Similar to the preceding analyses, a number of policy-relevant grouping variables from

the NALS data file were chosen to compare the subgroup distribution estimates as given by the

ETS-DE and the CJ-DE approach. Table 15 shows variance ratios and standardized mean

differences and between the estimates of ETS-DE and CJ-DE for the grouping variable REGION

with four subgroups.

Table 15

Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for REGION as Defined in the NALS 1992 Data


REGION Prose Document Quantitative Prose Document Quantitative

MIDWEST –0.560 –1.189 –0.418 1.012 0.981 0.964

N-EAST –0.154 0.193 –0.277 0.828 0.809 0.831

SOUTH –0.112 –0.275 –0.298 0.770 0.743 0.748

WEST 0.771 1.029 0.892 0.717 0.708 0.740

Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 are printed in boldface.

The results indicate that all four subgroup mean estimates given by ETS-DE and CJ-DE

agree relatively well. In contrast to the agreement between ETS-DE and CJ-DE for the means of

the region subgroups, the variance estimates for the regions SOUTH and WEST given by CJ-DE

are only about 0.75 times as large as the variance estimates given by ETS-DE.

The next NALS reporting variable used in the comparison is BORN IN having the five

categories—USA, SPAN (Spanish-speaking world), EUROP, ASIA, and OTHER. Table 16

31

shows the standardized mean differences and variance ratios CJ-DE compared to the ETS-DE

estimates for this reporting variable.

Table 16

Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for the Grouping Variable BORN IN as Defined in the NALS 1992 Data

Standardized mean differences Variance ratio

BORN IN Prose Document Quantitative Prose Document Quantitative

USA –2.687 –2.899 –2.475 0.923 0.890 0.884

SPAN 8.892 7.552 7.142 0.413 0.409 0.455

EUROP 0.616 0.980 0.415 0.559 0.592 0.655

ASIA 2.093 1.930 1.789 0.539 0.505 0.553

OTHER 1.404 1.293 1.182 0.588 0.564 0.627

Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 and standardized mean differences absolute larger than 2.2 are printed in boldface.

There are discrepancies between the subgroup mean estimates of CJ-DE and ETS-DE for

the USA and SPAN subgroups. The CJ-DE estimates for USA are about 2.5 to 2.8 standard units

lower than the ETS-DE estimates for the three literacy scales. The standardized differences

between the CJ-DE mean estimates and the ETS-DE estimates for SPAN lie between 7 to 8

across the three scales, indicating that CJ-DE differs significantly from the ETS-DE estimates.

The variance ratio for four subgroups—SPAN, EUROP, ASIA, and OTHER—is between 0.4

and 0.65 across all three subscales of the NALS data, indicating that the CJ-DE estimates are

systematically smaller than the ETS-DE estimates in this case.

The final comparison of CJ-DE and ETS-DE on the basis of the NALS data is based on

the reporting variable “Years living in the USA.” This reporting variable has nine categories,

ranging from “1-5 years in the USA” to “Ever live in the USA,” in 5 to 10 year intervals (see

below). Table 17 shows the standardized mean differences between CJ-DE and ETS-DE and the

variance ratios for the three literacy scales across the nine subgroups.

32

Table 17

Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for “Years Living in the USA” as Defined in NALS 1992 Data


Yrs in USA Prose Document Quantitative Prose Document Quantitative

1–5 12.060 11.418 8.791 0.432 0.410 0.432

6–10 2.762 2.725 2.584 0.545 0.527 0.572

11+ 4.121 4.581 3.361 0.427 0.434 0.471

16+ 2.863 2.333 2.740 0.440 0.436 0.486

21+ 1.461 2.035 2.208 0.520 0.518 0.565

31+ 0.777 1.151 0.913 0.558 0.602 0.609

41+ 0.187 0.370 0.104 0.467 0.473 0.499

51+ –0.441 –0.035 –0.322 0.939 0.943 0.979

Ever –3.182 –3.471 –2.770 0.937 0.902 0.896

Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 and standardized mean differences absolute larger than 2.2 are printed in boldface.

The subgroup mean estimates of CJ-DE are between 2.3 and 12 standardized units larger

than the corresponding estimates given by ETS-DE for the subgroups—“1–5 years in the USA,”

“6–10,” “11+,” and “16+.” The mean estimate for subgroup “Ever live in the USA” is between

2.7 and 3.4 standard units smaller for CJ-DE as compared to ETS-DE.

The variances estimates by CJ-DE for the first six subgroups in the interval between “1–

5” and “41+” are systematically smaller than what ETS-DE reports. The variance ratio lies

between 0.41 and 0.6 in these subgroups across all three scales. In contrast, the CJ-DE subgroup

variance estimates for “Ever” and “51+” are close to what ETS-DE yields, as the variance ratio is

close to 1. Note that the subgroups of US residents with a comparably small amount of years

residing in the United States are the subgroups with a comparably larger difference to the total

mean (see Appendix D). For these subgroups, CJ-DE yields estimates that deviate more from

33

what is given by the more general ETS-DE approach, whereas subgroups closer to the total mean

(“Ever” and “51+”) receive estimates that agree more closely with the ETS-DE approach.

Conclusions: Study II

The results reported in Study II show similarities with the results obtained in Study I,

which used simulated data. In the case of simulated data, CJ-DE differs from the values obtained

by ETS-DE and the true values obtained from analyzing the simulated proficiency values used

for generating the response data. The assumption of marginal normality leads to discrepancies

between CJ-DE and the true values in the presence of large subgroup mean differences and in

cases where the subgroup variances are heteroscedastic. Recall Cohen and Jiang's (1999) direct

estimation model, where the conditional density of � given subgroup membership g=k is derived

based on the marginal normality assumption. This density depends on the marginal parameters

��and �� and subgroup parameters (a1,b1,..,aG,bG). Essentially, the marginal normal density

��/�� acts as a prior for the conditional density

��

��

��

��

dkgfkgf

kgf� ��

��

��

�

)|())(()|())((

)|(1

1

(14)

�

which prevents the conditional densities from fitting larger subgroup mean differences. This

might be an indication why the CJ-DE standard deviation estimates are less variable across

subgroups, and the restriction of the standard deviation is correlated with the distance of the

corresponding subgroup mean from the total mean. A thorough analysis of the marginal direct

estimation model (Cohen & Jiang, 1999) should reveal that this restriction of the parameter space

is caused by the assumption of marginal normality. This assumption forces the mixture of

subgroup distributions to fit under the unimodal normal distribution.

The conclusion in Study II, which uses real data from NAEP and NALS, corresponds

closely to the findings of Study I, which compares CJ-DE and ETS-DE based on simulated data

examples, even though in real data applications, the true values usually are unknown. In the

presence of large subgroup mean differences, CJ-DE yields less extreme subgroup estimates than

ETS-DE, which also was found in the comparison in Study I of both methods to the true values.

34

Additionally, the variance estimates given by CJ-DE tend to be more similar across subgroups as

compared to the ETS-DE estimates and when comparing CJ-DE the true values in Study I. The

CJ-DE variance estimates seem to be increasingly restricted with increasing difference of the

subgroup mean to the total mean.

As noted in the introduction, CJ-DE uses a number of assumptions to derive a conditional

subgroup density while maintaining the restriction of normality of the marginal density. This

normal marginal assumption of the latent trait is believed to reflect common practice in large-

scale assessment applications of IRT (see Cohen & Jiang, 1999). However, NAEP and other

large-scale assessments do not rely on this assumption. Appendix E gives an example of how to

avoid the assumptions of CJ-DE when using AM in order to estimate a less restrictive model

with this software. The results of using simulated data and the results of using real data both

show that these assumptions used in CJ-DE lead to discrepancies when analyzing complex

samples where the assumptions are not met by the data. The operational ETS-DE approach does

not put the normality assumption in the marginal distribution, but in the conditional distribution

of the latent trait given the item responses and a large number of the background variables. The

conditioning approach utilized by ETS-DE is therefore more general and enables it to fit non-

normal distributions, as the conditional means given the background model are not assumed to

follow a specific distribution. In the light of systematic differences seen in both Study I and II,

using methods such as CJ-DE that rely on item responses only and replacing valuable

background information by a number of assumptions does not seem defendable for the analysis

of large-scale assessment data. This also holds for trend studies, where the assessment of change

relies even more on maximizing the comparability of results and the accuracy of the mean and

variance estimates obtained across time points and subgroups.

35

References

Braswell, J. S., Lutkus, A. D., Grigg, W. S., Santapau, S. L., Tay-Lim, B., & Johnson, M. (2001).

The nation’s report card: Mathematics 2000. Washington, DC: National Center for

Education Statistics.

Cohen, J. D. (1998). AM online help content—Preview. Washington, DC: American Institutes for

Research.

Cohen, J. D., & Jiang, T. (1999). Comparison of partially measured latent traits across normal

populations. Journal of the American Statistical Association, 94(448), 1035-1044.

Defining and measuring literacy. (n.d.) In National assessments of adult literacy. Retrieved

December 6, 2002, from http://nces.ed.gov/naal/defining/defining.asp

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA:

Addison-Wesley.

Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex

samples. Psychometrika, 56(2), 177-196.

Mislevy, R. J., Beaton, A. E., Kaplan. B., & Sheehan. K. M. (1992). Estimating population

characteristics from sparse matrix samples of item responses. Journal of Educational

Measurement, 29(2), 133-161.

Rasch G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago:

University of Chicago Press.

Thomas, N. (1993). Asymptotic corrections for multivariate posterior moments with factored

likelihood functions. Journal of Computational and Graphical Statistics, 2, 309-322.

Thomas, N. (2002). The role of secondary covariates when estimating latent trait population

distributions. Psychometrika, 67(1), 33-48.

Yamamoto, K., & Mazzeo, J. (1992). Item response theory scale linkage in NAEP. Journal of

Educational Statistics 17(2), 155-173.

36

Notes 1 The item parameters in the k-scale IRT model are assumed to be known constants. 2 The overall mean and standard deviation reported here are estimates by ETS-DE and the

TRUE data; CJ-DE does not provide overall means and standard deviations. 3 This indicates that the Catholic school category is more homogeneous as compared to the

two other categories. The Public school category consistently has the largest standard

deviations across all five scales.

37

Appendix A

Item Parameters of the Simulated Three-scale Six-item Data Set

Scale Slope Difficulty

Scale 1

[1,] 1.0707435 –0.423607249

[2,] 1.1946191 0.369087609

[3,] 1.1356097 –0.008368651

[4,] 1.1029780 –0.434542858

[5,] 0.6926124 –0.320136837

[6,] 0.8034373 0.817567985

Scale 2

[7,] 0.9617609 0.003169065

[8,] 1.1004634 1.327405006

[9,] 0.9115646 0.451618136

[10,] 1.0574126 –2.053570652

[11,] 1.1098851 0.006184470

[12,] 0.8589135 0.265193973

Scale 3

[13,] 1.2621460 1.339141978

[14,] 0.8917393 –0.220816527

[15,] 0.9161605 0.758596816

[16,] 0.9253288 –0.066838528

[17,] 0.7505099 –0.099260362

[18,] 1.2541155 –1.710823377

Note. The guessing parameter was 0.1 for all items.

38

Appendix B

Item Parameters of the Simulated 3-scale 12-item Data Set


Scale 1

[1,] 1.0301048 –0.40334405

[2,] 1.0807597 –0.12162779

[3,] 1.0250148 –0.29599706

[4,] 0.8097633 0.13585131

[5,] 1.0834746 –0.10137978

[6,] 1.0881449 1.18682432

[7,] 0.8241556 0.58488677

[8,] 1.0754401 0.95989977

[9,] 0.8284506 –1.57049425

[10,] 1.0272207 –0.24556290

[11,] 1.1410092 –0.53788298

[12,] 0.9864615 0.40882663

Scale 2

[13,] 1.0131312 0.56203695

[14,] 1.0604981 0.63205024

[15,] 1.2831725 –0.41368560

[16,] 1.1636971 –0.90477486

[17,] 0.9043142 0.01714852

[18,] 0.9837799 –0.84975192

[19,] 1.0296239 0.63169027

[20,] 1.2039188 0.04996556

[21,] 0.6799550 0.77051519

[22,] 0.9778539 –0.91851904

[23,] 1.0707815 0.19213650

(Table continues)

39

Table (continued)


[24,] 0.6292741 0.23118819

Scale 3

[25,] 1.1981095 0.24101286

[26,] 1.0874208 –0.18829633

[27,] 0.9684248 –0.58984308

[28,] 0.8853709 –0.95740524

[29,] 0.9017118 –0.19778461

[30,] 1.0488593 –1.42372395

[31,] 1.0086545 0.17463042

[32,] 0.8052735 1.48726305

[33,] 1.2051341 1.30940643

[34,] 1.0667933 0.28232721

[35,] 0.8616304 –0.32302987

[36,] 0.9626173 0.18544311

Note. The guessing parameter was 0.1 for all items.

40

Appendix C

ETS-DE Estimates for IEP Subgroup Means and Standard Deviations


IEP Non-IEP IEP Non-IEP

NUM&OPER –0.910 0.091 1.050 0.966

MEASURMT –0.841 0.054 1.100 1.016

GEOMETRY –0.844 0.076 1.043 0.958

DATA ANL –0.852 0.103 1.199 1.043

ALGEBRA –0.927 0.099 1.244 1.009

41

Appendix D

Means and Standard Deviations for ETS-DE Estimates for

“Years Living in the USA” as Defined in NALS 1992 Data


Yrs in USA Prose Document Quantitative Prose Document Quantitative

1-5 –1.287 –1.154 –1.043 1.549 1.584 1.548

6-10 –1.181 –1.029 –0.987 1.325 1.364 1.305

11+ –1.228 –1.094 –1.026 1.506 1.513 1.448

16+ –0.900 –0.874 –0.777 1.507 1.514 1.447

21+ –0.714 –0.691 –0.565 1.389 1.396 1.346

31+ –0.463 –0.513 –0.351 1.349 1.298 1.298

41+ –0.616 –0.749 –0.585 1.459 1.448 1.417

51+ 0.068 –0.139 0.062 1.033 1.032 1.018

Ever 0.102 0.094 0.084 1.065 1.076 1.085

Note. The subgroup means are reported as differences from the total mean.

42

Appendix E

Using AIR’s AM Software for Secondary Analyses

Studies I and II have shown that AM’s procedure for CJ-DE, a direct estimation approach

relying on a marginal normality assumption, does not seem suitable for data where the normality

of the latent trait across subgroups cannot be warranted. The CJ-DE approach has been

developed “to consistently estimate subpopulation distributions when the groups are defined by

values of a [nominal or ordinal variable]” (Cohen & Jiang, 1999). The two procedures

implementing CJ-DE are referred to as “Ordinal Table” (OT) and “Nominal Table” (NT) in the

AM software, depending on the grouping variables scale level. In contrast to the findings

concerning CJ-DE, AM’s MML regression procedure reproduced the results of analyzing the

true values—which served as the basis for the simulated data—quite well, in much the same way

the ETS-DE approach does. In the simulated data examples with known true regression

coefficients, ETS’s method and AM’s MML regression agreed closely when estimating

regression parameters for the full conditioning model.

AM’s MML regression module cannot be used “as is” for reporting purposes, because

additional steps are necessary in order to produce subgroup statistics based on the regression

results. The goal of this appendix is to explore ways to use MML regression and other modules

of AM and to provide a guideline on how to put together analysis steps that can be used to get

results with the AM software that resemble more closely the true values and the ETS-DE

conditioning model estimates.

AM was used in examples presented below in a multistep procedure for producing

subgroup statistics without using AM’s CJ-DE modules. This step-by-step procedure lacks the

convenience of the operational ETS-DE approach in that it requires manual concatenation of

separate intermediate results produced by AM’s procedures. Therefore, the goal of the study

presented here is not to provide an alternative to ETS-DE, but to test whether AM can be used

for secondary analyses.

The approaches taken by the ETS-DE conditioning model on the one hand and AM’s

direct estimation as well as its AM’s CJ-DE module on the other hand differ strongly with

respect to the information incorporated in estimating subgroup characteristics. ETS-DE uses

extensive background (conditioning) information, including grouping variables in addition to the

observed item responses. CJ-DE, in contrast, only includes one grouping variable at a time

43

together with the item responses but draws on a number of strong assumptions regarding the

shape of the marginal ability distribution and the relation between � and the group indicator

variable.

Issues in Model Selection

Assumptions about the population structure are central in the process of building a model

for complex survey data. The question is what kind of assumptions are viewed as appropriate for

the comparison of multiple subgroups with respect to their means and variances.

Figure E1. Subgroup distributions with normality assumption on the marginal level.

In the case depicted in Figure E1, the overall distribution is assumed to be normal, and

the sum of all subgroup distributions has to accommodate this shape. It follows that the shapes of

the subgroup distributions are no longer free; they have to fit under the overall normal shape and

their sum has to be equal to that shape. This assumption is central to CJ-DE and makes it

44

inappropriate for more complex real data. A less restrictive assumption is that all subgroups are

normally distributed and share the same variance but may vary with respect to their means and

size. This assumption can be modeled by a regression with contrast coded subgroup indicators.

This can be done in many software packages as well as in ETS-DE and AM. This drops the

assumption of marginal normality and with it the main feature of CJ-DE as proposed by Cohen

and Jiang (1999). The effect of this relaxing the marginal normality assumption is illustrated in

Figure E2.

Figure E2. Subgroup distributions with normality assumption in all subgroup levels.

This less restrictive assumption obviously allows a larger range of cases to be fitted as

compared to CJ-DE. This approach can be taken by using AM’s MML means procedure, even

though that procedure will not yield subgroup variance estimates. If only a few subgroups are

used, the homoscedasticity assumption within subgroups limits the ability to fit more general

marginal distributions. A useful extension would be to assume a separate variance for each

subgroup. In AM, this assumption can be accommodated by using MML regression together

with filtering the data as many times as there are subgroups. But even this limits the subgroup

45

distributions to be normal, which seems still a too restrictive approach if, for example, there is a

strong indication that some subgroups are composites.

This is one of the reasons why

Documents

Comparing Conditional and Marginal Direct Estimation of ......This conditional normality is a less restrictive assumption compared to the marginal normality assumption, on which CJ-DE