Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
RESEARCH REPORT January 2003 RR-03-02
Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions
Research & Development Division Princeton, NJ 08541
Matthias von Davier
Comparing Conditional and Marginal Direct Estimation of Subgroup Distributions
Matthias von Davier
Educational Testing Service, Princeton, NJ
January 2003
Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from:
Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541
Abstract
Many large-scale assessment programs in education utilize “conditioning models” that
incorporate both cognitive item responses and additional respondent background variables
relevant for the population of interest. The set of respondent background variables serves as a
predictor for the latent traits (proficiencies/abilities) and is used to obtain a conditional prior
distribution for these traits. This is done by estimating a linear regression, assuming normality of
the conditional trait distributions given the set of background variables. Multiple imputations, or
plausible values, of trait parameter estimates are used in addition to or, better, on top of the
conditioning model—as a computationally convenient approach to generating consistent
estimates of the trait distribution characteristics for subgroups in complex assessments. This
report compares, on the basis of simulated and real data, the conditioning method with a recently
proposed method of estimating subgroup distribution statistics that assumes marginal normality.
Study I presents simulated data examples where the marginal normality assumption leads to a
model that produces appropriate estimates only if subgroup differences are small. In the presence
of larger subgroup differences that cannot be fitted by the marginal normality assumption,
however, the proposed method produces subgroup mean and variance estimates that differ
strongly from the true values. Study II extends the findings on the marginal normality estimates
to real data from large-scale assessment programs such as the National Assessment of
Educational Progress (NAEP) and the National Adult Literacy Survey (NALS). The research
presented in Study II shows differences between the two methods that are similar to the
differences found in Study I. The consequences of relying upon the assumption of marginal
normality in direct estimation are discussed.
Key words: conditioning models, large-scale assessments, NAEP, NALS, direct estimation
i
Acknowledgements
I would like to thank John Mazzeo for valuable comments on previous versions of this
document, which improved both content and presentation. Any remaining errors are mine.
ii
Introduction
Large-scale assessments such as the National Assessment of Educational Progress
(NAEP) estimate the distribution of academic achievement for policy relevant subgroups.
Examples of estimates provided by large-scale assessment are means and percentages above cut
points for the subgroups of interest. Many large-scale assessments such as NAEP use a sparse
matrix sample design in which the number of cognitive items per respondent is kept relatively
small. Using such designs allows the assessment to provide a broad coverage of the content
domain while keeping the subjects’ testing time brief. This implies that individual ability
estimates based on these kinds of assessments would have a large measurement error component,
which has to be taken into account when reporting aggregate statistics for subgroups. Direct
estimation procedures, by which these estimates are obtained without the generation of
individual scores, have been the approach most commonly taken to address this analysis
challenge. Typically, these procedures have made use of background variables along with the
cognitive item responses to ensure a higher degree of accuracy in estimating subgroup
characteristics compared to only using the cognitive responses. Moreover, matrix sampling
makes it impossible to compare subjects—or groups of subjects—based on their observed item
responses. Therefore, large-scale assessments using matrix sampling rely on item response
theory (IRT) models (Lord & Novick, 1968; Rasch 1960).
To estimate the subgroup statistics of interest, ETS has employed since 1984 a particular
approach of integrating achievement data (item responses) and background information, such as
subgroup membership and additional student variables, into a hierarchical IRT model. This
approach may be referred to as “direct estimation” because ETS estimates group statistics
without the use of individual test scores. For the purposes of this report, I refer to this approach
as ETS-DE. The core features of the ETS-DE approach include:
1. A population model that assumes proficiencies are normally distributed conditional on a
large number of background variables (grouping variables and other covariates). As a
consequence, the marginal distribution (overall and for major reporting subgroups) is a
mixture of normals.
2. The generation of a posterior latent trait distribution of proficiency for each individual in the
sample, which is based on an estimate of (1); a separately estimated set of IRT parameters
that are treated as fixed and known; the cognitive item responses, the respondents’ group
1
membership; and other covariates. The mixture of these individual posterior distributions
provides the estimate of the actual subgroup distributions.
3. The integration over posterior distributions of examinees and some of the model parameters
(the parameters of the population model defined later) in (1) to obtain estimates of means,
percentages above achievement levels, etc.
4. The use of normal approximations for the individual posteriors and a multiple-imputation
approach (the so-called plausible values) to approximate the integration in (3). Imputations
are used in conjunction with conditioning models based on both cognitive item responses and
background information. The imputations are used as a mere convenience in order to
simplify the integration in (3) and to provide data that can be used with standard tools by
secondary analysts.
Cohen and Jiang (1999) propose an alternative approach to direct estimation (which I
refer to as CJ-DE in this report) of subpopulation characteristics that does not utilize additional
background variables. Cohen and Jiang assume that CJ-DE provides consistent subgroup
estimates without the use of background variables. The core features of CJ-DE include:
1. A population model that assumes marginal normality, i.e., the ability distributions of all
subgroups align in such a way that the joint distribution is normal.
2. A measurement model for the categorical grouping variables that assumes an underlying
continuous latent variable whose joint distribution with proficiency is normal.
3. Use of a set of fixed/known IRT model parameters.
4. Item responses that are used together with a single grouping variable only—the one used for
reporting—i.e., no additional covariates like other reporting variables or their interactions are
used in the population model.
5. A direct calculational approach that bypasses the generation of individual posterior
distributions and the generation of plausible values.
Both approaches, ETS-DE and CJ-DE, may be referred to as “direct estimation” because they
estimate group statistics without the use of individual test scores. ETS-DE uses a more general
model, which includes grouping variables as well as additional background information and no
specific assumption regarding the marginal proficiency distribution. CJ-DE includes the
assumption of marginal normality and ignores all the additional background information other
2
than a single grouping variable. This report presents a comparison of ETS-DE and CJ-DE using
simulated and real data.
The ETS-DE Methodology
For obtaining estimates of subpopulation distributions, ETS-DE involves a two-phase
procedure that uses achievement data (item responses) and respondents’ background
information. Key references for a more detailed outline of the conditioning model used by the
ETS-DE method are Mislevy (1991), Mislevy, Beaton, Kaplan, and Sheehan (1992) and Thomas
(1993, 2002). The two phases of the method, which sometimes are confused when discussed in
secondary literature, are:
1. Estimation of parameters for the conditioning, or population, model.
2. Production of plausible values from individual posterior distributions given the model
parameters, item responses, and background data.
The Conditioning Model
The method used for analyzing large-scale assessments at ETS uses both item responses
and background information, sometimes numbering up to one hundred conditioning variables.
Assume that there are k scales in the assessment and that each proficiency scale follows a
unidimensional IRT model1 with the usual assumption of conditional independence given ��, i.e.,
� � �� � �
�Kk Kk kJj kjkkkJ
xPxxP..1 ..1 )(..1)(1
)|()|,..,( �� � � � (1)
The conditioning model combines the k-scale IRT model with a k-dimensional
multivariate latent regression model in order to maximize the likelihood based on the posterior
distribution of the latent trait �=(��,.,��):
)|()|(~),|(),|(..1 )(..1
yxPyxfyxLKk kJj kjk
���� � �� �
�� (2)
where the prior �(��| y) is assumed to be normal with ��y�� N(�'y , �). The latent trait � is
unobserved and must be inferred from the observed item responses. The predictor y is a vector of
3
individual values on a set of conditioning variables, � is a matrix of regression weights, and � is
the residual variance-covariance matrix. Note that at ETS, three software programs are currently
available to carry out the estimation: NGROUP, BGROUP, and CGROUP. All implementations
are based on the EM (estimation-maximization) algorithm. In the E-step, the posterior
distribution of � given item responses and conditional on the background variables is computed
for each individual. These estimates are then used in the M-step to obtain the regression weights
��and the residual covariance matrix �. The approaches implemented in NGROUP, BGROUP,
and CGROUP differ with respect to how each carries out the E-step:
1. NGROUP assumes that the item likelihood ��������j=1..J(k)�P(xjk|��) can be approximated by
a multivariate normal distribution and has limited use. (It may be used only for generating
starting values for CGROUP or with extremely long scales.)
2. BGROUP does not assume any specific form of the item likelihood and uses a numerical
quadrature in the E-step. To date, BGROUP has been shown to not be computationally
feasible in more than two dimensions.
3. CGROUP is designed to be computationally feasible for more than two dimensions (it uses a
Laplace approximation in the E-step). CGROUP is used most frequently in NAEP since most
subject areas have multiple scales and require reporting on a composite.
In NAEP and other large-scale assessments analyzed at ETS, the estimation of the
conditioning model for multivariate latent traits is carried out with BGROUP and CGROUP.
This report uses CGROUP as the basis for evaluating the differences in direct estimation
between the conditional normality approach (ETS-DE, as implemented in CGROUP) and the
marginal normality approach (CJ-DE, as implemented in the AM software, see below) since
CGROUP has been the program most frequently used for NAEP analysis purposes.
Plausible Values
The second phase of the ETS-DE involves the production of plausible values, which
provide a computationally tractable approach of integrating the posterior distributions of
respondents to estimate the target statistics in subgroups of interest. Using plausible values
provides a means for estimating the error in the estimates due to the proficiencies being latent
(i.e., only indirectly observed) and the uncertainty about the regression parameters in the
4
population model. In addition, plausible values provide a set of quantities that researchers can
use with commercial statistical software to conduct a wide variety of secondary analyses.
The BGROUP, CGROUP, and NGROUP set of programs generate multiple imputations
for each respondent based on the estimates of � and � and on the respondents’ background data y
and the item responses x. These plausible values are drawn from the k-dimensional posterior
N(E(�|y,x),�(�|y,x)). In other words, the approach assumes that � given y and x is approximately normally distributed. This conditional normality is a less restrictive assumption
compared to the marginal normality assumption, on which CJ-DE relies (Cohen & Jiang, 1999).
The marginal distribution in ETS-DE conditioning model is therefore rather flexible and is not
limited to the normal distribution, but it is actually a mixture of the conditional posterior
distributions for the given set of items responses and background variables.
In order to carry the variability due to measurement and parameter estimation errors
through all subsequent analyses, a number of plausible values has to be drawn for each
respondent. As a rule of thumb, five to ten plausible values are drawn in most large-scale
assessment analyses. These plausible values are aggregated to provide consistent estimates of
group means, variances, and percentages above cut points for the subgroups defined by the
reporting variables. Plausible values drawn from a population model that uses item responses and
a large amount of background information are a valuable source for studying relationships
between the proficiency scales and secondary variables.
The CJ-DE Methodology
Marginal normality based direct estimation, or CJ-DE (Cohen & Jiang, 1999), is a
recently proposed method of estimation subgroup statistics based on a number of assumptions
regarding a) the marginal distribution of the latent trait and b) its relation to a set of group
indicator variables. The following studies use simulated and real data to compare the results from
the ETS-DE and CJ-DE methods. The study of real data offers a determination as to whether CJ-
DE yields estimates consistent with the results of more general models.
The software package AM (Cohen, 1998) implements the CJ-DE approach and is
available for the Windows operating system. The software provides modules for CJ-DE and
additional modules for univariate and composite regressions of the latent trait on a number of
predictors, which is referred to as marginal maximum likelihood (MML) regression in the AM
5
package. While the focus of this study is to compare CJ-DE with the ETS-DE conditioning
approach, AM's MML regression was used to make sure that both software programs—AM and
CGROUP—agree on the data structure. AM provides two procedures for CJ-DE that were
developed “…to consistently estimate subpopulation distributions when the groups are defined
by values of a [nominal or ordinal variable]” (Cohen, 1998; Cohen & Jiang, 1999). The AM
modules implementing CJ-DE are referred to as “Ordinal Table” (OT) and “Nominal Table”
(NT) in the software, depending on the scale level of the grouping variable. Both the OT and the
NT modules assume that the latent trait � is marginally normally distributed (Cohen, 1998;
Cohen & Jiang, 1999), so that the estimates of a finite mixture of subgroup distributions have to
fit this assumption.
In contrast to this assumption, the conditional normality estimation—ETS-DE, which is
used in NAEP's conditioning model and other large-scale assessment programs—does not rely
on assuming a certain form of the trait parameters’ marginal distribution. The marginal
distribution in the conditioning model is a mixture of normals. In addition, NAEP uses a
multinomial distribution to approximate the marginal distribution of � for item calibration
(Yamamoto & Mazzeo, 1992), so that the item parameters used in the conditioning model are not
based on a certain form of the marginal trait distribution.
Central Assumptions Driving CJ-DE
Cohen and Jiang (1999) propose to use the following approach in order to estimate
subgroup statistics:
a) Assume a latent trait ��~ N(,�). � is usually unobserved and has to be inferred by the
subjects responses to a number of items (x1,..,xk)
b) Assume that there are m groups, where the group membership gi indicates the maximum
outcome on a number of m unobserved variables, yl,...,ym. That means the group membership
of individual i equals k (gi = k), if for the unobserved variables yki > yli for all l k.
c) Assume that for k=1,..,m, a linear relationship exists between ��and yk (i.e., yk = ak + bk� + ek)
with mutually independent ek. The conditional distribution of yk given �� is assumed to be N(0,1).
d) Assume that conditional on �, the yi are mutually independent, i.e.,
)|(*)|()|,( ��� ByPAyPByAyP jiji ����� �� � � � �����������
6
Assumption (a) forces the ability distribution to be marginally normal. Assumption (c)
also is very strong and “may not be true but is a common and powerful one” (Cohen & Jiang,
1999). Assumptions (b) and (d) are used for defining the conditional density of
( | ) ( | ) ( | ) ( | ) ( | )k k j k k k jj k kf g k f x P y y j k dy f x P y y dy� � � � ��� � � � � � ��� � (4)
This conditional density, together with assumption (c) and the assumption of marginal normality
(a), yields
( , ) ( ) ( ) ( | )k k k k jj k kf g k z y a b P y y dy�� � � � ��� � � � ��� (5)�
where denotes the normal density and z�=����/����One more replacement uses the second
part of assumption (c), namely that the error term e in the linear relation yj=aj+bj�+e is assumed
to be N(0,1). This yields
(6) )()()()|0( ���� jjkjjkkjjjk baybayePyebaPyyP ��������������
where ��denotes the normal distribution function. It follows that
( , ) ( ) ( ) ( )k k k k j jj k kf g k z y a b y a b dy�� � � � ��� � � � � � ��� (7)
Finally, the conditional density of � given group g=k is obtained by
� �� ����
�
�
�����
�����
��
�����
�����
�
�
ddybaybayz
dybaybayzkgf
kkj jjkkkk
kkj jjkkkk
)()()(
)()()()|( (8)
which is used to compute the conditional means and variances given subgroup g=k (see Cohen &
Jiang, 1999). We may now define
7
� ��� ���� dkgfkgEnn )|()|( (9)
in order to obtain the conditional moments of �� The parameters a1,b1...am,bm and �,�� of
f(�|g=k) are estimated by maximizing the likelihood function based on the individual likelihood
terms
� ��� ����� �� dkgfxpkgxbaL ),()|(),|,...,,,( 11 (10)
for a subject in group g=k with observed responses x=(x1,..,xj), and f(g=k,�) as defined by
Equation (7). The two approaches taken by ETS-DE and CJ-DE differ strongly with respect to
the information incorporated in estimating subgroup characteristics. ETS-DE uses extensive
background (conditioning) information including grouping variables in addition to the cognitive
item responses. In contrast to that, CJ-DE only includes the grouping variable together with the
item responses but draws on a number of strong assumptions regarding the shape of the marginal
ability distribution and the relation between � and the grouping variable. The following section
presents examples of the differences found between both approaches with respect to recovering
known subgroup characteristics of simulated data.
Study I: Simulation Results
The examples presented in this section compare ETS-DE and CJ-DE based on simulated
data where each simulee responds to a limited set of test items and is additionally characterized
by a small set of background variables. The simulated data sets resemble some characteristics of
NAEP, such as the number of items per subscale. Short subscales in NAEP typically consist of
an average of 6 items across booklets; long subscales consist of approximately 12 items. The
number of subscales or dimensionality of the latent trait, k=3 in the simulations, also is found in
NAEP. The number of background variables in the simulation is smaller than what is typically
used in NAEP’s conditioning approach. While NAEP’s conditioning model may include up to
hundreds of background variables, the simulated data used in the present study limits the number
of background variables to the three made-up variables, GROUP, SES, and GENDER. Four
distinct data sets were simulated following a 2 x 2 design, varying:
1. The number of items per subscale (6 versus 12 items).
8
2. The dependency of the latent traits on the background variables: Setup (1) had a strong
dependency leading to multimodal marginal trait distributions, while Setup (2) had a weak
dependency resulting in unimodal, but possibly platokurtic marginals.
Using two different linear models created the two levels of dependency of the latent traits on
the background variables. Two different sets of regression weight were used to generate the
three-dimensional trait parameters (�1, �2, �3). Each latent trait value �i for i in 1, 2, 3 was
generated based on a linear model
�i = �1yGENDER,i + �2ySES,i + �3yGROUP,i + ei (11)
incorporating fictitious GENDER, SES, and GROUP effects together with normally distributed
residuals ei. GENDER, SES, and GROUP accounted for a varying percentage of variance for the
three trait components (see regression results below). The trait variable (�1, �2, �3) and its
component-wise linear relation to GENDER, SES, and GROUP were unaffected by additional
fictitious design variables WEIGHT, STRATA, and CLUSTER. The latter variables have been
included to check whether zero correlations are recovered in the same way by the regression
modules of CJ-DE and ETS-DE.
Setup 1, which includes one bimodal and one multimodal marginal, was included to
examine how CJ-DE performs in situations where its marginal distribution assumptions are
clearly violated. Setup 2 represents a more typical situation in which the marginal distributions
are unimodal but more platokurtic than the normal (see Figure 1). Data were generated for the
six-item test for both Setups 1 and 2, the item parameters used to generate the data are given in
Appendix A and B. However, only the six-item test is presented for Setup 2, since the pattern of
results obtained for the two test lengths was similar in Setup 1.
9
Figure 1. Histograms of marginal distributions for Setups 1 (left) and 2 (right).
10
Figure 1 shows histograms with integrated density plots for Setup 1 (left column) and
Setup 2 (right column), crossed by the three (from top to bottom row) simulated latent traits.
Setup 1 on the left results in a clearly bimodal marginal for Dimension 1, whereas in Setup 2, the
marginals are platokurtic or skewed, but not obviously multimodal.
In Setup 1, the proportion of variance of ��accounted for by the fictitious GROUP and
GENDER produced bimodal (for gender) or multi-modal marginal ��distributions. In Setup 2,
the proportion of variance explained by the fictitious conditioning variables GENDER, SES, and
GROUP was reduced, so that the resulting marginal � distributions are unimodal but platokurtic.
The marginal distribution of �1 is a mixture of two subpopulations where the mean difference
between subgroups is due to the fictitious GENDER variable. �2 is a mixture of five normals
with common variance but slightly different means due to the five-category variable GROUP.
The third variable, �3, can be viewed as the “control dimension” in both setups (i.e., the subgroup
distributions are all identical as there is no effect of the conditioning variables on latent trait �3).
Setup 2 can be viewed as a less extreme, non bimodal, version of Setup 1 with higher
intercorrelations between the � variables. The data generated by both setups were analyzed with
the ETS-DE and CJ-DE approaches to direct estimation. The results of both methods were
compared to the true values obtained from analyzing the actual � values used for generating the
item responses. Tables 1a and 1b show the marginal correlations obtained from analyzing the
simulees’ generating � values, both for the 6- and the 12-item data sets.
Table 1a
Marginal � Distributions in Setup 1, Correlations Between � Dimensions
[,1] [,2] [,3]
[1,] 1.0000000 0.3985606 0.1620800
[2,] 0.3985606 1.0000000 0.1832677
[3,] –0.1620800 0.1832677 1.0000000
11
Table 1b
Marginal � Distributions in Setup 2, Correlations Between � Dimensions
[,1] [,2] [,3]
[1,] 1.0000000 0.6499676 0.5401054
[2,] 0.6499676 1.0000000 0.7718106
[3,] 0.5401054 0.7718106 1.0000000
The following sections present results based on the generating true � values on the one
hand and the two approaches to direct estimation of subgroup statistics on the other. To clarify
that the expected differences between CJ-DE and ETS-DE are the result of differences in model
assumptions, the agreement of both software packages on the correlational structure of the
simulated data was assessed. To check this, the recovery of regression weights and the residual
variance covariance matrix of both AM (the software used for CJ-DE) and CGROUP (the
software for ETS-DE) was analyzed.
Regression Module Comparison
The regression module comparison is a check of agreement between both programs using
the same data. The regression of the three dimensional latent trait ��on the variables INTER
(explicit intercept), GENDER, SES, GROUP, STRATA, and CLUSTER was compared. The
results in Table 2 are obtained by analyzing the generating � vectors (the TRUE columns in the
tables below) with standard regression procedures. The entries in the ETS-DE and MML
columns stem from analyzing the item response data with the conditioning model incorporated in
ETS-DE and with AM’s MML regression module. The MML regression module, however, is
different from the direct estimation proposed by Cohen and Jiang (1999). The MML regression
module closely resembles the regression part of the ETS-DE approach in the one-dimensional
case and consequently should yield similar results when used with the same set of background
variables. MML regression does not include the marginal normality assumptions used by CJ-DE.
Table 2 shows the estimates of the linear model for the three-dimensional � variable. The
estimates show that the GENDER variable has the largest effect on �1, whereas the GROUP
12
variable has highest impact on �2 and the effects for �3 are close to zero for all methods, as
expected.
Table 2
Regression Coefficients for the Six-item Simulated Data Set, Setup 1
Scale 1 Scale 2 Scale 3
Effect TRUE ETS-DE MML TRUE ETS-DE MML TRUE ETS-DE MML
Constant -3.460 -3.530 -3.480 -2.970 -2.590 -2.570 0.040 0.070 0.070
CLUSTER -0.030 -0.050 -0.050 0.000 -0.060 -0.060 -0.040 -0.050 -0.050
STRATA -0.010 -0.010 -0.010 0.000 0.000 0.000 0.010 -0.010 -0.010
GENDER 1.680 1.760 1.740 0.430 0.340 0.340 0.000 0.020 0.030
GROUP 0.190 0.220 0.220 0.600 0.640 0.640 0.060 0.050 0.050
SES 0.280 0.280 0.280 0.270 0.210 0.210 -0.020 0.020 0.020
MML regression and the regression that is part of ETS-DE agree closely on the estimates
for this setup. Both ETS-DE and MML regression produce estimates close to those in the TRUE
columns, even though the number of six items per scale is comparably small (i.e., the inference
on � used by ETS-DE and MML regression are subject to a rather large measurement error).
Table 3 shows the respective results based on the 12-item data set.
Table 3
Regression Coefficients 12-item Simulated Data Set, Setup 1
Scale 1 Scale 2 Scale 3
Effect TRUE ETS-DE MML TRUE ETS-DE MML TRUE ETS-DE MML
Constant -3.560 -3.540 -3.533 -2.940 -2.890 -2.883 -0.070 -0.030 -0.038
CLUSTER 0.000 0.000 0.004 -0.010 -0.020 -0.016 0.040 0.040 0.043
STRATA 0.000 -0.010 -0.008 0.000 0.000 -0.003 -0.010 0.000 0.000
GENDER 1.700 1.720 1.718 0.470 0.510 0.505 -0.060 -0.080 -0.085
GROUP 0.140 0.140 0.136 0.610 0.620 0.617 -0.050 -0.050 -0.047
SES 0.300 0.290 0.290 0.220 0.180 0.183 0.080 0.040 0.045
MML regression and ETS-DE recover the parameters weights more closely if the number
of items is doubled. Note that both methods also agree with the TRUE columns for Scale 3,
13
where there is no impact on the latent variable, and as expected, all three columns show values
close to zero. Table 4 shows the residual correlations and variances as they were obtained using
the true � values from the simulations as well as the corresponding values produced by the ETS-
DE regression and MML regression algorithms.
Table 4
Residual Correlations With Variances in the Diagonal, Six-item Simulated Data Set
TRUE ETS-DE MML regression
Scale 1 2 3 1 2 3 1 2 3
1 0.188 –0.025 0.203 0.199 –0.033 0.246 0.188 –0.025 0.249
2 0.194 –0.214 0.214 –0.289 0.209 –0.293
3 0.996 1.127 1.155
Table 5 shows the results for the 12-item data set. ETS-DE and MML regression
reproduce the residual correlations and variances in a very similar way, both for the 6-item and
the 12-item data set in Setup 1.
Table 5
Residual Correlations With Variances in the Diagonal, 12-item Simulated Data Set
TRUE ETS-DE MML regression
Scale 1 2 3 1 2 3 1 2 3
1 0.176 –0.036 0.186 0.183 –0.125 0.227 0.183 –0.133 0.231
2 0.196 –0.218 0.167 –0.143 0.167 –0.144
3 0.991 0.960 0.971
Subgroup Distribution Recovery
ETS-DE and CJ-DE implement two very different approaches to direct estimation. While
ETS-DE assumes that the latent trait � is conditionally normal given a vector of background
data, CJ-DE assumes that the marginal latent distribution is normal, regardless of potentially
large subgroup differences in complex samples. These two approaches are compared in this
14
section with respect to the recovery of subgroup distributions. This analysis uses the exemplary
data previously introduced as Setup 1—6 and 12 items and Setup 2—6 items.
As shown in the previous section, the ETS-DE regression and the MML regression as
implemented in the software packages CGROUP and AM agree on these data sets and reproduce
the true regression parameters in a very similar way. In contrast, ETS-DE and CJ-DE incorporate
different assumptions regarding the marginal distribution of the latent traits. Recall that the
marginal distributions for Setup 1 are bimodal for Scale 1 and multimodal for Scale 2, because
the background variable GENDER (two subgroups) explains a major part of the variance for
Scale 1 whereas the background variable GROUP (five subgroups) has a strong impact on Scale
2. It can be expected that the marginal normality assumption of CJ-DE, which is violated for
Scales 1 and 2, will result in differences between subgroup mean estimates of ETS-DE and the
true values on the one hand, and CJ-DE on the other hand.
Table 6a
Subgroup Means and Standard Deviations for the Six-item Data Set, Setup 12
Mean Standard deviation
Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE
ALL –0.004 0.027(.039) -/- 1.001 1.040 -/-Female –0.849 –0.852(.030) –0.466(.074) 0.548 0.565 0.947
1 Male 0.840 0.907(.040) 0.343(.116) 0.527 0.545 0.936
ALL –0.002 –0.032(.047) -/- 0.997 0.960 -/- Female –0.213 –0.205(.057) –0.193(.088) 0.991 0.968 0.972
2 Male 0.208 0.140(.058) 0.135(.113) 0.957 0.919 0.972
ALL 0.014 –0.012(.045) -/- 1.003 1.065 -/- Female 0.000 –0.024(.057) –0.032(.058) 1.029 1.083 1.076
3 Male 0.027 0.000(.064) –0.003(.059) 0.975 1.045 1.08
Note. The results of CJ-DE direct estimation reported here are ones closest to the true values from one out of four trials with AM’s “slog through” option. Rows with large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.
Table 6a shows the TRUE values for the six-item data set in Setup 1 (i.e., the values
obtained by analyzing the generating data) as well as the subgroup means and standard
deviations as estimated by ETS-DE and CJ-DE. In addition, the values in parentheses next to the
15
subgroup mean estimates show the associated standard errors either computed with Rubin’s
imputation formula in the case of ETS-DE or as given by the Taylor series estimates in the case
of CJ-DE. The Taylor series estimates are given by the CJ-DE direct estimation procedure and
are recommended to yield appropriate estimates for complex samples by Cohen and Jiang
(1999). Here, the Taylor series standard error estimates for Scales 1 and 2 are larger than the
imputation-based estimate.
Table 6b gives a more condensed overview of the same results. Instead of individual subgroup means, the table gives standardized mean differences
ZETS-DE = (METS-DE - true)/se(DETS-DE) (12)
ZCJ-DE = (MCJ-DE - true)/se(DCJ-DE) (13)
as well as the variance ratio of estimated variance divided by true variance. se(D) stands for the
standard error of the difference. Assuming the TRUE values to be fixed target statistics, the
se(D) equals the standard error associated to the respective estimate given either by ETS-DE or
CJ-DE. If the difference between the two estimates of a certain subgroup mean is standardized,
se(D) equals the square root of the sum of the squared standard errors of the two statistics. The
standardized mean differences between CJ-DE and TRUE should be ~N(0,1) if the CJ-DE model
holds. The variance ratios given in Table 6a should be close to 1 if the approach recovers the
values in the TRUE column.
16
Table 6b
Subgroup Standardized Mean Differences and Variance Ratios for the Six-item Data Set, Setup 1
Standardized mean difference Variance ratio Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE
Female 0.000 –0.099 5.176 1.000 1.063 2.9861 Male 0.000 1.693 –4.284 1.000 1.069 3.154
Female 0.000 0.140 0.227 1.000 0.954 0.962 2 Male 0.000 –1.178 –0.646 1.000 0.922 1.032
Female 0.000 –0.419 –0.552 1.000 1.108 1.093 3 Male 0.000 –0.425 –0.508 1.000 1.149 1.218
Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.
The differences from the expected values are as hypothesized; CJ-DE shows large
differences for Scale 1, for which the marginal normality assumption does not hold. The absolute
standardized mean differences are 5.176 for the female subgroup and 4.284 for the male
subgroup. The variance ratios indicate that CJ-DE overestimates the subgroup variances by a
factor of ~3 for Scale 1.
Table 7a gives the mean and standard deviation for the 12-item data set in Setup 1 while
Table 7b gives the standardized mean differences and variance ratio. Table 7b enables a direct
comparison against the values 0 (zero) for the expected mean differences and 1 (one) for the
expected variance ratio if the models behind the approaches fit the data.
17
Table 7a
Subgroup Mean and Standard Deviation for GENDER for the 12-item Data Set, Setup 1
Mean Standard deviation
Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE
ALL 0.004 .016(.035) -/- 1.008 0.995 -/-Female –0.854 –.832(.029) –.707(.072) 0.529 0.515 0.858
1 Male 0.862 .865(.031) .704(.063) 0.527 0.524 0.746
ALL 0.000 .008(.039) -/- 1.002 0.991 -/- Female –0.234 –.252(.049) –.255(.089) 0.978 0.959 1.003
2 Male 0.234 .269(.050) .248(.124) 0.971 0.952 1.003
ALL –0.004 –.003(.036) -/- 0.996 0.993 -/- Female 0.031 .034(.047) .045(.050) 0.989 0.967 0.988
3 Male –0.041 –.042(.052) –.040(.047) 1.001 0.981 0.988
Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.
Table 7b
Subgroup Standardized Mean Differences and Variance Ratios for the 12-item Data Set, Setup 1
Standardized mean difference Variance ratio
Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE
Female 0.000 0.760 2.042 1.000 0.948 2.6311 Male 0.000 0.097 –2.508 1.000 0.989 2.004
Female 0.000 –0.367 –0.236 1.000 0.962 1.052
2 Male 0.000 0.699 0.113 1.000 0.961 1.067
Female 0.000 0.063 0.280 1.000 0.956 0.998
3 Male 0.000 –0.020 0.021 1.000 0.960 0.974
Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.
18
In the 12-item case, the current CJ-DE implementation does not converge with the default
settings for Scale 1 but needs to be put into the “slog through” mode, and the number of
iterations needs to be increased from 50 to 500. ETS-DE reproduces the subgroup means and
standard deviations accurately also for the 12-item data set. As in the six-item case, the
differences between subgroup standard deviations are not reproduced by CJ-DE.
The second reporting variable with a strong impact on one of the latent trait components
is GROUP, a variable with the categories 1..5. Table 8a shows the standardized mean differences
for the six-item data from Setup 1. Like in the above analysis with the grouping variable
GENDER, the algorithm for CJ-DE needs to be put in the “slog through” mode in AM to
converge in this example. The marginal normality assumption does not hold for Scales 1 and 2,
the first and second component of the three dimensional latent trait in the example data. It can be
expected that CJ-DE using the marginal normality assumption will not match the true subgroup
means and variances as closely as ETS-DE does in the analysis of the GROUP reporting
variable.
The subgroup mean differences for CJ-DE in Scale 2 indicate two subgroups for which
CJ-DE estimates deviate significantly from the true values. For Group 1, the absolute mean
difference between CJ-DE and the true value is 4.45, and for Group 5, the absolute mean
difference is 5.62. In contrast, the subgroup mean differences for ETS-DE and the true values are
all in the expected range. Table 8a shows also that CJ-DE overestimates the subgroup variances
for all subgroups and Scale 2 by a factor of between 1.77 and 2.54.
19
Table 8a
Subgroup Standardized Mean Differences and Variance Ratios for GROUP for the Six-item Data Set, Setup 1
Standardized mean differences Variance ratio
Scale Subgroups TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE
Group 1 0.0000 0.1928 0.7942 1.0000 1.0659 1.2226 Group 2 0.0000 0.6372 0.0748 1.0000 0.9503 1.2346 Group 3 0.0000 0.5593 –0.0784 1.0000 1.0092 1.3660 Group 4 0.0000 0.8930 0.4622 1.0000 1.0661 1.4910
1
Group 5 0.0000 0.6808 0.0856 1.0000 1.0927 1.2583
Group 1 0.0000 0.4218 4.4504 1.0000 1.0483 2.1302 Group 2 0.0000 0.1192 –1.0500 1.0000 0.8695 2.0350 Group 3 0.0000 0.0802 –1.0467 1.0000 0.9320 2.1604 Group 4 0.0000 –0.5602 –1.6938 1.0000 1.1380 2.5416
2
Group 5 0.0000 –1.7368 –5.6242 1.0000 0.9781 1.7734
Group 1 0.0000 0.5929 0.4584 1.0000 1.0153 0.9944 Group 2 0.0000 –0.7004 –0.3818 1.0000 1.0841 1.1513 Group 3 0.0000 –0.4016 –0.0861 1.0000 1.0557 1.0988 Group 4 0.0000 –1.5818 –0.7226 1.0000 1.0774 1.1999
3
Group 5 0.0000 –1.0241 –0.6721 1.0000 1.1177 1.1141
Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.
Simulation Results: Setup 2
Truly multimodal distributions are rarely found in real data, even though results from
large-scale assessments show variables that account for large differences in average achievement
between subgroups. Setup 2 was designed to be a less extreme version of the same model used
for Setup 1 and was made more realistic by allowing larger between-scale correlations as they
can be found in many large-scale assessment programs. Analyses like the ones presented for
Setup 1 were carried out with the six-item data set in Setup 2 in order to obtain additional results
from this less extreme case. Table 8b shows the comparison of CJ-DE MML regression and
ETS-DE regression estimates with the regression coefficients based on the true � values.
20
Table 8b
Regression Coefficients for the Six-item Simulated Data Set, Setup 2
Scale 1 Scale 2 Scale 3
Effect TRUE ETS-DE MML TRUE ETS-DE MML TRUE ETS-DE MML
INTER -3.186 -3.162 -3.100 -2.953 -2.996 -2.973 0.123 0.138 0.132
CLUSTER -0.033 -0.021 -0.018 0.000 -0.013 -0.014 -0.040 -0.080 -0.080
STRATA -0.016 -0.020 -0.020 -0.005 -0.002 -0.002 -0.010 0.000 0.000
GENDER 1.203 1.199 1.178 0.495 0.484 0.482 -0.014 0.015 0.018
GROUP 0.285 0.267 0.258 0.537 0.558 0.556 0.053 0.094 0.096
SES 0.394 0.382 0.373 0.313 0.330 0.327 0.000 -0.023 -0.024
Both ETS-DE and MML regression reproduce the regression weights based on the true values
closely for this data set. This indicates that AM’s MML regression and ETS-DE agree on the underlying
relationship between the reporting variables and the latent trait variables, so that the basis on which CJ-
DE marginal direct estimation and ETS-DE’s conditioning model are compared is the same.
Table 9 shows the residual correlations and variances for the true � residuals and for the
estimates as obtained by MML regression and ETS-DE.
Table 9
Residual Correlations With Variances in the Diagonal, Six-Item Data Set, Setup 2
TRUE ETS-DE MML
Scale 1 2 3 1 2 3 1 2 3
1 0.410 –0.077 0.200 0.437 –0.102 0.330 0.412 –0.104 0.332
2 0.297 –0.228 0.349 –0.109 0.345 –0.106
3 0.996 1.036 1.069
The two approaches reproduce the residual covariance matrix in a very similar way. The
differences between ETS-DE and CJ-DE are even smaller than the small differences of the two
approaches to the true values. The results on the regression part of ETS-DE and the MML
regression module of AM give no indication that the basic relationships between the three latent
21
traits and the subgroup variables are represented differently by the two approaches to direct
estimation.
For reporting the variable GENDER in Setup 2, Table 10 shows the respective
standardized mean differences and variance ratios.
Table 10
Standardized Mean Differences and Variance Ratios for GENDER for the Six-item Data Set, Setup 2
Standardized mean difference Variance ratio
Scale Group TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE
Female 0.000 0.544 1.088 1.000 1.082 1.218 1
Male 0.000 –0.450 –1.517 1.000 0.969 1.019
Female 0.000 0.069 0.057 1.000 1.098 1.126 2
Male 0.000 –0.045 –0.277 1.000 1.084 1.117
Female 0.000 –0.641 –0.281 1.000 1.026 1.044 3
Male 0.000 –0.164 –0.015 1.000 0.990 1.086
The results for Setup 2 show smaller, but noticeable differences between the estimates of
CJ-DE on the one hand, and the true values and ETS-DE on the other hand. The marginal
distributions in Setup 2 deviate to a lesser extent from CJ-DE normality assumption, so that the
subgroup estimates of CJ-DE seem impacted less by a moderate model violation as compared to
Setup 1. Table 11 shows the results for the reporting variable GROUP, which again could not be
estimated by CJ-DE using the default options and which is the one with the strongest effect on
Scale 2.
As expected, the standardized mean differences between the true values and CJ-DE for
Scale 2 are larger than the differences between the true values and ETS-DE. In addition, the
variance ratios for CJ-DE are consistently larger than 1.5 for Scale 2, indicating that CJ-DE
overestimates subgroup variances here.
The results for both reporting variables GENDER and GROUP are similar with respect to
where CJ-DE deviates from the TRUE values and the ETS-DE approach: The GENDER effect is
largest for Scale 1, where CJ-DE deviates most when reporting GENDER subgroup means.
22
Similarly, for Scale 2, where the GROUP reporting variable has a strong effect on Latent Trait 2,
CJ-DE deviates most when reporting on the GROUP subgroups.
Table 11
Standardized Mean Differences and Variance Ratios for GROUP for the Six-item Data Set, Setup 2
Standardized mean difference Variance ratio
Scale Subgroups TRUE ETS-DE CJ-DE TRUE ETS-DE CJ-DE
Group 1 0.0000 0.0349 0.6512 1.0000 0.9608 0.9649 Group 2 0.0000 –0.0490 –0.2745 1.0000 1.1590 1.2968 Group 3 0.0000 0.3333 –0.0897 1.0000 0.8876 0.8820 Group 4 0.0000 0.1477 –0.5000 1.0000 0.9061 0.9119
1
Group 5 0.0000 –0.2414 –0.3534 1.0000 1.0628 1.1324
Group 1 0.0000 –0.5567 3.2474 1.0000 1.1540 1.9239 Group 2 0.0000 0.0300 0.0300 1.0000 1.0930 1.7859 Group 3 0.0000 0.4177 –1.0127 1.0000 1.1113 1.6488 Group 4 0.0000 0.6000 –1.2143 1.0000 1.2346 1.7322
2
Group 5 0.0000 –0.3088 –3.4118 1.0000 1.1239 1.4805
Group 1 0.0000 –0.3298 –0.6702 1.0000 0.9496 0.9625 Group 2 0.0000 0.1038 0.6321 1.0000 1.1183 1.1314 Group 3 0.0000 0.1954 0.3448 1.0000 1.0988 1.1816 Group 4 0.0000 –0.7143 –0.5048 1.0000 1.0020 1.0563
3
Group 5 0.0000 –0.8095 –0.4762 1.0000 0.8931 1.0000
Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.
Conclusions: Study I
In the examples presented above, AM’s MML regression module yields similar results to
what is found when using the regression results of the ETS-DE methodology. Regression
coefficients, residual correlations, and variances are reproduced in much the same way as ETS-
DE recovers these parameters. These results cannot be generalized as they are currently based on
a few simulated data sets only. Nevertheless, all examples presented here indicate that both
23
software programs agree on the basic correlational relationships in the data as given by the AM
MML regression module and ETS-DE’s regression estimates.
In contrast to the close agreement of ETS-DE regression and AM’s MML regression, the
AM module for CJ-DE—the marginal normality direct estimation approach—diverges from the
ETS-DE results and the true values if the marginal distributions are non-normal. The exemplary
data sets were constructed and simulated in a way to show where discrepancies can be expected,
and the results so far match the expectations. Setup 1 was constructed to study how CJ-DE
performs if marginal distributions are bimodal or multimodal, and CJ-DE did not converge with
the default settings for the scales that violated the assumptions used in the marginal direct
estimation approach. Setup 2 represents a “milder” version of model violation for CJ-DE and
also shows that under this setup, where the multimodality of the marginal is less obvious, the CJ-
DE estimates differ from the values produced by ETS-DE, the conditional normality direct
estimation and the true values.
Assuming that the latent trait is normally distributed across groups may lead to an
inappropriate model because of strong monotonicity assumptions in the IRT model (note that
IRT serves as the basis for both ETS-DE and CJ-DE). For the 1PL and 2PL IRT models as well
as the (generalized) partial credit models, a simple statistic of the observed responses—the
weighted sum of scores—is sufficient for estimating the latent trait. Even for the 3PL, the
monotonicity of the success probability P(X=1|�) in the latent trait � and in the item parameters
ensures a relationship between the observed distribution of the raw scores and the unobserved
(but not arbitrary!) distribution of the latent trait. As an example, if a test is administered to two
different samples that differ a lot in their ability distributions (e.g. a reading test taken by both a
group of kindergarten students and a group of third graders), it seems unreasonable to assume a
joint normal distribution. A model assuming marginal normality would force both distributions
under one mode and produce biased estimates of differences between these two groups and other
groups defined by additional reporting variables.
The simulated data examples revealed effects of CJ-DE in the presence of non-normal
marginal distributions: systematic deviations from the true values in the mean and in the variance
estimates. In contrast, no indication of systematic differences between the true values and the
ETS-DE approach were found in the examples analyzed here. From the perspective of data
analysis, the differences in the subgroup mean estimates of CJ-DE are easier to detect, because in
24
extreme cases CJ-DE reports when it fails to converge. Nevertheless, when using AM’s “slog
through” estimation option and increasing the number of iterations, there may be no indication of
nonconvergence. The effects of CJ-DE when estimating subgroup variances are more difficult to
detect, as this can only be accomplished by additional analysis using other, less restrictive
methods.
Study II: Comparing Marginal Direct Estimation and Conditional Direct Estimation
Subgroup Statistics for NAEP and NALS Data
Study I showed that the marginal direct estimation (CJ-DE) method relies strongly on the
assumption that the latent trait is marginally normally distributed. The CJ-DE method as
implemented in the AM software (Cohen, 1998) does not reproduce subgroup mean and variance
appropriately in cases where a significant part of subgroup differences is explained by the
grouping variable of interest.
The examples presented here help in studying consequences of this effect of marginal
direct estimation in large-scale assessment data analysis. Assessments across a number of
countries, states, regions, or other grouping variables cannot assume a certain form of marginal
distribution of the trait across the groups (Yamamoto & Mazzeo, 1992). In addition, assuming
that subgroup variances are homogenous (i.e., that the trait[s] vary to a similar degree within all
groups) might be too restrictive to fit diverse populations. Data from large-scale assessment
programs provide a source to study differences between CJ-DE and ETS-DE in a realistic data
analysis setting. Using real data with operational reporting variables enables one to formulate
expectations about whether certain variances should be equal or for which subgroups differences
may be expected. This adds a different perspective to what was examined in Study I, where
known parameters were compared with CJ-DE and ETS-DE estimates.
NAEP Math Assessment, Grade 4
As the first real data example, results were compared for ETS-DE and CJ-DE on data
from an assessment given to a nationally representative sample of 13,855 students in the fourth
grade for the National Assessment of Educational Progress (NAEP). The assessment,
administered in 2000, used a sparse matrix sample design where examinees were given a 45-
minute test of mathematics items consisting of a mixture of multiple choice and constructed
25
response items. The 173-item pool was divided into 13 blocks of items (separately-timed
sections). The blocks were assembled into 26 booklets based on a BIB (balanced incomplete
block) design (Braswell et al., 2001). Each booklet contained three blocks of items, which were
classified into five content-area scales—numeracy and operations, measurement, geometry, data
analysis, and algebra. A typical examinee answered from 6 to 12 items per scale. A multiscale
IRT model estimated with PARSCALE was used to calibrate the IRT item parameters for each
of the five scales.
The following exhibits show results based on the ETS-DE methodology using 381
background variables in addition to item responses in order to obtain subgroup estimates. The
381 background variables are factor scores based on a principal component analysis that was
conducted using the variables available from the background questionnaire (see Braswell et al.,
2001, for details on the NAEP 2000 math assessment and the available background data). The
operational NAEP 2000 item parameters were used in a five-dimensional run with CGROUP, the
current software implementation of the multidimensional ETS-DE approach. The ETS-DE
approach was found to work accurately in recovering subgroup means and variances in Study I
and serves as a benchmark for CJ-DE, which has been proposed for use for subgroup reporting
(Cohen & Jiang, 1999). In contrast to CJ-DE, the ETS-DE approach assumes conditional
normality of the latent traits with a large set of background variables. Given that a large number
of background variables are used that explain a significant portion of the latent trait variance, this
approach is capable of modeling complex mixtures of abilities resulting in non-normal
population and subgroup distributions. To compare the results of ETS-DE and CJ-DE, the
operational data and NAEP 2000 math item parameters were imported into the software that
implements CJ-DE.
School Type
The first reporting variable used in this comparison is School Type, which has three
categories in NAEP—Public, Private, and Catholic. The subsequent tables offer a comparison
between CJ-DE and ETS-DE, the benchmark, on the basis of standardized mean differences and
variance ratios similar to the exhibits in the previous part of the report. Table 12a shows the
reference values estimated by ETS-DE in the untransformed latent trait scale, not in the NAEP
reporting scale. The untransformed latent trait scale is implicitly given by the item parameters as
26
calibrated with the PARSCALE software. PARSCALE defaults to the marginal latent trait
moments M(�)=0 and a standard deviation S(�)=1.
Table 12a
ETS-DE Estimates of the Means and Standard Deviations in the Latent Trait (Theta) Scale for School Type Subgroups
Mean Standard deviation
Public Private Catholic Public Private Catholic
NUM&OPER –0.047 0.430 0.368 1.021 0.913 0.842
MEASURMT –0.053 0.480 0.402 1.060 0.928 0.897
GEOMETRY –0.034 0.299 0.267 1.012 0.913 0.821
DATA ANL –0.045 0.327 0.425 1.103 0.969 0.880
ALGEBRA –0.047 0.429 0.358 1.081 0.969 0.886
The Private and Catholic school categories have a mean that is about 0.35 to 0.52
standard deviations higher than the one for Public schools, whereas the respective standard
deviations for these subgroups is slightly lower than the subgroup standard deviation for Public
school category across all five scales of the NAEP math assessment. Table 12b gives the
corresponding standardized mean differences and variance ratios. The table shows these values
for the School Type subgroups, where the differences are formed by “CJ-DE minus ETS-DE”
and the ratios are “CJ-DE divided by ETS-DE.”
27
Table 12b
Standardized Mean Differences and Variance Ratios for School Type Subgroups
Standardized mean difference Variance ratio
Public Private Catholic Public Private Catholic
NUM&OPER 0.047 –0.245 –0.737 0.931 1.129 1.338
MEASURMT 0.039 0.137 –0.293 0.911 1.143 1.235
GEOMETRY 0.074 0.658 –1.390 0.902 1.087 1.359
DATA ANL 0.071 –0.420 –0.330 0.820 1.044 1.253
ALGEBRA 0.149 –1.226 –0.439 0.832 1.010 1.211
Note. Large differences between CJ-DE on the one hand and ETS-DE and the true values on the other are printed in boldface.
ETS-DE and CJ-DE provide quite similar subgroup mean estimates for most of the five
scales in the three subgroups, but there are differences in the subgroup standard deviations
reported by the two methods. The ETS-DE method reports that the Catholic school subgroup has
a smaller standard deviation as compared to the Public school types on all five scales3, whereas
the CJ-DE method report comparably more similar standard deviations across the three
subgroups. In Study I, using simulated data examples, it was found that CJ-DE does not recover
differences in subgroup standard deviations correctly. The ETS-DE method, however, was found
to recover this type of subgroup heteroscedasticity in the simulated examples, and ETS-DE
reflects differences between subgroup variances in the NAEP example reported here.
Race/Ethnicity
The next variable analyzed is Race/Ethnicity, which has four categories—WHI/AI/O
(White, American Indian, Other), AFRAM (African American), HISPANIC (Hispanic
American), and ASIAM (Asian American)—in the NAEP 2000 data. Table 13 below shows the
subgroup mean differences between CJ-DE and ETS-DE and the corresponding variance ratios
for this reporting variable.
28
Table 13
Race/Ethnicity Subgroup Reports Generated Based on the NAEP 2000 Grade 4 Math Data 1
Standardized mean difference Variance ratio
WHI/AI/O AFRAM HISPANIC ASIAM WHI/AI/O AFRAM HISPANIC ASIAM
NUM&OPER -0.259 0.482 0.348 -0.285 0.996 0.952 0.849 0.729
MEASURMT -0.109 -0.172 0.459 0.091 0.955 0.935 0.830 0.738
GEOMETRY -0.282 0.012 0.658 0.222 0.976 0.901 0.784 0.715
DATA ANL -0.094 0.743 -0.254 -0.715 0.889 0.778 0.691 0.730
ALGEBRA -0.305 0.490 0.477 -0.388 0.878 0.815 0.728 0.717
Note. Large differences from the expected values given the more general model are printed in boldface.
The subgroup mean differences indicate that the estimates of the two methods do not
differ significantly from each other. CJ-DE resembles the ETS-DE mean estimates satisfactory
for the race subgroup variable.
The standard deviation estimates given by CJ-DE differ from what is reported by the
ETS-DE method for the subgroups, AFRAM, HISPANIC, and ASIAM. The standard deviation
estimates provided by CJ-DE are about 0.7 times the size of the respective ETS-DE estimate. In
contrast to that, CJ-DE yields a standard deviation more similar to ETS-DE for the WHI/AI/O
subgroup.
Individualized Education Plan
Table 14 shows the subgroup mean differences of CJ-DE estimates against the ETS-DE
analysis and the corresponding variance ratios for the dichotomous grouping variable IEP
(Individualized Education Plan). There is a large mean difference between the two subgroups
IEP and non-IEP. The IEP group means are approximately 0.9 standard deviations smaller than
the non-IEP group estimates across all five scales (see Appendix C, where the ETS-DE estimates
for the reporting variable IEP are given).
Based on the findings of Study I, it can be expected that CJ-DE mean estimates will not
reflect the large difference between the IEP and the non-IEP subgroups. The standardized mean
differences and variance ratios for the IEP reporting variable are given in Table 14.
29
Table 14
IEP Subgroup Reports Based on the NAEP 2000 Grade 4 Math Data
Standardized mean difference Variance ratio
IEP Non-IEP IEP Non-IEP
NUM&OPER 5.207 –0.556 0.841 1.025
MEASURMT 4.145 –1.206 0.835 0.996
GEOMETRY 4.832 0.042 0.830 1.002
DATA ANL 4.783 0.275 0.685 0.921
ALGEBRA 5.639 –0.998 0.621 0.961
Note. Large differences from the expected values given the more general model are printed in boldface.
The CJ-DE estimates show large differences to the IEP group means as provided by ETS-
DE. CJ-DE reports consistently smaller mean differences between IEP and non-IEP subgroups,
so that the corresponding mean difference between CJ-DE and ETS-DE is a large positive
number. The same was found in Study I (see above) using simulated data when the absolute
mean differences between subgroups are large. These results support the conjecture that CJ-DE
direct estimation of subgroup mean differences deviate from more general models in the
presence of large between group differences. Compared to ETS-DE, CJ-DE slightly
underestimates the IEP subgroup variances for the subscale categories—NUM&OPER,
MEASURMT, and GEOMETRY. For the subgroup variances of ALGEBRA and DATA ANL,
the CJ-DE estimates are only about 0.7 the size of the corresponding ETS-DE estimates.
National Adult Literacy Study
The second real data set used in this comparison is taken from National Adult Literacy
Survey (NALS) administered in 1992. This data set consists of 21,363 subjects and contains a
sparse matrix sample of 713 items from three content domains of literacy—quantitative, prose,
and document. NALS
…measured literacy along three dimensions, prose literacy, document literacy, and
quantitative literacy, designed to capture an ordered set of information-processing skills
and strategies that adults use to accomplish a diverse range of literacy tasks. The literacy
30
scales make it possible to profile the various types and levels of literacy among different
subgroups in our society (“Defining and measuring literacy,” n.d.).
The exemplary comparisons presented here utilize the NALS main assessment data file
and the operational item parameters, which were used with the CGROUP program, which is the
current implementation of the ETS-DE approach. The same data and item parameters were
imported into the implementation of the CJ-DE approach, the AM software.
Similar to the preceding analyses, a number of policy-relevant grouping variables from
the NALS data file were chosen to compare the subgroup distribution estimates as given by the
ETS-DE and the CJ-DE approach. Table 15 shows variance ratios and standardized mean
differences and between the estimates of ETS-DE and CJ-DE for the grouping variable REGION
with four subgroups.
Table 15
Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for REGION as Defined in the NALS 1992 Data
Standardized mean difference Variance ratio
REGION Prose Document Quantitative Prose Document Quantitative
MIDWEST –0.560 –1.189 –0.418 1.012 0.981 0.964
N-EAST –0.154 0.193 –0.277 0.828 0.809 0.831
SOUTH –0.112 –0.275 –0.298 0.770 0.743 0.748
WEST 0.771 1.029 0.892 0.717 0.708 0.740
Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 are printed in boldface.
The results indicate that all four subgroup mean estimates given by ETS-DE and CJ-DE
agree relatively well. In contrast to the agreement between ETS-DE and CJ-DE for the means of
the region subgroups, the variance estimates for the regions SOUTH and WEST given by CJ-DE
are only about 0.75 times as large as the variance estimates given by ETS-DE.
The next NALS reporting variable used in the comparison is BORN IN having the five
categories—USA, SPAN (Spanish-speaking world), EUROP, ASIA, and OTHER. Table 16
31
shows the standardized mean differences and variance ratios CJ-DE compared to the ETS-DE
estimates for this reporting variable.
Table 16
Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for the Grouping Variable BORN IN as Defined in the NALS 1992 Data
Standardized mean differences Variance ratio
BORN IN Prose Document Quantitative Prose Document Quantitative
USA –2.687 –2.899 –2.475 0.923 0.890 0.884
SPAN 8.892 7.552 7.142 0.413 0.409 0.455
EUROP 0.616 0.980 0.415 0.559 0.592 0.655
ASIA 2.093 1.930 1.789 0.539 0.505 0.553
OTHER 1.404 1.293 1.182 0.588 0.564 0.627
Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 and standardized mean differences absolute larger than 2.2 are printed in boldface.
There are discrepancies between the subgroup mean estimates of CJ-DE and ETS-DE for
the USA and SPAN subgroups. The CJ-DE estimates for USA are about 2.5 to 2.8 standard units
lower than the ETS-DE estimates for the three literacy scales. The standardized differences
between the CJ-DE mean estimates and the ETS-DE estimates for SPAN lie between 7 to 8
across the three scales, indicating that CJ-DE differs significantly from the ETS-DE estimates.
The variance ratio for four subgroups—SPAN, EUROP, ASIA, and OTHER—is between 0.4
and 0.65 across all three subscales of the NALS data, indicating that the CJ-DE estimates are
systematically smaller than the ETS-DE estimates in this case.
The final comparison of CJ-DE and ETS-DE on the basis of the NALS data is based on
the reporting variable “Years living in the USA.” This reporting variable has nine categories,
ranging from “1-5 years in the USA” to “Ever live in the USA,” in 5 to 10 year intervals (see
below). Table 17 shows the standardized mean differences between CJ-DE and ETS-DE and the
variance ratios for the three literacy scales across the nine subgroups.
32
Table 17
Variance Ratio and Standardized Mean Differences Between CJ-DE and ETS-DE Estimates for “Years Living in the USA” as Defined in NALS 1992 Data
Standardized mean difference Variance ratio
Yrs in USA Prose Document Quantitative Prose Document Quantitative
1–5 12.060 11.418 8.791 0.432 0.410 0.432
6–10 2.762 2.725 2.584 0.545 0.527 0.572
11+ 4.121 4.581 3.361 0.427 0.434 0.471
16+ 2.863 2.333 2.740 0.440 0.436 0.486
21+ 1.461 2.035 2.208 0.520 0.518 0.565
31+ 0.777 1.151 0.913 0.558 0.602 0.609
41+ 0.187 0.370 0.104 0.467 0.473 0.499
51+ –0.441 –0.035 –0.322 0.939 0.943 0.979
Ever –3.182 –3.471 –2.770 0.937 0.902 0.896
Note. The direction of the difference is (CJ-DE minus ETS-DE) and the direction of the ratio is (CJ-DE divided by ETS-DE). Variance ratios smaller than 0.75 and standardized mean differences absolute larger than 2.2 are printed in boldface.
The subgroup mean estimates of CJ-DE are between 2.3 and 12 standardized units larger
than the corresponding estimates given by ETS-DE for the subgroups—“1–5 years in the USA,”
“6–10,” “11+,” and “16+.” The mean estimate for subgroup “Ever live in the USA” is between
2.7 and 3.4 standard units smaller for CJ-DE as compared to ETS-DE.
The variances estimates by CJ-DE for the first six subgroups in the interval between “1–
5” and “41+” are systematically smaller than what ETS-DE reports. The variance ratio lies
between 0.41 and 0.6 in these subgroups across all three scales. In contrast, the CJ-DE subgroup
variance estimates for “Ever” and “51+” are close to what ETS-DE yields, as the variance ratio is
close to 1. Note that the subgroups of US residents with a comparably small amount of years
residing in the United States are the subgroups with a comparably larger difference to the total
mean (see Appendix D). For these subgroups, CJ-DE yields estimates that deviate more from
33
what is given by the more general ETS-DE approach, whereas subgroups closer to the total mean
(“Ever” and “51+”) receive estimates that agree more closely with the ETS-DE approach.
Conclusions: Study II
The results reported in Study II show similarities with the results obtained in Study I,
which used simulated data. In the case of simulated data, CJ-DE differs from the values obtained
by ETS-DE and the true values obtained from analyzing the simulated proficiency values used
for generating the response data. The assumption of marginal normality leads to discrepancies
between CJ-DE and the true values in the presence of large subgroup mean differences and in
cases where the subgroup variances are heteroscedastic. Recall Cohen and Jiang's (1999) direct
estimation model, where the conditional density of � given subgroup membership g=k is derived
based on the marginal normality assumption. This density depends on the marginal parameters
��and �� and subgroup parameters (a1,b1,..,aG,bG). Essentially, the marginal normal density
�����/��� acts as a prior for the conditional density
������
������
��
��
dkgfkgf
kgf� ��
��
���
�
)|())(()|())((
)|(1
1
(14)
�
which prevents the conditional densities from fitting larger subgroup mean differences. This
might be an indication why the CJ-DE standard deviation estimates are less variable across
subgroups, and the restriction of the standard deviation is correlated with the distance of the
corresponding subgroup mean from the total mean. A thorough analysis of the marginal direct
estimation model (Cohen & Jiang, 1999) should reveal that this restriction of the parameter space
is caused by the assumption of marginal normality. This assumption forces the mixture of
subgroup distributions to fit under the unimodal normal distribution.
The conclusion in Study II, which uses real data from NAEP and NALS, corresponds
closely to the findings of Study I, which compares CJ-DE and ETS-DE based on simulated data
examples, even though in real data applications, the true values usually are unknown. In the
presence of large subgroup mean differences, CJ-DE yields less extreme subgroup estimates than
ETS-DE, which also was found in the comparison in Study I of both methods to the true values.
34
Additionally, the variance estimates given by CJ-DE tend to be more similar across subgroups as
compared to the ETS-DE estimates and when comparing CJ-DE the true values in Study I. The
CJ-DE variance estimates seem to be increasingly restricted with increasing difference of the
subgroup mean to the total mean.
As noted in the introduction, CJ-DE uses a number of assumptions to derive a conditional
subgroup density while maintaining the restriction of normality of the marginal density. This
normal marginal assumption of the latent trait is believed to reflect common practice in large-
scale assessment applications of IRT (see Cohen & Jiang, 1999). However, NAEP and other
large-scale assessments do not rely on this assumption. Appendix E gives an example of how to
avoid the assumptions of CJ-DE when using AM in order to estimate a less restrictive model
with this software. The results of using simulated data and the results of using real data both
show that these assumptions used in CJ-DE lead to discrepancies when analyzing complex
samples where the assumptions are not met by the data. The operational ETS-DE approach does
not put the normality assumption in the marginal distribution, but in the conditional distribution
of the latent trait given the item responses and a large number of the background variables. The
conditioning approach utilized by ETS-DE is therefore more general and enables it to fit non-
normal distributions, as the conditional means given the background model are not assumed to
follow a specific distribution. In the light of systematic differences seen in both Study I and II,
using methods such as CJ-DE that rely on item responses only and replacing valuable
background information by a number of assumptions does not seem defendable for the analysis
of large-scale assessment data. This also holds for trend studies, where the assessment of change
relies even more on maximizing the comparability of results and the accuracy of the mean and
variance estimates obtained across time points and subgroups.
35
References
Braswell, J. S., Lutkus, A. D., Grigg, W. S., Santapau, S. L., Tay-Lim, B., & Johnson, M. (2001).
The nation’s report card: Mathematics 2000. Washington, DC: National Center for
Education Statistics.
Cohen, J. D. (1998). AM online help content—Preview. Washington, DC: American Institutes for
Research.
Cohen, J. D., & Jiang, T. (1999). Comparison of partially measured latent traits across normal
populations. Journal of the American Statistical Association, 94(448), 1035-1044.
Defining and measuring literacy. (n.d.) In National assessments of adult literacy. Retrieved
December 6, 2002, from http://nces.ed.gov/naal/defining/defining.asp
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA:
Addison-Wesley.
Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex
samples. Psychometrika, 56(2), 177-196.
Mislevy, R. J., Beaton, A. E., Kaplan. B., & Sheehan. K. M. (1992). Estimating population
characteristics from sparse matrix samples of item responses. Journal of Educational
Measurement, 29(2), 133-161.
Rasch G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago:
University of Chicago Press.
Thomas, N. (1993). Asymptotic corrections for multivariate posterior moments with factored
likelihood functions. Journal of Computational and Graphical Statistics, 2, 309-322.
Thomas, N. (2002). The role of secondary covariates when estimating latent trait population
distributions. Psychometrika, 67(1), 33-48.
Yamamoto, K., & Mazzeo, J. (1992). Item response theory scale linkage in NAEP. Journal of
Educational Statistics 17(2), 155-173.
36
Notes 1 The item parameters in the k-scale IRT model are assumed to be known constants. 2 The overall mean and standard deviation reported here are estimates by ETS-DE and the
TRUE data; CJ-DE does not provide overall means and standard deviations. 3 This indicates that the Catholic school category is more homogeneous as compared to the
two other categories. The Public school category consistently has the largest standard
deviations across all five scales.
37
Appendix A
Item Parameters of the Simulated Three-scale Six-item Data Set
Scale Slope Difficulty
Scale 1
[1,] 1.0707435 –0.423607249
[2,] 1.1946191 0.369087609
[3,] 1.1356097 –0.008368651
[4,] 1.1029780 –0.434542858
[5,] 0.6926124 –0.320136837
[6,] 0.8034373 0.817567985
Scale 2
[7,] 0.9617609 0.003169065
[8,] 1.1004634 1.327405006
[9,] 0.9115646 0.451618136
[10,] 1.0574126 –2.053570652
[11,] 1.1098851 0.006184470
[12,] 0.8589135 0.265193973
Scale 3
[13,] 1.2621460 1.339141978
[14,] 0.8917393 –0.220816527
[15,] 0.9161605 0.758596816
[16,] 0.9253288 –0.066838528
[17,] 0.7505099 –0.099260362
[18,] 1.2541155 –1.710823377
Note. The guessing parameter was 0.1 for all items.
38
Appendix B
Item Parameters of the Simulated 3-scale 12-item Data Set
Scale Slope Difficulty
Scale 1
[1,] 1.0301048 –0.40334405
[2,] 1.0807597 –0.12162779
[3,] 1.0250148 –0.29599706
[4,] 0.8097633 0.13585131
[5,] 1.0834746 –0.10137978
[6,] 1.0881449 1.18682432
[7,] 0.8241556 0.58488677
[8,] 1.0754401 0.95989977
[9,] 0.8284506 –1.57049425
[10,] 1.0272207 –0.24556290
[11,] 1.1410092 –0.53788298
[12,] 0.9864615 0.40882663
Scale 2
[13,] 1.0131312 0.56203695
[14,] 1.0604981 0.63205024
[15,] 1.2831725 –0.41368560
[16,] 1.1636971 –0.90477486
[17,] 0.9043142 0.01714852
[18,] 0.9837799 –0.84975192
[19,] 1.0296239 0.63169027
[20,] 1.2039188 0.04996556
[21,] 0.6799550 0.77051519
[22,] 0.9778539 –0.91851904
[23,] 1.0707815 0.19213650
(Table continues)
39
Table (continued)
Scale Slope Difficulty
[24,] 0.6292741 0.23118819
Scale 3
[25,] 1.1981095 0.24101286
[26,] 1.0874208 –0.18829633
[27,] 0.9684248 –0.58984308
[28,] 0.8853709 –0.95740524
[29,] 0.9017118 –0.19778461
[30,] 1.0488593 –1.42372395
[31,] 1.0086545 0.17463042
[32,] 0.8052735 1.48726305
[33,] 1.2051341 1.30940643
[34,] 1.0667933 0.28232721
[35,] 0.8616304 –0.32302987
[36,] 0.9626173 0.18544311
Note. The guessing parameter was 0.1 for all items.
40
Appendix C
ETS-DE Estimates for IEP Subgroup Means and Standard Deviations
Mean Standard deviation
IEP Non-IEP IEP Non-IEP
NUM&OPER –0.910 0.091 1.050 0.966
MEASURMT –0.841 0.054 1.100 1.016
GEOMETRY –0.844 0.076 1.043 0.958
DATA ANL –0.852 0.103 1.199 1.043
ALGEBRA –0.927 0.099 1.244 1.009
41
Appendix D
Means and Standard Deviations for ETS-DE Estimates for
“Years Living in the USA” as Defined in NALS 1992 Data
Mean Standard deviation
Yrs in USA Prose Document Quantitative Prose Document Quantitative
1-5 –1.287 –1.154 –1.043 1.549 1.584 1.548
6-10 –1.181 –1.029 –0.987 1.325 1.364 1.305
11+ –1.228 –1.094 –1.026 1.506 1.513 1.448
16+ –0.900 –0.874 –0.777 1.507 1.514 1.447
21+ –0.714 –0.691 –0.565 1.389 1.396 1.346
31+ –0.463 –0.513 –0.351 1.349 1.298 1.298
41+ –0.616 –0.749 –0.585 1.459 1.448 1.417
51+ 0.068 –0.139 0.062 1.033 1.032 1.018
Ever 0.102 0.094 0.084 1.065 1.076 1.085
Note. The subgroup means are reported as differences from the total mean.
42
Appendix E
Using AIR’s AM Software for Secondary Analyses
Studies I and II have shown that AM’s procedure for CJ-DE, a direct estimation approach
relying on a marginal normality assumption, does not seem suitable for data where the normality
of the latent trait across subgroups cannot be warranted. The CJ-DE approach has been
developed “to consistently estimate subpopulation distributions when the groups are defined by
values of a [nominal or ordinal variable]” (Cohen & Jiang, 1999). The two procedures
implementing CJ-DE are referred to as “Ordinal Table” (OT) and “Nominal Table” (NT) in the
AM software, depending on the grouping variables scale level. In contrast to the findings
concerning CJ-DE, AM’s MML regression procedure reproduced the results of analyzing the
true values—which served as the basis for the simulated data—quite well, in much the same way
the ETS-DE approach does. In the simulated data examples with known true regression
coefficients, ETS’s method and AM’s MML regression agreed closely when estimating
regression parameters for the full conditioning model.
AM’s MML regression module cannot be used “as is” for reporting purposes, because
additional steps are necessary in order to produce subgroup statistics based on the regression
results. The goal of this appendix is to explore ways to use MML regression and other modules
of AM and to provide a guideline on how to put together analysis steps that can be used to get
results with the AM software that resemble more closely the true values and the ETS-DE
conditioning model estimates.
AM was used in examples presented below in a multistep procedure for producing
subgroup statistics without using AM’s CJ-DE modules. This step-by-step procedure lacks the
convenience of the operational ETS-DE approach in that it requires manual concatenation of
separate intermediate results produced by AM’s procedures. Therefore, the goal of the study
presented here is not to provide an alternative to ETS-DE, but to test whether AM can be used
for secondary analyses.
The approaches taken by the ETS-DE conditioning model on the one hand and AM’s
direct estimation as well as its AM’s CJ-DE module on the other hand differ strongly with
respect to the information incorporated in estimating subgroup characteristics. ETS-DE uses
extensive background (conditioning) information, including grouping variables in addition to the
observed item responses. CJ-DE, in contrast, only includes one grouping variable at a time
43
together with the item responses but draws on a number of strong assumptions regarding the
shape of the marginal ability distribution and the relation between � and the group indicator
variable.
Issues in Model Selection
Assumptions about the population structure are central in the process of building a model
for complex survey data. The question is what kind of assumptions are viewed as appropriate for
the comparison of multiple subgroups with respect to their means and variances.
Figure E1. Subgroup distributions with normality assumption on the marginal level.
In the case depicted in Figure E1, the overall distribution is assumed to be normal, and
the sum of all subgroup distributions has to accommodate this shape. It follows that the shapes of
the subgroup distributions are no longer free; they have to fit under the overall normal shape and
their sum has to be equal to that shape. This assumption is central to CJ-DE and makes it
44
inappropriate for more complex real data. A less restrictive assumption is that all subgroups are
normally distributed and share the same variance but may vary with respect to their means and
size. This assumption can be modeled by a regression with contrast coded subgroup indicators.
This can be done in many software packages as well as in ETS-DE and AM. This drops the
assumption of marginal normality and with it the main feature of CJ-DE as proposed by Cohen
and Jiang (1999). The effect of this relaxing the marginal normality assumption is illustrated in
Figure E2.
Figure E2. Subgroup distributions with normality assumption in all subgroup levels.
This less restrictive assumption obviously allows a larger range of cases to be fitted as
compared to CJ-DE. This approach can be taken by using AM’s MML means procedure, even
though that procedure will not yield subgroup variance estimates. If only a few subgroups are
used, the homoscedasticity assumption within subgroups limits the ability to fit more general
marginal distributions. A useful extension would be to assume a separate variance for each
subgroup. In AM, this assumption can be accommodated by using MML regression together
with filtering the data as many times as there are subgroups. But even this limits the subgroup
45
distributions to be normal, which seems still a too restrictive approach if, for example, there is a
strong indication that some subgroups are composites.
This is one of the reasons why