Robust Measurement via A Fused Latent andGraphical Item Response Theory Model
Yunxiao Chen, Xiaoou Li, Jingchen Liu, Zhiliang Ying
November 26, 2016
Abstract
Item response theory (IRT) plays an important role in psychological and educa-
tional measurement. Unlike the classical testing theory, IRT models aggregate the
item level information, yielding more accurate measurements. Most IRT models
rely on the so-called local independence assumption, which may not be satisfied in
practice, especially for a large number of items. Results in the literature and sim-
ulation studies in this paper reveal that misspecifying the local independence as-
sumption may result in inaccurate measurements and differential item functioning.
To provide more robust measurements, we propose a Fused Latent and Graphical
IRT (FLaG-IRT) model that can offset the effect of unknown local dependence.
The new model contains a confirmatory latent variable component, which measures
the targeted latent traits, and a graphical component, which captures the local
dependence. An efficient proximal algorithm is proposed for the parameter estima-
tion and structure learning of the local dependence. The proposed approach can
substantially reduce the local dependence induced measurement bias. The model
can be applied to measure both a unidimentional latent trait and multidimensional
latent traits.
KEY WORDS: item response theory, local dependence, robust measurement, differential
item functioning, graphical model, Ising model, pseudo-likelihood, regularized estimator,
proximal algorithm, Eysencks Personality Questionnaire-Revised
1
1 Introduction
Item response theory models (IRT; Rasch, 1960; Lord and Novick, 1968) play an im-
portant role in measurement theory. Unlike classical testing theory, IRT models in-
tegrate item level information for measurement and are regarded as being a superior
measurement tool to classical test theory (Embretson and Reise, 2000). IRT model-
s have become the preferred method for developing scales, especially when high-stake
decisions are demanded. In particular, IRT models are used in National Assessment
of Education Progress (NAEP), Scholastic Aptitude Test (SAT), and Graduate Record
Examination (GRE). Popular IRT models include the single factor models, such as the
Rasch model (Rasch, 1960), the two-parameter logistic model, and the three-parameter
logistic model (Birnbaum, 1968), and multiple factor models, such as the multidimen-
sional two-parameter logistic (M2PL) model (McKinley and Reckase, 1982; Reckase,
2009).
We consider the multidimensional two-parameter logistic model as a building block.
There are N individuals responding to J test items and the responses from an individual
are recorded by a vector X = (X1, ..., XJ)>. To simplify the presentation, we only con-
sider binary items, i.e. Xj ∈ {0, 1}, but emphasize that the proposed approach is flexible
enough to be generalized to analyzing polytomous items (Chen, 2016). Associated with
each response vector is an unobserved continuous latent vector θ ∈ RK , representing the
latent characteristics that are measured. The conditional distribution of each response
given the latent vector follows a logistic model
fj(θ) , P (Xj = 1|θ) =ea
>j θ+bj
1 + ea>j θ+bj
,
where fj(θ) is known as the item response function and aj = (aj1, ..., ajK)> are known as
the factor loading parameters. When used in a confirmatory manner, the model imposes
constraints on the factor loading parameters, that is, parameter ajk is set to be 0, if item
2
j is not designed to measure the kth latent trait. Specifically, such design information
is characterized by a J × K item-trait relationship matrix, which we refer to as the
Λ-matrix, Λ = (λjk)J×K = (1{ajk 6=0})J×K . The Λ-matrix is usually provided by the item
designers and is often assumed to be known. When information about the Λ-matrix is
vague, data-driven approaches for learning the Λ-matrix are proposed (Liu et al., 2012,
2013; Chen et al., 2015; Sun et al., 2016).
One common assumption of standard IRT models, including the M2PL model, is the
so-called local independence assumption, saying that X1, X2, ..., XJ are conditionally
independent, given the value of θ. That is
P (X1 = x1, ..., XJ = xJ |θ) = P (X1 = x1|θ)P (X2 = x2|θ) · · ·P (XJ = xJ |θ), (1)
for each x = (x1, ..., xJ)> ∈ {0, 1}J . The local independence assumption implies that,
although the items may be highly intercorrelated in the test as a whole, it is only caused
by items’ sharing the common latent traits measured by the test. When the trait levels
are controlled, local independence implies that no relationship remains between the items
(Embretson and Reise, 2000).
In recent years, computer-based and mobile-app-based instruments are becoming
prevalent in educational and psychological studies, where a large number of responses
with complex dependence structure are observed. For these tests, a small number of
latent traits may not adequately capture the dependence structure among the responses.
It is known that there are many possible causes for local dependence, including order
effect where responses to early items affect the responses to subsequent items, and shared
content effect where additional dependence is caused by a common stimuli from shared
content (Hoskens and De Boeck, 1997; Knowles and Condon, 2000; Schwarz, 1999; Yen,
1993). Generally speaking, the item response process could be complicated, and affected
by many external and internal factors. Consequently, a low-dimensional latent factor
3
model may not be adequate to capture all the dependence structure within a test, which
may explain the frequently observed phenomenon of model lack of fit in empirical studies
(Reise et al., 2011; Yen, 1984, 1993; Ferrara et al., 1999).
Differential item functioning (DIF) refers to a test item functioning differently for
different groups, in the sense that the probability of a correct response is associated
with group membership for examinees of comparable ability (Holland and Thayer, 1988).
Controlling DIF is an important aspect in test development that is to ensure test fairness.
DIF may be caused by the presence of additional traits in some items (e.g. Camilli, 1992),
and thus is closely related to local dependence. That is, the nuisance traits that cause
DIF also induce the local dependence structure. Thus, DIF could be reduced if the local
dependence structure can be adjusted in the measurement model.
In this paper, we propose a Fused and Latent Graphical IRT (FLaG-IRT) model
to incorporate local dependence as well as to include the test-design information in
the Λ-matrix as a priori. The model extends the Fused and Latent Graphical (FLaG)
model proposed in Chen et al. (2016) by incorporating the loading structure information.
The proposed model adds a sparse graphical component upon a multidimensional item
response theory (MIRT) model to capture the local dependence. The idea is that for a
well designed test, the common dependence among responses has been well explained
by the latent traits and the remaining dependence can be characterized by a sparse
graphical structure.
In psychometrics, there is an existing literature on modeling the local dependence
structure, including the bi-factor and testlet models (Gibbons and Hedeker, 1992; Gib-
bons et al., 2007; Reise et al., 2007; Bradlow et al., 1999; Wainer et al., 2000; Li et al.,
2006; Cai et al., 2011), copula based approaches (Braeken et al., 2007; Braeken, 2011),
and models with fixed interaction parameters (Hoskens and De Boeck, 1997; Ip, 2002;
Ip et al., 2004; Ip, 2010). Most of these approaches require prior information on the
local dependence structure, such as knowing the item clusters and assuming the local
4
independence between items clusters.
The rest of the paper is organized as follows. In Section 2, the FLaG-IRT model is
introduced and then in Section 3, the statistical analysis based on the model, including
parameter estimation and model selection, is presented. Results of simulation studies
are reported in Section 4. Section 5 contains an application to a real data example. A
summary, along with discussions, is given in Section 6.
2 FLaG-IRT Model
2.1 Two Basic Models
We first describe the fused and latent graphical IRT model, which is built upon the
multidimensional 2-parameter logistic (M2PL) model and the Ising model (Ising, 1925).
To begin with, we describe these two building-block models.
2.1.1 MIRT Model
The M2PL model is one of the most popular multidimensional IRT models for binary
responses. The item response function of the M2PL model is given by
P (Xj = 1|θ) =ea
>j θ+bj
1 + ea>j θ+bj
.
The item-trait relationship is incorporated by constraints contained in a pre-specified
matrix Λ = (λjk)J×K , λjk ∈ {0, 1}. λjk = 0 means that item j is not associated with
latent trait k and the corresponding loading ajk is constrained to be 0. The item response
function can be further written as
P (Xj = xj|θ) =e(a>
j θ+bj)xj
1 + ea>j θ+bj
∝ exp{(a>j θ + bj)xj}.
5
The notation “∝” above is typically used to define probability density or mass func-
tions, which means that the left-hand side and the right-hand side are different by a
normalizing constant that depends only on the parameters and is free of the value of the
random variable/vector. The constant can be obtained by summing or integrating out
the random variable/vector. Such a constant sometimes can be difficult to obtain.
Under the M2PL model, the joint distribution of the responses X = (X1, ..., XJ)>
given θ can be further written as, due to the local independence assumption,
P (X = x|θ) =J∏j=1
P (Xj = xj|θ) ∝ exp{θ>A>x + b>x}, (2)
where A = (ajk)J×K is known as the factor loading matrix and b = (b1, ..., bJ)>. In
particular, when K = 1, the model is known as the two-parameter logistic model (2PL;
Birnbaum, 1968).
2.1.2 Ising Model
We now present the Ising model that is used to characterize the local dependence struc-
ture on top of the M2PL model. The Ising model is an undirected graphical model
(e.g. Koller and Friedman, 2009). It encodes the conditional independence relationship-
s among Xj’s through the topological structure of a graph that can greatly facilitate
the interpretation and understanding of the dependence structure. The Ising model is
originated in statistical physics (Ising, 1925).
Specification of the Ising model consists of an undirected graph G = (V,E), where
V and E are the sets of vertices and edges respectively. The vertex set V = {1, 2, ..., J},
corresponds to the random variables, X1, ..., XJ . The graph is said to be undirected in
the sense that (i, j) ∈ E, if and only if (j, i) ∈ E. The Ising model associated with an
6
Figure 1: The set C separates A from B. All paths from A to B pass through C.
undirected graph G = (V,E) is specified as
P (X = x) ∝ exp
{1
2x>Sx
}, (3)
where S = (sij)J×J is a symmetric matrix such that sij 6= 0 if and only if (i, j) ∈ E.
The conditional independence relationship in the Ising model is encoded by the topo-
logical structure of the graph. More precisely, let A,B and C be nonoverlapping subsets
of V and A ∪ B ∪ C = V . We further let XA, XB, and XC be the random vectors
associated with the sets A, B, and C, respectively, i.e., XA = (Xi : i ∈ A) and so on.
We say A and B are separated by C, if every path from a vertex in A to a vertex in B
includes at least one vertex in C, as illustrated by an example in Figure 1. In Figure 1,
A = {1, 2}, B = {4, 5}, and C = {3}, and all paths from A to B pass through C. For
example, the path (1→ 3→ 4) that connects vertices 1 and 4, passes through vertex 3.
In particular, (i, j) /∈ E implies Xi and Xj are independent given others. When C is an
empty set, the separation between A and B implies their independence.
2.2 FLaG-IRT Model
The FLaG-IRT model combines the M2PL model (2) and the Ising model (3) to construct
a joint item response function. More precisely, the conditional distribution of X given
7
θ is
P (X = x|θ, A, S) ∝ exp
{θ>A>x +
1
2x>Sx
}. (4)
We make two technical remarks on the conditional model (4). First, we remove the
term b>x, because it is absorbed into the diagonal terms of S. That is, xj ∈ {0, 1} and
thus xj = x2j . Consequently, the squared terms become linear
∑Jj=1 sjjx
2j =
∑Jj=1 sjjxj.
Second, the conditional model (4) is an Ising model with parameter matrix S(θ), where
sij(θ) = sij for i 6= j and sjj(θ) = a>j θ + sjj. In addition, the graph of model (4) is the
same as that encoded by S, that is, E = {(i, j) : sij 6= 0, i 6= j}.
To assist understanding, Figure 2 provides graphical representations of the MIRT
model and the FLaG-IRT model. The left panel shows a graphical representation of the
marginal distribution of responses, where there is an edge between each pair of responses.
Under the conditional independence assumption (1) of the MIRT model, there exists a
latent vector θ. If we include θ in the graph, then there is no edge among Xjs as in
the middle panel. The concern is that this conditional independence structure may be
oversimplified and there is additional dependence not attributable to the latent traits.
The FLaG-IRT model (right panel) is a natural extension of the MIRT model (middle
panel), allowing edges among Xjs even if θ is included. The additional edges capture the
dependence among Xjs not explained by θ. Due to the presence of the latent variables,
it is likely that we only need a small number of additional edges to capture the local
dependence. Furthermore, the loading structure in Λ is reflected by the edges between
θks and the responses Xjs in the middle and right panels.
We consider the following joint distribution of (X,θ),
f(x,θ|A, S,Σ) =1
z0(A, S,Σ)exp
{− 1
2θ>Σ−1θ + θ>A>x +
1
2x>Sx
}, (5)
8
Figure 2: Graphical illustration of the MIRT model and the FLaG-IRT model.
where (A, S,Σ) are the model parameters and z0(A, S,Σ) is the normalizing constant,
z0(A, S,Σ) =∑
x∈{0,1}J
∫exp
{− 1
2θ>Σ−1θ + θ>A>x +
1
2x>Sx
}dθ.
Note that under this joint distribution, the joint item response function, i.e., the condi-
tional distribution of X given θ, is consistent with (4). Under this joint distribution, a
specific prior distribution of θ is implicitly assumed and the posterior distribution of θ
becomes Guassian, an assumption discussed in Holland (1990). As will be described in
the sequel, this prior distribution of θ brings technical convenience in the data analysis.
We refer the readers to Holland (1990) for more justifications for this prior. Moreover,
the posterior variance of θ becomes Σ and the posterior mean of θ is given by
E(θ|X = x) = ΣA>x, (6)
a weighted sum of the responses. Once A and Σ are estimated from the data, it is
reasonable to score each individual by ΣA>x.
In the specification (5), A, Σ, S, and the graph E induced by S (equivalently, the
nonzero pattern of matrix S) can be be estimated from the data. Similar to the M2PL
model, we pre-specify a binary matrix Λ = (λjk)J×K for the confirmatory structure and
impose constraint that ajk = 0 if λjk = 0. The latent vector θ is not directly observable.
9
The estimation is based on the marginal likelihood,
P (X = x|A, S,Σ) =
∫f(x,θ|A, S,Σ)dθ,
where f(x,θ|A, S,Σ) is given in (5).
Under the above model specification, the marginal distribution of X still follows an
Ising model, that is
P (X = x|A, S,Σ) =
∫f(x,θ|A, S,Σ)dθ ∝ exp
{1
2x>(AΣA> + S)x
}. (7)
This is a second-order generalized log-linear model (Holland, 1990; Laird, 1991).
2.3 Bi-factor Model as a Special Case
The bi-factor model is one of the most popular models that takes local dependence into
account. This model is a special case of the M2PL model, assuming that there is a
unidimensional general factor θg associated with all items and is the target of measure-
ment. Besides the general factor, there exist nuisance factors θ1, ..., θM associated with
M nonoverlapping item clusters C1, C2, ..., CM , where each item cluster has no less than
two items and there may be items not belonging to any of these item clusters. As we
will discuss in the sequel, the FLaG-IRT model is able to capture such a structure. One
of the advantages of the FLaG-IRT model is that there is no need to specify a priori
item clusters and they are learned from the data.
The bi-factor model based on a logistic link (e.g. Cai et al., 2011) can be viewed as
a special M2PL model with
P (X = x|θ) ∝ exp{θ>A>x + b>x},
where b = (b1, ..., bJ)> and A = (ag, a1, ..., aK). In particular, the jth element of ak is
10
Figure 3: Graphical representation of a bi-factor model, the corresponding FLaG-IRTmodel, and the local dependence graph.
zero if item j is not in the kth item cluster, i.e., j /∈ Ck. If we use the specific prior of θ
in the FLaG-IRT model and further assume Σ to be an identity matrix,
P (X = x) ∝ exp
{1
2x>aga
>g x +
1
2x>Sx
}, (8)
where sjj = 2bj, and sij = sji = 0 when items i and j do not belong to the same item
cluster and sij = sji = aikajk when both items belong to the kth cluster, which admits
the same form as the marginal FLaG-IRT model in (7). In other words, the graphical
model component of the FLaG-IRT model can take the place of the specific factors in
the bi-factor model. The corresponding graph encoded by the S matrix in (8) is sparse,
when each item cluster has only a small number of items. For example, if each item
cluster has only two items, then the sparsity level of the graph, defined as the ratio of
the number of edges in the graph and the total number of item pairs, is 1/(J−1), which
can be as small as 3% with J = 30 items. Figure 3 presents an example of the a bi-factor
model, the corresponding FLaG-IRT model, and the local dependence graph. In other
words, when the specific prior for θ is assumed, the bi-factor model becomes a special
case of the FLaG-IRT model with one latent trait and a sparse local dependence graph.
11
3 FLaG-IRT Analysis
3.1 Regularized Pseudo-likelihood Estimation
In this section, we discuss estimation and dimension reduction of the FLaG-IRT model.
The most natural approach would be the maximum marginal likelihood function of
responses given in (7). Unfortunately, the evaluation of (7) involves computing the
normalizing constant,
z(A, S,Σ) =∑
x∈{0,1}Jexp
{1
2x>(AΣA> + S)x
},
which requires a summation over 2J all possible response patterns and thus is compu-
tationally infeasible for even a relatively small J . To bypass this, we propose a pseudo-
likelihood as a surrogate (Besag, 1974), which is based on the conditional distribution
of Xj given the rest X−j = (X1, ..., Xj−1, Xj+1, ..., XJ),
P (Xj = 1|Xj = x−j, A, S,Σ) =exp{1
2(ljj + sjj) +
∑i 6=j(lij + sij)xi}
1 + exp{12(ljj + sjj) +
∑i 6=j(lij + sij)xi}
,
where L = (lij)J×J = AΣA>. Note that the above conditional distribution takes a logis-
tic regression form. Following Besag (1974), we let Lj(A, S,Σ;x) = P (Xj = xj|X−j =
x−j, A, S,Σ) and define the pseudo-likelihood function
L(A, S,Σ) =N∏i=1
J∏j=1
Lj(A, S,Σ;xi), (9)
where xi is the responses from individual i.
To incorporate the knowledge of the test items, the factor loading matrix A is con-
strained such that ajk = 0 when λjk = 0. Therefore, the unknown parameters in A are
{ajk : λjk = 1}. Since A and Σ appear in the pseudo-likelihood function in the form of
AΣA>, additional constraints are needed to ensure their identifiability. This is because,
12
for example, scaling A by a constant ω can be offset by the corresponding scaling of Σ by
ω−2. To identify the scale of latent factors, we impose constraints Σkk = 1, k = 1, ..., K,
which means that the posterior variance must be 1. To avoid the rotational indetermina-
cy, we assume that with appropriate column swapping, the Λ matrix contains a K ×K
identity submatrix.. It means that for each latent factor, there is at least one item that
only measures that factor.
When the graph for local dependence is known, we estimate A, S, and Σ using a
maximum pseudo-likelihood function
(A, S, Σ) = arg minA,S,Σ
{− 1
NlogL(A, S,Σ)
}s.t. ajk = 0 if λjk = 0, j = 1, ..., J, k = 1, ..., K,
S = S>, sij = 0 if (i, j) /∈ E,
and Σ is positive semidefinite, σkk = 1, k = 1, ..., K,
(10)
where E is the set of edges of the known graph.
When the graph for local dependence is unknown, which is typically the case in
practice, we impose an assumption that the graph is sparse, that is, the number of
edges in E = {(i, j) : sij 6= 0} is relatively small. The rationale is that most of the
dependence among responses has been captured by the common latent traits, leaving
the local dependence structure sparse. This assumption is incorporated in the analysis
through selecting a sparse graphical model component based on the data. We’d like to
point out that even a sparse local dependence structure (i.e. a local dependence graph
with a relatively small number of edges), if ignored in the measurement, can result in
measurement bias, as illustrated by simulated examples. In addition, the sparse local
dependence graph, once learned from the data, facilitates the understanding of the
measurement and may be used to improve the test design. For example, patterns (e.g.
item clusters) identified from the graph may help the test designers to review the items
13
and improve the wording.
We propose to use the regularized pseudo-likelihood for simultaneous estimation and
model selection
(Aγ, Sγ, Σγ) = arg minA,S,Σ
{− 1
NlogL(A, S,Σ) + γ
∑i 6=j
|sij|
}
s.t. ajk = 0 if λjk = 0, j = 1, ..., J, k = 1, ..., K,
S = S>, and Σ is positive semidefinite, σkk = 1, k = 1, ..., K,
(11)
where γ is the tuning parameter that controls the sparsity level of the estimated graph
Eγ = {(i, j) : sγij 6= 0, i 6= j}. At one extreme, when γ is sufficiently large, the estimated
graph becomes degenerate, i.e., no edge, and the responses are conditionally independent
given the latent variables that are measured. The graph becomes more and more dense
as γ decreases.
The optimization problem (11) is nonconvex and nonsmooth, and thus is compu-
tationally nontrivial. An efficient and stable algorithm is developed, which alternates
between minimizing A, S, and Σ. In particular, an proximal gradient based method
(Parikh et al., 2014) is used in updating S, which avoids the issues due to the non-
smoothness of the function that may occur in standard gradient based optimization
approaches. Details of the algorithm are provided in the appendix. We emphasize that
this algorithm is scalable to very large data sets with a large number of items (e.g. t-
housands) and a large number of latent factors (tens or larger), and thus is suitable for
large scale data analysis.
3.2 Choice of Tuning Parameters
In the estimation, we construct a solution path of (Aγ, Sγ, Σγ) for a sequence of γ values.
We then choose γ based on the Bayes information criterion (BIC; Schwarz, 1978), which
14
takes a general form
BIC(M) = −2 logL(β(M)) + |M| logN,
whereM is the model under consideration, L(β(M)) is the maximal likelihood for model
M, and |M| is the number of free parameters. In this study, we replace the likelihood
function with the pseudo-likelihood function. Specifically, let
Mγ ={
(A, S,Σ) : ajk = 0 if λjk = 0, S = S>, sij=0 if sij = 0,
and Σ is positive semidefinite, σkk = 1, k = 1, ..., K}
be the model selected by tuning parameter γ, containing all models having the same
support as Sγ. We select the tuning parameter γ, such that the corresponding model
minimizes the pseudo-likelihood-based BIC
BIC(Mγ) = −2 max(A,S,Σ)∈Mγ
{logL(A, S,Σ)}+ |Mγ| logN, (12)
where the number of parameters in Mγ is
|Mγ| =∑j,k
λjk + J +∑i<j
1{sγij 6=0} +(K − 1)K
2.
Here,∑
j,k λjk counts the number of free parameters in the loading matrix A, J and∑i<j 1{sγij 6=0} are the numbers of diagonal and off-diagonal parameters in Sγ, and K(K−
1)/2 is the number of parameters in Σ.
The tuning parameter is finally selected by
γ = arg minγ
BIC(Mγ). (13)
In addition, the corresponding maximal pseudo-likelihood estimates of A, S, and Σ are
15
used as the final estimate of A, S, and Σ:
(A, S, Σ) = arg max(A,S,Σ)∈Mγ
{L(A, S,Σ)}. (14)
3.3 Summary
We summarize the procedure of FLaG-IRT analysis, when the graph for local dependence
is unknown.
1. Select a sequence of γ values, denoted by Γ.
2. Obtain a sequence of models indexed by γ ∈ Γ, based on the regularized estimates
(Aγ, Sγ, Σγ) from (11).
3. Among the sequence of models above, select the best fitted model in terms of BIC
value, using (13).
4. Report (A, S, Σ) from the selected model given by (14), as well as the local depen-
dence graph given by E = {(i, j) : sij 6= 0}.
4 Simulation Studies
In this section, we report two simulation studies. First, we provide an illustrative ex-
ample that ignoring local dependence results in measurement bias (differential item
functioning). Second, we evaluate the FLaG-IRT analysis under various simulation set-
tings.
4.1 Simulation Study 1
Data generation. We generate a data set from a bi-factor model, with N = 3000,
J = 15, and only one item cluster C1 = {1, 2, 3, 4, 5}. Note that the general factor θg and
16
Item 1 2 3 4 5 6 7 8
ag 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
a1 2 2 2 2 2 0 0 0
b -1.37 -1.98 1.08 -1.55 0.07 -0.18 -0.72 -1.67
Item 9 10 11 12 13 14 15
ag 1.5 1.5 1.5 1.5 1.5 1.5 1.5
a1 0 0 0 0 0 0 0
b -1.63 -1.92 -1.64 -0.41 -0.4 1.71 -1.63
Table 1: The item parameters in study 1.
Model 2PL Bi-factor FLaG-IRT FLaG-IRT
(true model) (known graph) (unknown graph)
τ(θig, θi) 0.512 0.539 0.539 0.539
τ(θi1, θi) 0.183 0.046 0.051 0.059
Table 2: The results of study 1.
the nuisance factor θ1 are assumed to be independent and follow the standard normal
distribution. The model parameters are shown in Table 1, where bjs are sampled from
uniform distribution over interval [−2, 2]. This data set is generated to mimic a test that
aims at measuring the general factor θg, thus every item is designed to be associated
with this dimension. In addition, θ1 is a nuisance dimension that is only associated with
five items and is not included in the design (so that people may not be aware of), but
has an effect on the responses. The results are summarized in Table 2.
Unidimensional 2PL model. We first analyze the data using the unidimensional
2PL model, where the unidimensional latent trait is assumed to have a Gaussian prior,
following the standard IRT setting. We consider measuring individuals using a two-stage
procedure. In the first stage, the item parameters are estimated, denoted by a and b.
In the second stage, a and b are plugged into the posterior mean of θi and the resulting
estimate is denoted by θi. The two plots in panel (a) of Figure 4 show θi1 (x-axis) versus
θi (y-axis) and θi,g (x-axis) versus θi (y-axis), respectively. In both plots, the regression
17
lines are plotted to show the trend. In addition, the Kendall’s tau correlation coefficient
(e.g. Kendall and Gibbons, 1990) between θi1 and θi is 0.183 and that between θi,g and
θi is 0.512. Kendall’s tau is a nonparametric measure of correlation based on ordering
that does not require any parametric, such as the Gaussian, model assumption. For two
vectors (y1, ..., yN)> and (z1, ..., zN)>, the Kendall’s tau is defined as
τ =2
N(N − 1)
∑i<j
sgn {(yi − yj)(zi − zj)} ,
where sgn(x) is the sign function, taking value 1 if x > 0, 0 if x = 0, and −1 otherwise.
If yi and zi are independent samples from two populations, τ is asymptotically 0, as the
sample size N grows. As θi is intended to measure θi,g, it is expected that if θi,g > θj,g,
then θi > θj and vise versa, for each pair of individuals i and j. Thus, the larger the
τ is, the better the measurement. On the other hand, since θ1 is a nuisance factor
independent of θg, it is expected that θi is independent of θi1, so that Kendall’s tau
is close to 0. Consequently, the measurement validity under this misspecified model is
low, due to the high Kendall’s tau correlation between the nuisance factor θi1 and θi. In
other words, the latent trait being measured under the 2PL model deviates from what
is designed to measure. This could lead to the issue of test fairness that could especially
be of concern in educational testing. That is, for two examinees with the same θg value,
the one with a higher nuisance trait level tends to be scored higher. This phenomenon
is known as differential item functioning (Holland and Wainer, 2012).
Bi-factor model with known nuisance factor. If the presence of the nuisance
factor θ1 is known, as well as its loading structure, we fit the true model. The latent
traits θis are estimated using the two-stage procedure based on the posterior mean as
above. Again, we plot θi1 (x-axis) versus θi,g (y-axis) and θi,g (x-axis) versus θi,g (y-axis)
in panel (b) of Figure 4. In addition, the Kendall’s tau correlation coefficient between
θi1 and θi,g is 0.046 and that between θi,g and θi,g is 0.539. Thus, with the nuisance
18
Figure 4: The scatter plots of θi1 versus θi (left) and θi,g versus θi (right). Panel (a):Unidimensional 2PL model; Panel (c): FLaG-IRT with known graph; Panel (d) FLaG-IRT with unknown graph.
factor adjusted in the measurement model, the test validity improves; that is, θi,g tends
to be less correlated with θi1, and θi,g is more correlated with θi,g.
FLaG-IRT with known conditional graph. According to the discussion in Sec-
tion 2.3, when knowing the item cluster C1, we can also fit a unidimensional FLaG-IRT
model with a known graph satisfying only the first five items connected to each other,
i.e., E = {(i, j) : i, j ≤ 5}. With A and S from (10), the latent trait level θi for individ-
ual i is scored according to (6), θi =∑J
j=1 ajxij. We remark that under the FLaG-IRT
model, due to the choice of the prior distribution, the θis are no longer in the same
range as θi,gs. The results are presented in panel (c) of Figure 4 and the Kendall’s tau
correlation coefficient between θi1 and θi is 0.051 and that between θi,g and θi is 0.539,
which is very close to the values under the true model. It indicates that even when a
misspecified prior is used, the measurement based on the FLaG-IRT model is still valid.
Moreover, A = (a1, ..., aJ)> is shown in Table 3. Note that the magnitude of ajs is not
on the same scale as the ones in the generating model, due to the different choice of
19
1 2 3 4 5 6 7 8
aj 0.20 0.22 0.21 0.25 0.22 0.7 0.68 0.72
9 10 11 12 13 14 15
aj 0.70 0.60 0.67 0.62 0.68 0.60 0.64
Table 3: The estimated factor loadings for the unidimensional FLaG-IRT model withknown local dependence graph.
the marginal distribution of the latent trait in the FLaG-IRT model. From Table 3,
we see that the values of ajs are much smaller for items 1 to 5 than those for items 6
to 15, while all the loadings on the general factor are the same in the true model. In
other words, due to the positive local dependence among items 1 to 5, the weights of
items 1 to 5 are discounted when computing the score θi =∑J
j=1 ajxij. This is expected,
because in the presence of the nuisance factor, the responses to items 1 to 5 are highly
correlated and thus provide less information about the general factor than the rest of the
items. Consequently, the FLaG-IRT model discounts the weights of these items when
computing the score.
FLaG-IRT with unknown conditional graph. Finally, we consider the more real-
istic situation, in which that the presence of the nuisance factor and its relationship with
the items are unknown. The FLaG-IRT analysis described in Section 3 is applied, for
which the graph of local dependence graph is unknown. The selected local dependence
graph is shown in Figure 5 and the scatter plots of θi1 versus θi and θi,g versus θi are
shown in panel (d) of Figure 4. In addition, the Kendall’s tau correlation coefficient
between θi1 and θi is 0.059 and that between θi,g and θi is 0.539. From Figure 5, the first
five items are connected to each other and there are five additional edges (1, 8), (2, 8),
(2, 9), (5, 15), and (10, 15). These results show that the measurement is very close to
that with the true model or with the FLaG-IRT model with a correctly specified local
dependence graph. They also show that the differential item functioning caused by the
nuisance factor can be substantially reduced to a negligible level.
20
Figure 5: The local dependence graph of the selected FLaG-IRT model.
4.2 Simulation Study 2
In this study, we evaluate the performance of the FLaG-IRT analysis presented in Sec-
tion 3 under different settings. For each setting, 100 independent data sets are generated.
In the FLaG-IRT analysis, the local dependence structure is completely unspecified and
learned from data. The settings are listed below.
S1. The same setting as in Study 1, where K is set to be 1 in the FLaG-IRT analysis.
S2. The same as setting S1, except that sample size N = 1500.
S3. Generate data from a bi-factor model, with J = 20, N = 3000, and two item
clusters, C1 = {1, 2, 3, 4, 5}, C2 = {6, 7, 8, 9, 10}. K is set to be 1 in the FLaG-IRT
analysis.
S4. The same as setting S3, except that sample size N = 1500.
S5. Generate data from the FLaG-IRT model, with J = 45, N = 3000, K = 3 and local
dependence graph E = {(1, 2), (2, 3), (3, 4), ..., (44, 45)}. For the loading structure,
items 1-15 measure the first latent trait, items 16-30 measure the second latent
trait, and items 31-45 measure the third latent trait. If particular, we set ajk = 0.5
21
for qjk 6= 0, sjj = −6.5, j = 1, ..., J , sij = 1 for (i, j) ∈ E, and σkk = 1, k = 1, ..., K
and σkl = 0.1, k 6= l.
S6. The same as setting S5, except that sample size N = 1500.
For settings S1 and S2, we evaluate the the performance of the FLaG-IRT based on the
following criteria and compare the results with those of the 2PL model and the true
model.
R1. The Kendall’s tau correlation between θis and the true values of the general factor,
θi,gs.
R2. The Kendall’s tau correlation between θis and the values of the nuisance factor,
θi1s.
R3. The true positive rate of graph estimation, defined as
TPR =
∑i<j 1{(i,j)∈E,(i,j)∈E}∑
i<j 1{(i,j)∈E},
where for bi-factor models, we say (i, j) ∈ E if i and j belong to the same item
cluster Cm, m = 1, ...,M .
R4. The false positive rate of graph estimation, defined as
FPR =
∑i<j 1{(i,j)∈E,(i,j)/∈E}∑
i<j 1{(i,j)/∈E}.
For settings S3 and S4, we evaluate the FLaG-IRT analysis through R1, R3, and R4,
but not R2 because now there are two nuisance factors. In addition, the results are
compared with those of the 2PL model and the true model. For settings S5 and S6, the
evaluation is based on R3, R4 and
R5. The estimation accuracy of the nonzero loading parameters and the diagonal en-
tries of S.
22
R6. The estimation accuracy of the off-diagonal entries of Σ.
We include R5 and R6, because data are generated from the FLaG-IRT model and thus
the true parameters are known.
The results are summarized in Table 4 and Figures 6 and 7. In particular, under
Setting 1, the mean of the Kendall’s tau between the estimated scores from FLaG-IRT
analysis and the true general factor scores is 0.535, with a standard error 0.003. In
addition, the mean of the Kendall’s tau between the estimated scores and the nuisance
factor scores is 0.069 (SE = 0.001). On the other hand, when the unidimensional 2PL
model is used, the corresponding values are 0.508 (SE = 0.003) and 0.199 (SE = 0.002),
respectively; when the true model is fitted, the mean of the Kendall’s tau between the
estimated general factor scores and the true general and nuisance scores are 0.535 (SE
= 0.003) and 0.063 (SE = 0.001). Thus, the measurement based on the FLaG-IRT
analysis is comparable to that based on the true model and performs significantly better
than the 2PL model, which ignores the local dependence structure. In addition, it is
observed that the true positive rate of the graph estimation is always 1, meaning that
the edges are always correctly selected by the FLaG-IRT, and that the false positive
rate is small, albeit nonzero. Similar patterns are observed under settings S2, S3, and
S4. For settings S5 and S6, data are generated from a FLaG-IRT model. Under both
settings, the local dependence graph is estimated accurately. When sample size is 1500,
the average true positive rate over 100 replications is 0.925 (SE = 0.004) which is close
to 1 and the average false positive rate is 0.030 (SE = 0.001) which is close to 0. The
results for N = 3000 improves upon N = 1500, with the true positive rate increasing to
1 and the false positive rate decreasing to 0.021 (SE = 0.001). Figures 6 and 7 plot the
histograms of ajk − ajk for all nonzero loadings, sjj − sjj over all items, and σkl − σkl
for all k 6= l, over 100 simulated data sets. As we can see, the estimates are roughly
unbiased and as the sample size increases, the estimates tend to be more concentrated
around their true values.
23
Criteria
R1 R2 R3 R4
S1 (FLaG-IRT) 0.535 (0.003) 0.069 (0.001) 1 (0) 0.041 (0.003)
S1 (2PL) 0.508 (0.003) 0.199 (0.002) * *
S1 (Bi-fac) 0.535 (0.003) 0.063 (0.001) * *
S2 (FLaG-IRT) 0.525 (0.004) 0.067 (0.002) 1 (0) 0.056 (0.003)
S2 (2PL) 0.502 (0.003) 0.187 (0.003) * *
S2 (Bi-fac) 0.525 (0.004) 0.061 (0.002) * *
S3 (FLaG-IRT) 0.549 (0.004) * 1 (0) 0.103 (0.005)
S3 (2PL) 0.523 (0.003) * * *
S3 (Bi-fac) 0.548 (0.004) * * *
S4 (FLaG-IRT) 0.549 (0.004) * 1 (0) 0.093 (0.003)
S4 (2PL) 0.521 (0.003) * * *
S4 (Bi-fac) 0.549 (0.004) * * *
S5 * * 0.996 (0.001) 0.030 (0.001)
S6 * * 0.925 (0.004) 0.039 (0.001)
Table 4: Results of FLaG-IRT analysis in simulation study 2.
ajk − ajk
Fre
quen
cy
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
020
040
060
080
010
0012
00
sjj − sjj
Fre
quen
cy
−1.0 −0.5 0.0 0.5
020
040
060
0
σkl − σkl
Fre
quen
cy
−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06
010
2030
4050
6070
Figure 6: The difference between estimated and true nonzero factor loading parametersajk − ajk (ajk = 0.5), the difference between the estimated and true diagonal entriesof S matrix, sjj − sjj (sjj = −6.5) and the difference between the estimated and trueoff-diagonal entries of Σ matrix, σkl − σkl (σkl = 0.1) under setting S5.
24
ajk − ajk
Fre
quen
cy
−0.4 −0.2 0.0 0.2 0.4 0.6
020
040
060
080
0
sjj − sjj
Fre
quen
cy
−1.5 −1.0 −0.5 0.0 0.5 1.0
020
040
060
080
010
00
σkl − σkl
Fre
quen
cy
−0.05 0.00 0.05 0.10
020
4060
80
Figure 7: The difference between estimated and true nonzero factor loading parametersajk − ajk (ajk = 0.5), the difference between the estimated and true diagonal entriesof S matrix, sjj − sjj (sjj = −6.5) and the difference between the estimated and trueoff-diagonal entries of Σ matrix, σkl − σkl (σkl = 0.1) under setting S6.
5 Real Data Analysis
We illustrate the use of FLaG-IRT analysis by an application to the Extroversion short
scale of the Eysencks Personality Questionnaire-Revised (EPQ-R; Eysenck et al., 1985;
Eysenck and Barrett, 2013). The data set contains the responses to 12 items from
824 females in the United Kingdom. All these items are designed to measure a single
personality factor Extroversion, characterized by personality patterns such as sociability,
talkativeness, and assertiveness. The items are shown in Table 5, and the data are
preprocessed so that the responses to the reversely worded items are flipped.
We start with fitting the unidimensional 2PL model whose unidimensional latent trait
follows a standard Gaussian distribution and then check the model fit. The estimated
2PL parameters are shown in Table 6. Under the fitted model, the expected two-by-two
tables for item pairs can be evaluated by
Exixj = N × P (Xi = xi, Xj = xj) = N
∫exp (aiθ + bi)xi
1 + exp (aiθ + bi)
exp (ajθ + bj)xj
1 + exp (ajθ + bj)φ(θ)dθ,
where φ(θ) is the density function of a standard normal distribution. We first check the
fit of item pairs by comparing the expected two-by-two tables with the observed ones,
25
using the X2 local dependence index (Chen and Thissen, 1997) as a descriptive statistic.
For each item pair i and j, the X2 statistic is defined as
X2ij =
1∑xi=0
1∑xj=0
(Oxixj − Exixj)2
Exixj,
where Oxixj is the observed number of (xi, xj) pairs. A large value of X2ij indicates a
lack of fit. In addition, based on simulation studies, Chen and Thissen (1997) suggest
that the marginal distribution of each X2ij is roughly a chi-square distribution with one
degree of freedom when data are generated from the 2PL model. We visualize (X2ij)J×J
using a heat map in the left panel of Figure 8. For a better visualization, we plot a
monotone transformation of X2ij,
Tij = X2ij/(Q
Chi1,95% +X2
ij),
where QChi1,95% is the 95% quantile of the chi-square distribution with one degree of free-
dom. Thus, Tij > 1/2 suggests that item pair (i, j) is not fitted well. In the heat map,
the value of Tij is presented according to the color key above the heat map. The top
five item pairs with highest levels of Tij are shown in Table 7, where items within a
pair tend to share common content/stimuli. To further assess the over-all fit of the
2PL model and to compare it with that of the selected FLaG-IRT model, we consider a
parametric bootstrap test, using the total sum of the X2 statistics as the test statistic
SX2PL =∑
i<j X2ij. That is, we generate 500 bootstrap data sets, each of which has
824 samples drawn from the estimated 2PL model. For each bootstrap data set, we
fit the 2PL model again and compute the corresponding total sum of X2s, denoted by
SX(b)2PL. The empirical distribution of SX
(b)2PL is used as the reference distribution. The
histogram of SX(1)2PL, ..., SX
(500)2PL are shown in the left panel of Figure 9. The observed
value of SX2PL based on the fitted model is 318, much larger than the ones from boot-
strap data. Consequently, the p-value of this bootstrap test is 0, indicating that the
26
lack-of-fit of the 2PL model.
1 Are you a talkative person?
2 Are you rather lively?
3 Can you usually let yourself go and enjoy yourself at a lively party?
4 Do you enjoy meeting new people?
5 Do you usually take the initiative in making new friends?
6 Can you easily get some life into a rather dull party?
7 Do you like mixing with people?
8 Can you get a patty going?
9 Do you like plenty of bustle and excitement around you?
10 Do other people think of you as being very lively?
11(R) Do you tend to keep in the background on social occasions?
12(R) Are you mostly quiet when you are with other people?
Table 5: The revised Eysenck Personality Questionnaire short form of Extroversion scale.
1 2 3 4 5 6 7 8 9 10 11 12
aj 1.90 2.52 1.55 1.64 1.41 2.62 1.70 2.22 1.20 2.35 2.52 2.19
bj 1.12 1.98 1.51 3.04 0.60 -1.48 3.14 -0.62 1.17 0.90 0.62 1.27
Table 6: The estimated 2PL model for the EPQ-R data.
Tij X2ij
1 0.94 62 6 Can you easily get some life into a rather dull party?
8 Can you get a patty going?
2 0.87 25 1 Are you a talkative person?
12(R) Are you mostly quiet when you are with other people?
3 0.86 24 2 Are you rather lively?
10 Do other people think of you as being very lively?
4 0.85 23 4 Do you enjoy meeting new people?
7 Do you like mixing with people?
5 0.79 15 7 Do you like mixing with people?
9 Do you like plenty of bustle and excitement around you?
Table 7: Item pairs with largest values of local dependence indices.
We then apply FLaG-IRT. The selected local dependence graph of the selected model
has 20 edges, as shown in Figure 10, where the positive and negative edges are in
black and red, respectively. In particular, the most locally dependent item pairs also
correspond to the most positive edges in Figure 10. Similar to the analysis above, we
compute the local independence indices for all the items pairs and visualize them in
27
Figure 8: The heat maps for visualizing the fit of all item pairs under the 2PL model(left) and selected FLaG-IRT model (right).
Figure 9: The results of a parametric bootstrap test for the 2PL model (left) and theselected FLaG-IRT model (right)
28
Figure 10: The local dependence graph of the selected FLaG-IRT model.
the right panel of Figure 8, where no X2ij is found to exceed QChi
1,95%. Moreover, 500
bootstrap data sets are generated from the selected FLaG-IRT model and the bootstrap
distribution of SXFLaG−IRT is shown in the right panel of Figure 9. As we can see,
the observed value of SXFLaG−IRT for the selected model is in the middle range of the
bootstrap distribution and the p-value is 31.6%, indicating that the selected FLaG-IRT
model fits well.
Based on the above analysis, we see that even a well designed 12-item EPQ-R short
form displays significant level of local dependence, which, if not adjusted, may result in
severe measurement bias. The proposed FLaG-IRT model automatically adjusts for the
local dependence based on the data, while maintaining the unidimensional latent trait
as the key source of dependence among responses. As a result, the FLaG-IRT model
learned from data fits well, at both the item pair level and the test level.
6 Summary
It is well known that mental processes are complex and there are always factors not
perfectly explained by a measurement model. Standard latent factor models may result
in model lack of fit, thereby having a negative effect on the test validity. In this paper,
we propose the FLaG-IRT model for robust measurement. The key idea behind the
29
proposed model is that, given the loading structure of a well-designed test, the local
dependence structure can be well incorporated through a sparse graphical component.
This is done without requiring any prior information about the structure of local depen-
dence. Our analysis shows that the method greatly reduces the measurement bias and
increases the measurement accuracy. In particular, when significant local dependence
is observed and when test fairness is of a concern, the FLaG-IRT model can be a good
choice for scoring individuals. Finally, we remark that our algorithm for FLaG-IRT
analysis is very efficient and stable, although the optimization problem is nonconvex
and nonsmooth. The FLaG-IRT analysis is thus scalable to analyzing large scale data.
References
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems.
Journal of the Royal Statistical Society. Series B (Methodological), pages 192–236.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s
ability. In Lord, F. M. and Novick, M. R., editors, Statistical Theories of Mental Test
Scores, pages 395–479. Addison-Wesley, Reading, MA.
Bradlow, E. T., Wainer, H., and Wang, X. (1999). A Bayesian random effects model for
testlets. Psychometrika, 64(2):153–168.
Braeken, J. (2011). A boundary mixture approach to violations of conditional indepen-
dence. Psychometrika, 76(1):57–76.
Braeken, J., Tuerlinckx, F., and De Boeck, P. (2007). Copula functions for residual
dependency. Psychometrika, 72(3):393–411.
Cai, L., Yang, J. S., and Hansen, M. (2011). Generalized full-information item bifactor
analysis. Psychological methods, 16(3):221–248.
30
Camilli, G. (1992). A conceptual analysis of differential item functioning in terms
of a multidimensional item response model. Applied Psychological Measurement,
16(2):129–147.
Chen, W.-H. and Thissen, D. (1997). Local dependence indexes for item pairs using item
response theory. Journal of Educational and Behavioral Statistics, 22(3):265–289.
Chen, Y. (2016). Latent variable modeling and statistical learning. Available at http:
//academiccommons.columbia.edu/catalog/ac:198122. PhD thesis, Columbia U-
niversity.
Chen, Y., Li, X., Liu, J., and Ying, Z. (2016). A fused latent and graphical model for
multivariate binary data. Available at https://arxiv.org/pdf/1606.08925v1.pdf.
arXiv preprint.
Chen, Y., Liu, J., Xu, G., and Ying, Z. (2015). Statistical analysis of Q-matrix based
diagnostic classification models. Journal of the American Statistical Association,
110(510):850–866.
Embretson, S. E. and Reise, S. P. (2000). Item response theory for psychologists.
Lawrence Erlbaum Associates Publishers, Mahwah, NJ.
Eysenck, S. and Barrett, P. (2013). Re-introduction to cross-cultural studies of the EPQ.
Personality and Individual Differences, 54:485–489.
Eysenck, S. B., Eysenck, H. J., and Barrett, P. (1985). A revised version of the Psy-
choticism scale. Personality and Individual Differences, 6:21–29.
Ferrara, S., Huynh, H., and Michaels, H. (1999). Contextual explanations of local
dependence in item clusters in a large scale hands-on science performance assessment.
Journal of Educational Measurement, 36:119–140.
31
Gibbons, R. D., Bock, R. D., Hedeker, D., Weiss, D. J., Segawa, E., Bhaumik, D. K.,
Kupfer, D. J., Frank, E., Grochocinski, V. J., and Stover, A. (2007). Full-information
item bifactor analysis of graded response data. Applied Psychological Measurement,
31(1):4–19.
Gibbons, R. D. and Hedeker, D. R. (1992). Full-information item bi-factor analysis.
Psychometrika, 57(3):423–436.
Holland, P. W. (1990). The Dutch identity: A new tool for the study of item response
models. Psychometrika, 55(1):5–18.
Holland, P. W. and Thayer, D. T. (1988). Differential item performance and the Mantel-
Haenszel procedure. In Wainer, H. and Braun, H. I., editors, Test validity, pages
129–145. Lawrence Erlbaum Associates, Hillsdale, NJ.
Holland, P. W. and Wainer, H. (2012). Differential item functioning. Routledge.
Hoskens, M. and De Boeck, P. (1997). A parametric model for local dependence among
test items. Psychological methods, 2(3):261–277.
Ip, E. H. (2002). Locally dependent latent trait model and the Dutch identity revisited.
Psychometrika, 67(3):367–386.
Ip, E. H. (2010). Empirically indistinguishable multidimensional irt and locally de-
pendent unidimensional item response models. British Journal of Mathematical and
Statistical Psychology, 63(2):395–416.
Ip, E. H., Wang, Y. J., De Boeck, P., and Meulders, M. (2004). Locally dependent
latent trait model for polytomous responses with application to inventory of hostility.
Psychometrika, 69(2):191–216.
Ising, E. (1925). Beitrag zur theorie des ferromagnetismus. Zeitschrift fur Physik A
Hadrons and Nuclei, 31:253–258.
32
Kendall, M. G. and Gibbons, J. D. (1990). Rank correlation methods. Oxford University
Press, London, UK.
Knowles, E. S. and Condon, C. A. (2000). Does the rose still smell as sweet? Item
variability across test forms and revisions. Psychological Assessment, 12:245–252.
Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and tech-
niques. MIT press, Cambridge, MA.
Laird, N. M. (1991). Topics in likelihood-based methods for longitudinal data analysis.
Statistica Sinica, 1(1):33–50.
Li, Y., Bolt, D. M., and Fu, J. (2006). A comparison of alternative models for testlets.
Applied Psychological Measurement, 30(1):3–21.
Liu, J., Xu, G., and Ying, Z. (2012). Data-driven learning of Q-matrix. Applied psycho-
logical measurement, 36(7):548–564.
Liu, J., Xu, G., and Ying, Z. (2013). Theory of the self-learning Q-matrix. Bernoulli:
official journal of the Bernoulli Society for Mathematical Statistics and Probability,
19(5A):1790–1817.
Lord, F. M. and Novick, M. R. (1968). Statistical theories of mental test scores. Addison-
Wesley, Reading, MA.
McKinley, R. L. and Reckase, M. D. (1982). The use of the general Rasch model with
multidimensional item response data. Iowa City, IA: American College Testing.
Parikh, N., Boyd, S. P., et al. (2014). Proximal algorithms. Foundations and Trends in
optimization, 1(3):127–239.
Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests.
Copenhagen: Danish Institute for Educational Research.
33
Reckase, M. (2009). Multidimensional item response theory. Springer, New York, NY.
Reise, S. P., Horan, W. P., and Blanchard, J. J. (2011). The challenges of fitting an
item response theory model to the social anhedonia scale. Journal of personality
assessment, 93(3):213–224.
Reise, S. P., Morizot, J., and Hays, R. D. (2007). The role of the bifactor model in
resolving dimensionality issues in health outcomes measures. Quality of Life Research,
16(1):19–31.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6:461–
464.
Schwarz, N. (1999). Self-reports: How the questions shape the answers. American
Psychologist, 54:93–105.
Sun, J., Chen, Y., Liu, J., Ying, Z., and Xin, T. (2016). Latent variable selection for
multidimensional item response theory models via L1 regularization. Psychometrika.
To appear.
Wainer, H., Bradlow, E. T., and Du, Z. (2000). Testlet response theory: An analog
for the 3PL model useful in testlet-based adaptive testing. In van der Linden, W. J.
and Glas, G. A., editors, Computerized adaptive testing: Theory and practice, pages
245–269. Springer, New York, NY.
Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance
of the three-parameter logistic model. Applied Psychological Measurement, 8(2):125–
145.
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local
item dependence. Journal of Educational Measurement, 30(3):187–213.
34
Appendix
A Computation via an Alternating Minimization
A.1 Proximal Gradient Descent Update
The proximal gradient descent update is designed for solving nonsmooth convex opti-
mization problems (Parikh et al., 2014). Consider optimization problem
minxf(x) + g(x), (15)
where x ∈ Rn, f is a smooth convex function, and g is a continuous but nonsmooth
convex function. Due to the nonsmoothness of g, the traditional gradient descent algo-
rithm cannot be directly applied, because the gradient of g does not always exist. The
proximal gradient descent update can be viewed as a variant of the gradient descent
update that accounts for the nonsmoothness.
To describe the proximal gradient descent update, we first introduce the proximal
operator Pλ,g: Rn → Rn as
Pλ,g(v) = arg minx∈Rn{g(x) +
1
2λ‖x− v‖2}
and proximal gradient descent update is
xt+1 = Pλ,g(xt − λ∇f(xt)), (16)
where xt and xt+1 are the current and updated values. It can be shown that for a
sufficiently small λ,
f(xt+1) + g(xt+1) < f(xt) + g(xt),
35
if xt is not the an optimal solution. Thus, one can always search for a step size λ, such
that the objective function decreases. When function g has a special form, the proximal
gradient descent update (16) may have a closed form solution, which is indeed the case
in our algorithm below.
A.2 An Alternating Minimization Algorithm
We use an alternating minimization algorithm for optimizing (11). The positive semidef-
inite constraint on Σ is not easy to handle in the computation. Therefore, we reparame-
terize Σ = BB>, where B = (bij) is a K×K lower triangle matrix. In addition, instead
of constraining Σkk = 1, we require bkk to be 1. There is a one to one correspondence
between the two sets of parametrization and the transformation will be discussed in the
sequel. We let l(A, S,B) = logL(A, S,BB>) and
Hγ(A, S,B) = − 1
Nl(A, S,B) + γ
∑i 6=j
|sij|
be the objective function. Then the alternating minimization algorithm alternates be-
tween updating A, S, and B iteratively, so that the values of (A, S,B), denoted by
(At, St, Bt), satisfy
Hγ(At, St, Bt) > Hγ(A
t+1, St, Bt) > Hγ(At+1, St+1, Bt) > Hγ(A
t+1, St+1, Bt+1),
for all t. Specifically, A and B are updated using a gradient descent method and S is
updated using the proximal gradient descent method. We summarize the algorithm as
follows, given the current parameter values (At, St, Bt).
Algorithm 1 An Alternating Minimization Algorithm.
1. Update
At+1 ← At − αtgA(At, St, Bt),
36
where gA(At, St, Bt) is the gradient of −l(A, S,B)/N with respect to A at (At, St, Bt).
The step size αt is chosen by line searching, such that Hγ(At+1, St, Bt) < Hγ(A
t, St, Bt).
2. Update
St+1 ← Pλt,hγ (St − λtgS(At+1, St)),
where hγ(S) = γ∑
i 6=j |sij| is the regularization function, gS(At+1, St, Bt)) is the
gradient of −l(A, S,B)/N with respect to S at (At+1, St, Bt). In addition, λt is the
step size for a proximal gradient operator chosen by line searching, satisfying
Hγ(At+1, St+1, Bt) < Hγ(A
t+1, St, Bt). (17)
3. Update
Bt+1 ← Bt − βtgB(At+1, St+1, Bt),
where gB(At+1, St+1, Bt) is the gradient of −l(A, S,B)/N with respect to B eval-
uated at (At+1, St+1, Bt). The step size βt is chosen by line searching, such that
Hγ(At+1, St+1, Bt+1) < Hγ(A
t+1, St+1, Bt).
4. Iterates between the above three steps until convergence.
We make a few remarks.
Remark 1 −l(A, S,B)/N is a smooth function of (A, S,B) and its gradients gA, gS,
and gB have analytic forms.
Remark 2 In Step 2, Hγ(At+1, S, Bt), when viewed as a function of S (with At+1 and Bt
fixed), is the sum of a smooth convex function −l(At+1, S, Bt)/N and a nonsmooth convex
function hγ(S) = γ∑
i 6=j |sij|. Therefore, according to the discussion in Appendix A.1,
there exists a sufficiently small step size λt, such that (17) is satisfied.
37
Remark 3 The proximal operator Pλt,hγ (·) has a closed form solution. Let S = St −
λtgS(At+1, St, Bt). Then st+1jj = sjj and st+1
ij is obtained by soft-thresholding
st+1ij =
12(sij + sji)− λtγ if 1
2(sij + sji) > γλt;
0 if |12(sij + sji)| ≤ γλt;
12(sij + sji) + λtγ if 1
2(sij + sji) < −γλt.
Remark 4 Given estimates Aγ, Sγ, and Bγ from optimizing Hγ(A, S,B), the estimates
under the parametrization in (11) can be obtained by
Aγ ← AγD, Sγ ← Sγ, Σγ ← D−1Bγ(Bγ)>D−1,
where D = diag(d11, ..., dKK) is a K ×K diagonal matrix with
dkk =
√(Bγ(Bγ)>)kk.
38