Download pdf - Robust Measurement via A Fused Latent and Graphical Item ...stat.columbia.edu/~jcliu/paper/FLaG_conf3.pdf · j 2f0;1g, but emphasize that the proposed approach is exible enough to

Robust Measurement via A Fused Latent andGraphical Item Response Theory Model

Yunxiao Chen, Xiaoou Li, Jingchen Liu, Zhiliang Ying

November 26, 2016

Abstract

Item response theory (IRT) plays an important role in psychological and educa-

tional measurement. Unlike the classical testing theory, IRT models aggregate the

item level information, yielding more accurate measurements. Most IRT models

rely on the so-called local independence assumption, which may not be satisfied in

practice, especially for a large number of items. Results in the literature and sim-

ulation studies in this paper reveal that misspecifying the local independence as-

sumption may result in inaccurate measurements and differential item functioning.

To provide more robust measurements, we propose a Fused Latent and Graphical

IRT (FLaG-IRT) model that can offset the effect of unknown local dependence.

The new model contains a confirmatory latent variable component, which measures

the targeted latent traits, and a graphical component, which captures the local

dependence. An efficient proximal algorithm is proposed for the parameter estima-

tion and structure learning of the local dependence. The proposed approach can

substantially reduce the local dependence induced measurement bias. The model

can be applied to measure both a unidimentional latent trait and multidimensional

latent traits.

KEY WORDS: item response theory, local dependence, robust measurement, differential

item functioning, graphical model, Ising model, pseudo-likelihood, regularized estimator,

proximal algorithm, Eysencks Personality Questionnaire-Revised

1

1 Introduction

Item response theory models (IRT; Rasch, 1960; Lord and Novick, 1968) play an im-

portant role in measurement theory. Unlike classical testing theory, IRT models in-

tegrate item level information for measurement and are regarded as being a superior

measurement tool to classical test theory (Embretson and Reise, 2000). IRT model-

s have become the preferred method for developing scales, especially when high-stake

decisions are demanded. In particular, IRT models are used in National Assessment

of Education Progress (NAEP), Scholastic Aptitude Test (SAT), and Graduate Record

Examination (GRE). Popular IRT models include the single factor models, such as the

Rasch model (Rasch, 1960), the two-parameter logistic model, and the three-parameter

logistic model (Birnbaum, 1968), and multiple factor models, such as the multidimen-

sional two-parameter logistic (M2PL) model (McKinley and Reckase, 1982; Reckase,

2009).

We consider the multidimensional two-parameter logistic model as a building block.

There are N individuals responding to J test items and the responses from an individual

are recorded by a vector X = (X1, ..., XJ)>. To simplify the presentation, we only con-

sider binary items, i.e. Xj ∈ {0, 1}, but emphasize that the proposed approach is flexible

enough to be generalized to analyzing polytomous items (Chen, 2016). Associated with

each response vector is an unobserved continuous latent vector θ ∈ RK , representing the

latent characteristics that are measured. The conditional distribution of each response

given the latent vector follows a logistic model

fj(θ) , P (Xj = 1|θ) =ea

>j θ+bj

1 + ea>j θ+bj

,

where fj(θ) is known as the item response function and aj = (aj1, ..., ajK)> are known as

the factor loading parameters. When used in a confirmatory manner, the model imposes

constraints on the factor loading parameters, that is, parameter ajk is set to be 0, if item

2

j is not designed to measure the kth latent trait. Specifically, such design information

is characterized by a J × K item-trait relationship matrix, which we refer to as the

Λ-matrix, Λ = (λjk)J×K = (1{ajk 6=0})J×K . The Λ-matrix is usually provided by the item

designers and is often assumed to be known. When information about the Λ-matrix is

vague, data-driven approaches for learning the Λ-matrix are proposed (Liu et al., 2012,

2013; Chen et al., 2015; Sun et al., 2016).

One common assumption of standard IRT models, including the M2PL model, is the

so-called local independence assumption, saying that X1, X2, ..., XJ are conditionally

independent, given the value of θ. That is

P (X1 = x1, ..., XJ = xJ |θ) = P (X1 = x1|θ)P (X2 = x2|θ) · · ·P (XJ = xJ |θ), (1)

for each x = (x1, ..., xJ)> ∈ {0, 1}J . The local independence assumption implies that,

although the items may be highly intercorrelated in the test as a whole, it is only caused

by items’ sharing the common latent traits measured by the test. When the trait levels

are controlled, local independence implies that no relationship remains between the items

(Embretson and Reise, 2000).

In recent years, computer-based and mobile-app-based instruments are becoming

prevalent in educational and psychological studies, where a large number of responses

with complex dependence structure are observed. For these tests, a small number of

latent traits may not adequately capture the dependence structure among the responses.

It is known that there are many possible causes for local dependence, including order

effect where responses to early items affect the responses to subsequent items, and shared

content effect where additional dependence is caused by a common stimuli from shared

content (Hoskens and De Boeck, 1997; Knowles and Condon, 2000; Schwarz, 1999; Yen,

1993). Generally speaking, the item response process could be complicated, and affected

by many external and internal factors. Consequently, a low-dimensional latent factor

3

model may not be adequate to capture all the dependence structure within a test, which

may explain the frequently observed phenomenon of model lack of fit in empirical studies

(Reise et al., 2011; Yen, 1984, 1993; Ferrara et al., 1999).

Differential item functioning (DIF) refers to a test item functioning differently for

different groups, in the sense that the probability of a correct response is associated

with group membership for examinees of comparable ability (Holland and Thayer, 1988).

Controlling DIF is an important aspect in test development that is to ensure test fairness.

DIF may be caused by the presence of additional traits in some items (e.g. Camilli, 1992),

and thus is closely related to local dependence. That is, the nuisance traits that cause

DIF also induce the local dependence structure. Thus, DIF could be reduced if the local

dependence structure can be adjusted in the measurement model.

In this paper, we propose a Fused and Latent Graphical IRT (FLaG-IRT) model

to incorporate local dependence as well as to include the test-design information in

the Λ-matrix as a priori. The model extends the Fused and Latent Graphical (FLaG)

model proposed in Chen et al. (2016) by incorporating the loading structure information.

The proposed model adds a sparse graphical component upon a multidimensional item

response theory (MIRT) model to capture the local dependence. The idea is that for a

well designed test, the common dependence among responses has been well explained

by the latent traits and the remaining dependence can be characterized by a sparse

graphical structure.

In psychometrics, there is an existing literature on modeling the local dependence

structure, including the bi-factor and testlet models (Gibbons and Hedeker, 1992; Gib-

bons et al., 2007; Reise et al., 2007; Bradlow et al., 1999; Wainer et al., 2000; Li et al.,

2006; Cai et al., 2011), copula based approaches (Braeken et al., 2007; Braeken, 2011),

and models with fixed interaction parameters (Hoskens and De Boeck, 1997; Ip, 2002;

Ip et al., 2004; Ip, 2010). Most of these approaches require prior information on the

local dependence structure, such as knowing the item clusters and assuming the local

4

independence between items clusters.

The rest of the paper is organized as follows. In Section 2, the FLaG-IRT model is

introduced and then in Section 3, the statistical analysis based on the model, including

parameter estimation and model selection, is presented. Results of simulation studies

are reported in Section 4. Section 5 contains an application to a real data example. A

summary, along with discussions, is given in Section 6.

2 FLaG-IRT Model

2.1 Two Basic Models

We first describe the fused and latent graphical IRT model, which is built upon the

multidimensional 2-parameter logistic (M2PL) model and the Ising model (Ising, 1925).

To begin with, we describe these two building-block models.

2.1.1 MIRT Model

The M2PL model is one of the most popular multidimensional IRT models for binary

responses. The item response function of the M2PL model is given by

P (Xj = 1|θ) =ea

>j θ+bj

1 + ea>j θ+bj

.

The item-trait relationship is incorporated by constraints contained in a pre-specified

matrix Λ = (λjk)J×K , λjk ∈ {0, 1}. λjk = 0 means that item j is not associated with

latent trait k and the corresponding loading ajk is constrained to be 0. The item response

function can be further written as

P (Xj = xj|θ) =e(a>

j θ+bj)xj

1 + ea>j θ+bj

∝ exp{(a>j θ + bj)xj}.

5

The notation “∝” above is typically used to define probability density or mass func-

tions, which means that the left-hand side and the right-hand side are different by a

normalizing constant that depends only on the parameters and is free of the value of the

random variable/vector. The constant can be obtained by summing or integrating out

the random variable/vector. Such a constant sometimes can be difficult to obtain.

Under the M2PL model, the joint distribution of the responses X = (X1, ..., XJ)>

given θ can be further written as, due to the local independence assumption,

P (X = x|θ) =J∏j=1

P (Xj = xj|θ) ∝ exp{θ>A>x + b>x}, (2)

where A = (ajk)J×K is known as the factor loading matrix and b = (b1, ..., bJ)>. In

particular, when K = 1, the model is known as the two-parameter logistic model (2PL;

Birnbaum, 1968).

2.1.2 Ising Model

We now present the Ising model that is used to characterize the local dependence struc-

ture on top of the M2PL model. The Ising model is an undirected graphical model

(e.g. Koller and Friedman, 2009). It encodes the conditional independence relationship-

s among Xj’s through the topological structure of a graph that can greatly facilitate

the interpretation and understanding of the dependence structure. The Ising model is

originated in statistical physics (Ising, 1925).

Specification of the Ising model consists of an undirected graph G = (V,E), where

V and E are the sets of vertices and edges respectively. The vertex set V = {1, 2, ..., J},

corresponds to the random variables, X1, ..., XJ . The graph is said to be undirected in

the sense that (i, j) ∈ E, if and only if (j, i) ∈ E. The Ising model associated with an

6

Figure 1: The set C separates A from B. All paths from A to B pass through C.

undirected graph G = (V,E) is specified as

P (X = x) ∝ exp

{1

2x>Sx

}, (3)

where S = (sij)J×J is a symmetric matrix such that sij 6= 0 if and only if (i, j) ∈ E.

The conditional independence relationship in the Ising model is encoded by the topo-

logical structure of the graph. More precisely, let A,B and C be nonoverlapping subsets

of V and A ∪ B ∪ C = V . We further let XA, XB, and XC be the random vectors

associated with the sets A, B, and C, respectively, i.e., XA = (Xi : i ∈ A) and so on.

We say A and B are separated by C, if every path from a vertex in A to a vertex in B

includes at least one vertex in C, as illustrated by an example in Figure 1. In Figure 1,

A = {1, 2}, B = {4, 5}, and C = {3}, and all paths from A to B pass through C. For

example, the path (1→ 3→ 4) that connects vertices 1 and 4, passes through vertex 3.

In particular, (i, j) /∈ E implies Xi and Xj are independent given others. When C is an

empty set, the separation between A and B implies their independence.

2.2 FLaG-IRT Model

The FLaG-IRT model combines the M2PL model (2) and the Ising model (3) to construct

a joint item response function. More precisely, the conditional distribution of X given

7

θ is

P (X = x|θ, A, S) ∝ exp

{θ>A>x +

1

2x>Sx

}. (4)

We make two technical remarks on the conditional model (4). First, we remove the

term b>x, because it is absorbed into the diagonal terms of S. That is, xj ∈ {0, 1} and

thus xj = x2j . Consequently, the squared terms become linear

∑Jj=1 sjjx

2j =

∑Jj=1 sjjxj.

Second, the conditional model (4) is an Ising model with parameter matrix S(θ), where

sij(θ) = sij for i 6= j and sjj(θ) = a>j θ + sjj. In addition, the graph of model (4) is the

same as that encoded by S, that is, E = {(i, j) : sij 6= 0, i 6= j}.

To assist understanding, Figure 2 provides graphical representations of the MIRT

model and the FLaG-IRT model. The left panel shows a graphical representation of the

marginal distribution of responses, where there is an edge between each pair of responses.

Under the conditional independence assumption (1) of the MIRT model, there exists a

latent vector θ. If we include θ in the graph, then there is no edge among Xjs as in

the middle panel. The concern is that this conditional independence structure may be

oversimplified and there is additional dependence not attributable to the latent traits.

The FLaG-IRT model (right panel) is a natural extension of the MIRT model (middle

panel), allowing edges among Xjs even if θ is included. The additional edges capture the

dependence among Xjs not explained by θ. Due to the presence of the latent variables,

it is likely that we only need a small number of additional edges to capture the local

dependence. Furthermore, the loading structure in Λ is reflected by the edges between

θks and the responses Xjs in the middle and right panels.

We consider the following joint distribution of (X,θ),

f(x,θ|A, S,Σ) =1

z0(A, S,Σ)exp

{− 1

2θ>Σ−1θ + θ>A>x +

1

2x>Sx

}, (5)

8

Figure 2: Graphical illustration of the MIRT model and the FLaG-IRT model.

where (A, S,Σ) are the model parameters and z0(A, S,Σ) is the normalizing constant,

z0(A, S,Σ) =∑

x∈{0,1}J

∫exp

{− 1

2θ>Σ−1θ + θ>A>x +

1

2x>Sx

}dθ.

Note that under this joint distribution, the joint item response function, i.e., the condi-

tional distribution of X given θ, is consistent with (4). Under this joint distribution, a

specific prior distribution of θ is implicitly assumed and the posterior distribution of θ

becomes Guassian, an assumption discussed in Holland (1990). As will be described in

the sequel, this prior distribution of θ brings technical convenience in the data analysis.

We refer the readers to Holland (1990) for more justifications for this prior. Moreover,

the posterior variance of θ becomes Σ and the posterior mean of θ is given by

E(θ|X = x) = ΣA>x, (6)

a weighted sum of the responses. Once A and Σ are estimated from the data, it is

reasonable to score each individual by ΣA>x.

In the specification (5), A, Σ, S, and the graph E induced by S (equivalently, the

nonzero pattern of matrix S) can be be estimated from the data. Similar to the M2PL

model, we pre-specify a binary matrix Λ = (λjk)J×K for the confirmatory structure and

impose constraint that ajk = 0 if λjk = 0. The latent vector θ is not directly observable.

9

The estimation is based on the marginal likelihood,

P (X = x|A, S,Σ) =

∫f(x,θ|A, S,Σ)dθ,

where f(x,θ|A, S,Σ) is given in (5).

Under the above model specification, the marginal distribution of X still follows an

Ising model, that is

P (X = x|A, S,Σ) =

∫f(x,θ|A, S,Σ)dθ ∝ exp

{1

2x>(AΣA> + S)x

}. (7)

This is a second-order generalized log-linear model (Holland, 1990; Laird, 1991).

2.3 Bi-factor Model as a Special Case

The bi-factor model is one of the most popular models that takes local dependence into

account. This model is a special case of the M2PL model, assuming that there is a

unidimensional general factor θg associated with all items and is the target of measure-

ment. Besides the general factor, there exist nuisance factors θ1, ..., θM associated with

M nonoverlapping item clusters C1, C2, ..., CM , where each item cluster has no less than

two items and there may be items not belonging to any of these item clusters. As we

will discuss in the sequel, the FLaG-IRT model is able to capture such a structure. One

of the advantages of the FLaG-IRT model is that there is no need to specify a priori

item clusters and they are learned from the data.

The bi-factor model based on a logistic link (e.g. Cai et al., 2011) can be viewed as

a special M2PL model with

P (X = x|θ) ∝ exp{θ>A>x + b>x},

where b = (b1, ..., bJ)> and A = (ag, a1, ..., aK). In particular, the jth element of ak is

10

Figure 3: Graphical representation of a bi-factor model, the corresponding FLaG-IRTmodel, and the local dependence graph.

zero if item j is not in the kth item cluster, i.e., j /∈ Ck. If we use the specific prior of θ

in the FLaG-IRT model and further assume Σ to be an identity matrix,

P (X = x) ∝ exp

{1

2x>aga

>g x +

1

2x>Sx

}, (8)

where sjj = 2bj, and sij = sji = 0 when items i and j do not belong to the same item

cluster and sij = sji = aikajk when both items belong to the kth cluster, which admits

the same form as the marginal FLaG-IRT model in (7). In other words, the graphical

model component of the FLaG-IRT model can take the place of the specific factors in

the bi-factor model. The corresponding graph encoded by the S matrix in (8) is sparse,

when each item cluster has only a small number of items. For example, if each item

cluster has only two items, then the sparsity level of the graph, defined as the ratio of

the number of edges in the graph and the total number of item pairs, is 1/(J−1), which

can be as small as 3% with J = 30 items. Figure 3 presents an example of the a bi-factor

model, the corresponding FLaG-IRT model, and the local dependence graph. In other

words, when the specific prior for θ is assumed, the bi-factor model becomes a special

case of the FLaG-IRT model with one latent trait and a sparse local dependence graph.

11

3 FLaG-IRT Analysis

3.1 Regularized Pseudo-likelihood Estimation

In this section, we discuss estimation and dimension reduction of the FLaG-IRT model.

The most natural approach would be the maximum marginal likelihood function of

responses given in (7). Unfortunately, the evaluation of (7) involves computing the

normalizing constant,

z(A, S,Σ) =∑

x∈{0,1}Jexp

{1

2x>(AΣA> + S)x

},

which requires a summation over 2J all possible response patterns and thus is compu-

tationally infeasible for even a relatively small J . To bypass this, we propose a pseudo-

likelihood as a surrogate (Besag, 1974), which is based on the conditional distribution

of Xj given the rest X−j = (X1, ..., Xj−1, Xj+1, ..., XJ),

P (Xj = 1|Xj = x−j, A, S,Σ) =exp{1

2(ljj + sjj) +

∑i 6=j(lij + sij)xi}

1 + exp{12(ljj + sjj) +

∑i 6=j(lij + sij)xi}

,

where L = (lij)J×J = AΣA>. Note that the above conditional distribution takes a logis-

tic regression form. Following Besag (1974), we let Lj(A, S,Σ;x) = P (Xj = xj|X−j =

x−j, A, S,Σ) and define the pseudo-likelihood function

L(A, S,Σ) =N∏i=1

J∏j=1

Lj(A, S,Σ;xi), (9)

where xi is the responses from individual i.

To incorporate the knowledge of the test items, the factor loading matrix A is con-

strained such that ajk = 0 when λjk = 0. Therefore, the unknown parameters in A are

{ajk : λjk = 1}. Since A and Σ appear in the pseudo-likelihood function in the form of

AΣA>, additional constraints are needed to ensure their identifiability. This is because,

12

for example, scaling A by a constant ω can be offset by the corresponding scaling of Σ by

ω−2. To identify the scale of latent factors, we impose constraints Σkk = 1, k = 1, ..., K,

which means that the posterior variance must be 1. To avoid the rotational indetermina-

cy, we assume that with appropriate column swapping, the Λ matrix contains a K ×K

identity submatrix.. It means that for each latent factor, there is at least one item that

only measures that factor.

When the graph for local dependence is known, we estimate A, S, and Σ using a

maximum pseudo-likelihood function

(A, S, Σ) = arg minA,S,Σ

{− 1

NlogL(A, S,Σ)

}s.t. ajk = 0 if λjk = 0, j = 1, ..., J, k = 1, ..., K,

S = S>, sij = 0 if (i, j) /∈ E,

and Σ is positive semidefinite, σkk = 1, k = 1, ..., K,

(10)

where E is the set of edges of the known graph.

When the graph for local dependence is unknown, which is typically the case in

practice, we impose an assumption that the graph is sparse, that is, the number of

edges in E = {(i, j) : sij 6= 0} is relatively small. The rationale is that most of the

dependence among responses has been captured by the common latent traits, leaving

the local dependence structure sparse. This assumption is incorporated in the analysis

through selecting a sparse graphical model component based on the data. We’d like to

point out that even a sparse local dependence structure (i.e. a local dependence graph

with a relatively small number of edges), if ignored in the measurement, can result in

measurement bias, as illustrated by simulated examples. In addition, the sparse local

dependence graph, once learned from the data, facilitates the understanding of the

measurement and may be used to improve the test design. For example, patterns (e.g.

item clusters) identified from the graph may help the test designers to review the items

13

and improve the wording.

We propose to use the regularized pseudo-likelihood for simultaneous estimation and

model selection

(Aγ, Sγ, Σγ) = arg minA,S,Σ

{− 1

NlogL(A, S,Σ) + γ

∑i 6=j

|sij|

}

s.t. ajk = 0 if λjk = 0, j = 1, ..., J, k = 1, ..., K,

S = S>, and Σ is positive semidefinite, σkk = 1, k = 1, ..., K,

(11)

where γ is the tuning parameter that controls the sparsity level of the estimated graph

Eγ = {(i, j) : sγij 6= 0, i 6= j}. At one extreme, when γ is sufficiently large, the estimated

graph becomes degenerate, i.e., no edge, and the responses are conditionally independent

given the latent variables that are measured. The graph becomes more and more dense

as γ decreases.

The optimization problem (11) is nonconvex and nonsmooth, and thus is compu-

tationally nontrivial. An efficient and stable algorithm is developed, which alternates

between minimizing A, S, and Σ. In particular, an proximal gradient based method

(Parikh et al., 2014) is used in updating S, which avoids the issues due to the non-

smoothness of the function that may occur in standard gradient based optimization

approaches. Details of the algorithm are provided in the appendix. We emphasize that

this algorithm is scalable to very large data sets with a large number of items (e.g. t-

housands) and a large number of latent factors (tens or larger), and thus is suitable for

large scale data analysis.

3.2 Choice of Tuning Parameters

In the estimation, we construct a solution path of (Aγ, Sγ, Σγ) for a sequence of γ values.

We then choose γ based on the Bayes information criterion (BIC; Schwarz, 1978), which

14

takes a general form

BIC(M) = −2 logL(β(M)) + |M| logN,

whereM is the model under consideration, L(β(M)) is the maximal likelihood for model

M, and |M| is the number of free parameters. In this study, we replace the likelihood

function with the pseudo-likelihood function. Specifically, let

Mγ ={

(A, S,Σ) : ajk = 0 if λjk = 0, S = S>, sij=0 if sij = 0,

and Σ is positive semidefinite, σkk = 1, k = 1, ..., K}

be the model selected by tuning parameter γ, containing all models having the same

support as Sγ. We select the tuning parameter γ, such that the corresponding model

minimizes the pseudo-likelihood-based BIC

BIC(Mγ) = −2 max(A,S,Σ)∈Mγ

{logL(A, S,Σ)}+ |Mγ| logN, (12)

where the number of parameters in Mγ is

|Mγ| =∑j,k

λjk + J +∑i<j

1{sγij 6=0} +(K − 1)K

2.

Here,∑

j,k λjk counts the number of free parameters in the loading matrix A, J and∑i<j 1{sγij 6=0} are the numbers of diagonal and off-diagonal parameters in Sγ, and K(K−

1)/2 is the number of parameters in Σ.

The tuning parameter is finally selected by

γ = arg minγ

BIC(Mγ). (13)

In addition, the corresponding maximal pseudo-likelihood estimates of A, S, and Σ are

15

used as the final estimate of A, S, and Σ:

(A, S, Σ) = arg max(A,S,Σ)∈Mγ

{L(A, S,Σ)}. (14)

3.3 Summary

We summarize the procedure of FLaG-IRT analysis, when the graph for local dependence

is unknown.

1. Select a sequence of γ values, denoted by Γ.

2. Obtain a sequence of models indexed by γ ∈ Γ, based on the regularized estimates

(Aγ, Sγ, Σγ) from (11).

3. Among the sequence of models above, select the best fitted model in terms of BIC

value, using (13).

4. Report (A, S, Σ) from the selected model given by (14), as well as the local depen-

dence graph given by E = {(i, j) : sij 6= 0}.

4 Simulation Studies

In this section, we report two simulation studies. First, we provide an illustrative ex-

ample that ignoring local dependence results in measurement bias (differential item

functioning). Second, we evaluate the FLaG-IRT analysis under various simulation set-

tings.

4.1 Simulation Study 1

Data generation. We generate a data set from a bi-factor model, with N = 3000,

J = 15, and only one item cluster C1 = {1, 2, 3, 4, 5}. Note that the general factor θg and

16

Item 1 2 3 4 5 6 7 8

ag 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5

a1 2 2 2 2 2 0 0 0

b -1.37 -1.98 1.08 -1.55 0.07 -0.18 -0.72 -1.67

Item 9 10 11 12 13 14 15

ag 1.5 1.5 1.5 1.5 1.5 1.5 1.5

a1 0 0 0 0 0 0 0

b -1.63 -1.92 -1.64 -0.41 -0.4 1.71 -1.63

Table 1: The item parameters in study 1.

Model 2PL Bi-factor FLaG-IRT FLaG-IRT

(true model) (known graph) (unknown graph)

τ(θig, θi) 0.512 0.539 0.539 0.539

τ(θi1, θi) 0.183 0.046 0.051 0.059

Table 2: The results of study 1.

the nuisance factor θ1 are assumed to be independent and follow the standard normal

distribution. The model parameters are shown in Table 1, where bjs are sampled from

uniform distribution over interval [−2, 2]. This data set is generated to mimic a test that

aims at measuring the general factor θg, thus every item is designed to be associated

with this dimension. In addition, θ1 is a nuisance dimension that is only associated with

five items and is not included in the design (so that people may not be aware of), but

has an effect on the responses. The results are summarized in Table 2.

Unidimensional 2PL model. We first analyze the data using the unidimensional

2PL model, where the unidimensional latent trait is assumed to have a Gaussian prior,

following the standard IRT setting. We consider measuring individuals using a two-stage

procedure. In the first stage, the item parameters are estimated, denoted by a and b.

In the second stage, a and b are plugged into the posterior mean of θi and the resulting

estimate is denoted by θi. The two plots in panel (a) of Figure 4 show θi1 (x-axis) versus

θi (y-axis) and θi,g (x-axis) versus θi (y-axis), respectively. In both plots, the regression

17

lines are plotted to show the trend. In addition, the Kendall’s tau correlation coefficient

(e.g. Kendall and Gibbons, 1990) between θi1 and θi is 0.183 and that between θi,g and

θi is 0.512. Kendall’s tau is a nonparametric measure of correlation based on ordering

that does not require any parametric, such as the Gaussian, model assumption. For two

vectors (y1, ..., yN)> and (z1, ..., zN)>, the Kendall’s tau is defined as

τ =2

N(N − 1)

∑i<j

sgn {(yi − yj)(zi − zj)} ,

where sgn(x) is the sign function, taking value 1 if x > 0, 0 if x = 0, and −1 otherwise.

If yi and zi are independent samples from two populations, τ is asymptotically 0, as the

sample size N grows. As θi is intended to measure θi,g, it is expected that if θi,g > θj,g,

then θi > θj and vise versa, for each pair of individuals i and j. Thus, the larger the

τ is, the better the measurement. On the other hand, since θ1 is a nuisance factor

independent of θg, it is expected that θi is independent of θi1, so that Kendall’s tau

is close to 0. Consequently, the measurement validity under this misspecified model is

low, due to the high Kendall’s tau correlation between the nuisance factor θi1 and θi. In

other words, the latent trait being measured under the 2PL model deviates from what

is designed to measure. This could lead to the issue of test fairness that could especially

be of concern in educational testing. That is, for two examinees with the same θg value,

the one with a higher nuisance trait level tends to be scored higher. This phenomenon

is known as differential item functioning (Holland and Wainer, 2012).

Bi-factor model with known nuisance factor. If the presence of the nuisance

factor θ1 is known, as well as its loading structure, we fit the true model. The latent

traits θis are estimated using the two-stage procedure based on the posterior mean as

above. Again, we plot θi1 (x-axis) versus θi,g (y-axis) and θi,g (x-axis) versus θi,g (y-axis)

in panel (b) of Figure 4. In addition, the Kendall’s tau correlation coefficient between

θi1 and θi,g is 0.046 and that between θi,g and θi,g is 0.539. Thus, with the nuisance

18

Figure 4: The scatter plots of θi1 versus θi (left) and θi,g versus θi (right). Panel (a):Unidimensional 2PL model; Panel (c): FLaG-IRT with known graph; Panel (d) FLaG-IRT with unknown graph.

factor adjusted in the measurement model, the test validity improves; that is, θi,g tends

to be less correlated with θi1, and θi,g is more correlated with θi,g.

FLaG-IRT with known conditional graph. According to the discussion in Sec-

tion 2.3, when knowing the item cluster C1, we can also fit a unidimensional FLaG-IRT

model with a known graph satisfying only the first five items connected to each other,

i.e., E = {(i, j) : i, j ≤ 5}. With A and S from (10), the latent trait level θi for individ-

ual i is scored according to (6), θi =∑J

j=1 ajxij. We remark that under the FLaG-IRT

model, due to the choice of the prior distribution, the θis are no longer in the same

range as θi,gs. The results are presented in panel (c) of Figure 4 and the Kendall’s tau

correlation coefficient between θi1 and θi is 0.051 and that between θi,g and θi is 0.539,

which is very close to the values under the true model. It indicates that even when a

misspecified prior is used, the measurement based on the FLaG-IRT model is still valid.

Moreover, A = (a1, ..., aJ)> is shown in Table 3. Note that the magnitude of ajs is not

on the same scale as the ones in the generating model, due to the different choice of

19

1 2 3 4 5 6 7 8

aj 0.20 0.22 0.21 0.25 0.22 0.7 0.68 0.72

9 10 11 12 13 14 15

aj 0.70 0.60 0.67 0.62 0.68 0.60 0.64

Table 3: The estimated factor loadings for the unidimensional FLaG-IRT model withknown local dependence graph.

the marginal distribution of the latent trait in the FLaG-IRT model. From Table 3,

we see that the values of ajs are much smaller for items 1 to 5 than those for items 6

to 15, while all the loadings on the general factor are the same in the true model. In

other words, due to the positive local dependence among items 1 to 5, the weights of

items 1 to 5 are discounted when computing the score θi =∑J

j=1 ajxij. This is expected,

because in the presence of the nuisance factor, the responses to items 1 to 5 are highly

correlated and thus provide less information about the general factor than the rest of the

items. Consequently, the FLaG-IRT model discounts the weights of these items when

computing the score.

FLaG-IRT with unknown conditional graph. Finally, we consider the more real-

istic situation, in which that the presence of the nuisance factor and its relationship with

the items are unknown. The FLaG-IRT analysis described in Section 3 is applied, for

which the graph of local dependence graph is unknown. The selected local dependence

graph is shown in Figure 5 and the scatter plots of θi1 versus θi and θi,g versus θi are

shown in panel (d) of Figure 4. In addition, the Kendall’s tau correlation coefficient

between θi1 and θi is 0.059 and that between θi,g and θi is 0.539. From Figure 5, the first

five items are connected to each other and there are five additional edges (1, 8), (2, 8),

(2, 9), (5, 15), and (10, 15). These results show that the measurement is very close to

that with the true model or with the FLaG-IRT model with a correctly specified local

dependence graph. They also show that the differential item functioning caused by the

nuisance factor can be substantially reduced to a negligible level.

20

Figure 5: The local dependence graph of the selected FLaG-IRT model.

4.2 Simulation Study 2

In this study, we evaluate the performance of the FLaG-IRT analysis presented in Sec-

tion 3 under different settings. For each setting, 100 independent data sets are generated.

In the FLaG-IRT analysis, the local dependence structure is completely unspecified and

learned from data. The settings are listed below.

S1. The same setting as in Study 1, where K is set to be 1 in the FLaG-IRT analysis.

S2. The same as setting S1, except that sample size N = 1500.

S3. Generate data from a bi-factor model, with J = 20, N = 3000, and two item

clusters, C1 = {1, 2, 3, 4, 5}, C2 = {6, 7, 8, 9, 10}. K is set to be 1 in the FLaG-IRT

analysis.


S5. Generate data from the FLaG-IRT model, with J = 45, N = 3000, K = 3 and local

dependence graph E = {(1, 2), (2, 3), (3, 4), ..., (44, 45)}. For the loading structure,

items 1-15 measure the first latent trait, items 16-30 measure the second latent

trait, and items 31-45 measure the third latent trait. If particular, we set ajk = 0.5

21

for qjk 6= 0, sjj = −6.5, j = 1, ..., J , sij = 1 for (i, j) ∈ E, and σkk = 1, k = 1, ..., K

and σkl = 0.1, k 6= l.


For settings S1 and S2, we evaluate the the performance of the FLaG-IRT based on the

following criteria and compare the results with those of the 2PL model and the true

model.

R1. The Kendall’s tau correlation between θis and the true values of the general factor,

θi,gs.

R2. The Kendall’s tau correlation between θis and the values of the nuisance factor,

θi1s.

R3. The true positive rate of graph estimation, defined as

TPR =

∑i<j 1{(i,j)∈E,(i,j)∈E}∑

i<j 1{(i,j)∈E},

where for bi-factor models, we say (i, j) ∈ E if i and j belong to the same item

cluster Cm, m = 1, ...,M .

R4. The false positive rate of graph estimation, defined as

FPR =

∑i<j 1{(i,j)∈E,(i,j)/∈E}∑

i<j 1{(i,j)/∈E}.

For settings S3 and S4, we evaluate the FLaG-IRT analysis through R1, R3, and R4,

but not R2 because now there are two nuisance factors. In addition, the results are

compared with those of the 2PL model and the true model. For settings S5 and S6, the

evaluation is based on R3, R4 and

R5. The estimation accuracy of the nonzero loading parameters and the diagonal en-

tries of S.

22

R6. The estimation accuracy of the off-diagonal entries of Σ.

We include R5 and R6, because data are generated from the FLaG-IRT model and thus

the true parameters are known.

The results are summarized in Table 4 and Figures 6 and 7. In particular, under

Setting 1, the mean of the Kendall’s tau between the estimated scores from FLaG-IRT

analysis and the true general factor scores is 0.535, with a standard error 0.003. In

addition, the mean of the Kendall’s tau between the estimated scores and the nuisance

factor scores is 0.069 (SE = 0.001). On the other hand, when the unidimensional 2PL

model is used, the corresponding values are 0.508 (SE = 0.003) and 0.199 (SE = 0.002),

respectively; when the true model is fitted, the mean of the Kendall’s tau between the

estimated general factor scores and the true general and nuisance scores are 0.535 (SE

= 0.003) and 0.063 (SE = 0.001). Thus, the measurement based on the FLaG-IRT

analysis is comparable to that based on the true model and performs significantly better

than the 2PL model, which ignores the local dependence structure. In addition, it is

observed that the true positive rate of the graph estimation is always 1, meaning that

the edges are always correctly selected by the FLaG-IRT, and that the false positive

rate is small, albeit nonzero. Similar patterns are observed under settings S2, S3, and

S4. For settings S5 and S6, data are generated from a FLaG-IRT model. Under both

settings, the local dependence graph is estimated accurately. When sample size is 1500,

the average true positive rate over 100 replications is 0.925 (SE = 0.004) which is close

to 1 and the average false positive rate is 0.030 (SE = 0.001) which is close to 0. The

results for N = 3000 improves upon N = 1500, with the true positive rate increasing to

1 and the false positive rate decreasing to 0.021 (SE = 0.001). Figures 6 and 7 plot the

histograms of ajk − ajk for all nonzero loadings, sjj − sjj over all items, and σkl − σkl

for all k 6= l, over 100 simulated data sets. As we can see, the estimates are roughly

unbiased and as the sample size increases, the estimates tend to be more concentrated

around their true values.

23

Criteria

R1 R2 R3 R4

S1 (FLaG-IRT) 0.535 (0.003) 0.069 (0.001) 1 (0) 0.041 (0.003)

S1 (2PL) 0.508 (0.003) 0.199 (0.002) * *

S1 (Bi-fac) 0.535 (0.003) 0.063 (0.001) * *

S2 (FLaG-IRT) 0.525 (0.004) 0.067 (0.002) 1 (0) 0.056 (0.003)

S2 (2PL) 0.502 (0.003) 0.187 (0.003) * *

S2 (Bi-fac) 0.525 (0.004) 0.061 (0.002) * *

S3 (FLaG-IRT) 0.549 (0.004) * 1 (0) 0.103 (0.005)

S3 (2PL) 0.523 (0.003) * * *

S3 (Bi-fac) 0.548 (0.004) * * *

S4 (FLaG-IRT) 0.549 (0.004) * 1 (0) 0.093 (0.003)

S4 (2PL) 0.521 (0.003) * * *

S4 (Bi-fac) 0.549 (0.004) * * *

S5 * * 0.996 (0.001) 0.030 (0.001)

S6 * * 0.925 (0.004) 0.039 (0.001)

Table 4: Results of FLaG-IRT analysis in simulation study 2.

ajk − ajk

Fre

quen

cy

−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

020

040

060

080

010

0012

00

sjj − sjj

Fre

quen

cy

−1.0 −0.5 0.0 0.5

020

040

060

0

σkl − σkl

Fre

quen

cy

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06

010

2030

4050

6070

Figure 6: The difference between estimated and true nonzero factor loading parametersajk − ajk (ajk = 0.5), the difference between the estimated and true diagonal entriesof S matrix, sjj − sjj (sjj = −6.5) and the difference between the estimated and trueoff-diagonal entries of Σ matrix, σkl − σkl (σkl = 0.1) under setting S5.

24

ajk − ajk

Fre

quen

cy

−0.4 −0.2 0.0 0.2 0.4 0.6

020

040

060

080

0

sjj − sjj

Fre

quen

cy

−1.5 −1.0 −0.5 0.0 0.5 1.0

020

040

060

080

010

00

σkl − σkl

Fre

quen

cy

−0.05 0.00 0.05 0.10

020

4060

80

Figure 7: The difference between estimated and true nonzero factor loading parametersajk − ajk (ajk = 0.5), the difference between the estimated and true diagonal entriesof S matrix, sjj − sjj (sjj = −6.5) and the difference between the estimated and trueoff-diagonal entries of Σ matrix, σkl − σkl (σkl = 0.1) under setting S6.

5 Real Data Analysis

We illustrate the use of FLaG-IRT analysis by an application to the Extroversion short

scale of the Eysencks Personality Questionnaire-Revised (EPQ-R; Eysenck et al., 1985;

Eysenck and Barrett, 2013). The data set contains the responses to 12 items from

824 females in the United Kingdom. All these items are designed to measure a single

personality factor Extroversion, characterized by personality patterns such as sociability,

talkativeness, and assertiveness. The items are shown in Table 5, and the data are

preprocessed so that the responses to the reversely worded items are flipped.

We start with fitting the unidimensional 2PL model whose unidimensional latent trait

follows a standard Gaussian distribution and then check the model fit. The estimated

2PL parameters are shown in Table 6. Under the fitted model, the expected two-by-two

tables for item pairs can be evaluated by

Exixj = N × P (Xi = xi, Xj = xj) = N

∫exp (aiθ + bi)xi

1 + exp (aiθ + bi)

exp (ajθ + bj)xj

1 + exp (ajθ + bj)φ(θ)dθ,

where φ(θ) is the density function of a standard normal distribution. We first check the

fit of item pairs by comparing the expected two-by-two tables with the observed ones,

25

using the X2 local dependence index (Chen and Thissen, 1997) as a descriptive statistic.

For each item pair i and j, the X2 statistic is defined as

X2ij =

1∑xi=0

1∑xj=0

(Oxixj − Exixj)2

Exixj,

where Oxixj is the observed number of (xi, xj) pairs. A large value of X2ij indicates a

lack of fit. In addition, based on simulation studies, Chen and Thissen (1997) suggest

that the marginal distribution of each X2ij is roughly a chi-square distribution with one

degree of freedom when data are generated from the 2PL model. We visualize (X2ij)J×J

using a heat map in the left panel of Figure 8. For a better visualization, we plot a

monotone transformation of X2ij,

Tij = X2ij/(Q

Chi1,95% +X2

ij),

where QChi1,95% is the 95% quantile of the chi-square distribution with one degree of free-

dom. Thus, Tij > 1/2 suggests that item pair (i, j) is not fitted well. In the heat map,

the value of Tij is presented according to the color key above the heat map. The top

five item pairs with highest levels of Tij are shown in Table 7, where items within a

pair tend to share common content/stimuli. To further assess the over-all fit of the

2PL model and to compare it with that of the selected FLaG-IRT model, we consider a

parametric bootstrap test, using the total sum of the X2 statistics as the test statistic

SX2PL =∑

i<j X2ij. That is, we generate 500 bootstrap data sets, each of which has

824 samples drawn from the estimated 2PL model. For each bootstrap data set, we

fit the 2PL model again and compute the corresponding total sum of X2s, denoted by

SX(b)2PL. The empirical distribution of SX

(b)2PL is used as the reference distribution. The

histogram of SX(1)2PL, ..., SX

(500)2PL are shown in the left panel of Figure 9. The observed

value of SX2PL based on the fitted model is 318, much larger than the ones from boot-

strap data. Consequently, the p-value of this bootstrap test is 0, indicating that the

26

lack-of-fit of the 2PL model.

1 Are you a talkative person?

2 Are you rather lively?

3 Can you usually let yourself go and enjoy yourself at a lively party?

4 Do you enjoy meeting new people?

5 Do you usually take the initiative in making new friends?

6 Can you easily get some life into a rather dull party?

7 Do you like mixing with people?

8 Can you get a patty going?

9 Do you like plenty of bustle and excitement around you?

10 Do other people think of you as being very lively?

11(R) Do you tend to keep in the background on social occasions?

12(R) Are you mostly quiet when you are with other people?

Table 5: The revised Eysenck Personality Questionnaire short form of Extroversion scale.

1 2 3 4 5 6 7 8 9 10 11 12

aj 1.90 2.52 1.55 1.64 1.41 2.62 1.70 2.22 1.20 2.35 2.52 2.19

bj 1.12 1.98 1.51 3.04 0.60 -1.48 3.14 -0.62 1.17 0.90 0.62 1.27

Table 6: The estimated 2PL model for the EPQ-R data.

Tij X2ij

1 0.94 62 6 Can you easily get some life into a rather dull party?

8 Can you get a patty going?

2 0.87 25 1 Are you a talkative person?

12(R) Are you mostly quiet when you are with other people?

3 0.86 24 2 Are you rather lively?

10 Do other people think of you as being very lively?

4 0.85 23 4 Do you enjoy meeting new people?

7 Do you like mixing with people?

5 0.79 15 7 Do you like mixing with people?

9 Do you like plenty of bustle and excitement around you?

Table 7: Item pairs with largest values of local dependence indices.

We then apply FLaG-IRT. The selected local dependence graph of the selected model

has 20 edges, as shown in Figure 10, where the positive and negative edges are in

black and red, respectively. In particular, the most locally dependent item pairs also

correspond to the most positive edges in Figure 10. Similar to the analysis above, we

compute the local independence indices for all the items pairs and visualize them in

27

Figure 8: The heat maps for visualizing the fit of all item pairs under the 2PL model(left) and selected FLaG-IRT model (right).

Figure 9: The results of a parametric bootstrap test for the 2PL model (left) and theselected FLaG-IRT model (right)

28

Figure 10: The local dependence graph of the selected FLaG-IRT model.

the right panel of Figure 8, where no X2ij is found to exceed QChi

1,95%. Moreover, 500

bootstrap data sets are generated from the selected FLaG-IRT model and the bootstrap

distribution of SXFLaG−IRT is shown in the right panel of Figure 9. As we can see,

the observed value of SXFLaG−IRT for the selected model is in the middle range of the

bootstrap distribution and the p-value is 31.6%, indicating that the selected FLaG-IRT

model fits well.

Based on the above analysis, we see that even a well designed 12-item EPQ-R short

form displays significant level of local dependence, which, if not adjusted, may result in

severe measurement bias. The proposed FLaG-IRT model automatically adjusts for the

local dependence based on the data, while maintaining the unidimensional latent trait

as the key source of dependence among responses. As a result, the FLaG-IRT model

learned from data fits well, at both the item pair level and the test level.

6 Summary

It is well known that mental processes are complex and there are always factors not

perfectly explained by a measurement model. Standard latent factor models may result

in model lack of fit, thereby having a negative effect on the test validity. In this paper,

we propose the FLaG-IRT model for robust measurement. The key idea behind the

29

proposed model is that, given the loading structure of a well-designed test, the local

dependence structure can be well incorporated through a sparse graphical component.

This is done without requiring any prior information about the structure of local depen-

dence. Our analysis shows that the method greatly reduces the measurement bias and

increases the measurement accuracy. In particular, when significant local dependence

is observed and when test fairness is of a concern, the FLaG-IRT model can be a good

choice for scoring individuals. Finally, we remark that our algorithm for FLaG-IRT

analysis is very efficient and stable, although the optimization problem is nonconvex

and nonsmooth. The FLaG-IRT analysis is thus scalable to analyzing large scale data.

References

Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems.

Journal of the Royal Statistical Society. Series B (Methodological), pages 192–236.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s

ability. In Lord, F. M. and Novick, M. R., editors, Statistical Theories of Mental Test

Scores, pages 395–479. Addison-Wesley, Reading, MA.

Bradlow, E. T., Wainer, H., and Wang, X. (1999). A Bayesian random effects model for

testlets. Psychometrika, 64(2):153–168.

Braeken, J. (2011). A boundary mixture approach to violations of conditional indepen-

dence. Psychometrika, 76(1):57–76.

Braeken, J., Tuerlinckx, F., and De Boeck, P. (2007). Copula functions for residual

dependency. Psychometrika, 72(3):393–411.

Cai, L., Yang, J. S., and Hansen, M. (2011). Generalized full-information item bifactor

analysis. Psychological methods, 16(3):221–248.

30

Camilli, G. (1992). A conceptual analysis of differential item functioning in terms

of a multidimensional item response model. Applied Psychological Measurement,

16(2):129–147.

Chen, W.-H. and Thissen, D. (1997). Local dependence indexes for item pairs using item

response theory. Journal of Educational and Behavioral Statistics, 22(3):265–289.

Chen, Y. (2016). Latent variable modeling and statistical learning. Available at http:

//academiccommons.columbia.edu/catalog/ac:198122. PhD thesis, Columbia U-

niversity.

Chen, Y., Li, X., Liu, J., and Ying, Z. (2016). A fused latent and graphical model for

multivariate binary data. Available at https://arxiv.org/pdf/1606.08925v1.pdf.

arXiv preprint.

Chen, Y., Liu, J., Xu, G., and Ying, Z. (2015). Statistical analysis of Q-matrix based

diagnostic classification models. Journal of the American Statistical Association,

110(510):850–866.

Embretson, S. E. and Reise, S. P. (2000). Item response theory for psychologists.

Lawrence Erlbaum Associates Publishers, Mahwah, NJ.

Eysenck, S. and Barrett, P. (2013). Re-introduction to cross-cultural studies of the EPQ.

Personality and Individual Differences, 54:485–489.

Eysenck, S. B., Eysenck, H. J., and Barrett, P. (1985). A revised version of the Psy-

choticism scale. Personality and Individual Differences, 6:21–29.

Ferrara, S., Huynh, H., and Michaels, H. (1999). Contextual explanations of local

dependence in item clusters in a large scale hands-on science performance assessment.

Journal of Educational Measurement, 36:119–140.

31

Gibbons, R. D., Bock, R. D., Hedeker, D., Weiss, D. J., Segawa, E., Bhaumik, D. K.,

Kupfer, D. J., Frank, E., Grochocinski, V. J., and Stover, A. (2007). Full-information

item bifactor analysis of graded response data. Applied Psychological Measurement,

31(1):4–19.

Gibbons, R. D. and Hedeker, D. R. (1992). Full-information item bi-factor analysis.

Psychometrika, 57(3):423–436.

Holland, P. W. (1990). The Dutch identity: A new tool for the study of item response

models. Psychometrika, 55(1):5–18.

Holland, P. W. and Thayer, D. T. (1988). Differential item performance and the Mantel-

Haenszel procedure. In Wainer, H. and Braun, H. I., editors, Test validity, pages

129–145. Lawrence Erlbaum Associates, Hillsdale, NJ.

Holland, P. W. and Wainer, H. (2012). Differential item functioning. Routledge.

Hoskens, M. and De Boeck, P. (1997). A parametric model for local dependence among

test items. Psychological methods, 2(3):261–277.

Ip, E. H. (2002). Locally dependent latent trait model and the Dutch identity revisited.


Ip, E. H. (2010). Empirically indistinguishable multidimensional irt and locally de-

pendent unidimensional item response models. British Journal of Mathematical and

Statistical Psychology, 63(2):395–416.

Ip, E. H., Wang, Y. J., De Boeck, P., and Meulders, M. (2004). Locally dependent

latent trait model for polytomous responses with application to inventory of hostility.


Ising, E. (1925). Beitrag zur theorie des ferromagnetismus. Zeitschrift fur Physik A

Hadrons and Nuclei, 31:253–258.

32

Kendall, M. G. and Gibbons, J. D. (1990). Rank correlation methods. Oxford University

Press, London, UK.

Knowles, E. S. and Condon, C. A. (2000). Does the rose still smell as sweet? Item

variability across test forms and revisions. Psychological Assessment, 12:245–252.

Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and tech-

niques. MIT press, Cambridge, MA.

Laird, N. M. (1991). Topics in likelihood-based methods for longitudinal data analysis.

Statistica Sinica, 1(1):33–50.

Li, Y., Bolt, D. M., and Fu, J. (2006). A comparison of alternative models for testlets.

Applied Psychological Measurement, 30(1):3–21.

Liu, J., Xu, G., and Ying, Z. (2012). Data-driven learning of Q-matrix. Applied psycho-

logical measurement, 36(7):548–564.

Liu, J., Xu, G., and Ying, Z. (2013). Theory of the self-learning Q-matrix. Bernoulli:

official journal of the Bernoulli Society for Mathematical Statistics and Probability,

19(5A):1790–1817.

Lord, F. M. and Novick, M. R. (1968). Statistical theories of mental test scores. Addison-

Wesley, Reading, MA.

McKinley, R. L. and Reckase, M. D. (1982). The use of the general Rasch model with

multidimensional item response data. Iowa City, IA: American College Testing.

Parikh, N., Boyd, S. P., et al. (2014). Proximal algorithms. Foundations and Trends in

optimization, 1(3):127–239.

Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests.

Copenhagen: Danish Institute for Educational Research.

33

Reckase, M. (2009). Multidimensional item response theory. Springer, New York, NY.

Reise, S. P., Horan, W. P., and Blanchard, J. J. (2011). The challenges of fitting an

item response theory model to the social anhedonia scale. Journal of personality

assessment, 93(3):213–224.

Reise, S. P., Morizot, J., and Hays, R. D. (2007). The role of the bifactor model in

resolving dimensionality issues in health outcomes measures. Quality of Life Research,

16(1):19–31.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6:461–

464.

Schwarz, N. (1999). Self-reports: How the questions shape the answers. American

Psychologist, 54:93–105.

Sun, J., Chen, Y., Liu, J., Ying, Z., and Xin, T. (2016). Latent variable selection for

multidimensional item response theory models via L1 regularization. Psychometrika.

To appear.

Wainer, H., Bradlow, E. T., and Du, Z. (2000). Testlet response theory: An analog

for the 3PL model useful in testlet-based adaptive testing. In van der Linden, W. J.

and Glas, G. A., editors, Computerized adaptive testing: Theory and practice, pages

245–269. Springer, New York, NY.

Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance

of the three-parameter logistic model. Applied Psychological Measurement, 8(2):125–

145.

Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local

item dependence. Journal of Educational Measurement, 30(3):187–213.

34

Appendix

A Computation via an Alternating Minimization

A.1 Proximal Gradient Descent Update

The proximal gradient descent update is designed for solving nonsmooth convex opti-

mization problems (Parikh et al., 2014). Consider optimization problem

minxf(x) + g(x), (15)

where x ∈ Rn, f is a smooth convex function, and g is a continuous but nonsmooth

convex function. Due to the nonsmoothness of g, the traditional gradient descent algo-

rithm cannot be directly applied, because the gradient of g does not always exist. The

proximal gradient descent update can be viewed as a variant of the gradient descent

update that accounts for the nonsmoothness.

To describe the proximal gradient descent update, we first introduce the proximal

operator Pλ,g: Rn → Rn as

Pλ,g(v) = arg minx∈Rn{g(x) +

1

2λ‖x− v‖2}

and proximal gradient descent update is

xt+1 = Pλ,g(xt − λ∇f(xt)), (16)

where xt and xt+1 are the current and updated values. It can be shown that for a

sufficiently small λ,

f(xt+1) + g(xt+1) < f(xt) + g(xt),

35

if xt is not the an optimal solution. Thus, one can always search for a step size λ, such

that the objective function decreases. When function g has a special form, the proximal

gradient descent update (16) may have a closed form solution, which is indeed the case

in our algorithm below.

A.2 An Alternating Minimization Algorithm

We use an alternating minimization algorithm for optimizing (11). The positive semidef-

inite constraint on Σ is not easy to handle in the computation. Therefore, we reparame-

terize Σ = BB>, where B = (bij) is a K×K lower triangle matrix. In addition, instead

of constraining Σkk = 1, we require bkk to be 1. There is a one to one correspondence

between the two sets of parametrization and the transformation will be discussed in the

sequel. We let l(A, S,B) = logL(A, S,BB>) and

Hγ(A, S,B) = − 1

Nl(A, S,B) + γ

∑i 6=j

|sij|

be the objective function. Then the alternating minimization algorithm alternates be-

tween updating A, S, and B iteratively, so that the values of (A, S,B), denoted by

(At, St, Bt), satisfy

Hγ(At, St, Bt) > Hγ(A

t+1, St, Bt) > Hγ(At+1, St+1, Bt) > Hγ(A

t+1, St+1, Bt+1),

for all t. Specifically, A and B are updated using a gradient descent method and S is

updated using the proximal gradient descent method. We summarize the algorithm as

follows, given the current parameter values (At, St, Bt).

Algorithm 1 An Alternating Minimization Algorithm.

1. Update

At+1 ← At − αtgA(At, St, Bt),

36

where gA(At, St, Bt) is the gradient of −l(A, S,B)/N with respect to A at (At, St, Bt).

The step size αt is chosen by line searching, such that Hγ(At+1, St, Bt) < Hγ(A

t, St, Bt).

2. Update

St+1 ← Pλt,hγ (St − λtgS(At+1, St)),

where hγ(S) = γ∑

i 6=j |sij| is the regularization function, gS(At+1, St, Bt)) is the

gradient of −l(A, S,B)/N with respect to S at (At+1, St, Bt). In addition, λt is the

step size for a proximal gradient operator chosen by line searching, satisfying

Hγ(At+1, St+1, Bt) < Hγ(A

t+1, St, Bt). (17)

3. Update

Bt+1 ← Bt − βtgB(At+1, St+1, Bt),

where gB(At+1, St+1, Bt) is the gradient of −l(A, S,B)/N with respect to B eval-

uated at (At+1, St+1, Bt). The step size βt is chosen by line searching, such that

Hγ(At+1, St+1, Bt+1) < Hγ(A

t+1, St+1, Bt).

4. Iterates between the above three steps until convergence.

We make a few remarks.

Remark 1 −l(A, S,B)/N is a smooth function of (A, S,B) and its gradients gA, gS,

and gB have analytic forms.

Remark 2 In Step 2, Hγ(At+1, S, Bt), when viewed as a function of S (with At+1 and Bt

fixed), is the sum of a smooth convex function −l(At+1, S, Bt)/N and a nonsmooth convex

function hγ(S) = γ∑

i 6=j |sij|. Therefore, according to the discussion in Appendix A.1,

there exists a sufficiently small step size λt, such that (17) is satisfied.

37

Remark 3 The proximal operator Pλt,hγ (·) has a closed form solution. Let S = St −

λtgS(At+1, St, Bt). Then st+1jj = sjj and st+1

ij is obtained by soft-thresholding

st+1ij =

12(sij + sji)− λtγ if 1

2(sij + sji) > γλt;

0 if |12(sij + sji)| ≤ γλt;

12(sij + sji) + λtγ if 1

2(sij + sji) < −γλt.

Remark 4 Given estimates Aγ, Sγ, and Bγ from optimizing Hγ(A, S,B), the estimates

under the parametrization in (11) can be obtained by

Aγ ← AγD, Sγ ← Sγ, Σγ ← D−1Bγ(Bγ)>D−1,

where D = diag(d11, ..., dKK) is a K ×K diagonal matrix with

dkk =

√(Bγ(Bγ)>)kk.

38