Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
A COMPARISON OF DIFFERENTIAL ITEM FUNCTIONING (DIF) DETECTION FOR
DICHOTOMOUSLY SCORED ITEMS BY USING 2.1 IRTPRO, BILOG-MG 3, AND
IRTLRDIF V.2
by
MEI LING ONG
(Under the Direction of Soeck-Ho Kim)
ABSTRACT
This paper addresses statistical issues of differential item functioning (DIF). The first
purpose of this study is to present an empirical data comparison of the IRTPRO, BILOG-MG 3,
and IRTLRDIF programs and to detect DIF across two samples with IRT models, 1PL, 2PL, and
3PL. The second purpose is to examine IRTPRO to determine its effectiveness in detecting DIF,
and, finally, to consider whether DIF exists in the GHSGPT for different ethnicities only in
Social Studies. The GHSGPT predicts 11th grade students’ future performance on the Georgia
High School Graduation Test and consists of 79 dichotomously scored items. The results show
that several DIF items exist in the GHSGPT. For instance, all three programs consistently
indicate that Item 13 is beneficial to Whites. In addition, IRTPRO is effective in detecting DIF
because its results parallel those of IRTLRDIF and BILOG-MG 3.
INDEX WORDS: Differential item functioning (DIF), IRTPRO, BILOG-MG 3, IRTLRDIF,
IRT, 1PL, 2PL, and 3PL.
A COMPARISON OF DIFFERENTIAL ITEM FUNCTIONING (DIF) DETECTION FOR
DICHOTOMOUSLY SCORED ITEMS BY USING 2.1 IRTPRO, BILOG-MG 3, AND
IRTLRDIF V.2
by
MEI LING ONG
B.A, Fu-Jen Catholic University, Taiwan, 1999
A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment
of the Requirements for the Degree
MASTER OF ARTS
ATHENS, GEORGIA
2012
© 2012
MEI LING ONG
All Rights Reserved
A COMPARISON OF DIFFERENTIAL ITEM FUNCTIONING (DIF) DETECTION FOR
DICHOTOMOUSLY SCORED ITEMS BY USING 2.1 IRTPRO, BILOG-MG 3, AND
IRTLRDIF V.2
by
MEI LING ONG
Major Professor: Soeck-Ho Kim
Committee: Allan S. Cohen Stephen E. Cramer Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia August 2012
iv
ACKNOWLEDGEMENTS
I sincerely appreciate those who supported and encouraged me throughout this process. I
would like to thank my advisor, Dr. Soeck-Ho Kim, for his guidance and technical support
throughout this study, without which I would not have completed this thesis. In addition, I would
like to thank the members of my committee, Dr. Allan S. Cohen and Dr. Stephen E. Cramer, for
their comments and helpful suggestions while completing this thesis. Furthermore, I want to
thank my friends, Yoonsun, Youn-Jeng, Sunbok, Stephanie Short, Mary Edmond, and many
other friends, who provided their opinions in terms of this thesis. Lastly and importantly, I wish
to express my deepest appreciation to my parents, my elder brother, my younger sister, and my
younger auntie for their support and encouragement. To my lovely husband, Man Kit Lei, thanks
for cooking lunch and dinner for me while I was researching, writing and revising this study.
Because of your unending encouragement and full support, I have had an opportunity to obtain
my Master’s Degree. Thank you very much.
v
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS ........................................................................................................... iv
LIST OF TABLES ........................................................................................................................ vii
LIST OF FIGURES ..................................................................................................................... viii
CHAPTER
1 INTRODUCTION .........................................................................................................1
1.1 Overview ............................................................................................................1
1.2 Item Bias, Differential Item Functioning (DIF), and Impact .............................2
1.3 The Purpose of the Study ...................................................................................6
2 LITERATURE REVIEW ..............................................................................................7
2.1 Classical Test Theory .........................................................................................7
2.2 Modern Test Theory ..........................................................................................9
2.3 Estimation of Item Parameters .........................................................................10
2.4 Dichotomously Scored Items ...........................................................................12
2.5 The DIF Detection Method ..............................................................................18
2.6 Current Research ..............................................................................................27
3 METHOD ....................................................................................................................28
3.1 Research Structure ...........................................................................................28
3.2 Instrumentation ................................................................................................29
3.3 Sample..............................................................................................................29
3.4 Computer Programs .........................................................................................30
vi
4 RESULTS ....................................................................................................................33
4.1 Item Analysis ...................................................................................................33
4.2 Racial Differential Item Functioning (DIF) Analysis ......................................41
5 SUMMARY AND DISCUSSION ...............................................................................80
5.1 Summary ..........................................................................................................80
5.2 Discussion ........................................................................................................83
REFERENCES ..............................................................................................................................88
APPENDICES
A IRTPRO Input File for DIF Detection for Two Groups with 3PL ..............................95
B BILOG-MG 3 Input File for DIF Detection for Two Groups with 3PL ....................101
C IRTLRDIF Input File for DIF Detection for Two Groups with 3PL .........................103
vii
LIST OF TABLES
Page
Table 1: The Development of Item Response Models and Computer Programs ..........................14
Table 2: The 2-by-2 Contingency Table ........................................................................................19
Table 3: The DIF Detection for Ethnicity ......................................................................................30
Table 4: Raw Score Summary Statistics for the GHSGPT ............................................................33
Table 5: Item Statistics Based on Classical Test Theory ...............................................................36
Table 6: Item Statistics Based on Item Response Theory..............................................................39
Table 7: The Summary of Goodness of Fit Using BILOG-MG 3 .................................................42
Table 8: The Summary of Goodness of Fit Using IRTPRO ..........................................................42
Table 9: The Summary of BILOG-MG 3 and IRTPRO for Three Comparison Groups
with 1PL ..............................................................................................................................44
Table 10: The Summary of IRTLRDIF for Three Comparison Groups with 2PL ........................47
Table 11: The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison
Groups with 2PL ................................................................................................................52
Table 12: The Summary of IRTLRDIF for Three Comparison Groups with 3PL ........................56
Table 13: The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison
Groups with 3PL ................................................................................................................61
Table 14: The Summary of BILOG-MG 3 and IRTPRO for All Ethnicities/Races with 1PL ......65
Table 15: The Summary of BILOG-MG 3 and IRTPRO for All Ethnicities/Races with 2PL ......68
Table 16: The Summary of BILOG-MG 3 and IRTPRO for All Ethnicities/Races with 3PL ......71
viii
LIST OF FIGURES
Page
Figure 1: No DIF between two groups ...........................................................................................4
Figure 2: DIF exists in two groups called uniform DIF...................................................................4
Figure 3: Non-uniform DIF ............................................................................................................5
Figure 4: The research structure ...................................................................................................28
Figure 5: Item 13 between Whites and Blacks ..............................................................................73
Figure 6: Item 14 between Whites and Blacks ..............................................................................73
Figure 7: Item 15 between Whites and Blacks ..............................................................................74
Figure 8: Item 32 between Whites and Blacks ..............................................................................74
Figure 9: Item 44 between Whites and Blacks ..............................................................................75
Figure 10: Item 45 between Whites and Blacks ............................................................................75
Figure 11: Item 56 between Whites and Blacks ............................................................................76
Figure 12: Item 57 between Whites and Blacks ............................................................................76
Figure 13: Item 78 between Whites and Blacks ............................................................................77
Figure 14: Item 13 between Whites and Hispanics .......................................................................77
Figure 15: Item 19 between Whites and Hispanics .......................................................................78
Figure 16: Item 51 between Whites and Hispanics .......................................................................78
Figure 17: Item 44 between Whites and the Multi-Racial Group ..................................................79
1
CHAPTER 1
INTRODUCTION
1.1 Overview
A well-constructed test is the best way to evaluate a student’s mastery in a particular
field. Gronlund (1993) stated that tests not only aid teachers in making various instructional
decisions by having a direct influence on students’ learning, but they also assist in a number of
other ways. For instance, tests can increase students’ motivation. The purposes of tests are to
obtain an accurate and fair assessment of a student’s abilities. Nevertheless, a test cannot
properly evaluate skills or knowledge bases if it is affected by irrelevant factors that could bias
the results. These potentially biasing factors could include gender, ethnic, and cultural
differences. Without properly accounting for these compounding factors, the results of the test
will be an unfair representation of students’ abilities (Gronlund, 1993). In other words, if a test is
unfair for examinees because of gender, ethnic origin, or cultural bias, then its results are
essentially meaningless. For instance, Freedle and Kostin (1988) investigated whether the GRE
verbal item types differed across races and resulted in different item function. They found that
most of the GRE verbal items advantaged Whites. Thus, test fairness is an important issue with
which researchers must be concerned.
There are several ways to measure students’ cognitive abilities in standardized testing.
Currently, multiple-choice tests are commonly used for measuring students’ cognitive abilities
(Ling & Lau, 2005). Most schools use standardized scores to evaluate educational quality and
student performance (Brescia & Fortune, 1988). If test scores are an important factor in
2
evaluating students’ performance, test developers should make tests as fair as possible for
examinees of different races, genders, or handicapping conditions (APA, 1988). In order to
ensure that all items are as free as possible from irrelevant sources of variance, all items should
be reviewed because the presence of bias may unfairly affect examinees’ scores (Hambleton &
Swaminathan, 1985). Hence, detecting differential item functioning (DIF) can be seen as a
critical step in detecting biased items.
1.2 Item Bias, Differential Item Functioning (DIF), and Impact
Research on item bias first appeared in the literature in the 1960s. Angoff (1993)
characterized bias as “An item is biased if equally able (or proficient) individuals, from different
groups, do not have equal probabilities of answering the item correctly” (p. 4). Lord (1980) also
noted that a test would be unbiased if each item has exactly the same item response function in
each group, and examinees have exactly the same opportunity of obtaining the correct item at
any given level of ability, θ. However, if each item has a different item response function
between a reference group and a focal group, the item, obviously, is biased. Furthermore, Shealy
and Stout (1993) indicated that “if the matching criterion is judged to be construct-valid in the
sense that it is matching examinees on the basis of the latent trait (target ability) the test is
designed to measure without contamination from other unintended to be measured abilities then
the DIF item is said to be biased” (p. 197). For example, the word commodious, a verbal aptitude
item, advantages Hispanic examinees. The word commodious was considered biased because it
has a similar form and meaning in Spanish (Zieky, 1993). While researchers have determined the
need to identify such bias in testing, the very word “bias” is sometimes confusing and evokes
3
negative emotional reactions similar to the words “discrimination” and “racism” (Berk, 1982).
Eventually, researchers proposed DIF to replace the term bias (Angoff, 1993).
DIF involves testing examinees from different populations that share the same abilities
but differ in their probabilities of giving correct responses on test items (Crocker & Aligina,
2008). For example, a mathematics test requires skills in computation and reading and assumes
that all examinees have the same computational ability. Nonetheless, if one group is proficient in
reading English, but another group is made up of English as a second language (ESL)
individuals, these groups would not have equal English proficiency. In this situation, even
though all examinees are matched in their computational abilities, they will provide different
answers on the mathematic items because they have differential English proficiency. DIF exists
on mathematics test. On the other hand, if two groups exhibit different performances on a
mathematics test because they do not share the same ability, then this situation displays impact
rather than DIF.
Impact refers to a difference in performance on an item between two groups and is what
Holland and Thayer (1988) called “differential item performance.” If DIF exists for a focal group
associated with some reference group, then the item characteristic curves (ICC) differ for the two
groups (Cai et al., 2011). In other words, there is no DIF if the ICCs are equal as shown in Figure
1. On the other hand, DIF exists when the ICCs differ as shown in Figure 2. Thus, Lord (1980)
argued that DIF detection questions could be approached by comparing estimates of the item
parameters between groups, as the ICCs for an item are determined by the item parameters. DIF
exists in the test, which means that items have been detected as construct-irrelevant factors in the
test, and this will affect the validity of the items utilized. If a test is found to contain a biased
item, this item should be omitted in order to achieve a fairer test. Thus, determining DIF is an
4
important step in maintaining items’ effectiveness and fairness as well as in enhancing the
validity of a test.
Figure 1. No DIF between two groups.
Figure 2. DIF exists in two groups called uniform DIF.
5
Two types of DIF, uniform and non-uniform, have been defined by Mellenberg (1982).
Uniform DIF refers to the pattern of difference between two groups’ probabilities of obtaining a
correct response to an item as the same across all ability levels. Uniform DIF presents no
interaction between the level of two groups and their abilities as shown in Figure 2. Non-uniform
DIF, so-called crossing DIF (CDIF), refers to an item that discriminates across ability levels
differently for separate groups, which means that the probability of giving correct responses on
test items for different groups is not the same at all ability levels as shown Figure 3. Thus, there
is an interaction between ability levels and separate groups when non-uniform DIF exists
(Swaminathan & Rogers, 1990).
Overall, bias is not a simple synonym for DIF. The differentiation between bias and DIF
depends on “the extent to which a convincing construct validity argument has been given for the
matching criterion” (Shealy & Stout, 1993, p. 197). Therefore, most analyses of test data
examine DIF rather than item bias.
Figure 3. Non-uniform DIF.
6
1.3 The Purpose of the Study
In order to provide a fair and equitable test, the detection of DIF is necessary.
Traditionally, classical test theory was widely used because of its computational simplicity.
However, several computer programs, such as BILOG-MG 3 (Zimowski et al., 2003), flexMIRT
(Cai, 2012), and IRTPRO (Cai et al., 2011), have recently been developed which can address
complex mathematic computations. As a result, item response theory has grown in popularity.
This current study analyzes the data of the Georgia High School Graduation Predictor Test
(GHSGPT) to investigate DIF across multiple groups using several computer programs with
three popular IRT models for dichotomously scored items. This study has three main objectives.
The first objective is to present an empirical data comparison of three programs, IRTPRO,
BILOG –MG 3,and IRTLRDIF in order to detect DIF across majority and minority groups with
one parameter logistic (1PL), two-parameter logistic (2PL), and three-parameter logistic (3PL)
models. The second purpose is to examine IRTPRO to determine its effectiveness in detecting
DIF. Finally; this study considers whether the GHSGPT exhibits DIF for different ethnicities.
7
CHAPTER 2
LITERATURE REVIEW
Currently, classical test theory (CTT) and item response theory (IRT) are popular
statistical structures for addressing measurement problems such as test development, test-score
equating, and the identification of biased test items. Forty years ago, Frederic Lord indicated that
examinees’ observed scores and true scores were not the same as their ability scores because
ability scores are test independent (Hambleton & Jones, 1993). On the other hand, examinees’
observed- and true- scores are test-dependent (Lord, 1953). Thus, the CTT and the IRT are
widely perceived as representing two measurement frameworks.
2.1 Classical Test Theory
Classical test theory (CTT) or traditional measurement theory, which is referred to as the
“classical test model,” is regarded as the “true score theory” and includes three concepts: 1) the
observed score (test score); 2) the true score; and 3) the error score. Each observed score is made
up of two components, which are the “true score (T)” and the “error score (E)” (Hambleton &
Jones, 1993). The model of CTT is defined as:
X = T + E, (1)
where X is the test score (observed score), T is the true score, and E is the error score.
Observed scores are simply the scores individuals obtain on the measuring instrument.
The true score is the one that each observer desires to obtain. However, the true score, in fact, is
an unknown value and cannot be directly observed. It is inferred from the observed scores, and it
8
can merely be estimated. For individuals, the theoretical value of the true scores represents a real
psychological operation or academic performance. The true score for examinee j is given as:
Tj = E(Xj) = μxj. (2)
Errors include systematic errors, random errors, and measurement errors (Spector, 1992).
The CTT assumes each examinee has a true score if there were no errors in measurement, that is,
X = T. If the expected value of X is T, E’s expectation is zero (Lord, 1980):
𝜇𝐸|𝑇 ≡ 𝜇(𝑋−𝑇)|𝑇 ≡ 𝜇𝑋|𝑇 − 𝜇𝑇|𝑇 = 𝑇 − 𝑇 = 0, (3)
where μ is the mean, and the subscripts state that T is fixed. Equation 3 indicates that the error of
measurement is unbiased. If T and E are independent, the observed-score variance is defined as:
𝜎𝑋2 = 𝜎𝑇2 + 𝜎𝐸2, (4)
where 𝜎𝑋2 is the variance of the observed score (total score), 𝜎𝑇2 is the variance of the true score,
and 𝜎𝐸2 is the variance of errors. Reliability refers to the stability and consistency of assessment
results. The index of reliability can be stated as the ratio of the standard deviation of true scores
to the standard deviation of the observed scores (Lord & Novick, 1968) and is defined as:
𝜌𝑋𝑇 = 𝜎𝑇𝜎𝑋
, (5)
where 𝜌𝑋𝑇 is the correlation between true and observed scores, σT is the standard deviation of the
true score, and σX is the standard deviation of the observed score. Nevertheless, the true score is
unknown, so getting the Pearson correlation between the observed scores on parallel tests is a
way to estimate the reliability coefficient. The reliability coefficient is given by:
𝜌𝑋𝑋′ = 𝜎𝑇2
𝜎𝑋2, (6)
where 𝜌𝑋𝑋′ is the correlation between observed scores on two parallel test, X and X′ are referred
to as parallel measurements.
9
The assumptions of CTT are that: (1) true- and error- scores are independent, (2) the
average error score in the population of test takers is zero, and (3) error scores on parallel tests
are independent. The important advantage of CTT is its weak theoretical assumptions which
make it easy to employ in many testing situations. However, CTT’s major limitations are that:
(1) the person statistics are item dependent and (2) the item statistics, such as item difficulty and
item discrimination, are sample dependent (Hambleton & Jones, 1993). Although CTT is easy to
compute and to understand, its theory is based on weak assumptions, such as the sample
dependent index. Thus, there is trouble in obtaining a consistency of difficulty, discrimination,
and reliability on the same test. In order to overcome the disadvantages of CTT, modern test
theory, which is based on the item response theory framework, was developed.
2.2 Modern Test Theory
The theoretical structure of modern test theory (or modern measurement theory) is item
response theory (IRT). IRT, which is also known as “latent trait theory,” is a general statistical
theory concerning an examinee’s item and test performance and how his or her performance
relates to the abilities that are measured by the items in the test (Hambleton & Jones, 1993). In
other words, IRT mainly focuses on item-level information. The essential elements of an IRT
model are ability or proficiency, which is an unobservable (latent) variable, usually denoted by θ,
that varies within the population of examinees and the item characteristic curve (ICC) (Thissen et
al., 1993). The ICC is the curve that describes the functional relationship between the probability
of a correct response to an item and the ability scale. The ICC is denoted by the following:
(Baker & Kim, 2004)
𝑃(𝛽𝑖,𝛼𝑖,𝜃𝑗) ≡ 𝑃𝑖(𝜃𝑗), (7)
10
where 𝑃𝑖(𝜃𝑗) is the probability of the correct response at any point θj on the ability scale ( j = 1,
2, 3,…,N), i is an item (i = 1, 2, 3, …,n), βi is the difficulty parameter, and αi is the
discrimination parameter (Baker & Kim, 2004). Item responses can be discrete or continuous and
dichotomously or polychotomously scored. Item score categories can be ordered or unordered.
The assumptions of IRT are: (1) dimensionality, which includes uni- or multi-dimensional and
(2) local independence, so called conditional independence, means that every person has a
certain probability of giving a predefined response to each item, and this probability is
independent of the answers given to the preceding items (Croker & Algina, 2008). The
characteristics of IRT are parameter invariance and information function; however, CTT does
not have these two characteristics. The major limitation in IRT is that it tends to be complex in
its computations.
2.3 Estimation of Item Parameters
This study applies three computer programs, IRTPRO, BILOG-MG 3 and IRTLRDIF, to
analyze DIF. These three programs implement the method of marginal maximum likelihood
estimation (MMLE) and maximum likelihood estimation (MLE) for item parameter estimation.
Hence, this study utilizes only MMLE and MLE.
2.3.1 Marginal Maximum Likelihood Estimation (MMLE)
The method of marginal maximum likelihood estimation (MMLE) was proposed by
Bock and Lieberman (1970). However, their approach was practical only for very short tests; the
computation was complicated, and the estimation was slow. Thus, in order to solve these
problems, Bock and Aitkin (1981) developed the expectation- maximization (EM) algorithm to
11
improve the effectiveness of the MMLE. Baker and Kim (2004) indicated that the MMLE
assumes that examinees represent a random sample from a population where ability is distributed
based on a density function g (θ|τ), where τ refers to the vector containing the parameters of the
examinee population’s ability distribution. Currently, this situation corresponds to a mixed-effect
ANOVA model with items are considered to be a fixed effect and abilities a random effect. The
essential feature of the Bock and Lieberman solution is its ability to integrate over the ability
distribution and to remove random nuisance parameters from the likelihood functions (Baker &
Kim, 2004). Therefore, item parameters are estimated in the marginal distribution; the item
parameter estimation is freed from its dependency on the estimation of each examinee's ability,
although it is not from its dependency upon the ability distribution. The ability is estimated
together with the item parameters if the ability distribution is correctly identified (Baker & Kim,
2004). Because increasing sample size does not require the estimation of additional examinee
parameters, this produces consistent estimates of item parameters for samples of any size
(Harwell et al., 1988). The marginal likelihood function will be maximized in order to obtain
item parameters, and the equation is identified below (Baker & Kim, 2004):
𝐿 = ���𝑃𝑖(𝜃𝑗)𝑢𝑖𝑗𝑛
𝑖=1
𝑁
𝑗=1
𝑄𝑖(𝜃𝑗)1−𝑢𝑖𝑗𝑔�𝜃𝑗�𝜏�𝑑𝜃𝑗 , (8)
where uij is the probability of obtaining a dichotomous response, 0 or 1, and 𝑔�𝜃𝑗�𝜏� is the
probability of density function of ability in the population of examinees (Baker & Kim, 2004).
2.3.2 Maximum Likelihood Estimation (MLE)
The maximum likelihood estimation (MLE) began with a mathematical expression
known as the likelihood function, which is the likelihood of a set of parameter values that is the
12
probability getting the particular set of parameter values, given the chosen probability
distribution model including unknown model parameters (Czepiel, 2002). The parameter values
maximize the sample likelihood, which is known as the maximum likelihood estimates (MLE).
The MLE procedures will be presented for the two-parameter logistic model (Baker & Kim,
2004) that is given by:
𝑃𝑗 = Ψ�𝑍𝑗� = 1
1+𝑒−(𝜍+𝜆𝜃𝑗) , (9)
where 𝑍𝑗 = 𝜍 + 𝜆𝜃𝑗 and is the logit, ς is the slope, and λ is the intercept. The likelihood function
is defined by:
Prob (R) = ∏ 𝑓𝑗!𝑟𝑗!(𝑓𝑗−𝑟𝑗)
𝑃𝑗𝑟𝑗(1 − 𝑃𝑗)𝑓𝑗−𝑟𝑗 ,k
j=1 (10)
where 𝑟𝑗 represents the correct response, 𝑓𝑗 − 𝑟𝑗 represents the incorrect response, Pj is the true
probability of correct response. There are �𝑓𝑗𝑟𝑗� different ways to arrange rj successes from among
fj trials for each population; the probability of the success of any one of the 𝑓𝑗 trials is Pj, and the
probability of 𝑟𝑗 successes is 𝑃𝑗𝑟𝑗 (Czepiel, 2002). Similarly, the probability of 𝑓𝑗 − 𝑟𝑗 failures is
(1 − 𝑃𝑗)𝑓𝑗−𝑟𝑗. The maximum likelihood estimated are the values for R that maximizes the
likelihood function in Equation 10 (Czepiel, 2002).
2.4 Dichotomously Scored Items
For psychological and educational testing, dichotomous scoring, polytomous scoring, and
continuous scoring are commonly used in the scoring of item responses. Previously, DIF
research primarily focused on dichotomously scored items (Embretson & Reise, 2000); recently,
however, several studies mention polytomously scored items (Raju et al., 1995). Because this
13
study is focused on the unidimensional dichotomously scored items, it discusses only the
unidimensional dichotomously scored items.
For dichotomously scored items, scored items, either correct or incorrect, are the majority
of the multiple-choice test score items analyzed, even though a multiple-choice test item has four
options (Potenza & Dorans, 1995). Van der Linden and Hambleton (1996) mentioned that if the
examinees j respond to the item I denoted by a random variable Uij, the two scores are codes as
Uij = 1 (correct) and Uij = 0 (incorrect). The probability of the ability of the examinees getting a
correct response is presented by parameter θ (-∞, ∞). The properties of item i that have an
effect on the probability of success are its difficulty, bi (-∞, ∞), and discriminating power, ai
(-∞, ∞). The probability of success on item i is usually denoted by Pi(θ), which is a function of θ
specific to item i, known as the item response function (IRF), item characteristic curve (ICC), or
trace line. Because the IRF cannot be linear in θ, it usually has to be monotonically increasing
when θ rises. In addition, it provides the different probability of that response across the ability
continuum (Thissen et al., 1993).
2. 4.1 Item Response Models
Dimensionality, which is one of the assumptions under IRT, includes unidimensionality
and multidimensionality. Both the unidimensional item response theory (UIRT) model and the
multidimensional item response theory (MIRT) model include dichotomously and polytomously
scored items. Based on the different scoring and dimensionality, researchers developed different
item response models. Table 1 briefly displays the dimensionality, scoring, parameters, model
presented by researchers, and computer programs that are appropriate to use in different models.
14
Table 1
Dimentionality Scoring Parameters Presented by Computer Programs
Unidimentionality Dichotomous One-Parameter Logistic Model or Rasch Models (1PLM)
Rasch (1960)
Two-Parameter Logistic Model (2PLM)
Birnbaum (1968)
Three-Parameter Logistic Model (3PLM)
Birnbaum (1968)
Polytomous Nominal Response Model Bock (1972)Rating Scale Model Andrich (1978)Graded Response Model Samejima (1969)Partial Credit Model Master (1982)Generalized Partial Credit Model
Muraki (1991)
Multidimensionality Dichotomous Multidimensional Extension of the Rasch Model (M1PL)
Adams, Wilson & Wang (1997)
Multidimensional Extension of the Two-Parameter Logistic Model
Mckinley & Reckase (1991)
Multidimensional Extension of the Three-Parameter Logistic Model
Reckase (1985)
Polytomous Multidimensional Extension of the Graded Response (MGP) Model
Muraki & Carlson (1993)
Multidimensional Extension of the Partial Credit (MPC) Model
Kelderman & Rijkes (1994)
Multidimensional Extension of the Genralized Partial Credit (MGPC) Model
Yao & Schwarz (2006)
Note. Adapted from Multidimensional item Response Theory , by M. D. Reckase, 2009. Copyright 2009 by Springer.
Winstep, BILOG-MG,
IRTPRO, flexMIRT,
TESTFACT
MULTILOG, PARSCALE,
IRTPRO, flexMIRT, ConQuest
The Development of Item Response Models and Computer Programs
TESTFACT, NOHARM, ConQuest, BMIRT, IRTPRO, flexMIRT
POLYFACT,BMIRT, IRTPRO, flexMIRT
15
The one-parameter logistic (1PL) model, or the so-called Rasch model, the two-parameter
logistic (2PL) model, and the three-parameter logistic (3PL) model are the three popular
unidimensional IRT models for dichotomous tests. Because this study focuses on the UIRT, it
discusses only three models.
2.4.1.1 The One-Parameter Logistic (1PL) Model or The Rasch Model
In the 1950s, Georg Rasch (1960) developed his Poisson models for reading tests and a
model for intelligence and achievement tests, which is called the Rasch model. Under the Rasch
model, both guessing and discrimination are negligible or constant. The main motivation of the
Rasch model was to remove references to populations of examinees in analyses of tests. The test
analysis would only be worthwhile if it were individual centered with separate parameters for the
items and the examinees. The Rasch model was derived from the initial Poisson model defined
as (Van der Linden & Hambleton, 1996):
𝜉 = 𝛿𝜃
, (11)
where 𝜉 is a function of parameters describing the ability of an examinee and difficulty of the
test, θ is the ability of the examinee, and δ is the difficulty of the test that is estimated by the
summation of errors in a test.
The model was enhanced to suppose that the probability of a student who will correctly
answer a question is a logistic function of the different between the students’ abilities θ and
questions’ difficulties β. Currently, Rasch model is specified as:
P(𝜃) = 𝑒(𝜃−𝑏)
1+𝑒(𝜃−𝑏) , (12)
16
where P(𝜃) depends upon the particular ICC model used, e is the constant, 2.718, θ is the ability,
and b is an item difficulty parameter.
The difficulty parameter, b, describes the item functions on the ability scale. It is defined
as the point on the ability scale at which the probability of a correct response to the item is .5
(Baker & Kim, 2004). The discriminations of all items are supposed to be equal to one under the
Rasch model. The Rasch model is appropriate for dichotomous responses and models the
probability of an individual’s correct response on a dichotomous item.
2.4.1.2 The Two-Parameter Logistic (2PL) Model
Unlike Rasch, Birnbaum’s aim was to finish the work begun by Lord (1952) on the
normal-ogive model. The contribution of Birnbaum was to replace the normal-ogive model with
the logistic model. Thus, Birnbaum (1968) proposed the two-parameter logistic (2PL) model,
which extends the 1PL by estimating an item discrimination parameter (a) and an item difficulty
parameter (b). The 2PL model is given as:
P(𝜃) = 𝑒𝑎(𝜃−𝑏)
1+𝑒𝑎(𝜃−𝑏), (13)
where a is the discrimination parameter without the scaling constant D= 1.702.
The discrimination parameter, a, describes how well an item can differentiate between
examinees’ abilities below or above the item location. It also reflects the steepness of the ICC in
its middle section. The steeper the curve is the higher the value of a; thus, the better items can
discriminate. On the other hand, the flatter the curve is the lower the value of a; therefore, the
less items can differentiate (Baker, 2001).
17
2.4.1.3 The Three-Parameter Logistic (3PL) Model
Besides 2PL, Birnbaum (1968) proposed a third parameter for inclusion in the model to
consider the nonzero performance, which is the probability of guessing correct answers, of low-
ability examinees on multiple-choice items. The three-parameter logistic (3PL) model is defined
as:
P(𝜃) = c +(1- c) 𝑒𝑎(𝜃−𝑏)
1+𝑒𝑎(𝜃−𝑏), (14)
where c is the lower asymptote of an ICC.
The lower asymptote, c, which is commonly referred to as the “pseudo-chance level”
parameter, represents the probability of examinees with low ability correctly answering an item.
In general, the c parameter assumes that values are smaller than the value that would result if
examinees of low ability were to guess randomly on the item. Thus, Lord (1974) has noted that c
is no longer called the “guessing parameter” because this phenomenon can probably be attributed
to item writers developing “attractive” but incorrect choices. A side effect of using the guessing
parameter c is that the definition of the difficulty parameter is changed, and the lower limit of the
ICC is the value of c rather than zero. The difficulty parameter is the point on the ability scale.
The equation is given as:
P (θ) = (1+c)/2, (15)
and the discrimination parameter is proportional to the slope, that is :
𝑎(1−𝑐)4
, (16)
of the item characteristic curve at θ = b (Baker, 2001).
McDonald (1999) stated that the 3PL was designed specifically for multiple-choice
cognitive items in discussing this model, and it is appropriate to refer to the latent trait as the
18
ability common to the m items in the test. With the introduction of the pseudo-guessing
parameter, there is no quantity calculated from the response pattern that serves as a sufficient
statistic for ability.
2.5 The DIF Detection Method
The two frameworks, CTT and IRT frameworks, are mostly used to detect DIF. There are
two methods that can be used to detect DIF: the non-item response theory (non-IRT) based
method (or observed score methods), such as the Mantel-Haenszel (MH) procedure,
standardization, SIBTEST, and logistic regression (Dorans & Holland, 1993), and the item
response theory (IRT) based method, such as Lord’s chi-square test, area measures, and the
likelihood function (Hambleton et al., 1991). This study applies the IRT based approach to detect
DIF and will present a comparison of three programs, IRTLRDIF 2.1, BILOG-MG 3, and
IRTPRO, using the Georgia High School Graduation Predictor Test data with the three IRT
models.
2.5.1 The Non-Item Response Theory (Non-IRT) Based Method
There are several methods to detect DIF on the non-IRT method, including the Mantel-
Haenszel (MH) procedure, standardization, SIBTEST, and logistic regression.
2.5.1.1 Mantel-Haenszel Method
The Mantel-Haenszel method was proposed by Mantel and Haenszel (1959). This method
is attractive because it is easy to implement, has an associated test of significance, and can be
used with small sample sizes. Thus, this method is commonly used in non-IRT based methods,
19
and it may be widely used in contingency two-by-two table procedures shown in Table 2 and has
been the object of considerable evaluation since it was firstly recommended by Holland and
Thayer (Dorans & Holland, 1993).
Table 2
Group Right Wrong TotalReference Group(R) A k B k n RK
Focal Group (F) C k D k n FK
Total Group (T) m 1k m 0k T k
The 2-by-2 Contingency Table
Item Score
Note: k = 1,2,..,j
There is a chi-square test associated with the MH approach, namely a test of the null
hypothesis:
H0 = αMH = 1; H1 = αMH ≠ 1, (17)
where αMH is the common odds ratio (Dorans & Holland, 1993). The equation of an estimate of
the constant odds ratio, αMH, is given as:
𝛼�𝑀𝐻 = ∑ 𝐴𝑘𝐷𝑘𝑘 𝑇𝑘⁄∑ 𝐵𝑘𝐶𝑘𝑘 𝑇𝑘⁄
. (18)
The equation of the MH procedure is given as:
MHχ2 = [|𝛴𝑘𝐴𝑘−𝛴𝑘𝐸(𝐴𝑘)|−.05]2
∑ 𝑉𝑎𝑟(𝐴𝑘𝑘 ) , (19)
where 𝐸(𝐴𝑘) = 𝑛𝑅𝐾𝑚1𝑘𝑇𝑘
,𝑉𝑎𝑟(𝐴𝑘) = 𝑛𝑅𝐾𝑛𝐹𝐾𝑚1𝑘𝑚0𝑘𝑇𝑘2(𝑇𝑘−1)
, and -.5 in the expression for 𝑀𝐻𝜒2 serves
as a continuity correction to improve the accuracy of the chi-square percentage points as
approximations to observed significance levels. The MH approximates the chi-square
distribution with one degree of freedom when the null hypothesis is true.
20
This estimate is an estimate of the DIF effect size in a metric that ranges from 0 to ∞ with a
value of 1 indicating null DIF (Clauser & Mazor, 1998). However, this score makes it more
difficult to interpret. Thus, this score will transform into:
MH D-DIF (Δ𝑀𝐻) = -2.35 ln(αMH). (20)
According to Δ𝑀𝐻, the three categories were developed at ETS for using in test
development (Dorans & Holland, 1993):
(1). Negligible DIF (A)
Items are classified as A either if MH D-DIF is not significantly different from zero or
|Δ𝑀𝐻| < 1.
(2) Intermediate DIF (B)
Items in level B are those that do not meet either of the other criteria.
(3) Large DIF (C)
Items in level C indicate that |Δ𝑀𝐻| both exceed 1.5 and are significantly greater than 1.
The limitation of the MH procedure is that it can only detect uniform DIF, even though it
is widely used in non-IRT based method.
2.5.1.2. Standardization
The standardization approach has been developed for use at the Educational Testing
Service (ETS). This approach was developed by Dorans and Kulick (1986) for use with the
Scholastic Assessment Test (SAT). When the expected performance on an item, which can be
operationalized by nonparametric item test regressions, differs from examinees of equal ability
from different groups, DIF exists. Doran and Holland (1993) stated that one of the main
purposes of the standardization approach is to use all available appropriate data to estimate the
21
conditional item performance of each group at each level of the matching variable. The matching
does not require the use of stratified sampling procedures to produce equal numbers of
examinees at a given score level across group memberships. In addition, the standardization
approach is straightforward to obtain standardized response rates for distractors, omits, and not
reaches (Schmitt & Dorans, 1990). Dorans and Kulick (1986) indicated that when the probability
of giving correct responses to an item is lower for examinees from one group than for examinees
of equal ability from another group, DIF is exhibited in this item. Therefore, DIF does not exist
in an item when this item satisfies:
Pg (X = 1|S) – Pg' (X = 1|S), (21)
where S refers to developed ability as measured by the total score on a test, X is an item score (X
= 1 for a correct answer and X = 0 for an incorrect answer), and Pg (X = 1|S) is as referred to the
probability that a candidate from subpopulation g who has a total test score equal to S will
provide the correct answer.
The standardization approach of the DIF measure is the observed proportion of correct
differences on an item between two groups at the kth matching variable level. The measure is
given as:
𝐷𝑘 = 𝑃𝑓𝑘 − 𝑃𝑟𝑘, (22)
where 𝑃𝑓𝑘 is the proportion correct of the studied item for the focal group and 𝑃𝑟𝑘 for the
reference group at the kth level of a matching variable.
Standardized p-difference, DSTD, is one of the important DIF indices used in this
approach, and it can range from -1 to 1 (Dorans & Schmitt, 1991). DSTD is given as:
DSTD = ∑ 𝐾𝑠𝑆𝑠=1 �𝑃𝑓𝑠 − 𝑃𝑟𝑠� ∑ 𝐾𝑠,𝑆
𝑠=1⁄ (23)
22
where [𝐾𝑠 𝛴𝐾𝑠⁄ ] represents the weighing factor at score level S supplied by the standardization
group to weight differences in performance between the 𝑃𝑓𝑠 and 𝑃𝑟𝑠.
2.5.1.3. SIBTEST
SIBTEST is a nonparametric procedure. It estimates the amount of DIF in an item and
statistically tests whether the amount is different from zero. In addition, it assesses differences in
item performance from two groups through their conditional ability levels. The main
characteristic of SIBTEST is that it employs a regression correction method to match examinees
from reference and focal groups at the same latent ability levels in order to compare their
performances on the studied items. This correction controls the inflation of a Type I error;
otherwise, results in the measurement error of the test and differences in the ability distributions
may exist across groups (Bolt, 2000).
SIBTEST requires two non-overlapping subsets of items in the test. One is a valid
subtest, which means that items are assumed to measure ability. The other is a suspect subtest,
which contains items to be tested for DIF. Scores on the valid subtest are used to match
examinees having the same ability levels across group memberships so as to test items from the
suspect subtest for DIF (Bolt, 2000).
2.5.1.4. Logistic Regression
The logistic regression was proposed by Swaminathan and Rogers (1990). This model
can be used to model DIF by identifying separate equations for the two groups of interest. The
equation is given by:
P(Uij = 1|θij) =𝑒(𝛽0𝑗+𝛽1𝑗𝜃1𝑗)
[1+𝑒(𝛽0𝑗+𝛽1𝑗𝜃1𝑗], (24)
23
where Uij is the response of person i in group j of an item, β0j is the intercept of group j, β1j is the
slope of group j, and θij is the ability of an examinee i in group j. If DIF does not exist, the
logistic regression curves for the two groups must be equal, that is, β01 is equal to β02, and β11 is
equal to β12. However, uniform DIF may be inferred if β01 is not equal to β02, and the curves are
parallel but not equivalent. In addition, the presence of non-uniform DIF may be inferred if β01 is
equal to β02, but β11 is not equal to β12, and the curves are not parallel (Swaminathan & Rogers,
1990).
2.5.2 The Item Response Theory (IRT) Based Method
IRT based methods include a comparison of item parameters, area measures, and
likelihood functions.
2.5.2.1. The Comparison of Item Parameters
This method was proposed by Lord (1980). This method attempted to perform a
statistical test of the equality of item parameters, and it can simultaneously investigate either the
difference of a, b, and c parameters or merely the difference of a and b parameters (Lord, 1980).
Lord proposed two tests for evaluating the statistical significance of DIF.
2.5.2.1.1. The Test of b Difference
The equation compares the difficulty parameters, b, for the reference and focal group and
is defined as (Thissen et al., 1993):
𝑑𝑖 = 𝑏�𝐹𝑖−𝑏�𝑅𝑖
�𝑉𝑎𝑟(𝑏�𝐹𝑖)+𝑉𝑎𝑟(𝑏�𝑅𝑖) , (25)
24
where 𝑏�𝐹𝑖and 𝑏�𝑅𝑖 is the maximum likelihood estimate of the item difficulty parameter for a focal
group and reference group, 𝑉𝑎𝑟(𝑏�𝐹𝑖) and 𝑉𝑎𝑟(𝑏�𝑅𝑖) are the variance of the b values of the focal
and reference groups. The null hypothesis H0 : di = 0, and di is the standard normal distribution.
If the di is greater than 1.96 or smaller than -1.96, two-tailed p ≤ .05, which rejects the null
hypothesis, DIF exists. Besides this test, Lord proposed a test of the joint difference between ai
and bi for two groups (Thissen et al., 1993), known as Lord’s chi-square.
2.5.2.1.2. The Lord’s Chi-Square
Lord (1980) employed the chi-square method to test whether the two groups (focal and
reference groups) achieve a significant difference. Thus, Lord’s chi-square, which examines the
hypothesis that each of the parameters of the item response function are consistent across groups
(Cohen & Kim, 1993), is the difference in the two vectors of item parameter estimated weighted
by the inverse of the variance and covariance metric, that is, the Wald statistics. However, the
item parameter estimates should be placed onto the same scale when comparing the item
parameters estimated in two groups of examinees. The equation is defined as:
𝜒2 = (𝑏�𝐹𝑖 − 𝑏�𝑅𝑖)′𝛴−1�𝑏�𝐹𝑖 − 𝑏�𝑅𝑖�, (26)
where 𝛴−1 is the estimate of the sampling variance and covariance matrix of the differences
between the item parameter estimates and 𝜒2 on two degrees of freedom for large samples. The
Lord’s chi-square has been shown to be efficient for detection of DIF based on several
assumptions that include asymptotic, known θ, and maximum likelihood estimate (Kim et al.,
1995).
25
2.5.2.2. Area Measure
Before computing the area between two item characteristic curves (ICCs), it is necessary
to transform the estimates obtained from the reference and focal group on the same scale. Thus,
the areas between the two ICCs of the same items should be equal to 0. If the areas are not equal
to 0, then DIF exists (Runder et al., 1980). Raju (1988) stated that “the area between two ICCs is
only estimated either by integrating the appropriate function between two finite points or by
adding successive rectangles of width 0.005 between two finite points” (p. 495). In addition, he
proposed the signed and unsigned area formulas for calculating the exact area between two ICCs
for 1PL, 2PL, and 3PL. The signed area (SA) is referred to as the difference between two curves,
and it is defined as:
Signed Area (SA) = ∫ (𝐹1 − 𝐹2)𝑑𝜃∞−∞ . (27)
The unsigned area (UA) refers to the distance, and it is given as:
Unsigned Area (UA) = ∫ |𝐹1 − 𝐹2|𝑑𝜃∞−∞ . (28)
For 3PL, if F1 and F2 stand for two ICCs with the stipulation a1 a2 and c=c1=c2, then:
SA = (1-c)(b2-b1). (29)
UA = (1 − 𝑐) �2(𝑎2−𝑎1)𝐷𝑎1𝑎2
ln�1 + 𝑒𝐷𝑎𝑖𝑎2(𝑏2−𝑏1) (𝑎2−𝑎1)⁄ � − (𝑏2 − 𝑏1)�. (30)
The area between two ICCs is finite when the lower asymptotes, c, are equal. On the
other hand, when the c parameters are unequal, the area between two ICCs is infinite, and this
will yield misleading results. In other words, if the area measure needs to be meaningful and
valid, the area between two ICCs must be finite, and its estimate must be fairly accurate (Raju,
1988).
26
2.5.2.3. The Likelihood Function
The likelihood function uses the likelihood ratio (LR) test, which was proposed by
Thiseen, Steinberg, and Gerrard (1986) and Thiseen, Steinberg, and Wainer (1993), to evaluate
the differences between item responses from two groups (Cohen et al., 1996). In this approach,
the null hypothesis, the item parameters between two groups that are equal, is to be tested.
Moreover, it can test both uniform and non-uniform DIF. The uniform DIF analyzes the
difference in the item difficulty parameters between a reference and focal group. By contrast,
non-uniform DIF examines the difference in the item discrimination parameters (Cohen et al.,
1996).
The LR procedure involves a compact model (C) and an augmented model (A). Thissen
et al. (1993) stated that the compact model is the item response to be tested, and the anchor items
across two groups are constrained to be equal. Cohen et al. (1996) stated that “in the augmented
model, item parameters for all items except the studied item(s) were constrained, which are
referred to as the common or anchor set, to be equal in both the reference and focal groups (p.
19).” Because the augmented model includes all parameters of the compact model and additional
parameters, the compact model is hierarchically nested within the augmented model (Cohen et
al., 1996). The LR is the difference between the values of -2log likelihood for the compact model
(LC) and for the augmented model (LA) (Cohen et al., 1996). LR is defined as:
𝐺2(𝑑.𝑓. ) = −2log𝐿𝐶 − (−2log𝐿𝐴), (31)
where [·] is the likelihood of the data given the maximum likelihood estimated of the parameters
of the model, d.f. is the difference between the number of parameters in the augmented- and the
compact- model, and G2(d.f.) is distributed as χ2(d.f.) under the null hypothesis. Therefore, if the
27
value of G2(d.f.) is large, the null hypothesis will be rejected (Thissen et al., 1993). In other
words, if the test’s result is statistical significant, DIF exists in the studied item.
2.6 Current Research
The aim of this study is to employ the IRT framework to detect DIF across ethnicity/race
using three computer programs with three popular dichotomous models. Several studies, such as
Kim et al. (1995) and Raju and Drasgow (1993), adopted BILOG-MG 3 to detect DIF. In
addition, many studies, such as Woods (2009), employed IRTLRDIF in detecting DIF. To my
knowledge, few studies employ IRTPRO in detecting DIF because it is a new computer program.
The first hypothesis of this study is to compare the difference of testing results using IRTPRO,
BILOG-MG 3, and IRTLRDIF. This study expects that the three programs will exhibit consistent
results. The second hypothesis is to examine IRTPRO to determine its effectiveness in detecting
DIF. The present study expects that IRTPRO is effective in detecting DIF if it exhibits consistent
results with BILOG-MG 3 and IRTLRDIF. Hypothesis three examines a goodness of fit model
in detecting DIF in the Georgia High School Graduation Predictor Test (GHSGPT) with three
models, 1PL, 2PL, and 3PL. The paper argues that 3PL is a goodness of fit model because it was
designed specifically for multiple-choice cognitive items, so in discussing this model it is
appropriate to refer to the latent trait as the ability common to the m items in the test (McDonald,
1999). The fourth hypothesis examines the differences between ethnicity groups taking the
GHSGPT in Social Studies. Because of the differences in culture, social economic status (SES),
and neighborhood characteristics, this study argues that Whites will perform better than other
races. Hypothesis five investigates whether DIF exists in the GHSGPT between ethnicity groups’
item responses. The current research anticipates that DIF will exist in several items.
28
CHAPTER 3
METHOD
3.1 Research Structure
This study utilizes the fall 2010 empirical data of the GHSGPT, which measures high
school achievement in the fields of Social Studies and Science, from the Georgia Center for
Assessment. It detects DIF across races using the three programs, IRTPRO, BILOG-MG 3, and
IRTLRDIF and compares whether these three programs are consistent and, thus, appropriate to
investigate DIF. Figure 4 shows the research structure.
Figure 4. The research structure.
Empirical Data
1. GHSGPT 2. Scored Item: Dichotomous 3. Number of Item: 79 items
Examining the DIF of Race/Ethnicity
1. IRTPRO, BILOG-MG, and IRTLRDIF 2. 1PL, 2PL, and 3PL
The Analysis DIF of Race/Ethnicity
Comparing DIF Results in Different Ethnicity When Using Three Programs
29
3.2 Instrumentation
An empirical comparison of the three programs is presented using the fall 2010 data of
the GHSGPT. Although GHSGPT measures high school achievement in the fields of Social
Studies and Science, this study detects DIF for different ethnicities only in Social Studies, which
consists of 79 dichotomously scored items. Note that the original items were 80 questions;
however, Item 26 was considered a problematic item because its biserial correlation was -.052,
so Item 26 was removed, and the remaining subsequent items were renumbered to maintain
consecutive numbering. The GHSGPT contains multiple-choice questions, and each multiple-
choice item has four response options. This test is a standardized test, and it follows the blueprint
of the Georgia High School Graduation Tests (GHSGT), including the same strands and
objectives. There are six strands for Social Studies that include World Studies (18-20%), U.S.
History to 1865 (18-20%), U.S. History since 1865 (18-20%), Citizenship/Government (12-
14%), Map and Globe Skills (15%), and Information Processing Skills (15%). Because both
GHSGPT and GHSGT are built on the same content, the GHSGPT is able to predict 11th grade
students’ future performance on the GHSGT (Georgia Department of Education, 2010).
3.3 Sample
The data for the 11th grade GHSGPT in Social Studies consists of 2,654 respondents after
deleting the non-response data. Respondents were 11th grade students attending 18 different high
schools from 17 different counties in Georgia. Table 3 shows the DIF detection for ethnicity.
Whites are treated as the reference group, and Blacks, Hispanics, and a Multi-Racial group are
treated as the focal groups.
30
Table 3
Races Sample SizesWhites 1,536Blacks 872Hispanics 114Multi-Racial 132Total 2,654
The DIF Detection for Ethnicity
3.4 Computer Programs
Three computer programs, IRTLRDIF, BILOG-MG 3, and IRTPRO, are used in this study.
3.4.1 IRTLRDIF
IRTLRDIF refers to likelihood-ratio testing for differential item functioning, and this
program employs the IRT (Woods, 2009). It was developed to implement a version of IRT-LR
DIF analysis for large-scale testing applications (Thissen, 2001). In previous studies, IRT-LR
DIF detection has been used in disparate research contexts. For example, Wainer et al. (1991)
used this procedure to study the testlets for DIF. In addition, Wang et al. (1995) used it to
investigate the consequences of item choice in an experimental section. Furthermore, Steinberg
(1994) has used this procedure to effectively answer questions about item serial-position and
context effects with experimental data. These studies presented that “IRT-LR DIF analysis tests
precisely specified and straightforwardly interpretable hypotheses about the parameters of item
response models” (Thissen, 2001, p. 3). IRTLRDIF employs the likelihood ratio test and
implements the methods of marginal maximum likelihood (MML) for item parameter estimation.
31
3.4.2 BILOG-MG 3
BILOG-MG 3 is an extension of the BILOG 3 program. Zimowski et al. (2003) stated
that it is designed for the effective analysis of binary items, it is capable of large-scale
production applications without limited numbers of items or respondents, and it can perform item
analysis and the scoring of any number of subtests or subscales. In addition, it can analyze DIF
and DRIFT (Item Parameter Drift) associated with multiple groups, and it can perform the
equating of test scores. The response models include the one-, two-, and three-parameter models
(Zimowski et al., 2003). BILOG-MG 3 applies likelihood ratio chi-square and executes the
method of marginal maximum likelihood estimation (MMLE) for item parameter estimation.
3.4.3 IRTPRO
IRTPRO (Item Response Theory for Patient-Reported Outcomes) is a new IRT program
for item calibration and test scoring (Cai et al., 2011). Item response theory (IRT) models for
which item calibration and scoring are implemented based on unidimensional responses, such as
multiple choice or short-answer items scored correctly or incorrectly, and multidimensional ones
include confirmatory factor analysis (CFA) or exploratory factor analysis (EFA). In addition, it is
capable of calibrating large-scale production applications with unrestricted numbers of items or
respondents. The response functions of IRTPRO include 1PL, 2PL, 3PL, graded, generalized
partial credit, and nominal response models. “These item response models may be mixed in any
combination within a test or scale and may be specified in any user-specified equality constraints
among parameters, or fixed value for parameters” (Cai et al., 201, p. 4). IRTPRO is applied to
the Wald test, an application proposed by Lord (1980). It implements the methods of marginal
maximum likelihood (MML) and maximum likelihood estimation (MLE) for item parameter
32
estimation. However, if prior distributions are particular for the item parameters, IRTPRO
calculates Maximum a posteriori (MAP) estimates (Cai et al., 2011).
33
CHAPTER 4
RESULTS
4.1 Item Analysis
To analyze items of the GHSGPT and to search for problematic items, the item parameters
are estimated by marginal maximum likelihood using BILOG-MG 3. The original data set
consists of 80 items from the GHSGPT, which was administered to 2,654 11th grade high school
students from different counties and different high schools. All Pearson and biserial correlations
were positive except for Item 26, which were -.40 and -.053, respectively, and some items fell
below .30. Hence, Item 26 was considered a problematic item and was omitted when calibrating,
and the remaining subsequent items were renumbered to maintain consecutive numbering. Thus,
79 items in total were used in this study. Table 4 presents the summary statistics for each
ethnicity/race.
Table 4
Statistics Whites Blacks Hispanics Multi-Racial
Number of Items 79 79 79 79Mean 43.24 36.63 40.46 42.67Standard Deviation 12.59 11.46 10.73 11.752Coefficient Alpha .902 .877 .859 .884
Races
Raw Score Summary Statistics for the GHSGPT
4.1.1 Classical Test Theory
Table 5 presents 79 items of the classical item statistics for multiple groups for the
34
GHSGPT. It displays the item right, discrimination, difficulty (p-value), which is the rate of
correct-responses, the Pearson- and the biserial- correlations.
First, this study analyzes the probability of giving correct answers for multiple groups
using SPSS (Statistical Package for the Social Sciences). When an item is dichotomously scored,
the mean item score corresponds to the proportion of examinees who answer the item correctly.
This proportion for item i is denoted as pi and is called the item difficulty or p-value (Crocker &
Algina, 2008). The equation for the p-value is defined as:
pi = the number of examinees getting the item right
total number of examinees. (32)
The value of pi may range from .00 to 1.00. The p-value expresses the proportion of examinees
that answered an item correctly. For example, the p-value of Item 1 is .492, which means that
only 49.2% of examinees’ responses to Item 1 are correct as shown in Table 5. Items with
difficulties near zero are difficult; however, items with difficulties near one are easy. In order to
avoid very difficult and very easy items, the ranges of difficulties that are acceptable are .3 to .7
(Allen & Yen, 2008). The difficulty or p-value is from .23 to .92. Item 52, 59, 74, 77, and 79 are
considered difficult because the p-value is lower than .3 and Item 52 is the hardest item (p=.231).
In addition, Items 17, 41, 53, 61, 62, 63, 65, 66, 67, 70, and 73 are considered easy because the
p-value is higher than .7, and Item 67 is the easiest item (p=.922). There are 62 items (78%)
between .3 and .7, the mean of the correct response rates is .518, and the degree of difficulty is
moderate to easy.
Second, the item discriminations address an index of how well an item differentiates
between people who do well on the test and those who do not do well. The discrimination’s
index can range between -1.00 and +1.00. The range of item discrimination would accept from
35
.30 to .70 (Allen & Yen, 2002). The item-discrimination index for item i, di, is defined as (Allen
& Yen, 2002):
di = 𝑈𝑖𝑛𝑖𝑈
– 𝐿𝑖𝑛𝑖𝐿
, (33)
where Ui is the number of examinees who have total test scores and have item i correct in the
upper range of total test scores, Li is the number of examinees who have total test scores and
have item i correct in the lower range of total test scores, niU is the number of examinees who
have total test scores in the upper range, and niL is the number of examinees who have total test
scores in the lower range. Table 5 shows that 43 items are smaller than the criterion .3, which
means that these items tend to have the lowest discrimination, and Item 11 (.004) is the lowest
discrimination. The ranges of item discrimination are from .004 to .427. The average item
discrimination is .262. The Pearson correlations (i.e., point-biserial) are from .025 to .456, and
the average is .304. The biserial correlations are from .034 to .644, and the average is .399. The
reliability is .899.
36
Table 5
ItemItem Right Discrimination
Difficulty (p -value)
Pearson Correlation
Biserial Correlation
1 1306 .133 .492 .142 .1782 1762 .268 .664 .293 .3793 1264 .390 .476 .408 .5124 1190 .324 .448 .344 .4335 906 .173 .341 .187 .2426 1413 .231 .532 .256 .3227 915 .175 .345 .199 .2578 1408 .187 .531 .180 .2269 1510 .367 .569 .379 .478
10 1070 .315 .403 .344 .43611 1221 .004 .460 .041 .05212 1260 .335 .475 .371 .46513 1279 .377 .482 .393 .49314 1587 .347 .598 .380 .48215 1912 .362 .720 .420 .56016 1289 .213 .486 .232 .29017 2153 .291 .811 .429 .62118 1447 .388 .545 .417 .52419 1538 .311 .580 .343 .43220 1030 .356 .388 .398 .50721 1392 .201 .524 .222 .27922 1099 .134 .414 .160 .20223 1659 .427 .625 .456 .58224 1254 .281 .472 .300 .37725 1738 .389 .655 .437 .56326 812 .153 .306 .139 .18227 1387 .191 .523 .219 .27428 827 .144 .312 .175 .22929 1061 .096 .400 .109 .13930 1281 .397 .483 .413 .51731 1663 .403 .627 .425 .54332 1215 .284 .458 .297 .37333 1395 .359 .526 .368 .46134 1140 .174 .430 .187 .23635 878 .159 .331 .187 .24236 1166 .295 .439 .352 .44337 805 .208 .303 .227 .29938 1175 .341 .443 .365 .45939 1159 .323 .437 .359 .45240 1341 .414 .505 .436 .547
Item Statistics Based on Classical Test Theory
37
ItemItem Right Discrimination
Difficulty (p -value)
Pearson Correlation
Biserial Correlation
41 1971 .310 .743 .398 .53942 1110 .148 .418 .167 .21143 1578 .385 .595 .438 .55544 1160 .170 .437 .179 .22645 1639 .323 .618 .364 .46446 1226 .389 .462 .427 .53647 1360 .344 .512 .350 .43848 1150 .348 .433 .392 .49449 1354 .357 .510 .378 .47450 1026 .291 .387 .330 .42051 985 .260 .371 .305 .38952 613 .065 .231 .064 .08853 2011 .344 .758 .436 .59854 1246 .330 .469 .338 .42455 1586 .379 .598 .389 .49356 1335 .335 .503 .361 .45257 1008 .330 .380 .348 .44458 1491 .308 .562 .329 .41459 667 .065 .251 .084 .11460 1608 .352 .606 .398 .50661 1931 .338 .728 .411 .55162 2119 .245 .798 .355 .50663 2357 .190 .888 .379 .62864 1306 .269 .492 .293 .36765 2381 .166 .897 .362 .61366 2264 .212 .853 .380 .58667 2447 .137 .922 .351 .64468 975 .241 .367 .251 .32269 1376 .199 .518 .211 .26570 2329 .158 .878 .303 .48971 1756 .292 .662 .350 .45372 1623 .344 .612 .376 .47973 2335 .179 .880 .363 .59074 659 .023 .248 .025 .03475 907 .080 .342 .091 .11776 1033 .206 .389 .224 .28577 705 .109 .266 .139 .18878 1303 .367 .491 .399 .50079 766 .217 .289 .252 .335
Item Statistics Based on Classical Test Theory
Table 5 (continued)
38
4.1.2 Item Response Theory
This study employs BILOG-MG 3 to compute p-value with 1PL, 2PL, and 3PL. The total sample
contains 2,654 (Whites=1,536, Blacks=872, Hispanics=114, and Multi-Racial=132), and α = .05.
If the p-value is less than .05, it is statistically significant. Table 6 shows that the range of item
difficulty with 1PL is from -3.679 to 1.832, the reliability is .898 (Zimowski et al., 2003), the
root-mean square (RMS) is .3261, the mean of item difficulty is -.173, and six items’ p-values
(8%) are greater than .05. For 2PL, the ranges of item difficulty are from .102 to 1.767 and item
discrimination from -1.819 to 6.458. The means of item discrimination and item difficulty are
.519 and .266 respectively, the reliability is .916, the RMS is .293, and there are 38 items (53%)
to determine the goodness of fit index. For 3PL, the ranges of the item discrimination parameter
are from -1.788 to 3.213, the item difficulty parameter from .390 to 2.347, and the pseudo-
guessing parameter from .054 to .435. The means of the item discrimination parameter is .968,
item difficulty parameter is .650, and pseudo-guessing parameter is .224. The reliability of 3PL
is .923, the RMS is .2844, and 60 items (76%) determine the goodness of fit index.
39
Table 6
Item b p -value a b p -value a b c p -value1 0.042 .00 * 0.194 0.096 .79 0.651 1.854 0.396 .052 -1.049 .49 0.426 -1.061 .01 ** 0.809 0.332 0.413 .00 *3 0.140 .00 * 0.591 0.094 .00 * 1.308 0.669 0.241 .224 0.313 .00 * 0.477 0.278 .00 * 1.046 0.858 0.239 .065 1.004 .00 * 0.255 1.584 .18 0.753 1.822 0.236 .926 -0.206 .03 ** 0.347 -0.247 .62 0.629 0.837 0.301 .127 0.981 .00 * 0.270 1.464 .68 0.608 1.817 0.208 .828 -0.194 .00 * 0.240 -0.317 .96 0.476 1.268 0.337 .769 -0.433 .00 * 0.540 -0.376 .09 0.757 0.218 0.209 .00 *
10 0.597 .00 * 0.485 0.534 .13 1.114 0.980 0.215 .9111 0.240 .00 * 0.106 0.894 .00 * 1.872 1.902 0.435 .00 *12 0.149 .01 ** 0.523 0.116 .47 0.788 0.584 0.174 .3613 0.105 .00 * 0.571 0.070 .63 1.004 0.608 0.212 .4214 -0.616 .00 * 0.565 -0.514 .76 0.858 0.216 0.264 .8515 -1.451 .00 * 0.778 -0.971 .01 ** 1.149 -0.239 0.326 .03 **16 0.082 .04 ** 0.306 0.110 .60 0.426 0.839 0.185 .8717 -2.214 .00 * 1.012 -1.264 .07 1.098 -0.834 0.285 .0818 -0.285 .00 * 0.625 -0.235 .06 1.283 0.485 0.283 .9119 -0.499 .05 0.490 -0.460 .03 ** 0.878 0.438 0.304 .6420 0.694 .00 * 0.576 0.543 .00 * 1.547 0.891 0.199 .0821 -0.157 .00 * 0.299 -0.212 .01 ** 0.390 0.553 0.181 .2722 0.528 .00 * 0.223 0.944 .17 0.559 1.869 0.284 .1123 -0.791 .00 * 0.753 -0.554 .00 * 1.040 -0.033 0.216 .00 *24 0.163 .70 0.407 0.165 .91 0.682 0.830 0.221 .4925 -0.988 .00 * 0.725 -0.699 .02 ** 0.952 -0.149 0.230 .01 **26 1.251 .00 * 0.197 2.512 .07 0.724 2.253 0.233 .01 **27 -0.146 .00 * 0.293 -0.199 .09 0.986 1.174 0.395 .8328 1.211 .00 * 0.251 1.935 .42 0.991 1.843 0.236 .5229 0.619 .00 * 0.163 1.495 .27 0.769 2.219 0.339 .8930 0.100 .00 * 0.597 0.062 .06 1.074 0.586 0.211 .4231 -0.801 .00 * 0.678 -0.593 .18 1.080 0.116 0.282 .1932 0.254 .01 ** 0.405 0.261 .00 * 1.000 0.977 0.276 .9633 -0.164 .00 * 0.524 -0.153 .02 ** 0.939 0.532 0.250 .0334 0.430 .00 * 0.253 0.685 .34 0.698 1.581 0.300 .4735 1.076 .00 * 0.267 1.626 .00 * 1.223 1.648 0.255 .2236 0.369 .00 * 0.492 0.322 .00 * 1.376 0.935 0.269 .6837 1.270 .03 ** 0.319 1.633 .02 ** 0.820 1.659 0.183 .0138 0.348 .04 ** 0.513 0.292 .96 0.846 0.746 0.180 .4139 0.385 .00 * 0.507 0.328 .02 ** 1.068 0.846 0.222 .1340 -0.039 .00 * 0.653 -0.048 .01 ** 0.754 0.172 0.075 .36
Item Statistics Based on Item Response Theory
IPL 2PL 3PL
Note. * p < .001, ** p <.05
40
Item b p -value a b p -value a b c p -value41 -1.621 .00 * 0.763 -1.091 .23 0.810 -0.733 0.188 .4842 0.502 .00 * 0.227 0.883 .94 0.452 1.836 0.252 .9943 -0.595 .00 * 0.699 -0.440 .79 0.993 0.107 0.218 .4644 0.383 .00 * 0.244 0.631 .08 0.891 1.578 0.338 .7345 -0.742 .00 * 0.552 -0.625 .02 ** 0.679 -0.109 0.185 .01 **46 0.228 .00 * 0.653 0.148 .00 * 2.033 0.697 0.252 .8147 -0.083 .02 ** 0.492 -0.085 .25 0.807 0.545 0.223 .3048 0.407 .00 * 0.579 0.309 .00 * 1.574 0.820 0.239 .2149 -0.069 .00 * 0.554 -0.071 .18 0.877 0.482 0.205 .2250 0.704 .00 * 0.463 0.655 .00 * 1.678 1.033 0.240 .0951 0.805 .35 0.419 0.816 .60 0.679 1.136 0.147 .7352 1.832 .00 * 0.133 5.383 .75 0.767 3.213 0.208 .6253 -1.742 .00 * 0.913 -1.062 .00 * 0.992 -0.738 0.183 .00 *54 0.182 .20 0.476 0.158 .87 0.826 0.734 0.213 .4255 -0.614 .00 * 0.592 -0.498 .00 * 0.705 -0.119 0.137 .00 *56 -0.025 .00 * 0.502 -0.033 .01 ** 0.791 0.549 0.208 .1657 0.748 .00 * 0.493 0.662 .00 * 1.014 0.996 0.179 .00 *58 -0.388 .20 0.474 -0.369 .88 0.772 0.438 0.268 .6459 1.665 .00 * 0.161 4.046 .08 2.347 2.013 0.232 .2560 -0.667 .00 * 0.637 -0.516 .00 * 0.687 -0.321 0.067 .00 *61 -1.505 .00 * 0.773 -1.009 .02 ** 0.817 -0.712 0.151 .1362 -2.093 .00 * 0.776 -1.377 .01 ** 0.754 -1.227 0.122 .5263 -3.108 .00 * 1.406 -1.497 .00 * 1.259 -1.550 0.085 .0164 0.042 .00 * 0.394 0.041 .00 * 1.142 0.966 0.330 .8765 -3.244 .00 * 1.375 -1.566 .00 * 1.212 -1.643 0.089 .0566 -2.655 .00 * 1.125 -1.422 .00 * 1.033 -1.450 0.054 .00 *67 -3.679 .00 * 1.767 -1.605 .00 * 1.581 -1.739 0.075 .00 *68 0.830 .00 * 0.335 1.021 .00 * 0.656 1.403 0.188 .01 **69 -0.120 .00 * 0.290 -0.165 .06 0.445 0.906 0.254 .3370 -2.961 .00 * 0.842 -1.819 .00 * 0.791 -1.788 0.122 .00 *71 -1.034 .00 * 0.563 -0.852 .00 * 0.596 -0.580 0.106 .1072 -0.703 .00 * 0.580 -0.575 .90 0.833 0.108 0.250 .9673 -2.991 .00 * 1.161 -1.560 .00 * 1.043 -1.630 0.065 .00 *74 1.689 .00 * 0.102 6.458 .00 * 1.954 2.247 0.235 .01 **75 1.001 .00 * 0.155 2.528 .33 0.609 2.664 0.279 .3376 0.687 .01 ** 0.304 0.922 .71 0.728 1.488 0.244 .9777 1.552 .00 * 0.221 2.791 .00 * 1.736 1.745 0.218 .0378 0.049 .00 * 0.584 0.023 .38 0.948 0.532 0.197 .2179 1.377 .02 ** 0.357 1.605 .02 ** 0.916 1.582 0.169 .10
Note. * p < .001, ** p <.05
IPL 2PL 3PL
Item Statistics Based on Item Response Theory
Table 6 (continued)
41
4.2 Racial Differential Item Functioning (DIF) Analysis
DIF analyses were employed to determine whether items are advantaged/disadvantaged
across ethnicities/races. Whites were identified as the reference group, and Blacks, Hispanics,
and the Multi-Racial group were regarded as the focal groups.
Thiseen (2001), in the manual for IRTLRDIF, noted that “IRTLRDIF has implemented
two of the most commonly-used IRT models: the three-parameter logistic (3PL) model and
Samejima’s graded model. Both of those models include the two-parameter logistic (2PL) model
as a special case” (p. 5). Thus, this study adopts BILOG-MG 3 and IRTPRO to examine the 79
items in Social Studies with 1PL and employs IRTPRO, BILOG-MG 3, and IRTLRDIF with
2PL and 3PL. If both BILOG-MG 3 and IRTPRO identity an item as a DIF item, then it was
considered a DIF item with 1PL. In addition, when three programs identically detect DIF
phenomenon for 2PL and 3PL, those items are included as DIF. This study determines whether
any race is favored in each item based on the outcomes from IRTLRDIF, BILOG-MG 3, and
IRTPRO. Because the results of IRTPRO and IRTLRDIF are similar, this study will
simultaneously employ two programs, IRTPRO and BILOG-MG 3, to compare multiple groups,
White vs. Blacks, Hispanics, and the Multi-Racial group, to investigate which items exist in DIF
for specific ethnicities with 1PL, 2PL, and 3PL.
1PL, 2PL, and 3PL are the three main models used to estimate item parameters for the
dichotomous items (Hambleton et al., 1991). The 1PL assumes that all discriminations, a, are
equal, so it only considers item difficulty, b, while calibrating. The 2PL calibrates item difficulty
and discrimination, and the 3PL calibrates item difficulty, discrimination, and the lower
asymptote, c.
42
This study adopts -2loglikelihood (-2logL) for each race to determine the goodness of fit.
The item fit statistics provided by both IRTPRO and BILOG-MG 3 indicated that the 3PL model
provided a good fit to the data shown in Table 7 and Table 8.
Table 7
ModelWhites vs.
BlacksWhites vs. Hispanics
Whites vs. Multi-Racial
Whites vs. All Races
1PL(-2loglikelihood) 225310.41 152592.65 154221.83 248561.46
2PL(-2loglikelihood) 221722.24 150104.10 151702.80 244621.80
3PL(-2loglikelihood) 220417.99 149214.33 149166.39 243439.83
Comparison Groups
The Summary of Goodness of Fit Using BILOG-MG 3
Table 8
ModelWhites vs.
BlacksWhites vs. Hispanics
Whites vs. Multi-Racial
Whites vs. All Races
1PL(-2loglikelihood) 225352.18 152604.19 154235.26 248614.63
2PL(-2loglikelihood) 221704.61 150057.77 151661.38 244426.20
3PL(-2loglikelihood) 220702.07 149472.04 151046.77 243451.64
Comparison Groups
The Summary of Goodness of Fit Using IRTPRO
43
4.2.1 Three Comparison Groups Using BILOG-MG 3 and IRTPRO with 1PL
To investigate DIF items, this study employs Lord‘s (1980) technique, which is the
comparison of item parameters between two groups divided by the standard errors of differences
while the ability parameter is known. This is done using BILOG-MG 3 and IRTPRO. The item
parameters determined the ICC for an item, and “Lord (1980) noted that the question of DIF
detection could be approached by computing estimates of the item parameters within each
group” (Thissen et al., 1993, p. 68). The equation is defined as (Thissen et al., 1993):
Zi = Δ𝑏
𝑆𝐸(𝐺𝐹−𝐺𝑅), (34)
where Δ𝑏 is 𝑏𝐹 − 𝑏𝑅. 𝑏𝐹 and 𝑏𝑅 are the item difficulty parameter for the focal group and the
reference group, 𝑆𝐸 (𝐺𝐹−𝐺𝑅) is the standard errors of the differences of focal group and the
reference group, and Zi is the approximated standard normal distribution. If the absolute value Zi
is greater than 1.96, which is a two-tailed test (p ≤ .05), DIF exists. Table 9 presents the
outcomes of the three comparison groups using two computer programs with 1PL. For the
Whites vs. Blacks, both computer programs, BILOG-MG 3 and IRTPRO, indicate that Items 1,
2, 7, 8, 11, 13, 14, 15, 17, 20, 22, 23, 25, 26, 27, 28, 29, 30, 31, 34, 44, 49, 52, 56, 57, 59, 60, 61,
62, 66, 69, 71, 72, 74, and 78 are DIF. Items 1, 7, 8, 11, 22, 26,27, 28, 29, 34, 44, 52, 59, 69, and
74 advantaged Blacks , and Items 2, 13, 14, 15, 17, 20, 23, 25, 30, 31, 49, 56, 57, 60, 61, 62, 66,
71, 72, and 78 advantaged Whites. In addition, Items 2, 13, 19, 51, and 74 are DIF for Whites vs.
Hispanics groups, Items 2, 13, and 19 favor Whites, and Items 51 and 74 favor Hispanics.
Moreover, Items 8, 44, and 56 are DIF in Whites vs. the Multi-Racial group, and all these items
disadvantaged the Multi-Racial group as shown in Table 9. In sum, there are 35 DIF items in
Whites vs. Blacks, and only a few DIF items exist for Whites vs. Hispanics (five items) and
Whites vs. the Multi-Racial group (three items).
44
Table 9
Item1 -5.216 * 24.1 * 1.057 .8 -.077 .02 2.926 * 10.7 * 2.919 * 8.7 * .137 .13 1.886 5.7 1.877 3.5 -2.385 5.54 .442 1.6 1.097 1.1 -1.979 3.45 -1.792 3.6 -.486 .3 -1.000 .96 -1.252 2.3 -.471 .3 .365 .27 -3.008 * 8.8 * -1.435 2.2 -.191 .08 -5.442 * 27.1 * -1.339 1.9 -2.724 * 6.4 *9 -1.331 2.6 -.865 1.0 -1.542 2.3
10 -.447 1.3 1.004 .9 -.242 .011 -7.861 * 51.3 * -1.931 3.6 -2.266 4.212 1.867 5.5 .737 .5 -1.074 .913 6.350 * 45.4 * 3.412 * 12.0 * 1.093 1.614 4.554 * 24.4 * -.148 .1 .938 1.115 4.106 * 21.0 * 1.723 2.9 .140 .116 .270 1.4 .672 .3 -.135 .017 2.820 * 10.7 * 1.698 3.0 -1.236 1.618 1.132 3.0 .942 .9 .031 .019 1.445 3.9 3.069 * 8.9 * .275 .220 3.445 * 15.0 * .763 .5 -.742 .521 -1.496 2.8 .288 .1 -.021 .022 -4.257 * 16.3 * .988 .8 .000 .023 2.968 * 12.1 * .618 .3 .601 .624 -1.641 3.3 -.583 .5 -1.029 .925 2.661 * 10.0 * 1.010 1.0 -.385 .126 -4.238 * 16.6 * -1.556 2.5 -.221 .027 -3.442 * 11.2 * .556 .3 .377 .328 -3.602 * 12.4 * -.623 .5 -.138 .029 -5.848 * 29.5 * -1.627 2.7 -1.738 2.530 2.256 * 7.4 * 1.737 2.8 1.241 2.031 2.000 * 6.2 * -.256 .1 1.532 3.032 -2.261 5.6 1.307 1.5 -2.194 4.233 .361 1.5 .476 .2 -2.030 4.134 -3.518 * 11.5 * -.469 .3 -.706 .435 -1.017 1.8 -2.348 5.5 .528 .436 -2.025 4.6 -.041 .0 -.929 .837 1.054 2.6 -.760 .8 -.057 .038 -1.893 4.3 .184 .0 -2.099 4.239 .529 1.7 .580 .3 .680 .740 -.623 1.4 .152 .0 -.176 .0
The Summary of BILOG-MG 3 and IRTPRO for Three Comparison Groups with 1PL
BILOG-MG 3 (d )
IRTPRO (χ2 )
Whites vs. Multi-RacialBILOG-MG 3
(d )IRTPRO
(χ2 )
Whites vs. Blacks Whites vs. HispanicsBILOG-MG 3
(d )IRTPRO
(χ2 )
Note. * DIF Items
45
Item41 1.963 5.9 -1.296 2.1 1.461 2.642 -2.307 5.3 -.734 .6 .332 .243 .837 2.3 -1.208 1.8 .340 .244 -8.263 * 62.8 * -1.724 3.2 -2.756 * 6.6 *45 -2.197 5.5 .296 .1 -1.007 .946 .260 1.4 -.086 .0 .321 .247 .559 1.7 .119 .0 -.170 .048 1.828 5.4 1.498 2.1 1.092 1.549 2.132 * 6.8 * .749 .5 .449 .350 -1.492 2.9 -.794 .8 -1.256 1.551 -2.314 5.7 -3.383 * 12.2 * 1.707 3.152 -4.123 * 15.3 * -.997 1.1 -.189 .053 .151 1.3 -.856 1.0 -.133 .054 -1.890 4.1 .040 .0 -1.775 3.155 -.189 1.2 -.774 .8 -.293 .156 4.770 * 26.9 * -.278 .1 2.510 * 6.9 *57 4.016 * 19.1 * 1.340 1.8 1.377 2.358 .778 2.1 .944 .8 .715 .859 -2.706 * 6.8 * .107 .0 1.528 2.560 2.179 * 7.1 * .036 .0 .532 .561 3.504 * 15.7 * 1.026 1.0 2.244 5.962 2.632 * 9.2 * .678 .5 1.811 3.963 .604 1.7 -.499 .4 1.076 1.664 .845 2.2 -1.153 1.6 -1.683 2.465 2.000 5.9 -.317 .1 .888 1.066 2.859 * 10.8 * .234 .0 .893 1.167 1.505 3.9 -1.657 3.1 1.102 1.668 1.270 3.2 .815 .6 .415 .369 -2.752 * 7.4 * 1.163 1.1 -.325 .170 1.814 5.0 .565 .3 1.019 1.371 2.886 * 10.7 * -.756 .7 1.244 2.072 3.446 * 14.7 * -.228 .1 .500 .473 .795 2.0 -1.058 1.3 .540 .474 -4.706 * 19.3 * -3.412 * 10.5 * -.316 .175 -2.259 5.0 1.216 1.2 -.004 .076 -.487 1.3 .698 .4 -.264 .077 -1.921 4.0 .503 .2 .154 .178 3.746 * 17.4 * .341 .1 .374 .379 -1.481 2.9 -1.695 3.3 .518 .4
The Summary of BILOG-MG 3 and IRTPRO for Three Component Groups with 1PL
Table 9 (Continued)
Note. * DIF Items
Whites vs. Blacks Whites vs. Hispanics Whites vs. Multi-RacialBILOG-MG 3
(d )IRTPRO
(χ2 )BILOG-MG 3
(d )IRTPRO
(χ2 )BILOG-MG 3
(d )IRTPRO
(χ2 )
46
4.2.2 Three Comparison Groups Using Three Computer Programs with 2PL
This study employs three computer programs, IRTPRO, BILOG-MG 3, and IRTLRDIF
with 2PL. IRTPRO and BILOG-MG 3 use the same methods used for 1PL to detect 2PL. For
IRTLRDIF, Thissen (2001) stated that “if the value of G2(d.f.) exceeds 3.84 at α = .05 critical
value of the chi-square distribution for one degree of freedom, df, fit additional models to
compute single d.f., likelihood ratio tests appropriate for the item response model” (p.8).
Table 10 displays the uniform and non-uniform DIF among three comparison groups with 2PL.
There are 39 items that were identified as statistical significant DIF items, including 15 uniform
DIF items and 24 non-uniform DIF items, for Whites vs. Blacks. A total of 16 items are DIF
items that include two uniform DIF and 14 non-uniform DIF for Whites vs. Hispanics. 24 items
were identified as statistically significant DIF items that include five uniform DIF and 19 non-
uniform DIF for the Whites vs. the Multi-Racial group. Table 11 shows the outcome of the three
computer programs for three comparison groups with 2PL. First, the three computer programs
show that Items 1, 2, 8, 11, 13, 14, 15, 36, 38, 44, 45, 56, 57, 68, 72, and 78 are DIF items for
Whites vs. Blacks, and Items 2, 13, 14, 15, 56, 57, 68, 72, and 78 advantaged Whites and Items
1, 11, 8, 36, 38, 44, and 45 favored Blacks. Second, three DIF items (Items 2, 3, 51) exist in the
Whites vs. Hispanics test, Items 2 and 13 favored Whites , and Item 51 advantaged Hispanics.
Third, Items 3, 4, 8, and 44 are DIF items in the Whites vs. the Multi-Racial group, and all items
disadvantaged Whites. Overall, several DIF items (16 items) exist for Whites vs. Blacks, the
same as for 1PL, and only a few DIF items exist for Whites vs. Hispanics (three items), and
Whites vs. the Multi-Racial group (four items) test.
47
ItemH0: all equal
H0 : a equal
H0 : b equal
H0: all equal
H0 : a equal
H0 : b equal
H0: all equal
H0 : a equal
H0 : b equal
1 6.1 0.0 6.1 Uniform 7.5 5.9 1.7Non-
uniform 0.7
2 12.1 2.7 9.5Non-
uniform 9.6 1.4 8.2Non-
uniform 0.03 0.7 2.6 8.2 0.7 7.4 Uniform
4 2.1 2.2 11.6 7.1 4.5Non-
uniform5 0.2 2.5 3.06 0.4 1.3 0.27 2.7 1.4 0.9
8 11.2 0.1 11.2 Uniform 2.4 7.5 1.2 6.2Non-
uniform
9 8.7 3.3 5.5Non-
uniform 5.0 3.5 1.5Non-
uniform 3.2
10 4.4 3.5 1.0Non-
uniform 1.1 3.111 10.3 0.5 9.8 Uniform 1.6 3.3
12 1.6 3.0 6.1 4.6 1.5Non-
uniform13 32.0 0.4 31.6 Uniform 11.1 0.1 11.1 Uniform 1.014 14.3 0.2 14.0 Uniform 0.5 1.6
15 9.1 3.5 5.6Non-
uniform 3.1 0.516 3.6 2.6 0.4
17 0.3 4.8 2.0 2.8Non-
uniform 5.3 2.4 2.9Non-
uniform
Table 10
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi_Racial
The Summary of IRTLRDIF for Three Comparison Groups with 2PL
G 2 G 2 G 2
48
ItemH0: all equal
H0 : a equal
H0 : b equal
H0: all equal
H0 : a equal
H0 : b equal
H0: all equal
H0 : a equal
H0 : b equal
18 0.6 1.3 0.1
19 1.0 9.1 0.9 8.3Non-
uniform 1.320 6.1 0.3 5.8 Uniform 0.9 1.221 2.4 2.1 0.222 3.3 3.5 0.3
23 3.4 0.2 0.524 2.1 1.9 1.8
25 4.7 3.8 0.9Non-
uniform 1.7 0.526 3.5 1.3 2.3
27 9.3 5.9 3.4Non-
uniform 2.0 0.2
28 4.1 1.4 2.7Non-
uniform 0.8 0.4
29 10.2 4.7 5.5Non-
uniform 1.7 2.1
30 4.0 3.2 0.7Non-
uniform 5.9 4.0 1.8Non-
uniform 1.431 0.2 0.5 2.3
32 13.5 8.6 4.9Non-
uniform 1.8 5.3 0.7 4.6Non-
uniform33 1.0 2.0 5.5 0.2 5.3 Uniform
Table 10 (Continued)
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi_RacialG 2 G 2 G 2
The Summary of IRTLRDIF for Three Comparison Groups with 2PL
49
ItemH0: all equal
H0 : a equal
H0 : b equal
H0: all equal
H0 : a equal
H0 : b equal
H0: all equal
H0 : a equal
H0 : b equal
34 2.1 0.4 0.7
35 5.8 5.4 0.4Non-
uniform 5.4 1.4 4.0Non-
uniform 3.4
36 10.3 2.6 7.7Non-
uniform 0.5 1.5
37 4.6 0.0 4.5 Uniform 6.2 5.8 0.4Non-
uniform 4.0 4.0 0.0Non-
uniform
38 9.4 1.6 7.8Non-
uniform 3.1 5.6 0.2 5.4 Uniform
39 0.1 1.6 4.7 4.4 0.4Non-
uniform40 7.2 0.1 7.1 Uniform 0.5 0.341 0.1 3.8 3.442 0.6 0.7 1.043 1.8 3.7 1.9
44 36.6 0.4 36.3 Uniform 2.4 7.4 1.1 6.3Non-
uniform
45 17.0 6.2 10.9Non-
uniform 3.0 2.0
46 2.9 2.2 7.5 7.5 0.0Non-
uniform
47 1.8 0.1 4.3 4.2 0.1Non-
uniform
48 8.7 8.6 0.2Non-
uniform 3.4 6.3 5.4 0.9Non-
uniform
Table 10 (Continued)
The Summary of IRTLRDIF for Three Comparison Groups with 2PL
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi_RacialG 2 G 2 G 2
50
ItemH0: all equal
H0 : a equal
H0 : b equal
H0: all equal
H0 : a equal
H0 : b equal
H0: all equal
H0 : a equal
H0 : b equal
49 1.0 1.4 6.1 6.0 0.1Non-
uniform
50 6.6 3.4 3.2Non-
uniform 1.6 3.8
51 12.7 6.6 6.0Non-
uniform 13.9 0.2 13.8 Uniform 7.0 4.4 2.6Non-
uniform52 0.7 0.3 0.0
53 8.6 1.2 7.4Non-
uniform 2.5 1.5
54 5.5 0.1 5.4 Uniform 0.6 4.7 0.6 4.0Non-
uniform
55 7.6 5.2 2.4Non-
uniform 1.3 0.6
56 22.0 3.0 19.0Non-
uniform 0.8 6.4 0.5 5.9 Uniform57 13.6 0.4 13.3 Uniform 2.2 2.158 1.8 0.8 2.1
59 3.2 2.4 4.0 1.2 2.8Non-
uniform
60 4.8 4.2 0.6Non-
uniform 0.1 0.361 3.5 0.9 5.2 0.1 5.1 Uniform62 1.1 2.2 3.5
63 5.1 0.2 5.0 Uniform 9.7 9.4 0.3Non-
uniform 3.6
Whites vs. Blacks
The Summary of IRTLRDIF for Three Comparison Groups with 2PL
Table 10 (Continued)
Whites vs. Hispanics Whites vs.Multi_RacialG 2 G 2 G 2
51
ItemH0: all equal
H0 : a equal
H0 : b equal
H0: all equal
H0 : a equal
H0 : b equal
H0: all equal
H0 : a equal
H0 : b equal
64 2.3 2.4 4.3 1.5 2.9Non-
uniform
65 0.8 1.0 4.1 3.1 1.0Non-
uniform
66 7.9 7.6 0.3Non-
uniform 1.2 0.8
67 1.0 6.3 2.4 3.9Non-
uniform 4.6 2.4 2.2Non-
uniform
68 6.1 0.5 5.6 Uniform 4.7 3.7 0.9Non-
uniform 0.669 1.4 1.7 1.9
70 0.3 7.1 6.8 0.2Non-
uniform 1.0
71 4.8 0.7 4.1Non-
uniform 2.3 2.072 6.9 0.3 6.6 Uniform 2.3 0.573 3.0 1.7 0.5
74 3.4 9.1 4.3 4.8Non-
uniform 0.5
75 2.3 4.2 1.3 2.9Non-
uniform 0.176 1.3 2.2 0.5
77 8.5 8.4 0.1Non-
uniform 0.7 4.1 4.1 0.1Non-
uniform78 8.1 0.2 7.9 Uniform 1.0 1.679 1.0 3.7 0.3
G 2 G 2 G 2
Table 10 (Continued)
The Summary of IRTLRDIF for Three Comparison Groups with 2PL
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi_Racial
52
Item1 6.1 * -2.368 * 6.1 * 7.5 1.313 6.7 0.7 0.121 1.32 12.1 * 2.919 * 11.0 * 9.6 * 2.711 * 8.6 * 0.0 0.198 0.03 0.7 1.258 0.5 2.6 1.739 2.4 8.2 * -2.455 * 7.5 *4 2.1 0.464 2.1 2.2 0.995 2.3 11.6 * -1.976 * 12.3 *5 0.2 0.386 0.1 2.5 -0.095 1.9 3.0 -0.682 3.36 0.4 -0.035 0.4 1.3 -0.335 1.2 0.2 0.470 0.27 2.7 -0.721 206.0 1.4 -0.942 1.2 0.9 0.017 1.68 11.2 * -3.135 * 11.0 * 2.4 -0.975 3.0 7.5 * -2.277 * 7.4 *9 8.7 -1.619 7.8 5.0 -0.992 3.9 3.2 -1.471 3.3
10 4.4 -0.322 4.0 1.1 0.970 1.0 3.1 -0.171 2.711 10.3 * -3.083 * 10.5 * 1.6 -1.148 1.8 3.3 -1.718 3.312 1.6 1.752 1.5 3.0 0.599 1.9 6.1 -1.029 7.413 32.0 * 5.888 * 29.7 * 11.1 * 3.382 * 10.6 * 1.0 1.216 0.914 14.3 * 4.198 * 13.4 * 0.5 -0.300 0.4 1.6 1.049 1.715 9.1 * 2.768 * 8.2 * 3.1 1.650 2.9 0.5 0.335 0.416 3.6 1.815 3.4 2.6 0.880 2.3 0.4 0.030 0.117 0.3 1.026 0.2 4.8 1.691 5.1 5.3 -0.989 3.018 0.6 0.412 0.7 1.3 0.733 0.7 0.1 0.102 0.119 1.0 1.545 0.9 9.1 2.992 8.8 1.3 0.362 1.020 6.1 2.730 5.7 0.9 0.553 0.7 1.2 -0.740 1.221 2.4 0.386 2.3 2.1 0.425 1.7 0.2 0.116 0.222 3.3 -1.351 3.3 3.5 1.165 3.4 0.3 0.184 0.423 3.4 1.769 3.2 0.2 0.411 0.1 0.5 0.756 0.324 2.1 -0.888 2.0 1.9 -0.567 1.7 1.8 -0.903 1.925 4.7 1.598 3.9 1.7 0.834 1.2 0.5 -0.284 0.426 3.5 -0.720 3.5 1.3 -0.711 1.4 2.3 0.023 2.3
Table 11
The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Group with 2PL
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-RacialIRTLRDIF
Note. * DIF Items
BILOG-MG 3 IRTPRO IRTPRO IRTLRDIF BILOG-MG 3 IRTPRO BILOG-MG 3 IRTLRDIF
53
Item27 9.3 -1.570 9.5 2.0 -0.565 1.9 0.2 0.494 0.228 4.1 -0.903 4.1 0.8 -0.164 1.0 0.4 0.070 0.129 10.2 -1.732 10.6 1.7 -0.985 2.0 2.1 -1.299 2.130 4.0 1.633 3.9 5.9 1.593 5.4 1.4 1.374 1.231 0.2 1.049 0.2 0.5 -0.495 0.4 2.3 1.734 1.932 13.5 -1.634 12.3 1.8 1.345 1.8 5.3 -1.989 4.833 1.0 0.287 0.9 2.0 0.374 1.4 5.5 -1.956 4.934 2.1 -1.018 2.1 0.4 -0.119 0.4 0.7 -0.472 0.935 5.8 0.603 5.8 5.4 -1.856 5.6 3.4 0.592 3.436 10.3 * -2.047 * 9.7 * 0.5 -0.223 0.4 1.5 -0.925 1.437 4.6 1.643 4.5 6.2 -0.490 4.4 4.0 0.116 4.238 9.4 * -2.068 * 8.5 * 3.1 0.058 2.2 5.6 -2.092 5.539 0.1 0.495 0.1 1.6 0.485 1.0 4.7 0.752 3.640 7.2 -1.800 6.2 0.5 -0.148 0.6 0.3 -0.150 0.241 0.1 0.659 0.1 3.8 -1.443 2.9 3.4 1.676 3.042 0.6 0.238 0.5 0.7 -0.263 0.9 1.0 0.493 1.043 1.8 -0.421 1.7 3.7 -1.586 3.4 1.9 0.465 2.044 36.6 * -5.495 * 35.8 * 2.4 -1.367 2.1 7.4 * -2.368 * 7.7 *45 17.0 * -2.485 * 14.7 * 3.0 0.175 2.5 2.0 -0.917 1.846 2.9 -0.859 2.8 2.2 -0.443 2.1 7.5 0.354 4.547 1.8 0.771 19.0 0.1 0.009 0.1 4.3 -0.099 3.948 8.7 1.094 9.1 3.4 1.264 3.2 6.3 1.204 5.949 1.0 1.720 0.9 1.4 0.595 1.5 6.1 0.545 7.150 6.6 -1.092 6.5 1.6 -0.925 1.2 3.8 -1.218 3.451 12.7 -1.649 12.8 13.9 * -3.547 * 13.5 * 7.0 1.733 7.452 0.7 -0.212 0.6 0.3 -0.177 0.2 0.0 0.057 0.0
BILOG-MG 3BILOG-MG 3 IRTPRO IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF IRTPRO Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-Racial
Note. * DIF Items
IRTLRDIF
The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Group with 2PL
Table 11 (Continued)
54
Item53 8.6 -1.616 6.4 2.5 -1.074 1.7 1.5 0.070 1.454 5.5 -1.704 5.2 0.6 -0.054 0.4 4.7 -1.729 4.255 7.6 -0.772 6.8 1.3 -0.916 1.2 0.6 -0.210 0.656 22.0 * 4.687 * 21.0 * 0.8 -0.304 0.7 6.4 2.580 6.057 13.6 * 3.638 * 13.0 * 2.2 1.254 2.1 2.1 1.436 2.058 1.8 1.106 1.8 0.8 0.862 0.6 2.1 0.807 1.459 3.2 0.153 3.2 2.4 0.404 2.5 4.0 1.122 3.660 4.8 1.456 4.1 0.1 -0.130 0.2 0.3 0.629 0.261 3.5 2.136 3.1 0.9 0.866 0.7 5.2 2.451 4.962 1.1 1.279 1.0 2.2 0.528 1.7 3.5 1.888 3.063 5.1 -1.096 3.9 9.7 -0.457 1.8 3.6 1.612 2.764 2.3 1.652 2.3 2.4 -1.129 1.9 4.3 -1.528 3.765 0.8 0.236 0.5 1.0 -0.264 0.4 4.1 1.377 3.766 7.9 1.037 5.5 1.2 0.182 0.4 0.8 1.165 0.867 1.0 -0.088 0.8 6.3 -1.648 8.6 4.6 1.920 2.568 6.1 * 2.155 * 6.1 * 4.7 0.972 4.0 0.6 0.533 0.469 1.4 -0.814 1.4 1.7 1.308 1.6 1.9 -0.173 1.570 0.3 0.573 0.1 7.1 0.453 9.4 1.0 1.218 1.271 4.8 2.387 4.8 2.3 -0.803 2.4 2.0 1.301 2.572 6.9 * 3.040 * 6.7 * 2.3 -0.358 1.7 0.5 0.588 0.773 3.0 -0.735 2.8 1.7 -1.108 1.4 0.5 0.945 0.474 3.4 -0.284 3.8 9.1 -1.456 10.0 0.5 -0.003 0.375 2.3 0.559 2.1 4.2 1.254 4.1 0.1 0.199 0.176 1.3 0.933 1.3 2.2 0.870 1.9 0.5 -0.074 0.477 8.5 0.217 8.6 0.7 0.639 0.6 4.1 0.270 3.878 8.1 * 3.430 * 7.8 * 1.0 0.162 1.1 1.6 0.447 1.679 1.0 -0.187 1.0 3.7 -1.438 3.3 0.3 0.566 0.3
BILOG-MG 3
Table 11 (Continued)
Whites vs. Blacks Whites vs. Hispanics
The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Group with 2PL
Whites vs.Multi-RacialIRTPRO IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF
Note. * DIF Items
IRTPRO IRTLRDIF BILOG-MG 3
55
4.2.3 The Three Comparison Groups Using Three Computer Programs with 3PL
Table 12 displays the uniform and non-uniform DIF among three comparison groups with
3PL. There are 43 items that were identified as statistically significant DIF items, including 17
uniform DIF items and 26 non-uniform DIF items, for Whites vs. Blacks. A total of 24 items are
DIF items that include four uniform DIF and 20 non-uniform DIF for Whites vs. Hispanics. 25
items were identified as statistically significant DIF items that include six uniform DIF and 19
non-uniform DIF for the Whites vs. the Multi-Racial group. Table 13 presents the outcomes of
the three comparison groups using the three computer programs. For Whites vs. Blacks, the
programs indicated that Item 13, 14, 15, 32, 44, 45, 56, 57, and 78 are DIF items. Item 32, 44,
and 45 advantaged Whites, and Item 13, 14, 15, 56, 56, and 78 favored Blacks. Additionally,
Item 13, 19, and 51 are DIF items for Whites vs. Hispanics, and Items 13 and 19 favored Whites
and Item 51 Hispanics. Furthermore, only one DIF (Item 44) is determined in the Whites vs. the
Multi-Racial group, and this item advantaged the Multi-Racial group.
56
ItemH0: all equal
H0 : a equal
H0 : b equal
H0: c equal
H0: all equal
H0 : a equal
H0 : b equal
H0: c equal
H0: all equal
H0 : a equal
H0 : b equal
H0: c equal
1 5.1 0.4 0.0 4.8 Uniform 7.7 0.6 2.5 4.6Non-
Uniform 0.6
2 11.8 2.5 0.7 8.5 Uniform 9.5 6.5 2.9 0.1Non-
Uniform 0.03 3.8 2.3 9.9 0.0 3.8 6.0 Uniform
4 0.5 2.7 12.0 11.9 0.0 0.0Non-
Uniform
5 2.2 4.0 3.3 0.6 0.1Non-
Uniform 4.1 1.9 2.1 0.1Non-
Uniform6 3.1 2.3 0.0
7 5.8 0.0 2.3 3.7Non-
Uniform 4.1 0.7 0.2 3.3Non-
Uniform 1.7
8 11.4 0.5 0.3 10.6 Uniform 2.4 7.2 5.1 1.9 0.2Non-
Uniform
9 10.6 0.2 1.7 8.7Non-
Uniform 6.9 3.8 1.7 1.3Non-
Uniform 3.8
10 7.1 0.5 4.1 2.4Non-
Uniform 4.1 2.4 1.0 0.8Non-
Uniform 3.1
11 11.3 9.6 0.0 1.7Non-
Uniform 4.2 2.2 1.0 1.0Non-
Uniform 3.9 3.8 0.0 0.2Non-
Uniform
12 2.2 6.0 6.0 0.0 0.1Non-
Uniform 7.6 7.5 0.1 0.0Non-
Uniform
13 34.4 0.9 6.4 27.1 Uniform 10.1 0.1 8.5 1.5Non-
Uniform 1.0
Table 12
The Summary of IRTLRDIF for Three Comparison Groups with 3PL
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-RacialG 2 G 2 G 2
57
ItemH0: all equal
H0 : a equal
H0 : b equal
H0: c equal
H0: all equal
H0 : a equal
H0 : b equal
H0: c equal
H0: all equal
H0 : a equal
H0 : b equal
H0: c equal
14 17.1 0.0 2.3 15.1 Uniform 0.5 1.3
15 9.8 0.2 8.6 1.0Non-
Uniform 3.4 2.2
16 2.8 2.5 8.8 1.0 0.0 7.9Non-
Uniform
17 0.8 4.7 1.0 4.1 0.0Non-
Uniform 5.3 0.6 5.0 0.0Non-
Uniform18 1.1 1.0 0.0
19 1.7 8.5 0.5 8.7 0.0 Uniform 4.9 4.7 0.0 0.2Non-
Uniform20 7.9 0.3 0.0 7.6 Uniform 0.4 1.1
21 4.5 4.8 0.1 0.0Non-
Uniform 1.5 0.5
22 3.0Non-
Uniform 3.2 0.2
23 6.0 5.8 0.0 0.2Non-
Uniform 0.0 3.424 2.0 1.7 3.1
25 4.6 4.4 0.0 0.2Non-
Uniform 1.2 1.2
26 6.9 1.6 5.3 0.0Non-
Uniform 1.4 1.9
27 6.4 4.6 0.0 1.8Non-
Uniform 1.6 0.4
28 8.0 1.0 6.7 0.3Non-
Uniform 5.7 4.9 0.3 0.5Non-
Uniform 0.6
The Summary of IRTLRDIF for Three Comparison Groups with 3PL
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-RacialG 2 G 2 G 2
Table 12 (Continued)
58
ItemH0: all equal
H0 : a equal
H0 : b equal
H0: c equal
H0: all equal
H0 : a equal
H0 : b equal
H0: c equal
H0: all equal
H0 : a equal
H0 : b equal
H0: c equal
29 7.4 4.4 0.0 3.0Non-
Uniform 2.3 2.1
30 8.3 4.3 2.1 1.9Non-
Uniform 6.0 0.1 4.4 1.6Non-
Uniform 1.531 2.9 0.6 2.1
33 3.1 1.4 6.1 0.7 5.3 0.1Non-
Uniform34 2.3 0.6 1.0
35 2.2 5.6 5.6 0.1 0.0Non-
Uniform 2.9
36 6.1 4.7 0.3 1.1Non-
Uniform 0.6 3.3
37 6.9 0.1 0.2 6.7 Uniform 6.8 0.8 3.6 2.4Non-
Uniform 4.7 3.9 0.7 0.0Non-
Uniform38 7.6 0.0 7.8 0.0 Uniform 2.0 6.8 0.0 3.8 3.0 Uniform39 0.1 1.4 3.5
40 5.0 1.0 6.5 0.0Non-
Uniform 2.5 0.041 1.4 3.6 3.342 1.5 0.9 2.2
43 6.3 2.5 1.4 2.3Non-
Uniform 4.9 0.0 4.8 0.1 Uniform 4.7 0.6 0.6 3.4Non-
Uniform
44 34.2 6.2 12.7 15.3 Uniform 2.8 7.7 4.4 3.3 0.0Non-
Uniform45 12.4 1.2 11.4 0.0 Uniform 3.4 2.8
The Summary of IRTLRDIF for Three Comparison Groups with 3PL
Table 12 (Continued)
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-RacialG 2 G 2 G 2
59
ItemH0: all equal
H0 : a equal
H0 : b equal
H0: c equal
H0: all equal
H0 : a equal
H0 : b equal
H0: c equal
H0: all equal
H0 : a equal
H0 : b equal
H0: c equal
46 7.0 0.0 6.4 0.6Non-
Uniform 4.7 0.0 0.0 4.7Non-
Uniform 6.6 1.5 2.2 2.9Non-
Uniform
47 7.7 0.0 0.4 7.3Non-
Uniform 0.0 2.7
48 6.8 3.8 2.2 0.8Non-
Uniform 2.6 7.1 2.5 4.5 0.1Non-
Uniform
49 1.5 1.5 7.2 6.2 0.9 0.1Non-
Uniform50 4.4 0.0 3.8 0.6 Uniform 3.0 5.0 0.3 4.7 0.0 Uniform
51 15.2 16.0 0.2 0.0Non-
Uniform 14.5 1.1 12.1 1.3 Uniform 7.6 3.4 4.9 0.0Non-
Uniform52 1.2 0.6 0.153 8.8 0.0 8.9 0.0 Uniform 2.2 2.254 5.0 0.9 1.0 3.2 Uniform 1.4 4.6 0.0 4.9 0.0 Uniform
55 4.0 1.2 3.1 0.0Non-
Uniform 1.0 0.5
56 32.5 31.5 0.3 0.7Non-
Uniform 1.0 10.3 0.0 7.1 3.2Non-
Uniform
57 17.9 15.5 2.1 0.3Non-
Uniform 1.7 2.558 3.0 0.7 1.5
59 2.5 0.7 4.4 1.2 0.0 3.2Non-
Uniform60 0.0 0.0 0.061 4.0 0.0 3.7 0.3 Uniform 0.9 4.0 0.1 4.8 0.0 Uniform62 0.2 1.5 3.3
Table 12 (Continued)
G 2 G 2 G 2
The Summary of IRTLRDIF for Three Comparison Groups with 3PL
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-Racial
60
ItemH0: all equal
H0 : a equal
H0 : b equal
H0: c equal
H0: all equal
H0 : a equal
H0 : b equal
H0: c equal
H0: all equal
H0 : a equal
H0 : b equal
H0: c equal
63 4.8 0.0 7.1 0.0 Uniform 9.8 8.5 0.3 1.0Non-
Uniform 1.4
64 2.8 4.5 1.1 3.0 0.4Non-
Uniform 5.0 2.0 0.0 3.1 Uniform65 0.0 0.5 2.666 0.1 1.1 0.0
67 0.0 6.4 2.2 3.9 0.4Non-
Uniform 1.7
68 12.2 11.2 0.2 0.8Non-
Uniform 2.9 2.469 1.7 2.2 1.6
70 0.3 7.4 6.8 1.0 0.0Non-
Uniform 0.871 4.2 0.0 5.3 0.0 Uniform 1.9 1.3
72 11.0 8.2 1.5 1.4Non-
Uniform 1.9 0.673 1.2 1.1 0.0
74 3.7 6.7 9.3 0.0 0.0Non-
Uniform 1.375 2.8 3.1 2.976 2.3 1.5 1.1
77 0.9 0.8 4.8 1.6 3.7 0.0Non-
Uniform78 9.2 0.9 6.9 1.5 Uniform 1.0 0.879 0.3 4.0 0.1 4.4 0.0 Uniform 0.9
G 2 G 2 G 2
Table 12 (Continued)
The Summary of IRTLRDIF for Three Comparison Groups with 3PL
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-Racial
61
Item1 5.1 -1.475 4.9 7.7 1.194 1.5 0.6 0.367 0.52 11.8 1.239 7.8 9.5 1.354 8.4 0.0 -0.081 0.63 3.8 -0.100 2.0 2.3 1.013 5.7 9.9 -2.144 7.94 0.5 0.443 1.7 2.7 1.000 9.0 12.0 -0.816 20.95 2.2 -0.063 0.6 4.0 -0.873 1.2 4.1 -1.633 2.46 3.1 -0.328 2.5 2.3 -0.025 3.9 0.0 0.381 2.27 5.8 -1.191 2.7 4.1 -0.528 1.7 1.7 0.487 3.08 11.4 -2.299 6.4 2.4 -0.684 1.7 7.2 -2.100 6.99 10.6 -1.628 5.2 6.9 -1.299 2.8 3.8 -1.288 5.9
10 7.1 -1.738 3.8 4.1 1.264 9.5 3.1 -1.413 1.311 11.3 -1.137 11.0 4.2 -0.663 6.8 3.9 -0.797 10.012 2.2 1.413 4.1 6.0 0.038 0.9 7.6 -0.338 13.413 34.4 * 4.256 * 35.0 * 10.1 * 2.150 * 12.8 * 1.0 0.585 2.814 17.1 * 2.596 * 14.3 * 0.5 -0.764 1.2 1.3 0.962 5.115 9.8 * 2.287 * 10.6 * 3.4 1.503 6.8 2.2 0.439 6.516 2.8 1.023 2.0 2.5 0.799 2.0 8.8 0.240 1.017 0.8 0.319 1.5 4.7 1.678 10.2 5.3 -1.687 6.018 1.1 -0.032 2.2 1.0 -0.260 2.5 0.0 -0.446 5.519 1.7 0.673 2.0 8.5 * 2.026 * 10.9 * 4.9 -0.245 0.320 7.9 1.273 8.9 0.4 -0.295 4.3 1.1 -1.176 7.521 4.5 0.601 4.6 1.5 -0.112 0.4 0.5 0.107 0.822 3.0 -0.844 2.8 3.2 1.039 1.4 0.2 -0.228 0.023 6.0 0.444 3.2 0.0 0.036 1.0 3.4 0.403 0.424 2.0 -0.900 1.7 1.7 -1.077 0.9 3.1 -0.847 5.025 4.6 0.253 1.9 1.2 0.116 0.7 1.2 -0.724 2.726 6.9 -1.898 5.9 1.4 -0.523 1.0 1.9 -0.906 1.1
Note. * DIF Items
IRTLRDIF (χ2)
BILOG-MG 3 (d )
IRTPRO (χ2)
BILOG-MG 3 (d )
IRTPRO (χ2)
IRTLRDIF (χ2)
BILOG-MG 3 (d )
IRTPRO (χ2)
IRTLRDIF (χ2)
Table 13
The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Groups with 3PL
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-Racial
62
Item27 6.4 0.042 9.9 1.6 -0.102 1.3 0.4 -0.033 0.628 8.0 -2.322 4.8 5.7 -0.480 0.1 0.6 0.274 1.929 7.4 -0.425 7.2 2.3 -0.857 1.7 2.1 -1.313 1.830 8.3 1.441 11.4 6.0 1.707 9.3 1.5 0.778 3.731 2.9 0.024 2.4 0.6 -0.736 2.7 2.1 0.829 3.732 23.5 * -4.104 * 13.0 * 2.4 1.055 6.2 3.9 -1.619 8.333 3.1 -0.692 1.0 1.4 -0.415 0.5 6.1 -2.262 7.934 2.3 -1.095 1.5 0.6 -0.591 0.5 1.0 -1.028 0.635 2.2 1.144 4.4 5.6 -1.098 8.3 2.9 1.251 7.136 6.1 -1.033 9.0 * 0.6 -0.434 6.4 3.3 -1.103 3.737 6.9 1.176 5.9 6.8 -2.158 3.5 4.7 1.032 7.138 7.6 -2.289 4.7 2.0 -0.770 0.2 6.8 -1.960 6.239 0.1 0.027 0.4 1.4 0.692 5.9 3.5 -0.658 1.140 5.0 -1.716 6.1 2.5 -0.301 2.0 0.0 -0.554 3.941 1.4 0.183 1.8 3.6 -1.566 4.5 3.3 1.367 4.642 1.5 -0.288 0.5 0.9 0.076 0.3 2.2 -0.333 0.943 6.3 -0.649 5.7 4.9 -2.134 5.9 4.7 0.658 7.544 34.2 * -4.245 * 28.8 * 2.8 -1.758 2.8 7.7 * -2.134 * 8.2 *45 12.4 * -2.366 * 9.5 * 3.4 -0.296 0.4 2.8 -0.978 2.146 7.0 -2.119 2.7 4.7 -0.277 14.1 6.6 -1.900 5.647 7.7 0.935 5.0 0.0 -0.188 0.9 2.7 -0.938 0.948 6.8 1.404 13.3 2.6 1.192 14.4 7.1 1.398 20.149 1.5 0.913 2.4 1.5 0.567 5.5 7.2 1.270 13.350 4.4 -1.941 1.5 3.0 -1.912 5.6 5.0 -2.617 4.951 15.2 -0.585 15.6 14.5 * -3.438 * 15.8 * 7.6 1.953 10.952 1.2 -0.577 1.0 0.6 -0.540 0.7 0.1 -0.033 0.1
Note. * DIF Items
BILOG-MG 3 (d )
IRTPRO (χ2)
Table 13 (Continued)
IRTLRDIF (χ2)
BILOG-MG 3 (d )
IRTPRO (χ2)
IRTLRDIF (χ2)
BILOG-MG 3 (d )
The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Groups with 3PL
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-RacialIRTPRO
(χ2)IRTLRDIF
(χ2)
63
Item53 8.8 -1.820 7.3 2.2 -1.210 4.3 2.2 0.024 6.854 5.0 -1.321 3.7 1.4 -0.467 0.2 4.6 -2.184 5.455 4.0 -1.330 2.6 1.0 -1.115 3.4 0.5 -0.669 0.856 32.5 * 3.093 * 11.4 * 1.0 -0.086 1.9 10.3 2.049 4.757 17.9 * 2.856 * 12.3 * 1.7 0.607 1.1 2.5 1.108 3.458 3.0 1.207 5.5 0.7 0.509 1.0 1.5 0.011 0.359 2.5 0.780 6.1 0.7 0.481 5.5 4.4 0.138 3.760 0.0 0.752 0.9 0.0 -0.220 1.4 0.0 0.348 1.461 4.0 1.481 3.6 0.9 0.685 1.1 4.0 1.657 5.962 0.2 0.790 1.9 1.5 0.373 0.3 3.3 1.303 5.163 4.8 -0.751 7.5 9.8 -0.419 2.1 1.4 0.647 0.864 2.8 0.976 2.4 4.5 -1.701 2.4 5.0 -0.821 6.665 0.0 0.216 2.3 0.5 -0.152 8.5 2.6 0.620 11.466 0.1 0.738 0.8 1.1 0.376 1.8 0.0 0.759 3.767 0.0 -0.053 4.3 6.4 -1.221 11.7 1.7 0.989 2.768 12.2 1.538 4.7 2.9 0.252 1.3 2.4 0.212 0.569 1.7 -0.468 1.6 2.2 1.063 1.6 1.6 -0.699 1.070 0.3 0.394 3.1 7.4 0.778 15.6 0.8 0.835 2.171 4.2 1.585 4.5 1.9 -0.505 4.3 1.3 0.750 1.072 11.0 2.009 6.6 1.9 -0.682 0.5 0.6 0.206 0.273 1.2 -0.617 6.1 1.1 -0.799 6.6 0.0 0.482 4.474 3.7 -0.204 3.1 6.7 0.169 7.9 1.3 0.365 3.475 2.8 0.692 1.6 3.1 0.515 2.8 2.9 -0.127 0.576 2.3 0.184 1.5 1.5 0.909 1.5 1.1 0.004 1.577 0.9 0.196 0.9 0.8 0.468 5.6 4.8 1.340 9.078 9.2 * 2.463 * 12.0 * 1.0 -0.630 1.3 0.8 -0.438 0.479 0.3 -0.171 0.6 4.0 -2.288 3.9 0.9 0.223 0.9
Note. * DIF Items
Table 13 (Continued)
The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Groups with 3PL
Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-RacialIRTPRO
(χ2)IRTLRDIF
(χ2)BILOG-MG 3
(d )IRTPRO
(χ2)IRTLRDIF
(χ2)BILOG-MG 3
(d )IRTPRO
(χ2)IRTLRDIF
(χ2)BILOG-MG 3
(d )
64
4.2.4 Multiple Groups Using two Programs with Three Models
The results of the BILOG-MG 3 and IRTPRO for Whites vs. Blacks, Hispanics, and
Multi-Racial group with 1PL are given in Table 14. BILOG-MG 3 was detected in 36 items for
Whites vs. Blacks, in six items for Whites vs. Hispanics, and in 10 items for Whites vs. the
Multi-Racial group. Items 2, 13, and 74 are DIF in both Whites vs. Blacks and Whites vs.
Hispanics. In addition, Items 8, 11, 32, 44, 56, and 61are DIF in both Whites vs. Blacks and
Whites vs. the Multi-Racial group. On the other hands, IRTPRO detected less DIF items among
the three comparison groups. There are 12 items for Whites vs. Blacks, three items for Whites vs.
Hispanics, and two items for Whites vs. the Multi-Racial group. Based on the results, both
BILOG-MG 3 and IRTPRO consistently detect DIF, which include Items 2, 8, 11, 13, 29, 30, 44,
56, 61, and 74 for Whites vs. Blacks and Item 3 for Whites vs. the Multi-Racial group. There is
no the consistent DIF detection for Whites vs. Hispanics.
65
Item1 -5.216 * 1.057 -0.073 0.7 12.2 * 0.42 2.902 * 2.919 * 0.137 7.8 * 0.3 3.93 1.878 1.877 -2.385 * 0.3 2.4 9.5 *4 0.450 1.101 -1.979 * 0.3 0.9 4.35 -1.792 0.486 -1.004 2.0 0.2 0.06 -1.252 -0.471 0.365 0.3 0.5 0.57 -3.008 * -1.435 -0.191 3.3 0.6 1.18 -5.442 * -1.333 -2.717 * 14.5 * 0.5 0.69 -1.331 -0.865 -1.542 3.6 1.1 0.1
10 -0.455 1.004 -0.242 0.4 0.7 0.711 -7.861 * -1.931 -2.262 * 19.4 * 3.0 0.012 1.867 0.734 -1.074 0.4 2.1 1.513 6.350 * 3.412 * 1.093 23.5 * 1.1 3.114 4.554 * -0.145 0.942 3.5 5.7 0.915 4.106 * 1.723 0.140 5.8 2.0 1.216 0.270 0.676 -0.130 0.4 0.2 0.217 2.820 * 1.698 -1.239 1.0 2.8 4.718 1.140 0.942 -0.031 1.3 0.2 0.419 1.445 3.069 * 0.275 6.7 * 1.8 3.920 3.445 * 0.763 -0.742 1.5 5.1 1.021 -1.487 0.292 -0.021 0.2 1.1 0.022 -3.817 * 0.988 0.000 0.4 8.8 * 0.323 3.197 * 0.618 0.601 3.2 1.4 0.024 -1.512 -0.579 -1.029 2.1 0.2 0.025 2.770 * 1.010 -0.382 1.7 1.9 0.926 -4.575 * -1.560 -0.221 4.9 1.8 1.227 -3.163 * 0.556 0.377 0.3 6.5 * 0.028 -3.946 * -0.623 -0.137 2.1 2.5 0.229 -5.413 * -1.627 -1.738 12.2 * 1.5 0.030 2.194 * 1.737 1.238 6.9 * 0.5 0.131 2.084 * -0.256 1.532 2.2 0.5 2.232 -2.360 * 1.303 -2.190 * 1.2 0.9 5.633 0.355 0.480 -2.030 * 0.8 2.0 3.234 -3.370 * -0.466 -0.703 2.8 1.6 0.035 -0.953 -2.344 * 0.528 1.9 0.6 4.736 -1.992 * -0.037 -0.929 1.4 0.5 0.337 1.124 -0.764 -0.057 0.2 1.9 0.538 -1.877 0.184 -2.099 * 2.8 0.2 2.439 0.184 0.584 0.680 1.2 0.4 0.140 -0.302 -0.152 0.161 0.2 0.3 0.0
Table 14
W vs. BBILOG-MG 3
The Summary of BILOG-MG 3 and IRTPRO for All Ethicities/Races with 1PL
W vs. BIRTPRO
W vs. H W vs. MR W vs. H W vs. MR
Note. *DIF Items
66
Item41 0.756 -1.296 1.461 0.4 1.9 4.942 -1.040 -0.734 0.332 0.8 1.1 0.843 0.356 -1.204 0.340 0.3 1.8 1.844 -3.613 * -1.724 -2.756 * 23.6 * 3.8 0.245 -0.954 0.300 -1.004 1.2 0.8 0.846 0.117 -0.086 0.321 0.3 0.2 0.247 0.246 0.119 -0.170 0.2 0.4 0.048 0.823 1.498 -1.092 5.0 0.4 0.149 0.981 0.749 0.449 2.2 0.6 0.050 -0.657 -0.794 -1.256 3.0 0.5 0.051 -1.061 -3.383 * 1.707 2.9 0.2 14.9 *52 -1.805 -1.000 -0.193 3.4 2.4 0.553 0.059 -0.856 -0.133 0.5 1.0 0.554 -0.820 0.040 -1.775 2.6 0.2 1.555 -0.082 -0.774 -0.293 0.7 0.7 0.356 2.213 * -0.278 2.506 * 7.9 * 2.9 4.657 1.789 1.340 1.377 8.9 * 0.8 0.058 0.778 0.944 0.715 2.0 0.5 0.059 -2.698 * 0.110 1.528 0.3 6.0 1.360 2.171 * 0.036 0.532 1.3 1.3 0.361 3.511 * 1.026 2.244 * 9.7 * 0.2 0.962 2.625 * 0.678 1.811 5.6 0.2 0.863 0.604 -0.501 1.076 0.4 0.2 1.764 0.836 -1.153 -1.683 2.2 6.0 0.065 2.000 * -0.319 0.888 0.9 1.0 0.966 2.859 * 0.234 0.893 2.5 1.2 0.367 1.500 -1.656 1.102 0.4 2.9 4.968 1.270 0.815 0.411 1.6 0.2 0.169 -2.752 * 1.163 -0.325 0.2 4.4 0.970 1.803 0.596 1.019 2.5 0.2 0.271 2.886 * -0.756 1.244 1.4 2.5 2.672 3.446 * -0.228 0.500 1.6 4.2 0.573 0.795 -1.058 0.540 0.2 1.2 1.874 -4.706 * -3.412 * -0.320 11.1 * 0.4 4.975 -2.267 * 1.216 -0.004 0.3 4.0 0.776 -0.487 0.698 -0.264 0.2 0.5 0.377 -1.929 0.503 0.154 0.2 2.2 0.078 3.746 * 0.341 0.374 2.8 4.0 0.079 -1.481 -1.693 0.514 1.5 0.2 3.2
Note. *DIF Items
W vs. BW vs. B W vs. H W vs. MR W vs. H W vs. MR
The Summary of BILOG-MG 3 and IRTPRO for All Ethicities/Races with 1PL
BILOG-MG 3 IRTPRO
Table 14 (Continued)
67
Table 15 shows the results of the BILOG-MG 3 and IRTPRO for Whites vs. Blacks,
Hispanics, and the Multi-Racial group with 2PL. There are 20 items to be detected using
BILOG-MG 3 for Whites vs. Blacks, four items for Whites vs. Hispanics, and 10 items for
Whites vs. Multi-Racial. Items 2 and 13 are detected DIF for both Whites vs. Blacks and Whites
vs. Hispanics. In addition, Items 8, 11, 33, 44, 56, and 61are identified DIF for both Whites vs.
Blacks and Whites vs. the Multi-Racial group, On the other hand, IRTPRO detected fewer DIF
items. There are 15 items for Whites vs. Blacks, four items for Whites vs. Hispanics, and six
items for Whites vs. the Multi-Racial group. Based on the results, Items 2, 8, 11, 13, 44, and 45
are detected by BILOG-MG 3 and IRTPRO for Whites vs. Blacks and Item 3 and 67 for Whites
vs. the Multi-Racial group. There is no consistent DIF detection for Whites vs. Hispanics using
BILOG-MG 3 and IRTPRO.
68
Item1 -2.217 * 1.335 0.124 5.0 9.6 * 2.62 2.949 * 2.799 * 0.203 7.4 * 0.1 4.83 1.298 1.775 -2.492 * 0.3 1.6 9.8 *4 0.626 1.066 -1.945 8.2 * 3.6 5.55 0.317 -0.097 -0.707 4.0 3.8 0.16 0.028 -0.329 0.467 0.7 1.3 0.67 -0.667 -0.964 0.009 1.9 2.0 0.78 -3.015 * -0.965 -4.802 * 11.8 * 2.9 1.49 -1.684 -1.013 -1.339 8.5 * 0.2 3.6
10 -0.353 0.950 -0.049 1.4 1.5 3.111 -3.078 * -1.118 -7.717 * 7.1 * 0.0 0.612 1.783 0.654 -1.159 0.1 1.7 8.1 *13 6.267 * 3.412 * 1.215 18.0 * 0.5 3.214 5.220 * -0.260 1.230 1.3 4.7 1.715 1.289 1.669 0.145 4.2 0.5 1.216 4.442 * 0.870 0.043 1.2 3.3 1.217 0.964 1.801 -1.033 0.3 0.6 4.118 0.275 0.777 0.089 0.1 1.0 0.919 1.487 3.039 * 0.349 4.2 1.8 5.420 1.797 0.603 -0.761 0.1 3.7 2.021 0.225 0.490 0.118 0.1 2.3 1.922 -4.377 * 1.210 0.191 0.8 3.8 3.023 1.105 0.417 0.755 1.1 0.2 0.024 -1.383 -0.543 -0.921 2.7 0.2 2.425 0.267 0.848 -0.303 1.4 0.2 1.326 -0.758 -0.814 0.020 1.2 0.4 3.127 -1.673 0.728 0.491 0.3 8.2 * 0.728 -0.859 -0.192 0.064 0.7 3.1 0.329 -1.735 -0.910 -1.271 5.8 0.5 0.430 0.934 1.642 1.351 6.5 * 0.7 2.131 -2.366 * -0.484 1.737 0.2 0.1 2.332 0.167 1.320 -2.060 * 1.8 6.1 6.133 -2.343 * 0.374 -2.005 * 3.9 1.4 3.534 1.005 -0.123 -0.481 1.7 0.6 0.035 -0.696 -1.749 0.610 6.5 * 3.4 2.836 4.500 * -0.144 -0.907 3.6 0.4 0.337 -0.762 -0.453 0.107 0.4 3.6 8.4 *38 0.490 0.060 -2.141 * 6.4 * 0.2 5.239 -1.330 0.491 1.000 0.9 1.5 4.740 0.763 -0.121 -0.170 1.8 1.2 0.6
W vs. H W vs. MR
Table 15
W vs. BBILOG-MG 3
The Summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 2PL
W vs. BIRTPTOW vs. H
Note. *DIF Items
W vs. MR
69
Item41 0.693 -1.461 1.710 0.3 0.4 5.742 0.254 -0.277 0.493 0.0 0.3 2.143 -0.395 -1.533 0.446 2.5 0.4 3.844 -5.503 * -1.329 -2.380 * 17.6 * 0.9 1.545 -2.604 * 0.154 -0.961 7.4 2.5 1.546 -0.928 -0.433 0.341 0.9 2.1 7.0 *47 0.670 0.051 -0.102 1.3 4.7 1.848 1.245 1.354 1.182 11.5 * 0.5 0.249 1.887 0.641 0.532 5.1 5.4 0.650 -1.200 -0.858 -1.200 3.8 4.1 0.451 -1.543 -3.321 * 1.726 9.4 * 0.2 13.0 *52 -0.209 -0.183 0.061 0.2 0.1 0.253 -1.603 -1.083 0.061 3.5 0.4 1.954 -1.783 -0.029 -1.732 4.4 1.3 1.455 -0.769 -0.971 -0.236 3.0 1.0 0.656 4.729 * -0.332 2.588 * 5.3 5.4 3.957 3.662 * 1.295 1.430 6.1 1.0 1.558 1.009 0.899 0.796 1.4 2.7 0.359 0.160 0.445 1.108 4.3 1.8 4.260 1.467 -0.147 0.628 0.9 0.5 0.361 2.161 * 0.904 2.503 * 4.4 0.9 1.362 1.284 0.561 2.000 2.6 0.9 2.363 -1.188 -0.466 1.709 11.4 * 8.9 * 5.664 1.667 -1.064 -1.524 2.3 7.6 * 1.565 0.288 -0.267 1.511 0.6 2.7 0.966 1.025 0.223 1.299 1.6 1.2 0.567 -0.106 -1.705 2.008 * 0.8 1.4 6.7 *68 2.157 * 0.985 0.533 5.0 1.6 1.569 -0.864 1.320 -0.177 0.3 2.3 2.370 0.608 0.500 1.216 2.3 1.5 4.671 2.427 * -0.842 1.315 0.4 1.4 5.272 3.020 * -0.359 0.585 2.4 3.3 0.973 -0.735 -1.137 0.981 0.9 0.2 1.974 -0.251 -1.402 0.007 6.8 * 2.4 3.875 0.559 1.308 0.207 3.1 2.6 1.876 1.028 0.888 -0.085 1.0 2.5 1.077 0.255 0.723 0.287 3.5 0.5 2.278 3.371 * 0.193 0.436 2.7 4.8 0.179 -0.232 -1.417 0.582 1.0 1.1 3.0
W vs. MRBILOG-MG 3
Note. *DIF Items
The Summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 2PL
W vs. B W vs. BIRTPTO
Table 15 (Continued)
W vs. H W vs. MR W vs. H
70
Table 16 shows the 3PL by using BILOG-MG 3 and IRTPRO for Whites vs. all focal
groups. For BILOG-MG 3, there are 12 items detected DIF for Whites vs. Blacks, six items for
Whites vs. Hispanics, and five items for Whites vs. the Multi-Racial group. Items 13 and 15 are
detected DIF for both Whites vs. Blacks and Whites vs. Hispanics, Item 51 for both Whites vs.
Hispanics and Whites vs. the Multi-Racial group, and Items 56 is identified DIF for both Whites
vs. Blacks and Whites vs. the Multi-Racial group. On the other hand, IRTPRO detected more
DIF items than BILOG-MG 3 for Whites vs. Blacks, which are 16 items, with 3PL, four items
for Whites vs. Hispanics, and two items for Whites vs. the Multi-Racial group. Items 49 and 65
are investigated DIF for both Whites vs. Blacks and Whites vs. Hispanics and Item 51 for Whites
vs. Blacks and Whites vs. the Multi-Racial group. Moreover, the results indicated that Items 13,
15, and 44 are consistently detected by BILOG-MG 3 and IRTPRO for Whites vs. Blacks and for
Whites vs. the Multi-Racial group.
Overall, DIF exists in the GHSGPT in Social Studies when employing the three computer
programs for the three comparison groups for the dichotomously scored items using three
models. Figure 5 to Figure 13 display the DIF items between Whites vs. Blacks, Figure 14 and
16 demonstrate DIF items between Whites and Hispanics, and Figure 17 shows that DIF exists
between Whites vs. the Multi-Racial group, with 3PL because 3PL shows a good fit to the data.
71
Item1 -1.384 1.186 0.449 1.3 3.4 0.82 1.459 1.804 0.160 7.9 1.2 4.63 0.319 1.554 -1.821 1.5 5.7 8.9 *4 0.923 1.583 -0.582 12.0 * 6.3 4.85 0.026 -0.718 -1.461 2.4 2.1 0.16 -0.104 0.180 0.524 4.3 0.8 0.37 -0.945 -0.471 0.443 2.3 2.7 0.78 -1.960 -0.430 -1.932 7.5 1.2 1.29 -1.316 -0.985 -1.151 6.1 1.1 2.4
10 -1.329 1.457 -0.896 3.1 2.2 3.911 -1.357 -0.993 -1.545 8.4 * 3.0 0.512 1.655 0.542 -0.213 1.3 2.3 5.113 4.563 * 2.852 * 1.033 22.6 * 2.1 3.514 2.852 * -0.202 1.265 5.5 6.5 1.515 -2.482 * 2.075 * 0.717 11.2 * 3.6 1.816 1.183 1.019 0.305 2.2 2.1 1.217 -0.500 2.144 * -1.380 4.0 2.5 7.518 0.316 0.369 0.056 5.6 0.8 0.619 0.881 2.548 * 0.183 5.6 1.2 6.120 1.776 0.341 -0.697 5.9 8.4 * 1.221 0.510 0.265 0.281 0.7 1.9 1.022 -0.796 1.190 -0.082 1.0 2.8 1.323 0.832 0.487 0.778 0.6 3.4 0.424 -0.637 -0.645 -0.620 3.1 0.5 1.525 0.633 0.593 -0.480 0.3 4.2 1.326 -1.664 -0.393 -0.775 1.8 2.1 1.227 0.118 0.123 0.193 1.6 7.3 0.328 -2.113 * -0.147 0.493 0.3 3.3 0.529 -0.461 -0.421 -1.065 3.1 1.8 0.330 1.802 -2.231 * 1.214 15.7 * 0.9 1.431 0.313 -0.186 1.381 4.3 0.7 1.732 -3.524 * 1.386 -1.486 3.0 12.0 * 5.733 -0.305 -0.012 -2.101 * 2.0 3.9 3.534 -0.989 -0.326 -0.788 0.9 0.6 0.135 -1.333 -0.772 1.403 7.7 2.6 2.736 -0.649 0.222 -0.642 9.1 * 0.2 0.737 1.420 -1.588 1.024 2.6 2.7 5.238 -2.011 * -0.339 -1.631 4.2 0.4 2.839 0.337 1.144 -0.189 2.6 0.9 3.640 -1.479 0.240 -0.133 3.2 1.5 0.7
Note. * DIF Items
W vs. H W vs. MR W vs. H W vs. MR
Table 16
W vs.BBILOG-MG 3
The summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 3PL
W vs.BIRTPTO
72
Item41 0.353 -1.243 1.673 3.3 1.5 4.942 0.000 0.031 -0.036 0.6 0.1 0.843 -0.323 -1.404 0.919 8.2 * 2.0 2.644 -4.028 * -1.201 -1.748 12.4 1.6 1.145 -2.162 * 0.047 -0.721 3.4 1.2 1.146 -1.620 0.713 -1.294 6.5 2.6 5.447 1.117 0.217 -0.690 0.7 4.2 1.448 1.906 1.768 1.930 28.0 * 2.6 0.249 1.280 1.091 1.450 11.8 * 8.3 * 0.450 -1.459 -1.228 -2.071 * 4.1 3.1 0.451 -0.419 -2.699 * 2.097 * 16.1 * 0.7 11.0 *52 -0.219 -0.531 -0.234 0.7 0.2 0.353 -1.513 -0.846 0.313 9.9 * 2.7 0.954 -1.089 0.005 -1.875 3.5 0.3 1.955 -1.000 -0.813 -0.277 2.9 2.6 0.656 3.325 * 0.170 2.498 * 6.9 7.7 3.057 3.181 * 0.989 1.503 7.1 2.9 0.758 1.302 0.907 -0.356 2.2 1.2 0.559 0.876 0.491 0.366 8.3 * 1.2 1.660 0.953 0.094 0.634 1.2 3.5 0.161 1.523 1.027 2.090 * 5.9 0.9 0.862 0.797 0.601 1.682 2.9 0.4 1.463 -0.900 -0.391 1.187 2.7 1.1 1.164 1.288 -1.339 -0.633 1.1 6.9 1.365 0.220 -0.147 1.174 11.9 * 8.5 * 0.466 0.845 0.494 1.258 1.7 6.0 0.367 -0.082 -1.400 1.686 8.6 * 4.1 4.268 1.758 0.462 0.400 4.1 1.0 0.669 0.438 1.301 -0.384 0.3 2.3 2.370 0.451 0.895 1.161 10.2 * 2.8 1.971 1.560 -0.333 1.111 1.8 2.6 3.072 2.124 * -0.443 0.518 0.8 5.2 0.573 -0.657 -0.876 0.969 8.8 * 3.3 1.174 -0.274 -1.886 -0.802 5.7 1.2 2.275 0.696 0.595 0.027 3.0 1.5 1.076 0.471 1.132 0.149 1.4 2.6 0.877 0.481 0.586 1.656 7.2 3.1 0.978 2.740 * -0.140 0.027 2.3 5.1 0.179 0.054 -1.911 0.502 1.3 0.5 2.3
The summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 3PL
Note. * DIF Items
W vs.B W vs.B
Table 16 (Continued)
IRTPTOW vs. MRW vs. H W vs. MR W vs. H
BILOG-MG 3
73
Figure 5. Item 13 between Whites and Blacks.
Figure 6. Item 14 between Whites and Blacks.
74
Figure 7. Item 15 between Whites and Blacks.
Figure 8. Item 32 between Whites and Blacks.
75
Figure 9. Item 44 between Whites and Blacks.
Figure 10. Item 45 between Whites and Blacks.
76
Figure 11. Item 56 between Whites and Blacks.
Figure 12. Item 57 between Whites and Blacks.
77
Figure 13. Item 78 between Whites and Blacks.
Figure 14. Item 13 between Whites and Hispanics.
78
Figure 15. Item 19 between Whites and Hispanics.
Figure 16. Item 51 between Whites and Hispanics.
79
Figure 17. Item 44 between Whites and the Multi-Racial Group.
80
CHAPTER 5
SUMMARY AND DISCUSSION
The purpose of this study, which employed the data from the Georgia High School
Graduation Predictor Test (GHSGPT) for Social Studies, was to analyze academic performance
by ethnicity/race. IRTPRO, BILOG-MG 3, and IRTLRDIF were utilized to investigate across
reference and focal groups with 1PL, 2PL, and 3PL. Consequently, the two programs, IRTPRO
and BILOG-MG 3, identically detected 35 DIF items for Whites vs. Blacks, five DIF items for
Whites vs. Hispanics, and three DIF items for Whites vs. the Multi-Racial group with 1PL. For
2PL, three programs, IRTPRO, BILOG-MG 3, and IRTLRDIF, consistently detected DIF. There
are 16 DIF items for Whites vs. Blacks, three for Whites vs. Hispanics, and four for Whites vs.
the Multi-Racial group. Additionally, for 3PL, as well as 2PL, the three programs are identically
investigated DIF. Nine DIF items exist for Whites vs. Blacks, three in Whites vs. Hispanics, and
one in Whites vs. the Multi-Racial group. Based on the results of both BILOG-MG 3 and
IRTPRO, 3PL provided a good fit for the data.
5.1 Summary
This study employed GHSGPT data to consider whether DIF for different
ethnicities/races exists in the GHSGPT for Social Studies. This thesis analyzed only 79 items
from the GHSGPT for Social Studies rather than the total 80 items because the Pearson- and
biserial- correlations of Item 26 were negative. They were -.40 and -.053, respectively. Hence
Item 26 was omitted from the calibration, and the remaining subsequent items were renumbered
to maintain consecutive numbering. The summaries of the results are described below:
81
1. The Results Based on the Classical Test Theory (CTT)
The average p-value (the rate of correct responses) is .518. There are 62 (77%) items
between .3 and .7. The difficulty is moderate and tends toward easy. The average of
discrimination is .304. There are 30 (38%) items lower than .3. The discrimination is moderate,
so the items do not have a high discrimination. In addition, the Pearson- and biserial- correlations
are positive.
2. The Results Based on the Item Response Theory (IRT)
a. Item Discrimination Parameter
The average item discrimination with 2PL is .519 and with 3PL is .968; thus, the degree of
discriminations of both 2PL and 3PL are acceptable.
b. Item Difficulty Parameter
The average of item difficulty with 1PL is -.173, 2PL is .266, and 3PL is .650. The degrees
of difficulty for the three models are moderate; however, 1PL and 2PL tend toward easy,
and 3PL tends toward difficult.
c. The Lower Asymptote (Pseudo-Guessing Parameter)
The mean of the pseudo-guessing parameter for 3PL is .224; therefore, it is not high.
3. Detecting DIF Using the Three Computer Programs
IRTPRO, BILOG-MG 3, and IRTLRDIF were used to assess the79 items to detect
whether DIF for ethnicities/races exists on the GHSGPT for Social Studies with α = .05. Whites
were regarded as the reference group, and Blacks, Hispanics, and the Multi-Racial group were
considered the focal groups. For 1PL, items are considered to be DIF when BILOG-MG 3 and
82
IRTPRO consistently detected DIF. In addition, for 2PL and 3PL, when the three programs
identically detected the DIF phenomenon, those items are included as DIF.
a. The One-Parameter Logistic Model
There were 35 DIF items for Whites vs. Blacks; 15 items advantaged Blacks, and 20
items advantaged Whites. In addition, five DIF items existed for Whites vs. Hispanics;
three items favored Whites, and two items favored Hispanics. Moreover, three DIF items
existed for Whites vs. the Multi-Racial group, and those items all advantaged Whites.
b. The Two-Parameter Logistic Model
There were 16 DIF items for Whites vs. Blacks; nine items advantaged Whites, and seven
items favored Blacks. Three items showed DIF for Whites vs. Hispanics; two items favored
Whites, and one item advantaged Hispanics. Four DIF items existed for Whites vs. the
Multi-Racial group, and all advantaged the Multi-Racial group.
c. The Three-Parameter Logistic Model
There were nine DIF items found for Whites vs. Blacks; three items advantaged Whites,
and six items favored Blacks. Furthermore, three DIF items were shown for Whites vs.
Hispanics; two items advantaged Whites and one Hispanics. Additionally, only one DIF
item was found for Whites vs. the Multi-Racial group, and it advantaged the Multi-Racial
group.
4. Using IRTPRO and BILOG-MG 3 to Investigate DIF in Multiple Groups
DIF items were considered in multiple groups parallel to the three comparison groups. If
both IRTPRO and BILOG-MG 3 identically detected DIF, then those items are included as DIF.
83
a. The One-Parameter Logistic Model
There were ten DIF items for Whites vs. Blacks; five items favored Whites, and five
favored Blacks. There was one DIF item for Whites vs. the Multi-Racial group, and this
item advantaged the Multi-Racial group. IRTPRO and BILOG-MG 3 did not identically
detect DIF for Whites vs. Hispanics.
b. The Two-Parameter Logistic Model
BILOG-MG 3 and IRTPRO both determined seven DIF items for Whites vs. Blacks; four
items advantaged Whites and three items Blacks. Two items were detected for Whites vs.
the Multi-Racial group; one item favored Whites, and one item favored the Multi-Racial
group.
c. The Three-Parameter Logistic Model
The three DIF items were consistently detected by two programs for Whites vs. Blacks;
two items advantaged Whites, and one advantaged Blacks. Only one DIF item was detected
for Whites vs. the Multi-Racial group, and that one favored Whites. There was no
consistent DIF item for Whites vs. Hispanics with 2PL and 3PL.
5.2 Discussion
Currently, DIF detection procedures have been developed exclusively for comparisons
between a reference group/majority group and a focal group/ minority group, such as between
Whites and Blacks, or males and females. Some previous Social Science studies consider all
minorities as a homogeneous group. For instance, several studies mentioned that racial
differences in assessment have primarily been developed in reference to comparisons between
Whites and minority groups, which include Blacks, Asians, Hispanics, and Native Americans.
84
However, there is no evidence that Blacks and Hispanics are similar in this regard (Logan et al.,
2012). Thus, this study shows that DIF detection differs by ethnicity. In addition, previous
studies (Freedle & Kostin, 1988; Coffman & Belue, 2009) investigated the scores for either
Whites and Blacks or Whites and Hispanics or other single comparison groups. However,
numerous focal groups, for example Asians, African Americans, Hispanics, Native Americans,
females, and examinees with disabilities, are available for study (Zieky, 1993). Thus, this thesis
extends the line of prior research by using three comparison groups—1) Whites vs. Blacks; 2)
Whites vs. Hispanics; and 3) White vs. a Multi-Racial group— to determine which items contain
bias for a specific race/ethnicity. IRTPRO, BILOG-MG 3, and IRTLRDIF with three popular
IRT models were used to detect DIF.
This study met with some problems when calibrating the 3PL using BILOG-MG 3.
These problems may have resulted because of the small sample sizes of the focal groups, the
Hispanic and the Multi-Racial groups, numbering 114 and 132, respectively. It could not employ
the default (GPRIOR) of the prior BILOG-MG 3 because it stopped when calibrating Item 59 for
two comparison groups, Whites vs. Hispanics and Whites vs. the Multi-Racial group. Therefore,
this study changed the prior to TPRIOR instead of GPRIOR using BILOG-MG 3, and the beta
employed (4, 16) when using IRTPRO. In addition, when calibrating for two comparison groups,
Whites vs. Hispanics and Whites vs. the Multi-Racial group, with 3PL using IRTLRDIF, several
values of discrimination appeared very large, such as Item 74 (186.82) for Whites vs. Hispanics
and Item 16 (78.68) for Whites vs. the Multi-Racial group. Nevertheless, this might be an
estimation error, so the present study did not change because its purpose is to detect DIF for the
GHSGPT. The discussion below follows the order of the five hypotheses in order to present the
result of the study’s findings.
85
Hypothesis one and two: The three programs, IRTPRO, BILOG-MG 3, and IRTLRDIF, will
exhibit consistent results when testing for DIF and will examine
IRTPRO to assess whether it is effective in detecting DIF.
Based on the results for the detection of DIF, methods using IRTPRO, BILOG-MG 3,
and IRTLRDIF for three comparison groups are consistent. The rate of consistency of
IRTLRDIF and IRTPRO was the highest; the consistent rate of IRTLRDIF and BILOG-MG 3
and BILOG-MG 3 and IRTPRO was high. The rate of consistency of BILOG-MG 3 and
IRTPRO for multiple groups was moderate. Overall, the three computer programs displayed high
consistency for the detection of DIF in this study. Furthermore, because IRTPRO displayed
identical results to IRTLRDIF and BILOG-MG 3 for the three comparison groups, it is effective
in detecting DIF.
Hypothesis three: Which models are goodness of fit models for detecting DIF?
According to Tables 7 and 8, both BILOG-MG 3 and IRTPRO exhibited the -
2loglikelihood of 3PL for each comparison group and is smaller than the -2loglikelihood of 2PL
and 1PL. Thus, this finding concludes that 3PL is a goodness of fit model to detect DIF in the
GHSGPT.
Hypothesis four: Were their differences between the ethnic groups?
The computation of total scores is:
86
Total of the item correct
Total number of each race ×Total number of item × 100% (35)
According to the total scores for each race, Whites were 55%, Blacks were 46%,
Hispanics were 51%, and the Multi-Racial groups were 54%. In general, Whites performed
better than other races. Perhaps, because of a different cultural background and community
region, Blacks performed worse than other races.
Hypothesis five: DIF exists between ethnic groups on the GHSGPT.
The three computer programs consistently showed that DIF exists between ethnicity
groups. In addition, these findings indicated that several items advantaged specific races.
Although the results supported all of the hypotheses, there are several limitations. First,
this study does not control for gender, individual social economic status (SES), and school
regions. Second, because the present study was unable to obtain the items, it cannot analyze the
distractor. Thus, it is unable to further investigate some items that have lower response rates and
to investigate why Blacks performed worse than other races. Third, this finding does not employ
simulated data; it only applies empirical data to determine IRTPRO. In order to obtain an
accurate result to determine IRTPRO, researchers should employ simulated and empirical data in
detecting DIF in future study. Additionally, researchers may consider that school regions might
affect the probability of answering an item correctly. For example, if a school has enough
funding to hire additional teachers for tutoring, students might perform better because of this
additional help. Thus, researchers can adopt multilevel IRT, such as the HLM program or
flexMIRT, to better understand school level variables that may influence the relationships
observed here.
87
In sum, DIF is an important tool in helping test developers recognize some questions that
may be unfair for test-takers because of their gender, ethnicity/race, or cultural background
(Zieky, 1993). In other words, DIF is a particularly useful instrument for test developers. This
study presents DIF detection results from empirical tests, and, in addition, it provides important
DIF information for the test developers of the Georgia High School Graduation Predictor Test.
They can consider eliminating or revising several items, such as Items 52, 59, 74, 77, and 79,
that are beneficial or adverse for particular races. Furthermore, it examines the new program,
IRTPRO, to demonstrate and determine its effectiveness for detecting DIF.
88
REFERENCES
Allen, M. J., & Yen, W. M. (2002). Introduction to measurement theory. Long Grove, IL
Waveland Press, Inc.
American Psychological Association, c/o Joint Committee on Testing Practices. (1988). Code of
fair testing practices in education. Washington, DC: Author.
Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P.W.
Hland & H. Wainer (Eds.), Differential item functioning (pp. 3-24). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Baker, F. B. (2001). The basic of item response theory. New York, NY: Eric Clearinghouse on
Assessment and Evaluation.
Baker, F. B., & Kim, S-H. (2004). Item response theory: Parameter estimation techniques. Boca
Raton, FL: Taylor & Francis.
Berk, R. A. (1982). Handbook of methods for detecting test bias. Baltimore, MD: Johns Hopkins
University Press.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In
F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 392-479).
Reading, MA: Addison-Wesley.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:
An application of an EM algorithm. Psychometrika, 46, 443-459.
Bock, R. D., & Lieberman, M. (1970). Fitting a response model for dichotomously scored items.
Psychometrika, 35, 179-197.
89
Bolt, D. M. (2000). A SIBTEST approach to testing DIF hypothesis using experimentally
designed test items. Journal of Educational Measurement, 37, 307-327.
Brescia, W., & Fortune, J. C. (1988). Standardized testing of American Indian students. Eric
Clearinghouse on Rural Education and Small Schools, Las Cruces, N. Mex. Retrieved
January 31, 2012, from http://www.enc.org/topics/equity/articles/document.shtm?=ACQ-
111498-1498.
Cai, L., Thissen, D., & du Toit, S. (2011). IRTPRO 2.1 [Computer software]. Lincolnwood, IL:
Scientific Software International.
Cai, L. (2012). flexMIRTTM version 1.86: A numerical engine for multilevel item factor analysis
and test scoring. [Computer software]. Seattle, WA: Vector Psychometric Group.
Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially
functioning test items. Educational Measurement: Issues and Practice, 17, 31-44.
Coffman, D. L., & Belue, R. (2009). Disparities in sense of community: True race differences or
differential item functioning? Journal of Community Psychology, 37, 547-558.
Cohen, A. S., & Kim, S-H. (1993). A comparison of Lord’s χ2 and Raju’s area measures in
detection of DIF. Applied Psychological Measurement, 17, 39-52.
Cohen, A. S., Kim, S-H., & Wollack, J. A. (1996). An investigation of the likelihood ratio test
for detection of differential item functioning. Applied Psychological Measurement, 20, 15-
26.
Crocker, L., & Algina, J. (2008). Introduction to classical and modern test theory. Mason, OH:
Cengage Learning.
Czepiel, S. A. (2002). Maximum likelihood estimation of logistic regression models: Theory and
implementation. Retrieved from http://czep.net/stat/mlelr.pdf.
90
Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to
assessing unexpected differential item performance on the Scholastic Aptitude Test.
Journal of Educational Measurement, 23, 355-368.
Dorans, N. J., & Schmitt, A. P. (1991). Constructed response and differential item functioning: A
pragmatic approach
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and
standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp.35-
66). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.).
(ETS-RR-91-47). Princeton, NJ: Educational Testing Service.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologist. Mahwah, NJ:
Lawrence Erlbaum Associates.
Freedle, R., & Kostin, I. (1988). Relationship between item characteristics and an index of
differential item functioning (DIF) for the four GRE verbal item types. ETSRR-88-29.
Princeton, NJ: Educational Testing Service.
Georgia Department of Education. Test content descriptions based on the Georgia performance
standards social studies (2010). Retrieved from http://archives.gadoe.org/DMGet
Document.aspx/GHSGT%20Social%20Studies%20Content%20Descriptions%20GPS%20
Version%20Update%20Oct%202010.pdf?p=6CC6799F8C1371F6A344D9C15C23A9D85
9A861593B934AB75F446073BD12714C&Type=D.
Gronlund, N. E. (1993). How to make achievement tests and assessments (5th ed.) Boston, MA:
Allyn and Bacon.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principle and applications.
Boston, MA: Kluwer-Nijhoff.
91
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response
theory. Newbury Park, CA: Sage.
Hambleton, R. K., & Jones, R.W. (1993) Comparison of classical test theory and item response
theory and their application to test development. Educational Measurement: Issues and
Practice, 12, 38-47.
Harwell, M. R., Baker, F. B., & Zwarts, M. (1988). Item parameter estimation via marginal
maximum likelihood and an EM algorithm: A didactic. Journal of Educational Statistics,
13, 247-271.
Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel-Haenszel
procedure. In H. Wainer & H. I. Braun (Eds.) Test validity (pp. 129-145). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Kim, S-H., Cohen, A. S., & Park, T. H. (1995). Detection of differential item functioning in
multiple groups. Journal of Educational Measurement, 32, 261-278.
Ling, S. E., & Lau, S. H. (2005). Detecting differential item functioning (DIF) in standardized
multiple-choice test: An application of item response theory (IRT) using three parameter
logistic model. Retrieved January 31, 2012,
from http://www.ipbl.edu.my/inter/penyelidikan/seminarpapers/2005/lingUITM.pdf
Logan, J. R., Minca, E., & Adar, S. (2012, January 10). The geography of inequality: Why
separate means unequal in American public schools. Sociology of Education. Advance
online publication. doi:10.1177/0038040711431588.
.
Lord, F. M. (1952). A theory of test scores. Psychometric Monograph, No 7.
Lord, F. M. (1953). A relation of test score to the trait underlying the test. Educational and
Psychological Measurement, 13, 517-548.
92
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA:
Addison-Wesley.
Lord, F. M. (1974).Estimation of latent ability and item parameters when there are omitted
responses. Psychometrika, 39, 247-264.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,
NJ: Lawrence Erlbaum Associates.).
Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective
studies of disease. Journal of the National Cancer Institute, 22, 719-748.
McDonald, R.P. (1999). Test theory: a unified treatment. Mahwah, NJ: Lawrence Erlbaum
Associates.
Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of
Educational Statistics, 7, 105-118
Potenza, M. T., & Dorans, N. J. (1995). DIF assessment for polytomously scored items: A
framework for classification and evaluation. Applied Psychological Measurement, 19, 23-
37.
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495-
502.
Raju, N. S., & Drasgow, F. (1993). An empirical comparison of the area method, Lord’s chi-
square test, and the Mantel-Haenszel technique for assessing differential item functioning.
Educational and Psychological Measurement. 53, 301-314.
Raju, N. S., van der Linder, W. J., & Fleer, P. F. (1995). IRT-based internal measures of
differential functioning of items and tests. Applied Psychological Measurement, 19, 353-
368.
93
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen:
The Danish Institute for Educational Research.
Rudner, L. M., Getson, P. R., & Knight, D. L. (1980). Biased item detection techniques. Journal
of Educational Statistics, 5, 213-233.
Schmitt, A. P., & Dorans, N. J. (1990). Differential item functioning for minority examinees on
the SAT. Journal of Educational Measurement, 27, 67-81.
Shealy, R. T., & Stout. W. F. (1993). An item response theory model for test bias and differential
test functioning. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp.
197-239). Hillsdale, NJ: Lawrence Erlbaum Associates.
Spector, P. E. (1992). Summated rating scale construction: An introduction. Newbury Park, CA:
Sage.
Steinberg, L. (1994). Context and serial-order effects in personality measurement: Limits on the
generality of measuring changes the measure. Journal of Personality and Social
Psychology, 66, 341-349.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic
regression procedures. Journal of Educational Measurement, 27, 361-370.
Thissen, D, Steinverg, L. & Gerrard, M. (1986). Beyond group mean differences: The concept of
item bias. Psychological Bulletin, 99, 118-128.
Thissen, D., Steinverg, L., & Wainer, H. (1993). Detection of differential item functioning using
the parameters of item response model. In P.W. Holland & H. Wainer (Eds.), Differential
item functioning (pp. 67–114). Hillsdale, NJ: Lawrence Erlbaum Associates.
Thissen, D. (2001). IRTLRDIF v2.0b: Software for the computation of the statistics involved in
item response theory likelihood-ratio tests for differential item functioning [Computer
94
software documentation]. Chapel Hill: L. L. Thurstone Psychometric Laboratory,
University of North Carolina.
Van der Linden, W. J., & Hambleton, R. K. (1996). Handbook of modern item response theory.
New York, NY: Springer-Verlag.
Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and
detection. Journal of Educational Measurement, 28, 197-219.
Wang, X-B, Wainer, H., & Thissen, D. (1995) On the viability of some untestable assumptions
in equating exams that allow examinee choice. Applied Measurement in Education, 8, 211-
225.
Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning.
Applied Psychological Measurement, 33, 42-57.
Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P.W.
Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–347). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Zimowski, M. F, Muraki, E., Mislevy, R. J, & Bock, R. D. (2003). BILOG-MG 3 [Computer
software]. Lincolnwood, IL: Scientific Software International.
95
APPENDICES
A. IRTPRO Input File for DIF Detection for Two Groups with 3PL
Project:
Name = WALL;
Data:
File = .\WALL.ssig;
Analysis:
Name = 3PL;
Mode = Calibration;
Title:
Master Thesis 3PL DIF
Comments:
3PL models fitted to each of the 79 items.
Estimation:
Method = BAEM;
E-Step = 500, 1e-005;
SE = S-EM;
M-Step = 50, 1e-006;
Quadrature = 49, 6;
SEM = 0.001;
96
SS = 1e-005;
Scoring:
Mean = 0;
SD = 1;
Miscellaneous:
Decimal = 2;
Processors = 2;
Print CTLD, P-Nums, Diagnostic;
Min Exp = 1;
Groups:
Variable = group;
Group G1:
Value = (1);
Dimension = 1;
Items = Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15,
Q16, Q17, Q18, Q19, Q20, Q21, Q22, Q23, Q24, Q25, Q26, Q27, Q28, Q29, Q30,
Q31, Q32, Q33, Q34, Q35, Q36, Q37, Q38, Q39, Q40, Q41, Q42, Q43, Q44, Q45,
Q46, Q47, Q48, Q49, Q50, Q51, Q52, Q53, Q54, Q55, Q56, Q57, Q58, Q59, Q60,
Q61, Q62, Q63, Q64, Q65, Q66, Q67, Q68, Q69, Q70, Q71, Q72, Q73, Q74, Q75,
97
Q76, Q77, Q78, Q79;
Codes(Q1) = 0(0), 1(1);
Codes(Q2) = 0(0), 1(1);
⋮
Codes(Q78) = 0(0), 1(1);
Codes(Q79) = 0(0), 1(1);
Model(Q1) = 3PL;
Model(Q2) = 3PL;
⋮
Model(Q78) = 3PL;
Model(Q79) = 3PL;
Referenced;
Mean = 0.0;
Covariance = 1.0;
Group G2:
Value = (2);
Dimension = 1;
Items = Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15,
Q16, Q17, Q18, Q19, Q20, Q21, Q22, Q23, Q24, Q25, Q26, Q27, Q28, Q29, Q30,
Q31, Q32, Q33, Q34, Q35, Q36, Q37, Q38, Q39, Q40, Q41, Q42, Q43, Q44, Q45,
Q46, Q47, Q48, Q49, Q50, Q51, Q52, Q53, Q54, Q55, Q56, Q57, Q58, Q59, Q60,
Q61, Q62, Q63, Q64, Q65, Q66, Q67, Q68, Q69, Q70, Q71, Q72, Q73, Q74, Q75,
98
Q76, Q77, Q78, Q79;
Codes(Q1) = 0(0), 1(1);
Codes(Q2) = 0(0), 1(1);
⋮
Codes(Q78) = 0(0), 1(1);
Codes(Q79) = 0(0), 1(1);
Model(Q1) = 3PL;
Model(Q2) = 3PL;
⋮
Model(Q78) = 3PL;
Model(Q79) = 3PL;
Mean = Free;
Covariance = Free;
DIF All:
Constraints:
Equal = (G1, Q1, Slope[0]), (G2, Q1, Slope[0]);
Equal = (G1, Q1, Intercept[0]), (G2, Q1, Intercept[0]);
Equal = (G1, Q1, Guessing[0]), (G2, Q1, Guessing[0]);
Equal = (G1, Q2, Slope[0]), (G2, Q2, Slope[0]);
Equal = (G1, Q2, Intercept[0]), (G2, Q2, Intercept[0]);
Equal = (G1, Q2, Guessing[0]), (G2, Q2, Guessing[0]);
99
⋮
Equal = (G1, Q78, Slope[0]), (G2, Q78, Slope[0]);
Equal = (G1, Q78, Intercept[0]), (G2, Q78, Intercept[0]);
Equal = (G1, Q78, Guessing[0]), (G2, Q78, Guessing[0]);
Equal = (G1, Q79, Slope[0]), (G2, Q79, Slope[0]);
Equal = (G1, Q79, Intercept[0]), (G2, Q79, Intercept[0]);
Equal = (G1, Q79, Guessing[0]), (G2, Q79, Guessing[0]);
Priors:
(G1, Q1, Slope[0]) = Lognormal, 0, 1;
(G1, Q1, Intercept[0]) = Normal, 0, 3;
(G1, Q1, Guessing[0]) = Beta, 4, 16;
(G1, Q2, Slope[0]) = Lognormal, 0, 1;
(G1, Q2, Intercept[0]) = Normal, 0, 3;
(G1, Q2, Guessing[0]) = Beta, 4, 16;
⋮
(G1, Q78, Slope[0]) = Lognormal, 0, 1;
(G1, Q78, Intercept[0]) = Normal, 0, 3;
(G1, Q78, Guessing[0]) = Beta, 4, 16;
(G1, Q79, Slope[0]) = Lognormal, 0, 1;
(G1, Q79, Intercept[0]) = Normal, 0, 3;
(G1, Q79, Guessing[0]) = Beta, 4, 16;
(G2, Q1, Slope[0]) = Lognormal, 0, 1;
100
(G2, Q1, Intercept[0]) = Normal, 0, 3;
(G2, Q1, Guessing[0]) = Beta, 4, 16;
(G2, Q2, Slope[0]) = Lognormal, 0, 1;
(G2, Q2, Intercept[0]) = Normal, 0, 3;
(G2, Q2, Guessing[0]) = Beta, 4, 16;
⋮
(G2, Q78, Slope[0]) = Lognormal, 0, 1;
(G2, Q78, Intercept[0]) = Normal, 0, 3;
(G2, Q78, Guessing[0]) = Beta, 4, 16;
(G2, Q79, Slope[0]) = Lognormal, 0, 1;
(G2, Q79, Intercept[0]) = Normal, 0, 3;
(G2, Q79, Guessing[0]) = Beta, 4, 16;
101
B. BILOG-MG 3 Input File for DIF Detection for Two Groups with 3PL
Master Thesis
All Races 3PL DIF
>COMMENT
An empirical comparison of the three programs is presented using the fall 2010 data of the
GHSGPT. This study detects DIF for different ethnicities only in social studies,
which consists of 79 dichotomously scored items.
>GLOBAL DFName = 'D:\Thesis\Result\BL\WALL\WALL.1.dat',
NPArm = 3;
>LENGTH NITems = (79);
>INPUT NTOtal = 79,
NIDchar = 4,
NGRoup = 4,
DIF;
>ITEMS ;
>TEST1 TNAme = 'WALL3PL',
INUmber = (1(1)79);
>GROUP1 GNAme = 'WRFGROUP',
LENgth = 79,
INUmbers = (1(1)79);
>GROUP2 GNAme = 'BFCGROUP',
LENgth = 79,
102
INUmbers = (1(1)79);
(4A1, 4X, I1, 4X, 79A1)
>CALIB CRIt = 0.0050,
PLOt = 1.0000,
ACCel = 1.0000,
TPRIOR;
>SCORE ;
103
C. IRTLRDIF Input File for DIF Detection for Two Groups with 3PL
2654
79
111111111111111111111111111111111111111111111111111111111111111111111111111111
1
WBLR.dat
4
1
5-83
WBLR3PL.out