A COMPARISON OF DIFFERENTIAL ITEM FUNCTIONING (DIF

A COMPARISON OF DIFFERENTIAL ITEM FUNCTIONING (DIF) DETECTION FOR

DICHOTOMOUSLY SCORED ITEMS BY USING 2.1 IRTPRO, BILOG-MG 3, AND

IRTLRDIF V.2

by

MEI LING ONG

(Under the Direction of Soeck-Ho Kim)

ABSTRACT

This paper addresses statistical issues of differential item functioning (DIF). The first

purpose of this study is to present an empirical data comparison of the IRTPRO, BILOG-MG 3,

and IRTLRDIF programs and to detect DIF across two samples with IRT models, 1PL, 2PL, and

3PL. The second purpose is to examine IRTPRO to determine its effectiveness in detecting DIF,

and, finally, to consider whether DIF exists in the GHSGPT for different ethnicities only in

Social Studies. The GHSGPT predicts 11th grade students’ future performance on the Georgia

High School Graduation Test and consists of 79 dichotomously scored items. The results show

that several DIF items exist in the GHSGPT. For instance, all three programs consistently

indicate that Item 13 is beneficial to Whites. In addition, IRTPRO is effective in detecting DIF

because its results parallel those of IRTLRDIF and BILOG-MG 3.

INDEX WORDS: Differential item functioning (DIF), IRTPRO, BILOG-MG 3, IRTLRDIF,

IRT, 1PL, 2PL, and 3PL.



IRTLRDIF V.2

by

MEI LING ONG

B.A, Fu-Jen Catholic University, Taiwan, 1999

A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment

of the Requirements for the Degree

MASTER OF ARTS

ATHENS, GEORGIA

2012

© 2012

MEI LING ONG

All Rights Reserved



IRTLRDIF V.2

by

MEI LING ONG

Major Professor: Soeck-Ho Kim

Committee: Allan S. Cohen Stephen E. Cramer Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia August 2012

iv

ACKNOWLEDGEMENTS

I sincerely appreciate those who supported and encouraged me throughout this process. I

would like to thank my advisor, Dr. Soeck-Ho Kim, for his guidance and technical support

throughout this study, without which I would not have completed this thesis. In addition, I would

like to thank the members of my committee, Dr. Allan S. Cohen and Dr. Stephen E. Cramer, for

their comments and helpful suggestions while completing this thesis. Furthermore, I want to

thank my friends, Yoonsun, Youn-Jeng, Sunbok, Stephanie Short, Mary Edmond, and many

other friends, who provided their opinions in terms of this thesis. Lastly and importantly, I wish

to express my deepest appreciation to my parents, my elder brother, my younger sister, and my

younger auntie for their support and encouragement. To my lovely husband, Man Kit Lei, thanks

for cooking lunch and dinner for me while I was researching, writing and revising this study.

Because of your unending encouragement and full support, I have had an opportunity to obtain

my Master’s Degree. Thank you very much.

v

TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS ........................................................................................................... iv

LIST OF TABLES ........................................................................................................................ vii

LIST OF FIGURES ..................................................................................................................... viii

CHAPTER

1 INTRODUCTION .........................................................................................................1

1.1 Overview ............................................................................................................1

1.2 Item Bias, Differential Item Functioning (DIF), and Impact .............................2

1.3 The Purpose of the Study ...................................................................................6

2 LITERATURE REVIEW ..............................................................................................7

2.1 Classical Test Theory .........................................................................................7

2.2 Modern Test Theory ..........................................................................................9

2.3 Estimation of Item Parameters .........................................................................10

2.4 Dichotomously Scored Items ...........................................................................12

2.5 The DIF Detection Method ..............................................................................18

2.6 Current Research ..............................................................................................27

3 METHOD ....................................................................................................................28

3.1 Research Structure ...........................................................................................28

3.2 Instrumentation ................................................................................................29

3.3 Sample..............................................................................................................29

3.4 Computer Programs .........................................................................................30

vi

4 RESULTS ....................................................................................................................33

4.1 Item Analysis ...................................................................................................33

4.2 Racial Differential Item Functioning (DIF) Analysis ......................................41

5 SUMMARY AND DISCUSSION ...............................................................................80

5.1 Summary ..........................................................................................................80

5.2 Discussion ........................................................................................................83

REFERENCES ..............................................................................................................................88

APPENDICES

A IRTPRO Input File for DIF Detection for Two Groups with 3PL ..............................95

B BILOG-MG 3 Input File for DIF Detection for Two Groups with 3PL ....................101

C IRTLRDIF Input File for DIF Detection for Two Groups with 3PL .........................103

vii

LIST OF TABLES

Page

Table 1: The Development of Item Response Models and Computer Programs ..........................14

Table 2: The 2-by-2 Contingency Table ........................................................................................19

Table 3: The DIF Detection for Ethnicity ......................................................................................30

Table 4: Raw Score Summary Statistics for the GHSGPT ............................................................33

Table 5: Item Statistics Based on Classical Test Theory ...............................................................36

Table 6: Item Statistics Based on Item Response Theory..............................................................39

Table 7: The Summary of Goodness of Fit Using BILOG-MG 3 .................................................42

Table 8: The Summary of Goodness of Fit Using IRTPRO ..........................................................42

Table 9: The Summary of BILOG-MG 3 and IRTPRO for Three Comparison Groups

with 1PL ..............................................................................................................................44

Table 10: The Summary of IRTLRDIF for Three Comparison Groups with 2PL ........................47

Table 11: The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison

Groups with 2PL ................................................................................................................52

Table 12: The Summary of IRTLRDIF for Three Comparison Groups with 3PL ........................56

Table 13: The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison

Groups with 3PL ................................................................................................................61

Table 14: The Summary of BILOG-MG 3 and IRTPRO for All Ethnicities/Races with 1PL ......65



viii

LIST OF FIGURES

Page

Figure 1: No DIF between two groups ...........................................................................................4

Figure 2: DIF exists in two groups called uniform DIF...................................................................4

Figure 3: Non-uniform DIF ............................................................................................................5

Figure 4: The research structure ...................................................................................................28

Figure 5: Item 13 between Whites and Blacks ..............................................................................73





Figure 10: Item 45 between Whites and Blacks ............................................................................75




Figure 14: Item 13 between Whites and Hispanics .......................................................................77



Figure 17: Item 44 between Whites and the Multi-Racial Group ..................................................79

1

CHAPTER 1

INTRODUCTION

1.1 Overview

A well-constructed test is the best way to evaluate a student’s mastery in a particular

field. Gronlund (1993) stated that tests not only aid teachers in making various instructional

decisions by having a direct influence on students’ learning, but they also assist in a number of

other ways. For instance, tests can increase students’ motivation. The purposes of tests are to

obtain an accurate and fair assessment of a student’s abilities. Nevertheless, a test cannot

properly evaluate skills or knowledge bases if it is affected by irrelevant factors that could bias

the results. These potentially biasing factors could include gender, ethnic, and cultural

differences. Without properly accounting for these compounding factors, the results of the test

will be an unfair representation of students’ abilities (Gronlund, 1993). In other words, if a test is

unfair for examinees because of gender, ethnic origin, or cultural bias, then its results are

essentially meaningless. For instance, Freedle and Kostin (1988) investigated whether the GRE

verbal item types differed across races and resulted in different item function. They found that

most of the GRE verbal items advantaged Whites. Thus, test fairness is an important issue with

which researchers must be concerned.

There are several ways to measure students’ cognitive abilities in standardized testing.

Currently, multiple-choice tests are commonly used for measuring students’ cognitive abilities

(Ling & Lau, 2005). Most schools use standardized scores to evaluate educational quality and

student performance (Brescia & Fortune, 1988). If test scores are an important factor in

2

evaluating students’ performance, test developers should make tests as fair as possible for

examinees of different races, genders, or handicapping conditions (APA, 1988). In order to

ensure that all items are as free as possible from irrelevant sources of variance, all items should

be reviewed because the presence of bias may unfairly affect examinees’ scores (Hambleton &

Swaminathan, 1985). Hence, detecting differential item functioning (DIF) can be seen as a

critical step in detecting biased items.

1.2 Item Bias, Differential Item Functioning (DIF), and Impact

Research on item bias first appeared in the literature in the 1960s. Angoff (1993)

characterized bias as “An item is biased if equally able (or proficient) individuals, from different

groups, do not have equal probabilities of answering the item correctly” (p. 4). Lord (1980) also

noted that a test would be unbiased if each item has exactly the same item response function in

each group, and examinees have exactly the same opportunity of obtaining the correct item at

any given level of ability, θ. However, if each item has a different item response function

between a reference group and a focal group, the item, obviously, is biased. Furthermore, Shealy

and Stout (1993) indicated that “if the matching criterion is judged to be construct-valid in the

sense that it is matching examinees on the basis of the latent trait (target ability) the test is

designed to measure without contamination from other unintended to be measured abilities then

the DIF item is said to be biased” (p. 197). For example, the word commodious, a verbal aptitude

item, advantages Hispanic examinees. The word commodious was considered biased because it

has a similar form and meaning in Spanish (Zieky, 1993). While researchers have determined the

need to identify such bias in testing, the very word “bias” is sometimes confusing and evokes

3

negative emotional reactions similar to the words “discrimination” and “racism” (Berk, 1982).

Eventually, researchers proposed DIF to replace the term bias (Angoff, 1993).

DIF involves testing examinees from different populations that share the same abilities

but differ in their probabilities of giving correct responses on test items (Crocker & Aligina,

2008). For example, a mathematics test requires skills in computation and reading and assumes

that all examinees have the same computational ability. Nonetheless, if one group is proficient in

reading English, but another group is made up of English as a second language (ESL)

individuals, these groups would not have equal English proficiency. In this situation, even

though all examinees are matched in their computational abilities, they will provide different

answers on the mathematic items because they have differential English proficiency. DIF exists

on mathematics test. On the other hand, if two groups exhibit different performances on a

mathematics test because they do not share the same ability, then this situation displays impact

rather than DIF.

Impact refers to a difference in performance on an item between two groups and is what

Holland and Thayer (1988) called “differential item performance.” If DIF exists for a focal group

associated with some reference group, then the item characteristic curves (ICC) differ for the two

groups (Cai et al., 2011). In other words, there is no DIF if the ICCs are equal as shown in Figure

1. On the other hand, DIF exists when the ICCs differ as shown in Figure 2. Thus, Lord (1980)

argued that DIF detection questions could be approached by comparing estimates of the item

parameters between groups, as the ICCs for an item are determined by the item parameters. DIF

exists in the test, which means that items have been detected as construct-irrelevant factors in the

test, and this will affect the validity of the items utilized. If a test is found to contain a biased

item, this item should be omitted in order to achieve a fairer test. Thus, determining DIF is an

4

important step in maintaining items’ effectiveness and fairness as well as in enhancing the

validity of a test.

Figure 1. No DIF between two groups.

Figure 2. DIF exists in two groups called uniform DIF.

5

Two types of DIF, uniform and non-uniform, have been defined by Mellenberg (1982).

Uniform DIF refers to the pattern of difference between two groups’ probabilities of obtaining a

correct response to an item as the same across all ability levels. Uniform DIF presents no

interaction between the level of two groups and their abilities as shown in Figure 2. Non-uniform

DIF, so-called crossing DIF (CDIF), refers to an item that discriminates across ability levels

differently for separate groups, which means that the probability of giving correct responses on

test items for different groups is not the same at all ability levels as shown Figure 3. Thus, there

is an interaction between ability levels and separate groups when non-uniform DIF exists

(Swaminathan & Rogers, 1990).

Overall, bias is not a simple synonym for DIF. The differentiation between bias and DIF

depends on “the extent to which a convincing construct validity argument has been given for the

matching criterion” (Shealy & Stout, 1993, p. 197). Therefore, most analyses of test data

examine DIF rather than item bias.

Figure 3. Non-uniform DIF.

6

1.3 The Purpose of the Study

In order to provide a fair and equitable test, the detection of DIF is necessary.

Traditionally, classical test theory was widely used because of its computational simplicity.

However, several computer programs, such as BILOG-MG 3 (Zimowski et al., 2003), flexMIRT

(Cai, 2012), and IRTPRO (Cai et al., 2011), have recently been developed which can address

complex mathematic computations. As a result, item response theory has grown in popularity.

This current study analyzes the data of the Georgia High School Graduation Predictor Test

(GHSGPT) to investigate DIF across multiple groups using several computer programs with

three popular IRT models for dichotomously scored items. This study has three main objectives.

The first objective is to present an empirical data comparison of three programs, IRTPRO,

BILOG –MG 3,and IRTLRDIF in order to detect DIF across majority and minority groups with

one parameter logistic (1PL), two-parameter logistic (2PL), and three-parameter logistic (3PL)

models. The second purpose is to examine IRTPRO to determine its effectiveness in detecting

DIF. Finally; this study considers whether the GHSGPT exhibits DIF for different ethnicities.

7

CHAPTER 2

LITERATURE REVIEW

Currently, classical test theory (CTT) and item response theory (IRT) are popular

statistical structures for addressing measurement problems such as test development, test-score

equating, and the identification of biased test items. Forty years ago, Frederic Lord indicated that

examinees’ observed scores and true scores were not the same as their ability scores because

ability scores are test independent (Hambleton & Jones, 1993). On the other hand, examinees’

observed- and true- scores are test-dependent (Lord, 1953). Thus, the CTT and the IRT are

widely perceived as representing two measurement frameworks.

2.1 Classical Test Theory

Classical test theory (CTT) or traditional measurement theory, which is referred to as the

“classical test model,” is regarded as the “true score theory” and includes three concepts: 1) the

observed score (test score); 2) the true score; and 3) the error score. Each observed score is made

up of two components, which are the “true score (T)” and the “error score (E)” (Hambleton &

Jones, 1993). The model of CTT is defined as:

X = T + E, (1)

where X is the test score (observed score), T is the true score, and E is the error score.

Observed scores are simply the scores individuals obtain on the measuring instrument.

The true score is the one that each observer desires to obtain. However, the true score, in fact, is

an unknown value and cannot be directly observed. It is inferred from the observed scores, and it

8

can merely be estimated. For individuals, the theoretical value of the true scores represents a real

psychological operation or academic performance. The true score for examinee j is given as:

Tj = E(Xj) = μxj. (2)

Errors include systematic errors, random errors, and measurement errors (Spector, 1992).

The CTT assumes each examinee has a true score if there were no errors in measurement, that is,

X = T. If the expected value of X is T, E’s expectation is zero (Lord, 1980):

𝜇𝐸|𝑇 ≡ 𝜇(𝑋−𝑇)|𝑇 ≡ 𝜇𝑋|𝑇 − 𝜇𝑇|𝑇 = 𝑇 − 𝑇 = 0, (3)

where μ is the mean, and the subscripts state that T is fixed. Equation 3 indicates that the error of

measurement is unbiased. If T and E are independent, the observed-score variance is defined as:

𝜎𝑋2 = 𝜎𝑇2 + 𝜎𝐸2, (4)

where 𝜎𝑋2 is the variance of the observed score (total score), 𝜎𝑇2 is the variance of the true score,

and 𝜎𝐸2 is the variance of errors. Reliability refers to the stability and consistency of assessment

results. The index of reliability can be stated as the ratio of the standard deviation of true scores

to the standard deviation of the observed scores (Lord & Novick, 1968) and is defined as:

𝜌𝑋𝑇 = 𝜎𝑇𝜎𝑋

, (5)

where 𝜌𝑋𝑇 is the correlation between true and observed scores, σT is the standard deviation of the

true score, and σX is the standard deviation of the observed score. Nevertheless, the true score is

unknown, so getting the Pearson correlation between the observed scores on parallel tests is a

way to estimate the reliability coefficient. The reliability coefficient is given by:

𝜌𝑋𝑋′ = 𝜎𝑇2

𝜎𝑋2, (6)

where 𝜌𝑋𝑋′ is the correlation between observed scores on two parallel test, X and X′ are referred

to as parallel measurements.

9

The assumptions of CTT are that: (1) true- and error- scores are independent, (2) the

average error score in the population of test takers is zero, and (3) error scores on parallel tests

are independent. The important advantage of CTT is its weak theoretical assumptions which

make it easy to employ in many testing situations. However, CTT’s major limitations are that:

(1) the person statistics are item dependent and (2) the item statistics, such as item difficulty and

item discrimination, are sample dependent (Hambleton & Jones, 1993). Although CTT is easy to

compute and to understand, its theory is based on weak assumptions, such as the sample

dependent index. Thus, there is trouble in obtaining a consistency of difficulty, discrimination,

and reliability on the same test. In order to overcome the disadvantages of CTT, modern test

theory, which is based on the item response theory framework, was developed.

2.2 Modern Test Theory

The theoretical structure of modern test theory (or modern measurement theory) is item

response theory (IRT). IRT, which is also known as “latent trait theory,” is a general statistical

theory concerning an examinee’s item and test performance and how his or her performance

relates to the abilities that are measured by the items in the test (Hambleton & Jones, 1993). In

other words, IRT mainly focuses on item-level information. The essential elements of an IRT

model are ability or proficiency, which is an unobservable (latent) variable, usually denoted by θ,

that varies within the population of examinees and the item characteristic curve (ICC) (Thissen et

al., 1993). The ICC is the curve that describes the functional relationship between the probability

of a correct response to an item and the ability scale. The ICC is denoted by the following:

(Baker & Kim, 2004)

𝑃(𝛽𝑖,𝛼𝑖,𝜃𝑗) ≡ 𝑃𝑖(𝜃𝑗), (7)

10

where 𝑃𝑖(𝜃𝑗) is the probability of the correct response at any point θj on the ability scale ( j = 1,

2, 3,…,N), i is an item (i = 1, 2, 3, …,n), βi is the difficulty parameter, and αi is the

discrimination parameter (Baker & Kim, 2004). Item responses can be discrete or continuous and

dichotomously or polychotomously scored. Item score categories can be ordered or unordered.

The assumptions of IRT are: (1) dimensionality, which includes uni- or multi-dimensional and

(2) local independence, so called conditional independence, means that every person has a

certain probability of giving a predefined response to each item, and this probability is

independent of the answers given to the preceding items (Croker & Algina, 2008). The

characteristics of IRT are parameter invariance and information function; however, CTT does

not have these two characteristics. The major limitation in IRT is that it tends to be complex in

its computations.

2.3 Estimation of Item Parameters

This study applies three computer programs, IRTPRO, BILOG-MG 3 and IRTLRDIF, to

analyze DIF. These three programs implement the method of marginal maximum likelihood

estimation (MMLE) and maximum likelihood estimation (MLE) for item parameter estimation.

Hence, this study utilizes only MMLE and MLE.

2.3.1 Marginal Maximum Likelihood Estimation (MMLE)

The method of marginal maximum likelihood estimation (MMLE) was proposed by

Bock and Lieberman (1970). However, their approach was practical only for very short tests; the

computation was complicated, and the estimation was slow. Thus, in order to solve these

problems, Bock and Aitkin (1981) developed the expectation- maximization (EM) algorithm to

11

improve the effectiveness of the MMLE. Baker and Kim (2004) indicated that the MMLE

assumes that examinees represent a random sample from a population where ability is distributed

based on a density function g (θ|τ), where τ refers to the vector containing the parameters of the

examinee population’s ability distribution. Currently, this situation corresponds to a mixed-effect

ANOVA model with items are considered to be a fixed effect and abilities a random effect. The

essential feature of the Bock and Lieberman solution is its ability to integrate over the ability

distribution and to remove random nuisance parameters from the likelihood functions (Baker &

Kim, 2004). Therefore, item parameters are estimated in the marginal distribution; the item

parameter estimation is freed from its dependency on the estimation of each examinee's ability,

although it is not from its dependency upon the ability distribution. The ability is estimated

together with the item parameters if the ability distribution is correctly identified (Baker & Kim,

2004). Because increasing sample size does not require the estimation of additional examinee

parameters, this produces consistent estimates of item parameters for samples of any size

(Harwell et al., 1988). The marginal likelihood function will be maximized in order to obtain

item parameters, and the equation is identified below (Baker & Kim, 2004):

𝐿 = ��𝑃𝑖(𝜃𝑗)𝑢𝑖𝑗𝑛

𝑖=1

𝑁

𝑗=1

𝑄𝑖(𝜃𝑗)1−𝑢𝑖𝑗𝑔�𝜃𝑗�𝜏�𝑑𝜃𝑗 , (8)

where uij is the probability of obtaining a dichotomous response, 0 or 1, and 𝑔�𝜃𝑗�𝜏� is the

probability of density function of ability in the population of examinees (Baker & Kim, 2004).

2.3.2 Maximum Likelihood Estimation (MLE)

The maximum likelihood estimation (MLE) began with a mathematical expression

known as the likelihood function, which is the likelihood of a set of parameter values that is the

12

probability getting the particular set of parameter values, given the chosen probability

distribution model including unknown model parameters (Czepiel, 2002). The parameter values

maximize the sample likelihood, which is known as the maximum likelihood estimates (MLE).

The MLE procedures will be presented for the two-parameter logistic model (Baker & Kim,

2004) that is given by:

𝑃𝑗 = Ψ�𝑍𝑗� = 1

1+𝑒−(𝜍+𝜆𝜃𝑗) , (9)

where 𝑍𝑗 = 𝜍 + 𝜆𝜃𝑗 and is the logit, ς is the slope, and λ is the intercept. The likelihood function

is defined by:

Prob (R) = ∏ 𝑓𝑗!𝑟𝑗!(𝑓𝑗−𝑟𝑗)

𝑃𝑗𝑟𝑗(1 − 𝑃𝑗)𝑓𝑗−𝑟𝑗 ,k

j=1 (10)

where 𝑟𝑗 represents the correct response, 𝑓𝑗 − 𝑟𝑗 represents the incorrect response, Pj is the true

probability of correct response. There are �𝑓𝑗𝑟𝑗� different ways to arrange rj successes from among

fj trials for each population; the probability of the success of any one of the 𝑓𝑗 trials is Pj, and the

probability of 𝑟𝑗 successes is 𝑃𝑗𝑟𝑗 (Czepiel, 2002). Similarly, the probability of 𝑓𝑗 − 𝑟𝑗 failures is

(1 − 𝑃𝑗)𝑓𝑗−𝑟𝑗. The maximum likelihood estimated are the values for R that maximizes the

likelihood function in Equation 10 (Czepiel, 2002).

2.4 Dichotomously Scored Items

For psychological and educational testing, dichotomous scoring, polytomous scoring, and

continuous scoring are commonly used in the scoring of item responses. Previously, DIF

research primarily focused on dichotomously scored items (Embretson & Reise, 2000); recently,

however, several studies mention polytomously scored items (Raju et al., 1995). Because this

13

study is focused on the unidimensional dichotomously scored items, it discusses only the

unidimensional dichotomously scored items.

For dichotomously scored items, scored items, either correct or incorrect, are the majority

of the multiple-choice test score items analyzed, even though a multiple-choice test item has four

options (Potenza & Dorans, 1995). Van der Linden and Hambleton (1996) mentioned that if the

examinees j respond to the item I denoted by a random variable Uij, the two scores are codes as

Uij = 1 (correct) and Uij = 0 (incorrect). The probability of the ability of the examinees getting a

correct response is presented by parameter θ (-∞, ∞). The properties of item i that have an

effect on the probability of success are its difficulty, bi (-∞, ∞), and discriminating power, ai

(-∞, ∞). The probability of success on item i is usually denoted by Pi(θ), which is a function of θ

specific to item i, known as the item response function (IRF), item characteristic curve (ICC), or

trace line. Because the IRF cannot be linear in θ, it usually has to be monotonically increasing

when θ rises. In addition, it provides the different probability of that response across the ability

continuum (Thissen et al., 1993).

2. 4.1 Item Response Models

Dimensionality, which is one of the assumptions under IRT, includes unidimensionality

and multidimensionality. Both the unidimensional item response theory (UIRT) model and the

multidimensional item response theory (MIRT) model include dichotomously and polytomously

scored items. Based on the different scoring and dimensionality, researchers developed different

item response models. Table 1 briefly displays the dimensionality, scoring, parameters, model

presented by researchers, and computer programs that are appropriate to use in different models.

14

Table 1

Dimentionality Scoring Parameters Presented by Computer Programs

Unidimentionality Dichotomous One-Parameter Logistic Model or Rasch Models (1PLM)

Rasch (1960)

Two-Parameter Logistic Model (2PLM)

Birnbaum (1968)

Three-Parameter Logistic Model (3PLM)

Birnbaum (1968)

Polytomous Nominal Response Model Bock (1972)Rating Scale Model Andrich (1978)Graded Response Model Samejima (1969)Partial Credit Model Master (1982)Generalized Partial Credit Model

Muraki (1991)

Multidimensionality Dichotomous Multidimensional Extension of the Rasch Model (M1PL)

Adams, Wilson & Wang (1997)

Multidimensional Extension of the Two-Parameter Logistic Model

Mckinley & Reckase (1991)

Multidimensional Extension of the Three-Parameter Logistic Model

Reckase (1985)

Polytomous Multidimensional Extension of the Graded Response (MGP) Model

Muraki & Carlson (1993)

Multidimensional Extension of the Partial Credit (MPC) Model

Kelderman & Rijkes (1994)

Multidimensional Extension of the Genralized Partial Credit (MGPC) Model

Yao & Schwarz (2006)

Note. Adapted from Multidimensional item Response Theory , by M. D. Reckase, 2009. Copyright 2009 by Springer.

Winstep, BILOG-MG,

IRTPRO, flexMIRT,

TESTFACT

MULTILOG, PARSCALE,

IRTPRO, flexMIRT, ConQuest

The Development of Item Response Models and Computer Programs

TESTFACT, NOHARM, ConQuest, BMIRT, IRTPRO, flexMIRT

POLYFACT,BMIRT, IRTPRO, flexMIRT

15

The one-parameter logistic (1PL) model, or the so-called Rasch model, the two-parameter

logistic (2PL) model, and the three-parameter logistic (3PL) model are the three popular

unidimensional IRT models for dichotomous tests. Because this study focuses on the UIRT, it

discusses only three models.

2.4.1.1 The One-Parameter Logistic (1PL) Model or The Rasch Model

In the 1950s, Georg Rasch (1960) developed his Poisson models for reading tests and a

model for intelligence and achievement tests, which is called the Rasch model. Under the Rasch

model, both guessing and discrimination are negligible or constant. The main motivation of the

Rasch model was to remove references to populations of examinees in analyses of tests. The test

analysis would only be worthwhile if it were individual centered with separate parameters for the

items and the examinees. The Rasch model was derived from the initial Poisson model defined

as (Van der Linden & Hambleton, 1996):

𝜉 = 𝛿𝜃

, (11)

where 𝜉 is a function of parameters describing the ability of an examinee and difficulty of the

test, θ is the ability of the examinee, and δ is the difficulty of the test that is estimated by the

summation of errors in a test.

The model was enhanced to suppose that the probability of a student who will correctly

answer a question is a logistic function of the different between the students’ abilities θ and

questions’ difficulties β. Currently, Rasch model is specified as:

P(𝜃) = 𝑒(𝜃−𝑏)

1+𝑒(𝜃−𝑏) , (12)

16

where P(𝜃) depends upon the particular ICC model used, e is the constant, 2.718, θ is the ability,

and b is an item difficulty parameter.

The difficulty parameter, b, describes the item functions on the ability scale. It is defined

as the point on the ability scale at which the probability of a correct response to the item is .5

(Baker & Kim, 2004). The discriminations of all items are supposed to be equal to one under the

Rasch model. The Rasch model is appropriate for dichotomous responses and models the

probability of an individual’s correct response on a dichotomous item.

2.4.1.2 The Two-Parameter Logistic (2PL) Model

Unlike Rasch, Birnbaum’s aim was to finish the work begun by Lord (1952) on the

normal-ogive model. The contribution of Birnbaum was to replace the normal-ogive model with

the logistic model. Thus, Birnbaum (1968) proposed the two-parameter logistic (2PL) model,

which extends the 1PL by estimating an item discrimination parameter (a) and an item difficulty

parameter (b). The 2PL model is given as:

P(𝜃) = 𝑒𝑎(𝜃−𝑏)

1+𝑒𝑎(𝜃−𝑏), (13)

where a is the discrimination parameter without the scaling constant D= 1.702.

The discrimination parameter, a, describes how well an item can differentiate between

examinees’ abilities below or above the item location. It also reflects the steepness of the ICC in

its middle section. The steeper the curve is the higher the value of a; thus, the better items can

discriminate. On the other hand, the flatter the curve is the lower the value of a; therefore, the

less items can differentiate (Baker, 2001).

17

2.4.1.3 The Three-Parameter Logistic (3PL) Model

Besides 2PL, Birnbaum (1968) proposed a third parameter for inclusion in the model to

consider the nonzero performance, which is the probability of guessing correct answers, of low-

ability examinees on multiple-choice items. The three-parameter logistic (3PL) model is defined

as:

P(𝜃) = c +(1- c) 𝑒𝑎(𝜃−𝑏)

1+𝑒𝑎(𝜃−𝑏), (14)

where c is the lower asymptote of an ICC.

The lower asymptote, c, which is commonly referred to as the “pseudo-chance level”

parameter, represents the probability of examinees with low ability correctly answering an item.

In general, the c parameter assumes that values are smaller than the value that would result if

examinees of low ability were to guess randomly on the item. Thus, Lord (1974) has noted that c

is no longer called the “guessing parameter” because this phenomenon can probably be attributed

to item writers developing “attractive” but incorrect choices. A side effect of using the guessing

parameter c is that the definition of the difficulty parameter is changed, and the lower limit of the

ICC is the value of c rather than zero. The difficulty parameter is the point on the ability scale.

The equation is given as:

P (θ) = (1+c)/2, (15)

and the discrimination parameter is proportional to the slope, that is :

𝑎(1−𝑐)4

, (16)

of the item characteristic curve at θ = b (Baker, 2001).

McDonald (1999) stated that the 3PL was designed specifically for multiple-choice

cognitive items in discussing this model, and it is appropriate to refer to the latent trait as the

18

ability common to the m items in the test. With the introduction of the pseudo-guessing

parameter, there is no quantity calculated from the response pattern that serves as a sufficient

statistic for ability.

2.5 The DIF Detection Method

The two frameworks, CTT and IRT frameworks, are mostly used to detect DIF. There are

two methods that can be used to detect DIF: the non-item response theory (non-IRT) based

method (or observed score methods), such as the Mantel-Haenszel (MH) procedure,

standardization, SIBTEST, and logistic regression (Dorans & Holland, 1993), and the item

response theory (IRT) based method, such as Lord’s chi-square test, area measures, and the

likelihood function (Hambleton et al., 1991). This study applies the IRT based approach to detect

DIF and will present a comparison of three programs, IRTLRDIF 2.1, BILOG-MG 3, and

IRTPRO, using the Georgia High School Graduation Predictor Test data with the three IRT

models.

2.5.1 The Non-Item Response Theory (Non-IRT) Based Method

There are several methods to detect DIF on the non-IRT method, including the Mantel-

Haenszel (MH) procedure, standardization, SIBTEST, and logistic regression.

2.5.1.1 Mantel-Haenszel Method

The Mantel-Haenszel method was proposed by Mantel and Haenszel (1959). This method

is attractive because it is easy to implement, has an associated test of significance, and can be

used with small sample sizes. Thus, this method is commonly used in non-IRT based methods,

19

and it may be widely used in contingency two-by-two table procedures shown in Table 2 and has

been the object of considerable evaluation since it was firstly recommended by Holland and

Thayer (Dorans & Holland, 1993).

Table 2

Group Right Wrong TotalReference Group(R) A k B k n RK

Focal Group (F) C k D k n FK

Total Group (T) m 1k m 0k T k

The 2-by-2 Contingency Table

Item Score

Note: k = 1,2,..,j

There is a chi-square test associated with the MH approach, namely a test of the null

hypothesis:

H0 = αMH = 1; H1 = αMH ≠ 1, (17)

where αMH is the common odds ratio (Dorans & Holland, 1993). The equation of an estimate of

the constant odds ratio, αMH, is given as:

𝛼�𝑀𝐻 = ∑ 𝐴𝑘𝐷𝑘𝑘 𝑇𝑘⁄∑ 𝐵𝑘𝐶𝑘𝑘 𝑇𝑘⁄

. (18)

The equation of the MH procedure is given as:

MHχ2 = [|𝛴𝑘𝐴𝑘−𝛴𝑘𝐸(𝐴𝑘)|−.05]2

∑ 𝑉𝑎𝑟(𝐴𝑘𝑘 ) , (19)

where 𝐸(𝐴𝑘) = 𝑛𝑅𝐾𝑚1𝑘𝑇𝑘

,𝑉𝑎𝑟(𝐴𝑘) = 𝑛𝑅𝐾𝑛𝐹𝐾𝑚1𝑘𝑚0𝑘𝑇𝑘2(𝑇𝑘−1)

, and -.5 in the expression for 𝑀𝐻𝜒2 serves

as a continuity correction to improve the accuracy of the chi-square percentage points as

approximations to observed significance levels. The MH approximates the chi-square

distribution with one degree of freedom when the null hypothesis is true.

20

This estimate is an estimate of the DIF effect size in a metric that ranges from 0 to ∞ with a

value of 1 indicating null DIF (Clauser & Mazor, 1998). However, this score makes it more

difficult to interpret. Thus, this score will transform into:

MH D-DIF (Δ𝑀𝐻) = -2.35 ln(αMH). (20)

According to Δ𝑀𝐻, the three categories were developed at ETS for using in test

development (Dorans & Holland, 1993):

(1). Negligible DIF (A)

Items are classified as A either if MH D-DIF is not significantly different from zero or

|Δ𝑀𝐻| < 1.

(2) Intermediate DIF (B)

Items in level B are those that do not meet either of the other criteria.

(3) Large DIF (C)

Items in level C indicate that |Δ𝑀𝐻| both exceed 1.5 and are significantly greater than 1.

The limitation of the MH procedure is that it can only detect uniform DIF, even though it

is widely used in non-IRT based method.

2.5.1.2. Standardization

The standardization approach has been developed for use at the Educational Testing

Service (ETS). This approach was developed by Dorans and Kulick (1986) for use with the

Scholastic Assessment Test (SAT). When the expected performance on an item, which can be

operationalized by nonparametric item test regressions, differs from examinees of equal ability

from different groups, DIF exists. Doran and Holland (1993) stated that one of the main

purposes of the standardization approach is to use all available appropriate data to estimate the

21

conditional item performance of each group at each level of the matching variable. The matching

does not require the use of stratified sampling procedures to produce equal numbers of

examinees at a given score level across group memberships. In addition, the standardization

approach is straightforward to obtain standardized response rates for distractors, omits, and not

reaches (Schmitt & Dorans, 1990). Dorans and Kulick (1986) indicated that when the probability

of giving correct responses to an item is lower for examinees from one group than for examinees

of equal ability from another group, DIF is exhibited in this item. Therefore, DIF does not exist

in an item when this item satisfies:

Pg (X = 1|S) – Pg' (X = 1|S), (21)

where S refers to developed ability as measured by the total score on a test, X is an item score (X

= 1 for a correct answer and X = 0 for an incorrect answer), and Pg (X = 1|S) is as referred to the

probability that a candidate from subpopulation g who has a total test score equal to S will

provide the correct answer.

The standardization approach of the DIF measure is the observed proportion of correct

differences on an item between two groups at the kth matching variable level. The measure is

given as:

𝐷𝑘 = 𝑃𝑓𝑘 − 𝑃𝑟𝑘, (22)

where 𝑃𝑓𝑘 is the proportion correct of the studied item for the focal group and 𝑃𝑟𝑘 for the

reference group at the kth level of a matching variable.

Standardized p-difference, DSTD, is one of the important DIF indices used in this

approach, and it can range from -1 to 1 (Dorans & Schmitt, 1991). DSTD is given as:

DSTD = ∑ 𝐾𝑠𝑆𝑠=1 �𝑃𝑓𝑠 − 𝑃𝑟𝑠� ∑ 𝐾𝑠,𝑆

𝑠=1⁄ (23)

22

where [𝐾𝑠 𝛴𝐾𝑠⁄ ] represents the weighing factor at score level S supplied by the standardization

group to weight differences in performance between the 𝑃𝑓𝑠 and 𝑃𝑟𝑠.

2.5.1.3. SIBTEST

SIBTEST is a nonparametric procedure. It estimates the amount of DIF in an item and

statistically tests whether the amount is different from zero. In addition, it assesses differences in

item performance from two groups through their conditional ability levels. The main

characteristic of SIBTEST is that it employs a regression correction method to match examinees

from reference and focal groups at the same latent ability levels in order to compare their

performances on the studied items. This correction controls the inflation of a Type I error;

otherwise, results in the measurement error of the test and differences in the ability distributions

may exist across groups (Bolt, 2000).

SIBTEST requires two non-overlapping subsets of items in the test. One is a valid

subtest, which means that items are assumed to measure ability. The other is a suspect subtest,

which contains items to be tested for DIF. Scores on the valid subtest are used to match

examinees having the same ability levels across group memberships so as to test items from the

suspect subtest for DIF (Bolt, 2000).

2.5.1.4. Logistic Regression

The logistic regression was proposed by Swaminathan and Rogers (1990). This model

can be used to model DIF by identifying separate equations for the two groups of interest. The

equation is given by:

P(Uij = 1|θij) =𝑒(𝛽0𝑗+𝛽1𝑗𝜃1𝑗)

[1+𝑒(𝛽0𝑗+𝛽1𝑗𝜃1𝑗], (24)

23

where Uij is the response of person i in group j of an item, β0j is the intercept of group j, β1j is the

slope of group j, and θij is the ability of an examinee i in group j. If DIF does not exist, the

logistic regression curves for the two groups must be equal, that is, β01 is equal to β02, and β11 is

equal to β12. However, uniform DIF may be inferred if β01 is not equal to β02, and the curves are

parallel but not equivalent. In addition, the presence of non-uniform DIF may be inferred if β01 is

equal to β02, but β11 is not equal to β12, and the curves are not parallel (Swaminathan & Rogers,

1990).

2.5.2 The Item Response Theory (IRT) Based Method

IRT based methods include a comparison of item parameters, area measures, and

likelihood functions.

2.5.2.1. The Comparison of Item Parameters

This method was proposed by Lord (1980). This method attempted to perform a

statistical test of the equality of item parameters, and it can simultaneously investigate either the

difference of a, b, and c parameters or merely the difference of a and b parameters (Lord, 1980).

Lord proposed two tests for evaluating the statistical significance of DIF.

2.5.2.1.1. The Test of b Difference

The equation compares the difficulty parameters, b, for the reference and focal group and

is defined as (Thissen et al., 1993):

𝑑𝑖 = 𝑏�𝐹𝑖−𝑏�𝑅𝑖

�𝑉𝑎𝑟(𝑏�𝐹𝑖)+𝑉𝑎𝑟(𝑏�𝑅𝑖) , (25)

24

where 𝑏�𝐹𝑖and 𝑏�𝑅𝑖 is the maximum likelihood estimate of the item difficulty parameter for a focal

group and reference group, 𝑉𝑎𝑟(𝑏�𝐹𝑖) and 𝑉𝑎𝑟(𝑏�𝑅𝑖) are the variance of the b values of the focal

and reference groups. The null hypothesis H0 : di = 0, and di is the standard normal distribution.

If the di is greater than 1.96 or smaller than -1.96, two-tailed p ≤ .05, which rejects the null

hypothesis, DIF exists. Besides this test, Lord proposed a test of the joint difference between ai

and bi for two groups (Thissen et al., 1993), known as Lord’s chi-square.

2.5.2.1.2. The Lord’s Chi-Square

Lord (1980) employed the chi-square method to test whether the two groups (focal and

reference groups) achieve a significant difference. Thus, Lord’s chi-square, which examines the

hypothesis that each of the parameters of the item response function are consistent across groups

(Cohen & Kim, 1993), is the difference in the two vectors of item parameter estimated weighted

by the inverse of the variance and covariance metric, that is, the Wald statistics. However, the

item parameter estimates should be placed onto the same scale when comparing the item

parameters estimated in two groups of examinees. The equation is defined as:

𝜒2 = (𝑏�𝐹𝑖 − 𝑏�𝑅𝑖)′𝛴−1�𝑏�𝐹𝑖 − 𝑏�𝑅𝑖�, (26)

where 𝛴−1 is the estimate of the sampling variance and covariance matrix of the differences

between the item parameter estimates and 𝜒2 on two degrees of freedom for large samples. The

Lord’s chi-square has been shown to be efficient for detection of DIF based on several

assumptions that include asymptotic, known θ, and maximum likelihood estimate (Kim et al.,

1995).

25

2.5.2.2. Area Measure

Before computing the area between two item characteristic curves (ICCs), it is necessary

to transform the estimates obtained from the reference and focal group on the same scale. Thus,

the areas between the two ICCs of the same items should be equal to 0. If the areas are not equal

to 0, then DIF exists (Runder et al., 1980). Raju (1988) stated that “the area between two ICCs is

only estimated either by integrating the appropriate function between two finite points or by

adding successive rectangles of width 0.005 between two finite points” (p. 495). In addition, he

proposed the signed and unsigned area formulas for calculating the exact area between two ICCs

for 1PL, 2PL, and 3PL. The signed area (SA) is referred to as the difference between two curves,

and it is defined as:

Signed Area (SA) = ∫ (𝐹1 − 𝐹2)𝑑𝜃∞−∞ . (27)

The unsigned area (UA) refers to the distance, and it is given as:

Unsigned Area (UA) = ∫ |𝐹1 − 𝐹2|𝑑𝜃∞−∞ . (28)

For 3PL, if F1 and F2 stand for two ICCs with the stipulation a1 a2 and c=c1=c2, then:

SA = (1-c)(b2-b1). (29)

UA = (1 − 𝑐) �2(𝑎2−𝑎1)𝐷𝑎1𝑎2

ln�1 + 𝑒𝐷𝑎𝑖𝑎2(𝑏2−𝑏1) (𝑎2−𝑎1)⁄ � − (𝑏2 − 𝑏1)�. (30)

The area between two ICCs is finite when the lower asymptotes, c, are equal. On the

other hand, when the c parameters are unequal, the area between two ICCs is infinite, and this

will yield misleading results. In other words, if the area measure needs to be meaningful and

valid, the area between two ICCs must be finite, and its estimate must be fairly accurate (Raju,

1988).

26

2.5.2.3. The Likelihood Function

The likelihood function uses the likelihood ratio (LR) test, which was proposed by

Thiseen, Steinberg, and Gerrard (1986) and Thiseen, Steinberg, and Wainer (1993), to evaluate

the differences between item responses from two groups (Cohen et al., 1996). In this approach,

the null hypothesis, the item parameters between two groups that are equal, is to be tested.

Moreover, it can test both uniform and non-uniform DIF. The uniform DIF analyzes the

difference in the item difficulty parameters between a reference and focal group. By contrast,

non-uniform DIF examines the difference in the item discrimination parameters (Cohen et al.,

1996).

The LR procedure involves a compact model (C) and an augmented model (A). Thissen

et al. (1993) stated that the compact model is the item response to be tested, and the anchor items

across two groups are constrained to be equal. Cohen et al. (1996) stated that “in the augmented

model, item parameters for all items except the studied item(s) were constrained, which are

referred to as the common or anchor set, to be equal in both the reference and focal groups (p.

19).” Because the augmented model includes all parameters of the compact model and additional

parameters, the compact model is hierarchically nested within the augmented model (Cohen et

al., 1996). The LR is the difference between the values of -2log likelihood for the compact model

(LC) and for the augmented model (LA) (Cohen et al., 1996). LR is defined as:

𝐺2(𝑑.𝑓. ) = −2log𝐿𝐶 − (−2log𝐿𝐴), (31)

where [·] is the likelihood of the data given the maximum likelihood estimated of the parameters

of the model, d.f. is the difference between the number of parameters in the augmented- and the

compact- model, and G2(d.f.) is distributed as χ2(d.f.) under the null hypothesis. Therefore, if the

27

value of G2(d.f.) is large, the null hypothesis will be rejected (Thissen et al., 1993). In other

words, if the test’s result is statistical significant, DIF exists in the studied item.

2.6 Current Research

The aim of this study is to employ the IRT framework to detect DIF across ethnicity/race

using three computer programs with three popular dichotomous models. Several studies, such as

Kim et al. (1995) and Raju and Drasgow (1993), adopted BILOG-MG 3 to detect DIF. In

addition, many studies, such as Woods (2009), employed IRTLRDIF in detecting DIF. To my

knowledge, few studies employ IRTPRO in detecting DIF because it is a new computer program.

The first hypothesis of this study is to compare the difference of testing results using IRTPRO,

BILOG-MG 3, and IRTLRDIF. This study expects that the three programs will exhibit consistent

results. The second hypothesis is to examine IRTPRO to determine its effectiveness in detecting

DIF. The present study expects that IRTPRO is effective in detecting DIF if it exhibits consistent

results with BILOG-MG 3 and IRTLRDIF. Hypothesis three examines a goodness of fit model

in detecting DIF in the Georgia High School Graduation Predictor Test (GHSGPT) with three

models, 1PL, 2PL, and 3PL. The paper argues that 3PL is a goodness of fit model because it was

designed specifically for multiple-choice cognitive items, so in discussing this model it is

appropriate to refer to the latent trait as the ability common to the m items in the test (McDonald,

1999). The fourth hypothesis examines the differences between ethnicity groups taking the

GHSGPT in Social Studies. Because of the differences in culture, social economic status (SES),

and neighborhood characteristics, this study argues that Whites will perform better than other

races. Hypothesis five investigates whether DIF exists in the GHSGPT between ethnicity groups’

item responses. The current research anticipates that DIF will exist in several items.

28

CHAPTER 3

METHOD

3.1 Research Structure

This study utilizes the fall 2010 empirical data of the GHSGPT, which measures high

school achievement in the fields of Social Studies and Science, from the Georgia Center for

Assessment. It detects DIF across races using the three programs, IRTPRO, BILOG-MG 3, and

IRTLRDIF and compares whether these three programs are consistent and, thus, appropriate to

investigate DIF. Figure 4 shows the research structure.

Figure 4. The research structure.

Empirical Data

1. GHSGPT 2. Scored Item: Dichotomous 3. Number of Item: 79 items

Examining the DIF of Race/Ethnicity

1. IRTPRO, BILOG-MG, and IRTLRDIF 2. 1PL, 2PL, and 3PL

The Analysis DIF of Race/Ethnicity

Comparing DIF Results in Different Ethnicity When Using Three Programs

29

3.2 Instrumentation

An empirical comparison of the three programs is presented using the fall 2010 data of

the GHSGPT. Although GHSGPT measures high school achievement in the fields of Social

Studies and Science, this study detects DIF for different ethnicities only in Social Studies, which

consists of 79 dichotomously scored items. Note that the original items were 80 questions;

however, Item 26 was considered a problematic item because its biserial correlation was -.052,

so Item 26 was removed, and the remaining subsequent items were renumbered to maintain

consecutive numbering. The GHSGPT contains multiple-choice questions, and each multiple-

choice item has four response options. This test is a standardized test, and it follows the blueprint

of the Georgia High School Graduation Tests (GHSGT), including the same strands and

objectives. There are six strands for Social Studies that include World Studies (18-20%), U.S.

History to 1865 (18-20%), U.S. History since 1865 (18-20%), Citizenship/Government (12-

14%), Map and Globe Skills (15%), and Information Processing Skills (15%). Because both

GHSGPT and GHSGT are built on the same content, the GHSGPT is able to predict 11th grade

students’ future performance on the GHSGT (Georgia Department of Education, 2010).

3.3 Sample

The data for the 11th grade GHSGPT in Social Studies consists of 2,654 respondents after

deleting the non-response data. Respondents were 11th grade students attending 18 different high

schools from 17 different counties in Georgia. Table 3 shows the DIF detection for ethnicity.

Whites are treated as the reference group, and Blacks, Hispanics, and a Multi-Racial group are

treated as the focal groups.

30

Table 3

Races Sample SizesWhites 1,536Blacks 872Hispanics 114Multi-Racial 132Total 2,654

The DIF Detection for Ethnicity

3.4 Computer Programs

Three computer programs, IRTLRDIF, BILOG-MG 3, and IRTPRO, are used in this study.

3.4.1 IRTLRDIF

IRTLRDIF refers to likelihood-ratio testing for differential item functioning, and this

program employs the IRT (Woods, 2009). It was developed to implement a version of IRT-LR

DIF analysis for large-scale testing applications (Thissen, 2001). In previous studies, IRT-LR

DIF detection has been used in disparate research contexts. For example, Wainer et al. (1991)

used this procedure to study the testlets for DIF. In addition, Wang et al. (1995) used it to

investigate the consequences of item choice in an experimental section. Furthermore, Steinberg

(1994) has used this procedure to effectively answer questions about item serial-position and

context effects with experimental data. These studies presented that “IRT-LR DIF analysis tests

precisely specified and straightforwardly interpretable hypotheses about the parameters of item

response models” (Thissen, 2001, p. 3). IRTLRDIF employs the likelihood ratio test and

implements the methods of marginal maximum likelihood (MML) for item parameter estimation.

31

3.4.2 BILOG-MG 3

BILOG-MG 3 is an extension of the BILOG 3 program. Zimowski et al. (2003) stated

that it is designed for the effective analysis of binary items, it is capable of large-scale

production applications without limited numbers of items or respondents, and it can perform item

analysis and the scoring of any number of subtests or subscales. In addition, it can analyze DIF

and DRIFT (Item Parameter Drift) associated with multiple groups, and it can perform the

equating of test scores. The response models include the one-, two-, and three-parameter models

(Zimowski et al., 2003). BILOG-MG 3 applies likelihood ratio chi-square and executes the

method of marginal maximum likelihood estimation (MMLE) for item parameter estimation.

3.4.3 IRTPRO

IRTPRO (Item Response Theory for Patient-Reported Outcomes) is a new IRT program

for item calibration and test scoring (Cai et al., 2011). Item response theory (IRT) models for

which item calibration and scoring are implemented based on unidimensional responses, such as

multiple choice or short-answer items scored correctly or incorrectly, and multidimensional ones

include confirmatory factor analysis (CFA) or exploratory factor analysis (EFA). In addition, it is

capable of calibrating large-scale production applications with unrestricted numbers of items or

respondents. The response functions of IRTPRO include 1PL, 2PL, 3PL, graded, generalized

partial credit, and nominal response models. “These item response models may be mixed in any

combination within a test or scale and may be specified in any user-specified equality constraints

among parameters, or fixed value for parameters” (Cai et al., 201, p. 4). IRTPRO is applied to

the Wald test, an application proposed by Lord (1980). It implements the methods of marginal

maximum likelihood (MML) and maximum likelihood estimation (MLE) for item parameter

32

estimation. However, if prior distributions are particular for the item parameters, IRTPRO

calculates Maximum a posteriori (MAP) estimates (Cai et al., 2011).

33

CHAPTER 4

RESULTS

4.1 Item Analysis

To analyze items of the GHSGPT and to search for problematic items, the item parameters

are estimated by marginal maximum likelihood using BILOG-MG 3. The original data set

consists of 80 items from the GHSGPT, which was administered to 2,654 11th grade high school

students from different counties and different high schools. All Pearson and biserial correlations

were positive except for Item 26, which were -.40 and -.053, respectively, and some items fell

below .30. Hence, Item 26 was considered a problematic item and was omitted when calibrating,

and the remaining subsequent items were renumbered to maintain consecutive numbering. Thus,

79 items in total were used in this study. Table 4 presents the summary statistics for each

ethnicity/race.

Table 4

Statistics Whites Blacks Hispanics Multi-Racial

Number of Items 79 79 79 79Mean 43.24 36.63 40.46 42.67Standard Deviation 12.59 11.46 10.73 11.752Coefficient Alpha .902 .877 .859 .884

Races

Raw Score Summary Statistics for the GHSGPT

4.1.1 Classical Test Theory

Table 5 presents 79 items of the classical item statistics for multiple groups for the

34

GHSGPT. It displays the item right, discrimination, difficulty (p-value), which is the rate of

correct-responses, the Pearson- and the biserial- correlations.

First, this study analyzes the probability of giving correct answers for multiple groups

using SPSS (Statistical Package for the Social Sciences). When an item is dichotomously scored,

the mean item score corresponds to the proportion of examinees who answer the item correctly.

This proportion for item i is denoted as pi and is called the item difficulty or p-value (Crocker &

Algina, 2008). The equation for the p-value is defined as:

pi = the number of examinees getting the item right

total number of examinees. (32)

The value of pi may range from .00 to 1.00. The p-value expresses the proportion of examinees

that answered an item correctly. For example, the p-value of Item 1 is .492, which means that

only 49.2% of examinees’ responses to Item 1 are correct as shown in Table 5. Items with

difficulties near zero are difficult; however, items with difficulties near one are easy. In order to

avoid very difficult and very easy items, the ranges of difficulties that are acceptable are .3 to .7

(Allen & Yen, 2008). The difficulty or p-value is from .23 to .92. Item 52, 59, 74, 77, and 79 are

considered difficult because the p-value is lower than .3 and Item 52 is the hardest item (p=.231).

In addition, Items 17, 41, 53, 61, 62, 63, 65, 66, 67, 70, and 73 are considered easy because the

p-value is higher than .7, and Item 67 is the easiest item (p=.922). There are 62 items (78%)

between .3 and .7, the mean of the correct response rates is .518, and the degree of difficulty is

moderate to easy.

Second, the item discriminations address an index of how well an item differentiates

between people who do well on the test and those who do not do well. The discrimination’s

index can range between -1.00 and +1.00. The range of item discrimination would accept from

35

.30 to .70 (Allen & Yen, 2002). The item-discrimination index for item i, di, is defined as (Allen

& Yen, 2002):

di = 𝑈𝑖𝑛𝑖𝑈

– 𝐿𝑖𝑛𝑖𝐿

, (33)

where Ui is the number of examinees who have total test scores and have item i correct in the

upper range of total test scores, Li is the number of examinees who have total test scores and

have item i correct in the lower range of total test scores, niU is the number of examinees who

have total test scores in the upper range, and niL is the number of examinees who have total test

scores in the lower range. Table 5 shows that 43 items are smaller than the criterion .3, which

means that these items tend to have the lowest discrimination, and Item 11 (.004) is the lowest

discrimination. The ranges of item discrimination are from .004 to .427. The average item

discrimination is .262. The Pearson correlations (i.e., point-biserial) are from .025 to .456, and

the average is .304. The biserial correlations are from .034 to .644, and the average is .399. The

reliability is .899.

36

Table 5

ItemItem Right Discrimination

Difficulty (p -value)

Pearson Correlation

Biserial Correlation

1 1306 .133 .492 .142 .1782 1762 .268 .664 .293 .3793 1264 .390 .476 .408 .5124 1190 .324 .448 .344 .4335 906 .173 .341 .187 .2426 1413 .231 .532 .256 .3227 915 .175 .345 .199 .2578 1408 .187 .531 .180 .2269 1510 .367 .569 .379 .478

10 1070 .315 .403 .344 .43611 1221 .004 .460 .041 .05212 1260 .335 .475 .371 .46513 1279 .377 .482 .393 .49314 1587 .347 .598 .380 .48215 1912 .362 .720 .420 .56016 1289 .213 .486 .232 .29017 2153 .291 .811 .429 .62118 1447 .388 .545 .417 .52419 1538 .311 .580 .343 .43220 1030 .356 .388 .398 .50721 1392 .201 .524 .222 .27922 1099 .134 .414 .160 .20223 1659 .427 .625 .456 .58224 1254 .281 .472 .300 .37725 1738 .389 .655 .437 .56326 812 .153 .306 .139 .18227 1387 .191 .523 .219 .27428 827 .144 .312 .175 .22929 1061 .096 .400 .109 .13930 1281 .397 .483 .413 .51731 1663 .403 .627 .425 .54332 1215 .284 .458 .297 .37333 1395 .359 .526 .368 .46134 1140 .174 .430 .187 .23635 878 .159 .331 .187 .24236 1166 .295 .439 .352 .44337 805 .208 .303 .227 .29938 1175 .341 .443 .365 .45939 1159 .323 .437 .359 .45240 1341 .414 .505 .436 .547

Item Statistics Based on Classical Test Theory

37

ItemItem Right Discrimination

Difficulty (p -value)

Pearson Correlation

Biserial Correlation

41 1971 .310 .743 .398 .53942 1110 .148 .418 .167 .21143 1578 .385 .595 .438 .55544 1160 .170 .437 .179 .22645 1639 .323 .618 .364 .46446 1226 .389 .462 .427 .53647 1360 .344 .512 .350 .43848 1150 .348 .433 .392 .49449 1354 .357 .510 .378 .47450 1026 .291 .387 .330 .42051 985 .260 .371 .305 .38952 613 .065 .231 .064 .08853 2011 .344 .758 .436 .59854 1246 .330 .469 .338 .42455 1586 .379 .598 .389 .49356 1335 .335 .503 .361 .45257 1008 .330 .380 .348 .44458 1491 .308 .562 .329 .41459 667 .065 .251 .084 .11460 1608 .352 .606 .398 .50661 1931 .338 .728 .411 .55162 2119 .245 .798 .355 .50663 2357 .190 .888 .379 .62864 1306 .269 .492 .293 .36765 2381 .166 .897 .362 .61366 2264 .212 .853 .380 .58667 2447 .137 .922 .351 .64468 975 .241 .367 .251 .32269 1376 .199 .518 .211 .26570 2329 .158 .878 .303 .48971 1756 .292 .662 .350 .45372 1623 .344 .612 .376 .47973 2335 .179 .880 .363 .59074 659 .023 .248 .025 .03475 907 .080 .342 .091 .11776 1033 .206 .389 .224 .28577 705 .109 .266 .139 .18878 1303 .367 .491 .399 .50079 766 .217 .289 .252 .335

Item Statistics Based on Classical Test Theory

Table 5 (continued)

38

4.1.2 Item Response Theory

This study employs BILOG-MG 3 to compute p-value with 1PL, 2PL, and 3PL. The total sample

contains 2,654 (Whites=1,536, Blacks=872, Hispanics=114, and Multi-Racial=132), and α = .05.

If the p-value is less than .05, it is statistically significant. Table 6 shows that the range of item

difficulty with 1PL is from -3.679 to 1.832, the reliability is .898 (Zimowski et al., 2003), the

root-mean square (RMS) is .3261, the mean of item difficulty is -.173, and six items’ p-values

(8%) are greater than .05. For 2PL, the ranges of item difficulty are from .102 to 1.767 and item

discrimination from -1.819 to 6.458. The means of item discrimination and item difficulty are

.519 and .266 respectively, the reliability is .916, the RMS is .293, and there are 38 items (53%)

to determine the goodness of fit index. For 3PL, the ranges of the item discrimination parameter

are from -1.788 to 3.213, the item difficulty parameter from .390 to 2.347, and the pseudo-

guessing parameter from .054 to .435. The means of the item discrimination parameter is .968,

item difficulty parameter is .650, and pseudo-guessing parameter is .224. The reliability of 3PL

is .923, the RMS is .2844, and 60 items (76%) determine the goodness of fit index.

39

Table 6

Item b p -value a b p -value a b c p -value1 0.042 .00 * 0.194 0.096 .79 0.651 1.854 0.396 .052 -1.049 .49 0.426 -1.061 .01 ** 0.809 0.332 0.413 .00 *3 0.140 .00 * 0.591 0.094 .00 * 1.308 0.669 0.241 .224 0.313 .00 * 0.477 0.278 .00 * 1.046 0.858 0.239 .065 1.004 .00 * 0.255 1.584 .18 0.753 1.822 0.236 .926 -0.206 .03 ** 0.347 -0.247 .62 0.629 0.837 0.301 .127 0.981 .00 * 0.270 1.464 .68 0.608 1.817 0.208 .828 -0.194 .00 * 0.240 -0.317 .96 0.476 1.268 0.337 .769 -0.433 .00 * 0.540 -0.376 .09 0.757 0.218 0.209 .00 *

10 0.597 .00 * 0.485 0.534 .13 1.114 0.980 0.215 .9111 0.240 .00 * 0.106 0.894 .00 * 1.872 1.902 0.435 .00 *12 0.149 .01 ** 0.523 0.116 .47 0.788 0.584 0.174 .3613 0.105 .00 * 0.571 0.070 .63 1.004 0.608 0.212 .4214 -0.616 .00 * 0.565 -0.514 .76 0.858 0.216 0.264 .8515 -1.451 .00 * 0.778 -0.971 .01 ** 1.149 -0.239 0.326 .03 **16 0.082 .04 ** 0.306 0.110 .60 0.426 0.839 0.185 .8717 -2.214 .00 * 1.012 -1.264 .07 1.098 -0.834 0.285 .0818 -0.285 .00 * 0.625 -0.235 .06 1.283 0.485 0.283 .9119 -0.499 .05 0.490 -0.460 .03 ** 0.878 0.438 0.304 .6420 0.694 .00 * 0.576 0.543 .00 * 1.547 0.891 0.199 .0821 -0.157 .00 * 0.299 -0.212 .01 ** 0.390 0.553 0.181 .2722 0.528 .00 * 0.223 0.944 .17 0.559 1.869 0.284 .1123 -0.791 .00 * 0.753 -0.554 .00 * 1.040 -0.033 0.216 .00 *24 0.163 .70 0.407 0.165 .91 0.682 0.830 0.221 .4925 -0.988 .00 * 0.725 -0.699 .02 ** 0.952 -0.149 0.230 .01 **26 1.251 .00 * 0.197 2.512 .07 0.724 2.253 0.233 .01 **27 -0.146 .00 * 0.293 -0.199 .09 0.986 1.174 0.395 .8328 1.211 .00 * 0.251 1.935 .42 0.991 1.843 0.236 .5229 0.619 .00 * 0.163 1.495 .27 0.769 2.219 0.339 .8930 0.100 .00 * 0.597 0.062 .06 1.074 0.586 0.211 .4231 -0.801 .00 * 0.678 -0.593 .18 1.080 0.116 0.282 .1932 0.254 .01 ** 0.405 0.261 .00 * 1.000 0.977 0.276 .9633 -0.164 .00 * 0.524 -0.153 .02 ** 0.939 0.532 0.250 .0334 0.430 .00 * 0.253 0.685 .34 0.698 1.581 0.300 .4735 1.076 .00 * 0.267 1.626 .00 * 1.223 1.648 0.255 .2236 0.369 .00 * 0.492 0.322 .00 * 1.376 0.935 0.269 .6837 1.270 .03 ** 0.319 1.633 .02 ** 0.820 1.659 0.183 .0138 0.348 .04 ** 0.513 0.292 .96 0.846 0.746 0.180 .4139 0.385 .00 * 0.507 0.328 .02 ** 1.068 0.846 0.222 .1340 -0.039 .00 * 0.653 -0.048 .01 ** 0.754 0.172 0.075 .36

Item Statistics Based on Item Response Theory

IPL 2PL 3PL

Note. * p < .001, ** p <.05

40

Item b p -value a b p -value a b c p -value41 -1.621 .00 * 0.763 -1.091 .23 0.810 -0.733 0.188 .4842 0.502 .00 * 0.227 0.883 .94 0.452 1.836 0.252 .9943 -0.595 .00 * 0.699 -0.440 .79 0.993 0.107 0.218 .4644 0.383 .00 * 0.244 0.631 .08 0.891 1.578 0.338 .7345 -0.742 .00 * 0.552 -0.625 .02 ** 0.679 -0.109 0.185 .01 **46 0.228 .00 * 0.653 0.148 .00 * 2.033 0.697 0.252 .8147 -0.083 .02 ** 0.492 -0.085 .25 0.807 0.545 0.223 .3048 0.407 .00 * 0.579 0.309 .00 * 1.574 0.820 0.239 .2149 -0.069 .00 * 0.554 -0.071 .18 0.877 0.482 0.205 .2250 0.704 .00 * 0.463 0.655 .00 * 1.678 1.033 0.240 .0951 0.805 .35 0.419 0.816 .60 0.679 1.136 0.147 .7352 1.832 .00 * 0.133 5.383 .75 0.767 3.213 0.208 .6253 -1.742 .00 * 0.913 -1.062 .00 * 0.992 -0.738 0.183 .00 *54 0.182 .20 0.476 0.158 .87 0.826 0.734 0.213 .4255 -0.614 .00 * 0.592 -0.498 .00 * 0.705 -0.119 0.137 .00 *56 -0.025 .00 * 0.502 -0.033 .01 ** 0.791 0.549 0.208 .1657 0.748 .00 * 0.493 0.662 .00 * 1.014 0.996 0.179 .00 *58 -0.388 .20 0.474 -0.369 .88 0.772 0.438 0.268 .6459 1.665 .00 * 0.161 4.046 .08 2.347 2.013 0.232 .2560 -0.667 .00 * 0.637 -0.516 .00 * 0.687 -0.321 0.067 .00 *61 -1.505 .00 * 0.773 -1.009 .02 ** 0.817 -0.712 0.151 .1362 -2.093 .00 * 0.776 -1.377 .01 ** 0.754 -1.227 0.122 .5263 -3.108 .00 * 1.406 -1.497 .00 * 1.259 -1.550 0.085 .0164 0.042 .00 * 0.394 0.041 .00 * 1.142 0.966 0.330 .8765 -3.244 .00 * 1.375 -1.566 .00 * 1.212 -1.643 0.089 .0566 -2.655 .00 * 1.125 -1.422 .00 * 1.033 -1.450 0.054 .00 *67 -3.679 .00 * 1.767 -1.605 .00 * 1.581 -1.739 0.075 .00 *68 0.830 .00 * 0.335 1.021 .00 * 0.656 1.403 0.188 .01 **69 -0.120 .00 * 0.290 -0.165 .06 0.445 0.906 0.254 .3370 -2.961 .00 * 0.842 -1.819 .00 * 0.791 -1.788 0.122 .00 *71 -1.034 .00 * 0.563 -0.852 .00 * 0.596 -0.580 0.106 .1072 -0.703 .00 * 0.580 -0.575 .90 0.833 0.108 0.250 .9673 -2.991 .00 * 1.161 -1.560 .00 * 1.043 -1.630 0.065 .00 *74 1.689 .00 * 0.102 6.458 .00 * 1.954 2.247 0.235 .01 **75 1.001 .00 * 0.155 2.528 .33 0.609 2.664 0.279 .3376 0.687 .01 ** 0.304 0.922 .71 0.728 1.488 0.244 .9777 1.552 .00 * 0.221 2.791 .00 * 1.736 1.745 0.218 .0378 0.049 .00 * 0.584 0.023 .38 0.948 0.532 0.197 .2179 1.377 .02 ** 0.357 1.605 .02 ** 0.916 1.582 0.169 .10

Note. * p < .001, ** p <.05

IPL 2PL 3PL

Item Statistics Based on Item Response Theory

Table 6 (continued)

41

4.2 Racial Differential Item Functioning (DIF) Analysis

DIF analyses were employed to determine whether items are advantaged/disadvantaged

across ethnicities/races. Whites were identified as the reference group, and Blacks, Hispanics,

and the Multi-Racial group were regarded as the focal groups.

Thiseen (2001), in the manual for IRTLRDIF, noted that “IRTLRDIF has implemented

two of the most commonly-used IRT models: the three-parameter logistic (3PL) model and

Samejima’s graded model. Both of those models include the two-parameter logistic (2PL) model

as a special case” (p. 5). Thus, this study adopts BILOG-MG 3 and IRTPRO to examine the 79

items in Social Studies with 1PL and employs IRTPRO, BILOG-MG 3, and IRTLRDIF with

2PL and 3PL. If both BILOG-MG 3 and IRTPRO identity an item as a DIF item, then it was

considered a DIF item with 1PL. In addition, when three programs identically detect DIF

phenomenon for 2PL and 3PL, those items are included as DIF. This study determines whether

any race is favored in each item based on the outcomes from IRTLRDIF, BILOG-MG 3, and

IRTPRO. Because the results of IRTPRO and IRTLRDIF are similar, this study will

simultaneously employ two programs, IRTPRO and BILOG-MG 3, to compare multiple groups,

White vs. Blacks, Hispanics, and the Multi-Racial group, to investigate which items exist in DIF

for specific ethnicities with 1PL, 2PL, and 3PL.

1PL, 2PL, and 3PL are the three main models used to estimate item parameters for the

dichotomous items (Hambleton et al., 1991). The 1PL assumes that all discriminations, a, are

equal, so it only considers item difficulty, b, while calibrating. The 2PL calibrates item difficulty

and discrimination, and the 3PL calibrates item difficulty, discrimination, and the lower

asymptote, c.

42

This study adopts -2loglikelihood (-2logL) for each race to determine the goodness of fit.

The item fit statistics provided by both IRTPRO and BILOG-MG 3 indicated that the 3PL model

provided a good fit to the data shown in Table 7 and Table 8.

Table 7

ModelWhites vs.

BlacksWhites vs. Hispanics

Whites vs. Multi-Racial

Whites vs. All Races

1PL(-2loglikelihood) 225310.41 152592.65 154221.83 248561.46



Comparison Groups

The Summary of Goodness of Fit Using BILOG-MG 3

Table 8

ModelWhites vs.

BlacksWhites vs. Hispanics

Whites vs. Multi-Racial

Whites vs. All Races




Comparison Groups

The Summary of Goodness of Fit Using IRTPRO

43

4.2.1 Three Comparison Groups Using BILOG-MG 3 and IRTPRO with 1PL

To investigate DIF items, this study employs Lord‘s (1980) technique, which is the

comparison of item parameters between two groups divided by the standard errors of differences

while the ability parameter is known. This is done using BILOG-MG 3 and IRTPRO. The item

parameters determined the ICC for an item, and “Lord (1980) noted that the question of DIF

detection could be approached by computing estimates of the item parameters within each

group” (Thissen et al., 1993, p. 68). The equation is defined as (Thissen et al., 1993):

Zi = Δ𝑏

𝑆𝐸(𝐺𝐹−𝐺𝑅), (34)

where Δ𝑏 is 𝑏𝐹 − 𝑏𝑅. 𝑏𝐹 and 𝑏𝑅 are the item difficulty parameter for the focal group and the

reference group, 𝑆𝐸 (𝐺𝐹−𝐺𝑅) is the standard errors of the differences of focal group and the

reference group, and Zi is the approximated standard normal distribution. If the absolute value Zi

is greater than 1.96, which is a two-tailed test (p ≤ .05), DIF exists. Table 9 presents the

outcomes of the three comparison groups using two computer programs with 1PL. For the

Whites vs. Blacks, both computer programs, BILOG-MG 3 and IRTPRO, indicate that Items 1,

2, 7, 8, 11, 13, 14, 15, 17, 20, 22, 23, 25, 26, 27, 28, 29, 30, 31, 34, 44, 49, 52, 56, 57, 59, 60, 61,

62, 66, 69, 71, 72, 74, and 78 are DIF. Items 1, 7, 8, 11, 22, 26,27, 28, 29, 34, 44, 52, 59, 69, and

74 advantaged Blacks , and Items 2, 13, 14, 15, 17, 20, 23, 25, 30, 31, 49, 56, 57, 60, 61, 62, 66,

71, 72, and 78 advantaged Whites. In addition, Items 2, 13, 19, 51, and 74 are DIF for Whites vs.

Hispanics groups, Items 2, 13, and 19 favor Whites, and Items 51 and 74 favor Hispanics.

Moreover, Items 8, 44, and 56 are DIF in Whites vs. the Multi-Racial group, and all these items

disadvantaged the Multi-Racial group as shown in Table 9. In sum, there are 35 DIF items in

Whites vs. Blacks, and only a few DIF items exist for Whites vs. Hispanics (five items) and

Whites vs. the Multi-Racial group (three items).

44

Table 9

Item1 -5.216 * 24.1 * 1.057 .8 -.077 .02 2.926 * 10.7 * 2.919 * 8.7 * .137 .13 1.886 5.7 1.877 3.5 -2.385 5.54 .442 1.6 1.097 1.1 -1.979 3.45 -1.792 3.6 -.486 .3 -1.000 .96 -1.252 2.3 -.471 .3 .365 .27 -3.008 * 8.8 * -1.435 2.2 -.191 .08 -5.442 * 27.1 * -1.339 1.9 -2.724 * 6.4 *9 -1.331 2.6 -.865 1.0 -1.542 2.3

10 -.447 1.3 1.004 .9 -.242 .011 -7.861 * 51.3 * -1.931 3.6 -2.266 4.212 1.867 5.5 .737 .5 -1.074 .913 6.350 * 45.4 * 3.412 * 12.0 * 1.093 1.614 4.554 * 24.4 * -.148 .1 .938 1.115 4.106 * 21.0 * 1.723 2.9 .140 .116 .270 1.4 .672 .3 -.135 .017 2.820 * 10.7 * 1.698 3.0 -1.236 1.618 1.132 3.0 .942 .9 .031 .019 1.445 3.9 3.069 * 8.9 * .275 .220 3.445 * 15.0 * .763 .5 -.742 .521 -1.496 2.8 .288 .1 -.021 .022 -4.257 * 16.3 * .988 .8 .000 .023 2.968 * 12.1 * .618 .3 .601 .624 -1.641 3.3 -.583 .5 -1.029 .925 2.661 * 10.0 * 1.010 1.0 -.385 .126 -4.238 * 16.6 * -1.556 2.5 -.221 .027 -3.442 * 11.2 * .556 .3 .377 .328 -3.602 * 12.4 * -.623 .5 -.138 .029 -5.848 * 29.5 * -1.627 2.7 -1.738 2.530 2.256 * 7.4 * 1.737 2.8 1.241 2.031 2.000 * 6.2 * -.256 .1 1.532 3.032 -2.261 5.6 1.307 1.5 -2.194 4.233 .361 1.5 .476 .2 -2.030 4.134 -3.518 * 11.5 * -.469 .3 -.706 .435 -1.017 1.8 -2.348 5.5 .528 .436 -2.025 4.6 -.041 .0 -.929 .837 1.054 2.6 -.760 .8 -.057 .038 -1.893 4.3 .184 .0 -2.099 4.239 .529 1.7 .580 .3 .680 .740 -.623 1.4 .152 .0 -.176 .0

The Summary of BILOG-MG 3 and IRTPRO for Three Comparison Groups with 1PL

BILOG-MG 3 (d )

IRTPRO (χ2 )

Whites vs. Multi-RacialBILOG-MG 3

(d )IRTPRO

(χ2 )

Whites vs. Blacks Whites vs. HispanicsBILOG-MG 3

(d )IRTPRO

(χ2 )

Note. * DIF Items

45

Item41 1.963 5.9 -1.296 2.1 1.461 2.642 -2.307 5.3 -.734 .6 .332 .243 .837 2.3 -1.208 1.8 .340 .244 -8.263 * 62.8 * -1.724 3.2 -2.756 * 6.6 *45 -2.197 5.5 .296 .1 -1.007 .946 .260 1.4 -.086 .0 .321 .247 .559 1.7 .119 .0 -.170 .048 1.828 5.4 1.498 2.1 1.092 1.549 2.132 * 6.8 * .749 .5 .449 .350 -1.492 2.9 -.794 .8 -1.256 1.551 -2.314 5.7 -3.383 * 12.2 * 1.707 3.152 -4.123 * 15.3 * -.997 1.1 -.189 .053 .151 1.3 -.856 1.0 -.133 .054 -1.890 4.1 .040 .0 -1.775 3.155 -.189 1.2 -.774 .8 -.293 .156 4.770 * 26.9 * -.278 .1 2.510 * 6.9 *57 4.016 * 19.1 * 1.340 1.8 1.377 2.358 .778 2.1 .944 .8 .715 .859 -2.706 * 6.8 * .107 .0 1.528 2.560 2.179 * 7.1 * .036 .0 .532 .561 3.504 * 15.7 * 1.026 1.0 2.244 5.962 2.632 * 9.2 * .678 .5 1.811 3.963 .604 1.7 -.499 .4 1.076 1.664 .845 2.2 -1.153 1.6 -1.683 2.465 2.000 5.9 -.317 .1 .888 1.066 2.859 * 10.8 * .234 .0 .893 1.167 1.505 3.9 -1.657 3.1 1.102 1.668 1.270 3.2 .815 .6 .415 .369 -2.752 * 7.4 * 1.163 1.1 -.325 .170 1.814 5.0 .565 .3 1.019 1.371 2.886 * 10.7 * -.756 .7 1.244 2.072 3.446 * 14.7 * -.228 .1 .500 .473 .795 2.0 -1.058 1.3 .540 .474 -4.706 * 19.3 * -3.412 * 10.5 * -.316 .175 -2.259 5.0 1.216 1.2 -.004 .076 -.487 1.3 .698 .4 -.264 .077 -1.921 4.0 .503 .2 .154 .178 3.746 * 17.4 * .341 .1 .374 .379 -1.481 2.9 -1.695 3.3 .518 .4

The Summary of BILOG-MG 3 and IRTPRO for Three Component Groups with 1PL

Table 9 (Continued)

Note. * DIF Items

Whites vs. Blacks Whites vs. Hispanics Whites vs. Multi-RacialBILOG-MG 3

(d )IRTPRO

(χ2 )BILOG-MG 3

(d )IRTPRO

(χ2 )BILOG-MG 3

(d )IRTPRO

(χ2 )

46

4.2.2 Three Comparison Groups Using Three Computer Programs with 2PL

This study employs three computer programs, IRTPRO, BILOG-MG 3, and IRTLRDIF

with 2PL. IRTPRO and BILOG-MG 3 use the same methods used for 1PL to detect 2PL. For

IRTLRDIF, Thissen (2001) stated that “if the value of G2(d.f.) exceeds 3.84 at α = .05 critical

value of the chi-square distribution for one degree of freedom, df, fit additional models to

compute single d.f., likelihood ratio tests appropriate for the item response model” (p.8).

Table 10 displays the uniform and non-uniform DIF among three comparison groups with 2PL.

There are 39 items that were identified as statistical significant DIF items, including 15 uniform

DIF items and 24 non-uniform DIF items, for Whites vs. Blacks. A total of 16 items are DIF

items that include two uniform DIF and 14 non-uniform DIF for Whites vs. Hispanics. 24 items

were identified as statistically significant DIF items that include five uniform DIF and 19 non-

uniform DIF for the Whites vs. the Multi-Racial group. Table 11 shows the outcome of the three

computer programs for three comparison groups with 2PL. First, the three computer programs

show that Items 1, 2, 8, 11, 13, 14, 15, 36, 38, 44, 45, 56, 57, 68, 72, and 78 are DIF items for

Whites vs. Blacks, and Items 2, 13, 14, 15, 56, 57, 68, 72, and 78 advantaged Whites and Items

1, 11, 8, 36, 38, 44, and 45 favored Blacks. Second, three DIF items (Items 2, 3, 51) exist in the

Whites vs. Hispanics test, Items 2 and 13 favored Whites , and Item 51 advantaged Hispanics.

Third, Items 3, 4, 8, and 44 are DIF items in the Whites vs. the Multi-Racial group, and all items

disadvantaged Whites. Overall, several DIF items (16 items) exist for Whites vs. Blacks, the

same as for 1PL, and only a few DIF items exist for Whites vs. Hispanics (three items), and

Whites vs. the Multi-Racial group (four items) test.

47

ItemH0: all equal

H0 : a equal

H0 : b equal

H0: all equal

H0 : a equal

H0 : b equal

H0: all equal

H0 : a equal

H0 : b equal

1 6.1 0.0 6.1 Uniform 7.5 5.9 1.7Non-

uniform 0.7

2 12.1 2.7 9.5Non-

uniform 9.6 1.4 8.2Non-

uniform 0.03 0.7 2.6 8.2 0.7 7.4 Uniform

4 2.1 2.2 11.6 7.1 4.5Non-

uniform5 0.2 2.5 3.06 0.4 1.3 0.27 2.7 1.4 0.9

8 11.2 0.1 11.2 Uniform 2.4 7.5 1.2 6.2Non-

uniform

9 8.7 3.3 5.5Non-


uniform 3.2

10 4.4 3.5 1.0Non-

uniform 1.1 3.111 10.3 0.5 9.8 Uniform 1.6 3.3

12 1.6 3.0 6.1 4.6 1.5Non-

uniform13 32.0 0.4 31.6 Uniform 11.1 0.1 11.1 Uniform 1.014 14.3 0.2 14.0 Uniform 0.5 1.6

15 9.1 3.5 5.6Non-

uniform 3.1 0.516 3.6 2.6 0.4

17 0.3 4.8 2.0 2.8Non-


uniform

Table 10

Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi_Racial

The Summary of IRTLRDIF for Three Comparison Groups with 2PL

G 2 G 2 G 2

48

ItemH0: all equal

H0 : a equal

H0 : b equal

H0: all equal

H0 : a equal

H0 : b equal

H0: all equal

H0 : a equal

H0 : b equal

18 0.6 1.3 0.1

19 1.0 9.1 0.9 8.3Non-

uniform 1.320 6.1 0.3 5.8 Uniform 0.9 1.221 2.4 2.1 0.222 3.3 3.5 0.3

23 3.4 0.2 0.524 2.1 1.9 1.8

25 4.7 3.8 0.9Non-

uniform 1.7 0.526 3.5 1.3 2.3

27 9.3 5.9 3.4Non-

uniform 2.0 0.2

28 4.1 1.4 2.7Non-

uniform 0.8 0.4

29 10.2 4.7 5.5Non-

uniform 1.7 2.1

30 4.0 3.2 0.7Non-


uniform 1.431 0.2 0.5 2.3

32 13.5 8.6 4.9Non-

uniform 1.8 5.3 0.7 4.6Non-

uniform33 1.0 2.0 5.5 0.2 5.3 Uniform

Table 10 (Continued)

Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi_RacialG 2 G 2 G 2


49

ItemH0: all equal

H0 : a equal

H0 : b equal

H0: all equal

H0 : a equal

H0 : b equal

H0: all equal

H0 : a equal

H0 : b equal

34 2.1 0.4 0.7

35 5.8 5.4 0.4Non-


uniform 3.4

36 10.3 2.6 7.7Non-

uniform 0.5 1.5

37 4.6 0.0 4.5 Uniform 6.2 5.8 0.4Non-


uniform

38 9.4 1.6 7.8Non-

uniform 3.1 5.6 0.2 5.4 Uniform

39 0.1 1.6 4.7 4.4 0.4Non-

uniform40 7.2 0.1 7.1 Uniform 0.5 0.341 0.1 3.8 3.442 0.6 0.7 1.043 1.8 3.7 1.9

44 36.6 0.4 36.3 Uniform 2.4 7.4 1.1 6.3Non-

uniform

45 17.0 6.2 10.9Non-

uniform 3.0 2.0

46 2.9 2.2 7.5 7.5 0.0Non-

uniform

47 1.8 0.1 4.3 4.2 0.1Non-

uniform

48 8.7 8.6 0.2Non-

uniform 3.4 6.3 5.4 0.9Non-

uniform



Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi_RacialG 2 G 2 G 2

50

ItemH0: all equal

H0 : a equal

H0 : b equal

H0: all equal

H0 : a equal

H0 : b equal

H0: all equal

H0 : a equal

H0 : b equal

49 1.0 1.4 6.1 6.0 0.1Non-

uniform

50 6.6 3.4 3.2Non-

uniform 1.6 3.8

51 12.7 6.6 6.0Non-

uniform 13.9 0.2 13.8 Uniform 7.0 4.4 2.6Non-

uniform52 0.7 0.3 0.0

53 8.6 1.2 7.4Non-

uniform 2.5 1.5

54 5.5 0.1 5.4 Uniform 0.6 4.7 0.6 4.0Non-

uniform

55 7.6 5.2 2.4Non-

uniform 1.3 0.6

56 22.0 3.0 19.0Non-

uniform 0.8 6.4 0.5 5.9 Uniform57 13.6 0.4 13.3 Uniform 2.2 2.158 1.8 0.8 2.1

59 3.2 2.4 4.0 1.2 2.8Non-

uniform

60 4.8 4.2 0.6Non-

uniform 0.1 0.361 3.5 0.9 5.2 0.1 5.1 Uniform62 1.1 2.2 3.5

63 5.1 0.2 5.0 Uniform 9.7 9.4 0.3Non-

uniform 3.6

Whites vs. Blacks



Whites vs. Hispanics Whites vs.Multi_RacialG 2 G 2 G 2

51

ItemH0: all equal

H0 : a equal

H0 : b equal

H0: all equal

H0 : a equal

H0 : b equal

H0: all equal

H0 : a equal

H0 : b equal

64 2.3 2.4 4.3 1.5 2.9Non-

uniform

65 0.8 1.0 4.1 3.1 1.0Non-

uniform

66 7.9 7.6 0.3Non-

uniform 1.2 0.8

67 1.0 6.3 2.4 3.9Non-


uniform

68 6.1 0.5 5.6 Uniform 4.7 3.7 0.9Non-

uniform 0.669 1.4 1.7 1.9

70 0.3 7.1 6.8 0.2Non-

uniform 1.0

71 4.8 0.7 4.1Non-

uniform 2.3 2.072 6.9 0.3 6.6 Uniform 2.3 0.573 3.0 1.7 0.5

74 3.4 9.1 4.3 4.8Non-

uniform 0.5

75 2.3 4.2 1.3 2.9Non-

uniform 0.176 1.3 2.2 0.5

77 8.5 8.4 0.1Non-

uniform 0.7 4.1 4.1 0.1Non-

uniform78 8.1 0.2 7.9 Uniform 1.0 1.679 1.0 3.7 0.3

G 2 G 2 G 2



Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi_Racial

52

Item1 6.1 * -2.368 * 6.1 * 7.5 1.313 6.7 0.7 0.121 1.32 12.1 * 2.919 * 11.0 * 9.6 * 2.711 * 8.6 * 0.0 0.198 0.03 0.7 1.258 0.5 2.6 1.739 2.4 8.2 * -2.455 * 7.5 *4 2.1 0.464 2.1 2.2 0.995 2.3 11.6 * -1.976 * 12.3 *5 0.2 0.386 0.1 2.5 -0.095 1.9 3.0 -0.682 3.36 0.4 -0.035 0.4 1.3 -0.335 1.2 0.2 0.470 0.27 2.7 -0.721 206.0 1.4 -0.942 1.2 0.9 0.017 1.68 11.2 * -3.135 * 11.0 * 2.4 -0.975 3.0 7.5 * -2.277 * 7.4 *9 8.7 -1.619 7.8 5.0 -0.992 3.9 3.2 -1.471 3.3

10 4.4 -0.322 4.0 1.1 0.970 1.0 3.1 -0.171 2.711 10.3 * -3.083 * 10.5 * 1.6 -1.148 1.8 3.3 -1.718 3.312 1.6 1.752 1.5 3.0 0.599 1.9 6.1 -1.029 7.413 32.0 * 5.888 * 29.7 * 11.1 * 3.382 * 10.6 * 1.0 1.216 0.914 14.3 * 4.198 * 13.4 * 0.5 -0.300 0.4 1.6 1.049 1.715 9.1 * 2.768 * 8.2 * 3.1 1.650 2.9 0.5 0.335 0.416 3.6 1.815 3.4 2.6 0.880 2.3 0.4 0.030 0.117 0.3 1.026 0.2 4.8 1.691 5.1 5.3 -0.989 3.018 0.6 0.412 0.7 1.3 0.733 0.7 0.1 0.102 0.119 1.0 1.545 0.9 9.1 2.992 8.8 1.3 0.362 1.020 6.1 2.730 5.7 0.9 0.553 0.7 1.2 -0.740 1.221 2.4 0.386 2.3 2.1 0.425 1.7 0.2 0.116 0.222 3.3 -1.351 3.3 3.5 1.165 3.4 0.3 0.184 0.423 3.4 1.769 3.2 0.2 0.411 0.1 0.5 0.756 0.324 2.1 -0.888 2.0 1.9 -0.567 1.7 1.8 -0.903 1.925 4.7 1.598 3.9 1.7 0.834 1.2 0.5 -0.284 0.426 3.5 -0.720 3.5 1.3 -0.711 1.4 2.3 0.023 2.3

Table 11

The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Group with 2PL

Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-RacialIRTLRDIF

Note. * DIF Items

BILOG-MG 3 IRTPRO IRTPRO IRTLRDIF BILOG-MG 3 IRTPRO BILOG-MG 3 IRTLRDIF

53

Item27 9.3 -1.570 9.5 2.0 -0.565 1.9 0.2 0.494 0.228 4.1 -0.903 4.1 0.8 -0.164 1.0 0.4 0.070 0.129 10.2 -1.732 10.6 1.7 -0.985 2.0 2.1 -1.299 2.130 4.0 1.633 3.9 5.9 1.593 5.4 1.4 1.374 1.231 0.2 1.049 0.2 0.5 -0.495 0.4 2.3 1.734 1.932 13.5 -1.634 12.3 1.8 1.345 1.8 5.3 -1.989 4.833 1.0 0.287 0.9 2.0 0.374 1.4 5.5 -1.956 4.934 2.1 -1.018 2.1 0.4 -0.119 0.4 0.7 -0.472 0.935 5.8 0.603 5.8 5.4 -1.856 5.6 3.4 0.592 3.436 10.3 * -2.047 * 9.7 * 0.5 -0.223 0.4 1.5 -0.925 1.437 4.6 1.643 4.5 6.2 -0.490 4.4 4.0 0.116 4.238 9.4 * -2.068 * 8.5 * 3.1 0.058 2.2 5.6 -2.092 5.539 0.1 0.495 0.1 1.6 0.485 1.0 4.7 0.752 3.640 7.2 -1.800 6.2 0.5 -0.148 0.6 0.3 -0.150 0.241 0.1 0.659 0.1 3.8 -1.443 2.9 3.4 1.676 3.042 0.6 0.238 0.5 0.7 -0.263 0.9 1.0 0.493 1.043 1.8 -0.421 1.7 3.7 -1.586 3.4 1.9 0.465 2.044 36.6 * -5.495 * 35.8 * 2.4 -1.367 2.1 7.4 * -2.368 * 7.7 *45 17.0 * -2.485 * 14.7 * 3.0 0.175 2.5 2.0 -0.917 1.846 2.9 -0.859 2.8 2.2 -0.443 2.1 7.5 0.354 4.547 1.8 0.771 19.0 0.1 0.009 0.1 4.3 -0.099 3.948 8.7 1.094 9.1 3.4 1.264 3.2 6.3 1.204 5.949 1.0 1.720 0.9 1.4 0.595 1.5 6.1 0.545 7.150 6.6 -1.092 6.5 1.6 -0.925 1.2 3.8 -1.218 3.451 12.7 -1.649 12.8 13.9 * -3.547 * 13.5 * 7.0 1.733 7.452 0.7 -0.212 0.6 0.3 -0.177 0.2 0.0 0.057 0.0

BILOG-MG 3BILOG-MG 3 IRTPRO IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF IRTPRO Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-Racial

Note. * DIF Items

IRTLRDIF



54

Item53 8.6 -1.616 6.4 2.5 -1.074 1.7 1.5 0.070 1.454 5.5 -1.704 5.2 0.6 -0.054 0.4 4.7 -1.729 4.255 7.6 -0.772 6.8 1.3 -0.916 1.2 0.6 -0.210 0.656 22.0 * 4.687 * 21.0 * 0.8 -0.304 0.7 6.4 2.580 6.057 13.6 * 3.638 * 13.0 * 2.2 1.254 2.1 2.1 1.436 2.058 1.8 1.106 1.8 0.8 0.862 0.6 2.1 0.807 1.459 3.2 0.153 3.2 2.4 0.404 2.5 4.0 1.122 3.660 4.8 1.456 4.1 0.1 -0.130 0.2 0.3 0.629 0.261 3.5 2.136 3.1 0.9 0.866 0.7 5.2 2.451 4.962 1.1 1.279 1.0 2.2 0.528 1.7 3.5 1.888 3.063 5.1 -1.096 3.9 9.7 -0.457 1.8 3.6 1.612 2.764 2.3 1.652 2.3 2.4 -1.129 1.9 4.3 -1.528 3.765 0.8 0.236 0.5 1.0 -0.264 0.4 4.1 1.377 3.766 7.9 1.037 5.5 1.2 0.182 0.4 0.8 1.165 0.867 1.0 -0.088 0.8 6.3 -1.648 8.6 4.6 1.920 2.568 6.1 * 2.155 * 6.1 * 4.7 0.972 4.0 0.6 0.533 0.469 1.4 -0.814 1.4 1.7 1.308 1.6 1.9 -0.173 1.570 0.3 0.573 0.1 7.1 0.453 9.4 1.0 1.218 1.271 4.8 2.387 4.8 2.3 -0.803 2.4 2.0 1.301 2.572 6.9 * 3.040 * 6.7 * 2.3 -0.358 1.7 0.5 0.588 0.773 3.0 -0.735 2.8 1.7 -1.108 1.4 0.5 0.945 0.474 3.4 -0.284 3.8 9.1 -1.456 10.0 0.5 -0.003 0.375 2.3 0.559 2.1 4.2 1.254 4.1 0.1 0.199 0.176 1.3 0.933 1.3 2.2 0.870 1.9 0.5 -0.074 0.477 8.5 0.217 8.6 0.7 0.639 0.6 4.1 0.270 3.878 8.1 * 3.430 * 7.8 * 1.0 0.162 1.1 1.6 0.447 1.679 1.0 -0.187 1.0 3.7 -1.438 3.3 0.3 0.566 0.3

BILOG-MG 3


Whites vs. Blacks Whites vs. Hispanics


Whites vs.Multi-RacialIRTPRO IRTLRDIF BILOG-MG 3 IRTPRO IRTLRDIF

Note. * DIF Items

IRTPRO IRTLRDIF BILOG-MG 3

55

4.2.3 The Three Comparison Groups Using Three Computer Programs with 3PL

Table 12 displays the uniform and non-uniform DIF among three comparison groups with

3PL. There are 43 items that were identified as statistically significant DIF items, including 17

uniform DIF items and 26 non-uniform DIF items, for Whites vs. Blacks. A total of 24 items are

DIF items that include four uniform DIF and 20 non-uniform DIF for Whites vs. Hispanics. 25

items were identified as statistically significant DIF items that include six uniform DIF and 19

non-uniform DIF for the Whites vs. the Multi-Racial group. Table 13 presents the outcomes of

the three comparison groups using the three computer programs. For Whites vs. Blacks, the

programs indicated that Item 13, 14, 15, 32, 44, 45, 56, 57, and 78 are DIF items. Item 32, 44,

and 45 advantaged Whites, and Item 13, 14, 15, 56, 56, and 78 favored Blacks. Additionally,

Item 13, 19, and 51 are DIF items for Whites vs. Hispanics, and Items 13 and 19 favored Whites

and Item 51 Hispanics. Furthermore, only one DIF (Item 44) is determined in the Whites vs. the

Multi-Racial group, and this item advantaged the Multi-Racial group.

56

ItemH0: all equal

H0 : a equal

H0 : b equal

H0: c equal

H0: all equal

H0 : a equal

H0 : b equal

H0: c equal

H0: all equal

H0 : a equal

H0 : b equal

H0: c equal

1 5.1 0.4 0.0 4.8 Uniform 7.7 0.6 2.5 4.6Non-

Uniform 0.6

2 11.8 2.5 0.7 8.5 Uniform 9.5 6.5 2.9 0.1Non-

Uniform 0.03 3.8 2.3 9.9 0.0 3.8 6.0 Uniform

4 0.5 2.7 12.0 11.9 0.0 0.0Non-

Uniform

5 2.2 4.0 3.3 0.6 0.1Non-

Uniform 4.1 1.9 2.1 0.1Non-

Uniform6 3.1 2.3 0.0

7 5.8 0.0 2.3 3.7Non-

Uniform 4.1 0.7 0.2 3.3Non-

Uniform 1.7

8 11.4 0.5 0.3 10.6 Uniform 2.4 7.2 5.1 1.9 0.2Non-

Uniform

9 10.6 0.2 1.7 8.7Non-

Uniform 6.9 3.8 1.7 1.3Non-

Uniform 3.8

10 7.1 0.5 4.1 2.4Non-

Uniform 4.1 2.4 1.0 0.8Non-

Uniform 3.1

11 11.3 9.6 0.0 1.7Non-

Uniform 4.2 2.2 1.0 1.0Non-

Uniform 3.9 3.8 0.0 0.2Non-

Uniform

12 2.2 6.0 6.0 0.0 0.1Non-

Uniform 7.6 7.5 0.1 0.0Non-

Uniform

13 34.4 0.9 6.4 27.1 Uniform 10.1 0.1 8.5 1.5Non-

Uniform 1.0

Table 12


Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-RacialG 2 G 2 G 2

57

ItemH0: all equal

H0 : a equal

H0 : b equal

H0: c equal

H0: all equal

H0 : a equal

H0 : b equal

H0: c equal

H0: all equal

H0 : a equal

H0 : b equal

H0: c equal

14 17.1 0.0 2.3 15.1 Uniform 0.5 1.3

15 9.8 0.2 8.6 1.0Non-

Uniform 3.4 2.2

16 2.8 2.5 8.8 1.0 0.0 7.9Non-

Uniform

17 0.8 4.7 1.0 4.1 0.0Non-

Uniform 5.3 0.6 5.0 0.0Non-

Uniform18 1.1 1.0 0.0

19 1.7 8.5 0.5 8.7 0.0 Uniform 4.9 4.7 0.0 0.2Non-

Uniform20 7.9 0.3 0.0 7.6 Uniform 0.4 1.1

21 4.5 4.8 0.1 0.0Non-

Uniform 1.5 0.5

22 3.0Non-

Uniform 3.2 0.2

23 6.0 5.8 0.0 0.2Non-

Uniform 0.0 3.424 2.0 1.7 3.1

25 4.6 4.4 0.0 0.2Non-

Uniform 1.2 1.2

26 6.9 1.6 5.3 0.0Non-

Uniform 1.4 1.9

27 6.4 4.6 0.0 1.8Non-

Uniform 1.6 0.4

28 8.0 1.0 6.7 0.3Non-

Uniform 5.7 4.9 0.3 0.5Non-

Uniform 0.6




58

ItemH0: all equal

H0 : a equal

H0 : b equal

H0: c equal

H0: all equal

H0 : a equal

H0 : b equal

H0: c equal

H0: all equal

H0 : a equal

H0 : b equal

H0: c equal

29 7.4 4.4 0.0 3.0Non-

Uniform 2.3 2.1

30 8.3 4.3 2.1 1.9Non-

Uniform 6.0 0.1 4.4 1.6Non-

Uniform 1.531 2.9 0.6 2.1

33 3.1 1.4 6.1 0.7 5.3 0.1Non-

Uniform34 2.3 0.6 1.0

35 2.2 5.6 5.6 0.1 0.0Non-

Uniform 2.9

36 6.1 4.7 0.3 1.1Non-

Uniform 0.6 3.3

37 6.9 0.1 0.2 6.7 Uniform 6.8 0.8 3.6 2.4Non-

Uniform 4.7 3.9 0.7 0.0Non-

Uniform38 7.6 0.0 7.8 0.0 Uniform 2.0 6.8 0.0 3.8 3.0 Uniform39 0.1 1.4 3.5

40 5.0 1.0 6.5 0.0Non-

Uniform 2.5 0.041 1.4 3.6 3.342 1.5 0.9 2.2

43 6.3 2.5 1.4 2.3Non-

Uniform 4.9 0.0 4.8 0.1 Uniform 4.7 0.6 0.6 3.4Non-

Uniform

44 34.2 6.2 12.7 15.3 Uniform 2.8 7.7 4.4 3.3 0.0Non-

Uniform45 12.4 1.2 11.4 0.0 Uniform 3.4 2.8




59

ItemH0: all equal

H0 : a equal

H0 : b equal

H0: c equal

H0: all equal

H0 : a equal

H0 : b equal

H0: c equal

H0: all equal

H0 : a equal

H0 : b equal

H0: c equal

46 7.0 0.0 6.4 0.6Non-

Uniform 4.7 0.0 0.0 4.7Non-

Uniform 6.6 1.5 2.2 2.9Non-

Uniform

47 7.7 0.0 0.4 7.3Non-

Uniform 0.0 2.7

48 6.8 3.8 2.2 0.8Non-

Uniform 2.6 7.1 2.5 4.5 0.1Non-

Uniform

49 1.5 1.5 7.2 6.2 0.9 0.1Non-

Uniform50 4.4 0.0 3.8 0.6 Uniform 3.0 5.0 0.3 4.7 0.0 Uniform

51 15.2 16.0 0.2 0.0Non-

Uniform 14.5 1.1 12.1 1.3 Uniform 7.6 3.4 4.9 0.0Non-

Uniform52 1.2 0.6 0.153 8.8 0.0 8.9 0.0 Uniform 2.2 2.254 5.0 0.9 1.0 3.2 Uniform 1.4 4.6 0.0 4.9 0.0 Uniform

55 4.0 1.2 3.1 0.0Non-

Uniform 1.0 0.5

56 32.5 31.5 0.3 0.7Non-

Uniform 1.0 10.3 0.0 7.1 3.2Non-

Uniform

57 17.9 15.5 2.1 0.3Non-

Uniform 1.7 2.558 3.0 0.7 1.5

59 2.5 0.7 4.4 1.2 0.0 3.2Non-

Uniform60 0.0 0.0 0.061 4.0 0.0 3.7 0.3 Uniform 0.9 4.0 0.1 4.8 0.0 Uniform62 0.2 1.5 3.3


G 2 G 2 G 2


Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-Racial

60

ItemH0: all equal

H0 : a equal

H0 : b equal

H0: c equal

H0: all equal

H0 : a equal

H0 : b equal

H0: c equal

H0: all equal

H0 : a equal

H0 : b equal

H0: c equal

63 4.8 0.0 7.1 0.0 Uniform 9.8 8.5 0.3 1.0Non-

Uniform 1.4

64 2.8 4.5 1.1 3.0 0.4Non-

Uniform 5.0 2.0 0.0 3.1 Uniform65 0.0 0.5 2.666 0.1 1.1 0.0

67 0.0 6.4 2.2 3.9 0.4Non-

Uniform 1.7

68 12.2 11.2 0.2 0.8Non-

Uniform 2.9 2.469 1.7 2.2 1.6

70 0.3 7.4 6.8 1.0 0.0Non-

Uniform 0.871 4.2 0.0 5.3 0.0 Uniform 1.9 1.3

72 11.0 8.2 1.5 1.4Non-

Uniform 1.9 0.673 1.2 1.1 0.0

74 3.7 6.7 9.3 0.0 0.0Non-

Uniform 1.375 2.8 3.1 2.976 2.3 1.5 1.1

77 0.9 0.8 4.8 1.6 3.7 0.0Non-

Uniform78 9.2 0.9 6.9 1.5 Uniform 1.0 0.879 0.3 4.0 0.1 4.4 0.0 Uniform 0.9

G 2 G 2 G 2




61

Item1 5.1 -1.475 4.9 7.7 1.194 1.5 0.6 0.367 0.52 11.8 1.239 7.8 9.5 1.354 8.4 0.0 -0.081 0.63 3.8 -0.100 2.0 2.3 1.013 5.7 9.9 -2.144 7.94 0.5 0.443 1.7 2.7 1.000 9.0 12.0 -0.816 20.95 2.2 -0.063 0.6 4.0 -0.873 1.2 4.1 -1.633 2.46 3.1 -0.328 2.5 2.3 -0.025 3.9 0.0 0.381 2.27 5.8 -1.191 2.7 4.1 -0.528 1.7 1.7 0.487 3.08 11.4 -2.299 6.4 2.4 -0.684 1.7 7.2 -2.100 6.99 10.6 -1.628 5.2 6.9 -1.299 2.8 3.8 -1.288 5.9

10 7.1 -1.738 3.8 4.1 1.264 9.5 3.1 -1.413 1.311 11.3 -1.137 11.0 4.2 -0.663 6.8 3.9 -0.797 10.012 2.2 1.413 4.1 6.0 0.038 0.9 7.6 -0.338 13.413 34.4 * 4.256 * 35.0 * 10.1 * 2.150 * 12.8 * 1.0 0.585 2.814 17.1 * 2.596 * 14.3 * 0.5 -0.764 1.2 1.3 0.962 5.115 9.8 * 2.287 * 10.6 * 3.4 1.503 6.8 2.2 0.439 6.516 2.8 1.023 2.0 2.5 0.799 2.0 8.8 0.240 1.017 0.8 0.319 1.5 4.7 1.678 10.2 5.3 -1.687 6.018 1.1 -0.032 2.2 1.0 -0.260 2.5 0.0 -0.446 5.519 1.7 0.673 2.0 8.5 * 2.026 * 10.9 * 4.9 -0.245 0.320 7.9 1.273 8.9 0.4 -0.295 4.3 1.1 -1.176 7.521 4.5 0.601 4.6 1.5 -0.112 0.4 0.5 0.107 0.822 3.0 -0.844 2.8 3.2 1.039 1.4 0.2 -0.228 0.023 6.0 0.444 3.2 0.0 0.036 1.0 3.4 0.403 0.424 2.0 -0.900 1.7 1.7 -1.077 0.9 3.1 -0.847 5.025 4.6 0.253 1.9 1.2 0.116 0.7 1.2 -0.724 2.726 6.9 -1.898 5.9 1.4 -0.523 1.0 1.9 -0.906 1.1

Note. * DIF Items

IRTLRDIF (χ2)

BILOG-MG 3 (d )

IRTPRO (χ2)

BILOG-MG 3 (d )

IRTPRO (χ2)

IRTLRDIF (χ2)

BILOG-MG 3 (d )

IRTPRO (χ2)

IRTLRDIF (χ2)

Table 13

The Summary of IRTLRDIF, BILOG-MG 3, and IRTPRO for Three Comparison Groups with 3PL


62

Item27 6.4 0.042 9.9 1.6 -0.102 1.3 0.4 -0.033 0.628 8.0 -2.322 4.8 5.7 -0.480 0.1 0.6 0.274 1.929 7.4 -0.425 7.2 2.3 -0.857 1.7 2.1 -1.313 1.830 8.3 1.441 11.4 6.0 1.707 9.3 1.5 0.778 3.731 2.9 0.024 2.4 0.6 -0.736 2.7 2.1 0.829 3.732 23.5 * -4.104 * 13.0 * 2.4 1.055 6.2 3.9 -1.619 8.333 3.1 -0.692 1.0 1.4 -0.415 0.5 6.1 -2.262 7.934 2.3 -1.095 1.5 0.6 -0.591 0.5 1.0 -1.028 0.635 2.2 1.144 4.4 5.6 -1.098 8.3 2.9 1.251 7.136 6.1 -1.033 9.0 * 0.6 -0.434 6.4 3.3 -1.103 3.737 6.9 1.176 5.9 6.8 -2.158 3.5 4.7 1.032 7.138 7.6 -2.289 4.7 2.0 -0.770 0.2 6.8 -1.960 6.239 0.1 0.027 0.4 1.4 0.692 5.9 3.5 -0.658 1.140 5.0 -1.716 6.1 2.5 -0.301 2.0 0.0 -0.554 3.941 1.4 0.183 1.8 3.6 -1.566 4.5 3.3 1.367 4.642 1.5 -0.288 0.5 0.9 0.076 0.3 2.2 -0.333 0.943 6.3 -0.649 5.7 4.9 -2.134 5.9 4.7 0.658 7.544 34.2 * -4.245 * 28.8 * 2.8 -1.758 2.8 7.7 * -2.134 * 8.2 *45 12.4 * -2.366 * 9.5 * 3.4 -0.296 0.4 2.8 -0.978 2.146 7.0 -2.119 2.7 4.7 -0.277 14.1 6.6 -1.900 5.647 7.7 0.935 5.0 0.0 -0.188 0.9 2.7 -0.938 0.948 6.8 1.404 13.3 2.6 1.192 14.4 7.1 1.398 20.149 1.5 0.913 2.4 1.5 0.567 5.5 7.2 1.270 13.350 4.4 -1.941 1.5 3.0 -1.912 5.6 5.0 -2.617 4.951 15.2 -0.585 15.6 14.5 * -3.438 * 15.8 * 7.6 1.953 10.952 1.2 -0.577 1.0 0.6 -0.540 0.7 0.1 -0.033 0.1

Note. * DIF Items

BILOG-MG 3 (d )

IRTPRO (χ2)


IRTLRDIF (χ2)

BILOG-MG 3 (d )

IRTPRO (χ2)

IRTLRDIF (χ2)

BILOG-MG 3 (d )


Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-RacialIRTPRO

(χ2)IRTLRDIF

(χ2)

63

Item53 8.8 -1.820 7.3 2.2 -1.210 4.3 2.2 0.024 6.854 5.0 -1.321 3.7 1.4 -0.467 0.2 4.6 -2.184 5.455 4.0 -1.330 2.6 1.0 -1.115 3.4 0.5 -0.669 0.856 32.5 * 3.093 * 11.4 * 1.0 -0.086 1.9 10.3 2.049 4.757 17.9 * 2.856 * 12.3 * 1.7 0.607 1.1 2.5 1.108 3.458 3.0 1.207 5.5 0.7 0.509 1.0 1.5 0.011 0.359 2.5 0.780 6.1 0.7 0.481 5.5 4.4 0.138 3.760 0.0 0.752 0.9 0.0 -0.220 1.4 0.0 0.348 1.461 4.0 1.481 3.6 0.9 0.685 1.1 4.0 1.657 5.962 0.2 0.790 1.9 1.5 0.373 0.3 3.3 1.303 5.163 4.8 -0.751 7.5 9.8 -0.419 2.1 1.4 0.647 0.864 2.8 0.976 2.4 4.5 -1.701 2.4 5.0 -0.821 6.665 0.0 0.216 2.3 0.5 -0.152 8.5 2.6 0.620 11.466 0.1 0.738 0.8 1.1 0.376 1.8 0.0 0.759 3.767 0.0 -0.053 4.3 6.4 -1.221 11.7 1.7 0.989 2.768 12.2 1.538 4.7 2.9 0.252 1.3 2.4 0.212 0.569 1.7 -0.468 1.6 2.2 1.063 1.6 1.6 -0.699 1.070 0.3 0.394 3.1 7.4 0.778 15.6 0.8 0.835 2.171 4.2 1.585 4.5 1.9 -0.505 4.3 1.3 0.750 1.072 11.0 2.009 6.6 1.9 -0.682 0.5 0.6 0.206 0.273 1.2 -0.617 6.1 1.1 -0.799 6.6 0.0 0.482 4.474 3.7 -0.204 3.1 6.7 0.169 7.9 1.3 0.365 3.475 2.8 0.692 1.6 3.1 0.515 2.8 2.9 -0.127 0.576 2.3 0.184 1.5 1.5 0.909 1.5 1.1 0.004 1.577 0.9 0.196 0.9 0.8 0.468 5.6 4.8 1.340 9.078 9.2 * 2.463 * 12.0 * 1.0 -0.630 1.3 0.8 -0.438 0.479 0.3 -0.171 0.6 4.0 -2.288 3.9 0.9 0.223 0.9

Note. * DIF Items



Whites vs. Blacks Whites vs. Hispanics Whites vs.Multi-RacialIRTPRO

(χ2)IRTLRDIF

(χ2)BILOG-MG 3

(d )IRTPRO

(χ2)IRTLRDIF

(χ2)BILOG-MG 3

(d )IRTPRO

(χ2)IRTLRDIF

(χ2)BILOG-MG 3

(d )

64

4.2.4 Multiple Groups Using two Programs with Three Models

The results of the BILOG-MG 3 and IRTPRO for Whites vs. Blacks, Hispanics, and

Multi-Racial group with 1PL are given in Table 14. BILOG-MG 3 was detected in 36 items for

Whites vs. Blacks, in six items for Whites vs. Hispanics, and in 10 items for Whites vs. the

Multi-Racial group. Items 2, 13, and 74 are DIF in both Whites vs. Blacks and Whites vs.

Hispanics. In addition, Items 8, 11, 32, 44, 56, and 61are DIF in both Whites vs. Blacks and

Whites vs. the Multi-Racial group. On the other hands, IRTPRO detected less DIF items among

the three comparison groups. There are 12 items for Whites vs. Blacks, three items for Whites vs.

Hispanics, and two items for Whites vs. the Multi-Racial group. Based on the results, both

BILOG-MG 3 and IRTPRO consistently detect DIF, which include Items 2, 8, 11, 13, 29, 30, 44,

56, 61, and 74 for Whites vs. Blacks and Item 3 for Whites vs. the Multi-Racial group. There is

no the consistent DIF detection for Whites vs. Hispanics.

65

Item1 -5.216 * 1.057 -0.073 0.7 12.2 * 0.42 2.902 * 2.919 * 0.137 7.8 * 0.3 3.93 1.878 1.877 -2.385 * 0.3 2.4 9.5 *4 0.450 1.101 -1.979 * 0.3 0.9 4.35 -1.792 0.486 -1.004 2.0 0.2 0.06 -1.252 -0.471 0.365 0.3 0.5 0.57 -3.008 * -1.435 -0.191 3.3 0.6 1.18 -5.442 * -1.333 -2.717 * 14.5 * 0.5 0.69 -1.331 -0.865 -1.542 3.6 1.1 0.1

10 -0.455 1.004 -0.242 0.4 0.7 0.711 -7.861 * -1.931 -2.262 * 19.4 * 3.0 0.012 1.867 0.734 -1.074 0.4 2.1 1.513 6.350 * 3.412 * 1.093 23.5 * 1.1 3.114 4.554 * -0.145 0.942 3.5 5.7 0.915 4.106 * 1.723 0.140 5.8 2.0 1.216 0.270 0.676 -0.130 0.4 0.2 0.217 2.820 * 1.698 -1.239 1.0 2.8 4.718 1.140 0.942 -0.031 1.3 0.2 0.419 1.445 3.069 * 0.275 6.7 * 1.8 3.920 3.445 * 0.763 -0.742 1.5 5.1 1.021 -1.487 0.292 -0.021 0.2 1.1 0.022 -3.817 * 0.988 0.000 0.4 8.8 * 0.323 3.197 * 0.618 0.601 3.2 1.4 0.024 -1.512 -0.579 -1.029 2.1 0.2 0.025 2.770 * 1.010 -0.382 1.7 1.9 0.926 -4.575 * -1.560 -0.221 4.9 1.8 1.227 -3.163 * 0.556 0.377 0.3 6.5 * 0.028 -3.946 * -0.623 -0.137 2.1 2.5 0.229 -5.413 * -1.627 -1.738 12.2 * 1.5 0.030 2.194 * 1.737 1.238 6.9 * 0.5 0.131 2.084 * -0.256 1.532 2.2 0.5 2.232 -2.360 * 1.303 -2.190 * 1.2 0.9 5.633 0.355 0.480 -2.030 * 0.8 2.0 3.234 -3.370 * -0.466 -0.703 2.8 1.6 0.035 -0.953 -2.344 * 0.528 1.9 0.6 4.736 -1.992 * -0.037 -0.929 1.4 0.5 0.337 1.124 -0.764 -0.057 0.2 1.9 0.538 -1.877 0.184 -2.099 * 2.8 0.2 2.439 0.184 0.584 0.680 1.2 0.4 0.140 -0.302 -0.152 0.161 0.2 0.3 0.0

Table 14

W vs. BBILOG-MG 3

The Summary of BILOG-MG 3 and IRTPRO for All Ethicities/Races with 1PL

W vs. BIRTPRO

W vs. H W vs. MR W vs. H W vs. MR

Note. *DIF Items

66

Item41 0.756 -1.296 1.461 0.4 1.9 4.942 -1.040 -0.734 0.332 0.8 1.1 0.843 0.356 -1.204 0.340 0.3 1.8 1.844 -3.613 * -1.724 -2.756 * 23.6 * 3.8 0.245 -0.954 0.300 -1.004 1.2 0.8 0.846 0.117 -0.086 0.321 0.3 0.2 0.247 0.246 0.119 -0.170 0.2 0.4 0.048 0.823 1.498 -1.092 5.0 0.4 0.149 0.981 0.749 0.449 2.2 0.6 0.050 -0.657 -0.794 -1.256 3.0 0.5 0.051 -1.061 -3.383 * 1.707 2.9 0.2 14.9 *52 -1.805 -1.000 -0.193 3.4 2.4 0.553 0.059 -0.856 -0.133 0.5 1.0 0.554 -0.820 0.040 -1.775 2.6 0.2 1.555 -0.082 -0.774 -0.293 0.7 0.7 0.356 2.213 * -0.278 2.506 * 7.9 * 2.9 4.657 1.789 1.340 1.377 8.9 * 0.8 0.058 0.778 0.944 0.715 2.0 0.5 0.059 -2.698 * 0.110 1.528 0.3 6.0 1.360 2.171 * 0.036 0.532 1.3 1.3 0.361 3.511 * 1.026 2.244 * 9.7 * 0.2 0.962 2.625 * 0.678 1.811 5.6 0.2 0.863 0.604 -0.501 1.076 0.4 0.2 1.764 0.836 -1.153 -1.683 2.2 6.0 0.065 2.000 * -0.319 0.888 0.9 1.0 0.966 2.859 * 0.234 0.893 2.5 1.2 0.367 1.500 -1.656 1.102 0.4 2.9 4.968 1.270 0.815 0.411 1.6 0.2 0.169 -2.752 * 1.163 -0.325 0.2 4.4 0.970 1.803 0.596 1.019 2.5 0.2 0.271 2.886 * -0.756 1.244 1.4 2.5 2.672 3.446 * -0.228 0.500 1.6 4.2 0.573 0.795 -1.058 0.540 0.2 1.2 1.874 -4.706 * -3.412 * -0.320 11.1 * 0.4 4.975 -2.267 * 1.216 -0.004 0.3 4.0 0.776 -0.487 0.698 -0.264 0.2 0.5 0.377 -1.929 0.503 0.154 0.2 2.2 0.078 3.746 * 0.341 0.374 2.8 4.0 0.079 -1.481 -1.693 0.514 1.5 0.2 3.2

Note. *DIF Items

W vs. BW vs. B W vs. H W vs. MR W vs. H W vs. MR

The Summary of BILOG-MG 3 and IRTPRO for All Ethicities/Races with 1PL

BILOG-MG 3 IRTPRO


67

Table 15 shows the results of the BILOG-MG 3 and IRTPRO for Whites vs. Blacks,

Hispanics, and the Multi-Racial group with 2PL. There are 20 items to be detected using

BILOG-MG 3 for Whites vs. Blacks, four items for Whites vs. Hispanics, and 10 items for

Whites vs. Multi-Racial. Items 2 and 13 are detected DIF for both Whites vs. Blacks and Whites

vs. Hispanics. In addition, Items 8, 11, 33, 44, 56, and 61are identified DIF for both Whites vs.

Blacks and Whites vs. the Multi-Racial group, On the other hand, IRTPRO detected fewer DIF

items. There are 15 items for Whites vs. Blacks, four items for Whites vs. Hispanics, and six

items for Whites vs. the Multi-Racial group. Based on the results, Items 2, 8, 11, 13, 44, and 45

are detected by BILOG-MG 3 and IRTPRO for Whites vs. Blacks and Item 3 and 67 for Whites

vs. the Multi-Racial group. There is no consistent DIF detection for Whites vs. Hispanics using

BILOG-MG 3 and IRTPRO.

68

Item1 -2.217 * 1.335 0.124 5.0 9.6 * 2.62 2.949 * 2.799 * 0.203 7.4 * 0.1 4.83 1.298 1.775 -2.492 * 0.3 1.6 9.8 *4 0.626 1.066 -1.945 8.2 * 3.6 5.55 0.317 -0.097 -0.707 4.0 3.8 0.16 0.028 -0.329 0.467 0.7 1.3 0.67 -0.667 -0.964 0.009 1.9 2.0 0.78 -3.015 * -0.965 -4.802 * 11.8 * 2.9 1.49 -1.684 -1.013 -1.339 8.5 * 0.2 3.6

10 -0.353 0.950 -0.049 1.4 1.5 3.111 -3.078 * -1.118 -7.717 * 7.1 * 0.0 0.612 1.783 0.654 -1.159 0.1 1.7 8.1 *13 6.267 * 3.412 * 1.215 18.0 * 0.5 3.214 5.220 * -0.260 1.230 1.3 4.7 1.715 1.289 1.669 0.145 4.2 0.5 1.216 4.442 * 0.870 0.043 1.2 3.3 1.217 0.964 1.801 -1.033 0.3 0.6 4.118 0.275 0.777 0.089 0.1 1.0 0.919 1.487 3.039 * 0.349 4.2 1.8 5.420 1.797 0.603 -0.761 0.1 3.7 2.021 0.225 0.490 0.118 0.1 2.3 1.922 -4.377 * 1.210 0.191 0.8 3.8 3.023 1.105 0.417 0.755 1.1 0.2 0.024 -1.383 -0.543 -0.921 2.7 0.2 2.425 0.267 0.848 -0.303 1.4 0.2 1.326 -0.758 -0.814 0.020 1.2 0.4 3.127 -1.673 0.728 0.491 0.3 8.2 * 0.728 -0.859 -0.192 0.064 0.7 3.1 0.329 -1.735 -0.910 -1.271 5.8 0.5 0.430 0.934 1.642 1.351 6.5 * 0.7 2.131 -2.366 * -0.484 1.737 0.2 0.1 2.332 0.167 1.320 -2.060 * 1.8 6.1 6.133 -2.343 * 0.374 -2.005 * 3.9 1.4 3.534 1.005 -0.123 -0.481 1.7 0.6 0.035 -0.696 -1.749 0.610 6.5 * 3.4 2.836 4.500 * -0.144 -0.907 3.6 0.4 0.337 -0.762 -0.453 0.107 0.4 3.6 8.4 *38 0.490 0.060 -2.141 * 6.4 * 0.2 5.239 -1.330 0.491 1.000 0.9 1.5 4.740 0.763 -0.121 -0.170 1.8 1.2 0.6

W vs. H W vs. MR

Table 15

W vs. BBILOG-MG 3

The Summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 2PL

W vs. BIRTPTOW vs. H

Note. *DIF Items

W vs. MR

69

Item41 0.693 -1.461 1.710 0.3 0.4 5.742 0.254 -0.277 0.493 0.0 0.3 2.143 -0.395 -1.533 0.446 2.5 0.4 3.844 -5.503 * -1.329 -2.380 * 17.6 * 0.9 1.545 -2.604 * 0.154 -0.961 7.4 2.5 1.546 -0.928 -0.433 0.341 0.9 2.1 7.0 *47 0.670 0.051 -0.102 1.3 4.7 1.848 1.245 1.354 1.182 11.5 * 0.5 0.249 1.887 0.641 0.532 5.1 5.4 0.650 -1.200 -0.858 -1.200 3.8 4.1 0.451 -1.543 -3.321 * 1.726 9.4 * 0.2 13.0 *52 -0.209 -0.183 0.061 0.2 0.1 0.253 -1.603 -1.083 0.061 3.5 0.4 1.954 -1.783 -0.029 -1.732 4.4 1.3 1.455 -0.769 -0.971 -0.236 3.0 1.0 0.656 4.729 * -0.332 2.588 * 5.3 5.4 3.957 3.662 * 1.295 1.430 6.1 1.0 1.558 1.009 0.899 0.796 1.4 2.7 0.359 0.160 0.445 1.108 4.3 1.8 4.260 1.467 -0.147 0.628 0.9 0.5 0.361 2.161 * 0.904 2.503 * 4.4 0.9 1.362 1.284 0.561 2.000 2.6 0.9 2.363 -1.188 -0.466 1.709 11.4 * 8.9 * 5.664 1.667 -1.064 -1.524 2.3 7.6 * 1.565 0.288 -0.267 1.511 0.6 2.7 0.966 1.025 0.223 1.299 1.6 1.2 0.567 -0.106 -1.705 2.008 * 0.8 1.4 6.7 *68 2.157 * 0.985 0.533 5.0 1.6 1.569 -0.864 1.320 -0.177 0.3 2.3 2.370 0.608 0.500 1.216 2.3 1.5 4.671 2.427 * -0.842 1.315 0.4 1.4 5.272 3.020 * -0.359 0.585 2.4 3.3 0.973 -0.735 -1.137 0.981 0.9 0.2 1.974 -0.251 -1.402 0.007 6.8 * 2.4 3.875 0.559 1.308 0.207 3.1 2.6 1.876 1.028 0.888 -0.085 1.0 2.5 1.077 0.255 0.723 0.287 3.5 0.5 2.278 3.371 * 0.193 0.436 2.7 4.8 0.179 -0.232 -1.417 0.582 1.0 1.1 3.0

W vs. MRBILOG-MG 3

Note. *DIF Items

The Summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 2PL

W vs. B W vs. BIRTPTO


W vs. H W vs. MR W vs. H

70

Table 16 shows the 3PL by using BILOG-MG 3 and IRTPRO for Whites vs. all focal

groups. For BILOG-MG 3, there are 12 items detected DIF for Whites vs. Blacks, six items for

Whites vs. Hispanics, and five items for Whites vs. the Multi-Racial group. Items 13 and 15 are

detected DIF for both Whites vs. Blacks and Whites vs. Hispanics, Item 51 for both Whites vs.

Hispanics and Whites vs. the Multi-Racial group, and Items 56 is identified DIF for both Whites

vs. Blacks and Whites vs. the Multi-Racial group. On the other hand, IRTPRO detected more

DIF items than BILOG-MG 3 for Whites vs. Blacks, which are 16 items, with 3PL, four items

for Whites vs. Hispanics, and two items for Whites vs. the Multi-Racial group. Items 49 and 65

are investigated DIF for both Whites vs. Blacks and Whites vs. Hispanics and Item 51 for Whites

vs. Blacks and Whites vs. the Multi-Racial group. Moreover, the results indicated that Items 13,

15, and 44 are consistently detected by BILOG-MG 3 and IRTPRO for Whites vs. Blacks and for

Whites vs. the Multi-Racial group.

Overall, DIF exists in the GHSGPT in Social Studies when employing the three computer

programs for the three comparison groups for the dichotomously scored items using three

models. Figure 5 to Figure 13 display the DIF items between Whites vs. Blacks, Figure 14 and

16 demonstrate DIF items between Whites and Hispanics, and Figure 17 shows that DIF exists

between Whites vs. the Multi-Racial group, with 3PL because 3PL shows a good fit to the data.

71

Item1 -1.384 1.186 0.449 1.3 3.4 0.82 1.459 1.804 0.160 7.9 1.2 4.63 0.319 1.554 -1.821 1.5 5.7 8.9 *4 0.923 1.583 -0.582 12.0 * 6.3 4.85 0.026 -0.718 -1.461 2.4 2.1 0.16 -0.104 0.180 0.524 4.3 0.8 0.37 -0.945 -0.471 0.443 2.3 2.7 0.78 -1.960 -0.430 -1.932 7.5 1.2 1.29 -1.316 -0.985 -1.151 6.1 1.1 2.4

10 -1.329 1.457 -0.896 3.1 2.2 3.911 -1.357 -0.993 -1.545 8.4 * 3.0 0.512 1.655 0.542 -0.213 1.3 2.3 5.113 4.563 * 2.852 * 1.033 22.6 * 2.1 3.514 2.852 * -0.202 1.265 5.5 6.5 1.515 -2.482 * 2.075 * 0.717 11.2 * 3.6 1.816 1.183 1.019 0.305 2.2 2.1 1.217 -0.500 2.144 * -1.380 4.0 2.5 7.518 0.316 0.369 0.056 5.6 0.8 0.619 0.881 2.548 * 0.183 5.6 1.2 6.120 1.776 0.341 -0.697 5.9 8.4 * 1.221 0.510 0.265 0.281 0.7 1.9 1.022 -0.796 1.190 -0.082 1.0 2.8 1.323 0.832 0.487 0.778 0.6 3.4 0.424 -0.637 -0.645 -0.620 3.1 0.5 1.525 0.633 0.593 -0.480 0.3 4.2 1.326 -1.664 -0.393 -0.775 1.8 2.1 1.227 0.118 0.123 0.193 1.6 7.3 0.328 -2.113 * -0.147 0.493 0.3 3.3 0.529 -0.461 -0.421 -1.065 3.1 1.8 0.330 1.802 -2.231 * 1.214 15.7 * 0.9 1.431 0.313 -0.186 1.381 4.3 0.7 1.732 -3.524 * 1.386 -1.486 3.0 12.0 * 5.733 -0.305 -0.012 -2.101 * 2.0 3.9 3.534 -0.989 -0.326 -0.788 0.9 0.6 0.135 -1.333 -0.772 1.403 7.7 2.6 2.736 -0.649 0.222 -0.642 9.1 * 0.2 0.737 1.420 -1.588 1.024 2.6 2.7 5.238 -2.011 * -0.339 -1.631 4.2 0.4 2.839 0.337 1.144 -0.189 2.6 0.9 3.640 -1.479 0.240 -0.133 3.2 1.5 0.7

Note. * DIF Items

W vs. H W vs. MR W vs. H W vs. MR

Table 16

W vs.BBILOG-MG 3

The summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 3PL

W vs.BIRTPTO

72

Item41 0.353 -1.243 1.673 3.3 1.5 4.942 0.000 0.031 -0.036 0.6 0.1 0.843 -0.323 -1.404 0.919 8.2 * 2.0 2.644 -4.028 * -1.201 -1.748 12.4 1.6 1.145 -2.162 * 0.047 -0.721 3.4 1.2 1.146 -1.620 0.713 -1.294 6.5 2.6 5.447 1.117 0.217 -0.690 0.7 4.2 1.448 1.906 1.768 1.930 28.0 * 2.6 0.249 1.280 1.091 1.450 11.8 * 8.3 * 0.450 -1.459 -1.228 -2.071 * 4.1 3.1 0.451 -0.419 -2.699 * 2.097 * 16.1 * 0.7 11.0 *52 -0.219 -0.531 -0.234 0.7 0.2 0.353 -1.513 -0.846 0.313 9.9 * 2.7 0.954 -1.089 0.005 -1.875 3.5 0.3 1.955 -1.000 -0.813 -0.277 2.9 2.6 0.656 3.325 * 0.170 2.498 * 6.9 7.7 3.057 3.181 * 0.989 1.503 7.1 2.9 0.758 1.302 0.907 -0.356 2.2 1.2 0.559 0.876 0.491 0.366 8.3 * 1.2 1.660 0.953 0.094 0.634 1.2 3.5 0.161 1.523 1.027 2.090 * 5.9 0.9 0.862 0.797 0.601 1.682 2.9 0.4 1.463 -0.900 -0.391 1.187 2.7 1.1 1.164 1.288 -1.339 -0.633 1.1 6.9 1.365 0.220 -0.147 1.174 11.9 * 8.5 * 0.466 0.845 0.494 1.258 1.7 6.0 0.367 -0.082 -1.400 1.686 8.6 * 4.1 4.268 1.758 0.462 0.400 4.1 1.0 0.669 0.438 1.301 -0.384 0.3 2.3 2.370 0.451 0.895 1.161 10.2 * 2.8 1.971 1.560 -0.333 1.111 1.8 2.6 3.072 2.124 * -0.443 0.518 0.8 5.2 0.573 -0.657 -0.876 0.969 8.8 * 3.3 1.174 -0.274 -1.886 -0.802 5.7 1.2 2.275 0.696 0.595 0.027 3.0 1.5 1.076 0.471 1.132 0.149 1.4 2.6 0.877 0.481 0.586 1.656 7.2 3.1 0.978 2.740 * -0.140 0.027 2.3 5.1 0.179 0.054 -1.911 0.502 1.3 0.5 2.3

The summary of BILOG-MG 3 and IRTPRO for all Ethicities/Races with 3PL

Note. * DIF Items

W vs.B W vs.B


IRTPTOW vs. MRW vs. H W vs. MR W vs. H

BILOG-MG 3

73

Figure 5. Item 13 between Whites and Blacks.


74



75



76



77


Figure 14. Item 13 between Whites and Hispanics.

78



79

Figure 17. Item 44 between Whites and the Multi-Racial Group.

80

CHAPTER 5

SUMMARY AND DISCUSSION

The purpose of this study, which employed the data from the Georgia High School

Graduation Predictor Test (GHSGPT) for Social Studies, was to analyze academic performance

by ethnicity/race. IRTPRO, BILOG-MG 3, and IRTLRDIF were utilized to investigate across

reference and focal groups with 1PL, 2PL, and 3PL. Consequently, the two programs, IRTPRO

and BILOG-MG 3, identically detected 35 DIF items for Whites vs. Blacks, five DIF items for

Whites vs. Hispanics, and three DIF items for Whites vs. the Multi-Racial group with 1PL. For

2PL, three programs, IRTPRO, BILOG-MG 3, and IRTLRDIF, consistently detected DIF. There

are 16 DIF items for Whites vs. Blacks, three for Whites vs. Hispanics, and four for Whites vs.

the Multi-Racial group. Additionally, for 3PL, as well as 2PL, the three programs are identically

investigated DIF. Nine DIF items exist for Whites vs. Blacks, three in Whites vs. Hispanics, and

one in Whites vs. the Multi-Racial group. Based on the results of both BILOG-MG 3 and

IRTPRO, 3PL provided a good fit for the data.

5.1 Summary

This study employed GHSGPT data to consider whether DIF for different

ethnicities/races exists in the GHSGPT for Social Studies. This thesis analyzed only 79 items

from the GHSGPT for Social Studies rather than the total 80 items because the Pearson- and

biserial- correlations of Item 26 were negative. They were -.40 and -.053, respectively. Hence

Item 26 was omitted from the calibration, and the remaining subsequent items were renumbered

to maintain consecutive numbering. The summaries of the results are described below:

81

1. The Results Based on the Classical Test Theory (CTT)

The average p-value (the rate of correct responses) is .518. There are 62 (77%) items

between .3 and .7. The difficulty is moderate and tends toward easy. The average of

discrimination is .304. There are 30 (38%) items lower than .3. The discrimination is moderate,

so the items do not have a high discrimination. In addition, the Pearson- and biserial- correlations

are positive.

2. The Results Based on the Item Response Theory (IRT)

a. Item Discrimination Parameter

The average item discrimination with 2PL is .519 and with 3PL is .968; thus, the degree of

discriminations of both 2PL and 3PL are acceptable.

b. Item Difficulty Parameter

The average of item difficulty with 1PL is -.173, 2PL is .266, and 3PL is .650. The degrees

of difficulty for the three models are moderate; however, 1PL and 2PL tend toward easy,

and 3PL tends toward difficult.

c. The Lower Asymptote (Pseudo-Guessing Parameter)

The mean of the pseudo-guessing parameter for 3PL is .224; therefore, it is not high.

3. Detecting DIF Using the Three Computer Programs

IRTPRO, BILOG-MG 3, and IRTLRDIF were used to assess the79 items to detect

whether DIF for ethnicities/races exists on the GHSGPT for Social Studies with α = .05. Whites

were regarded as the reference group, and Blacks, Hispanics, and the Multi-Racial group were

considered the focal groups. For 1PL, items are considered to be DIF when BILOG-MG 3 and

82

IRTPRO consistently detected DIF. In addition, for 2PL and 3PL, when the three programs

identically detected the DIF phenomenon, those items are included as DIF.

a. The One-Parameter Logistic Model

There were 35 DIF items for Whites vs. Blacks; 15 items advantaged Blacks, and 20

items advantaged Whites. In addition, five DIF items existed for Whites vs. Hispanics;

three items favored Whites, and two items favored Hispanics. Moreover, three DIF items

existed for Whites vs. the Multi-Racial group, and those items all advantaged Whites.

b. The Two-Parameter Logistic Model

There were 16 DIF items for Whites vs. Blacks; nine items advantaged Whites, and seven

items favored Blacks. Three items showed DIF for Whites vs. Hispanics; two items favored

Whites, and one item advantaged Hispanics. Four DIF items existed for Whites vs. the

Multi-Racial group, and all advantaged the Multi-Racial group.

c. The Three-Parameter Logistic Model

There were nine DIF items found for Whites vs. Blacks; three items advantaged Whites,

and six items favored Blacks. Furthermore, three DIF items were shown for Whites vs.

Hispanics; two items advantaged Whites and one Hispanics. Additionally, only one DIF

item was found for Whites vs. the Multi-Racial group, and it advantaged the Multi-Racial

group.

4. Using IRTPRO and BILOG-MG 3 to Investigate DIF in Multiple Groups

DIF items were considered in multiple groups parallel to the three comparison groups. If

both IRTPRO and BILOG-MG 3 identically detected DIF, then those items are included as DIF.

83

a. The One-Parameter Logistic Model

There were ten DIF items for Whites vs. Blacks; five items favored Whites, and five

favored Blacks. There was one DIF item for Whites vs. the Multi-Racial group, and this

item advantaged the Multi-Racial group. IRTPRO and BILOG-MG 3 did not identically

detect DIF for Whites vs. Hispanics.

b. The Two-Parameter Logistic Model

BILOG-MG 3 and IRTPRO both determined seven DIF items for Whites vs. Blacks; four

items advantaged Whites and three items Blacks. Two items were detected for Whites vs.

the Multi-Racial group; one item favored Whites, and one item favored the Multi-Racial

group.

c. The Three-Parameter Logistic Model

The three DIF items were consistently detected by two programs for Whites vs. Blacks;

two items advantaged Whites, and one advantaged Blacks. Only one DIF item was detected

for Whites vs. the Multi-Racial group, and that one favored Whites. There was no

consistent DIF item for Whites vs. Hispanics with 2PL and 3PL.

5.2 Discussion

Currently, DIF detection procedures have been developed exclusively for comparisons

between a reference group/majority group and a focal group/ minority group, such as between

Whites and Blacks, or males and females. Some previous Social Science studies consider all

minorities as a homogeneous group. For instance, several studies mentioned that racial

differences in assessment have primarily been developed in reference to comparisons between

Whites and minority groups, which include Blacks, Asians, Hispanics, and Native Americans.

84

However, there is no evidence that Blacks and Hispanics are similar in this regard (Logan et al.,

2012). Thus, this study shows that DIF detection differs by ethnicity. In addition, previous

studies (Freedle & Kostin, 1988; Coffman & Belue, 2009) investigated the scores for either

Whites and Blacks or Whites and Hispanics or other single comparison groups. However,

numerous focal groups, for example Asians, African Americans, Hispanics, Native Americans,

females, and examinees with disabilities, are available for study (Zieky, 1993). Thus, this thesis

extends the line of prior research by using three comparison groups—1) Whites vs. Blacks; 2)

Whites vs. Hispanics; and 3) White vs. a Multi-Racial group— to determine which items contain

bias for a specific race/ethnicity. IRTPRO, BILOG-MG 3, and IRTLRDIF with three popular

IRT models were used to detect DIF.

This study met with some problems when calibrating the 3PL using BILOG-MG 3.

These problems may have resulted because of the small sample sizes of the focal groups, the

Hispanic and the Multi-Racial groups, numbering 114 and 132, respectively. It could not employ

the default (GPRIOR) of the prior BILOG-MG 3 because it stopped when calibrating Item 59 for

two comparison groups, Whites vs. Hispanics and Whites vs. the Multi-Racial group. Therefore,

this study changed the prior to TPRIOR instead of GPRIOR using BILOG-MG 3, and the beta

employed (4, 16) when using IRTPRO. In addition, when calibrating for two comparison groups,

Whites vs. Hispanics and Whites vs. the Multi-Racial group, with 3PL using IRTLRDIF, several

values of discrimination appeared very large, such as Item 74 (186.82) for Whites vs. Hispanics

and Item 16 (78.68) for Whites vs. the Multi-Racial group. Nevertheless, this might be an

estimation error, so the present study did not change because its purpose is to detect DIF for the

GHSGPT. The discussion below follows the order of the five hypotheses in order to present the

result of the study’s findings.

85

Hypothesis one and two: The three programs, IRTPRO, BILOG-MG 3, and IRTLRDIF, will

exhibit consistent results when testing for DIF and will examine

IRTPRO to assess whether it is effective in detecting DIF.

Based on the results for the detection of DIF, methods using IRTPRO, BILOG-MG 3,

and IRTLRDIF for three comparison groups are consistent. The rate of consistency of

IRTLRDIF and IRTPRO was the highest; the consistent rate of IRTLRDIF and BILOG-MG 3

and BILOG-MG 3 and IRTPRO was high. The rate of consistency of BILOG-MG 3 and

IRTPRO for multiple groups was moderate. Overall, the three computer programs displayed high

consistency for the detection of DIF in this study. Furthermore, because IRTPRO displayed

identical results to IRTLRDIF and BILOG-MG 3 for the three comparison groups, it is effective

in detecting DIF.

Hypothesis three: Which models are goodness of fit models for detecting DIF?

According to Tables 7 and 8, both BILOG-MG 3 and IRTPRO exhibited the -

2loglikelihood of 3PL for each comparison group and is smaller than the -2loglikelihood of 2PL

and 1PL. Thus, this finding concludes that 3PL is a goodness of fit model to detect DIF in the

GHSGPT.

Hypothesis four: Were their differences between the ethnic groups?

The computation of total scores is:

86

Total of the item correct

Total number of each race ×Total number of item × 100% (35)

According to the total scores for each race, Whites were 55%, Blacks were 46%,

Hispanics were 51%, and the Multi-Racial groups were 54%. In general, Whites performed

better than other races. Perhaps, because of a different cultural background and community

region, Blacks performed worse than other races.

Hypothesis five: DIF exists between ethnic groups on the GHSGPT.

The three computer programs consistently showed that DIF exists between ethnicity

groups. In addition, these findings indicated that several items advantaged specific races.

Although the results supported all of the hypotheses, there are several limitations. First,

this study does not control for gender, individual social economic status (SES), and school

regions. Second, because the present study was unable to obtain the items, it cannot analyze the

distractor. Thus, it is unable to further investigate some items that have lower response rates and

to investigate why Blacks performed worse than other races. Third, this finding does not employ

simulated data; it only applies empirical data to determine IRTPRO. In order to obtain an

accurate result to determine IRTPRO, researchers should employ simulated and empirical data in

detecting DIF in future study. Additionally, researchers may consider that school regions might

affect the probability of answering an item correctly. For example, if a school has enough

funding to hire additional teachers for tutoring, students might perform better because of this

additional help. Thus, researchers can adopt multilevel IRT, such as the HLM program or

flexMIRT, to better understand school level variables that may influence the relationships

observed here.

87

In sum, DIF is an important tool in helping test developers recognize some questions that

may be unfair for test-takers because of their gender, ethnicity/race, or cultural background

(Zieky, 1993). In other words, DIF is a particularly useful instrument for test developers. This

study presents DIF detection results from empirical tests, and, in addition, it provides important

DIF information for the test developers of the Georgia High School Graduation Predictor Test.

They can consider eliminating or revising several items, such as Items 52, 59, 74, 77, and 79,

that are beneficial or adverse for particular races. Furthermore, it examines the new program,

IRTPRO, to demonstrate and determine its effectiveness for detecting DIF.

88

REFERENCES

Allen, M. J., & Yen, W. M. (2002). Introduction to measurement theory. Long Grove, IL

Waveland Press, Inc.

American Psychological Association, c/o Joint Committee on Testing Practices. (1988). Code of

fair testing practices in education. Washington, DC: Author.

Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P.W.

Hland & H. Wainer (Eds.), Differential item functioning (pp. 3-24). Hillsdale, NJ:

Lawrence Erlbaum Associates.

Baker, F. B. (2001). The basic of item response theory. New York, NY: Eric Clearinghouse on

Assessment and Evaluation.

Baker, F. B., & Kim, S-H. (2004). Item response theory: Parameter estimation techniques. Boca

Raton, FL: Taylor & Francis.

Berk, R. A. (1982). Handbook of methods for detecting test bias. Baltimore, MD: Johns Hopkins

University Press.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In

F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 392-479).

Reading, MA: Addison-Wesley.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:

An application of an EM algorithm. Psychometrika, 46, 443-459.

Bock, R. D., & Lieberman, M. (1970). Fitting a response model for dichotomously scored items.

Psychometrika, 35, 179-197.

89

Bolt, D. M. (2000). A SIBTEST approach to testing DIF hypothesis using experimentally

designed test items. Journal of Educational Measurement, 37, 307-327.

Brescia, W., & Fortune, J. C. (1988). Standardized testing of American Indian students. Eric

Clearinghouse on Rural Education and Small Schools, Las Cruces, N. Mex. Retrieved

January 31, 2012, from http://www.enc.org/topics/equity/articles/document.shtm?=ACQ-

111498-1498.

Cai, L., Thissen, D., & du Toit, S. (2011). IRTPRO 2.1 [Computer software]. Lincolnwood, IL:

Scientific Software International.

Cai, L. (2012). flexMIRTTM version 1.86: A numerical engine for multilevel item factor analysis

and test scoring. [Computer software]. Seattle, WA: Vector Psychometric Group.

Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially

functioning test items. Educational Measurement: Issues and Practice, 17, 31-44.

Coffman, D. L., & Belue, R. (2009). Disparities in sense of community: True race differences or

differential item functioning? Journal of Community Psychology, 37, 547-558.

Cohen, A. S., & Kim, S-H. (1993). A comparison of Lord’s χ2 and Raju’s area measures in

detection of DIF. Applied Psychological Measurement, 17, 39-52.

Cohen, A. S., Kim, S-H., & Wollack, J. A. (1996). An investigation of the likelihood ratio test

for detection of differential item functioning. Applied Psychological Measurement, 20, 15-

26.

Crocker, L., & Algina, J. (2008). Introduction to classical and modern test theory. Mason, OH:

Cengage Learning.

Czepiel, S. A. (2002). Maximum likelihood estimation of logistic regression models: Theory and

implementation. Retrieved from http://czep.net/stat/mlelr.pdf.

http://www.enc.org/topics/equity/articles/document.shtm?=ACQ-111498-1498�

http://www.enc.org/topics/equity/articles/document.shtm?=ACQ-111498-1498�

http://czep.net/stat/mlelr.pdf�

90

Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to

assessing unexpected differential item performance on the Scholastic Aptitude Test.

Journal of Educational Measurement, 23, 355-368.

Dorans, N. J., & Schmitt, A. P. (1991). Constructed response and differential item functioning: A

pragmatic approach

Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and

standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp.35-

66). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.).

(ETS-RR-91-47). Princeton, NJ: Educational Testing Service.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologist. Mahwah, NJ:


Freedle, R., & Kostin, I. (1988). Relationship between item characteristics and an index of

differential item functioning (DIF) for the four GRE verbal item types. ETSRR-88-29.

Princeton, NJ: Educational Testing Service.

Georgia Department of Education. Test content descriptions based on the Georgia performance

standards social studies (2010). Retrieved from http://archives.gadoe.org/DMGet

Document.aspx/GHSGT%20Social%20Studies%20Content%20Descriptions%20GPS%20

Version%20Update%20Oct%202010.pdf?p=6CC6799F8C1371F6A344D9C15C23A9D85

9A861593B934AB75F446073BD12714C&Type=D.

Gronlund, N. E. (1993). How to make achievement tests and assessments (5th ed.) Boston, MA:

Allyn and Bacon.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principle and applications.

Boston, MA: Kluwer-Nijhoff.

http://archives.gadoe.org/DMGet�

91

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response

theory. Newbury Park, CA: Sage.

Hambleton, R. K., & Jones, R.W. (1993) Comparison of classical test theory and item response

theory and their application to test development. Educational Measurement: Issues and

Practice, 12, 38-47.

Harwell, M. R., Baker, F. B., & Zwarts, M. (1988). Item parameter estimation via marginal

maximum likelihood and an EM algorithm: A didactic. Journal of Educational Statistics,

13, 247-271.

Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel-Haenszel

procedure. In H. Wainer & H. I. Braun (Eds.) Test validity (pp. 129-145). Hillsdale, NJ:


Kim, S-H., Cohen, A. S., & Park, T. H. (1995). Detection of differential item functioning in

multiple groups. Journal of Educational Measurement, 32, 261-278.

Ling, S. E., & Lau, S. H. (2005). Detecting differential item functioning (DIF) in standardized

multiple-choice test: An application of item response theory (IRT) using three parameter

logistic model. Retrieved January 31, 2012,

from http://www.ipbl.edu.my/inter/penyelidikan/seminarpapers/2005/lingUITM.pdf

Logan, J. R., Minca, E., & Adar, S. (2012, January 10). The geography of inequality: Why

separate means unequal in American public schools. Sociology of Education. Advance

online publication. doi:10.1177/0038040711431588.

.

Lord, F. M. (1952). A theory of test scores. Psychometric Monograph, No 7.

Lord, F. M. (1953). A relation of test score to the trait underlying the test. Educational and

Psychological Measurement, 13, 517-548.

http://www.ipbl.edu.my/inter/�

92

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA:

Addison-Wesley.

Lord, F. M. (1974).Estimation of latent ability and item parameters when there are omitted

responses. Psychometrika, 39, 247-264.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,

NJ: Lawrence Erlbaum Associates.).

Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective

studies of disease. Journal of the National Cancer Institute, 22, 719-748.

McDonald, R.P. (1999). Test theory: a unified treatment. Mahwah, NJ: Lawrence Erlbaum

Associates.

Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of

Educational Statistics, 7, 105-118

Potenza, M. T., & Dorans, N. J. (1995). DIF assessment for polytomously scored items: A

framework for classification and evaluation. Applied Psychological Measurement, 19, 23-

37.

Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495-

502.

Raju, N. S., & Drasgow, F. (1993). An empirical comparison of the area method, Lord’s chi-

square test, and the Mantel-Haenszel technique for assessing differential item functioning.

Educational and Psychological Measurement. 53, 301-314.

Raju, N. S., van der Linder, W. J., & Fleer, P. F. (1995). IRT-based internal measures of

differential functioning of items and tests. Applied Psychological Measurement, 19, 353-

368.

93

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen:

The Danish Institute for Educational Research.

Rudner, L. M., Getson, P. R., & Knight, D. L. (1980). Biased item detection techniques. Journal

of Educational Statistics, 5, 213-233.

Schmitt, A. P., & Dorans, N. J. (1990). Differential item functioning for minority examinees on

the SAT. Journal of Educational Measurement, 27, 67-81.

Shealy, R. T., & Stout. W. F. (1993). An item response theory model for test bias and differential

test functioning. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp.

197-239). Hillsdale, NJ: Lawrence Erlbaum Associates.

Spector, P. E. (1992). Summated rating scale construction: An introduction. Newbury Park, CA:

Sage.

Steinberg, L. (1994). Context and serial-order effects in personality measurement: Limits on the

generality of measuring changes the measure. Journal of Personality and Social

Psychology, 66, 341-349.

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic

regression procedures. Journal of Educational Measurement, 27, 361-370.

Thissen, D, Steinverg, L. & Gerrard, M. (1986). Beyond group mean differences: The concept of

item bias. Psychological Bulletin, 99, 118-128.

Thissen, D., Steinverg, L., & Wainer, H. (1993). Detection of differential item functioning using

the parameters of item response model. In P.W. Holland & H. Wainer (Eds.), Differential

item functioning (pp. 67–114). Hillsdale, NJ: Lawrence Erlbaum Associates.

Thissen, D. (2001). IRTLRDIF v2.0b: Software for the computation of the statistics involved in

item response theory likelihood-ratio tests for differential item functioning [Computer

94

software documentation]. Chapel Hill: L. L. Thurstone Psychometric Laboratory,

University of North Carolina.

Van der Linden, W. J., & Hambleton, R. K. (1996). Handbook of modern item response theory.

New York, NY: Springer-Verlag.

Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and

detection. Journal of Educational Measurement, 28, 197-219.

Wang, X-B, Wainer, H., & Thissen, D. (1995) On the viability of some untestable assumptions

in equating exams that allow examinee choice. Applied Measurement in Education, 8, 211-

225.

Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning.

Applied Psychological Measurement, 33, 42-57.

Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P.W.

Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–347). Hillsdale, NJ:


Zimowski, M. F, Muraki, E., Mislevy, R. J, & Bock, R. D. (2003). BILOG-MG 3 [Computer

software]. Lincolnwood, IL: Scientific Software International.

95

APPENDICES

A. IRTPRO Input File for DIF Detection for Two Groups with 3PL

Project:

Name = WALL;

Data:

File = .\WALL.ssig;

Analysis:

Name = 3PL;

Mode = Calibration;

Title:

Master Thesis 3PL DIF

Comments:

3PL models fitted to each of the 79 items.

Estimation:

Method = BAEM;

E-Step = 500, 1e-005;

SE = S-EM;

M-Step = 50, 1e-006;

Quadrature = 49, 6;

SEM = 0.001;

96

SS = 1e-005;

Scoring:

Mean = 0;

SD = 1;

Miscellaneous:

Decimal = 2;

Processors = 2;

Print CTLD, P-Nums, Diagnostic;

Min Exp = 1;

Groups:

Variable = group;

Group G1:

Value = (1);

Dimension = 1;

Items = Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15,

Q16, Q17, Q18, Q19, Q20, Q21, Q22, Q23, Q24, Q25, Q26, Q27, Q28, Q29, Q30,




97

Q76, Q77, Q78, Q79;

Codes(Q1) = 0(0), 1(1);

Codes(Q2) = 0(0), 1(1);

⋮

Codes(Q78) = 0(0), 1(1);

Codes(Q79) = 0(0), 1(1);

Model(Q1) = 3PL;

Model(Q2) = 3PL;

⋮

Model(Q78) = 3PL;

Model(Q79) = 3PL;

Referenced;

Mean = 0.0;

Covariance = 1.0;

Group G2:

Value = (2);

Dimension = 1;

Items = Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15,





98

Q76, Q77, Q78, Q79;

Codes(Q1) = 0(0), 1(1);

Codes(Q2) = 0(0), 1(1);

⋮

Codes(Q78) = 0(0), 1(1);

Codes(Q79) = 0(0), 1(1);

Model(Q1) = 3PL;

Model(Q2) = 3PL;

⋮

Model(Q78) = 3PL;

Model(Q79) = 3PL;

Mean = Free;

Covariance = Free;

DIF All:

Constraints:

Equal = (G1, Q1, Slope[0]), (G2, Q1, Slope[0]);

Equal = (G1, Q1, Intercept[0]), (G2, Q1, Intercept[0]);

Equal = (G1, Q1, Guessing[0]), (G2, Q1, Guessing[0]);




99

⋮







Priors:

(G1, Q1, Slope[0]) = Lognormal, 0, 1;

(G1, Q1, Intercept[0]) = Normal, 0, 3;

(G1, Q1, Guessing[0]) = Beta, 4, 16;



(G1, Q2, Guessing[0]) = Beta, 4, 16;

⋮



(G1, Q78, Guessing[0]) = Beta, 4, 16;



(G1, Q79, Guessing[0]) = Beta, 4, 16;


100


(G2, Q1, Guessing[0]) = Beta, 4, 16;



(G2, Q2, Guessing[0]) = Beta, 4, 16;

⋮



(G2, Q78, Guessing[0]) = Beta, 4, 16;



(G2, Q79, Guessing[0]) = Beta, 4, 16;

101

B. BILOG-MG 3 Input File for DIF Detection for Two Groups with 3PL

Master Thesis

All Races 3PL DIF

>COMMENT

An empirical comparison of the three programs is presented using the fall 2010 data of the

GHSGPT. This study detects DIF for different ethnicities only in social studies,

which consists of 79 dichotomously scored items.

>GLOBAL DFName = 'D:\Thesis\Result\BL\WALL\WALL.1.dat',

NPArm = 3;

>LENGTH NITems = (79);

>INPUT NTOtal = 79,

NIDchar = 4,

NGRoup = 4,

DIF;

>ITEMS ;

>TEST1 TNAme = 'WALL3PL',

INUmber = (1(1)79);

>GROUP1 GNAme = 'WRFGROUP',

LENgth = 79,

INUmbers = (1(1)79);

>GROUP2 GNAme = 'BFCGROUP',

LENgth = 79,

102

INUmbers = (1(1)79);

(4A1, 4X, I1, 4X, 79A1)

>CALIB CRIt = 0.0050,

PLOt = 1.0000,

ACCel = 1.0000,

TPRIOR;

>SCORE ;

103

C. IRTLRDIF Input File for DIF Detection for Two Groups with 3PL

2654

79

111111111111111111111111111111111111111111111111111111111111111111111111111111

1

WBLR.dat

4

1

5-83

WBLR3PL.out

Documents

A COMPARISON OF DIFFERENTIAL ITEM FUNCTIONING (DIF