21
Improving Classroom 1 Improving Classroom Multiple-Choice Tests: A Worked Example Using Statistical Criteria Judith R. Emslie and Gordon R. Emslie Department of Psychology Ryerson University, Toronto Copyright © 2005 Emslie

Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 1

Improving Classroom Multiple-Choice Tests: A Worked Example Using Statistical Criteria

Judith R. Emslie and Gordon R. Emslie

Department of Psychology

Ryerson University, Toronto

Copyright © 2005 Emslie

Page 2: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 2

Abstract

The first draft of a classroom multiple-choice test is an object in need of improvement

(T. M. Haladyna, 1999; S. J. Osterlind, 1998). Teachers who build each test anew fail to

capitalize on their previous efforts. This paper describes an iterative approach and provides

specific numerical criteria for item and test adequacy. Indices are selected for aptness, simplicity,

and mutual compatibility. A worked example, suitable for teachers of all disciplines,

demonstrates the process of raising test quality to a level that meets student needs. Reuse after

refinement does not threaten test security.

Keywords: Multiple-Choice tests; Teacher made tests; Item analysis; Psychometrics; Test

Construction; Teacher Education

Page 3: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 3

Ellsworth, Dunnell, and Duell (1990) present 37 guidelines for writing multiple-choice

items. They find that approximately 60% of the items published in instructor guides for

educational psychology texts violate one or more item writing guideline. Hansen and Dexter

(1997) report that, by the same criterion, 75% of auditing (accounting) test bank items are faulty.

And in practice, “item writers can expect that about 50% of their items will fail to perform as

intended” (Haladyna, 1999, p. 214). It is safe to assume that teacher constructed items fare no

better. Therefore, a classroom multiple-choice test is an object in need of improvement.

Following item-writing advice helps but most guidelines are based on expert opinion rather than

empirical evidence or theory (Haladyna, 1999; Osterlind, 1998). Even a structurally sound item

might be functionally flawed. For example, the item might be too difficult or off topic.

Therefore, teachers must assume personal responsibility for test quality. They must assess item

functioning in the context of their test’s intended domain and target population. Teachers must

conduct an iterative functional analysis of the test. They must scrutinize, refine, and reuse items.

They must not assume any collection of likely looking items will suffice. Psychometric

interpretation guidelines are available in classic texts (e.g.; Crocker & Algina, 1986; Cronbach,

1990; Magnusson, 1966/1967; Nunnally, 1978). However, the information is not in a brief,

pragmatic form with unequivocal criteria for item acceptance or rejection. Consequently,

teachers neglect statistical information that could improve the quality of their multiple-choice

tests. This paper provides “how to” information and a demonstration test using artificial data. It

is of use to teachers—of all disciplines—particularly those awed by statistical concepts or new to

multiple-choice testing. Experienced teachers might find it useful for distribution to teaching

assistants or as a pedagogic aid. This paper de-emphasizes technical terminology without

Page 4: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 4

misrepresenting “classical” psychometric concepts. (For simplicity, “modern” psychometric

theory relating to criterion-referenced items and tests is not considered.) To allow reproduction

of the entire data set, statistical ideals are relaxed (e.g., the demonstration test length and student

sample size are unrealistically small). Teachers can customize the information by processing the

data through their own test scoring systems.

This paper enumerates criteria for the interpretation of elementary indices of item and test

quality. Where appropriate, letters in parentheses signify alternative or related expressions of the

given criterion. The specific numerical values of the indices are selected for aptness, simplicity,

and compatibility. A tolerable, not ideal, standard of acceptability is assumed. The demonstration

test—based on knowledge not specific to any discipline—provides a medium for (a) describing

the indices, (b) explaining the rationale for these indices, and (c) illustrating the process of test

refinement.

A Demonstration Multiple-Choice Test

A Light-hearted Illustrative Classroom Evaluation (ALICE, see Table 1) about Lewis

Carroll's literary work (1965a, 1965b) is administered to ten hypothetical students. The scored

student responses are given in Table 2.

Item Acceptability

Criterion 1

(a) An item is acceptable if its pass rate, p, is between .20 and .80.

(b) An item is acceptable if its failure rate, q, is between .20 and .80.

(c) An item is acceptable if its variance (pq) is .16 or higher.

A basic assumption in testing is that people differ. Therefore, the first requirement of a test item

Page 5: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 5

Table 1

The Demonstration Test

______________________________________________________________________________

Instructions: Answer the following questions about Lewis Carroll’s literary work.

1. When Alice met Humpty Dumpty, he was sitting on a

*a. wall. b. horse. c. king. d. chair. e. tove.

2. Alice's prize in the caucus-race was a

a. penny. *b. thimble. c. watch. d. beheading. e. glove.

3. Tweedledum and Tweedledee's battle was halted by a

a. sheep. b. lion. *c. crow. d. fawn. e. walrus.

4. The balls used in the Queen's croquet game were

a. serpents. b. teacups. c. cabbages. *d. hedgehogs. e. apples.

5. The Mock Turtle defined uglification as a kind of

a. raspberry. b. calendar. c. envelope. d. whispering. *e. arithmetic.

6. The White Queen was not able to

*a. think. b. knit. c. subtract. d. sew. e. calculate.

7. The Cook's tarts were mostly made of

a. barley. *b. pepper. c. treacle. d. camomile. e. vinegar.

8. In real life, Alice's surname was

a. Carroll. b. Hopeman. *c. Liddell. d. Dodgson. e. Lewis.

______________________________________________________________________________

Note. Asterisks are added to identify targeted alternatives (correct answers).

Page 6: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 6

Table 2

Scored Student Responses and Indices of Item Quality

______________________________________________________________________________

Item number

_____________________________________________________________________

Student 1 2 3 4 5 6 7 8

______________________________________________________________________________

Top half of the class: High scorers on the test

Ann a b c d e a b b

Bob a b a d NR a b c

Cam a b e d e c b a

Don a b c d b e b d

Eve NR b d d e R>1 b c

Bottom half of the class: Low scorers on the test

Fay a b c a c a c d

Guy a e c b NR a b b

Hal a b b c NR a b a

Ian a c c e d a c e

Joy a a c e a R>1 c e

______________________________________________________________________________

(table continues)

Page 7: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 7

Item number

_____________________________________________________________________

Index 1 2 3 4 5 6 7 8

______________________________________________________________________________

pa .90

#1 .70 .60 .50 .30 .60 .70 .20

pqb .09

#1 .21 .24 .25 .21 .24 .21 .16

ptopc

.80 1.00 .40 1.00 .60 .40 1.00 .40

pbottomd 1.00 .40 .80 .00 .00 .80 .40 .00

re -.33

#2 .49 -.57

#2 .60 .26

#2 -.21

#2 .49 .08

#2

______________________________________________________________________________

Note. The letters a through e = alternative selected; NR = no response (omission); R>1 = more

than one response (multiple response); = correct response; = incorrect response; The

psychometric shortcomings of the demonstration test are indicated by numerical superscripts

corresponding to criteria discussed in this paper.

aPass rate for the entire class (N = 10), the proportion of students answering the item correctly.

bItem variance.

cPass rate for the top half of the class.

dPass rate for the bottom half of the class.

eItem-test correlation.

Page 8: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 8

is that not all respondents give the same answer. The item must differentiate. A multiple-choice

item distinguishes two groups, those who pass and those who fail. The pass rate is p, the

proportion of students that selects the target (correct answer). The failure rate is q, or 1 - p, the

proportion that fails to answer, gives multiple answers, or selects a decoy (incorrect answers).

Differentiation is highest when the pass and fail groups are equal and decreases as these groups

diverge in size. For example, if five of ten students pass (p = .5) and five fail (q = .5), then each

passing student is differentiated from each failing student and there are 5 x 5 = 25

discriminations. If everyone passes and no one fails (p = 1, q = 0), there are 10 x 0 = 0

discriminations. Differentiation becomes inadequate if one group is more than four times the size

of the other. Therefore, both p and q should be within the range .2 to .8. It follows that pq should

be no lower than .2 x .8 = .16. The product pq is the item variance (often labeled s2 or VAR).

Summing the check marks in the columns of Table 2 gives the number of students who

answer each question correctly. The column sum divided by the total number of students gives

the item pass rate. Item 1, where Humpty Dumpty sat, is too easy, p = .9. Consequently, its

variance is unacceptably low, pq = .9 x .1 = .09. Unless this item can be made more difficult, it

should be dropped.

Criterion 2

An item is acceptable if its item-test correlation (r) is positive and .30 or higher.

It is not enough that an item differentiates; it must differentiate appropriately. Well-informed

students should pass and uninformed students should fail the item. If an item is so easy that only

one person answers incorrectly, that person should be the one with least knowledge. Item 1 fails

in this respect. Joy has the least knowledge (the lowest ALICE score) but the student who fails

Page 9: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 9

this item is Eve, a high scorer.

In Table 2 the students are ranked in descending order according to their total ALICE

scores. Correct responses should be more frequent in the top half of the table (i.e., in the top half

of the class) than in the bottom half. This is not the case for items 1, 3, and 6. For example, only

two of the top five students pass item 6 but four of the bottom five students get it correct. The

pass rates calculated separately for the top and bottom halves of the class (ptop and pbottom)

confirm that the “wrong” students get items 1, 3, and 6 correct. Students who know nothing

about Lewis Carroll’s work but know their nursery rhymes could answer the Humpty Dumpty

question (item 1). For item 3, what halted Tweedledum and Tweedledee’s battle, perhaps

students selected the target, c. crow, not through knowledge but because they chose the

exception—the only non-mammal—or because they followed the adage, “when in doubt, choose

c”. In item 6, the White Queen’s inability, alternative c. subtract is a more accurate description

than the teacher’s target, a. think. The White Queen sometimes was unable to think but always

was unable to subtract. Item 6 is mis-keyed. Off with the teacher’s head!

The relationship between passing or failing an item and doing well or doing poorly on the

test as a whole is assessed by an item-test correlation coefficient (r). The possible range is from

-1 to +1. A mid range value ( -.30 < r < +.30) indicates that the relationship is weak or absent.

The higher the positive correlation, the stronger the tendency for students who do well on an

item to also do well on the test as a whole (appropriate differentiation). The more negative the

correlation, the stronger the evidence that passing the item is associated with a low score on the

test (inappropriate differentiation). In Table 2 negative correlation coefficients flag items 1, 3,

and 6 for attention as anticipated. More to the point, the correlation coefficient detects

Page 10: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 10

weaknesses overlooked by the pass/fail rate statistics. Items 5 and 8 differentiate appropriately

but insufficiently. The correlation of item 8, Alice’s real life surname, with the test is practically

zero. It should be deleted. The teacher should consider rewording item 5, the definition of

uglification. Perhaps a more homogeneous set of alternatives (e.g., all school subjects) would

increase its item-test correlation.

Bear in mind that an unsatisfactory item-test correlation indicates a problem either in the

item or in the test (or both). This index is a measure of item quality only if the total test score is

meaningful (see General Remarks below).

Criterion 3

An item with d decoys is acceptable if 20/d% to 80/d% of the students select each decoy.

An efficient item has effective decoys; that is, wrong answers that are sufficiently plausible as to

be selected by uninformed students. Effectiveness is maximal when the incorrect responses are

evenly distributed across the decoys (showing that none is redundant). By Criterion 1, an item

failure rate between 20% and 80% is acceptable. In ALICE, the number of decoys (d) per item is

4. Therefore, each wrong answer should be chosen by 20/4% to 80/4% of the students. With a

sample size of ten, frequencies below 1 or above 2 are outside the desired (5% to 20%) range.

The response frequencies are given in Table 3. Items 1, 2, 6, and 7 fail to meet the criterion. For

item 1, where Humpty Dumpty sat, the frequency of selection of all decoys is zero. But the

teacher should not direct test refinement efforts at the apparent violation before identifying the

root cause. The real problem here is this item’s high p value. An excessively easy target depletes

the response rate for the decoys. For item 2, no one selects the implausible decoy d. beheading as

the prize in the caucus-race. For item 6, the White Queen’s inability, the decoys b. knit and d.

Page 11: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 11

Table 3

Frequency of Student Responses and Indices of Item Quality

______________________________________________________________________________

Item number

____________________________________________________________________

Responsea 1 2 3 4 5 6 7 8

______________________________________________________________________________

a 9 1 1 1 1 6 0 #3

2

b 0 #3

7 1 1 1 0 #3

7 2

c 0 #3

1 6 1 1 1 3 #3

2

d 0 #3

0 #3

1 5 1 0 #3

0 #3

2

e 0 #3

1 1 2 3 1 0 #3

2

NR 1 #4

0 0 0 3 #4

0 0 0

R>1 0 0 0 0 0 2 #4

0 0

______________________________________________________________________________

Note. Ten students wrote the test. For each item, the frequency of selection of the correct

response is double underlined. The psychometric shortcomings of the demonstration test are

indicated by numerical superscripts corresponding to criteria discussed in this paper.

aThe letters a through e = option chosen; NR = no response (omission); R>1 = more than one

response (multiple response).

Page 12: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 12

sew are never selected. These decoys are highly related and therefore it might seem that either

both are correct or neither is. In item 7, the Cook’s tarts, only one of the four decoys is ever

chosen. Consequently, the pass rate is inflated by guessing. The teacher should generate

plausible new decoys and re-evaluate these items.

Criterion 4

An item is acceptable if ≤ 5% of the students omit answers and/or give multiple answers.

An important requirement is that the item presents a well-structured task. Obviously, the clarity

of the items should be assessed during test construction. But unforeseen problems are revealed

when a sizable proportion of the students (5% or more) fails to respond, or gives multiple

answers (or both). With a sample size of 10, frequencies of 1 or above are outside the acceptable

range. Items 1 and 5 show excessive omissions (see Table 3). Maybe the high scorer Eve omitted

item 1, where Humpty Dumpty sat, because she thought the answer, a. wall, so obvious that it

must be a trick question. For item 5, perhaps poorly prepared students simply gave up. For them

the concocted word, uglification, had no associated thoughts. The teacher could try new decoys

based on uglification’s resemblance to English words such as ugliness and nullification. (See

Criterion 2 for an alternative strategy.) The frequency of omissions for items 1 and 5 suggests

that some students feared they would be penalized for an incorrect guess. Item 6 has too many

multiple answers. The stem might be confusing because it contravenes recommended practice;

word the stem positively or at least emphasize negative words by capitalizing or underlining. The

alternatives might be confusing because they overlap in meaning (knit with sew, subtract with

calculate, and calculate with think). Puzzled, students left more than one alternative marked.

Perhaps these students assumed they would receive partial credit. The teacher should rewrite the

Page 13: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 13

item with clear-cut alternatives.

Test Acceptability

Criterion 5

(a) A test is acceptable if the internal consistency coefficient (K-R 20) is at least .64.

(b) A test is acceptable if the correlation between observed and true scores is at least .80.

(c) A test is acceptable if the true variance proportion is at least .64 of the total.

(d) A test is acceptable if the error variance proportion is no more than .36 of the total.

Any test of more than one item requires the assumption that item scores can be added to produce

a meaningful single test score. In an internally consistent test all the items work in unison to

produce a stable assessment of student performance. When the test is internally inconsistent,

performance varies markedly according to the particular items considered. Someone who excels

in one part of the test might do very badly, do well, or be average on another part. The test gives

a mixed message. The additivity requirement applies when all the items in the test are intended

to measure the same topic. If the test covers more than one topic, each topic is considered a test

in its own right (see General Remarks below).

In Table 2, Bob (the second highest ALICE scorer) gets 50% of the odd numbered items

correct and 100% of the even numbered items correct. Joy (the lowest scorer) also gets 50% of

the odd numbered items correct but 0% of the even numbered questions. These two students are

equally knowledgeable in terms of the odd items, but at opposite ends of the knowledge

spectrum in terms of the even items. Similar discrepancies are present for other students and

other item groups. For example, Don gets 100% in the first half of the test but only 25% in the

second half.

Page 14: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 14

The Kuder-Richardson formula 20 (K-R 20) is a measure of internal consistency.

Theoretical values range from zero, each item measures something different from every other

item, to +1, all the items measure the same thing. An internal consistency coefficient of .64 or

higher is adequate. The K-R 20 for ALICE is .15 (see Table 4). Therefore, the eight items do not

form a coherent set. However, this does not preclude the possibility of one or more coherent

subsets of items.

The test score is an estimate of the student’s true knowledge. The student’s true score is

unknowable but, in a statistical sense, it is the mean of the scores that would be obtained if the

individual were tested repeatedly. The positive square root of the test's internal consistency

coefficient (K-R 20) estimates the correlation between students' observed and true scores. A test

is acceptable if this correlation estimate is at least 64. = .80. For ALICE, the estimated

correlation between observed and true scores is 15. = .39. The observed scores are

insufficiently related to the students' true scores.

When an achievement test is administered to a group of students, the spread of scores (the

total test variance) is determined in part by genuine differences in the students' knowledge (true

variance) and in part by inadequacies of measurement (error variance). The K-R 20 coefficient

represents the proportion attributable to true variance. Therefore, the error proportion is

1 - (K-R 20). A test is acceptable if the proportion of true variance is at least .64 and,

equivalently, if the proportion of error variance is no more than .36 of the total variance. ALICE

fails to reach the required standard. Its true variance proportion is .15. The error proportion is

.85.

To convert to the measurement units of a particular test, the true and error proportions are

Page 15: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 15

multiplied by the total test variance. ALICE has a total variance of 1.85 (see Table 4). Therefore

in raw score units, the true variance is .15 (1.85) = .28 and the error variance is

(1 - .15) (1.85) = 1.57.

The square root of the error variance—the standard error of measurement (SEM)—

measures the margin of error in assessing an individual student's true score. The probability is

approximately 95% that the student's true score lies within 2 SEM of the obtained score. The

SEM for ALICE is 57.1 = 1.26. For Fay, whose ALICE score is 4, the margin of error is

4 ± 2(1.26). That is; her true ALICE score might be anywhere between 1.48 and 6.52—in effect

anywhere between 1 and 7. Given that there are only eight items, the imprecision is obvious.

General Remarks

Test scoring programs generally provide the foregoing information in a composite

printout such as shown for ALICE in Table 4. Only item 4, croquet balls, passes muster. The

other items must be modified or dropped. An indication of the power of test refinement is that

the mere omission of the worst item in terms of item-test correlation (item 3, Tweedledum and

Tweedledee) raises the ALICE K-R 20 coefficient from .15 to .52. But there are additional

problems.

First, the teacher’s directions do not fully disclose the test requirements (see Table 1). As

a result, individual differences in students’ willingness to omit, guess, or give multiple answers

contribute to error variance. Better instructions, such as follow, might prevent items 1, 5, and 6

from running afoul of Criterion 4 (omissions and multiple answers):

Answer all the questions. For each item, circle the single best alternative. There is

no penalty for wrong answers. If you guess correctly you will receive one mark.

Page 16: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 16

Table 4

Sample Computer Printout and Indices of Item and Test Quality

Psychometric Analysis of the Alice Demonstration Test

_____________________________________________________________________________

Item number

___________________________________________________________________

1 2 3 4 5 6 7 8

_____________________________________________________________________________

Responsea

a 90.#1 .10 .10 .10 .10 60. .00

#3 .20

b .00 #3

70. .10 .10 .10 .00 #3

70. .20

c .00 #3

.10 60. .10 .10 .10 .30 #3

20.

d .00 #3

.00 #3

.10 50. .10 .00 #3

.00 #3

.20

e .00 #3

.10 .10 .20 30. .10 .00 #3

.20

NR .10 #4

.00 .00 .00 .30 #4

.00 .00 .00

R>1 .00 .00 .00 .00 .00 .20 #4

.00 .00

Index

pqb .09

#1 .21 .24 .25 .21 .24 .21 .16

rc -.33

#2 .49 -.57

#2 .60 .26

#2 -.21

#2 .49 .08

#2

_____________________________________________________________________________

(table continues)

Page 17: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 17

Test statistics Test indices

Mean = 4.50 Kuder-Richardson formula 20 = .15 #5

Variance = 1.85 True variance = 0.28 = 15% #5

Standard error of measurement = 1.26 Error variance = 1.57 = 85% #5

Number of students = 10 Observed, true score correlation = .39 #5

_____________________________________________________________________________

Note. Entries in the top part of the table are the proportional response frequencies for the test

items. For each item, the proportion of students selecting the correct alternative is double

underlined. The psychometric shortcomings of the test are indicated by numerical superscripts

corresponding to criteria discussed in this paper.

aThe letters a through e = alternative selected; NR = no response (omission); R>1 = more than

one response (multiple response). bVariance.

cItem-test correlation.

Page 18: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 18

If you guess incorrectly you will neither gain nor lose marks. There is no credit

for multiple answers even if one of them is correct.

Second, the teacher failed to define the test domain clearly. The instructions state that the

test assesses students’ knowledge of Lewis Carroll’s literary work but do not specify which

works. Items 2, 4, 5, and 7 are from Alice's Adventures in Wonderland. Items 1, 3, and 6 are from

Through the Looking-Glass and What Alice Found There. Item 8, the real Alice’s surname, is

biographical and does not relate to Carroll’s literary work. The assumption of a single domain is

suspect. Perhaps ALICE is three tests. This partitioning is supported by the item-test correlation

pattern observed in Table 4. The four Wonderland items have positive item-test correlations. The

three Looking-Glass items all correlate negatively with ALICE. The biographical item has an

essentially zero correlation with ALICE. Therefore, the teacher should analyze the Wonderland

and Looking-Glass items as independent tests. (This new information means that the item-test

correlations and the K-R 20 for the original ALICE are inappropriate.)

Even without modification, the four Wonderland items make an internally consistent test

(K-R 20 = .84). The three Looking-Glass items show promise (K-R 20 = .54). Students who do

well on Wonderland items tend to do poorly on Looking-Glass items (the correlation between

tests is -.57). It seems that the requirement to read both books was inadequately communicated

or misunderstood. Curiouser and curiouser!

If the overall quality of the test (or tests) is still unacceptable after modification of the

existing items, the next step is to write new items. Additional items generally increase a test’s

internal consistency. Estimate the required test length by multiplying the current test length by

the quotient (D - CD)/(C - CD) where C is the consistency coefficient obtained for the Current

Page 19: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 19

test and D is the consistency coefficient Desired for the new test. Therefore, to upgrade the

Looking-Glass test from a current K-R 20 of .54 to a desired K-R 20 of .64, the new test must be

about (.64 - .54 x .64)/(.54 - .54 x .64) = 1.5 times as long as the existing 3-item test. Five-items

should suffice.

A final consideration in refining a test is to ensure that the item set as a whole has

desirable characteristics. Do the items cover the entire domain? Are there omitted topics or

redundancies? Is there an appropriate mix of factual and application items? Is the difficulty level

appropriate? Is the test structurally sound? ALICE violates several guidelines. For example, the

stems are in sentence completion form instead of question form and the alternatives are placed

horizontally instead of vertically (see Haladyna, 1999).

Refutations and Conclusions

A critic might argue that the reuse-after-refinement approach compromises test security.

“I return the tests to students for study purposes so I need a new test every time.” But assuming

that the teacher wants to encourage students to review conceptually rather than memorize

specific information, there is no requirement that the distributed review items should be the same

as those used in class tests. Besides, most of the items would be modified or replaced before the

test is reused.

Teachers who refuse to modify or reuse items can create their own items. But this is

inefficient because it fails to capitalize on previous work. Moreover, it places a heavy demand on

the teacher’s creativity and item-writing skills—abilities that are never examined.

Alternatively, the teacher can compose successive tests by selecting new items from

published test banks. But these teachers use items that have been and will be used by other

Page 20: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 20

teachers. If their students interact, item security is compromised. Furthermore, a test bank is an

exhaustible supply source of mediocre quality. If, as is likely, the teacher selects the “best bet”

items first, then subsequent tests will be of deteriorating quality. In sharp contrast, the reuse-

after-refinement approach generates tests of improving quality. Therefore, the major threat is not

reuse after refinement but reuse without refinement.

Teachers who argue that, “the psychometric indices are important for commercial tests

but not for classroom tests” miss the point. There is a difference between relaxing the rigor and

abandoning the process. The level of processing might vary but all tests share the need for

refinement. Before testing, the teacher should write or select items according to item-writing

guidelines. After testing, the teacher should evaluate item performance according to the

psychometric indices. The requirement is to establish a classroom test of quality. The level of

quality must meet the needs of students and teacher, not those of commercial test publishers.

Extra vetting takes extra time but the investment is recouped by a fairer and more precise

assessment of student performance—surely the essential purpose of any test procedure.

To conclude, good multiple-choice tests are not likely to occur if teachers select questions

indiscriminately from published test banks or rely on their own first drafts of original items. An

iterative approach to test construction improves test quality.

Page 21: Improving Classroom 1 - Ryerson University · The relationship between passing or failing an item and doing well or doing poorly on the ... Improving Classroom 10 weaknesses overlooked

Improving Classroom 21

References

Carroll, L. (1965a). Alice's adventures in wonderland (A Centennial Edition). New York:

Random House.

Carroll, L. (1965b). Through the looking-glass and what Alice found there (A Centennial

Edition). New York: Random House.

Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. New York:

Holt, Rinehart, & Winston.

Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York: HarperCollins.

Ellsworth, R. A., Dunnell, P., & Duell, O. K. (1990). Multiple-choice test items: What are

textbook authors telling teachers? Journal of Educational Research, 83, 289-293.

Haladyna, T. M. (1999). Developing and validating multiple-choice test items (2nd ed.).

Mahwah, NJ: Erlbaum.

Hansen, J. D., & Dexter, L. (1997). Quality multiple-choice test questions: Item-writing

guidelines and an analysis of auditing testbanks. Journal of Education for Business, 73,

94-97.

Magnusson, D. (1967). Test theory (H. Mabon, Tran.). Reading, MA: Addison-Wesley. (Original

work published 1966)

Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.

Osterlind, S. J. (1998). Constructing test items: Multiple-choice, constructed-response,

performance, and other formats (2nd ed.). Boston: Kluwer.