16
Japan Society of English Language Education NII-Electronic Library Service JapanSociety ofEnglish Language Education The Effects of the Number of Options on Different Tasksin Multiple-Choice Listening Tests HidekiIImm Pwfectuval Ciniversity oj7Ktimamoto Abstract This study investigated the effects of different rnimbers of options (four-option vs, three-option) on different listening tasks lpicture vs. dialogue vs, monologue). The data were collected from 118 Japanese university students and were analyzed from the perspective of item difficulty, discrirnination, reliahility, and distractor performance, The results indicate that the number of options hadno notal)le impact on itemdithculty in picture and dialogue tasks, buthad a sigriificant effect in the monologue task. TIheresults also suggest that thenumber ofoptions had no great impact on item discrimination and reliability inall tasks,and thatthenumber of items containing all discrirninating distractors was different between the three tasks. The overall findings provide evidence that task difference is an importarrt element in the study of optimal number of options and the development of multiple-choice listening tests, 1. Introduction One single-answer multiple-choice item consists of one question (the stem), one correct option (the keyt and several incorrect options (distractors), ln lariguage assessment, the multiple-choice test format is widely used because of its objectivity, automated scoring, and reliability (Thompson, 1995). ln addition, this format allows testing of recqptive ski11s without roquiring test takersto prodnce written or spoken lariguage (Hughes, 2003). Moreover, ifitems are well-constmcted, the multiple-choice format can provide content validity evidence by allowing a thorough and representative sampling ofthe cognitive domain (Downing, 2006), On theother hand, themultiple-choice format is often criticized because it induces random guessing, tests only recognition knowledge, and may produce harmfu1 washback (Hughes, 2003). The greatest weakness ofthis fbrmat is that it is dirucult to develop proper test items. To bemore precise, it is extremely didicult to create a sufficient number ofplausible distractors. As Brown 's law (Brown, 2005) states, `Lwhen writing four-option multiple-choice items, the stem and correct option are easy to write, and the next two distractors are relatively easy to make up, as well, but the last distractor is absolutely impossible" (p. 49). In fact, Haladynaand Downing (1993) show that three options (one keyand two distractors) are suitable since even in well-developed tests it is 237

The Effects Number Options Different Tasks

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

The Effects of the Number of Options on Different Tasks in

Multiple-Choice Listening Tests

Hideki IImmPwfectuval Ciniversity oj7Ktimamoto

Abstract

This study investigated the effects of different rnimbers of options (four-option vs,

three-option) on different listening tasks lpicture vs. dialogue vs, monologue). The data were

collected from 118 Japanese university students and were analyzed from the perspective of itemdifficulty, discrirnination, reliahility, and distractor performance, The results indicate that thenumber of options had no notal)le impact on item dithculty in picture and dialogue tasks, but had

a sigriificant effect in the monologue task. TIhe results also suggest that the number ofoptions had

no great impact on item discrimination and reliability in all tasks, and that the number of itemscontaining all discrirninating distractors was different between the three tasks. The overall

findings provide evidence that task difference is an importarrt element in the study of optimal

number of options and the development of multiple-choice listening tests,

1. Introduction

One single-answer multiple-choice item consists of one question (the stem), one correct

option (the keyt and several incorrect options (distractors), ln lariguage assessment, the

multiple-choice test format is widely used because of its objectivity, automated scoring, and

reliability (Thompson, 1995). ln addition, this format allows testing of recqptive ski11s without

roquiring test takers to prodnce written or spoken lariguage (Hughes, 2003). Moreover, ifitems arewell-constmcted, the multiple-choice format can provide content validity evidence by allowing a

thorough and representative sampling ofthe cognitive domain (Downing, 2006),

On the other hand, the multiple-choice format is often criticized because it induces random

guessing, tests only recognition knowledge, and may produce harmfu1 washback (Hughes, 2003).

The greatest weakness ofthis fbrmat is that it is dirucult to develop proper test items. To be more

precise, it is extremely didicult to create a sufficient number ofplausible distractors. As Brown 's

law (Brown, 2005) states, `Lwhen writing four-option multiple-choice items, the stem and correct

option are easy to write, and the next two distractors are relatively easy to make up, as well, but

the last distractor is absolutely impossible" (p. 49). In fact, Haladyna and Downing (1993) show

that three options (one key and two distractors) are suitable since even in well-developed tests it is

237

Page 2: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

rare that more thaii three options are functional. Moreover, Rodoriguez's (2005) meta-analysis of

80 years of research into the optimal number of options concluded that multiple-choice items

should consist of three options because using more does little to irnprove the test and results in

improperdistractors.

Despite the above-mentioned research, the four-option test fbrmat (one key and three

distractors) is the most widely used in domestic and international English proficiency tests such as

EIKEN, TOEIC, and TOEFL (iBT and ITP), which begs the question, do al1 four options function

properly in those tests? Although we carmot access the official test data directly, several empirical

studies have been conducted into the effects of number of options in EFL testing, and this may

give us some insight into the optimal number ofoptions in multiple-choice items. These wi11 be

discussed in the next section.

2. Background

Although the optimal rrumber of multiple-choice options has long been discussed

throughout educational testing, very few studies have been conducted in the language-testing

context. For instance, a rneta-analysis by Rodriguez (2005), which reviewed 27 articles written

over 80 years, did not include any studies related to the L2 field. Only recently have several

studies focused on the relationship between the number of options and EFL test perfbrmance.

First, Shizuka, Takeuchi, Yashima, and Yoshizawa (2006) compared the effect of four- and

three-option fbrmats on reading perfbrmance using a university entrance exam in Japan. They

changed an original four-option item into a three-option one by deleting the least chosen distractor

from the data of a previous administration of the entrance exam. [Ihe results indicated that the

average item facility and average item discrimination between the two forrnats were not

significantly different. in addition, their distractor analyses found that the rrumber of effective

distractors was almost equal between the two option formats.

Second, Currie and Chiramanee (2010) investigated how three-, four-, and five-option

multiple-choice items differed from open-ended items when testing L2 English grarnmar

knowledge for undergraduates in Thailand. [Ihey developed distractors for the multiple-choice

items based on participants' incorrect answers to open-ended items. They fbund that there was no

significant differences in item facility between the three-, four-, and five-option items, and that the

alpha relial)ility coefficients were all almost the same arnong the three option formats.

Lastly, Lee and Winke (2012) compared three-, four-, and five-option items using listening

test items equiyalent to the university entrance exarnination in Korea. They converted the original

five-option items into fbur- and three-option fbrrnats by deleting the least plausible option. The

plausibility of distractors was judged by different evaluators arnongst graduate students, English

teachers, and high school students. They reported that average scores on the three-option tests

were significantly higher than those on the four- and five-options tests, but no difference was

238

Page 3: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

fbund in average item discrirnination between the three fbrmats. As fbr reliability (Cronbach'salpha), no consistent pattems were found that depended on the number of options.

In sum, the results ofprevious studies can be divided into two types: (a) the number of

options had no effect on test perfbrrnance (Shizuka et al., 2006; Currie & Chiramanee, 201O), and

(b) the three-option fbmiat was easier than the other fbrmats (Lee & Winke, 2012). The clear

diffbrence between these studies lies in whether the test was written-mode (reading and grarnmar)or spoken-mode (listening). It is reasonable to assume that the effects of different numbers of

options have more impact on listening test performance than on reading or grammar test

perfbrrnance because of "the

real-time nature of spoken language" (Buck, 2001, p. 6). ln the

listening test, test takers have to read and choose one of three-, four-, or five-options while

listening to texts. [[hus, it is possible that a fewer nuniber of options could rednce the cognitive

demands on test takers.

Having said that, it is not safe to postu1ate that a fewer number of options can always

increase listening test perfbrmance, As reviewed below, text typesHlialogue or

monologue-might be related to listening test perfbrrnance and in that case, the effect ofdifferent

numbers of options may vary according to the type oflistening text.

Several studies have investigated which type of input test takers find more diracult in

listening comprehension. ln an earlier study by Shohamy and Inbar (1991), three types oflistening

test anews report, a lecture, and a consultative dialogue were compared, It was fbund that the

news report (scripted monologue) was the most difficult, fo11owed by the lecture (one speakerinteracting with audience), and finally the dialogue (one expert and one addressee). The authors

attributed this finding to the fact that the news report consisted of a 1arger density ofpropositions

and more complex grammatical and syntactic stmctures than the lecture and dialogue. In contrast,

Read (2002) fbund that a monologue version was less dithcult than a discussion version by imeespeakers of the same text. The author assumed that the discussion version was more demanding

because test takers were not accustomed to unscriptedlspontaneous dialogues that contained

colloquial expressions. Finally, Papageorgiou, Stevens, and Goodwin (2012) compared a

relatively longer dialogue and monologue and found that in some cases monologues were more

didicult than dialogues, In sum, the relative dienculty of dialogues and rnonologues has been

inconclusive to date, but at least some differences in difficulty have been found between them,

Having examined the research on text types, we will now return to Lee and Winke's (2012)study where the three-option fbrrnat was found to be easier than the four- and five-option formatsin a listening test, In their study, the listening test consisted of several task types: picture,

question-response, dialogue, and monologue. Some rnay be confused by refening to dialogues

and monologues msks, but this paper identified these two as such, because passage types can be

subcategorized into task characteristies (Buck, 2001), in Lee and Winke's study, none ofthe tasks

were analyzed sqparately, but instead all ofthe data were gathered together, Given the difference

239

Page 4: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

in difficulty level between dialogues and monologues reviewed ahove, task difference should be

included in research on the effect ofnumber of options on listening test perfbrmance.

3. Research Questions

ln order to investigate the effect of number of options on different tasks in multiple-choice

listening tests, the fbllowing research questions were formulated:

RQ 1 : Does the number of options afliect item difiiculty? Are there any differences between tasks

in this respect ?

RQ 2: Does the number of options affect item discrimination? Are there any differences between

tasks in this respect?

RQ 3: Does the number of options affect reliability? Are there any differences between tasks in

this respect?

RQ 4: Does the rrumber Qf options affbct distractor pcrfbrrnance? Are there any differences

between tasks in this respect ?

4. Method

4.1 Participants

The participants consisted of 118 undergraduates from fbur intact classes in two private

universities. The first two classes (n = 62) were allocated a fbur-option test fbrmat (Group A) and

the second two classes (n =: 56) were allocated a three-option test format (Group B). In the view of

class teachers, participants' English proficiency levels were low intamediate.

4.2 Materials

The materials used in this study were adapted from three listening sections of a TOEIC

preparation book (Educational Testing Service, 2011). The first section, Photographs, was a

picture task in which test takers heard four statements ahout a picture and then selected the

statement that best described the picture. These statements were n6t pimted in the test books and

were played only once, The second section, Cbnvensations, was a dialogue task in which test

takers heard short conversations between two people and answered three questions about what the

speakers said, The third section, S7tort 7Lilks, was a monologue task in which test takers heard

talks given by a single speaker and answered three questions ahout what the speaker said. in both

the dialogue and monologue tasks, questions and options were printed in the test book and the

conversation and talk were played once.

Although the original TOEIC listening has one additional section, Question-Response, this

was omitted from this study because it originally consists of three options and comparing the

three- with the two-option itern format was outside the scope ofthis study.

240

Page 5: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

4.3 Three-Option Format

Following by Shizuka et al. (2006), the original four-option fbrrnat used in the TOEIC was

transformed into a three-option format by deleting the least popular of the three distractors (Figure1). The least popular distractor was chosen based on item analyses in a pilot study in which 90

university students took the original test one year prior to the main study (in June 2012).Accordmg to class teachers, the studentsi who participated in the pilot study did not differ

considerably in English proficiency levels from those in the main study, placed as they were in the

same level of English class.

Three-o tion fbrrnat

1. Key2. Distractor 1

3. Distractgr,2 tt4;lijlwhIas ttt/' ttt

iFlgure 1. (]tption deletion method.

4.4 Procedures

The data were collected in June 2013. This study employed a common item equating design

using Rasch modeling with the computer program WINSTEPS (Linacre, 2014a) to compare two

simi1ar test forrnats that differed only in the number of options per item on the three tasks{a)

picture, (b) dialogue, and (c) monologueein terrns of item didiculty, discrimination, relial],ility,

and distractor performance (See section 5 below).

Figure 2 shows the outline ofthis research, Regarding the picture task, 1O four-option items

were used as anchor items and served a linking purpose. Both groups (Group A: n =

62; Group B:

n = 56) took the cornmon test first (k = 1O), GToup A then took the four-option experimental test (k

= 15) and Group B took the three-option experimental test (k = 15). A similar procedure was used

for the dialogue and monologue tasks. The two groups took three differeni test sets with a

one-week intervar between each adrninistration.

241

Page 6: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglishLanguageEducation

Picture

Cbmmon items

- Dialogue

Cbmmon items

.

Group A & B

(N-118)- 4-option (k = 1O)

JEx erimentalitems

4-option (k - 18)

tthperimental items

Monologue

Cbmmon items

4-option(k - 1 8)

Group A(n

-

62)- 4-option (k - 15)

vs.

4-option (k= 21)

Group B(n -

56)-

vs,

JExperimentalitems

4-option (k ; 21)

3-option (k = 15) 3-option (k= 21)

vs.

3-option (k = 21)

,FVgt{re 2. Task flow ofthis study.

5. Results and D--lscusslon

5.1 Comparison of Task Difficulty for Each Group

Befbre investigating the effects of different numbers of options on ]istening tasks, the

overall dithculty of the three tasks was examined. This was necessary to reveal the characteristics

of each task, Mean percentage corrcct scores of the three tasks for each format were compared.

The results are reported in Tal)le 1 . In the four-option group, students s¢ ored highest in the picturetask (60.0%), fo11owed by the dialogue task (54.8%), and finally the monologue task (41.6%).Sirnilarly, in the three-option group, students obtained the highest mean scores in the picture task

(70.6%), fo11owed by the dialogue task (64.7%), and finally the monologue task (5 1 .5%),

Table 1

Mean PeTcentqge Ct)n"ect Sbores on the 71hree 7kesksfor the 7Wo Groups

Task No. ofitems4-option fbrmat (n =;

62) 3-option fbrmat (n = 56)

M sw M saPictureDialogue

Monologue

152121 60.054,841,6 14,818.915,9 70,664,751,5 14,617.514.9

To examine these differences statistically, a Friedman test was conducted for each group.The test was significant fbr the fbur-option group ()(?(2, N = 62) = 49.14, p < .001), and fbr the

three-option group ()e?(2, N = 56) = 44.67, p < ,OOI), Follow-up pairwise comparisons were

conducted using Wilcoxon signed-rauk tests, The results ofthese tests are shown in Table 2, A

Bonferroni correction was applied, and all effects were evaluated at a ,O167 (.0513) level of

significance. Table 2 also reperts the effect size fbr each pairwise comparison fbr each task. In the

four-option group, the mean score in the monologue task was significantly lower than that in the

242

Page 7: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

dialogue and picture tasks (ps < ,OOI) with a 1arge effect size (r = .66 and .77, respectivelyi, On

the other hana the significant diffbrence between the picture and dialogue tasks was marginal lp= ,02) with a smal1 effect size (r == .30). Similar to the four-option group, in the three-option group,

the mean score in the monologue task was significantly lower than that in the dialogue and picture

tasks (ps < ,OOI) with a large effect size (r = .72 and .83, respectively), There was also a

significant difference between the picture and dialogue tasks (p < ,OOI), but the effect size was

smal1 (T - .35).

Tal)le 2]Fbllow-

Lip Z2stsfor Comparing S℃ores Across 7tisks in 71vo thnuats

Format 4-option (n - 62) 3-Qption (n -

56)

Task Pic vs, DiaDia vs. Mon Mon vs. Pic Pic vs, DiaDia vs, Mon Mon vs. Pic

Za

pbrc

2.34

.02

.30

520<

,OOI*

.66

6,07< ,OOI*

.77

2.61< .OOI*

.35

5.41< ,OOI*

,72

621< .OOI*

.83

?Vbte. Pic = picture; DiaF dialogue; Mon

= monologue.

aWilcoxon sigrred-ranlc test.

b Asymp. sig. (2-tailed).

C Effect size,

Ep < .Ol67 (,05/3)

Judging from the above, it can be stated that the monologue task was the most dithcult,

fbllowed by the dialogue, and finally the picture task. This finding supports the previous studies

reviewed (Shohamy & Inbar, 1991; Papageorgiou, Stevens, & Goodwin, 2012), which showed

that monologic texts were more diMcult than dialogic ones, The reason fbr this is that monologues

usually contain fewer features associated with interactive discourse (Buck, 2001), features that

may help listeners comprehend the text.

This study also fbund that the picture task was the least difficult of the three. This can beexplained by the difference in the test-taking process, ln dialogue and monologue tasks, test takers

are asked to read and answer three questions (with fbur options each) while listening to the text. In

the picture task, on the other hand, test takers simply listen to four short statements and select the

one that matches the picture. Thus, picture tasks demand less cognitive load than dialogue and

monologue tasks.

5.2 Comparison of Item Difficulty

5.2.1 Selection of Common Items for Equation

This section reports the quality of common items fbr equating. First, item invariance

difficulty was checked. Figure 3 shows the item dienculty plot between the fbur- and three-option

formats fbr the picture task. As can be seen, all items were located within the acceptable zone

between the two control lines (solid curvy lines), except for Item 1O, which was an outlier and so

removed.

243

Page 8: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglishLanguageEducation

t.t..Tr.t..Tt..tT.rrt..T.t.tmTt.rrtTtTt..ttT.mt.ttm.ttrettm.nvtt..trc.ttt..tmttttTtm..ttrrtttttt.tTtt.tt.

1

1:JGltFII;I

Plottmpirical line

l・ i' I

tiS.F/

TTsjg;.1i-l

-' i '" l

'i

iii

L-mo --wwum

4

.-"

.E1i-.--L--

-Ml

.tttttttr

Fo-r eptien fomist-l

71ll/l・/ltIiI1iill・II!lllii!

..-.T.t.tt.tT.tmtww-ttTttumtum-ttttvttumun-av-uammt-ttttttt-.rvwtttt-tt-tttm-ttmttfi-ttrmttd

Flgure 3. Item dithculty plot between thefour-option fbrrnat and the three-option forrnat for the

picture task,

Next, the items were evaluated from the viewpoint of fit statistics, reported in Tal)le 3.

Mig7t is divided into two types: (a) undedit and fo) ovedit. Undedit (> 1.3) degrades the qualityofthe ensuing measures, while overfit (< O.7) has no practical implications (Bond & Fox, 2007).

Accordmg to this guideline, two out of 1O items were diagnosed as misfit (Item 1 as overfit: O.18;

Item 9 as underfit: 2.49).

Table3]Fit

Statistids andPoinFBiserial ofClonelations on Ct)mmon Items in Picture Task

i

2 g K

O.08051O.373.241,091.45O,18O,91O.94O.96L12O,921,051,04O.69O.80O.721.131.071.00118.49.51.41.54.33.52,36

IVbte. ID = item dienculty. MNSQ

= mean squares, lobi

-- point-biserial correlation.

"Outlier in item diMculty invariance.

Dark shaded items were removed from equating. Light shaded values did not reach expected values for the

equatlon,

244

Page 9: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

Finally, point-biserial correlation (i?,bi) was examined. The value ofpoint-biserial correlation

fbr equating purposes should be .20 or more (Hirai, 2010). Again, Item 1 and 9 did not reach the

expected values (.19 and .12, respectively). Based on these observations, seven out of 1O common

items served as equat Lon fbr the picture task,

The same examination of common items was conducted fbr the dialogue and monologue

tasks. As a result, 15 and 16 test items, respectively, were selected fbr equating (see Appendix A

& B).

5.2.2 Item diffTiculty

Tahle 4 reports mean values ofdithculty estimates in the fbur- and three-option fbrmats, theresults ofpaired Ftests, and the effect size for each task.

With regard to the picture tasK the mean dithculty ofthe four-option format (Mr == O.25, so=

1.32) was slightly higher than that of the three-option fbrmat CM =

O.1 1, sw = 1.18), bnt there

was no significant difference between them lp = .60) with a small effect (d - O.1 1). With regard to

'

the dialogue tasK the mean dithculty of the four-option forrnat (M == - O,26, S[) = O.82) was

slightly higher thaii that of the three-option format (M =

- O.41, SD

== O.87), but no significant

difference was found between them (p = .16) with a small eflect (d = .18). Finally, in themonologue task, the mean difficulty of the four-option format (?LI - O.27, so - O.94) was higherthan that of the three-option format (M

= - O.25, sw

= O.89). Although the statistical difference

between the formats was marginal lp ==

,07), the effect size was medium (d = .57).

Table 4Rayults

ofPaired "testsfor Mean ftem Det7iculty (ZbgitsiPicture d wtal Monologue

4-option 3-option 4-option 3-o tionMswpd 0251.32,60JlO.11L18 - O.26

O.82.16.18

-O,41

O.87

4-o tion 3-option

O.27O.94.07.57- O.25 O.89

M)te, Item dienculty (logits): The higher the item dienculty is, the more dithcult it is,

These results suggest that the effbct of different numbers of options on item dificulty varied

between tasks, That is, the number of options did not infiuence item difficulty in picture anddialogue tasks. On the other hana the diencu1ty of the monologue task was affected to a moderate

degree by the different number of options; the three-option forrnat made the test easier than thefour-option forrnat. The main reason for these differences can be attributed to the varyingdifficulty ofthe tasks. The monologue, as discussed in section 5.1, was the most diMcult task sothat including fewer options could help test takers to focus on listening and comprehend the text

245

Page 10: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

better, which also led to selection of the correct answer. ln other words, reducing the number of

options (that test takers have to read) may have a beneficial effect fbr test takers only when the

task is diMcult.

This finding did not support some prior studies (Shizuka et al. 2006; Currie & Chiramanee,

2010), where no difft)rence in item diMculty was found between formats. One possible

interpretation is that even in those studies there would had been a difference in item dienculty if

tasks had been analyzed separately; some tasks would have been more dithcult than others and

vice versa. By extension, this interpretation can be applied to one supponive previous study (Lee& Winke, 2012), where the three-option fbrmat was significantly easier than the other formats. ln

short, it can be assumed that many more diencult taskslitems were included in Lee and Winke's

study so their study only found the difference in item dienculty between the formats.

5.3 Comparison of Item Discrimination

Tal]le 5 preseiits mean item discrimination (point-biscrial correlations: lpbi) in four- and

three-option fbrrnats, the results ofpaired t-tests, and effect size,

ln the picture task, the fbur-option format had slightly lower mcan item discrimination (lpbi= .33) than in the three-option fbrrnat foy O,O1) and no significant difference was fbund between

the two forrnats with almost no effect lp = .85, d = ,05). in the dialogue task, the mean item

discrimination in the four-option fbrmat (fpbi = .39) was somewhat higher than that in the

three-option format (fpbi= .33), resulting in no sigriificant difference between them with small

effect (p = ,23, d -- ,39). Similar to the picture task the mean item discrimination between the

four- and three-fbrmats was almost the same (fpbi =

.30 and ,29, respectively), and no significant

difference was found with almost no efTect size ip = .84, d

= , 16).

Tahle 5Results

ofPaired t-testsfor 1;{fean frem Diserimination

Picture Dialogue Monologue

4-option 3-option 4-option 3-option 4-option 3-option

M*SD*pd ,33.16.85.05,3425 .39.14.23,39,33,17 .30.15.84.16.29,18

?Vbte. *Fisher-z transforrried values

These findmgs indicate that option numbers had no impact on item discrimination in all

tasks. This supports the findings of Shizuka et al., (2006) and Lee and Winke (2012), and adds

another piece of evidence for the viability of substituting the four-option forrnat with the

three-option format. This may be because items of both formats are essentially the

246

Page 11: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

sameHluestions and correct options (keys), Iri other words, the core of each item remains the

same between the four- and three-option forrnats. Iherefore, redncing one distractor may not have

changed the essential ability oftest items to discriminate between test takers.

5.4 Comparison of Reliabmaty

Table 6 shows item reliahility and person relial)ility in both fbrmats fbr each task. Itemreliahility provides an estimate ofhow much confidence we can have in the perfbrmance of the

items; how likely they are to perform in the sarne way on another test population under similar

conditions (Green, 2013). As Tahle 6 shows, test reliability was consistently somewhat higher inthe four-option fbrmat than the three-option forrnat across the tasks, but the differences are quitesmal1 ranging from O.2 (dialogue and monologue) to O,4 tpicture), Person reliability provides an estimate of how much confidence we can have in the testtakers' results, i.e., the extent to which they are 1ikely to per[fbrm in the same way on another set of

test items of a similar nature given under similar conditions (Green, 2013), Compared to itemreliability, there seems to be a wider difference in person reliability between the fbur- and

three-option forrnats, ranging from O.2 (monologue) to O.7 tpicture). Moreover, unlike item

relial)ility, person reliability fbr the three-option format was slightly higlier: than that fbr the

four-option format in dialogue and monologue tasks.

Table 6Results

ofltem Reliability and Penson Reliability

Picture Dialogue Monologue4-option 3-option 4-option 3-o tion 4-option 3-option

Itemreliability

Person reliability

.93.65 .89,58 ,87,77 ,85,81 .90.69 .88.71

Overall, these results suggest that the effect of option numbers had no considerable effect on

reliability in both fbrmats fbr each task. However, consistent patterns were observed neither in thefbur- and three-option formats nor in the tasks. [[his inconsistency in reliability was also fbund inthe prior study by Lee and Winke (2012), where no difference was fbund in one ofthe three test

sets, but three- and four-option fbrrnats had higher reliability in another set, and three- and

five-option formats were mere reliable than the fbur-option format in another. This may bebecause reliability is sensitive to many factors such as sample ability and variances in itemdithculty, and sample size ofperson and items (Linacre, 2014b), Therefore, we need to collect and

analyze more data from a 1arger sample and use more test items to draw a conclusion conceming

the relationship between reliability and the number ofoptions,

247

Page 12: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

5.5 Comparison of Distractor Perfbrmance

Ms section reports how well distractors hnctioned in the two formats across tasks. lhe

functionality of distractors has conventionally been evaluated in two ways: (a) frequency and (b)discrimiiiahility (Haladyna, 2004; Haladyna & Downing, 1993; Henning, 1987), The fbrmer

measures the number oftest takers that chose each distractor, and a non-fimctioning distractor is

defined as one that was chosen by fewer than 59,6 oftest takers (e.g,, Haladyna & Downing, 1993).

The latter measures the extent to which a distractor can distinguish among high-al)ility and

low-ability test takers, and a non-fimctioning distractor is defined as one with a positive

point-biserial corTelation (e.g,, Haladyna & Downing). Based on these guidelines, we will evaluate

distractors in the fourL and three-option formats in the three tasks.

5.5.1 Frequency

Table 7 shows the number of items with which three, two, one, and no frequently chosen

distractors in the four- and three-option fbrrnats ofeach task. [Ihe task with the greatest number of

question iterns containing all frequently chosen distractors was the monologue task (95,2% fbr

both the four- and three-option fbrrnats), fbllowed by the dialogue task (85,7% fbr both), by

finally the picture task (46.7% each). In contrast to the dialogue and monologue tasks, items with

less frequerrtly chosen distractors were in the picture task in both formats, That is, many

distractors in the picture task were not chosen by test takers, while most distractors in both

dialogue and monologue task were selected by test takers.

Tal]le 77he

IVitmber of]F>'equently (]hosen Disnactors in TM,o formatsfor EZich 7Zityk

Frequently-chosen

distractors

Picture (k =15) Dialogue k=21) Monologue (k == 21)

4-option 3-option 4-o tion 3-option 4-option 3-option

321o 7 (46.7)4 (26,7)4 (26.7)o (o.o)

nla7 (46.7)7 (46.7)1 (O.6)

18 (85,7)3 (14.3)o (o.o)o (o.o)

nla18(85.7)3

(14,3)o(o,o)

20 (95.2) 1 (4.8) o (o.o) o (o.o)

nla20

(95.2) 1 (4,8) o(o.o)

IVbte. Percentages appear in parentheses,

5.52 DiscriminabMty

The frequency criteria al)ove can tell us how many test takers chose each distractor. What is

irnportant is not only how many distractors were chosen for each multiple choice item, but also

what level oftest taker selects which distractor (Shizuka et al. 2006), TEible 8 shows the number of

items with three, two, one, and no discrirninating distractors in both forrnats ofeach task, Contrary

to the frequency criteria, the greatest number ofitems with all discriminating distractors was in the

248

Page 13: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

dialogue (57.1% in the four-option format and 90.5% in the three-option fbrmat), fo11owed by themonologue (42.99t6 and 71.4%, respectivelyi, and by the picture (40% and 66.7%, respectivelyi.

Tal)le 87ke

Mimber qfDiscriminating Disnactors in 7M,o Ilonnatsfor Ebch 72xsk

Discriminating

distractor

Picture (k=15) Dialogue (k - 21) Monologue (k - 21)4-option 3-option 4-option 3-option

321o 6 (40.0)5 (33.3)3 (20.0)o(o.o)

nla10

(66.7)4 (26,7)1 (6,7)

12 (57, 1)9 (42.9)o (o,o)o (o,o)

nla19 (90.5)2 (9.5)o (o,o)

4-o tion 3-option

9 (42.9)8 (38.1)4(19.0)o

(o,o)

nla15 (71.4)6(28.6)o(o.o)

?Vbte. Percentages appear in parentheses.

These results indicate that the frequently chosen distractors in the four-option fbrmat were

not always discrirninating distractors. To put it precisely, there were seven items (46,79,6) with

three frequently chosen distractors in the picture task (see Table 7), but the number ofitems with

three-discriminating distractors decreased from seven to six items (40.0%). With regard to the

dialogue task, the number of items with three-discriminating distractors was rednced from 18

(85,7%) to 12 items (57,1%), More strikingly, in the monologue task, the number ofitems with

three-discriminating distractors was rednced from 20 (95.2%) to only nine items (42,9%). On the

other hand, such a sharp drop was not seen in the three-option format. That is, the two frequently

chosen distractors in question items remained the two discriminating distractors in those items. Iriother words, the three-option forrnat had similar, if not the same, discriminatory power as thefbur-option fbrmat. These findings support Shizuka et al.'s (2006) prior study where the number

of the disci iminating distractors was almost the same for the four- and three-option formats,

From the viewpoint of task difTerence, the number of items that contained all effective

distractors difft:red: the dialogue task had the highest number of effective distractors (57.19t6 fbr

four-option format and 90.5% for three-option format), fo11owed by the monologue task (42.9%and 71.4%, respectively), and finally the picture task (40,O% and 66.7%, respectively). This

difference may be related to task difficulty. ln the picture task some items were too easy to distract

some test takers, while in the monologue task some items were too dithcult to allow test takers toselect the right answer and indnced random guessing.

6. Conclusion

This study has investigated the effects of different numbers of options (fbur-option vs.

three-option) on different listening tasks tpicture vs. dialogue vs. monologue) from the viewpoint

ofitem diruculty, discrimination, reliability and distractor per[fbrmance.

249

Page 14: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

The main finding of this study suggests that the effects of different numbers of options

varies according to task. The number of options had no notahle impact on picture and dialogue

tasks, but did have an effect on the monologue task. This finding sheds new light on the study of

the optimal number of options and test developrnent because task difference is an essential factor

that determines item dienculty for multiple-choice listening tests.

Another finding indicates that the number of options had no great impact on item

discrimination and reliability in all tasks, which supports many previous studies on the number of

options, including L2 studies (e.g., Lee & Winke, 2012) and a meta-analysis that did not include

L2 studies (Rodriguez, 2005), The other finding of this study shows that the number of items

containing all discriminating distractors differed between the three tasks: The greatest number was

in the dialogue, fo11owed by the monologue, and finally by the picture. Again, task difference

becomes an important element in the development ofmultiple-choice listening tests.

Judging from the al)ovementioned findings, in this study the three-option format may have

the same potential to fUnction as the fbur-option fbrmat, ifwe take into account of the effects of

task difference.

There were some 1imitations to this study. It represents a 1irnited sample, iterns, and tasks,

and its findings can only be applied within the limited context in which it was conductedahat is,

three sections of the TOEIC listening exam taken by low intermediate university students. Future

studies should include different tasks with leamers at different levels.

Acknowledgements

This work was supported by JSPS KAKENHI Grant Nurnber (23520692), I would like to

thank three anonymous reviewers and Dr, Sachiyo Takanami for valuable comments on an earlier

version ofthis article.

References

Educational Testing Service. (201 1). TC)EIC test ctfiicialpractice; Listening. Tokyo: Institute for

lnternational Business Communication,Bond, T, G,, & Fox, C. M. (2007). Applying the R`xsch moclel: R`nclamental measurement in the

human science (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.

Brown, J. D. (2005). 7hsting in language program: A comprehensive guide to Engtish language

assessment (New ed.). New Yotk, NY: McGraw Hill.

Buck, G. (2001). Assessing iistening. Cambridge University Press.

Cunie, M., & Chiramanee, T. (2010). The effect of the multiple-choice item fbrmat on the

measurement of knowledge of language structure, Langtiqge Ttsting 24, 471-491. doi:

10.117710265532209356790

250

Page 15: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

Downing, S. M. (2006). Selected-response item fomiats in test development. ln S[ M. Downing &

T. M, Haladyna (Eds.), Hinndbook pftest development tpp, 287-301). Mahwah, NJ: Lawrence

Erlbaum,

Green, R. (2013). Statistical analysesfor language testens, New York, NY: Palgrave Macmi11an.

Haladyna, T, M. (2004). Devetoping and valiclating multipte-choice test items (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.

Haladyna, T, M., & Downing, S. M. (1993), How many options is enough for a multiple-choice

item? Educational and Rsychotcigi'cal Measuremen4 53, 999-1O10.

doi: 10.1177!OO13164493053004013

Henning, G. (1987). A guide to language testing: Devetopmenc evaluation, research. Cambridge,

MA: Newbury House.

Hirai, A. (2010). 71iyt mondai and kyazai sai rlyou no susume: 71EASY riron hen [Test Easy

Assernbly System]. Tokyo: Maruzen.

Hughes, A, (2003). Ilestingfor language teachers (2nd ed.). Carnbridge University Press.Lee, H., & Winke, P. (2012). The differences among three-, fbur-, and five-option-itern fbrmats in

the context of a high-stakes English-language listening tests. Language 7layting; 3a 99-123,

doii 10.1177f0265532212451235

Linaere, M. (2014a). WINSTEPS (Version 3.81) [Computer software]. Retrieved from www.

wmstqps. com

Linacre, M. (2014b). A user's guide to uaACSZErs: MZ?Vrs71EP Rasch-modet computerprogram.

Retrieved ffom www.winsteps,com

Papageorgiou, S., Stevens, R., & Goodwin, S, (2012). The relative difficulty of dialogic and

monologic input in a second-language listening comprehension test, Language Assessment

2uarter(}L 9, 375-397, doi: 10.1080115434303.2012,721425

Read, J. (2002). The use of interactive input in EAP listening assessment. Jbu,:nal ofEhglish for Academic Purpose 1, 105-119.

Rodriguez, M. C. (2005). Three options are optimal for multiple-choice itcms: A meta-analysis of

80 years ofresearch. Educational Measurement: lssues andPractica 24(2), 3-13.

Shizulca, T,, Takeuchi, O., Yashima, T,, & Yoshizawa, K, (2006), A cornparison of three- and

fbur-option English tests fbr university entrance selection purposes in Japan. Language

717stiug 23, 35-57. doi: 10.1191102655322061t319oa

Shohamy, E., & inbar, O. (1991). Construct validation eflistening comprehension tests: The

effect oftext and question type, Language 7lesting 8, 23-40.

Thompson, I. (1995), Assessment of second!fbreign laiiguage listening comprehension, ln D. J,

Mendelsohn & J, Rubin (Eds.), A guide for the teaching ofsecond langitage listening (pp, 31-58), San Diego, CA: Dominie Press,

251

Page 16: The Effects Number Options Different Tasks

Japan Society of English Language Education

NII-Electronic Library Service

JapanSociety ofEnglish Language Education

Appendices

Appendix A

Selection ofCornmon Items fbr Equation fbr the Dialogue and Monologue tasksr

'--"'"--.,..,..gMrmr'-M""-""'-"-

ii'-'------""--""A;J;"1fu-,'a----"'-"'"- 1

I r-- --'-""""-'----u-n-rm"r-""-""'il

km"-------""------------------------,・ti

,

g-

.

-

.

-

i.ny・"<drtJ"Sti"1 -i l,g- %

.-

.

-7-

..ssffe

7-

I

I 11"..r-...-.-. 1un-.r-.--;, .L-..-.-...."."-"""-".-".1/ i i L-""-m.rT-r--------. .------.r------------...-.--.----uu=' l/I Feuv.optiops format lI Fbor.opaon formst :L-.-.-.....t.tt.........-.tutuJ----tNn-Tttt...............ttt.........-....-.t L--.......tt......-....-.-.--tN--tNtTtttT.--......tt.t..-...-- ]

]Figure Al, Item difficulty plot fbr the dialogue IiVgure A2. Item difficulty plot fbr the

task. monologuetask.

Appendix B

iFlt Statistics and Point-Biserial Ptilaesfor Conversation and 7Zilk

Item#

va.

2

3wa・

5

6

7

Dialogue Item #

.s..,mu, .I."

Monologue

D Inf Outftfa' D

es9101112131415161718

lnf Outf lhbi

-O.80

1.29

L12-2.15-1.71

O.50

O.87

1.17-

O.36-120

O,82

O.12-O.60-O,80

O,68- O.28

e,41

O.91

O,921.12O,86O,83O.90O.99O.961,341,17O.891.05O.92O.9SO.921.011,13l,09O,94 O.86

1,15

. 9・.sfi..'・""' ti1;S

''

O,95'・

P:24'""/'"'{'1['l"1'''

O.77

1,14

O.90

O.90

O.89

1.01

L19

1.18

O,95

'l,・・,l・lii, llll'',,l・l.I・1'S,.li

,45.24,49.45

.43

.40

.42lil"lpa/

'1' .21

.46

.31

,47

,43

.44

.37

.24

.28

.43

2

3

4

5

6

7

8

9

10

11

12

13

14

IS

16 tt---

18,..・.gl

1,42

O,51-O,14-O,06-O.41

O.55

O.09-O.49

1.24-1,52-1.42-O,93

O,73

O.41

O.28-1.24

O.64

O.37

1.19O.971.elO,931.09O,931,081,031.01O,87O.891.00O.851,07e.98O.87L131.00・ISS ' -'・

'・・.ou-,',li:-'.l,

O.9i '.39

''

1.04 .34

O.93 .43

120 .22

O.88 ,43

1,08 .27

1.04 .31

O.97 ,31

O.85 .44

O.80 ,44

O.99 .33

O.83 .51

1.06 28

1.e6 ,36

O.77 ,,.49

1.231,14

,32

IVbte. D == item diffic ty. lnf= infit mean squares. Ouif= outfit mean squares, lhbi= point-bisencorrelation.'Outlier in item dienculty invariance, Dark shaded items were removed from equation. Light shaded values

did not reach expected values for the equation.

252

NII-Electionic