Original Test Part 2_Yinghua Cai

7/30/2019 Original Test Part 2_Yinghua Cai

http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 1/17

68

English Diagnostic Test for Translation & Interpretation Program Enrollees at the

Monterey Institute of International Studies

Original Test: Assignment II Part II

Yinghua Cai

EDUC 8540: Language Assessment SeminarDr. Kathleen M. Bailey

5 December 2012



69

In this part, the results of a series of statistical analyses conducted based on the

scores from the English diagnostic test piloted among Chinese Translation and

Interpretation students at MIIS will be reported. The analyses include Item Facility, Item

Discrimination, Readability, Distractor Analyses, Response Frequency Distribution, Split

Half Reliability, Inter-rater Reliability and Subtest Relationships. All these analyses offers

some useful information with regard to the strengths and weaknesses of this test as well as

the test’s reliability and validity, thus shedding light on the potential steps I may take for

improving this test.

Item Facility, Item Discrimination and Readability

Table 3Information Retention Subtest Item Facility and Item Discrimination (n=13)

Item Students whoanswered item

correctly

I.F. High scores (topfour) with

correct answers

Low scorers(bottom four) with

correct answers

I.D.

1 11 0.85 4 2 0.502 12 0.92 4 3 0.253 12 0.92 4 3 0.254 13 1.00 4 4 0.005 12 0.92 4 4 0.00

6 10 0.77 4 4 0.007 8 0.62 3 0 0.758 11 0.85 4 3 0.259 11 0.85 4 2 0.5010 11 0.85 4 3 0.25

AverageI.F.= 0.85

AverageI.D.= 0.27

Table 4 Joke Reproduction Subtest Item Facility and Item Discrimination (n=13)

Item Students who

answered itemcorrectly

I.F. High scores

(top four) withcorrectanswers

Low scorers

(bottom four) with correct

answers

I.D.

1 9 0.69 4 0 1.00

2 0 0.00 0 0 0.00

3 1 0.08 1 0 0.25

4 1 0.08 0 0 0.00



70

5 9 0.69 2 4 -0.50

6 1 0.08 1 0 0.25

7 1 0.08 0 0 0.00

8 2 0.15 1 0 0.25

9 1 0.08 1 0 0.25

10 2 0.15 2 0 0.50 AverageI.F.=0.21

AverageI.D.= 0.20

Table 5 Text Readability Statistics

Text Flesch Reading Ease Flesch-Kincaid Grade LevelGlossary (for Information

Retention Subtest)44.4 12.0

Cloze passage (for JokeReproduction Subtest)

72.7 5.1

Item Facility (IF) indicates the difficulty level of a test item from the perspective of

the learners who take the test of which that item is included (Oller, 1979). For my two

objectively scored subtests, which tap into students’ information retention ability and joke

reproduction ability respectively, I calculated the IF of each test item. I divided the number

of students who answered the item correctly by the total number of students who took the

test and came up with the I.F. values (Table 3 and Table 4). It is quite self-explanatory that

the closer the I.F. of a test item is to 1, the easier that item is. A glance at the Average I.F.

for the two subtests reveals that the information retention subtest is relatively easy (Average

I.F. at .85) while the joke reproduction subtest is relatively hard (Average I.F. at .18). Oller

(1979) states, “items falling somewhere between about .15 and .85 are usually preferred” (p.

247). Based on Oller’s criteria, in the information retention subtest, items 1, 6, 7, 8, 9, and 10

are of acceptable difficulty, though items 1, 8, 9, and 10 look rather suspicious with a cut-

point I.F. value (.85). As for the joke reproduction subtest, items 1, 5, 8, 10 are within the



71

acceptable range, where items 8 and 10 have a cut-point I.F. value (.15). The I.F. values of all

the other items are almost 0.

The I.F. values show a striking contrast with the readability statistics (Table 5) of the

two texts for the objectively scored sections, which take into consideration the average

sentence length and average number of syllables per word. Higher Flesch RE scores reflect

easier reading while lower scores reflect more difficult reading (Hedgcock & Ferris, 2009).

The Flesch RE score of the glossary used for the information retention subtest is almost 30

points higher than that of the cloze passage for the joke reproduction subtest (12th grade

level versus 5th grade level in the US educational system). Moreover, the glossary also has

about 7% more off-list words than the cloze passage (Appendix J). Why would the students

perform far better on much more difficult texts? From my perspective, it was the students’

capability of skimming a long text and memorizing the key information (receptive skill) and

deficiency in reproducing culturally bound texts (productive skill) due to their lack of

schematic knowledge that led to the inconsistency between the I.F. values and readability

statistics.

To gain a deeper understanding of how each test item for the objectively scored

subtests separates high scorers from low scorers on the whole test, I also calculated Item

Discrimination (Table 3 and Table 4). After ranking the tests from the highest score (52) to

the lowest score (21.5), I adjusted the Flanagan’s method (Bailey, 1998) by taking the top

four or 30.8% of the 13 tests and the bottom four (30.8%), assigning each group as the high

and low scorers respectively. Therefore, the formula used for calculating I.D. is: ID = (the

number of high scorers who got the item right – the number of low scorers who got the

item right)/30.8% of the total number of students tested.



72

Oller (1979) says “any valid test must discriminate between degrees of whatever it is

supposed to measure” (p. 248). In reference to Oller (1979), it is reported that “I.D. values

range from +1 to -1, with positive 1 showing a perfect discrimination between high scorers

and low scorers, and -1 showing a perfectly wrong discrimination. An I.D. of 0 shows no

discrimination, or no variance whatsoever. The lowest acceptable values are usually set at

0.25 or 0.35” (Bailey, 1998, p. 135). The I.D. values of the two subtests seem rather

alarming. The I.D. values of items 4, 5, and 6 of the information retention subtest show no

discrimination; those of items 2, 3, 8, and 10 equal to the lowest acceptable value (0.25). If

we look at both I.F. and I.D. values of the information retention subtest, items 4 and 5 are

the most problematic items. They are too easy to be used as a diagnostic tool to distinguish

high performers from low performers, especially identifying individual students’ strengths

and weaknesses when selecting most competent candidates for a program (what the ETD

does). There seem to be three ways that I can adopt to make this information retention

subtest more challenging, namely reducing the time of glossary reading, revising the glossary

to make it more difficult (in other words, go beyond the 12th

grade), and rewriting the

multiple-choice questions. A more detailed analysis of the multiple-choice items will be

presented in the Distractor Analysis and Response Frequency Distribution section.

As for the joke reproduction subtest, answers are viewed as correct only when the

students got 4 points out of 4. Thus, even though in some cases the I.D. value is 0.00, the

high scorers and low scorers’ raw scores can still be different. For example, for item 2,

whose I.D. and I.F. value are both 0.00, three high scorers scored 2 and one scored 1;

however, three low scorers scored 0 and only one scored 1. Based on the raw scores, we can

conclude that although item 2 is difficult in general, its discriminability is not bad. On

another note, item 5 has a negative I.D. value. In this case, all the low scorers got it correct,



73

but two high scorers did not. It turned out that the two high scorers mitigated their speech

by using “would” instead of “will” in “I will look like this” when filling in the blanks, which

changed part of the meaning of the original speech. One possible explanation would be that

low scorers tended to stick to the original form but high scorers, with more ease with the

language and more understanding of the whole text, tended to be more “adventurous” in

their language use, thus resulting in a meaning shift.

Considering the low I.F. values of the joke reproduction subtest, I think I will

include directions that tell the students that the information retention subtest would prepare

them for the joke reproduction subtest and specify that they need to produce 10 full

sentences in the speech transcript to elicit better performance. I may also give students

longer time to complete this subtest. As the joke reproduction section is built upon the

information retention section, I also need to take a close look at the glossary I wrote,

providing some more facilitative information. To gain a clearer picture of the I.D. values, I

will revise my scoring rubric by awarding two points for meaning accuracy and another two

for grammatical correctness. Then I can re-score the subtest and re-compute the I.D. values

in two dimensions, counting correct answers in terms of meaning and correct answers in

terms of grammar, which may yield higher I.D. values.

Distractor Analysis and Response Frequency Distribution

Table 6Information Retention Subtest Distractor Analysis (n=13)

Item A B C D

1 1 11* 0 12 1 0 0 12*

3 1 0 0 12*

4 0 13* 0 0

5 12* 0 1 0

6 2 10* 0 0

7 0 4 8* 1



74

8 0 11* 0 2

9 0 0 11* 2

10 2 0 11* 0

Table 7

Response Frequency Distribution on the Information Retention Subtest (n=13)Item A B C D

1 High Scorers 0 4* 0 0

Low Scorers 1 2 0 1

2 High Scorers 0 0 0 4*

Low Scorers 1 0 0 3

3 High Scorers 0 0 0 4*

Low Scorers 1 0 0 3


Low Scorers 0 4 0 0

5 High Scorers 4* 0 0 0Low Scorers 4 0 0 0


Low Scorers 0 4 0 0

7 High Scorers 0 0 3* 1

Low Scorers 0 4 0 0


Low Scorers 0 3 0 1

9 High Scorers 0 0 4* 0

Low Scorers 0 0 2 2

10 High Scorers 0 0 4* 0Low Scorers 1 0 3 0

Note. An asterisk denotes the key, or correct answer.

Bailey argues that for the purpose of improving the multiple-choice test, we need to

look at “how each individual distractor is functioning” (p. 134). Considering the high I.F.

values for the information retention subtest, unsurprisingly, many of the distractors did not

function well. A total of 19 distractors (out of 30) did not seem to be distracting (Table 6).

However, although item 4 did not seem to work well with the group I tested, one of my

fellow classmates got it wrong (chose D) during my pre-pilot phase. Also, C for item 6 was

chosen by one of my classmates, but was not by the 13 students. Therefore, I believe that

given a larger number of test takers, some of the distractors might serve the purpose as well.





76

subtest, I am pretty happy about the full-test reliability (at .89).

In addition to the split-half reliability, I also calculated the standard error of

measurement (SEM). SEM is a way to “determine a band around a student’s score within

which that student’s score would probably fall if the test were administered repeatedly to the

same person…[and] the narrower the SEM is, the narrower the band of possible fluctuations

will be, or the more consistently the raw scores represent the students’ actual abilities”

(Brown, 2005, p. 188-189). It seems that the joke reproduction subtest is a more consistent

measure of the students’ ability than the information retention subtest given that the SEM is

only 2.08 on a 40-point scale, although an SEM of 2.5 on a 20-point scale is still acceptable.

Reliability of the Subjectively Scored Subtest

Table 9Inter-rater Reliability for Subjectively Scored Subtest

Student Rater 1 Rater 2 Rater 1 + Rater 2

1 4 5 9

2 4 5 9

3 4 5 9

4 5 5 10

5 4 4 86 4 6 10

7 3 4 7

8 4 4 8

9 6 6 12

10 4 5 9

11 1 3 4

12 5 5 10

13 4 6 10

Mean 4.00 4.85 8.85

Standard Deviation 1.15 0.90 1.91

Variance 1.33 0.81 3.64

Cronbach’s alpha = .82

Reliability in a test of writing can be affected by multiple variables related to the

writing task itself as well as the scoring process; speaking of the scoring process, we need to



77

consider “the background and experience of the raters, the nature of the rating scale, and the

training given to raters” (Weigle, 2002, p. 49). After calculating the inter-rater reliability using

Cronbach’s alpha, I have realized that the rating procedures are not reliable enough,

evidenced by 2-point score differences on a 6-point scale by the two raters (Table 9). This

can be attributed to the fact that the raters were using TWE, a holistic rubric that was not

targeting the specific genre of reflective journal entry. Moreover, the lacking in training that

provided benchmark papers also complicated the scoring process. Even the experience of

the raters (e.g., SAT test preparation teacher versus first-time writing test scorer) could be a

factor that led to the relatively low inter-rater reliability. It is likely that the raters applied

different rating criteria to different samples or inconsistently applied the rating criteria to the

various samples (Bachman, 1990). Undoubtedly, I need to devise a genre-specific (reflective

journal entry) multi-trait or analytic rubric for scoring. Benchmark papers may be selected

from the samples and then analyzed (identifying common characteristics) to facilitate the

creation of a more reliable scoring rubric. I would also like to ask my fellow classmates to re-

score the 13 samples based on the new rubric to see if it works.

Subtest Relationships

Table 10Subtest Relationships (df=11)

Subtest Correlation Coefficients (Pearson’s r )

Information Retention - *0.50 0.03

Joke Reproduction *0.50 - **0.56

Reflective Journal Entry Production 0.03 **0.56 -

Information

Retention

Joke

Reproduction

Reflective Journal Entry

Production Note . *p<.05, **p<.025 in a directional (one-tailed) test

Table 11r -squared for Subtest Relationships

Subtest r -squared (Overlapping Variance)



78

Information Retention - 0.25 0.001

Joke Reproduction 0.25 - 0.31

Reflective Journal Entry Production 0.001 0.31 -

InformationRetention

JokeReproduction

Reflective Journal Entry Production

Pearson’s r was calculated to examine the correlation between scores on each of the

three subtests (Table 10). r- squared was computed to determine to what extent each of the

subtests are measuring the same trait (Table 11). According to the statistics, it is quite

obvious that there was no statistically significant relationship between the scores on the

information retention subtest and the reflective journal entry production test. Although

these two subtests tended to measure distinct constructs, I was still quite surprised to see

such statistics and questioned the subtests’ validity. Just as Oller (1979) says, “since reliability

is a prerequisite to validity, a given statistic cannot be taken as an indication of low reliability

and high validity” (p. 187). Their low correlation coefficient and overlapping variance might

simply be the product of a lack of reliability of both subtests.

The low overlapping variance between the scores on the joke reproduction subtest

and the information retention subtest confirms my belief that the information subtest did

not fully prepare the students for the joke reproduction subtest. In other words, students

who did well on the information retention subtest did not perform equally well on the joke

reproduction subtest, which shows the need for revising the test directions, the glossary, as

well as the multiple-choice questions.

The overlapping variance between the scores on the joke reproduction subtest and

those on the reflective journal entry production subtest is also quite low. Although both of

them involved writing, but the culturally intense joke reproduction task was by nature very

different from reflecting on your test-taking experience through writing. The subtests may



79

measure distinct constructs, but a low correlation alone “cannot be taken as an indication of

the validity of the correlated tests” (Oller, 1979, p. 188). This again alerts me to working on

the writing rubric to increase the reliability and validity of that subtest.

Reliability, Validity, Practicality, and Washback

To conclude this paper, I would like to analyze my test in terms of the four

traditional criteria (reliability, validity, practicality, and washback), and reiterate directions or

propose new ideas for test revision. In terms of reliability, the information retention and

reflective journal entry production subtests need to be improved as can be seen from the

inter-rater reliability and subtest relationship analysis. The glossary, the multiple-choice

items, and the writing scoring rubric all require revision. Although the SEM of the joke

reproduction subtest seems to be really satisfactory, the scoring criteria need to be adjusted

to distinguish the two dimensions (meaning accuracy and grammatical correctness) in a

clearer way.

As for validity, one defining feature of a valid test is “does [doing] what it is

supposed to do” (Oller, 1979, p. 4). A diagnostic test is supposed to identify students’

“existing strengths and weaknesses in order to help teachers tailor instruction to fit 2LL’s

needs” (Bailey, 1998, p. 40). The English diagnostic test I designed showed students’

strengths in information retention and weaknesses in joke reproduction as a whole group but

not on an individual level; thus, it would not be an ideal tool for screening purposes (e.g.,

EDT at MIIS). Nevertheless, the test still has content validity (Brown, 2005). Both the

glossary and the political speech were exemplary materials that interpreting trainees may

encounter in their classes or real-life interpreting practice. However, further examination of

the content validity will be useful for test revision.



80

The practicality of a test has to do with “the preparation, administration, scoring, and

interpretation of the test” (Oller, 1979, p. 4). Since this diagnostic test was quite innovative,

the preparation involved a great amount of time and effort. The administration was quite

complex and tiring since the test administrators had to pay close attention to the time for the

different tasks involved in each objectively scored subtest, distributing and collecting the

answer sheets multiple times during the test. Therefore, I think it would be more convenient

to utilize an online platform, which includes timed sections. The scoring for the multiple-

choice questions was fast and easy since the answers key was well developed and easy to use.

However, scoring the joke reproduction and the reflective journal entry writing subtests took

a longer time due to the amount of judgment one had to make based on the somewhat

vague scoring criteria. Finally, the interpretation of the test results was at most tentative due

to the novelty and flaws of this test concerning test items and scoring criteria as we can see

from the previous discussion.

Actually, I am most proud of the washback of this test, which Bailey (1998) defines

as “the effect a test has on teaching and learning” (p. 249). I have discussed in detail the

instructional value of awareness-raising in part 1 of this project. Besides, as a self-assessment

tool, the reflective journal entry may promote self-regulated learning, in which students

“make choices, select learning activities, and plan how to use their time and resources”

(O’Malley & Valdez Pierce, 1996, p. 5). I was really excited to see that in their journal entries,

almost all the students set their own goals with regard to how to use their time and resources,

such as doing more extensive reading to increase the background knowledge and adopting a

different note-taking strategy to deal with punch lines. Just as one student said in her writing,

“generally speaking, it’s a very helpful test.”



81

References

Alderson, C. (2000). Assessing reading. Cambridge: Cambridge University Press.

Alderson, J.C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation.

Cambridge: Cambridge University Press.

Anderson, R. C., & Pearson, P. D. (1984). A schema-theory view of the basic processes in

reading. In P. D. Pearson (Ed.), Handbook of reading research (pp. 255- 291).

New York, NY: Longman.

Attardo, S. (2008). Semantic and pragmatics of humor. Language and linguistics compass 2/6 ,

1203 –1215.

Bachman, L. (1990). Fundamental considerations in language testing . Oxford: Oxford University

Press.

Bailey, K. M. (1998). Learning about language assessment: Dilemmas, decisions and directions. Boston:

Heinle & Heinle.

Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language

assessment. New York: McGraw-Hill.

Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.

Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999). Dictionary of

language testing. UK: Cambridge University Press.

Hamp-Lyons, L. (1989). Newbury house TOEFL preparation kit: Preparing for the Test of Written

English. New York: Newbury House.

Hedgcock, J. S., & Ferris, D. R. (2009). Teaching Readers of English: Students, texts,

and contexts. New York, NY: Routledge.

Marriam-Webster, Incorporated. www.merriam-webster.com. Retrieved Oct. 5, 2012, from

http://www.merriam-webster.com/dictionary/joke.



82

Oller, J. W. (1979). Language tests at school . London: Longman Group.

O’Malley, J. M. & Valdez Pierce, L. (1996). Authentic assessment for English language learners:

Practical approaches for teachers . Boston, MA: Addison-Wesley.

Stevens, D. D. & Cooper, J. E. (2009). Journal keeping: How to use reflective writing for learning,

teaching, professional insight, and positive change . Sterling, VA: Stylus Publishing, LLC.

Swain, M. (1984). Large-scale communicative language testing: A case study. In S. J.

Savignon, & M. Berns (Eds.), Initiatives in communicative language teaching (pp. 185-201).

Reading, MA: Addison-Wesley.

Turner, J. (1995). Test preparation. Presentation at the TESOL convention, Long Beach,

California.

Weigle, S. C. (2000). Assessing writing . Cambridge: Cambridge University Press.

Wesche, M. B. (1983). Communicative testing in a second language. The Modern Language

Journal, 67, 41-55.



83

Appendix J

Readability

Glossary:



84

Cloze passage:

Documents

Original Test Part 2_Yinghua Cai