Upload
yinghuacai
View
214
Download
0
Embed Size (px)
Citation preview
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 1/17
68
English Diagnostic Test for Translation & Interpretation Program Enrollees at the
Monterey Institute of International Studies
Original Test: Assignment II Part II
Yinghua Cai
EDUC 8540: Language Assessment SeminarDr. Kathleen M. Bailey
5 December 2012
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 2/17
69
In this part, the results of a series of statistical analyses conducted based on the
scores from the English diagnostic test piloted among Chinese Translation and
Interpretation students at MIIS will be reported. The analyses include Item Facility, Item
Discrimination, Readability, Distractor Analyses, Response Frequency Distribution, Split
Half Reliability, Inter-rater Reliability and Subtest Relationships. All these analyses offers
some useful information with regard to the strengths and weaknesses of this test as well as
the test’s reliability and validity, thus shedding light on the potential steps I may take for
improving this test.
Item Facility, Item Discrimination and Readability
Table 3Information Retention Subtest Item Facility and Item Discrimination (n=13)
Item Students whoanswered item
correctly
I.F. High scores (topfour) with
correct answers
Low scorers(bottom four) with
correct answers
I.D.
1 11 0.85 4 2 0.502 12 0.92 4 3 0.253 12 0.92 4 3 0.254 13 1.00 4 4 0.005 12 0.92 4 4 0.00
6 10 0.77 4 4 0.007 8 0.62 3 0 0.758 11 0.85 4 3 0.259 11 0.85 4 2 0.5010 11 0.85 4 3 0.25
AverageI.F.= 0.85
AverageI.D.= 0.27
Table 4 Joke Reproduction Subtest Item Facility and Item Discrimination (n=13)
Item Students who
answered itemcorrectly
I.F. High scores
(top four) withcorrectanswers
Low scorers
(bottom four) with correct
answers
I.D.
1 9 0.69 4 0 1.00
2 0 0.00 0 0 0.00
3 1 0.08 1 0 0.25
4 1 0.08 0 0 0.00
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 3/17
70
5 9 0.69 2 4 -0.50
6 1 0.08 1 0 0.25
7 1 0.08 0 0 0.00
8 2 0.15 1 0 0.25
9 1 0.08 1 0 0.25
10 2 0.15 2 0 0.50 AverageI.F.=0.21
AverageI.D.= 0.20
Table 5 Text Readability Statistics
Text Flesch Reading Ease Flesch-Kincaid Grade LevelGlossary (for Information
Retention Subtest)44.4 12.0
Cloze passage (for JokeReproduction Subtest)
72.7 5.1
Item Facility (IF) indicates the difficulty level of a test item from the perspective of
the learners who take the test of which that item is included (Oller, 1979). For my two
objectively scored subtests, which tap into students’ information retention ability and joke
reproduction ability respectively, I calculated the IF of each test item. I divided the number
of students who answered the item correctly by the total number of students who took the
test and came up with the I.F. values (Table 3 and Table 4). It is quite self-explanatory that
the closer the I.F. of a test item is to 1, the easier that item is. A glance at the Average I.F.
for the two subtests reveals that the information retention subtest is relatively easy (Average
I.F. at .85) while the joke reproduction subtest is relatively hard (Average I.F. at .18). Oller
(1979) states, “items falling somewhere between about .15 and .85 are usually preferred” (p.
247). Based on Oller’s criteria, in the information retention subtest, items 1, 6, 7, 8, 9, and 10
are of acceptable difficulty, though items 1, 8, 9, and 10 look rather suspicious with a cut-
point I.F. value (.85). As for the joke reproduction subtest, items 1, 5, 8, 10 are within the
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 4/17
71
acceptable range, where items 8 and 10 have a cut-point I.F. value (.15). The I.F. values of all
the other items are almost 0.
The I.F. values show a striking contrast with the readability statistics (Table 5) of the
two texts for the objectively scored sections, which take into consideration the average
sentence length and average number of syllables per word. Higher Flesch RE scores reflect
easier reading while lower scores reflect more difficult reading (Hedgcock & Ferris, 2009).
The Flesch RE score of the glossary used for the information retention subtest is almost 30
points higher than that of the cloze passage for the joke reproduction subtest (12th grade
level versus 5th grade level in the US educational system). Moreover, the glossary also has
about 7% more off-list words than the cloze passage (Appendix J). Why would the students
perform far better on much more difficult texts? From my perspective, it was the students’
capability of skimming a long text and memorizing the key information (receptive skill) and
deficiency in reproducing culturally bound texts (productive skill) due to their lack of
schematic knowledge that led to the inconsistency between the I.F. values and readability
statistics.
To gain a deeper understanding of how each test item for the objectively scored
subtests separates high scorers from low scorers on the whole test, I also calculated Item
Discrimination (Table 3 and Table 4). After ranking the tests from the highest score (52) to
the lowest score (21.5), I adjusted the Flanagan’s method (Bailey, 1998) by taking the top
four or 30.8% of the 13 tests and the bottom four (30.8%), assigning each group as the high
and low scorers respectively. Therefore, the formula used for calculating I.D. is: ID = (the
number of high scorers who got the item right – the number of low scorers who got the
item right)/30.8% of the total number of students tested.
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 5/17
72
Oller (1979) says “any valid test must discriminate between degrees of whatever it is
supposed to measure” (p. 248). In reference to Oller (1979), it is reported that “I.D. values
range from +1 to -1, with positive 1 showing a perfect discrimination between high scorers
and low scorers, and -1 showing a perfectly wrong discrimination. An I.D. of 0 shows no
discrimination, or no variance whatsoever. The lowest acceptable values are usually set at
0.25 or 0.35” (Bailey, 1998, p. 135). The I.D. values of the two subtests seem rather
alarming. The I.D. values of items 4, 5, and 6 of the information retention subtest show no
discrimination; those of items 2, 3, 8, and 10 equal to the lowest acceptable value (0.25). If
we look at both I.F. and I.D. values of the information retention subtest, items 4 and 5 are
the most problematic items. They are too easy to be used as a diagnostic tool to distinguish
high performers from low performers, especially identifying individual students’ strengths
and weaknesses when selecting most competent candidates for a program (what the ETD
does). There seem to be three ways that I can adopt to make this information retention
subtest more challenging, namely reducing the time of glossary reading, revising the glossary
to make it more difficult (in other words, go beyond the 12th
grade), and rewriting the
multiple-choice questions. A more detailed analysis of the multiple-choice items will be
presented in the Distractor Analysis and Response Frequency Distribution section.
As for the joke reproduction subtest, answers are viewed as correct only when the
students got 4 points out of 4. Thus, even though in some cases the I.D. value is 0.00, the
high scorers and low scorers’ raw scores can still be different. For example, for item 2,
whose I.D. and I.F. value are both 0.00, three high scorers scored 2 and one scored 1;
however, three low scorers scored 0 and only one scored 1. Based on the raw scores, we can
conclude that although item 2 is difficult in general, its discriminability is not bad. On
another note, item 5 has a negative I.D. value. In this case, all the low scorers got it correct,
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 6/17
73
but two high scorers did not. It turned out that the two high scorers mitigated their speech
by using “would” instead of “will” in “I will look like this” when filling in the blanks, which
changed part of the meaning of the original speech. One possible explanation would be that
low scorers tended to stick to the original form but high scorers, with more ease with the
language and more understanding of the whole text, tended to be more “adventurous” in
their language use, thus resulting in a meaning shift.
Considering the low I.F. values of the joke reproduction subtest, I think I will
include directions that tell the students that the information retention subtest would prepare
them for the joke reproduction subtest and specify that they need to produce 10 full
sentences in the speech transcript to elicit better performance. I may also give students
longer time to complete this subtest. As the joke reproduction section is built upon the
information retention section, I also need to take a close look at the glossary I wrote,
providing some more facilitative information. To gain a clearer picture of the I.D. values, I
will revise my scoring rubric by awarding two points for meaning accuracy and another two
for grammatical correctness. Then I can re-score the subtest and re-compute the I.D. values
in two dimensions, counting correct answers in terms of meaning and correct answers in
terms of grammar, which may yield higher I.D. values.
Distractor Analysis and Response Frequency Distribution
Table 6Information Retention Subtest Distractor Analysis (n=13)
Item A B C D
1 1 11* 0 12 1 0 0 12*
3 1 0 0 12*
4 0 13* 0 0
5 12* 0 1 0
6 2 10* 0 0
7 0 4 8* 1
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 7/17
74
8 0 11* 0 2
9 0 0 11* 2
10 2 0 11* 0
Table 7
Response Frequency Distribution on the Information Retention Subtest (n=13)Item A B C D
1 High Scorers 0 4* 0 0
Low Scorers 1 2 0 1
2 High Scorers 0 0 0 4*
Low Scorers 1 0 0 3
3 High Scorers 0 0 0 4*
Low Scorers 1 0 0 3
4 High Scorers 0 4* 0 0
Low Scorers 0 4 0 0
5 High Scorers 4* 0 0 0Low Scorers 4 0 0 0
6 High Scorers 0 4* 0 0
Low Scorers 0 4 0 0
7 High Scorers 0 0 3* 1
Low Scorers 0 4 0 0
8 High Scorers 0 4* 0 0
Low Scorers 0 3 0 1
9 High Scorers 0 0 4* 0
Low Scorers 0 0 2 2
10 High Scorers 0 0 4* 0Low Scorers 1 0 3 0
Note. An asterisk denotes the key, or correct answer.
Bailey argues that for the purpose of improving the multiple-choice test, we need to
look at “how each individual distractor is functioning” (p. 134). Considering the high I.F.
values for the information retention subtest, unsurprisingly, many of the distractors did not
function well. A total of 19 distractors (out of 30) did not seem to be distracting (Table 6).
However, although item 4 did not seem to work well with the group I tested, one of my
fellow classmates got it wrong (chose D) during my pre-pilot phase. Also, C for item 6 was
chosen by one of my classmates, but was not by the 13 students. Therefore, I believe that
given a larger number of test takers, some of the distractors might serve the purpose as well.
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 8/17
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 9/17
76
subtest, I am pretty happy about the full-test reliability (at .89).
In addition to the split-half reliability, I also calculated the standard error of
measurement (SEM). SEM is a way to “determine a band around a student’s score within
which that student’s score would probably fall if the test were administered repeatedly to the
same person…[and] the narrower the SEM is, the narrower the band of possible fluctuations
will be, or the more consistently the raw scores represent the students’ actual abilities”
(Brown, 2005, p. 188-189). It seems that the joke reproduction subtest is a more consistent
measure of the students’ ability than the information retention subtest given that the SEM is
only 2.08 on a 40-point scale, although an SEM of 2.5 on a 20-point scale is still acceptable.
Reliability of the Subjectively Scored Subtest
Table 9Inter-rater Reliability for Subjectively Scored Subtest
Student Rater 1 Rater 2 Rater 1 + Rater 2
1 4 5 9
2 4 5 9
3 4 5 9
4 5 5 10
5 4 4 86 4 6 10
7 3 4 7
8 4 4 8
9 6 6 12
10 4 5 9
11 1 3 4
12 5 5 10
13 4 6 10
Mean 4.00 4.85 8.85
Standard Deviation 1.15 0.90 1.91
Variance 1.33 0.81 3.64
Cronbach’s alpha = .82
Reliability in a test of writing can be affected by multiple variables related to the
writing task itself as well as the scoring process; speaking of the scoring process, we need to
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 10/17
77
consider “the background and experience of the raters, the nature of the rating scale, and the
training given to raters” (Weigle, 2002, p. 49). After calculating the inter-rater reliability using
Cronbach’s alpha, I have realized that the rating procedures are not reliable enough,
evidenced by 2-point score differences on a 6-point scale by the two raters (Table 9). This
can be attributed to the fact that the raters were using TWE, a holistic rubric that was not
targeting the specific genre of reflective journal entry. Moreover, the lacking in training that
provided benchmark papers also complicated the scoring process. Even the experience of
the raters (e.g., SAT test preparation teacher versus first-time writing test scorer) could be a
factor that led to the relatively low inter-rater reliability. It is likely that the raters applied
different rating criteria to different samples or inconsistently applied the rating criteria to the
various samples (Bachman, 1990). Undoubtedly, I need to devise a genre-specific (reflective
journal entry) multi-trait or analytic rubric for scoring. Benchmark papers may be selected
from the samples and then analyzed (identifying common characteristics) to facilitate the
creation of a more reliable scoring rubric. I would also like to ask my fellow classmates to re-
score the 13 samples based on the new rubric to see if it works.
Subtest Relationships
Table 10Subtest Relationships (df=11)
Subtest Correlation Coefficients (Pearson’s r )
Information Retention - *0.50 0.03
Joke Reproduction *0.50 - **0.56
Reflective Journal Entry Production 0.03 **0.56 -
Information
Retention
Joke
Reproduction
Reflective Journal Entry
Production Note . *p<.05, **p<.025 in a directional (one-tailed) test
Table 11r -squared for Subtest Relationships
Subtest r -squared (Overlapping Variance)
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 11/17
78
Information Retention - 0.25 0.001
Joke Reproduction 0.25 - 0.31
Reflective Journal Entry Production 0.001 0.31 -
InformationRetention
JokeReproduction
Reflective Journal Entry Production
Pearson’s r was calculated to examine the correlation between scores on each of the
three subtests (Table 10). r- squared was computed to determine to what extent each of the
subtests are measuring the same trait (Table 11). According to the statistics, it is quite
obvious that there was no statistically significant relationship between the scores on the
information retention subtest and the reflective journal entry production test. Although
these two subtests tended to measure distinct constructs, I was still quite surprised to see
such statistics and questioned the subtests’ validity. Just as Oller (1979) says, “since reliability
is a prerequisite to validity, a given statistic cannot be taken as an indication of low reliability
and high validity” (p. 187). Their low correlation coefficient and overlapping variance might
simply be the product of a lack of reliability of both subtests.
The low overlapping variance between the scores on the joke reproduction subtest
and the information retention subtest confirms my belief that the information subtest did
not fully prepare the students for the joke reproduction subtest. In other words, students
who did well on the information retention subtest did not perform equally well on the joke
reproduction subtest, which shows the need for revising the test directions, the glossary, as
well as the multiple-choice questions.
The overlapping variance between the scores on the joke reproduction subtest and
those on the reflective journal entry production subtest is also quite low. Although both of
them involved writing, but the culturally intense joke reproduction task was by nature very
different from reflecting on your test-taking experience through writing. The subtests may
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 12/17
79
measure distinct constructs, but a low correlation alone “cannot be taken as an indication of
the validity of the correlated tests” (Oller, 1979, p. 188). This again alerts me to working on
the writing rubric to increase the reliability and validity of that subtest.
Reliability, Validity, Practicality, and Washback
To conclude this paper, I would like to analyze my test in terms of the four
traditional criteria (reliability, validity, practicality, and washback), and reiterate directions or
propose new ideas for test revision. In terms of reliability, the information retention and
reflective journal entry production subtests need to be improved as can be seen from the
inter-rater reliability and subtest relationship analysis. The glossary, the multiple-choice
items, and the writing scoring rubric all require revision. Although the SEM of the joke
reproduction subtest seems to be really satisfactory, the scoring criteria need to be adjusted
to distinguish the two dimensions (meaning accuracy and grammatical correctness) in a
clearer way.
As for validity, one defining feature of a valid test is “does [doing] what it is
supposed to do” (Oller, 1979, p. 4). A diagnostic test is supposed to identify students’
“existing strengths and weaknesses in order to help teachers tailor instruction to fit 2LL’s
needs” (Bailey, 1998, p. 40). The English diagnostic test I designed showed students’
strengths in information retention and weaknesses in joke reproduction as a whole group but
not on an individual level; thus, it would not be an ideal tool for screening purposes (e.g.,
EDT at MIIS). Nevertheless, the test still has content validity (Brown, 2005). Both the
glossary and the political speech were exemplary materials that interpreting trainees may
encounter in their classes or real-life interpreting practice. However, further examination of
the content validity will be useful for test revision.
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 13/17
80
The practicality of a test has to do with “the preparation, administration, scoring, and
interpretation of the test” (Oller, 1979, p. 4). Since this diagnostic test was quite innovative,
the preparation involved a great amount of time and effort. The administration was quite
complex and tiring since the test administrators had to pay close attention to the time for the
different tasks involved in each objectively scored subtest, distributing and collecting the
answer sheets multiple times during the test. Therefore, I think it would be more convenient
to utilize an online platform, which includes timed sections. The scoring for the multiple-
choice questions was fast and easy since the answers key was well developed and easy to use.
However, scoring the joke reproduction and the reflective journal entry writing subtests took
a longer time due to the amount of judgment one had to make based on the somewhat
vague scoring criteria. Finally, the interpretation of the test results was at most tentative due
to the novelty and flaws of this test concerning test items and scoring criteria as we can see
from the previous discussion.
Actually, I am most proud of the washback of this test, which Bailey (1998) defines
as “the effect a test has on teaching and learning” (p. 249). I have discussed in detail the
instructional value of awareness-raising in part 1 of this project. Besides, as a self-assessment
tool, the reflective journal entry may promote self-regulated learning, in which students
“make choices, select learning activities, and plan how to use their time and resources”
(O’Malley & Valdez Pierce, 1996, p. 5). I was really excited to see that in their journal entries,
almost all the students set their own goals with regard to how to use their time and resources,
such as doing more extensive reading to increase the background knowledge and adopting a
different note-taking strategy to deal with punch lines. Just as one student said in her writing,
“generally speaking, it’s a very helpful test.”
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 14/17
81
References
Alderson, C. (2000). Assessing reading. Cambridge: Cambridge University Press.
Alderson, J.C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation.
Cambridge: Cambridge University Press.
Anderson, R. C., & Pearson, P. D. (1984). A schema-theory view of the basic processes in
reading. In P. D. Pearson (Ed.), Handbook of reading research (pp. 255- 291).
New York, NY: Longman.
Attardo, S. (2008). Semantic and pragmatics of humor. Language and linguistics compass 2/6 ,
1203 –1215.
Bachman, L. (1990). Fundamental considerations in language testing . Oxford: Oxford University
Press.
Bailey, K. M. (1998). Learning about language assessment: Dilemmas, decisions and directions. Boston:
Heinle & Heinle.
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language
assessment. New York: McGraw-Hill.
Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999). Dictionary of
language testing. UK: Cambridge University Press.
Hamp-Lyons, L. (1989). Newbury house TOEFL preparation kit: Preparing for the Test of Written
English. New York: Newbury House.
Hedgcock, J. S., & Ferris, D. R. (2009). Teaching Readers of English: Students, texts,
and contexts. New York, NY: Routledge.
Marriam-Webster, Incorporated. www.merriam-webster.com. Retrieved Oct. 5, 2012, from
http://www.merriam-webster.com/dictionary/joke.
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 15/17
82
Oller, J. W. (1979). Language tests at school . London: Longman Group.
O’Malley, J. M. & Valdez Pierce, L. (1996). Authentic assessment for English language learners:
Practical approaches for teachers . Boston, MA: Addison-Wesley.
Stevens, D. D. & Cooper, J. E. (2009). Journal keeping: How to use reflective writing for learning,
teaching, professional insight, and positive change . Sterling, VA: Stylus Publishing, LLC.
Swain, M. (1984). Large-scale communicative language testing: A case study. In S. J.
Savignon, & M. Berns (Eds.), Initiatives in communicative language teaching (pp. 185-201).
Reading, MA: Addison-Wesley.
Turner, J. (1995). Test preparation. Presentation at the TESOL convention, Long Beach,
California.
Weigle, S. C. (2000). Assessing writing . Cambridge: Cambridge University Press.
Wesche, M. B. (1983). Communicative testing in a second language. The Modern Language
Journal, 67, 41-55.
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 16/17
83
Appendix J
Readability
Glossary:
7/30/2019 Original Test Part 2_Yinghua Cai
http://slidepdf.com/reader/full/original-test-part-2yinghua-cai 17/17
84
Cloze passage: