An Investigation Into Native and Non-native Teachers' Judgments of Oral English Performance- Mixed Method

8/13/2019 An Investigation Into Native and Non-native Teachers' Judgments of Oral English Performance- Mixed Method

1/32

http://ltj.sagepub.com/Language Testing

http://ltj.sagepub.com/content/26/2/187The online version of this article can be found at:

DOI: 10.1177/0265532208101010

2009 26: 187Language TestingYoun-Hee Kim

English performance: A mixed methods approachAn investigation into native and non-native teachers' judgments of oral

Published by:

http://www.sagepublications.com

can be found at:Language TestingAdditional services and information for

http://ltj.sagepub.com/cgi/alertsEmail Alerts:

http://ltj.sagepub.com/subscriptionsSubscriptions:

http://www.sagepub.com/journalsReprints.navReprints:

http://www.sagepub.com/journalsPermissions.navPermissions:

http://ltj.sagepub.com/content/26/2/187.refs.htmlCitations:

What is This?

- Mar 26, 2009Version of Record>>

at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from at UNIV OF NEW MEXICO on October 14, 2013ltj.sagepub.comDownloaded from
http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/content/26/2/187http://ltj.sagepub.com/content/26/2/187http://www.sagepublications.com/http://ltj.sagepub.com/cgi/alertshttp://ltj.sagepub.com/cgi/alertshttp://ltj.sagepub.com/subscriptionshttp://www.sagepub.com/journalsReprints.navhttp://www.sagepub.com/journalsReprints.navhttp://www.sagepub.com/journalsPermissions.navhttp://ltj.sagepub.com/content/26/2/187.refs.htmlhttp://ltj.sagepub.com/content/26/2/187.refs.htmlhttp://online.sagepub.com/site/sphelp/vorhelp.xhtmlhttp://online.sagepub.com/site/sphelp/vorhelp.xhtmlhttp://ltj.sagepub.com/content/26/2/187.full.pdfhttp://ltj.sagepub.com/content/26/2/187.full.pdfhttp://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://online.sagepub.com/site/sphelp/vorhelp.xhtmlhttp://ltj.sagepub.com/content/26/2/187.full.pdfhttp://ltj.sagepub.com/content/26/2/187.refs.htmlhttp://www.sagepub.com/journalsPermissions.navhttp://www.sagepub.com/journalsReprints.navhttp://ltj.sagepub.com/subscriptionshttp://ltj.sagepub.com/cgi/alertshttp://www.sagepublications.com/http://ltj.sagepub.com/content/26/2/187http://ltj.sagepub.com/


2/32

Address for correspondence: Youn-Hee Kim, Modern Language Center, Ontario Institute forStudies in Education, University of Toronto, 252 Bloor Street West, Toronto, Ont., Canada, M5S1V6; email: [email protected]

Language Testing2009 26 (2) 187217

The Author(s), 2009. Reprints and Permissions: http://www.sagepub.co.uk/journalsPermissions.nav

DOI:10.1177/0265532208101010

An investigation into native andnon-native teachers judgments of

oral English performance: A mixedmethods approachYoun-Hee Kim University of Toronto, Canada

This study used a mixed methods research approach to examine hownative English-speaking (NS) and non-native English-speaking (NNS)teachers assess students oral English performance. The evaluationbehaviors of two groups of teachers (12 Canadian NS teachers and 12Korean NNS teachers) were compared with regard to internal consist-ency, severity, and evaluation criteria. Results of a Many-faceted RaschMeasurement analysis showed that most of the NS and NNS teachersmaintained acceptable levels of internal consistency, with only one ortwo inconsistent raters in each group. The two groups of teachers alsoexhibited similar severity patterns across different tasks. However, sub-stantial dissimilarities emerged in the evaluation criteria teachers used toassess students performance. A qualitative analysis demonstrated thatthe judgments of the NS teachers were more detailed and elaborate thanthose of the NNS teachers in the areas of pronunciation, specific gram-mar use, and the accuracy of transferred information. These findings areused as the basis for a discussion of NS versus NNS teachers as languageassessors on the one hand and the usefulness of mixed methods inquirieson the other.

Keywords: mixed methods, NS and NNS, oral English performanceassessment, many-faceted Rasch Measurement

In the complex world of language assessment, the presence of ratersis one of the features that distinguish performance assessment fromtraditional assessment. While scores in traditional fixed responseassessments (e.g., multiple-choice tests) are elicited solely fromthe interaction between test-takers and tasks, it is possible thatthe final scores awarded by a rater could be affected by variables


3/32

188 An investigation into native and non-native teachers judgments

inherent to that rater (McNamara, 1996). Use of a rater for perform-ance assessment therefore adds a new dimension of interaction tothe process of assessment, and makes monitoring of reliability and

validity even more crucial.The increasing interest in rater variability has also given rise toissues of eligibility; in particular, the question of whether nativespeakers should be the only norm maker[s] (Kachru, 1985) inlanguage assessment has inspired heated debate among languageprofessionals. The normative system of native speakers has longbeen assumed in English proficiency tests (Taylor, 2006), andit is therefore unsurprising that large-scale, high-stakes testssuch as the Test of English as a Foreign Language (TOEFL)

and the International English Language Testing System (IELTS)rendered their assessments using native English-speaking abilityas a benchmark (Hill, 1997; Lowenberg, 2000, 2002). However,the current status of English as a language of international com-munication has caused language professionals to reconsiderwhether native speakers should be the only acceptable standard(Taylor, 2006). Indeed, non-native English speakers outnumbernative English speakers internationally (Crystal, 2003; Graddol,1997; Jenkins, 2003; Lowenberg, 2000), and localization of the

language has occurred in outer circle countries such as China,Korea, Japan, and Russia (Kachru, 1985, 1992; Lowenberg,2000). These developments suggest that new avenues of oppor-tunity may be opening for non-native English speakers as lan-guage assessors.

This study, in line with the global spread of English as a linguafranca, investigates how native English-speaking (NS) and non-nativeEnglish-speaking (NNS) teachers evaluate students oral Englishperformance in a classroom setting. A mixed methods approach will

be utilized to address the following research questions:

Do NS and NNS teachers exhibit similar levels of internalconsistency when they assess students oral English perform-ance?Do NS and NNS teachers exhibit interchangeable severityacross different tasks when they assess students oral Englishperformance?

How do NS and NNS teachers differ in drawing on evalu-ation criteria when they comment on students oral Englishperformance?

1)

2)

3)


4/32

Youn-Hee Kim 189

I Review of the literature

A great deal of research exploring rater variability in second languageoral performance assessment has been conducted, with a number of

early studies focusing on the impact of raters different backgrounds(Barnwell, 1989; Brown, 1995; Chalhoub-Deville, 1995; Chalhoub-Deville & Wigglesworth, 2005; Fayer & Krasinski, 1987; Galloway,1980; Hadden, 1991). In general, teachers and non-native speakerswere shown to be more severe in their assessments than non-teachersand native speakers, but the outcomes of some studies contradictedone another. This may be explained by their use of different nativelanguages, small rater samples, and different methodologies (Brown,1995; Chalhoub-Deville, 1995).

For example, in a study of raters professional backgrounds,Hadden (1991) investigated how native English-speaking teachersand non-teachers perceive the competence of Chinese students inspoken English. She found that teachers tended to be more severethan non-teachers as far as linguistic ability was concerned, butthat there were no significant differences in such areas as compre-hensibility, social acceptability, personality, and body language.Chalhoub-Deville (1995), on the other hand, comparing three differ-ent rater groups (i.e., native Arabic-speaking teachers living in theUSA, non-teaching native Arabic speakers living in the USA, andnon-teaching native Arabic speakers living in Lebanon), found thatteachers attended more to the creativity and adequacy of informationin a narration task than to linguistic features. Chalhoub-Deville sug-gested that the discrepant findings of the two studies could be due tothe fact that her study focused on modern standard Arabic (MSA),whereas Haddens study focused on English.

Another line of research has focused on raters different lin-

guistic backgrounds. Fayer and Krasinski (1987) examined howthe English-speaking performance of Puerto Rican students wasperceived by native English-speaking raters and native Spanish-speaking raters. The results showed that non-native raters tendedto be more severe in general and to express more annoyance whenrating linguistic forms, and that pronunciation and hesitation werethe most distracting factors for both sets of raters. However, thiswas somewhat at odds with Browns (1995) study which found thatwhile native speakers tended to be more severe than non-native

speakers, the difference was not significant. Brown concluded that,there is little evidence that native speakers are more suitable thannon-native speakers However, the way in which they perceive


5/32


the items (assessment criteria) and the way in which they apply thescale dodiffer (p. 13).

Studies of raters with diverse linguistic and professional back-

grounds have also been conducted. Comparing native and non-native Spanish speakers with or without teaching experience,Galloway (1980) found that non-native teachers tended to focuson grammatical forms and reacted more negatively to non-verbal behavior and slow speech, while non-teaching nativespeakers seemed to place more emphasis on content and on sup-porting students attempts at self-expression. Conversely, Barnwell(1989) reported that untrained native Spanish speakers providedmore severe assessments than an ACTFL-trained Spanish rater.

This result conflicts with that of Galloway (1980), who foundthat untrained native speakers were more lenient than teachers.Barnwell suggested that both studies were small in scope, and thatit was therefore premature to draw conclusions about native speak-ers responses to non-native speaking performance. Hill (1997)further pointed out that the use of two different versions of ratingscales in Barnwells study, one of which was presented in Englishand the other in Spanish, remains questionable.

One recent study of rater behavior focused on the effect of

country of origin and task on evaluations of students oral Englishperformance. Chalhoub-Deville and Wigglesworth (2005) inves-tigated whether native English-speaking teachers who live in dif-ferent English-speaking countries (i.e., Australia, Canada, the UK,and the USA) exhibited significantly different rating behaviors intheir assessments of students performance on three Test of SpokenEnglish (TSE) tasks 1) give and support an opinion,2) picture-based narration, and 3) presentation which require differentlinguistic, functional, and cognitive strategies. MANOVA results

indicated significant variability among the different groups of nativeEnglish-speaking teachers across all three tasks, with teachers resid-ing in the UK the most severe and those in the USA the most lenientacross the board; however, the very small effect size (2 = 0.01)suggested that little difference exists among different groups ofnative English-speaking teachers.

Although the above studies provide some evidence that raterslinguistic and professional backgrounds influence their evaluationbehavior, further research is needed for two reasons. First, most

extant studies are not grounded in finely tuned methodologies. Insome early studies (e.g., Fayer & Krasinski, 1987; Galloway, 1980;


6/32

Youn-Hee Kim 191

Hadden, 1991), raters were simply asked to assess speech samplesof less than four minutes length without reference to a carefullydesigned rating scale. Also, having raters assess only one type of

speech sample did not take the potential systematic effect of tasktype on task performance into consideration. Had the task types var-ied, raters could have assessed diverse oral language output, whichin turn might have elicited unknown or unexpected rating behaviors.Second, to my knowledge, no previous studies have attempted touse both quantitative and qualitative rating protocols to investigatedifferences between native and non-native English-speaking teach-ers judgments of their students oral English performance. A mixedmethods approach, known as the third methodological movement

(Tashakkori & Teddlie, 2003, p. ix), incorporates quantitative andqualitative research methods and techniques into a single study andhas the potential to reduce the biases inherent in one method whileenhancing the validity of inquiry (Greene, Caracelli, & Graham,1989). However, all previous studies that have examined native andnon-native English-speaking raters behavior in oral language per-formance assessment have been conducted using only a quantitativeframework, preventing researchers from probing research phenom-ena from diverse data sources and perspectives. The mixed methods

approach of the present study seeks to enhance understanding ofraters behavior by investigating not only the scores assigned by NSand NNS teachers but also how they assess students oral Englishperformance.

II Methodology

1 Research design overview

The underlying research framework of this study is based on bothexpansion and complementarity mixed methods designs, which aremost commonly used in empirical mixed methods evaluation studies(see Greene et al.,1989 for a review of mixed methods evaluationdesigns). The expansion design was considered particularly wellsuited to this study because it would offer a comprehensive anddiverse illustration of rating behavior, examining both the productthat the teachers generate (i.e., the numeric scores awarded to stu-

dents) and the process that they go through (i.e., evaluative com-ments) in their assessment of students oral English performance(Greene et al.,1989). The complementarity design was included to


7/32


provide greater understanding of the NS and NNS teachers ratingbehaviors by investigating the overlapping but different aspects ofrater behavior that different methods might elicit (Greene et al.,

1989). Intramethod mixing, in which a single method concurrentlyor sequentially incorporates quantitative and qualitative components(Johnson & Turner, 2003), was the selected guiding procedure. Thesame weight was given to both quantitative and qualitative methods,with neither method dominating the other.

2 Participants

Ten Korean students were selected from a college-level language

institute in Montreal, Canada, and were informed about the researchproject and the test to participate in the study. The students weredrawn from class levels ranging from beginner to advanced, so thatthe student sample would include differing levels of English profi-ciency. The language institute sorted students into one of five classlevels according to their aggregate scores on a placement test meas-uring four English language skills (listening, reading, speaking, andwriting): Level I for students with the lowest English proficiency,up to Level V for students with the highest English proficiency.

Table 1 shows the distribution of the student sample across the fiveclass levels.

For the teacher samples, a concurrent mixed methods samplingprocedure was used in which a single sample produced data forboth the quantitative and qualitative elements of the study (Teddlie& Yu, 2007). Twelve native English-speaking Canadian teachersof English and 12 non-native English-speaking Korean teachers ofEnglish constituted the NS and NNS teacher groups, respectively.In order to ensure that the teachers were sufficiently qualified,

certain participation criteria were outlined: 1) at least one year ofprior experience teaching an English conversation course to non-native English speakers in a college-level language institution;2) at least one graduate degree in a field related to linguistics orlanguage education; and 3) high proficiency in spoken English forKorean teachers of English. Teachers background information

Table 1 Distribution of students across class levels

Level I II III IV V

Number of students 1 1 3 3 2


8/32

Youn-Hee Kim 193

was obtained via a questionnaire after their student evaluationswere completed: all of the NNS teachers had lived in English-speaking countries for one to seven years for academic purposes,

and their self-assessed English proficiency levels ranged fromadvanced (six teachers) to near-native (six teachers); none of theNNS teachers indicated their self-assessed English proficiencylevels at or below an upper-intermediate level. In addition, nineNS and eight NNS teachers reported having taken graduate-levelcourses specifically in Second Language Testing and Evaluation,and four NS and one NNS teacher had been trained as raters ofspoken English.

3 Instruments

A semi-direct oral English test was developed for the study. Thepurpose of the test was to assess the overall oral communicativelanguage ability of non-native English speakers within an aca-demic context. Throughout the test, communicative language abil-ity would be evidenced by the effective use of language knowledgeand strategic competence (Bachman & Palmer, 1996). Initial test

development began with the identification of target language usedomain, target language tasks, and task characteristics (Bachman& Palmer, 1996). The test tasks were selected and revised to reflectpotential test-takers language proficiency and topical knowledge,as well as task difficulty and interest. An effort was also made toselect test tasks related to hypothetical situations that could occurwithin an academic context. In developing the test, the guidingprinciples of the Simulated Oral Proficiency Interview (SOPI)were referenced.

The test consisted of three different task types in order toassess the diverse oral language output of test-takers: picture-based, situation-based, and topic-based. The picture-based taskrequired test-takers to describe or narrate visual information, suchas describing the layout of a library (Task 1, [T1]), explainingthe library services based on a provided informational note (Task2, [T2]), narrating a story from six sequential pictures (Task 4,[T4]), and describing a graph of human life expectancy (Task 7,[T7]). The situation-based task required test-takers to perform the

appropriate pragmatic function in a hypothetical situation, suchas congratulating a friend on being admitted to school (Task 3,[T3]). Finally, the topic-based task required test-takers to offer


9/32


their opinions on a given topic, such as explaining their personalpreferences for either individual or group work (Task 5, [T5]),discussing the harmful effects of Internet use (Task 6, [T6]),

and suggesting reasons for an increase in human life expectancy(Task 8, [T8]).The test was administered in a computer-mediated indirect

interview format. The indirect method was selected because theintervention of interlocutors in a direct speaking test might affectthe reliability of test performance (Stansfield & Kenyon, 1992a,1992b). Although the lexical density produced in direct speakingtests and indirect speaking tests have been found to be different(OLoughlin, 1995), it has consistently been reported that scores

from indirect speaking tests have a high correlation with those fromdirect speaking tests (Clark & Swinton, 1979, 1980; OLoughlin,1995; Stansfield, Kenyon, Paiva, Doyle, Ulsh, & Antonia, 1990).In order to effectively and economically facilitate an understandingof the task without providing test-takers with a lot of vocabulary(Underhill, 1987), each task was accompanied by visual stimuli. Thetest lasted approximately 25 minutes, 8 of which were allotted forresponses.

A four-point rating scale was developed for rating (see

Appendix A). It had four levels labeled 1, 2, 3, and 4. A responseof I dont know or no response was automatically ratedNR(NotRatable). The rating scale only clarified the degree of commu-nicative success without addressing specific evaluation criteria.Because this study aimed to investigate how the teachers com-mented on the students oral communicative ability and definedthe evaluation criteria to be measured, the rating scale did notprovide teachers with any information about which evaluationfeatures to draw on. To deal with cases in which teachers sit

on the fence, an even number of levels was sought in the ratingscale. Moreover, in order not to cause a cognitive and psycho-logical overload on the teachers, six levels were set as the upperlimit during the initial stage of the rating scale development.Throughout the trials, however, the six levels describing thedegree of successfulness of communication proved to be indis-tinguishable without dependence on the adjacent levels. Moreimportantly, teachers who participated in the trials did not useall six levels of the rating scale in their evaluations. For these

reasons, the rating scale was trimmed to four levels, enabling theteachers to consistently distinguish each level from the others.


10/32

Youn-Hee Kim 195

4 Procedure

The test was administered individually to each of 10 Koreanstudents, and their speech responses were simultaneously

recorded as digital sound files. The order of the students testresponse sets was randomized to minimize a potential orderingeffect, and then 12 of the possible test response sets were distrib-uted to both groups of teachers. A meeting was held with eachteacher in order to explain the research project and to go over thescoring procedure, which had two phases: 1) rating the studentstest responses according to the four-point rating scale; and 2)justifying those ratings by providing written comments either inEnglish or in Korean. While the NS teachers were asked to write

comments in English, the NNS teachers were asked to writecomments in Korean (which were later translated into English).The rationale for requiring teachers comments was that theywould supply not only the evaluation criteria that they drew onto infer students oral proficiency, but that it would help to iden-tify the construct being measured. The teachers were allowedto control the playing, stopping, and replaying of test responsesand to listen to them as many times as they wanted. After rat-ing a single task response by one student according to the ratingscale, they justified their ratings by writing down their reasons orcomments. They then moved on to the next task response of thatstudent. The teachers rated and commented on 80 test responses(10 students 8 tasks).

To decrease the subject expectancy effect, the teachers were toldthat the purpose of the study was to investigate teachers ratingbehavior, and the comparison of different teacher groups was notexplicitly mentioned. The two groups of teachers were thereforeunaware of each other. In addition, a minimum amount of infor-mation about the students (i.e., education level, current visa status,etc.) was provided to the teachers. Meetings with the NS teacherswere held in Montreal, Canada, and meetings with the NNS teach-ers followed in Daegu, Korea. Each meeting lasted approximately30 minutes.

5 Data analyses

Both quantitative and qualitative data were collected. The quanti-tative data consisted of 1,727 valid ratings, awarded by 24 teach-ers to 80 sample responses by 10 students on eight tasks. Each


11/32


teacher rated every students performance on every task, so thatthe data matrix was fully crossed. A rating of NR (Not Ratable)was treated as missing data; there were eight such cases among

the 80 speech samples. In addition, one teacher failed to makeone rating. The qualitative data included 3,295 written com-ments. Both types of data were analyzed in a concurrent manner:a Many-faceted Rasch Measurement (Linacre, 1989) was used toanalyze quantitative ratings, and typology development and datatransformation (Caracelli & Greene, 1993) guided the analysisof qualitative written comments. The quantitative and qualitativeresearch approaches were integrated at a later stage (rather thanat the outset of the research process) when the findings from both

methods were interpreted and the study was concluded. Sincethe nature of the component designs to which this study belongsdoes not permit enough room to combine the two approaches(Caracelli & Greene, 1997), the different methods tended toremain distinct throughout the study. Figure 1 summarizes theoverall data analysis procedures.

a Quantitative data analysis: The data were analyzed using theFACETS computer program, Version 3.57.0 (Linacre, 2005). Four

facets were specified: student, teacher, teacher group, and task. Theteacher group facet was entered as a dummy facet and anchored atzero. A hybrid Many-faceted Rasch Measurement Model (Myford& Wolfe, 2004a) was used to differentially apply the Rating ScaleModel to teachers and tasks, and the Partial Credit Model to teachergroups.

Three different types of statistical analysis were carried out toinvestigate teachers internal consistency, based on: 1) fit statis-tics; 2) proportions of large standard residuals between observed

and expected scores (Myford & Wolfe, 2000); and 3) a singleraterrest of the raters (SR/ROR) correlation (Myford & Wolfe,2004a). The multiple analyses were intended to strengthen thevalidity of inferences drawn about raters internal consistencythrough converging evidence, and to minimize any bias that isinherent to a particular analysis. Teachers severity measureswere also examined in three different ways based on: 1) taskdifficulty measures, 2) a bias analysis between teacher groupsand tasks, and 3) a bias analysis between individual teachersand tasks.


12/32

Youn-Hee Kim 197

b Qualitative data analysis: The written comments were analyzedbased on evaluation criteria, with each written comment constitutingone criterion. Comments that provided only evaluative adjectives

without offering evaluative substance (e.g., accurate,clear, and soon) were excluded in the analysis so as not to misjudge the evalua-tive intent. The 3,295 written comments were open-coded so that theevaluation criteria that the teachers drew upon emerged. Nineteenrecurring evaluation criteria were identified (see Appendix B fordefinitions and specific examples). Once I had coded and analyzedthe teachers comments, a second coder conducted an independ-ent examination of the original uncoded comments of 10 teachers(five NS and five NNS teachers); our results reached approximately95% agreement (for a detailed description about coding procedures,

Teachers

evaluation criteria

3,295

written

comments

Teachers severity

Teachers internal

consistency

Fit statistics

Proportions of largestandard residuals

Task difficulty measures

Bias analysis (grouplevel)

Bias analysis (individual

level)

Typology development

Data transformation(quantification of

evaluation features) for

cross-comparison

Single raterrest of the

raters (SR/ROR) correlation1,727

ratings

Figure 1 Flowchart of data analysis procedure


13/32


see Kim, 2005). The 19 evaluative criteria were compared across thetwo teacher groups through a frequency analysis.

III Results and discussion

1 Do NS and NNS teachers exhibit similar levels of internal

consistency when they assess students oral English performance?

To examine fit statistics, the infit indices of each teacher wereassessed. Teachers fit statistics indicate the degree to which eachteacher is internally consistent in his or her ratings. Determiningan acceptable range of infit mean squares for teachers is not a

clear-cut process (Myford & Wolfe, 2004a); indeed, there are nostraightforward rules for interpreting fit statistics, or for settingupper and lower limits. As Myford and Wolfe (2004a) noted,such decisions are related to the assessment context and dependon the targeted use of the test results. If the stakes are high, tightquality control limits such as mean squares of 0.8 to 1.2 wouldbe set on multiple-choice tests (Linacre & Williams, 1998);however, in the case of low-stakes tests, looser limits would beallowed. Wright and Linacre (1994) proposed the mean squarevalues of 0.6 to 1.4 as reasonable values for data in which a ratingscale is involved, with the caveat that the ranges are likely to varydepending on the particulars of the test situation.

In the present study, the lower and upper quality control limitswere set at 0.5 and 1.5, respectively (Lunz & Stahl, 1990), giventhe tests rating scale and the fact that it investigates teachers ratingbehaviors in a classroom setting rather than those of trained ratersin a high-stakes test setting. Infit mean square values greater than

1.5 indicate significant misfit, or a high degree of inconsistency inthe ratings. On the other hand, infit mean square values less than0.5 indicate overfit, or a lack of variability in their scoring. Thefit statistics in Table 2 show that three teachers, NS10, NNS6, andNNS7, have misfit values. None of the teachers show overfit ratingpatterns.

Another analysis was carried out based on proportions of largestandard residuals between observed and expected scores in orderto more precisely identify the teachers whose rating patterns dif-

fered greatly from the model expectations. According to Myfordand Wolfe (2000), investigating the proportion to which each rater isinvolved with the large standard residuals between observed scores


14/32

Youn-Hee Kim 199

and expected scores can provide useful information about raterbehavior. If raters are interchangeable, it should be expected thatall raters would be assigned the same proportion of large standardresiduals, according to the proportion of total ratings that they make(Myford & Wolfe, 2000). Based on the number of large standardresiduals and ratings that all raters make and each rater makes, theysuggest that the null proportion of large standard residuals for each

rater ( ) and the observed proportion of large standard residuals foreach rater (Pr) can be computed using equations (1) and (2):

Table 2 Teacher measurement report

Teacher Obsvd FairM Measure Model Infit Outfit PtBis

average average (logits) S.E. MnSq MnSq

NS10 2.9 2.78 0.60 0.20 1.51 1.37 0.56NNS10 2.9 2.74 0.52 0.20 1.26 1.21 0.58

NNS11 2.8 2.63 0.29 0.19 1.09 0.94 0.55

NNS1 2.7 2.52 0.07 0.19 0.85 0.74 0.57

NS9 2.7 2.43 0.11 0.19 1.34 1.43 0.51

NS5 2.6 2.37 0.23 0.19 1.07 1.28 0.53

NNS9 2.6 2.35 0.26 0.19 1.29 1.46 0.50

NS12 2.6 2.32 0.33 0.19 0.96 1.12 0.54

NNS7 2.6 2.32 0.33 0.19 1.54 1.29 0.49

NNS5 2.5 2.29 0.40 0.19 0.81 0.82 0.57

NS7 2.5 2.27 0.44 0.19 1.11 1.12 0.53

NS11 2.5 2.25 0.47 0.19 1.00 0.94 0.53NS4 2.5 2.22 0.54 0.19 0.52 0.48 0.60

NNS4 2.5 2.22 0.54 0.19 0.52 0.48 0.60

NNS12 2.4 2.17 0.65 0.19 0.83 0.97 0.56

NNS2 2.4 2.13 0.72 0.19 0.69 0.68 0.57

NS3 2.4 2.08 0.83 0.19 0.77 1.03 0.57

NNS3 2.4 2.08 0.83 0.19 0.85 0.73 0.59

NS2 2.3 2.02 0.97 0.19 0.67 0.69 0.57

NS8 2.3 1.99 1.05 0.19 0.78 0.77 0.59

NS6 2.2 1.91 1.23 0.19 1.30 1.41 0.53

NNS6 2.2 1.84 1.38 0.19 1.61 1.74 0.49

NS1 2.1 1.75 1.60 0.20 0.68 0.60 0.58NNS8 2.1 1.73 1.64 0.20 0.85 0.72 0.56

Mean 2.5 2.22 0.54 0.19 1.00 1.00 0.55

S.D. 0.2 0.27 0.58 0.00 0.31 0.33 0.03

RMSE (model) = 0.19 Adj. S.D. = 0.55

Separation = 2.87 Separation (not inter-rater) Reliability = 0.89

Fixed (all same) 2 = 214.7 d.f. = 23

Significance (probability) = .00

Note:SR/ROR correlation is presented as the point-biserial correlation (PtBis) in the

FACET output.


15/32


= N

N

u

t

(1)

where, Nu= the total number of large standard residuals and

Nt= the total number of ratings.

Pr=N

N

ur

tr

(2)

where, Nur= the number of large standard residuals made by rater

r and Ntr= the number of ratings made by rater r.An inconsistent rating will occur when the observed propor-

tion exceeds the null proportion beyond the acceptable deviation(Myford & Wolfe, 2000). Thus, Myford and Wolfe propose that thefrequency of unexpected ratings (Zp) can be calculated using equa-tion (3). According to them, if a Zpvalue for a rater is below +2, itindicates that the unexpected ratings that he or she made are randomerror; however, if the value is above +2, the rater is considered to beexercising an inconsistent rating pattern.

Zp=P

N

r

tr

2

(3)

In this study, an unexpected observation was reported if the stand-ardized residual was greater than +2, which was the case in 89 outof a total of 1,727 responses. When rating consistency was exam-ined, one NS teacher and two NNS teachers were found to exhibitinconsistent rating patterns, a result similar to what was found inthe fit analysis. The two NNS teachers whose observed Zp valueswere greater than +2 were NNS6 and NNS7, who had been flaggedas misfitting teachers by their infit indices. Interestingly, the analy-

sis of NS teachers showed that it was NS9, not NS10, who had Zpvalues greater than +2. This may be because NS10 produced onlya small number of unexpected ratings which did not produce largeresiduals. That small Zpvalue indicates that while the teacher gave afew ratings that were somewhat unexpectedly higher (or lower) thanthe model would expect, those ratings were not highly unexpected(C. Myford, personal communication, May, 31, 2005).

Myford and Wolfe (2004a, 2004b) introduced the more advancedMany-faceted Rasch Measurement application to detect ratersconsistency based on the single raterrest of the raters (SR/ROR)correlation. When raters exhibit randomness, they are flagged with


16/32

Youn-Hee Kim 201

significantly large infit and outfit mean square indices; however, sig-nificantly large infit and outfit mean square indices may also indicateother rater effects (Myford & Wolfe, 2004a, 2004b). Thus, Myford

and Wolfe suggested that it is important to examine significantlylow SR/ROR correlations as well. More specifically, they suggestedthat randomness will be detected when infit and outfit mean squareindices are significantly larger than 1 and SR/ROR correlations aresignificantly lower than those of other raters. Four teachers appearedto be inconsistent: NS9, NNS6, NNS7, and NNS9 showed not onlylarge fit indices but also low SR/ROR correlations. When comparedrelatively, NS9, NNS7, and NNS9 seemed to be on the borderlinein their consistency, whereas NNS6 was obviously signaled as an

inconsistent teacher.In summary, the three different types of statistical approachesshowed converging evidence; most of the NS and NNS teacherswere consistent in their ratings, with one or two teachers from eachgroup showing inconsistent rating patterns. This result implies thatthe two groups rarely differed in terms of internal consistency, andthat the NNS teachers were as dependable as the NS teachers inassessing students oral English performance.

2 Do NS and NNS teachers exhibit interchangeable severity

across different tasks when they assess students oral English

performance?

The analysis was carried out in order to identify whether the twogroups of teachers showed similar severity measures across differ-ent tasks. Given that task difficulty is determined to some extent byraters severity in a performance assessment setting, comparison oftask difficulty measures is considered a legitimate approach. Figure

2 shows the task difficulty derived from the NS and the NNS groupsof teachers. As can be seen, the ratings of the NS group were slightlymore diverse across tasks, with task difficulty measures rangingfrom 0.53 logits to 0.97 logits, with a 1.50 logit spread; in the NNSgroups ratings, the range of task difficulty measures was similar tothat of the NS group, though slightly narrower: from 0.59 logits to0.82 logits, with a 1.41 logit spread. Figure 2 also shows that bothgroups exhibited generally similar patterns in task difficulty meas-ures. Task 6 was given the highest difficulty measure by both groupsof teachers, and Tasks 3 and 2 were given the lowest difficulty meas-ure by the NS and the NNS teacher groups, respectively.


17/32


A bias analysis was carried out to further explore the potentialinteraction between teacher groups and tasks. In the bias analysis, anestimate of the extent to which a teacher group was biased toward aparticular task is standardized to a Z-score. When the Z-score values

in a bias analysis fall between 2 and +2, that group of teachers isthought to be scoring a task without significant bias. Where the val-ues fall below 2, that group of teachers is scoring a task lenientlycompared with the way they have assessed other tasks, suggestinga significant interaction between the group and the task. By thesame token, where the values are above +2, that group of teachersis thought to be rating that task more severely than other tasks. Asthe bias slopes of Figure 3 illustrate, neither of the two groups ofteachers was positively or negatively biased toward any particular

tasks; thus, the NS and NNS teacher groups do not appear to haveany significant interactions with particular tasks.

A bias analysis between individual teachers and tasks confirmedthe result of the previous analysis. While an interaction was foundbetween individual teachers and tasks, no bias emerged toward aparticular task from a particular group of teachers. Strikingly, certainteachers from each group showed exactly the same bias patterns onparticular tasks. As shown in Table 3, one teacher from each groupexhibited significantly lenient rating patterns on Tasks 1 and 4, and

significantly severe patterns on Task 7. Two NS teachers exhibitedconflicting rating patterns on Task 6: NS11 showed a significantly

-0.8

-0.6

-0.4

-0.2

0

0.2

0.40.6

0.8

1

1.2

T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8

Tasks

TaskDifficulty(logits)

NS Teacher Group NNS Teacher Group

Figure 2 Task difficulty measures by NS and NNS teacher groups


18/32

Youn-Hee Kim 203

more lenient pattern of ratings, while NS9 showed the exact reversepattern; that is, NS9 rated Task 6 significantly more severely. It isvery interesting that one teacher from each group showed the samebias patterns on Tasks 1, 4, and 7, since it implies that the ratings ofthese two teachers may be interchangeable in that they display the

same bias patterns.In summary, the NS and NNS teachers seem to have behaved

similarly in terms of severity, and this is confirmed by both the taskdifficulty measures and the two bias analyses. The overall results ofthe multiple quantitative analyses also show that the NS and NNS

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.60.8

1

1. NS GROUP 2. NNS GROUP

Teacher Group

z-value

T1 T2 T3 T4 T5 T6 T7 T8

Figure 3 Bias analysis between teacher groups and tasks

Table 3 Bias analysis: Interactions between teachers and tasks

Teacher Task Obs-Exp Bias Model Z-score Infit

average measure S.E. MnSq

(logits)

NS11 T6 0.54 1.26 0.55 2.29 0.9

NS9 T4 0.38 1.23 0.58 2.13 1.5

NNS9 T4 0.43 1.22 0.55 2.19 1.5

NNS12 T1 0.47 1.18 0.53 2.24 0.7

NS3 T1 0.44 1.06 0.50 2.11 0.8

NS5 T6 0.43 1.01 0.55 1.84 1.3

NNS6 T6 0.34 1.06 0.69 1.54 3.0

NS9 T6 0.49 1.21 0.58 2.09 2.1

NS3 T6 0.44 1.21 0.64 1.90 0.7

NS6 T7 0.60 1.90 0.65 2.92 1.1

NNS6 T7 0.60 2.02 0.69 2.93 1.1


19/32


20/32

Youn-Hee Kim 205

for accuracy of transferred information; 53 vs. 66 comments forcompleteness of discourse).

When the evaluation criteria emphasized by the two teachergroups were examined, the NS group was found to draw mostfrequently on overall language use (13.46% of all comments),pronunciation (11.47%), vocabulary (11.42%), fluency (9.33%), andspecific grammar use (6.70%). The NNS group emphasized pronun-ciation (15.23% of all comments), vocabulary (14.47%), intelligibil-ity (7.69%), overall language use (7.00%), and coherence (5.68%).These trends indicate that the two teacher groups shared commonideas about the ways in which the students performance should beassessed. Although the NS and NNS groups differed in that the NSgroup made more comments across most of the evaluation criteria,both groups considered vocabulary, pronunciation, and overall lan-

guage use to be the primary evaluation criteria.The NS teachers provided more detailed and elaborate comments,often singling out a word or phrase from students speech responsesand using it as a springboard for justifying their evaluative com-ments. For example, when evaluating pronunciation, the NS teach-ers commented that some small pronunciation issue (can/cant& show/saw) causes confusion, some words mispronounced(e.g., reverse for reserve, arrive for alive), pronunciation dif-ficulty, l/r, d/t, f/p, vowels, i/e,, pronunciation occasionally unclear

(e.g., really), sometimes pronunciation is not clear, especially atword onsets, etc. The explicit pinpointing of pronunciation errors

0

50

100

150

200

250

300

unde

rstandin

gthetask

overallt

ask

acco

mplish

ment

stren

gth

ofargum

ent

accu

racy

oft

rans

ferred

info

topicr

elev

ance

overalll

angu

ageus

e

vocabu

lary

pron

unciatio

n

fluen

cy

intelli

gibilit

y

senten

cestructure

gene

ralg

rammar

use

specifi

cgram

mar

use

socio-cultu

rala

ppropria

tene

ss

contex

tapp

ropria

tene

ss

cohe

renc

e

supple

men

tofd

etails

completen

esso

fdisc

ourse

elab

oratio

nof

argum

ent

FrequencyofCom

ments

NSNNS

Figure 4 Frequency distribution of the comments by NS and NNS teacher groups


21/32


might imply that the NS teachers tended to be sensitive or strict interms of phonological accuracy. It can also be interpreted to suggestthat the NS teachers were less tolerant of or more easily distracted

by phonological errors made by non-native English-speakers. Thesefindings are somewhat contradictory to those of previous studies(e.g., Brown, 1995; Fayer & Krasinski, 1987) that indicated nativespeakers are less concerned about or annoyed by non-native speechfeatures as long as they are intelligible. This inconsistency might ulti-mately be due to the different methodological approaches employedin the studies. While this study examined non-native speakers pho-nological features through a qualitative lens, the previous studiesfocused on the quantitative scores awarded on pronunciation as one

analytic evaluation criterion.When the comments provided by the NNS teachers on pronun-ciation were examined, they were somewhat different. Althoughpronunciation was one of the most frequently mentioned evaluationcriteria and constituted 15.23% of the total comments, the NNSteachers were more general in their evaluation comments. Insteadof identifying problems with specific phonological features, theytended to focus on the overall quality of students pronunciationperformance. For example, their comments included problems with

pronunciation, problems with word stress, hard to follow due topronunciation, good description of library but problems with pro-nunciation (often only with effort can words be understood), etc. Itappears that the NNS teachers were less influenced by phonologi-cal accuracy than by global comprehensibility or intelligibility. Inother words, as long as students oral performance was intelligibleor comprehensible, the NNS teachers did not seem to be interestedin the micro-level of phonological performance. Intelligibility wasthe third most frequently mentioned evaluation criterion by the NNS

teachers, confirming that their attention is more focused on overallphonological performance or intelligibility than specific phonologi-cal accuracy. Another possible explanation might be that, as one ofthe reviewers of this article suggested, the NNS teachers were morefamiliar with the students English pronunciation than the NS teach-ers because the NNS teachers shared the same first language back-ground with the students.

Similar patterns appeared in the evaluation criteria of specificgrammar use and accuracy of transferred information. The NS teach-

ers provided more detailed feedback on specific aspects of gram-mar use, making more comments compared to the NNS teachers


22/32

Youn-Hee Kim 207

(152 vs. 29 comments). For example, when evaluating studentsperformance on Task 1 (describing the layout of a library), the NSteachers paid more attention to accurate use of prepositions than to

other grammatical features. They further pointed out that accurateuse of prepositions might facilitate listeners visualization of giveninformation, for example, by stating prepositions of place could bemore precise (e.g., in front of computers) and incorrect or vagueuse of prepositions of place hinders visualization.

The same observations were also made on Task 4 (narrating astory from six sequential pictures) and Task 7 (describing a graph ofhuman life expectancy). Tasks 4 and 7 were similar in that studentshad to describe events that had taken place in the past in order to

complete them successfully. It was therefore essential for students tobe comfortable with a variety of verb tenses (past, past progressive,past perfect, present, and future) so as not to confuse their listeners.As was the case with preposition use, the NS teachers were moreaware than the NNS teachers of the precise use of verb tenses, astheir comments make manifest: successfully recounted in the pastwith complex structure (i.e., past perfect, past progressive), chang-ing verb tense caused some confusion, all recounted in presenttense, tense accuracy is important for listener comprehension in

this task, and minor error in verb tense (didnt use future in refer-ence to 2010 at first).

By contrast, the NNS teachers neither responsively nor meticu-lously cited the use of prepositions or verb tenses. Their 29 totalcomments on specific grammar use were often too short to enableinterpretation of their judgments (e.g., no prepositions, wrongtense, problems with prepositions, and problems with tense),suggesting that the NNS teachers were less distracted by the mis-use of prepositions and verb tenses than NS teachers, consistent

with Galloways (1980) findings. Speculating as to why native andnon-native speakers had different perceptions of the extent to whichlinguistic errors disrupt communication, Galloway noted that con-fusion of tense may not have caused problems for the non-nativespeaker, but it did seem to impede communication seriously for thenative speaker (p. 432). Although the native language group in theGalloway study was quite different from that of the present study(i.e., native Spanish speakers as opposed to native Korean speakers),her conjectures are noteworthy.

The responses of the two teacher groups to the accuracy of trans-ferred information followed the same pattern. Although the NNS


23/32


teachers provided more comments than did the NS teachers (50vs. 46, respectively), their characteristics were dissimilar. This wasespecially evident in Task 2 (explaining the library services based

on a provided informational note) and Task 7 (describing a graphof human life expectancy), where students were asked to verbalizeliteral and numeric information. On these two tasks, the NS teach-ers appeared very attentive to the accuracy of transmitted informa-tion, and jotted down content errors whenever they occurred. Forexample, they pointed out every inconsistency between the providedvisual information and the transferred verbalized information, com-menting some key information inaccurate (e.g., confused renew-als for grads & undergrads; fines of $50/day 50/day), some

incorrect info (e.g., closing time of 9:00 pm instead of 6:00 pm),gradually accurate at first, then less so when talking about fines(e.g., $5050), some incorrect information (the gap betweenmen and women was smallest in 1930, NOT 2000), etc. By con-trast, the NNS teachers were primarily concerned with whether thedelivered information was generally correct, for example, comment-ing accurate info, not very inaccurate info, or provided wronginformation. The NNS teachers global judgments on the accuracyof transmitted information raises the question of whether the NNS

teachers were not as attentive as the NS teachers to specific aspectsof content accuracy, as long as the speech was comprehensible. Itmay simply be that the NNS teachers considered content errors tobe simple mistakes that should not be used to misrepresent studentsoverall oral English proficiency.

The tendency of the NNS teachers to provide less detailed, lesselaborate comments than the NS teachers on certain evaluation cri-teria requires careful interpretation. NNS teachers who teach dailyin an EFL context may be poorly informed about how to evaluate

students language performance without depending on numericscores and traditional fixed response assessment. Although therehave been recent advances in performance assessment in an ELFcontext, it has been pointed out that NNS teachers had not beeneffectively trained to assess students performance (Lee, 2007).This different evaluation culture might have contributed to thedissimilar evaluation patterns for the NS and NNS teachers. Thedifferent evaluation behaviors might also be attributable to a meth-odological matter. Because this study was intended only to capture

teachers evaluation behavior, those who participated in the studywere not told that they should make their comments as specific as


24/32

Youn-Hee Kim 209

possible, which might have influenced the NNS teachers lack ofevaluative comments. For example, the NNS teachers may sim-ply have noted the major characteristics of students oral output,

focusing on overall quality without considering the granularityof their own comments. As one of the reviewers suggested, it isalso possible that the NNS teachers did not orient their commentstoward providing feedback for the students. To suggest that theNNS teachers did not identify linguistic errors as accurately as didthe NS teachers would therefore be premature, and more evidenceneeds to be gathered to address the specific ways in which the NSand NNS teachers provided students with feedback related to thoselinguistic errors.

IV Conclusion and implications

This study has examined how a sample of NS and NNS teachersassessed students oral English performance from comprehensiveperspectives. A variety of test tasks were employed, enabling theteachers to exhibit varied rating behaviors while assessing diverseoral language output. The teachers not only exhibited different

severity measures, but they also drew on different evaluation criteriaacross different tasks. These findings suggest that employing multi-ple tasks might be useful in capturing diverse rater behaviors.

Three different statistical approaches were used to compare teach-ers internal consistency, and they revealed almost identical patterns.Most of the NS and NNS teachers maintained acceptable levels ofinternal consistency, with only one or two teachers from each groupidentified as inconsistent raters. Similar results were obtained whenthe severity of the two groups was compared. Of the eight individual

tasks, both teacher groups were most severe on Task 6, and neitherwas positively or negatively biased toward a particular task. Moreinterestingly, a bias analysis carried out for individual teachers andindividual tasks showed that one teacher from each group exhibitedexactly the same bias patterns on certain tasks. A striking disparity,however, appeared in the NS and NNS teachers evaluation criteriafor students performance. The NS teachers provided far more com-ments than the NNS teachers with regard to students performanceacross almost all of the evaluation criteria. A qualitative analysis

further showed the NS teachers to be more detailed and elaborate intheir comments than were the NNS teachers. This observation arose


25/32


from their judgments on pronunciation, specific grammar use, andthe accuracy of transferred information.

The comparable internal consistency and severity patterns that the

NS and NNS teachers exhibited appear to support the assertion thatNNS teachers can function as assessors as reliably as NS teacherscan. Although the NS teachers provided more detailed and elaboratecomments, the study has not evidenced how different qualitativeevaluation approaches interact with students and which evaluationmethod would be more beneficial to them. Therefore, the studysresults offer no indication that NNS teachers should be denied posi-tions as assessors simply because they do not own the languageby primogeniture and due of birth (Widdowson, 1994, p. 379).

Considering assessment practices can be truly valid only when allcontextual factors are considered, the involvement of native speakersin an assessment setting should not be interpreted as a panacea. Bythe same token, an inquiry into validity is a complicated quest, andno validity claims are one-size-fits-all. In a sense, NNS teacherscould be more compelling or sensitive assessors than NS teachersin expanding circles countries (Kachru, 1985), since the formermight be more familiar with the instructional objectives and cur-riculum goals of indigenous educational systems. Further research is

therefore warranted to investigate the effectiveness of NNS teacherswithin their local educational systems.

This study has shown that by combining quantitative and qualita-tive research methods, a comprehensive understanding of researchphenomena can be achieved via paradigmatic and methodologicalpluralism. Diverse paradigms and multiple research methods ena-bled diverse social phenomena to be explored from different angles;the inclusion of a qualitative analysis provided insight into the dif-ferent ways in which NS and NNS teachers assessed students oral

language performance, above and beyond findings from the quan-titative analysis alone. Collecting diverse data also helped to over-come the limitations of the aforementioned previous studies, whichdepended solely on numeric data to investigate raters behavior inoral language performance assessment.

Several methodological limitations and suggestions should benoted. First, this studys results cannot be generalized to otherpopulations. Only Canadian and Korean English teachers wereincluded in the sample, and most of these were well-qualified and

experienced, with at least one graduate degree related to linguis-tics or language education. Limiting the research outcomes to the


26/32

Youn-Hee Kim 211

specific context in which this study was carried out will make theinterpretations of the study more valid. The use of other qualitativeapproaches is also recommended. The only qualitative data col-

lected were written comments, which failed to offer a full accountof the teachers in-depth rating behavior. Those behaviors could befurther investigated using verbal protocols or in-depth interviewsfor a fuller picture of what the teachers consider effective languageperformance. As one of the reviewers pointed out, it might be alsointeresting to investigate whether the comments made by the NSand NNS teachers tap different constructs of underlying oral profi-ciency and thereby result in different rating scales. Lastly, furtherresearch is suggested to examine the extent to which the semi-

direct oral test and the rating scale employed in this study representthe construct of underlying oral proficiency.

Acknowledgements

I would like to acknowledge that this research project was fundedby the Social Sciences and Humanities Research Council of Canadathrough McGill Universitys Institutional Grant. My sincere appre-ciation goes to Carolyn Turner for her patience, insight, and guid-

ance, which inspired me to complete this research project. I am alsovery grateful to Eunice Jang, Alister Cumming, and Merrill Swainfor their valuable comments and suggestions on an earlier versionof this article. Thanks are also due to three anonymous reviewers ofLanguage Testingfor their helpful comments.

V References

Bachman, L. F. & Palmer, A. S. (1996).Language testing in practice.Oxford:Oxford University Press.

Barnwell, D. (1989). Nave native speakers and judgments of oral proficiencyin Spanish.Language Testing, 6, 152163.

Brown, A. (1995). The effect of rater variables in the development of an occu-pation-specific language performance test.Language Testing, 12, 115.

Caracelli, V. J. & Greene, J. C. (1993). Data analysis strategies for mixed-method evaluation designs.Educational Evaluation and Policy Analysis,15, 195207.

Caracelli, V. J. & Greene, J. C. (1997). Crafting mixed method evaluation

designs. In Greene, J. C. & Caracelli, V. J., editors,Advances in mixed-method evaluation: The challenges and benefits of integrating diverse


27/32


paradigms. New Directions for Evaluation no. 74 (pp. 1932). SanFrancisco: Jossey-Bass.

Chalhoub-Deville, M. (1995). Deriving oral assessment scales across differenttests and rater groups.Language Testing, 12, 1633.

Chalhoub-Deville, M. & Wigglesworth, G. (2005). Rater judgment and Englishlanguage speaking proficiency. World Englishes, 24, 383391.

Clark, J. L. D. & Swinton, S. S. (1979).An exploration of speaking proficiencymeasures in the TOEFL context (TOEFL Research Report No. RR-04).Princeton, NJ: Educational Testing Service.

Clark, J. L. D. & Swinton, S. S. (1980). The test of spoken English as ameasure of communicative ability in English-medium instructional set-tings(TOEFL Research Report No. RR-07). Princeton, NJ: EducationalTesting Service.

Crystal, D. (2003). English as a global language. Cambridge: Cambridge

University Press.Fayer, J. M. & Krasinski, E. (1987). Native and nonnative judgments of intel-

ligibility and irritation.Language Learning, 37, 313326.Galloway, V. B. (1980). Perceptions of the communicative efforts of American

students of Spanish.Modern Language Journal, 64, 428433.Graddol, D. (1997). The future of English?: A guide to forecasting the

popularity of English in the 21st century. London, UK: The BritishCouncil.

Greene, J. C., Caracelli, V. J. & Graham, W. F. (1989). Toward a conceptualframework for mixed-method evaluation design.Educational Evaluation

and Policy Analysis, 11, 255274.Hadden, B. L. (1991). Teacher and nonteacher perceptions of second-language

communication.Language Learning, 41, 124.Hill, K. (1997). Who should be the judge?: The use of non-native speakers

as raters on a test of English as an international language. In Huhta, A.,Kohonen, V., Kurki-Suonio, L., & Luoma, S., editors, Current develop-ments and alternatives in language assessment: Proceedings of LTRC96 (pp. 275290). Jyvskyl: University of Jyvskyl and University ofTampere.

Jenkins, J. (2003). World Englishes: A resource book for students.New York:

Routledge.Johnson, B. & Turner, L. A. (2003). Data collection strategies in mixed

methods research. In Tashakkori, A. & Teddlie, C., editors, Handbookof mixed methods in social and behavioral research (pp. 297319).Thousand Oaks, CA: Sage.

Kachru, B. B. (1985). Standards, codification and sociolinguistic realism: TheEnglish language in the outer circle. In Quirk, R. & Widdowson, H.,editors,English in the world: Teaching and learning the language andliteratures(pp. 1130). Cambridge: Cambridge University Press.

Kachru, B. B. (1992). The other side of English. In Kachru, B. B., editors, The

other tongue: English across cultures(pp. 115). Urbana, IL: Universityof Illinois Press.


28/32

Youn-Hee Kim 213

Kim, Y-H. (2005).An investigation into variability of tasks and teacher-judgesin second language oral performance assessment.Unpublished mastersthesis, McGill University, Montreal, Quebec, Canada.

Lee, H-K. (2007). A study on the English teacher quality as an English instruc-

tor and as an assessor in the Korean secondary school.English Teaching,62, 309330.

Linacre, J. M. (1989). Many-facet Rasch measurement.Chicago, IL: MESAPress.

Linacre, J. M. (2005). A users guide to facets: Rasch-model computer pro-grams. [Computer software and manual]. Retrieved April 10, 2005, fromwww.winsteps.com.

Linacre, J. M. & Williams, J. (1998). How much is enough?Rasch Measurement:Transactions of the Rasch Measurement SIG, 12, 653.

Lowenberg, P. H. (2000). Assessing English proficiency in the global con-

text: The significance of non-native norms. In Kam, H. W., editor,Language in the global context: Implications for the language classroom(pp. 207228). Singapore: SEAMEO Regional Language Center.

Lowenberg, P. H. (2002). Assessing English proficiency in the ExpandingCircle. World Englishes, 21, 431435.

Lunz, M. E. & Stahl, J. A. (1990). Judge severity and consistency across grad-ing periods.Evaluation and the health professions, 13, 425444.

McNamara, T. F. (1996). Measuring second language performance.London:Longman.

Myford, C. M. & Wolfe, E. W. (2000).Monitoring sources of variability within

the test of spoken English Assessment System (TOEFL Research ReportNo. RR-65). Princeton, NJ: Educational Testing Service.

Myford, C. M. & Wolfe, E. W. (2004a). Detecting and measuring rater effectsusing many-facet Rasch measurement: Part I. In Smith, Jr., E. V. &Smith, R. M., editors,Introduction toRasch measurement(pp. 460517).Maple Grove, MN: JAM Press.

Myford, C. M. & Wolfe, E. W. (2004b). Detecting and measuring rater effectsusing many-facet Rasch measurement: Part II. In Smith, Jr., E. V. &Smith, R. M., editors,Introduction to Rasch measurement. Maple Grove,MN: JAM Press, 518574.

OLoughlin, K. (1995). Lexical density in candidate output on direct andsemi-direct versions of an oral proficiency test. Language Testing, 12,217237.

Stansfield, C. W. & Kenyon, D. M. (1992a). The development and validation ofa simulated oral proficiency interview. The Modern Language Journal,72, 129141.

Stansfield, C. W. & Kenyon, D. M. (1992b). Research on the comparability ofthe oral proficiency interview and the simulated oral proficiency inter-view. System, 20, 347364.

Stansfield, C. W., Kenyon, D. M., Paiva, R., Doyle, F., Ulsh, I., & Antonia, M.

(1990). The development and validation of the Portuguese Speaking Test,Hispania,73, 641651.


29/32


Tashakkori, A. & Teddlie, C., editors (2003). Handbook of mixed methods insocial and behavioral research. Thousand Oaks, CA: Sage.

Taylor, L. B. (2006). The changing landscape of English: Implications for lan-guage assessment,ELT Journal,60, 5160.

Teddlie, C. & Yu, F. (2007). Mixed methods sampling: A typology with exam-ples.Journal of Mixed Methods Research, 1, 77100.

Underhill, N. (1987). Testing spoken language: A handbook of oral testingtechniques.Cambridge: Cambridge University Press.

Widdowson, H. G. (1994). The ownership of English. TESOL Quarterly, 28,377388.

Wright, B. D. & Linacre, J. M. (1994). Reasonable mean-square fit values.Rasch Measurement: Transactions of the Rasch Measurement SIG, 8,370.


30/32

Youn-Hee Kim 215

Appendix A: Rating scale for the oral English test

4

Overall communication is almost always successful; little or no lis-tener effort is required.

3

Overall communication is generally successful; some listener effortis required.

2

Overall communication is less successful; more listener effort isrequired.

1

Overall communication is generally unsuccessful; a great deal oflistener effort is required.

Notes:

1. Communication is defined as an examinees ability to both address a given task

and get a message across.

2. A score of 4 does not necessarily mean speech is comparable to that of nativeEnglish speakers.

3. No response, or a response of I dont know is automatically rated NR (Not

Ratable).


31/32


32/32

Youn-Hee Kim 217

10.

Intell

igibility:thedegreetowhich

theresponseis

intelligibleorcomprehensible

H

ardtounderstandlanguage

(agreatdealoflistenerwork

required)

A

lmostalwaysunderstandablelanguage

11.

Sentencestructure:thedegreeto

whichthesentential

structureoftheresponseisofgoo

dqualityandcomplexity

C

annotmakecomplexsentences.

T

elegraphicspeech

T

ookriskwithmorecomplexsentencestructure

12.

Gene

ralgrammaruse:thedegree

towhichthegeneral

gram

maticaluseisofgoodquality

G

enerallygoodgrammar

S

omeproblemswithgramma

r

F

ew

grammaticalerrors

13.

Specificgrammaruse:thedegree

towhichthemicro-level

ofgrammaticaluseisofgoodqua

lity

O

missionofarticles

Incorrectorvagueuseofprep

ositionsofplace

G

ooduseofpastprogressive

14.

Socio

-culturalappropriateness:thedegreetowhichthe

respo

nseisappropriateinasocial

andculturalsense

C

ultural/pragmaticissue(alitt

leformaltocongratulateafriend)

L

ittlecongratulations,moread

vice(culturallynotappropria

te)

15.

Contextualappropriateness:thed

egreetowhichthe

respo

nseisappropriatetotheinte

ndedcommunicative

goals

ofagivensituation

A

ppropriatelanguageforagivensituation

S

tudentresponsewouldhave

beenappropriateifMonicah

ad

expressedworryaboutgoingtograduateschool.

16.

Cohe

rence:thedegreetowhichth

eresponseisdeveloped

inac

oherentmanner

G

ooduseoflinkingwords

G

reattimemarkers

O

rganizedanswer

17.

Supp

lementofdetails:thedegree

towhichsufficient

informationordetailsareprovidedforeffective

comm

unication

P

rovidesenoughdetailsforef

fectiveexplanationaboutthe

graph.

S

tudentonlymadeonegeneralcommentaboutthegraph

without

re

ferringtospecifics.

L

acksenoughinformationwithlogicalexplanation.

18.

Completenessofdiscourse:thede

greetowhichthe

disco

urseoftheresponseisorgan

izedinacomplete

mann

er

Incompletespeech

N

oreferencetoconclusion

E

ndnotfinished.

19.

Elabo

rationofargument:thedegr

eetowhichtheargument

oftheresponseiselaborated

M

entionedhisargumentsbut

didnotexplainthem.

G

oodelaborationofreasons

C

onnectideassmoothlybyelaboratinghisarguments.

Documents

An Investigation Into Native and Non-native Teachers' Judgments of Oral English Performance- Mixed Method