Perceptions of Language-trained Raters and Occupational ...download.xuebalib.com/xuebalib.com.9582.pdf · made by nurses and consumers ... listening and reading) ... Occupational

English for Specific Purposes, Vol. 17, No. 4, pp. 347–367, 1998© 1998 The American University. Published by Elsevier Science Ltd\ Pergamon All rights reserved. Printed in Great Britain

0889-4906/98 $19.00+0.00

PII: S0889-4906(97)00016-1

Perceptions of Language-trained Ratersand Occupational Experts in a Test of

Occupational English LanguageProficiency

Tom Lumley

Abstract—As part of the process of development of spoken language assess-ment procedures in occupational settings, it is common practice to use occu-pational experts as informants. The rating process, however, more commonlyrelies exclusively upon the judgements of language-trained specialists.Research to date has produced conflicting findings concerning the relativeharshness and other characteristics of language-trained raters versus ‘naı̈ve’native speaker or occupational expert raters.

This issue is considered in the context of a recent standard-setting projectcarried out for the Occupational English Test [McNamara, T. F. (1990),Assessing the second language proficiency of health professionals. Unpublisheddoctoral dissertation, University of Melbourne; McNamara, T. F. (1996) Mea-suring second language performance. London: Longman; Lumley, T., Lynch,B., & McNamara, T. F. (1994), A new approach to standard-setting in languageassessment, Melbourne Papers in Language Testing, 3(2), 19–39], an occu-pation-specific test of English for overseas-trained health professionals admin-istered on behalf of the Australian Government. The study was conducted inresponse to criticism of the standards applied in the test. Twenty audiorecordings of role plays from recent administrations of the speaking sub-testwere each rated by both ten trained ESL raters and ten medical practitioners.

The ratings produced were analysed to compare the extent of agreementreached by the two groups of judges concerning candidates’ language pro-ficiency, as well as group and individual differences in interpretations of therating scale used. Broad similarities in judgements found between the twogroups indicate that the practice of relying on ESL-trained raters is essentiallyjustified. © 1998 The American University. Published by Elsevier Science Ltd.All rights reserved

Key words: language testing, language for specific purposes, language inoccupational settings, performance assessment, occupational English

Introduction

In general-purpose language tests such as the ACTFL Oral ProficiencyInterview (ACTFL 1986), the Test of Spoken English (TSE) (Clark & Swinton

Address correspondence to: Tom Lumley, Department of English, The Hong Kong Polytechnic University,Hung Hom, Hong Kong. E-mail: [email protected]

347

348 T. Lumley

1979) or the Cambridge EFL examinations (UCLES 1987) it is commonpractice to use trained language teachers as raters. The behaviour underexamination in such tests is language proficiency generalisable to a widerange of unspecified and unspecifiable contexts, and it therefore seemslogical to rely on language-trained experts to make appropriate judgements.In occupation-specific language tests, where judgements are made con-cerning candidates’ ability to use language in more specific contexts, therating process still commonly relies exclusively upon the judgements oflanguage trained specialists (e.g. McNamara 1990, 1996).

Research to date has produced conflicting findings concerning the relativeharshness and other characteristics of language-trained raters versus ‘naı�ve’native speaker or occupational expert raters (Galloway 1980; Barnwell 1989;Brown 1995). Barnwell (1989), for instance, found that a group of nativespeakers of Spanish, who had received no particular language training, wereconsistently harsher in their ratings of American students’ performances inSpanish-language oral interviews than was an ACTFL-trained rater.

Powers & Stansfield (1985) investigated the applicability of using a testof general English proficiency, the TSE, for making judgements about theEnglish proficiency of test takers wishing to practise in a specific occupation(nursing). They did this by comparing judgements of test takers’ proficiencymade by nurses and consumers (patients) with those made by the ratersnormally used in assessing the TSE, who were ESL teachers trained asraters for the test. One of the issues for the researchers, therefore, waswhether ESL teachers were able to make judgements that were reasonablyconsistent with those made by members of the nursing profession and theirclients. They found moderate, though not high, levels of agreement betweenthe language and non-language specialists: median correlations betweenscores produced by the two kinds of occupational judges (nurses and con-sumers) and TSE scores (produced by pairs of ESL-trained raters) were0.66–0.68.

In the context of a test of Japanese for Tour Guides, an advanced leveltest of occupational language proficiency, Brown (1995) found that therewere variations in rater behaviour between raters depending on whether ornot they had experience in the tourist industry. These variations showednot in the level of ‘harshness’ displayed by the two groups of raters, but intheir sensitivity to particular assessment criteria. The raters with a teachingbackground but no industry experience also showed themselves reluctantto use the full range of score points on the rating scale used to assesscandidates. The point of Brown’s study was to establish whether or not thetwo groups could provide fair (equivalent) ratings for candidates, using thescale provided. Her analysis showed that the two groups demonstratedsimilar levels of consistency in their ratings (that is, members of both groupsgenerally showed reasonable agreement in their perceptions of candidates’relative proficiency), and of overall severity (that is, neither group wasnoticeably more severe), but that there was more variability in severity levelsamongst the non-teachers. In addition, the two groups interpreted different

349Testing: Perceptions of Language and Content Experts

criteria in different ways: the language teachers rated more harshly onlinguistic categories (grammar and expression, vocabulary and fluency),while non-teachers were harsher on pronunciation, as well as on one of thecriteria, ‘task fulfilment’, on the task, ‘Dealing with an upset or worried client’,which requires particular skills relevant to the task of being a successful tourguide.

The Occupational English Test

This study examines the issue of agreement between the judgementsof language experts and occupational experts in another specific-purposecontext, that of the assessment of the English language proficiency of over-seas-trained immigrant health professionals in Australia. It is a matter ofsignificance for ESP teachers, as we shall see, that their competence in thissort of context is examined.

In order to obtain registration for practice in Australia, overseas-trainedhealth professionals first have to demonstrate their English language pro-ficiency by passing the Occupational English Test (OET) (McNamara 1990,1996). The OET is a four-skills specific-purpose test (speaking, writing,listening and reading) currently used by eleven health professions. Thelargest group of candidates is currently medical practitioners (doctors), butthere are also significant numbers of nurses, dentists, vets, physiotherapistsand others. Doctors were chosen as the subjects of this study, since theyform the dominant group in numerical terms, and it is their judgements thatwill be compared with those of ESL teachers.

The test includes common tasks for all professions for the listening andreading components, while the speaking and writing sub-tests use materialsdeveloped for each profession. Tasks used in the OET are designed to reflectthe most common types of communication required of health professionals(McNamara 1990, 1996). As a result, the tasks it uses place the test contentin the context of professional practice.

Neither the reading (multiple-choice questions) nor the listening sub-testrelies upon ratings, therefore these will not be examined in this study. Thewriting sub-test, for practical reasons (discussed further), is not examinedhere. The speaking sub-test, which is the focus of this paper, takes the formof a warm-up interview (unassessed), followed by two clinically-based role-plays. The interaction takes place between an interlocutor, in the role ofpatient/client or the relative of a patient/client, and the candidate, in his/herprofessional role (see Fig. 1).

Two raters rate all candidates after undergoing a training procedure. Thisorients them to the purpose and content of the test, as well as the ratingcriteria used. Following a number of practice ratings, where they comparetheir judgements with those of other raters in the group, all prospectiveraters are required to assess (individually) a common selection of audi-otapes. The scores they provide are analysed, and any rater who shows an

350 T. Lumley

Figure 1. Demonstration Stimulus Materials.

insufficient level of agreement with the others in the group is not permitted toact as a rater. Raters are reaccredited periodically, repeating this procedure.

The first rating is typically conducted live, while the second is made laterfrom an audio tape of the interaction. A six-point rating scale is used, for the

1 A CVA is a cerebro-vascular accident, more commonly known in lay terms as a stroke.


six categories shown in Fig. 2. More information about how the points onthe scale are to be interpreted is included in Appendix A.

The OET raises two concerns which are particularly relevant to ESPteachers. The first relates to the rating of test performance. This is conductedby qualified ESL teachers trained as raters, most of whom have experienceas ESP teachers (although not necessarily as teachers of English for healthprofessionals). In making their assessments, raters are asked to judgewhether or not the candidate should be able to participate successfully inthe next stage of accreditation, which is normally a supervised, clinically-based bridging programme in a teaching hospital. The rating process there-fore requires the rater to relate candidate performance to the communicativedemands of a situation with which she is not actually familiar, although someof its typical features are described during the rater training process. It isimportant, therefore, to establish whether or not it is reasonable to expectthese raters to be able to make judgements of this kind. In other words,how specific to the professional context is the assessment? Does it requirespecialist knowledge to make valid judgements about language proficiencyin occupation-specific contexts such as the health professions, where can-didates may draw on or refer to a large body of professional knowledge andexperience? In these circumstances, do doctors and ESL teachers applysimilar standards?

The second area of concern relates more closely to teaching. Candidatesresident in Australia who take the OET typically enrol in test preparation

OVERALL COMMUNICATIVE EFFECTIVENESS

Near-nativeflexibilityand range

6 5 4 3 2 1

Limited: : : : :

INTELLIGIBILITY Intelligible Unintelligible: : : : :

FLUENCY Even Uneven: : : : :

COMPREHENSION Complete Incomplete: : : : :

APPROPRIATENESS OF LANGUAGE Appropriate Inappropriate: : : : :

RESOURCES OF GRAMMAR AND EXPRESSION Rich, flexible Limited: : : : :

Figure 2. Occupational English Test: Rating — Categories and Scale used.

352 T. Lumley

courses designed and taught specifically for health professionals by ESPteachers. Because of the test’s claim to reflect real-life professional com-municative demands, teachers of such courses are likely to focus on thekind of language that candidates will need in their professional lives. Forthe practising ESP teacher who designs and teaches these courses, the issueof whether or not ESL teachers as professionals (ESL teachers employed inOET preparation courses do not necessarily receive special training) areable to make judgements in this sort of context which are substantially inagreement with representatives of the professions concerned is obviouslycrucial. How well are ESP teachers able to assess the occupational languageproficiency of their students? How well able are they, then, to evaluate theextent to which the courses they teach really prepare students for pro-fessional communication?

In summary, it would be helpful to have some sort of evidence thatthe understanding held by ESL-trained OET raters of the communicativedemands of the professional setting does not differ markedly from that ofthe doctors represented in this study.

Purpose of this Study

This study reports on the extent of agreement found in ratings given bytwo groups of judges, (1) ESL-trained raters and (2) doctors, on the speakingcomponent of the OET. The issue of agreement between these two groupswas examined during a standard-setting exercise, the purpose of which wasto redefine acceptable levels of performance in the test. However, first somebackground is needed on why this exercise took place.

The pass score for the test was originally set at a minimum of 4 (on ascale of 1–6) for the category ‘Overall Communicative Effect’, plus an averagescore of 4 for the remaining five assessment categories. However, there hadrecently been criticism from bodies representing health professionals thatthe pass standard for the OET was too low, so that (it was claimed) can-didates were passing the test with inadequate proficiency in English to copewith the demands of their profession.

This view received anecdotal support from other quarters too, includingsome of the ESP teachers involved in preparing candidates for the test.Their concern appeared to be motivated by problems candidates might faceif they pass the test with levels of proficiency too low for them to gain entryto or be successful in the clinically based bridging courses, or too low forthem to gain employment in their field. It might appear more productive,from the point of view of test candidates, for them to spend more timeacquiring a sounder grasp of English than in struggling against odds whichare already stacked fairly heavily against them, with the added burden ofcommunication difficulties.

The issue of setting standards in language tests is always a political one.In the case of the OET, stated in simplistic terms, a tension exists betweenthe views of advocates of the immigrant professionals (who generally press


for a more lenient standard), and those of the representatives of professionalregistration boards (who typically advocate more stringent criteria). Inrecent years, the views of the advocates of the immigrants have held greatersway, with the result that the OET has not been a difficult test, often havingpass rates of 70–80% or more. The major decisions regarding candidateswere thus moved to one of the examination procedures conducted by thecouncils representing the health professions once candidates have passedthe OET.

It was recognised by the test developers that the recent criticisms oflow pass standards deserved investigation. For example, it was consideredpossible that the raters’ view of the criterion level required to pass the testhad slipped over the years since the introduction of the test in its presentformat in the late 1980s. It was, therefore, decided to conduct a study todetermine a revised pass level for the test.

The study also made it possible to evaluate the competence of the ESLteachers as raters in this context, by comparing their judgements with thosemade by representatives of the health profession. More specifically, theimplied criticism by the health professions, that the ESL teachers wereapplying standards that were too lenient, could be investigated. It is this partof the standard-setting study which this paper discusses.

Because most of the criticism focused on candidates’ oral interaction, thespeaking sub-test was selected as the initial area of investigation. The issueof test candidates’ writing ability is seen as being of less concern, andtherefore the pass standard in the writing sub-test has not been considered.As mentioned, doctors were chosen as the focus of the present study,because of their dominance as candidates in numerical terms.

Methodology

The standard-setting study aimed to establish a new criterion level forperformance on the speaking sub-test. As mentioned earlier, Powers &Stansfield (1985), in their study of the TSE, obtained judgements of bothnurses (representing their profession) and consumers (representing pat-ients) for comparison with scores given by TSE raters (ESL teachers).Because of the kinds of criticisms received of the OET pass standard, it wasdecided to adopt a similar methodology, employing the judgements of (1)representatives of the medical profession and (2) trained ESL raters whoregularly rate test performance. The two sets of judgements would then becompared. This would allow statements to be made in relation to the con-cerns raised earlier about the competence of ESL raters. In particular, itwould provide information about whether the judgements of the two groupsof judges were, in fact, comparable, or whether the two professional groupsperceived candidate proficiency in this context in quite different terms.

The following questions will be considered in this study:

Question 1. To what extent did the 2 groups agree on classification of can-didates as pass/fail?

354 T. Lumley

Question 2. Were ESL raters as a group more lenient than doctors?

Question 3. What evidence is there for differences between judgements madeby the individual raters?

Ten ESL raters, trained in assessment of the OET, and ten doctors, wereinitially selected to take part in the study. The doctors were, for the mostpart, chosen on the basis of their having had extensive experience workingwith overseas-trained doctors in the clinically based bridging programmesmentioned earlier, giving them familiarity with the issues faced by thesedoctors in professional settings. Two doctors were included as rep-resentatives of the Australian Medical Council (AMC), the professional bodyresponsible for accreditation of medical practitioners in Australia. One ofthese had similar experience to the other eight, while the other occupied asenior position on the Examining Body of the AMC.

Twenty audio tapes of test candidates were selected from recent testadministrations, from a range of the national and language groups mostcommonly represented in the test population. They covered a range of scorepoints, above a clear fail, but most were clustered in the range of an averagescore across rating categories of between 4 and 5 (on a scale of 1–6), therange in which it was anticipated the new pass-level would fall. All par-ticipants rated these 20 tapes. Owing to pressure of work, one doctor wasunable to complete the task, and data were only collected from nine of thedoctors.

A significant difference between the Powers & Stansfield (1985) studyand the present one lies in the samples of language on which judges wereasked to make decisions. Powers and Stansfield asked judges (both nursesand consumers) to make judgements about candidates’ English proficiencyfor three different general situations in which nurses might be engaged(hospital nursing, public health nursing and teaching). In making thesejudgements, they relied on samples of oral language elicited by tasks on theTSE, a test of general proficiency with content not specifically related to anyhealth profession, and a test in which the ESL-trained raters are not askedto consider any specific occupational context when making operationaljudgements. In the OET, by contrast, judges (both the doctors who par-ticipated in this study and the ESL-trained raters who conducted the ratingfor this study, under operational conditions) are asked to make judgementsabout candidates’ proficiency to function generally in the communicativecontexts of a particular medical setting (a clinically-based bridging coursein a hospital). In making these judgements, they relied on the candidates’performances in a test with content designed specifically for health pro-fessionals, with a task simulating a situation medical practitioners mightexpect to encounter routinely, i.e. a medical consultation.

Briefing Process

The doctors were all given, either individually or in small groups, a shortbriefing session (30–45 min), the main purpose of which was to clarify the


judging task. During this session, most of them expressed an opinion aboutthe issue of English proficiency of overseas-trained doctors. Views of indi-vidual participants varied: generally most, but by no means all, overseas-trained doctors were perceived as having adequate proficiency in English.One or two participants thought the English language proficiency of over-seas-trained doctors was a very serious problem; others felt that issuesrelated to communication by these doctors in professional settings are notnecessarily best conceptualised as a language problem, but may include awide range of other factors, many of them cultural, and that there are potentreasons of equity which should not demand standards of English languageproficiency from immigrant doctors that are not assessed in doctors trainedin Australia (whatever their language background).

The practical point of interest in this study was a simple judgement ofwhether or not the candidate was considered to have adequate proficiencyin English to participate successfully in a supervised clinical bridging pro-gramme. It was felt impossible in this context to expect useful judgementsfrom the doctors on the full range of linguistic assessment categories,without extended discussion of how each individual category should beinterpreted. This was seen as impractical. They were, therefore, providedwith a list of the assessment categories used to guide the ESL teachers’ratings, with a brief gloss for each one, and asked to make only a singleholistic judgement, using the category, ‘Overall Communicative Effect’. Prac-tical reasons also made it possible to obtain judgements from the doctorson only the first role-play for each candidate. A full set of the instructionsprovided to the doctors is given in Appendix A.

The ESL raters received no particular briefing, since they had all beentrained previously as raters for the OET, and had all (except one) taken partin regular rating for the test recently. One rater was somewhat differentfrom the others, in that she was not a regular assessor for the OET, and hadreceived only minimal training, but had long familiarity with the situation inwhich OET candidates worked, following success in the test. In order to beable to make meaningful comparisons between the judgements producedby the two groups of judges, only the judgements on the category ‘OverallCommunicative Effectiveness’ for the first role play were analysed. Asreported by McNamara (1990, 1996) this holistic category represents thebest summary of the ratings provided for all categories of assessment,although it is likely to be influenced by additional features of the candidate’sperformance to the linguistic ones specified.

The differences between the two groups of judges are summarised in Fig.3. As can be seen, the two groups are polarised in a number of ways, and itwould, therefore, not be surprising if we were to observe substantial dif-ferences between them.

Results

Question 1: To what extent did the 2 groups agree on classification of can-didates as pass/fail?

356 T. Lumley

ESL raters

Language experts

received training as raters for the test

reliability established after training

used 5 explicit linguistic categories of assessment plus "Overall Communicative Effect"

Doctors

occupational experts

no triaining as raters

reliability not established

one judgement only, no particular linguistic categories: "Overall Communicative Effect" only

Figure 3. Features of the Two Groups of Judges.

This question is of course central in the context of standard setting, as wellas in considering the comparability of the two groups of judges.

Table 1 shows that, as indeed one might expect with single, holistic ratingson a subjectively marked test, there was considerable variation in levelsof agreement within and between the two groups of raters over pass/fail

TABLE 1Number of Raters Classifying Each Candidate as `Pass' (raw score=4.0 or more), Overall

Communicative Effect Only

Candidate ID ESL raters Doctors Totalno. (N=20) (N=10) (N=9) (N=19)

19 10 9 19 most proficient20 10 9 1916 10 9 19

1 10 9 198 10 9 19

18 9 9 1814 10 7 1717 10 8 18

6 9 9 1811 8 7 1510 9 6 15

4 7 9 162 5 7 129 6 6 125 4 3 7

15 3 5 87 2 2 4

12 3 1 [of 7] 4 [of 17: poorly audible;2 ratings missing]

13 0 2 23 0 0 0 least proficient

Complete agreement within each group is marked in bold type.


categorisations. Only the five most able candidates were universally judgedby both groups as passing; at the other end of the scale, there was completeagreement over only the least proficient candidate, no. 3. A further threecandidates were passed by all the doctors, (candidates 18, 6 and 4), butfailed by at least one ESL rater, while two other candidates were passed byall ESL raters (candidates 14 and 17), but failed by one or more doctors.The ESL raters also agreed that candidate no. 13 should fail. This leaveseight candidates over whom there was larger disagreement.

Question 2: Were ESL raters as a group more lenient than doctors, as hadbeen predicted would be the case?

Using the scores allocated by both groups of raters on the single categoryof Overall Communicative Effect, then the answer is, counter to expectations,no, as has been reported in Lumley et al. (1994). Table 2 shows that with araw score pass level of 4.0 (whereas the average score given by the doctorsas a group would allow on average 13 of the sample to pass) the averagescore given by the ESL raters would pass only 11, so if anything, theESL raters appear harsher than the doctors. Examination of mean scoresproduced by the two groups, however, shows no difference between them.

Table 2 also sheds more light on the issue of consistency between thetwo groups: it is worth noting that all candidates that were passed as a group

TABLE 2Mean Scores Produced by ESL Raters and Doctors: Overall Communicative Effect Only

Doctors Combined ratings,Candidate ID no. ESL raters judgements ESL raters & doctors

19 6.0 5.6 5.8 most proficient20 5.4 5.4 5.416 5.0 5.3 5.2

1 4.6 5.0 4.88 4.6 4.7 4.6

18 4.2 4.6 4.414 4.6 4.0 4.317 4.4 4.2 4.3

6 4.1 4.3 4.211 4.0 4.0 4.010 3.9 4.0 4.0

4 3.7 4.1 3.92 3.6 4.0 3.89 3.6 3.6 3.65 3.3 3.2 3.3

15 3.1 3.3 3.27 2.9 3.0 3.0

12 3.1 2.1 2.613 2.6 2.3 2.5

3 2.4 1.9 2.2 least proficient3.94 3.93 3.93 mean0.93 1.06 s.d.

358 T. Lumley

(i.e. with a mean score of 4.0 or above) by the ESL raters, were also passedby the doctors. Meanwhile, no candidate failed by the doctors as a group(i.e. with a mean score below 4.0) was passed by the ESL raters.

Question 3: What evidence is there for differences between the individualraters?

In Table 3, which shows the number of candidates passed by each judge,we can see that there are, in fact, very substantial differences here in thedegree of severity shown by individual raters, with the ESL raters eachpassing between 8 and 18 of the candidates, and the doctors each passingbetween 8 and 17. ESL rater no. 285 and doctor no. 109 are both considerablyharsher than the rest of either group. There also appears to be slightly lessvariation amongst the doctors concerning the number of passing candidatesthan among the ESL raters.

The scores given by each judge for all 20 candidates were ranked, and acorrelation table produced for each pair of judges, using Spearman’s r

calculations (Table 4), in order to show trends in consistency of agreementbetween each judge and every other.

The strength of the correlations between pairs of judges is indicated byitalic and bold type in Table 4: correlations below 0.70 are in plain type,those between 0.70 and 0.79 are italic, and those with correlations of 0.80and above are bold italic. The correlation between each pair of judges is onaverage around 0.70, indicating a moderate overall level of agreement.

It can be seen that doctors 107 and 108 are most out of step with theothers, with paired correlations rarely reaching 0.70 or above (and a meancorrelation with the other judges of 0.58 and 0.59, respectively). We sawearlier that doctors 107 and 108 show greater agreement with the wholegroup in terms of pass/fail decisions, but Table 4 shows that they are

TABLE 3Number of Candidates Passed by Each Judge, Overall Communicative Effect Only

ESL Raters (N=10) Doctors (N=9)

Judge No. No. Judge No. No.ID no. Passed Failed ID no. Passed Failed

282 18 2 lenient 101 17 3 lenient287 17 3 108 16 3283 14 6 102 16 3286 14 6 106 15 5288 14 6 103 14 6289 14 6 105 14 6290 13 7 107 13 7281 12 8 104 13 7284 11 9 109 8 12 harsh285 8 12 harsh

359

Testing:Perceptions

ofLanguageand

ContentE

xperts

TABLE 4Correlations Between Judgements on Overall Communicative Effect by ESL Raters and Doctors

ESL raters Doctors

281 282 283 284 285 286 287 288 290 289 101 102 103 104 105 106 107 108 109281282 0.67283 0.70 0.70284 0.71 0.60 0.78285 0.86 0.75 0.89 0.78286 0.86 0.80 0.69 0.74 0.81287 0.61 0.82 0.82 0.73 0.80 0.70288 0.85 0.71 0.74 0.81 0.88 0.82 0.73290 0.86 0.72 0.83 0.75 0.93 0.83 0.83 0.84289 0.66 0.68 0.85 0.77 0.80 0.67 0.83 0.75 0.78101 0.69 0.84 0.72 0.68 0.80 0.74 0.83 0.83 0.75 0.82102 0.76 0.66 0.74 0.72 0.72 0.77 0.75 0.68 0.74 0.71 0.76103 0.69 0.72 0.99 0.76 0.89 0.70 0.82 0.74 0.83 0.84 0.75 0.76104 0.71 0.60 0.81 0.60 0.82 0.66 0.66 0.65 0.74 0.77 0.59 0.59 0.81105 0.73 0.71 0.87 0.68 0.85 0.77 0.82 0.70 0.86 0.75 0.73 0.76 0.88 0.84106 0.81 0.72 0.83 0.81 0.90 0.77 0.76 0.81 0.83 0.85 0.79 0.75 0.84 0.81 0.80107 0.70 0.42 0.68 0.65 0.67 0.64 0.48 0.66 0.69 0.68 0.54 0.67 0.64 0.44 0.52 0.69108 0.63 0.44 0.70 0.65 0.60 0.55 0.54 0.55 0.51 0.69 0.44 0.68 0.68 0.75 0.60 0.73 0.57109 0.89 0.69 0.86 0.78 0.91 0.84 0.73 0.81 0.85 0.80 0.74 0.84 0.87 0.89 0.88 0.92 0.66 0.81

Correlations of 0.80 and higher in bold italic; correlations of 0.70–0.79 in italics; correlations below in regular type.

360 T. Lumley

applying the scale inconsistently in comparison with the other judges, dem-onstrating much less agreement about the relative proficiency of the can-didates. By contrast, ESL rater no. 285 and doctor no. 109, earlier identifiedas the harshest judges, show the highest average correlation with all theother pairs (average correlation with other judges of 0.77 and 0.78, respec-tively, compared with an average correlation amongst all pairs of judges of0.70). This suggests that these two judges are applying the rating scale in apattern consistent both with each other (a correlation between them of 0.91)and with the other judges, but with the important difference that they areusing different absolute standards: they would agree with the other judgesabout the relative proficiency of the candidates, but they would allow fewercandidates to pass the test.

It is worth noting that these figures are higher, but not by much, than the(median) correlations reported by Powers & Stansfield (1985) betweenjudges (including both nurses and consumers) and TSE scores (producedby pairs of ESL-trained raters) of 0.66 to 0.68. It would perhaps be unreason-able to expect much higher correlations, since each judge has provided onlya single score for each candidate, based on a single performance. Anotherpoint relevant here is that, as in the Powers & Stansfield (1985) study, theoccupational experts (doctors) had received no training in the application ofthe rating scale.

With regard to Brown’s (1995) finding that language specialists may bereluctant to use the full range of score points on a scale when rating occu-pational language tests, Fig. 4 shows that neither group uses the lowestscore category much at all, although there is clearly a greater reluctance onthe part of the ESL raters than the doctors to use the two lowest scorepoints. At the other end of the scale, however, the highest score, 6, is used

Figure 4. Use of Score Categories, ESL Raters and Doctors.


equally by both groups. Generally, the ESL raters do seem to prefer to usethe middle points on the scale, with 62% of their ratings falling into thecategories of 3 or 4, compared to the doctors, who only used those categoriesfor 52% of their scores. Possibly this is accounted for by the ESL ratersreserving the use of the lowest two categories for the weakest performancesthat they sometimes encounter when rating test administrations, examplesof which were not selected for this study, whereas the doctors have no suchexperience to draw on (test candidates with low levels of English proficiencywould not be registered for practice as doctors).

Discussion

The most significant finding to emerge from this study is that at a globallevel there seems to be reasonable agreement between the two groups(although there is also evidence of significant differences amongst membersof each group). In other words, it seems reasonable for ESL raters to makejudgements in this sort of occupational setting: it would appear that it doesnot matter too dramatically which of the two groups conducts the rating.This is a reassuring finding in the context of this test, providing clearevidence for the validity of the ratings made operationally by ESL teachersin an ESP setting. This finding refutes the implicit challenge to their com-petence by bodies representing the health professions.

However, there is evidence for two kinds of disagreement between thejudges involved in this study. Firstly, differences in perceptions of test-takerproficiency are shown in the noticeably higher level of harshness exhibitedby ESL rater no. 285 and doctor no. 109. Despite this greater harshness, itshould be pointed out that this pair of judges includes one from eachprofessional group, and that they disagree little with the remainder of thejudges about the relative proficiency of the candidates. Harshness, it appears,is not necessarily a function of professional background, but a more indi-vidual affair. Secondly, while doctors 107 and 108 do show little agreementabout the rank order of candidate proficiency, they are generally in line withthe majority of the judges, of both professional backgrounds, in terms oftheir categorisation of candidates as pass or fail.

Considerable variation, then, has been observed between individual mem-bers of each group. Training might be expected to reduce this variation (seeWeigle 1994, for example). However, this study emphasises once again theneed for more than a single rating of performances on subjective tests suchas this one, which carry significant consequences for test takers. It alsopoints to the need for some form of mechanism that can compensate fordifferences in relative harshness or leniency of raters, as shown by thoseinvolved in this study, since it is clear that even despite training, differenceswill remain. It should be noted here that for the purposes of reportingindividual OET candidates’ scores operationally, there are two mechanismswhich improve the reliability of candidates’ scores. Firstly, all candidatesare rated twice. Secondly, multi-faceted Rasch analysis (Linacre 1989) is

362 T. Lumley

used, which takes into account the relative harshness or severity of thejudges rating each candidate, and compensates accordingly, building thisinto the scores reported for candidates.

The wider variation observed between the individual ESL raters, com-pared to the doctors (see Table 3), concerning the number of candidateswho should pass the test, may be partially attributable to the different ratingprocesses employed by the two groups: it is conceivable that the morecomplex rating task conducted by the ESL raters (considering six categoriesof assessment) leads them to produce more diverse ratings than the doctors.It is quite possible that different linguistic features are dominating thejudgements of different raters for each candidate. Alternatively, the contentor the interactive style of individual candidates’ performances could beexerting a significant influence upon the ratings given by members of thetwo groups of judges. The issue of how well ESP teachers preparing theirstudents for this kind of test can anticipate the demands of (a) the test itselfand (b) professional life, while partly encouraging for the ESL profession, isnot resolved by the limited data-set presented in this study. Many factorsare likely to play a part in judgements made by occupational experts: itwould be interesting to determine what these influences are and considertheir relevance to the construct of language proficiency. This is an issueintimately bound up with the nature of communicative competence in spec-ific-purpose contexts, about which we still know too little.

The scores produced in this study by the two groups of judges showedthe sample of doctors to be no harsher than the ESL raters, in fact, ifanything, the reverse. This point requires discussion, raising as it doesquestions about the validity of the test, since in another sense the principalcomplaint had been that the OET raters were perceived as too lenient, asuggestion which finds no support in this study.

The OET deliberately includes content with an occupation-specific focus,unlike general-purpose tests such as the TSE. It is possible that this contentfocus contributed to the slightly improved inter-rater reliability found in thisstudy, compared to that found in the Powers & Stansfield (1985) study.However, it may be that either the tasks presented in the role-plays or thecommunicative demands of the test situation do not adequately representthe kind of oral communication where test candidates may in real life showthemselves to be lacking in proficiency. This should not surprise us, giventhat the test was designed as an example of a ‘weak’ performance test, touse McNamara’s term (McNamara 1996); that is to say, its primary purposeis the elicitation of a sample of language which can be assessed, and theoccupational focus is only used to provide a context that appears generallyrelevant to the participants in the test. For example, it may be that there isa problem with the interlocutors, who are, for the most part, middle-class,well-educated, articulate native speakers of rather standard Australian Engl-ish, whose ESL training has alerted them to the potential for mis-communication in spoken interaction, and which, one may fairly safelypresume, they would take some trouble to avoid in a testing situation. They


may not represent sufficiently well the kind of patients or clients with whomhealth professionals need to interact, or, at least, the range of patients withwhom they work. Involved here is very likely the intractable issue of breadthof comprehension, involving perhaps the ability, or lack of it, to processidiomatic language (possibly avoided or simplified by ESL teachers), askill which is largely untested in the OET. Another feature which may beinsufficiently considered is the ability of candidates to clarify, expand orrephrase explanations and courses of action in different ways when inter-acting with patients of different backgrounds or with different needs. Thereare possibilities here for further research concerning the authenticity of thetask and of the interaction between candidate and interlocutor. There arealso implications which extend beyond the testing context, in the field ofESP in general, related to how ESP practitioners ensure that their teachingencompasses an adequate range of kinds of language appropriate for theirclients.

The extent to which tasks simulating doctor–patient interaction (as usedin this test) are sufficiently representative of the range of types of com-municative behaviour required of doctors in hospital settings may also bequestioned. It is likely that the criticisms of the English proficiency ofsome doctors are partly motivated by difficulties experienced during doctor–doctor interaction in busy and stressful hospital settings. Alternatively, theissue may be less involved with language proficiency than with culturalexpectations. Again, the ESP practitioner needs to be aware of the range ofinteractive situations in which students might find themselves in academicor professional life.

It might be suggested that we need a ‘stronger’ performance test than this,in McNamara’s terms, involving some sort of judgements about candidates’professional competence (the ability to use accurate explanations to reassurepatients, for example). If so, the problem then arises of who would becompetent to make the assessments. The ESL raters already express con-cern on occasion over the extent to which they should be making judgementsabout candidates’ knowledge or understanding of professional terminology.It would be unfair for test takers as well as raters to ask more of them here:in essence, ESL-trained raters are neither permitted nor competent to passjudgement on such matters. Another solution might be to ensure that onlyESP teachers with experience of the health professions should act as raters,but such a move would ultimately face the same problems. However, to relyon doctors to make the kind of complex ratings now made by the ESLraters would create a wide range of additional problems, including issues oftraining, selection, and expense. In either of these cases, to make the test‘stronger’ would take it well beyond consideration of linguistic competence,and firmly into the realm of professional knowledge and clinical competence:the place for such judgements is surely within the procedures which alreadyexist for assessment and certification specifically in these areas.

From the point of view of the practising ESP teacher, what emerges fromthis study is the continuing need for ESP teachers to work with members

364 T. Lumley

of other professions in order to ensure that their expectations of, on the onehand, the content and type of communication encountered in professionalcontexts, and on the other, the standards applied within the profession, arerealistic. It is only by means of such collaboration that what differences havebeen observed here amongst the judges can be investigated and understood,and steps can be taken to ensure that students of ESP are given adequatetraining to equip them for the demands they will face in their professionallives, not only in assessment contexts but also in professional practice. Thepublication of the findings obtained through this sort of collaboration wouldcontribute to our general understanding of how assessment in this sort ofcontext may best be carried out. Are the task types realistic and useful?What are the sorts of interactions that need to be represented in ESP testsof this kind? What are the crucial aspects of professional communicationthat learners of English find hard to deal with? How may we improvethe match between the expectations of ESP teachers and experts in otheroccupations? How is this to be reflected in the test preparation coursesdelivered by ESP teachers?

To conclude, in the light of the general patterns observed in the datacollected in this study, there seems to be no convincing argument yet in thecase of the test examined here, the OET, for using other than properlytrained ESL teachers as raters, who both have expertise in language (whichis what the test aims to assess) and also appear to agree tolerably well withthe occupational experts, the doctors. The experience of such raters is likelyto have the advantage of enabling them to make judgements which are bothsimilar and generalisable across diverse occupational contexts, rather thanbeing restricted to a single profession. In the interests of fairness to can-didates for ESP tests who may come from a diverse group of loosely relatedprofessions (as in the case of OET candidates), such generalisability is animportant issue. This study has examined the context of a test for healthprofessionals, but there is fertile ground for further investigation of theissues raised here in all areas of ESP, which can only lead to greaterunderstanding of the ways in which assessment of ESP should be conducted,and the relationship between the work of ESP teachers and the task ofcommunicating in professional situations.

Acknowledgements—The author wishes to acknowledge the contributionof the National Languages and Literacy Institute of Australia (NLLIA) insupporting the ongoing research associated with the Occupational EnglishTest. Thanks are due to colleagues at the NLLIA-LTRC and to Liz Hamp-Lyons and Jan Hamilton for feedback on earlier versions of this paper, aswell as to two anonymous reviewers for their helpful comments.

(Revised version received January 1997)

REFERENCES

American Council on the Teaching of Foreign Languages (ACTFL) (1986),ACTFL proficiency guidelines. New York: ACTFL.


Barnwell, D. (1989). Naı̈ve native speakers and judgements of oral pro-ficiency in Spanish. Language Testing, 6(2), 152–163.

Brown, A. D. (1995). The effect of rater variables in the development of anoccupation-specific language performance test. Language Testing, 12(1),1–16.

Clark, J. L. D. & Swinton, S. S. (1979). An exploration of speaking proficiencymeasures in the TOEFL context. Report 4, TOEFL October 1979. Princeton,NJ: Educational Testing Service.

Galloway, V. (1980). Perceptions of the communicative efforts of Americanstudents of Spanish. Modern Language Journal, 64, 428–433.

Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago: MESAPress.

Lumley, T., Lynch, B., & McNamara, T. F. (1994). A new approach tostandard-setting in language assessment. Melbourne Papers in LanguageTesting, 3(2), 19–39.

McNamara, T. F. (1990). Assessing the second language proficiency of healthprofessionals. Unpublished doctoral dissertation, University of Melbourne.

McNamara, T. F. (1996). Measuring second language performance. London:Longman.

Powers, D. E., & Stansfield, C. W. (1985). Testing the oral proficiency offoreign nursing graduates. The English for Specific Purposes Journal, 4,21–35.

University of Cambridge Local Examinations Syndicate (UCLES) (1987).English as a foreign language: General handbook. Cambridge: UCLES.

Weigle, S. C. (1994). Effects of training on raters of ESL compositions.Language Testing, 11(2), 197–223.

Appendix A: Occupational English Test (OET): SpeakingStandard-setting Project, 1994

Information sheet for supervisors of clinical bridging programmes

The aim is to elicit opinions, from supervisors who have experience oftraining overseas-trained medical practitioners, of the minimum workingknowledge of English required in a supervised clinical setting. You will beasked to listen to a series of 20 audio recordings from recent administrationsof the OET, and on the basis of these make a judgement concerning theadequacy of each candidate’s English for participation in supervised clinicalpractice.

Assessment in the speaking sub-test of the OET is carried out on the basisof performance on two tasks, each lasting approximately 4 to 8 minutes.These tasks take the form of role plays: simulated consultations betweenthe candidate (adopting his/her professional role) and a native speaker ofEnglish (in the role of patient or client or the relative of a patient/client).For the purposes of the current study, you will listen only to the first ofthese role plays for each candidate.

366 T. Lumley

Before and during the test the candidate is constantly reassured:

1. that the purpose of the interaction is to elicit a sample of language on thebasis of which a judgement may be made about his/her English languageproficiency; and

2. that no judgements are made concerning the candidate’s medical knowl-edge.

The medical content of the interaction, and the quality of the advice given,are therefore irrelevant to decisions made during this test of language. Youshould therefore completely set aside any judgement of the candidate’sclinical knowledge or experience.

The following scale is used in rating candidates:

Overall communicative effectiveness

PASS FAILNear-native flexibility and range 6 5 4 3 2 1

—: —: —: —: —: — Limited

The points on the scale should be interpreted as follows:

6: There is no doubt about the candidate’s ability to communicate effec-tively in English.5: The candidate would clearly be able to cope successfully with thelinguistic demands of a supervised clinical bridging programme.4: The candidate has the minimum competence necessary to cope withthe linguistic demands of a supervised bridging programme in a clinicalsetting.3: The candidate does not quite have the minimum competencenecessary to cope with the linguistic demands of a supervised bridgingprogramme in a clinical setting.2: The candidate would clearly fail to cope with the linguistic demands ofa supervised clinical bridging programme.1: The candidate has no more than a fairly elementary level of competencein English, and should probably not even be taking this test.

The scale is thus meant to indicate a range from a very advanced to a fairlyelementary competence. Candidates who pass the OET may be eligible toapply for a place in a supervised bridging programme in a teaching hospital,provided they also pass any additional screening tests of medical knowl-edge/clinical competence that the programme may require as part of itsadmission procedure. A passing level (nominally mid-way between scorepoints 3 and 4) will therefore represent the minimum competence withwhich a candidate could cope with a bridging programme in a clinical setting,involving interaction with patients/clients, clinical teachers and colleagues.

In making your decision you should consider the following questions:


—Could this person cope without undue embarrassment to him/her selfor to others (supervisors, clinical teachers, patients, relatives of patients,colleagues) with the communicative demands of this supervised set-ting?

—Do you think this person would find the communicative demands ofsuch a setting unreasonably stressful?

—Could you manage to communicate effectively with this person in aclinical bridging programme you were supervising?

—Could your patients manage necessary communication with this per-son?

—Could your colleagues manage to communicate effectively with thisperson in a supervised clinical bridging programme?

Language features which may contribute to your decision include the fol-lowing (this is not an exhaustive list):

Intelligibility (e.g. How easy is it to understand the candidate’s pro-nunciation? Does it require undue strain to listen to the candidate? Doesit become easier to understand him/her as you get used to the accent/style of speech?)Fluency (e.g. How evenly does the candidate speak? Does speech flow ata rate which enables the listener at least to follow the conversation?)Comprehension (Does the candidate appear to understand most of whatthe patient expresses about his/her concerns?)Appropriateness of language (e.g. Are appropriate expressions used inexplaining medical conditions or courses of action to the patient? Is anyinappropriate choice a real barrier to communication?)Resources of grammar and expression (e.g. Does the candidate haveadequate vocabulary and control of grammatical expression to expressnecessary ideas clearly and unambiguously? Are any deficits here soserious as to form a real barrier to communication?)

At the end of the role play, enter an assessment using the six-point scale ofoverall communicative effectiveness as shown above. Use a cross to markwhich of the six points on the scale best locates the candidate’s performancein that category. Please DO NOT place a mark between two score points.

Tom Lumley is employed in the English Department of Hong KongPolytechnic University. He has worked for many years in the area of lan-guage assessment research and development, mainly in Australia, in additionto his experience as an ESL/ESP teacher. His PhD studies are concernedwith the process of assessing writing performance.

本文献由“学霸图书馆-文献云下载”收集自网络，仅供学习交流使用。

学霸图书馆（www.xuebalib.com）是一个“整合众多图书馆数据库资源，

提供一站式文献检索和下载服务”的24 小时在线不限IP

图书馆。

图书馆致力于便利、促进学习与科研，提供最强文献下载服务。

图书馆导航：

图书馆首页文献云下载图书馆入口外文数据库大全疑难文献辅助工具

http://www.xuebalib.com/cloud/

http://www.xuebalib.com/

http://www.xuebalib.com/cloud/


http://www.xuebalib.com/vip.html

http://www.xuebalib.com/db.php

http://www.xuebalib.com/zixun/2014-08-15/44.html


Documents

Perceptions of Language-trained Raters and Occupational ...download.xuebalib.com/xuebalib.com.9582.pdf · made by nurses and consumers ... listening and reading) ... Occupational