50
i r--E-r .' i t esttng tor Langu age Teachers Arthur H*ghes I I I ,F $ I 6 F 'i h n I I Cambri dge UniversitY Pres s Cambridge New York Port Chester Melbourne SydneY The righr of :the .. Unilcrsity oI Cbnbridgc to print ond sell oll mnnr of books Ntat tronlad:by Hcnrs,VIII in,l5J1, The Univcnily hat printcd and publithcd continuow!y since I5El,

HUGHES - Testing for Language Teachers

  • Upload
    henrimg

  • View
    266

  • Download
    14

Embed Size (px)

DESCRIPTION

Artigo sobre avaliação

Citation preview

  • ir--E-r .' it esttng torLangu age Teachers

    Arthur H*ghes

    I

    II

    ,F

    $I

    6F

    'i

    hn

    I

    I

    Cambri dge UniversitY Pres sCambridgeNew York Port ChesterMelbourne SydneY

    The righr of :the ..

    Unilcrsity oI Cbnbridgcto print ond sell

    oll mnnr of booksNtat tronlad:by

    Hcnrs,VIII in,l5J1,The Univcnily hat printcdand publithcd continuow!y

    since I5El,

  • @ Cambridge University Press 1989

    First pubiished 1989

    Prinred in Grear Brirainby Bell &.Bain Ltd, Glasgow

    Library of Congress cataloguing in publication clalaHughes, Arrhur, 1941-Testing for language reachers / Arrhur Hughes.p. cm.

    - (Cambridge h-andbooks for language reachers)

    Bibliography'p.Includes index.ISBN 0 521 25264 4.

    - ISBN 0 521 27260 2 (pbk,)

    l. Language and languages -

    Ability tesring. ' l. Title.IL Series.P53.4.H84 1989407 .5-dc19 88-3 850 CIPBritish Library cataloguittg in publication dataHughes, Arrhur, l94L-Tesring for language reachers.

    -

    (Cambridgehandbooks for language teachers).1. Great Brirain. Educarional insrirutions.Students. Foreign language skills.Assessment. Tests

    - For teaching.

    I. Title.478',.0076

    ISBN 0 521,25264 4 hard coversISBN0 52727260 2 paperback

    CopyrightThe law allows a reader ro make a single copy of part of a bookfor purposes of privare study. It does nor allow the copvine ofentire books or rhe making of mulriple copies of extracrs. iv.irr-npermission for any such copying must alwa.vs be obrained from rhepublisher in advance.

    CE

  • vF

    B

    F

    i,

    I estrng for Language Teachers

    hv

    B

    F

    2

  • CAMBRIDGE HANDBOOKS FOR LANGUAGE TEACHERSC.".itf Editors: Michael Swan and Roger Bowers

    eachers of English and othersually drawn from the field of Engiishhe ideas and techniques described canf arry language.

    ln this series:Drama Techniques in Language Learning - A resource book of.o*.nr.tnication activities fo r I4ngua ge teachersbliAlon MaleY and ,tian D"ffGames for Language Learningiy'iiarr* WrTghi Dauid Beiteridge and lvlichael BuckbyDiscussions thatWork-Task-centred fluency practice by Pewty Ur

    once upon a Time - using s_tories in the language.classroornUy' jit

    "'t torgan and NIario Rinualucri

    Teaching LiEtening Comprehension by Penny Ur

    Keep Talking - Communicative fluency activities for Ianguage teachingbt, Frederihe KliPPel

    working with words - A guide to teaching and learning vocabularyb,y RutiGairns and Stuart Redman

    Learner English - A teacher's guide to interference and other problems-riiltd by liirhael Swan and Bernard SmithTesting Spoken fanguage - A handbook of oralbv I'tric LJnderhill

    I

    t'It

    I

    \

    FlfifiI

    iIt

    resting techniques

    book of ideas andLiterature in the Language classroom - A resource

    ".tiuiri.t by Joanne eollie and Stephen Slater

    Dictatior, -

    New methods, new possibilitiesLy Paut Dauis and Mario Rinuolucri

    Gramnrar Practice Activities - A pracrical guide for teachers by Pennv Ur

    Testing for Language Teachers by Arthur lTughes

    3

  • rhI

    rF

    /

    'I

    F

    For Vicky, lr4eg and Jake

    .-f

  • rI

    IbI

    B

    ;

    ffimntents

    Acknou'ledgements viiiPreface

    Teaching and testing 1"Testing as problem solving:Kinds of test and testing/"-'--Validitl' 22

    an overview of th9

    t1I!'t

    ;l"l.rL

    ,, I5

    6

    Reliability 29Achieving ben efi cial bae-krS ta geibf tA$i-cdiis tFu ctionTest techniques and testing overall ability

    , 9 Testingwriting. 75JlO Testing oral ability. 70I v

    /

    3

    )I'

    11 Testing reading. 1,16L2 Testing listening 1.34

    '1.13 Testing gramrqar and vocabulary14 Test administration 152

    ,,1 Appendix 1 Statistical anall'sis of test\\ Annendix 2 1,55. "YYIlibliography 1,66f lde x 170

    5

    59

    141

    results 155

  • l{e t
  • iI

    ILd7

    I,

    I

    LIJ

    Fnef ece

    The simple objective of this book is to help language teachers write betterrests. It takes the view that test construction is essentially a matter ofproblem solving, u'ith every teaching situation setting a different tesringproblem. In order to arrive at the best solution for any particularsituation

    - the most appropriate test or testing systen'l

    - it is not enough

    ro have at one's disposal a collection of test techniques from which tochoose. lt is also necessary to understand the principles of testing andhou'they can be applied in practice.

    Ir is relatively straightforu'ard to introduce and explain the desirablequaliries of tests: validiq', reliability, practicality, and beneficial backu'ash(tl'lis last, r','hich refers to the favourable effects tests can have on teachingand learning, here receiving more attention than is usual in books of thiskind). lt is much less ea5y to give realistic advice on how to achieve themin reacher-made teStS; One is tempted either to ignore the problem or toFresent as a model the not always appropriate methods of large-scaletesring instirutions. In resisting these temptations I have been compelledro make explicit in my' own rnind much that had previously been vagueand intuitive. I have certainly benefited from doing this; I hope thatreaders u'ill too.

    Exemplification throughout the book is from the testing of English as aforeigrr language. This reflects both my own experience in languagete siing and the fact that English will be the one Ianguage knou,n by allreirde rs. I trust that it rvill not prove too difficult for teachers of otherI:r:iguaqes to find or construct parallel examples of their o\\rn.

    I rrrust acknoivledge the contributions of others: -lr4A students atReading University, too numerous to mention by name) who have taughtme much, usually by asking questions that I could not answer; my friendsirnC crrlleagues, Paul Fletcher, Michael Garman, Don Porter, and Toni'\1r.r,,.,,;c., -,,.,1;-o alJ read parts of the ma.nuscript and mad.e ITl3n)'helpfu!:lggesriorrs; Barbara Barnes, rvho typed a first version of the earlychaprers; Ir{ichael Swan, rn'ho gave good advice and much encourage-l'nenr. and u'ho remained remarkably patient as each deadline forconrpletion passed; and finally *y farnily, u'ho. accepted the r,,'riting ofLIrc book as an excuse more often than they shouid. To ail of them I arnve rt' gra'tefuI.

    t

    D

    ixf

  • FI

    ;F

    BI

    I

    BII

    N4any language teachers harbour a deep mistrust of tests and of tesrers.The starting point for this book is the admission that this mistrust isfrequently u,ell-founded. It cannot be denied that a great deal of languageresting is of very poor quality. Too often language tests have a harmfuleffect on teaching and learning; and too often they fail io measureaccurately whatever it is

    othey are intended to measure*

    ', -f_a--

    '^'\i\\ \**J'*-"' h* n"i\ \'r**'t- \"s\q

    The effect of testin on teaBackwash canthen preparation for iftan cori? to dominate all teaching and learningactivities. And if the test content and testing techniques are at varianceri'ith the objectives of the course, then there is likell' to be harmfulbacku'ash. An instance of this wouid be rvhere students are follow'ing anEnglish course w'hich is meant to train them in the language skills(including rvriting) necessary for university study in an Englisii-speakingcollntr)', but rvhere the language test which they have to take in order tob,e admitted to a university does not test those skills directly, If the skill oft'riting, for example, is tested only by multiple choice items, then there isgreat pressure to practise such items rather than practise the skill'ofri'riting itself. This is clearly undesirable.

    \l,re have just looked ^t a case of harmful backye:ll. Horvever,backu.aslrneednotalwaysbeharmf"ffiepositively

    beneficial. I was once involved in the development of an English^languag!rest for an English medium university in a non-English-speaking country,T]ie test \,\'as to br- administered at the end of an intensive year of Enelish:.rud,. ilicic an,i w-ouid be irse.l tu clererminc which srucients wouii beallou'ed to go on to their undergraduate courses (taught in English) andu'hich u'ould have to leave the universitr'. A test was devised which wasb;ased directly' on an analysis of the English language needs of first yearundersraduate students. and which included tasks as similar. as possibieto those u,hich they u'ould have to perforn'l as undergraduates (readingrer:tb'ook materials, taking notes during nectures, and so on). Theintroduction of this test, in place of one rvhich had been enrirely multiple

    lea rn ip_g;Ls known asib g c!1..u ry!2,. If a test is regarded as importanr,

    !

  • 'I

    I

    ITeaching and testittg

    choice, had ap immediate effect on teaching: the syllabus'was redesiqne'C.new books were chosen, classes were conducted differently. The result oiI th.se changes was that by the end of their year's training, in circum-stancer r^d. particularly difficult by greatly increased.numbers anrjlirnited ,.s"ur..r, the students reached a much higher standard in Englishthan had ever been achieved in the university's history. This was a case oIbeneficial backwash.

    *' Da,ries (1958:5) has said that 'the good test is an obedient servant sincr:it follows and apes rhe teaching'. I find it difficult to agree. The properrelationship between teaching and testing is surely that of partnersilip. Itis true that there may be occasions when the teaching is g_ood

    _

    andappropriate and the testing is not; we are then likely to suifer iromnrirnflf backwash. This rvould seem to be the situation that leads Dat'iesto confine testing to the role of servant of teaching. But equally there mavbe occasions whin teaching is poor or inappropriate and rvhen tesring isable to exerr a beneficial influence. We cannot expect testing ortiv rofollow teaching. What we should demand of ir, holvever' is that it shouldbe supportive of good teaching and, where-necessary, exert a correctl'''cinfluence on bad ftaching. lf resting ahvays had a beneficial backrvash onteachilg, it would have a much better reputation amongst teachers.Cn"pr.i,5 of this book is devoted to a discussion of horv benefici;ilbackwash can be achieved.

    {

    {{

    tests

    The second reason for mistrusting tests is that very oiten thel' ,larl-ro -measure accuratg\r--wtrateyer"-ifjs-thrauhey-ar.e-in!9lld-ed to *eas.ulr.fl;.G;["",ffiG. Students'true abilities are not al,.vays retlected in t]retesr scores that they obtain. To a certain extent this is inevitable.Lrngurg. abilities are not easy to measure; we cannot expect a level ofaccuracy comparable to those of measurements in thE phy'sical sciences.But we can exPect greater accuracy than is fr.equentlv achieved'

    Why "r.

    t.it, inaccurate? The causes of inaccuracy (and rva,vs ofrninimisinE their effects) are identified and discussed in subsequent

    is possible hese concerrxs

    1"-, : ifrwe'Warl'i-(

    re ofhave; br-rtsurern the

    scoring of tens of thousands of be a

  • tF

    Teaching and testing

    practical proposition, it is understandable that potentially greater accu-racy is sacrificed for reasons of economy and convenience. But it does notgive testing a good name! And it does set a bad example

    "

    While few teachers would wish to follow that particular exampie inorder to test writing abilityn the ovenvhelrning practice in large-scaletesting of using multiple choice items does lead to irnitation in circum-starices where such items are not at all appropriate. What is more, theimitation tencls to be of a very poor standard. Good multiple choice itemsare notoriously difficult to write. A great deal of time and effort has ro gointo their construction. Too many multiple choice tests are written wheresuch care and attention is not given (and indeed may not be possible). Theresult is a set of poor items that cannot possibly provide accuraremeasurements. One of the princlpal aims of this book is to discourage theuse of inappropriate techniques and to show that teacher-made tests canbe superior in certain respects to their professional counterparts.

    '-lI he.Secon u-l iahtliry Reliability is a

    t..hffi-I.1Eifr which is .ipl"i*d=-in Ctilpter"Sl-FbTThe moment it isenough ro say that a test is reliable if it measures consistently. On areliable test you can be confident that someone rvill get more or less thesame score, whether they happen to take it on one particular day or onthe nexrl u'hereas on an unreliable test the score is quite likely to be

    se, something about the test creates a tendency forfiFfoim significantly differently on different occasions

    u'hen they might take the test. Their performance might be quite differentif they took the test on, say,lWednesday rather than on the follou'ing day.As a iesult, even if the scoring of their performance on the test is perfectiyaccurate (that is, the sborers do not make any mistakes), they willnevertheless obtain a markedly different score) depending on u'hen theyactuallt. sat rhe test, even though there has been no change in the abilityu'hich the test is meant to measure. This is not the place to list all possiblefearures of a test rvhich might make it unreliable, but examples are:unclear instrucrions, ambiguous questions,_items that resulr in guessingon rhe part of the test takers. While it is not possible entirely to eiiminatesuch differences in behaviour from one test administration to another(hunran beings are not machines), there are prlncipies of test constructionu'hich can reduce them.

    In the erformances are accorded sis-_-

    .nifica t. F"t example, the sarnJlompoTiTron"miyEeg1i'en ver)' different scores by different nrarkers (or even by' the samema.rker on different occasions). Fortunately, there are well-understoodii'a)/s of minimising such differences in scoring.

    N,{ost (but not all) large testing organisations, to their credit, take every

    Io

    B

    considerably different, dependLpg on the day on which it is ta-ken.-+>

    -U,nr.J d i ry h a s t!\,o- e ;Fiqs!'t@ a "d tffi*a+a+is:*

    /

    F.

    Uh

    F

    F

    :1

  • Teacbing and testing

    tests, and the scoring of them, as reliabie asy highly successful in this respect. Small-scaied, tends to be less reliable than it should be'k, then, is to show how to achierie grcllter

    reliability intesting. Advice on this is to be found in chapter 5.

    The need for tests

    So far this chapter has been concerned to understand rvhy tests ere sc)l-ir,rrri.d by many language teachers,'We have seen that this mistrust is;^fr*l;r;in.h. on. .oniluJion drawn from this might be that rve *'ouldi.tU.it., off rvithout language tests.,Teaching is, after all, the.primar'ii .ot fli.t',virh it, then it is testing '',vhich shor-rpl

    as been ad.mitted that so much testing providesI his is a plausible argument - but there are orher

    considerarions, which might lead to a different conclusion., Information abou, p.oil.'s language ability is often very useiul andsometime, n...rrrry. i, ii aifficulito i*tgine, for example, British andA-.ri."n u.riu.rsities accePting students from overseas rvithout some'd;;i;;.

    of ,h.i, proficien.yltt English. The same is true for organi-,rrion, hlring interpreters or translators. They certainly need dependablemeasures of language abilitY'

    \ilithin teaching systems, too, as long as.it is thought appropriate forindi;idr"ls to be g"iuin a statement of what they have achieved in a secondor foreign language, then-tests of some kindor other will be needed'1it; *Ifi rftJ b." n..d.C in order to provide information about theachievement of SrouPs of learners, without which it is difficult to see horv,urion"t .d,r.rtlonaf decisions can be made. While for some PurPosesteachers' ,rr.rr*.nts of their o*n students are both appropriate andsufficient, this is not true for the cases iust mentioned' Even without.""ria.ting ,hr porribility of bias, we have to recognise the need for a;;;;"; yitattilf., whicir tests provide, in order to rnake meaningfulcompariso*t'

    .o, rhrr resrc are Iwe care about testing andlf it is accepted that tests are necessary, and iiits effect on ,.".lrlng tttd learning, the other conclusion (in my view' thecorrect orre) to-b. ir^*n frorn ^-',ttogt'ition of the poor

    quality of so;;;;-;.;ii.i ir ,hur ru. should do eve.ything that we can to irnprove thepractice of testlng.

    {

    I

    A

    interpreted widelY. [t is use'] toabiliry. No distincrion is macle

  • plF

    Teaching and testing

    What is to be done?

    The reaching profession can make two contributions to the improvementclf tesring: they can write better tests themselves, and they can put pressureon others, including professional testers and exarnining boards, toimprove theirtests. This book rpresents an attempt to help them do both.

    Fot the reader who doubts that teachers can influence th'! large testinginstitutions, let this chapter end with a further reference to the testing ofwriting through n'rultiple choice items. This was the practice followed bythose responsible for TCEFL (Test of English as a Foreign Language), thetest raken by most non-native speakers of English applying to NorthAmerican universities. Over a period of rnany years they maintained thatir was simply not posiible to test the writing ability of hundreds ofthousands of candidates by means of a composition: it was impracticableand the results, anyhow, would be unreliable. Yet in 1985 a writing test

    andidates actually have to write forupplement to TOEFL, and alreadyre requiring applicants to take thispal reason given for this change was

    pressure from English language-teachers who had finally convinced those,esponsible for the TOEFL of the overriding need for a writing taskro'I''ich would provide beneficial backwash.

    READER ACTIVITIES

    Think of tests with which you are farniliar (the tests may be inter-national or local, written by professionals or by teachers). What doyou think the backwash effect of each of them is? Harrnful orbeneficial? \Xlhat are your reasons for coming to these conclusions?Consider these tests again. Do you think that they give accurate orinaccurate information? 'What are your reasons for coming to theseconclusions ?

    t2

    1

    2

    F,9

    F:

    B

    Further reading

    Forlan accounr of how the introduction of a nerv test can have a strikingbeneficial effect on teaching and leagning, see F{ughes (tr988a). For areyie\\, of the nerv TCEFL w'riting test which acknorvledges its potentia[

    'beneficial backrvash effect but rn'hich also points out that the narrorvrange of writing tasks. set (they are. of only^two types) rnay 6.t*1, Unnarrow rr",ntngln writing, see Gree-4-herg (1986). For a discussion of thecthics of language testing, see Spo_lsky (n98x.).

  • I'f'J TestEffiE as ProfoEerm sotving:

    these seem necessary..]n every situatiggjbg ,trl! sLeJ rnustrEarwl"*l::jT:li:11:

    ffi'fr;r'n*solution.Er,erytestingprobIemcanbeexpressedinthe;;;g;..r"1tfr*s: we vant to create a test or testing system which rvill:onsistently provide accurate measures of precisely the abilitiesl inwhich we are interested;ave a beneficial effect on teaching (in those cases where the tests are

    The purpose of this chaprer is to introduce readers to the idea of testins asoioUf.n., solving and to show how the content and structure of the L:itokIr. -a.iign.a ; help them to become successful solvers of testing

    fi?ffr:ge tesrers are sometimes asked ro say rvhat is'the best test' or.the be"st tJsti.,g technique'. Such questions reveal a misunderstanding of;h;;i, inuotu.? in the iractice of [anguage re.sting. ln fact there is no besri.r,

    "iU.st technique. A test rvhich proves ideal for one purpose. ma' be

    ;;ir;;;.1.r, fo, ^norher;

    a technique which may work verv r'vell in o^eil;;ilr. .t. u. entirely inappropriate in another. As we saw in theor.uiour chapter, what suits l"ig. iesting corporations may be quite c'ut'oi pi^.. in the i.ttt of teaching institutions. In the same s7211r, t\\'o,."itring institutions may require ve y diffe rent tests, depending.amongstorher ,liing, on the objectivis of th:ir courses, the Pytpe:. and import-

    ""1. of th". t.sts, and ihe resources that are available. The assumption

    ,nr, n", to be *^d. therefore is that each testing-situation is unique.andso sets a partlcular testing problem. It is the tester's f ob to provide the best,otu,ion ,o that problemlThe aims of thi's book are to e qTiP readers rvith,rr. urri. kno*ledge and techniques first to solve such problems, secondlv'io.urittte the solitions ptopot.d or already implemented by others, andthirdly to argue persuasiv.ly fot improvements in testing practice lvhere

    {

    'v likelv to influence teaching);OU...onomical in terms of time and money'

    nical sense. lr refers sinrply to r'r'hat peopleexample, include the ability to collversey to recite grammatical rules (if chat isiuringl). [t does not, however, refer to

  • FF

    B

    b

    Testing as problem soluing

    I_,et us describe the general.testing problem in a little more detail. The firstthing that testers have to be clear about is the purPose of testing in anyorrti.ular situation. Different purposes will usually require different kindsf f ,.ri.t. This rnay seem obvious hut it is something which seems not always

    .tro be recognrseo. Ih-.- p_g.lp.g,::Loll.jl$ discussed in this bo.ok are:\-.-to []easure language proficiency regardless of any language courses

    edachieved the objectives of a course of

    nd rn'eaknesses, to identify what theywby identifying the stage or part of a

    teaching programme mosr appropriare ro their ability

    All of these purposes are discussed in the next chapter. That chapter alsointroduces different,"-kinds o.f-.t.g!-tjgg*-elq JgqlJshniqrres: direct asopposed to indirect resting; discrete-point versus integrative testing;.rii..iun-referenced testing as against norm-referenced testing; objectiveand subjective testing.

    ln stating the testing problem in general terms above, u'e spoke ofproyiding consistenr measures of precisely.the abilities we are interestedin. A ,.ti*hi.h does this is said to be 'ualid'. Chapter 4 addresses itself tor,irrious kinds of validity. [t provides advice on the achievement ofyalidity in test construction and shows how validity is measured.

    The u,ord 'consistently' was used in the statement of the testingproblem. The consj:Jengy [Lth which accu!?te m%.si-rspJnqSre-trS"

    -I

    F

    F

    in iact an esien-tt-alntrr-g4le vatlol lstent ;f;-f * . t

    "

    * p t . t p:llo-ft*t-lg9ll-aU rh e -tesljt- I * ely-tq*.b.*t-:sr'y" -si'nr+i'l-a r -'

    ;ee;rdless of ivhelhetrathgr than o4

    '@Jbe'#brr te-uegg,1::,1'l*Hl;l?Aatt ;.i.it.d*to inlfiE*pt*ious cha an absolutely essentialqualir, of tesrs

    -

    u,hat use is a test if it rvill give widely differing estirnatesJi .,n'individual's (unchanged) ability? - yet it is something which isdisrinctly lacking in very -^rry teacher-rnade tests. Chapter 5 gives adviceon ho'*, to achl*ue reliability and explains the \\rays in u'hich it ismea sured.

    The concepr of backr,vash effect was introduced in the previouschapter. Chaiter 6 identifies a number of conditions for tests to rneet inorder to achiette beneficial bacl

  • Testing as Problem soluing

    Ail tests cost tirne and money - to prepare, adtniuister' scol'e atidinrerpret. Time and money are in limited suppll', and so there is oftenlikely to be a conflict between what appears ro be. a perfect testirrgsolution in a particular situation and considerations of practiccli 11'. Thisissue is also discussed in Chapter 6.

    To rephrase rhe general testing problem identifred above: the basicproblem is to develop'tests rvhich are valid and reliable, r.vhich hrrrt '.'tcncficial bacltwasll effect cn teaching ("','here this is rele','lnl), al'.'C ','''hi'-:hare pracical. The next four chapters of the. book are intended to Iookmore closely at the releVant-cgncepts and so hqlp the reader to formrrlrtesuch problems clearly in pariicular instances, and to provide advice onhow to approach their solution-

    The seiond half of the book is devoted to more detailed advice on theconsrrucrion and use of tests, the putting into practice of the principlt:soutlined in earlier chapters. Chapter 7 outlines and exemplifies the

    '

    various srages of test construction. Chapter 8 discusses a nurnber oItesting tecltniques" Chapters 9-13 shorv how' a variety of langu;rgeabiliti-es can best be tested, particularly within teaching institutions.Chapter 14 gives straightforward advice on the administration oI tests.

    We have ,o rry something about statistics. Some understanding ofstatistics is-useful, indeed necessary, for a proper appreciation oi testingmatters and for successful problem solving. At the same time, rve have torecognise that there is a limit to rvhat *1ty readers will be prepared todo, Jrp..ially if they are ar all afraid of ma.then:Iatics. For this reason.statistical *,rtt.r, are kept to a minimum and are pre.sented in terms thateveryone should be able to grasp.-The emphasis will be on interpret;ttionr"rhe, than on calculation. For the rnore adventurous reader, how'evet.Appendix 1 explai.ns how to carry out a number bf statisrical operations..' ;li

    Further reading

    The collection of critiC-al reviews of nearly 50 Engtish language tests(mostly Brifish and American), e Krahnke and Stans-field (1.987), reveals how well p iters are thought tohave solved their problerirs. A full of the reviervs lvilldepend to some degree on an assimilation of the content of Chapters 3, 4,and 5 of this book.

    {t

    {

    \

    t5

  • F'

    b

    p

    'I'his chapter begins by considering the purposes for which languageresring is carried out. It goes on to make a number of distinctions:berween direct and indirect testing, between discrete point and integra-tive testing, between norm-referenced and criterion-referenced testing,and between objective and subjective testing. Finalll' there is a note oncommunicative language testing. ')

    We use rests to o"brrin inforriati n. The informatibit thtt we hope toobtain will of course'vary from situation to situation. It is possible,neverrheless) ro categorjse tests according to a pmall nurnber of kinds ofinformation being sought. This categorisation lt'ill prove useful both indeciding u,hether-an existing test is suitable for a particular purpose andin writing appropriate new tests where these are necessary. The fourtvpes of tesr which we will discuss in the following sections are:proficiency tests, achievement tests, diagnostic tests, and placement tests.

    *lt yro6ciency resrs are designed to measure people's ability' in a languageregardless of any training they may have had in that language. The.ont.nr of a proficiency test, therefore, is not based on the content orob jectivesfollou'ed.be able toraises the

    In the

    of language courses rn'hich people taking the test rnay haveRather, iqiqlased on a spe cification of u'hat candidates have todo in-tfFfrnguage in order to be considered prof,cient. This

    question of what we mean by the u'ord'proficient'.case of some proficiency tests, 'Pro[ciqgfl]mea+l-ber,1agB

    ))

    suffi 9 fgl ! p ary S"tg_p44ypo s 9.. An exa m pleoi ltrir u,ould be a test designed to discover rn'hether someone canfunctron successiully as a United Nations translator. Another exampleu'ould be a test used to determine n'hether a srudent's E,nglish is goodenough to follou' a course of studl' at a British university' Such a test mayeyen arrempr ro take into account the level and kind of English needed to

    'follou, ."uis.t in particular subiect areas" It might, for example, have one

    form of the test for arts subjects, another for sciences, and so on.\Thatever the particular purpose ro which the language is to.be put, this

    Proficiency tests

  • Kinds of test and testing

    wilt be reflected in the specification of test content at an early stage of atest's develoPment.

    There are other proficiency tests which, by contr"ast, do not have anyoccupation or course of srudy in mind. For them the concepl ofproficiency is more general. British

    -examples of these rvould be the^Cambridge examinations (First Certificate Exarnination and Profi ciencv

    E;ramination) and the Oxford EFL examinations (Preliminari' aniHigher). Tirc functitrn ui tiresc tcsts is to sltow lvhethcr carrriir-iaius iia', crea;hed a certain standard rvith respect to certain specified abilities, Sr:chexamining bodies-are independent of the teaching institutions and so canbe relied on by potential employers etc. to make fair comparisonsbetween candidates from different institutions and different courrtries.Though there is no particular purpose in mind for the language, theseg.n.r^l proficiency tests should have detailed specifications. saying justiuh", ir il that successful candidates will have demonstrated that they cando. Each test should be seen to be based directly on these specifications.All users of a test (teachers, students, employers, etc.) can then judgewhether the test is suitable for them, and can interpret te st results. It is norenough to have some vague notion of proficiency, however prestigiousthe testing body concerned.

    Despite differences between them of content and level of difficultv, allproficiency rests have in common the fact that thev are not based on.ourr.r that candidates may have previously taken. On the other hand, aswe saw in Chapter 1-, such tests may themselves exercise considerableinfluence over the method and content of language courses. Theirbackrvash effect

    -

    for this is what it is -

    may be beneficial or harrnful. Inmy view, the effect of sorne widely used proficiency tests is more harmfutthan beneficial. However, the teachers of students who take such tests,and whose work suffers from a harmful backwash effect, may be abie toexercise more influence over the tesring organisations concerned thanthey realise. The recent addition to TOEFL, referred to in Chapter L, is acase in Point.

    \t1

    Most reachers are urnlikeiy to he responsibne fon proficiency tests. lt ismuch more probable that they will be involved in the preparation and useof abhievement tests" X.n contrast to proficiency tests,'achievement testsare directly relate

  • Kinds of test and testing

    " ^st!dy, They may be written and administered by ministries of education.official examining boards, or by rnembers of teacLring institutions.Clearly the content of these tests must be related to the courses withu,hich they are concerned, but the nat

    nt amongst language tesew of some testers, theased directly on a detaile

    other materials used. This has been referred to as the 'sylXabus-content9

    F

    I

    approach'. trt has an obvious q ince the t r-'ry contalns what lt tsr h o u g h t tar:LllWlh a v e a c t u4lly_e n. o u n t Jr-e d._arylr h u sT a n-EE

    .+ -'tc5 sed on obiectives work asainst t .POOr"teaehl.ngIalIlEe,-something u'\ich c_o.Ut_Q_e-:-9__R^ten-t_:ba,F.qd.t.qqlg, almost as if part of

    .._Y-... -.iracy. fail to do. It is nrv1 _consp!_r-1-c_yr{n{,t_q _4g, I, is nry belief that to base test content on course

    ;::I. i ,,,. :;:,-.I:i.,,..;:,'.', ;. -,,.L rn l-... --rnS----.J, i+,=,;.!! -.,-.-..=,;.-!=r-)l:;r'i-LJ i Uo iJ iiiilCn trO DC pfCICffCG: it Vviii PfOViLie iiiUts aCCuIaLe

    ;-^1..-.-.^*: ^-. -L^,"*:-.1 :,,:J,.^l ^*,-: --^..^ ^^L:^-,^*^.-* ^-l ;*:^ l:l-^l-. -^infon:rarion about individual and group achievernent, and lt is likely topromote a more beneficial backwash effect on teaching"l

    Of course, if objectives are unrealisric, then tests u'ill also reveal a failure to achieverlrern. This too can only be regarded as salutary. There may be disagreement as to whytlrerE has been a failure to achieve the objecrives, bur at least this provides a srarringpoini for ne cessary discussion rvhich otlrerwise might never have taken place.

    tsI

    :y"ely--!0idsadt*n g. Su c-e5 sTu Iperf ormancegllbg1gqt_may"n-o-r-r:-ul.,r.r f I i .---*sstul achreveme*nt ofi. _-----

    *---.7-- :--'---;-c.-o*q!-lS-gbig*qgu:q, For example, a course may have as andevelooment of conversational abilitv- hrrt the corrrse irsedevelopment of conversational ability, but the course itself and the testmay require students only to utter carefully prepared sratements abouttheir home town, the w'eather) or whatever. Another course rnay aim todevelop a reading ability in German, bur the test may limit itself to thevocabuiary the students are known to have met. Yet another course isintended ro prepare studenrs for university study in English, but thesvllabus (and so the course and the test) may not include listening (',vithnote taking) to English delivered in lecture style on topics of the kind thattlre students rvillhave to deal with at university. In each of these examples-

    all of them based on actual cases -

    test results will fail to show rrhrt^6 students have achieved in terms of course objectives. xr-s.y5e- - *V\"*il\\r&

    # @The alternative appro"t.i: ,.o base the test coglllrec.-rlyp\rll. r-c'bjectives of the course. lbir has. a quglb

    f,',tr.-lorLp- compels course oeslgners* t-o I \r#__.r_-c^rgg_D_o_il"r*.A-ql9_Q-qlv"e_s.Q,e_c914'i j!

    ruiakeslT_poidble for performance on the tesd to show iuiFT6n'-far

    sprr;s ano oI

    I s.-Ti,ii in:ilil-p;ii-flitsiuft o;i-i and for the selecrion of books andnraterials to ensure that these are consistent with the course obiectives.

    D,,

    T

    1.L

    c r:-n s r dEie d,:glh it_lgjpgc ! 4tr arl ls tnat I

    \g 1aI-L

  • Kinds of test and testing

    Now it might be argued that to base test c herthan on.ourr. content is unfair to students' If noth-i *.ff with objectives, they will be expected htyhru. nor been prepared. nn a sense this is true. But in anothe r sense it isnor. If a resr is-baied on the content of a poor or inqPPropriate course,thb stud.nts taking ir will be misled as to the extent of their achieve-*.n, and the qut-iitv'of the course. \Whereas if the test is based onObieatirr"Si not onlV will the information it gives be mote trsefrrl. lrtrr*rire is less chance of the course surviving in its present unsatisfactorvi;;;. lnitially some students may suffer, but future srudents rvill benefitiro* the'pressure for change. The long-term interests of students areC*, ,.ro.d by final achievement te

    'ts *hot. content is based on course

    {

    ,{

    Itt

    IIItI

    II

    III

    iI

    I!?lIt

    ob jectives.- The reader ma nder at this stage *hgbe--_--LSqL._!S

    _i_n1ttlditierence betwe ment tests an retsts. [ar-esrrs

    'rthe_F^-rne

    r - -.- -^^;J^ ^- 'nO

    r'fle torm and conlenr of the tu'o

    an achievement test has been con-srructed, and be aware of the possibly limited validity and applicabilityLi,.r, ,.or.r. Test writers' on the rther hand, must create achievement

    I ;;r;; t;hi.h ,*fl..t the objectives of a particular course, and not exPect a\ ;.;;r;if roficiency test (or some imitition of it) to provide a satisfactory

    -#\u---/

    made. This is not reallY feasible,a course. The low scores obtained

    ; and quite possibly to their teachers"The alternarive is to establish a series of-well-defined short-term objec-;;;r. ih;r. rtr"uta n'rake a ciear progression towards the fi-nal achie'e-ment test based on course oblectiv.t. Th.tt if the syllabus and teachirrg

    based on short-termif not, there will be

    at is at fault, it is thethat ii t's there that change is n'eeded,

    ffi

    not in the tests.

    1,2t?

  • Fts

    !F

    Kinds of test and testing

    e such resrs will not form part of formalassessment Proceclures, fnelr construction and scoring need not be toorigorous. Nevertheless, they should be seen as measuring progresstourards the intermediate objectives on which the more formal progressachievement tests are based. They can, however, reflect the particular'route' that an individual teacher is taking towards the achievement ofoLr jectives.

    It has been argued in this section that it is better to base the content ofachievement tests on course objectives rather than on the detaile d contentof a course. However, it may not be at all easy to convince colleagues ofthis, especially if the latter approach is already being follorved. Not onlyis tlrere likely to be natural resistance to change, but such a change mayrepresent a threat to many people. A great deal of skill, tact and, possibly,political manoeuvring may be called for

    - topics on rvhich this book

    cannot pretend to give advice.

    .'"-Fiagn0stie tests

    =+

    emgnt test$ WhiSh*f_eq.uir-e. c-areful

    n g$-rs a nd u'eaknesses.rvhat further teatTin-f is

    necessar),. At the level of broad language skills this is reasonablysrraightfonn'ard. We can be fairly confident of our ability to create teststhat rn'ill tell us that a student is particularly rn,eak in,_sa,v, speaking as

    ndeed existing proficienc| tests ma1,se.

    UR

    D

    ndeed existing proficiency tests malrse.

    , analysing samples of a studerrr'sg in order to create profiles of thech categories as 'grammatical accu-e Chapter 9 for a scoring system that

    a detailed analysis of a student'ss, something which would tell us, forstereC the prcscnt pcrfcct/past tcnscsure of dris, we would need a numbert made between the t\ ro structures inought was significantll' diffe rent andbtaining information on. A singleugh, since a str.rdent rrright give the

    sult, a comprehensive diagnostic test(think of what would be irrvolved in

    . -1 1.I.J

    r

    F

  • Kinds of test and te'sting

    ). The size of such a test would rrtakcine fashion. For this reason, very fervnostic purposes, and those that thereormation.sts is unfortunate. TheY could bensrruction or self-instruction. Leirrn-st in their command of the lrrnrrtrrtg*cf infcrrnrticn, exempliFclti'rn :ln'J'

    lity of relatively inexpensi',re (-orFl-change the situation. Well-u'ritten

    e that the learner spent no more timeobtain the desired information, irtrcl

    rvithout the need for a test administrator. Tests of this kind rvill still nre ,l

    u ,r.In.*dour-r*ount of work to produc.e. Whether or not thel' becomeeenerally available will depend on the willingness of individuals to writein.nl and of publishers tQ distribute them'

    *;;"'il)e-'44e6'*F

    olacement. An .*r*pl. of how a-p mightbe.designed rvithin;;i;".tii"" it given in Ch-aPter ion of placement tests is

    t

    t

    sidered suits its particular teachiivill work for everY institution, athat is .o*..r.irily uutilable must be that it will not work rvell'

    most successful are those constructed

    have been Produced 'in house.'' Irion is ,.-Juta.a by the savlng ln ttme and effort through a:cur.ate

    i,- \\lE*\*-\ ) o.^**s\**+:.:h

    O Direct versus indirect testingSo far in this chaPterresulrs are Pur. Weconstruction.

    '14

    we have considered a number of uses tonow distinguish betweert two approa

    i,

    LI

  • pF

    Kinds of test and testirtg

    Testing is said to be direct when it requires the candidate to performnrecisely-the skill which we wish to measure. [f we want ro know horvl,.ii .riaidates can write compositions, we get thern to write com-oositions. If we want to know how well they pronounce a language, we'n.t rl1.r to speak. The tasks, and the texts w be asIuthentic as possible. The fact that candidates re in at.rt ui,u"tion means that the tasks cannot be rthe-,less the effort is made to make them as realistic as possible-

    Direct resting is easier to carry out when it is intended to measure theiting. The very acts of speaking andabout the candidate's ability. With

    ::::::il' ;,?':n'.T t l;:' : : : :',1'lsuccessfully. The tester has to devis uch evidence,..ur"t.ly and without the method rformance ofthe skills in which he or she is i methods forachieving this are discussed in Chapters 1l- and l-2. lnterestingll' enough,in many rexrs on language testing it is the te sting of productive skills thati, ft.r.nted as bein[ *-ost problematic, for reasons usually connected-r,.iif., reliability. ln fact the problems are bir no means insurmountable, asrve shall see in ChaPters 9 and 10'' Direct resring hai a number of attractions. First, provided that we areclear about jusi u,hat abilities we want to assess) it is relatively straight-.for,.,^rd

    ro create the conditions which u'ill elicit the behaviour on u'hi '.foru,ard

    ro create the conditions which u'ill elicit the behaviour on u'hiio t

    ^r. our judgements. Secondly, at least in the case of t ie producti

    hft

    skills, the assesr*.nt and interpreration of students' performance is alsoquit.'straightforu,ard. Thirdly, since practice for the test involves Prac-J.. of the"skills that we wish to foster, there is likely to be a helpfulbackr"'ash effect.

    Irrclirect testing atrempts ro measure the abilities rn'hich underlie theif.ittt in u'hich *J

    ^r. interested. One section of the TOEFL, for example,

    ,uri a.u.toped as an indirect measure of n'riting ability. It contains iternsof the follorving kind:

    At first the o\d \^,oman seemed unwilling to acceDt anything that u'as;ff.t.d her b| mY friend and I'

    2-2

    .,.,.hc;c rhc candidate has to iCentif,r' .*hich cf the underlin:d elernents is

    .,, o n. o u s o r i n a p p r o p r i a t e in f o r m a L*f ,Hi i; :l:,'l ; ) *h:hl'; it:h the strength of the relationship n'asot the same thing. Another exan-rple

    proposed rnethod of testing pronunci-etion ability by a paper and p."oii test in which the candidate has toidentif;, pai.s of words rn'hich rhyme with each other.

    I)

  • Kinds at''test and testing

    'perhaps.the main appeal of indirect testing is that it see ms to offcr thcpossibility of testing a representative sarnple of a finite number of ab,ilitieswhich underlie a potentially indefinitely large number of manifestatiorisof them. [f, for example, we take a representative sample of grarnmaticalStructuresl then, it may be argued, we have taken a sample rvhich isrelevant i.or all the situations in which control of grammar is uecessary.By contgast, direct testing is inevitably limited to a rather smali sample oftasks.which mav call on a restricted and nossiblv rrnrep'resentntit''e rnnc'eof grammatical structures. On this argument, indirect, testing is superiorto dir..t'testing in that its results are more generalisable.

    The main problem with indirect tests is that the relationship betr,veenperformance on them and_performance ot the skills in rvhich we atelsually more interested tends to be rather weak in strength and uncertainin nature. We do not yet knorv enough about the component parts of, sav,composition writing to predict accurately composition writing abilit.v-from scores on tests which measure the abilities Which we belietteunderlie it. \fle may construct tests of grammar, vocabularv, discoursemarkers, handwriting, punctuation, and what lve will, But rve still ,,viilnot be able to predict accurately scores on compositions (even if rve rnakesure of the representativeness of the composition scores by taking man-ysamples).

    Itlseems ro me that in our present state of knowledge, at least as far asproficiency and final achievement tests are concerned, it is preferabie toconcenrrare on direct testing. Provid d that we sample reasonably w'idely(for example require at least two compositions, each calling for adifferent kind of wridng and on a c iffer.ent topic), we can expect moreaccurate estimates of the abilities that really concern us than lvould beobtained through indirect testing. The fact that direct tests are generallyeasier to .orrrtruct simply reinforces this view with resPect to inslitutionaltesrs, as does their gt.it.r potential for beneficial backwash. it is only {airto say, however, th"t *rny testers are reluctant to commit themselvesentirely ro direct resring and will always include an indirect element inrheir ,.rrr. Of course, to obtain diagnostic information on underlyingabilities, such as conrrol of particulii grammatical structures, indir:ecttesting is cailed for.

    @ plscrete point versus integratlve testingDiscrete point testing refers to the testing of one element irt a tirne, item

    2?

    {sql

    6

    lo

  • Kinds af test and testing

    Dassage. Clearly this distinction is not unrelated to that between indirectind ,fir..r resting. Discrete poini tests rvill alrnost always be indirect,while inregrative tests will tend to be direct. However, some integrativetesring methods, such as the cloze procedure, are indirect.

    D Norm-referenced versus criterion-referenced testingImagine that a reading test is administered to an individual student.When we ask how the student performed on the test, we may be giventno kinds of answer. An answer of the first kind would be that thestudent obtained a score that placed her or him in the top ten'per cent ofcandidates who have taken that test, or in the bottom five per cent; orthat she or he did better than sixty per cent of those who took it. A testw,hich is designed to give this kind of information is said to be norm-referenced. It relates one candidate's performance to that of othercanCidates. We are not told directly what the student is capable of doingin the language

    The otlrer kind of answer we might be given is exemplified by thefollou'ing,'taken from the lnteragencv Language Roundtable (lLR)language skiltievel descriptions for reading:

    Sufficient comprehension to read iimple, authentic written materials in aform equivalent to usual printing or tvpescript on subiects u'ithin afamiliar context. Able to read with some misunderstandingsstraightforward, familiar, factual material, but in general insuffic-iently.*p.ii.n..d ivith rhe language to dran' inferences directly from thelin.euistic aspects of the text. Can locate and understand the main ideasapJ details in materials u'ritten for the general reader . . . The individualcan r,:ad uncomplicated, but authentic prose on familiar subiects that arenormally presenred in a predictable sequence which aids the reader inunderstanding, Texts may include descriptions and narrations inconrexrs such as news items describing frequently-occurring events,simple biographical information, social notices, fonnulaic businessl.tte 1s, anJ simple technical information written for the general reader.Generally rhe piose that can be r:ad bv the individual is predominantlyin straightforward/high-frequency sentence patterns. The individual doesnot hal'e a broad active vocabulary .. . but is able to use conrextual andreal-rvorid clues to understand the text"

    Simiiarlv, a candidate u,ho is awarded the Berksirire Certificate ofproficiency i., G.r-an Level 1 can 'speak and react to others using sirnplelanguage in the following contexts':

    to greet, interact with and taketo exchange information onschool life and interests;

    leave of others;personal backgroulnd, home,

    !

    F

    F

    FG,

    t

    Lrlt7

  • Kinds oi test and testing

    - to discuss and rnake choices, decisions and plans;

    - to express opinions, make requests and suggestions;

    - to ask for inforrnation and understand instructions.

    In these two cases we learn nothing about how the individual's perform-ance compares with that of other candidates. Rather we learn something

    ;

    abourt rvhat he or she can actually do in the language. Tests vihich a.r+..-lee;r'ned to src',.,ide this kind of infornnation directly r.re sa.id tr 1,.:evrrb..vs ." f -*cr it ir ion - r ef er e n c e d.2

    The purpose of qriterion-referenced tests is to classify people accordingto wheth.i ot not rhey are able to perform some task or set of t:rskssatisfactorily. The tasks are set, and the performances are evairiar,ed. it

    . does not marrer in principle whether all the candidates are successful, ornone of the candidates.is successful. The tasks are set, and thcse lvhoperform them satisf-actorily 'pass'; those who don't, 'fail'. This meansih^t rtudents are encouraged to measure their progress in relation tomeaningful criteria, rvithout feeling that, pegause they are-less able thln*o51 of=their fellows, they are desrined to fail. ln the case of rhe BerkshireGerman Certificate, for example, it is hoped that all students r*,'ho areentered for it ivill be successful. Criterion-referenced tests therefore hir,;'etwo positive virtues: they terms of -uvh.rtp.opi. can do,which do no s of candidirics Iand they motivate students

    The need for direct interpretati means that theconsrrucrion of a criterion-referenced test may be quite different fromrhat of a norm-referenced test designed to serve the same PurPose. Let usimagine that the purpose is to assess the English language abilit'v ofstud"enrs in relarion to ih. demands n ade by English medium universiries.The criterion-referenced test would almost certainly have to bebased on,^ r^rfyris of whar students had to be able to do with or through Englishat univlrsity. Tasks would rhen be set similar to those to be met aturriu.rtity. if ,hlr were not done, direct interpretation of performancewould be impossible. The i'orn't-referenced test, on the other hand, whileitsconrent might be based on a si .l1,:?:tlili]i;ffil:'.IT:i Jl;

    , and reading comprehension com-test does not teil us directly ivhat histhe demands that would be made on

    ir at an English-medium *niversity.To kqow thi.s, we must consult a tatllewhich *ni.., recommendations as to the academic load that a student

    use of the terrn 'criterion-referenced'' This is r:n-intended is made clear. The sense in rvhich it is usedbe mosr useful to the reader in analvsing testing

    {,

    q

    B

    2. People differ somewhat in theirimportant Provided thar the senseheie is rhe one which I feel willproblems.

    18

  • Fr

    B

    . Kinds af test and testing

    ra'irh that score should be allowed to carry, this heing based on experienceover rhe years of students with sirnilar scores, not on any meaning in thescore itself. In the same wdy, university administrators have learned fromcrperience horv to interpret TOEFL scores and to set minimum scores fortheit own institutions.

    Books on Ianguage testing have tended to give advice which is moreappropriate to norm-referenced testing than to criterion-referencedresting. One reason for this rnay be that procedures for use withnorm-referenced tests (particularly with respect to such matters as theanalysis of items and the estimation of reliability) are well established,rvhile those for criterion-referenced tests are not. The view taken in thisbook, and argued for in Chapter 6, is that criterion-referenced tests areoften to be preferred, not least for the beneficial backwash effect they arelikely to have. The lack of agreed procedures for such tests is notsufficient reason for them to be excluded from consideration.

    @ O1iective testing versus subiective testingThe distinction here is between methods of scoring,, and nothing else. lf nojudgement is required on the part of the scorer, then the scoring is objec-tiye, A multiple choice test, with the correct responses unarnbiguouslyidentified, rvould be a case in point. lf judgement is called for, thescoring is said to be subjective. There are different degrees of subjectivity'in testing. The impressionistic scoring of a composition malr $. .on'sidered more subjective than the scoring of short answers in response toquestions: on a reading Passage.

    Objectivity in scoring'is sought after by many testers) not for itself, butfor the grearer reliability it brings. ln general,,the less subjective thescoring, ih. gr.rt.r agreement there will be between two different scorers(and betu'een the scores of one person scoring the same test paPer on.different occasions). Hora'ever, there are ways of obtaining reliablesubjective scoring, even of compositions. These are discussed first inChapter 5.

    @ *o!'fi r-rt unr icative ian gu a g e testinghJ jrtuch has been written in recent years about 'cotrnmunicative tranguage

    testing'. Discussions have centred on the desirability of rneasuring th9.filitl' to take part in acts of comrnunication (including reading andlistening) and on the best way to do this. [t is assumed in this book that it

    " is usuall-), .o**unicative ability which we want to test. As a result, rn'hat trbeliei,e ro be rhe mosr significant points made in discussiores of communi-

    2, L9

  • Kinds of test and testing

    cative testing are to beseparate heading would

    found throughout. A recapitulatiou under atherefore be redundant-

    ffiDFnAcrlYlrtFS

    consider a number of language tesrs '',vith vrhich yoll are famili:.r:, F,:,:each of thcir,, ansivcr thc follo';;ing qucstions:

    (or a mixture of both)?(or a mixture of both)?subjective? Can You order

    thesubiectiveitemsaccordingtodegrteofsubiectivity?5. Is rhe test norm-referenced or criterion-refe renced?6. Does the test measure communicative abilities? would you describe it

    as a communicative test? Justify your answers'7. What ,.trtionihip is there b.t*l.in the answers to question 5 and the

    answers to the other questions?

    d

    {

    Further readingtowards achievement test content

    erson \1987) rePorts on researcfie computer to language testing'o be as authentic as Possible: Vol'

    Testing is devoted to articles onount of the develoPrnent of an

    dshalk et al. (1965). Classic shortorm-referencing (not restricted to

    ns at a nurnber of level's for the fours in acadernic contexts' have beenthe teaching of Foreign Languagesare available frorn ACTFI- at 579Y 1A706, USA. It shouid be said,ke and the waY in u'hich the;' weresorne controversy. Doubts about theto language testing are expressed byse. F{"ughes (1986)' Carroll (195i)

  • iII

    4I

    BiII

    !F

    Kinds of test. and testing

    made the distinction between discrete point and integrative languageresting. Oller (1979) discusses integrative testing techniques. Morrou,(1979) is a seminal paper on communicative language testing. Furthercliscussion of the topic is to be found in'Canale and Swain (1980),Alder+on and Hughes (1981, Part L), Hughes and Porter (1983), andDavies (1988). Weir's (1988a) book has as its title CommunicatiueIanguage testing.

    I

    r

    D

    }

    21?8

  • It

    & VattciitY

    2 that a test is said to be "'alid rf it

    ended to measure. This seerns simplewever, the concept of validiry re vcalsdeserves our attention.'[his chaprer

    will presenr each aspect in turn, and attempt to shorv its relevance for thesoluiion of language testing problems'

    /.--\L) Content validitY

    this in itselfdoes' not qvaliditY onlY"if it Iniluwhat are the relevant structuresof the test. 'We would not exPe

    set of strucrures as one for arlvan,:ed{ gI l-o,! a !9{ \"9-cgn!gnt validltv;11'estructure s etc. that tt ls meant to cover'made- at a very earlv stage in test

    f,

    d

  • Validity

    backwas-h effect. Areas which are not tested are likely to become areasisnored in teachind-cr-ermiqsd

    -byrry-heJ'fu*._d=:=_#

    ntent is a fair reflect

    6'tI L-./YCrlte ri o n -related va lid ity

    Eeiil-*-*-- "': - "''" "" *'-

    (g-.. ntent is tTt Ee on thervriting-f specifrcatiiins E"n-d onT-ne )udgementTf iontent validity is to befound in Chapter l.

    There are essentially twovaliditl, and predictive valithg_test and the..crimriorr are

    -ad-me.xemplify' this kind of validation in ach.icvement.tEsfing, let us consider asituation lvhere course objectives call for an oral component as part ofihe final achievement tes r of

    ' 'functions'which student Il of-u.hich might take 45 m I beiryppc.gigal. Perhaps it is felt that only ten minutes can be devoted to eachsrudent folthe oral component. The question then arises: can such a

    . ten-minute session give a sufficientll' accurate estimate of the student'sat'iliti'rvith respect to the functions specified in the course objectives? Isit, in other words, a valid measure?.

    From the point of view of content validity, thii-:urlklspgqd -gn howt, an d tto_urre.preserlta -

    d_ed in thq pbjec_tiues.Even, e ffort should be made when designing the oral component to give itconrent r,alidity. Once this has been done, however, we can go further.\\ie can attempt to establish the concurrent validity of the component.

    To do this, we should choose at random a sample of all the studentsi;rhing. the test. These students would then be subjected to the full 45irriiiuti oi:al coinponent necessary for cc-,verage of ail the functions, qsi-qg.pc ri-r ap s .f o ur-scorers -lo^ensur.ag3;lieflil gggin g ( see n ext chap te r ) . Th[u'ould be the criterion test against which the shorter test would bejudged. The students' scores on the full test would be cornpared with theones they obtained on the ten-minute session,. rvhich would have beenconducted and scored in the usual way, wirhout knowtredge of theirperfornlance on thre longer version. [f the con:lparison betrveen the twosers of scores reveals a high level of agreernent, then the shorter version of

    n

    r

    l

    Jo''! ')LJ

  • 'r't

    Validity

    the oral component may be considered valid, inasmuch as it gives resultssimilar ro rhose obtained with the longer version. [f, on the other I'rand,the two sets of scores show little agreement, the shorter version cannot beconsidered valid; it cannot be used as a dependable measure of achieve-menr rvirh respect to the functions specified in the objectives. Of course, iften minutes really is all that can be spared for each student, then the oralccfnponent rnay be included for the contribution that it makes to thc"ss.it-.nt of st'.:.jcn;s' ovcrall achici;ciirciit

    anC for its back';,'r;.'-rh ,:l:fl.:i,B.ut it cannot be regarded as an accutate measure in itself.

    References to 'a high level of agreement' and 'little agreement' raise thequestion of how the level of agreement is measured. There are in factsiandard procedures for comparing sets of scores in this wsY, lvhichgenerate what is called a 'validity coefficient', a mathematical measure oiiimilaritv. Perfect agreement betrveen tw'o sets of scores rvill result irr avalidity coeficient of 1. Total lack- of agreemen-t rvill give a coefficient ofzero.

    -io g.t a feel for the meaning of a coefficient between these trvoextremes, read the contents of Box 1.

    Box 1To get a feel for what a coefficient means in terms of the level oi,grJ.*.n, betw'een two sets of scores, it is best co square thatcJefficient. Let us imagine that a coefficient .of.0.7 is.calcul;rtedberrveen the two oral tests referred to in the main text. Squared' thisbecomes O.49.lf this is regarded as a proportion of one' andconverted to a percentage, \ 'e get 49 per cenr. On the basis of this' r,r'ecan say that thi scores on rhe short test predict 49 per cent oI thevariation in scores on the longer test.'[n broad terms' there is almost50 per cent agreement between one Set of scores and rhe other. A.o.ifi.i.n, ofb.S would signify 25 per cent agreement; a coefficient of0.8 would indicate 54 per cent agreement. [t is important to note thata 'level of agreement' of, say', 50 per cent does not mean thar 50 percent of the studenfs would each have equivalent Scores on rhe n','oversions. We are dealing with an overall measure of agreement thatdoes nor refer ro the individual scores of srudents. This explanation ofhow to interprer validity coefficients is very brief and necessarilyrarher crude. For a berrer under5tanding, the reader is referred toAppendix i..

    fl

    f;

    {I

    I

    I

    {

    d

    Vhether or not a particular level of agreemeirt is regarded as satisfactorY*itt a.p.nd upon the purpose of the test and the irnportance of thedecisions that are made on rhe basis of it. [f, for example, a test of oraL;;ii*t was ro be used as parr of the selection.procedure for a high lei'eldiplomatic post, then a cbefficient of 0.7 rnight well be regarded as tooio* fo, a shorter test ro be substituted for a full and thorough te st of oratr2l

  • i=i,j 1

    I.

    V alidity

    abrliry. The saving in time would not be worth the risk of appointingsomeone with insufficient ability in the relevant foreign language. On theotl-rer hand, a coefficient of the same size might be perfectly acceptable fora brief interview forming part of a placemcnt test.

    It should be said that the criterion for concurreRt va[idation is notnecessarily a proven, longer test. A test may be validated againsr, forexample, teachers' assessments of their students; provided that theassessrnents themselves can be relied on. This would be appropriatetvhere a test was develOped which claimed to be measuring somethingdifferent from all existing tests, as was said of at least one quiite recentlydevel,rped'communicative' test.

    The second kind of criterion-related validity is predictiue uafif,isy. Thisc^ncerns the d.g... ,o *n1g5. t-; ;-;;. pi.i.f

    .9-gndidat,JJ'p-ui*.nYexampiJ *oiiid u. rto* *.il r pioficienCy t.t,Tb"ta

    n ent's abiliry to cope with a graduate course at a Britishp ent's abiliry to cope with a graduate course at a Britishunirrersity. The criterion measure here might be an assessment of thestudent's English as perceived by his or her supervisoi at the university, orit cguld be the outcome of the course (pass/fail etc.). The choice ofcriterion measure raises interesting issues. Should we rely on the subjec-rive and untrained judgements of supervisors? How helpful is it to usefinal outcome as the criterion measure when so many factors other thanabiliry in English (such as subject knowledge, intelligence, motivation,health and happiness) will have contributed to every outcome? Whereoutcome is used as the criterion measure, a validity coefficient of around0.4 (only 20 per cent agreement) is about as high as one can expect. Thisis partly because of the other factors, and partly because those students,.,,hor. ErLglish the test predicted would be inadequate are not normallypermitted to take the course, and so the test's (possible) accuracy inpredicting problems for those students goes unr'ecognised. As a result, ai'alidity coefficient of this order is generally regarded as satisfactory. Thefurther reading section at the end of the chapter gives references to therecenr renorrs on the validation of the tsritish Council's ELTS test, inI\!!!r!u'hich these issues are discussed at length.

    students rvho u,ere thought to be rnisplaced. trt would then be a rnatter ofcomparing the number of rnisplacements (and their effect on teachingand i."rnLng) u'ith the cost of deveioping arad adrnini,stering a test whichu'ould place students more accurately'

    ,

    $

    F

    92 25

  • \talidity

    *-,^lilit-, if ir can he =e-d-tha t-tl,m e a s=.[-iei j ]r s I th e- ab i 1 i ty rv h i c fi i r[e r.s+.+-any- :,Jnd,e rl:ii n g,..',, of lansuase abilitt'.Le r-s+.+-any- :Jn d*

    -e.r \:{"1 n gry- of- lanBuage abiliti'.ab ^J-:-::-:----.-"-----Y-- ;9.*;;.,;i-t'KJnothesise. foi.example, that the abrlity"tbTded-iE';i6l'''ci "

    .r'. -L^ ._,-..-.:'-..',.4

    33

    ilitg-to....gucss ihc ii,irili::; r-:ich theY are met. It r'"ould 'oe awhether or not such a distinct

    tike 'reading ability' and 'r'"ritingtieah Similarly; the direct measure-

    that underlying writing abilitv are antrol of punctuation, sensitivit,t* toconstruct items that are meant toiriister them as a pilot test' Horv do

    ring writing ability? 9lt steP wen extensive samples of the rvriting

    t is first administered, and have thesee scores qn the pilot test r"ith theit g. If there is a high level ofdiscribed in the Previous section

    idence that we are measuring writing

    have developed a sadsfactory indirectnstrated tho reality of the underlying

    {6

    t

  • 9,E

    F

    *

    Validity

    n, we would obtain a set of scorests could then be calculated betweenients between Scores on the samen those between-scO-resu.rr dif ferentt we are indeed measuring separate

    and,identifiable ccinstructs.Construct validation is a research activity, the means by which theories

    are put to the test and. are confirmed, modified, or abandoned. lt istfri"lgh construct validation that language testing.can be put on aso,,ndir, more scientific footing. But it will not all happen overnight;there is a long way to go. In the meantime, the practical language testershould try to keep abreast of what is known. When in doubt, u'here it ispossible, direct testing of abilities is recommended.

    l4)Face vaiidi

    $

    b

    A test is said't6.fiiueJace validity if itlo-oA#s"d*!l $q9.9,uppor.d to measut..*Fot exa m ple,-TTEsT-whii -F p retenddd to mea sure

    -*T-.-ii+ r r: r r:l --^--:--- -L^ ^^-l:l- ^---t-

    ot

    ofiffiil-tion;Hifr but which did not require the candidate to speakr i r r. !.irnd there have been some) might be thought.to lack face validity. Thisil,ould be true even if the test'i construct and criterion-rel'ated vaiidityjrlari Lyas"ejenri ficco-n ce pJ, yet. it. i s .;"t h;t face v alidirf'uay- lg1

    -hF-uqation authorities or employers. ltusbd, the candidates' reaction to it

    on it in a way that truly reflects theirarly those which providg indirectu'ly, rvith care, and with conl'incing

    explanations.

    Ttre use of \ralidity

    \Vhat use is the reader to make of theshould be made in constructing tes

    " f*iltl #E ^. Gilffi '-iAE?

    v eli aiCJparricularly where it is intended to use indirect testirrg, reference shouldbe made ro rhe research literature to confirm that measurement of therelei,ant underlying constructs has been dernonstrated -using the testing

    .rechniques rhar are to be used (this rnay,often result in disappointment-another reason for favouring direct testing!)'W An), published test should supply details of its validation, without

    ?Lt 27

  • Validity

    which its ^validity (and suitabiliry)purchaserh.ttt ior which validitytreated with caurion.

    *

    u

    IBi

    can hardly be judged by a potentiaiinformation is not available should be

    READER ACTIVITIES

    Consicier any tests with which you ar'e iamiiiar. Assess'eacii oi iiiciii iiirerms of the various kinds of validity that have been presented in thischapter. What empiricalevidence is there that the test is valid? If evidenceis lacking, how would you set about gathering it?

    Further reading

    For general discussion of test validity and ways of measuring it, seeAnastasi (1,975). For an interesting recent example of test validation (ofthe British Council ELTS test) in which a number of importanr issues areraised, see Criper and Davies (1988) and Hughes, Porter and Veir(1988). For the argument (with which I do not agree) that there is nocriterion against which 'communicative' language tests can be validated(in the sense of criterion-related validity), see Morrow (1985). Bachmanand Palmer (1981) is a good example of construct validation. For acoliection cf papers on language testing research, see OIIer (1983).

    t5

  • ffieEiabiEEtV

    " .rqfl$i-i--fr

    a.EA

    I/tnl

    II

    u"lF;

    IIt!lII

    !lIil

    1-$,.*- t:o vr \.i , \ * tr,,1 .,fi **-n o"* * "t\.'ttr..r*. \,,-l*-.,{r*U-,*,{. 6*\\ rl. "\tS\ .

    lqnagine that a hundred students take a 100-item test at three o'clock oneThursday afternoon. The test is not impossibly difficult or ridiculouslyeasy for these students, so they do not all get zero or a perfbct score of100. Now what if in fact they had not taken the test on the Thursday buthad taken it at three o'clock the previous afternoon? Would r,,rre expecteach student to have got exactly the same score on the \,Wednesday as theyactually did on the Thursday? The answer to this question must be no. .Even if we assume that the test is excellent, that the conditions ofadministration are almost identical, that the scoring calls for no judge-ment on the part of the scorers and is carried out with perfect care, andthat no learning or forgetting has taken place during the one-day interval-

    nevertheless we would not expect every individual to get precisely thesame score on the Wednesday as they got on the Thursday. Humanbeings are not like that; they simpl)' do not behave in exactly the same\\ra), on e\.'ery occasion, even when the circumstances seem identical.

    But if this is the case, it would seem to imply that we can never havecomplete trust in any set of test scores. 'We know that the bcores wouldhave been different if the test had been administered on the previous orrhe follou'ing day. This is inevitable, and we must accept.it.'!7hat we havero do is construct, administer and score tests in such a way that the scoresactually obtained on a test on a particular occasion a.re likely to be uerysinilar to those which would have been obtained if it had been admin-isrered ro the same students with the same ability, but at a different time.The more similar the scores rvould have been, the moie reliable the test issaid to be.

    Look at the hypothetical data in Table 1a). They represent the scoresobtained by ten students who took a 1O0-item test (A) on a particularoccasion, and those that they u'ould have obtairied if they had taken it aJal' lltcl. Ccnpare the nvo sets cf scores. (Dc not \I":orii'foi thc nomcniabout tire fact that we would never be able to obtain this information.Ways of estimating what scores people would have got on anotheroccasion are discussed later. The most obvious of these is simply to havepeople take the same test trn'ice.) Note the size of the difference betweenthe tivo scores for each student"

    D

    x

    3/29

  • Score obtained

    Reliabiliry

    rnet.r 1a) scoRES oN TEsr n (tnvENTED onre)Student Score uhicb wotild hu,'e

    been obtained on thefollotuing day

    82283467635935236249

    59522t90395935rc5257

    BillMr rrtAnnHarryCyrilPaulineDonColinIreneSue

    5846L989435543277662

    {

    Now look ar Table 1b), rvhich displays the same kind of information [or isecond 1gO-item resr (B). Again note the difference in scores for e,rchstudent.

    reel-E, Lb) scoRES oN TEST e (lruvexrED Dnre)Stndent Score tuhich tt,ould haue

    been obtained on ihefollotuing day

    BiIIMaryAnnHarryCyrilPaulineDonColinIreneSue

    d

    \flhich rst seerns rhe rnore reliable? The differences betrveen the two setsof scores are much smaller for Test B than for Test A. On the eviciencethat we have here (and in practice we w'ould not wish to make claimsabout reLiabitiry on the basii of such a srnall number of individuals), TestB appears to be rnore reliable Test A'

    Score obtained

    654823854456381.9675Z

  • Reliability

    Look now ar Table 1c), which represents scores of the same studentsoin interview using a fiue-point scale'

    ranrl 1c) scoREs oN INTERvIEw (tNvENrEn o,+.ra)

    Student Score which would hauebeen obtained on tbefollowing dayF

    B

    n

    Billh4arYAnnHarrYCvrilPauiineDonColinIrene(tr c

    In one sense the nvo sets of interview scores are very similar. The.largestdiff.r.n.. betrveen a student's actual score and the one which would have;;;; obt.in.d on the following d"y is 3. But the largest possible;iff.r;. is only 4! Really the two sets of scores are very different..Thisbeconres apparent once we compare the.size of, the differences between,tud.n,o *iif, the size of differences betrveen scores for individual,ird.nto. They are of about the same order of ma-gnitude' The result ofil.,i, .r., b. slen if we place the students in order according to their

    le order based on their actual scores issed on the scores theY would have'rview on the following daY. Thisvery reliable at all.

    It is possible to quanrify the reliability of a test in the form of- a relia?I::l':.::tf;;;;, are s (chaPter:,Y'+;^','^i' reli .T--q-t&el

    .'+). ineY ali ]:i ^F.n i" ^,.,"

    I t' ,"^'^+ nF ! 4 recr :-t- .

    ^f 'tr is ong

    reliabilitv rnefficrent ls t - ---a

    Egilyllj}-a-..r-.gll*v-atl"k-y--.v.v*p^L{4-YlY.r.+: T4 -::-:l

    ,^ffic-h fj9.iy -gg, s"*._-ie_rs"ilr ,for 1 p"Ti.-:11'.t^.,_:l;i;*a** iegardless of wh." lt rrupp;"td to "be adrninistered.-A restr . t r t -r:-L:l:--, ^^^{E-;ant nd ,.*r', (nnri let tls hone that rtJiillH-;-,h,;h had a ,".tiuUitity coefficient of zero (and let us hope that no sr'rc

    354z451

    z5I

    F

    B

    D

    3i

    Score obtained

    5A.t252331

    43

    retlabi[ltY coefficie

    ?g

  • ReliabilitY

    Xrg' ''**> $

    test exists!) would give sets of results quite unconnected lvith each otl,.ei',in the sense that the score that someone actually got on a Wednesdavwould be no help at all in attempting to predict the score he or she would

    _ej5tiggr_gg gf 1roqr-$".coelnctent \\ e

    should expect for different types of language tests. Lado (1961)' f1rexa-nnple, -..)'s th.a.t good voca.b'-rla-ry, structure and. reading tcst: :r':'usually in the.90 to .99 range, while auditory comprehension tests;:lremore often in the .80 to .89 range. Oral production tests may be in the .70

    ability coefficient of .85 rniehr betion test but lorv for a reading rest.o sees as the difficulty in achievinqrent abilities. In fact rhe reliebilirvepend also on other consideratiotrs,

    mosr parricularly the importance of the decisions rhat are to be take n onthe basis of rhe test. The

    -ngg'ilnportant-!bg- d+uj he*gr-e.+te r

    slE

    I

    I

    t't1,

    reljabi,lity. Ftg SLI:!-qg;1ra;AG ty,t-1Lt someone the opportunitr'loituav ou.ir."J"6.iouse of tneii sEore on a language test) then u'e ha.''erelja

    -b-, il i ty Fre n H: !--*

    F

    =;,9T&!rst requ L

    most obviotis w e

    to be prerty sure that their score would not have been much difierent iIi[* f.t"a taken tfte test a d,ay or two earlier or l"t.r]The next section rvill

    rtexplain how rhe retiability coefficient can be used to arrive at-anotherfigut. (the standard error of measurement) to estimate Iikely differencesoithis kind. Before this is done, however, someth t=q;bqLrtoithis kind. Before this is done, however, someththe way in which reliability coefficients are arriv

    i6e ianre te_s.1"-tw-iee. lhts ts Known as ne3!9Y-r:t1:: ygrJ:r+; rneJrr*brlkr

    "i. not difficult ro see. tf the second affiififfition of the test

    i. roo soon after the first, then subjects are likely to recall items and theire responses more likely and the

    too long a gap between administra-l) will have taken place, and theuld be. F{owever long the gap, the

    subjects are unlikely to be very motiv.ated to take the same test tw'icq,and-::'---

    ' I't I t - - r^--essing effect on the coefficient' Theserhis too is ltkelY to have a oepr"if".,, are_reduced_spmewhalbr-rh-e-uqe"SL -t-r*pdtf{erer+F-f@rrnrofthe

    c

    forms met[odil F{owever, alg-1g4-te iorms are1e***-

    4, It rurns out, rutf iiiin$lnlhatthe necessary !1-y-o--.s"g!s of scores

    provide u,Lwtheseie-C,ru ject is given

    the most common methods of obtaininginvolve only o.zle administration of qns

    i e n t o f -in tr[ gl-q ogs-slps-qv"

    "ln this the subjects taketwo scores. Cne score issu

    ]Iit hal

    ih. t.tt ln the usual wAY, but ea7

  • |,I1

    I

    I

    I

    R.eliability

    f.or-thetest"+o-be-spli es which are really equiyalesrt, throughtl,, ."t.f,tl matchin fact.w-here.items in the iest fg.yg b$nord.r.d in terms of difficulty, a split into odd-numbered iterni andeven-numbered items may be adequate). It can be seen that this method israrher like the alternate forms rnethod,, except that the two 'forms' areonly half the length.l

    Ir has been demonstrated empirically that this altogether moreeconomical method will indeed give good estimates of alternate formscoefficients, provided that the alternate forms are closely equivalent roeach other. Details of other methods of estimating reliability and of

    &t

    B

    IBl

    -l carrying out the necessary statistiAppendix 1.

    _[The stqndard error of

    -sen-aure.tuelsssg"st"p--wh?i-iseall-edihSLt

    While the reliability coefficient allows us to compare the reliability ofrests, $ a-e"tu,a.Ls-c*QJe is-tq,,r,har ion. IVirh-a-lltrle-

    -

    f.urrher--. c alculati-o"n-n*h o w e v e r, j1-i

    rneasurement and the tru e score

    cloe score

    3

    $D

    B

    lmagine that it were possible for someone to take the same languaget.st or'.t and over again, an indefinitely large number of times, withouttheir performance being affected by having already taken the test, andrvithout their ability' in the language changing. Unless the test is perfectlyreliable, and provided that it is not so easy or difficult that the studentalu'a)'s gets full marks or zerotvarious administrations to vary. [.

    .-{-@qs

    wrth resPecttreasons we ertain, which iE referred to as thecarrciidate's@74, 'r /g"ruv- f,c are-

    $j'e ar-r atti?*t,oTn-ike statements ahout the probab,iliry thar a candi-dare 's rrue score (the one u'hich best represents their abiiitl'on the test) isrvithin a certain number of points of the score they actually obtainedqn the test. In order to do thit, *yg-Elqlk.to cSlcgla!-e the standard.error of measutrentent of the P-artic*f::t-:, T"he calculation (described in

    1. Because of the reduced lengrh, rn'hich will cause the coefficient to be less than it would bef,rr rhe u'hole tesr, a statisrical adjusrment has to h'e rnade (see Appendix'1 for details).

    lo 33

  • illustrated by an examPle.Suppose that a test has a standard error of measurement

    indi'.'rJua! scores 56 on that test- Ve are then in e nocirir'n tnfollowing statements: 2

    We can be abou.t 58 per cent c'ertain that the person's true score lies in therange of 51 to 61 (i.e. r,vithin one standard error of tneasureirlcni of tlicr.oi. actually obtained on this occasion)-

    We can be about 95 per cent certain that their true score is in the r:rnge -15to G6 (i.e. rvithin tr,r'o standard errors of measurement of the scoreactually obtained).

    .We can be 99.7 per cenr certain that their true score is in the range 41 to7l (i:e. u,ithin rhree standard errors of measurement of the scoreactually obtained).These sraremenrs are based on rvhat is known.about the pattern ot

    scores.that rvould occur if it lvere in fact possible for sorneone to take tiierest repearedly in the way described above. About 68 per cent of theirscores would be rvithin one standard error of measurement, and so on. liin fact they only take the test once' we cannot be sure hor,l' their score on

    core, but rve are still able to make

    Tffirrn dE-lislons that wg.!1ke-en.,qhs_F"e.:.*_9,! !g:"1"_:cg'e1."#e -slsls"Jor

    -ffii.;' ffi-li. 'f][g st"nd=iiile'ro-i of measui:*'."qil9i:ff::-:b.*.L!b'--rr2. These statisrical sratemenrs are based on what is known about the way a person's scotes-

    would tend ro be distribured if rhev took the same test an indefinitelv large nuntber oftimes (wirhout the experience of any test-raking occasio.n affecting performance on anvorher occasion). The scores would follor.r-rvhar is called a normal distributi,rn (see

    discussion beyond the scope t-,f the Presentmai distribution which ailou"s us to sav whatain range (for exarnple abour 68 per cerrt off measurement of the frue score)' Since about

    dg pei cent of actual scores will be within one standard error of measurenierLt of therrue scone, \\,e can be abour d8 per cenr certain that any parricular actual sccrre +'ill hern ithin one standard error of measuremenr of the true score"

    3. Ir should be clear rhar rhere is no such rhing as a'good'or a'bad'standard error oimeasurernent. tt is the particular use made of particular scores in relation to a particuhr

    nay be considered acceptable or unacceptable.

    Ot J. At^rrnrlrc the

    i1

    I

    Jq

  • adFi

    llF1

    Reliability

    sv.lde-us.er*s-r#iih ..nolo,;}l y th e'ard error of measurement,ardHiu

    '\I

    '(\t

    {1slF

    'i

    I

    n\\\

    Lteslis-not reliaow tnat t 5g:*eJm-ary^ jndUtd_ualsarelik,ely- fp--bE-Suirent-{rom their tru ttlef"affi-e"e*strrss. ird

    i-n" .the case..of .samea.ngy bqgweqn aq$alutious about making

    important decisions on the basis of of candidates whose,.iurl scores place them close to the cut-off point (the point that divides,Dasses' from 'fails'). tWe should at least consider the possibility ofjathering further relevant information on the language ability of suchcandidates.

    Haying seen the importance of reliability,'*'e shall consider' later in thechapter, to.,,n, to make our tesrs more reliable' Before that, however' u'eshail look at another asPect of reliability.

    n,

    D

    ;

    Scorer reliabilitY

    ]n rhe first example given in this chapter we spoke about scores on aniqlriple choice test. lt was rnost unlikely, we thought, that everYcandidate rvould get precisely the same score on both of two possibleadminisrrarions of the test. We assu red, horvever, that scoring of the test1'ould be 'perfect'. That is, if a particular candidate did perform inexactly the iame way on the two occasions, they would be given the samescore on borh occasions. Tbat is, any one scorer u'ould give the same'score on the t\ /o occasions, and this u'ould be the same score as would begiyeu b), an), other scorer on either occasion. lt is poss_ible to quantify thei.t.i uf agreement given by different scorers on different occasions bymeans of a scorer reliability coefficient which can be interpreted in asimilar \4,ay as the test reliability coefficient.,trn the cage-of:h.emulriplech oi qq q.est j.u.sr .d.e,sc:ib-e.d theil'e noteciin Chapter 3, when

    gq_ql.g lhe reliability coefficients of

    ectly consistent scoring in the case ofthe intervierv scores discussed earlier in the chapter. ni would probablyhaveTJeG-e?*tdtE-e reader an unreasonaL,le assumption. We can acceptthar score rs should be able to be consistent *'hen there is on15'one easily

    35tl2

  • Ts

    lI

    I

    Reliability

    n a degree of judgement is calied forring of performance in an ihtervie'',v.

    . Such subjective t"=sqg1"-j,ll" 5otthEEGiiTimlw'hen manv

    coefficients (and also rhe reliabilit.,'low to justify the use of subjectir,e

    measures o,t language ability in serious language testing. This vier', is lu:;swicieiy* heici rociay. v{Ihiie rhe perfecr reiiabiiity oi objectivc tcsis is i-rriiobtainable in subjective tests, there are ways of making it sufficientiy' 1r i5h

    ;eliable*lherlhg ther' Indeed the tesiieliabiliry coefficient will almost cert han scorer reliah'ilin',since other sources of unreliability will be additional to u'hat entersthrough imperfect scoring. In a case I knorv of, the scorer reliabiliircoef6cient on r composition rvriting test was .92, while the reliabiliti'coefficient for the resr was .84. Variability in the performance ofindividual candidates accounred for the difference betlveen the tyu'ocoef ficients.

    As we have seen, there are two components of test reiiabilin': thesion to occasion, and the reliabilitvesting ways of achieving consistentthen turn our attention to scorer

    reliability.

    U 'ake enough samPles of behaviourFth., things being equal, the more items that you have on a test' the morereliable rhlt test *iti U.. This seerns intuitively right. lf we wanted toknow how good an archer sorneone was, we wouldn't rely on the.uid.n.. of i single shot at the rarger. That_one shot could be quiteu*.pr.r.nrarive 6f ,h.ir ability. To be satisfied that we had a reallyreliaLle measure of rhe abiliry rve would want to see a large number ofshots at the target.

    The same is true for language tesring. trt has been demoirstrateciempirically that ereliabie. There is eAp-pe*ndix) tbafel1ry:-911-!"o- o

    {

    .,,1

    t

  • &;R eliability

    .tgr lectiglven a selectroj_-gtiimeV to-hr'.1

    l-^ -

    ^-^ C-^^S^- -L^-:-depresstng ettect on tne rellaDtlltl ,t-.,

    the test* iho"ld

    t.li U.* irk.n, fay, a day la1q5: In general, theief_ore,._candiglsi.Es Iover rvhich' l,{ ha raa+rtataA I nqnnr^ }hd f^lr^rt"miqht Vafj' SnCUiU UC

    -i_qljii*i!.L, \-\JiiiPciiL Lrlu lvrru vvrrlg r! r rLrrrS L::ri'.-r

    a) Write a comPosition on tourism.b) Write a composition on tourism in this country..j \y/1i1s r.o*porition on how we might develop the tourist industrv in

    this countrY.d) Discuss the following measures intended to increase the nurnl-le r .-,i

    foreign tourists coming to this country:i) More/better advertising and/or information (rvhere? u'h.rt [orrr,

    should it take?).ii) Improve facilities (hotels, transportation, communication etc.)'iii) Training of personnel (guides, hotel managers etc.).

    The successive tasks impose more and more control over $'hlr iswritten. The fourth task is likely to be a much more reliable indicetttr rrfwriting ability than the first-

    The"general principle of restricting the freedom of candidates rvill betaken r[ agrin'in chipters relating to particular skills. It should perhapst. rria ^tt.ri, ho*'eu.r, rhat in restricting the students we must be caret'ulnot ro distort too much the task that we really want to see them perform.ih. pon.ntial tension between reliabilitv and validity is taken up at theend of the chaPter.

    {

    dd

    Write unambiguous itemsIt is essential that candidates should not be presented with items 'uvhose*.^rring is not clear or to which there is.an acceptable ansrver which the,.r, *ri-*r has not anticipated. In a reading test I once set the follo',"'ingopen-ended question, baied on a length' reading Passage abo.ut English

    "i..nrr "rrd di"l.cts: 'Where does tl e author.direct the reader w'ho is

    interested in non-standard dialects of English? T'he expected answer \\'as, the Further reading secrion of the book, which is where the reader \,vas

    dirocted to. A nu.ib.r of candidates answered 'page 3', rvhich rvas theolace in rhe texr where the author actualll' said that the interested reader;h;;lt loqk in rhe Further reading section' Qgly*-t-trglS-1g1glrsfuac's1

    ins the test revealed th.a-q-."-thsrre \\'as a comscortng tne tesE [evcalcLt i.nrrert answer to the quest;;-rre;?fiil,t0_lHel'y_gsti9n. lf that hjd nol"hagnened, then ^,:o.f..:.ii.#ec*Cn rrm"#;;;*ed as incorrect. The fact that an individual

    ,/4

  • IlIII

    I

    FlII

    I!

    R.eliability

    ones alreadv in the test will be needed -go 1"ncq-e_gsg the reliability.the ene_Lalreadv tn the test

    .

    _LLL_U 3',.r-" A.-.,----

    .epgffigt--l 9ne thing,to bear in.mind,r and of exisring

    ile_ms. Lnagine a reading test that asks the question: ''Where did the thiefhide the jewels?'lf an additional item following that took the forrn'Whatr"6

    "nu.uual about the hiding place?', it vrould not rnake a ful! contri-

    bution to an increase in the reliability of the test. Why not? Because it ishardly possible for someone who got the original question-wrong to getthe supplementary question right. Such candidates are effectively pre-vented from answering the additional question; for them, in reality, thereis no additional question. We do not get an additional sample of theirbehaviour, so the reliability of our estima;e o{;heir apility is notincreased .f lT.* 1t+1{} s.rf afl (,- .t

    rS Each additional item should as tar.as Posslf^* (- ,# *u,[,fl,'f ,'noJ lnt*+n,r as possiBle represent a fresh start fqr

    of..their apility is.u6,[,{,'6yuo*r( l"te-yrnrincreased. 'f lu-

    ihe ca n d i d ate. Bv-dP t-qgJbi ina" d d-irc n a J inlqunErr-on-o.n

    is to be regarded as an item. The more independent Passages there are, themore reliable will be the test. In the same wdy, in an interview used to testoral ability, the candidate should be given as many 'fresh starts' asoossible. More detailed implications of the need to obtain sufficientlyirrg. samples of behaviour will be outlined later in the book, in chaptersdeircted to the testing of particular abilit