19000749 Principles of Language Assessment

Embed Size (px)

Citation preview

  • 8/2/2019 19000749 Principles of Language Assessment

    1/7

    ,PRINCIPLES DE

    LANG'U.AG'EASSESSMENT

    Thl.s chapO'er explores hOw principle () f h-mgu.lgc assessment can and sbotild beapplied to forma] tests, but -with the ill timate eecogniaon that these principles also-apply to assessments of' aU kinds. In this chapter, these principles will be usedtoevaluate anexrsung, previously published, or created tesr. Chapter -3will (tenter Onhow to use those principles to design a,good test.

    How < > 1 0you know lf a test is ,eff:ecti,ve? l'ot the most part, that question can beanswered by responding to such questions as: Can it be given wtthlaappropnaseadministrative constraints? Is jt dependable? Does rr acc\lratelf measure what youWain it to measure? These and, other questions help zo identify fivecardlnal criteria:for "testing a test": practicality, reliaoijity,validity, authenticity, an dwasbback.We willlook at each one, bu~ with no priority order implied inweONl:et of presentatlen,

    PRACTICALITY

    An effective test is 'p;ractic~ TIlls means tbat i: t

    is not l3'Xeesstve1y expensive, stays-within appropriate time COnstraints, is reL'ltiveLy-e~y to administer; and.. 11Ma sC0tingJe:val-qation procedure that 15 ,specifici'J;tldruue-effid_ent.

    A test that is prohibitively expensive is Impractical ...~ tes, of la ng ua ge p ro fi-cteacy that takes a student five hours to campler.e Is impractical-it consumesmore time (and money) than necessary to aceomplrsh its ObjeCtive. A - test thatrequires individlJa1 one-an-one proctoring is impdrcfi~al for a group of several 1111.0-dred test-takers ami only a handful of exa miners, A test that takes afew minutes fora student to take and several hours for an examiner to evaluate is imp,r_aeCiica] for'tnost classroom simations. A test that can be scored only by computet is impractical:if the test takes place a thousand miles away from the nearest computer. The value'and quality of a rest Sometimes hinge on, such nitty'gritty, practical conslderations.

  • 8/2/2019 19000749 Principles of Language Assessment

    2/7

    REIlABILITY

    A reliable rest is consistent and dependable, If you give the same test to tile samestude.lli: 0. matched studenrs on two d.I:ffu;rentQCCaslOIlS, the test should yield sim-ilar results, The issue or reliability of a test :ma:y best be addressed by considenng a:n rrmber of fictors tha [rna c.n nwihutp_.- In the' lill'rdi$lblliry of:;J test. C onsider 1'13.1'"

    following pos&'ihilities (adapted fro 111Mousavt, 2002, p ..80 ): flUGru!.Lti0ns. in the Stu-dent, in scoring in test Mrn:inlstr.ltioll and in the test itself.

    Student-Related Reliability

    The most .common learner-related issue in reliability is caused b. temporary Illrtess.fatigue,.a,"'bad day,"anxiety, and other physical or psychological factors, which maymake-an "observed "score 'deviate from one 's~true ~score. Also included in this Gate-gory are suCh !actorS-as a teb'tta1(er'S~c,est.:.wisehc5s~or st:Ta.tegiestor efficient testtaking,(Mousavi,2002, p, 8Q.q).

    Bater Reliability

    HUniatletor, stlbj.ectivityj and bias may enter tnro the scoang process. IntE:r~rater.reliability occurs when tt\root alore scorers yieldinooruiSteat scoresof the sametest; possibly fur laek of attenrlento scoring crttena, ~per;ience, inattention, oreven preconceived biases. In the story above about the placemern test, the luftialscoring: pian for the dictations was found to beunrellable=-thar is. [he two SCOrerS'were aot applying the same sraadards.

    Ra:teT..reliability issues are: nor limited to conreJfls wbere two or more scorersare involved. mtr_a-rater :reliability is a common occurrence for classroomteachers because Of unclear scOriDg.criteria. fatigue, bias toward particular "'good"and ~bad>l students, or simple carelessness, When] am faced with tip to 40 tests togmde in only a.week, I know that the standards I apply-however Stlblitninally-tothe first few tests will bedifferel l~[Tomthose: r applyro the h1StJe-w.lmaybe "easier"or "harder" on those first few papers or I may get tired, and. the .resulr may be aninconsistent evaluatioIi across all tests. One sdbluon to such rotra11ltet'umelli!:bili.tyis to read through abeut half of the tests before rendering-any final scores or grades,theatc reeycle back 'ctJroughrhe whole set of tests to ensure an even-handed. judg-ment.Jn tests of writing skills, rarer reliabUity is partk'Ulaxl'y hard to

  • 8/2/2019 19000749 Principles of Language Assessment

    3/7

    Test Reliability

    Someti:in~5the nature of the 'test itsclf can cause measuremenrerrors.If a tesrrsroolong, test-takers may become tatlgdd by the time they reach die .laxer' ttems andnastily respond Jncorrcetlg; Ttmedeests may discr.iminateagamst smdeers who doDOt perfarm well IIJIla rest with '3 time limit. We W1know people (and yO lJ m ay beifid.ud~ ill. this cat~go:ryI) who "know" the course material. peffectly but who ateadversely affected by the. presence of ,3, dock ticking. away; Poorly WTitten [est 1tems(that are ambiguous 01" that 'have .more than oaeeerreet answer) m:ay be g' furth.ersource of fest" urue.l1ability.

    VAIID:rrY

    By far the mostcomplex cr:ite.rriOIl Itall effecttve lest-and arguably tile mosf lrnpor-tam pancipie-is. validlty, j'the ~te4tt9 whiCh inferences made from assessment

    resultsare.appropriate, me.a:r:rl.ngfu.t,.a.odnse!ol :mterms of the purpose ofthe assess-ment" (G r ol1 iW ld " J 99 '81 p. 2:l6), A valid 'test of reading ability actu.a1ly measures.r:ead1pgability-o:ot 20/20 Vision, nor p:revioQ1j knowledge in a subject, no someadler. variable Qf qaesdonable relevance, TO measure w!l'itiqg 'ablIlry. one might ask:.students to wtite 'as many words as t J . i leyem In 15 mInutes, then simply count thewords to r the finn! "core. Such a test wonkl be. easy L a.admlnlster(practical), andthescertagqahe dependable (relbtble). But tt would 110'1 constimte a valid test ofwriting ability withem some C{JiOSi.deratiop of cOJllprclIens.i .b,i ji t:y. rbeto.riC'.udjs-course elements.and the organizanon of ideas. among other factors,

    How.is 'the' \1llUdity of a test established? There IS fle &aI..absolllte measure ofwlidity, bur severn! diffe.ven.r kinds of evidence ID.:'iy be invoked in SU,pport, In. some

    cases, it may be app.wpdaT;e to exatrdne the esrene to wbld:l arest calls for perfor-mance mat matches that of the course Or unit o f 5 1iUdybeing. tested, In other cases,we may be eoaceraed witlr how well a test determines whether or notstndems havereached i 'm'established set o(gmdsor level of.colnpe~e:oce. Statisticalcorxclati:on withother related but indepeDLI,em measures is another Widely accepted form of evi-deuce. OUlar ceneerns abouta rest's validity may focns on the ccnsequences=-beyond, measuring the. eriterla thernselves=-efa; test, or even en the te:!itH~aket>Spercep:tfQfI of validity; W e '-will lookar these fiVe. types of evidence below.

    C::onte:m.t:-ReIatedEvidenc:e

    Va test actLiallY~"M1.ple,'i the !Subject m11tter about which ceaeluslens are to bedm~and if it requires the test~t:ake1'tQ perform tbe beha:Vior lilat is being mea-sured. it can claim conteat-relsted eviden-ce of :validIty;often popularly referred ['0 ascontent validlty ( e .. g . Mousa:vi, 20()2~ Hughes, 2:003)'.Yell can usually identify con-tent..orela:ted evidence Qbservatiunauy.if you can dearlY' de.tine the a.cl.1ievement t113,tyou aremeasurtng.A [es~ of tennis competency tba[ asb someone.te run 1 ' 1 :10Q.-yarCl

  • 8/2/2019 19000749 Principles of Language Assessment

    4/7

    dash obv:iouslt 1a,ckseOI1~enl v.aIldjljl. If you are trytng to assess apersou's ahility tospeak a second Janguage ina consersadonal setting, :askiug the. learner to answerpaper-a.ad ...Pend1 mult:iple-chofcequestions Iequi:ring grammarlcal ju d,gm : en ts d oe snot-ad:iieve Iconten;t-vaUdlty.A'test thar requires the learner actuallyto speak wtdu!l

    some sort ' ( J If-authentic context does, And if a course has perhaps ten obJecUve:5butonly tWO are covered in a test then content -V:alidity suffers.- ~ r . "!J ',r

    The most feasible rule Df thumb (or .achieving cauteur validity in classreemassessment is to teSl perfor,mance directly; Consider, for cxamph:,~!a listenitig/spe~king. elll'SS tbat is dqling 11 [Jo.in on greetings ana exchanges that incll.J!.(ies dis-course for askfug .ror pe:rsorual.information (narne, address, bobbles" erc.) VJ : t t i J .semeIorm-fncus I[!' the verb to be, personal prenouns, and question formation. The teston thaI. unit should include all of the above discourse .~I.nd.gr.t.mmatilc,il.elemenrsandinvolve studeD1S in the actual pftrfpftrulnCe of I1srenrng.wd s.pe~g.

    Construct-Related Evidence

    A t1"lwdkind of evid~ce that can support validityj btu one [bat does oQtpJay as largea role: f~):rclassroom teachers, is C:0nStrut:t:-~1a[ed, valid ity ;c omIDonly r e..f er fe dto 'ascoastruet validity. A.COQ~trl,Lq is' any theory, bypodhes is ,. 'Or m odel that ;lrteri1.pts toexplain observed phenomena in our unlverseof perceptions; Constructs may Ofm ay nOifbe d:ifectly Or empirically measured-s-their veiificadoLl. often 'requires in!fer-enti'll data. "'ProficiencY'" and

  • 8/2/2019 19000749 Principles of Language Assessment

    5/7

    Face Validity

    ,An iniportaoLfacet Of eonseq oetnial vaJidit1r is the extent" to whldl"SJ;udents viewthe assessment as ra:ir.l: 'elev311~.,ao:d usefulf e n r improving leruming"CGrool1l1nd.1'998,p, 2.lQ > orwhat is popularly known as face validity~ YFaQevalidity refers [Q thedegree to wittich a l,e'St Tonks tight. and ItqJjJetl,t"'S(,(1measure tIte .knowledge or abili-'ties. ~t clarms to measure, based OJl the subjective jul1gnaent of the examinees whorake Jr,the adminlstratlve peesonnel. who decide on its IISe, and other psychometri-cally unsophistic-aled observers" (MOWllvi"2002, p. 244' .

    Sometimes students don'tknc Vi ! 'what is bein~ tested -w;heJ3they tackle a. test;TIley may fed. tOl! - avariety 0,( reasons, that a rest isn't lestin_g what it is'''supposed''to test, 'face -wlidity' means that the srudentspercetve the eest te be valid. 'F~c~valiruty ~sw.e ~eStibn "1)oes tile test.on t1").efuce~olf it, a.ppear from the learner'sperspeetlve to test wh.t!;.it is desrgned to test?" Facie vaJidity wiU likely be high iflea.rners encouater

    a .wdl-coDb1:nlcted ,expected formar with familiar tasks,.. >atest that iscleady doable wiJilim the allotted rime limit. nte.m sthat ' (I re clear.and 'LUlcOlnplicated ..~ directtens tba,t are.crysrnJ . c lear, tasks thatrelate to theft"course work (content lo1illdIty)).and." a.difficult}" Jevel that presents a reasonable dlaUenge.

    Rem em ber; face 'Validity is 1UJl' 'something tha :t can be empiri.caUy tested by a .teacher o.reven by ;a re;sting. expect. II.- i pru:ely a factor lof the "eye of the.b&olclcr,i -now the test-taker, or possibly the test gjver. mtul.tively perceives the.instrument. Par t l1U!reason some assessment experts (see Stevenson. 1985) viev'face' \1'alidity as it supedi:ctal rnctQl 'thlfl is dependent on rb"ewhim of the perceNer.

    The orber side of this issue remands us t['i3:,t tllte psyclloIogtcaJ srateof 'thelearner (confidf

  • 8/2/2019 19000749 Principles of Language Assessment

    6/7

    A fourth nmjor pdoCipte qf l anguage tes tingJs.:Juthenticity, a concept mat is a U t t 1 tSll'pperyID defi:ne~especially wifWll the art and science of eva}uatlng.:llld desi;goiogrests, Baehmartand Ealme.r (lL996, p, :a3).dcline authenticit:yas "the degree 'ofcQl;;I:e~spoadenee oftbe characterisetes Of . a given language test task to tae features Q f a.target hl.Qguage task," and then suggest an agenda for identifYing those target Ian-guage rasks and for transforming them. Into valid test. uems,

    ~seatiaIly, when you make. a. claim for :a:umendcity'in .a test ~sk .youare:-sayingthat this task 15 like[yto be enacted in, the" real world:" Many test item 'types fail to simulate real-world tasks, 'TIley may be cemrived or arlificial in them attempt to target agraIhlhatical. form or a lexie1l item, The sequencing , .ofitems that bear no relationshipto ORe another I~ authtmtidty; One does net have to look very long to lind readingcomprehension. passages inproficiency tests T113&.do opt reflect a I,ea}"WOrld. passage,

    to it rest. ~uthen[iclty rna" be ' present in me fol!lowing ways:

    The language in the test is, as n~ as posslble, Items are contesrnaltzed rather than tsclated, 'TopicS are meaningful (relevant, :interesting) fur the learner;-Some thematic organization to item .... is provided such as througha story lineor episode.

    Tap,ks :(e,PrC':$ent,OI'cJG~e1y a.pproximate, real-worldtssks,

    The autb.entldty of test tasks in recent years bali increased mOijceably. Two orthree decades age, uncennected, boring, centrived i,temswere acceptedas a neces-sary cenrponeru of t~5tmg, Thin:g:s; ha.~' changed, lit was once assumed iliat large-scale l!eStit1.g could not illciude perl~I'.ifiance of the pioduc:l1'Ve ski ll s a ild'Staywitbifi

    budget~'COI15tra,ints, but now.many: $uch tests offer ~.,peaking_and v;tdting compo-uents, Readihg passages ate- selected .from real,wo'.rld sources'rhat tcsHakers arel . . ike1yto have encounte redor will encounrer, L tsren ingCQmp,rehef i-S io .osect rtOllS tea -ture natu.r:al languagewitll hesitations. Wlltee noise, and intcrrLlptlofi6. More. andmote tests offer it:emsthat- are r.lepisodktl in..t:hat they are sequenced to form .mean-mgfu.Lum1!:s,paragraphs, or stori es.

    You a re :l.nvlted to , ta teupthe ch_aIleggc of authenticity in OUt classroom tests ..A s we 'explore rmmy different types of task -tntliis book, especially in Chapters (5tllroqg,b9"tlle principle ofautheptfdty will be ve ry.muchin i lie forefront,

  • 8/2/2019 19000749 Principles of Language Assessment

    7/7

    WASHBACK

    A (a,c,t:,[ of eonsequentsal valid:ity';diSeu:ssedabove, i J i ithe effecto,f testing on teach-ing and lear..ning'"(HUg l1es , ,:W03 .p. J), othefWise knowh '!Ul10ilg language-teSti..Qgspectalists as washback. In large-scale assessment, washback generaJly"-mfcrs tache'effects the tests have enInstructlon in terms of bow studersts prepare : f : o rthe test.

    Thri! challenge to 'ce'ilclilersis to create classroom tests that serve as k~1ingdevices tbtOu,gh which wasltback is achieved. rudents' inCOr'rect responses canbecome. windows of insigbt 'into further work. The1rcQI:reot reaponses need to bep[".u~ed. espeetally when they represent aocol,'nplishmel1ts in a studer;I'[' s in[et-Ianguage, Teachers cznsuggest strategies for success as part of their "coaching" role.Washhack enhances a number ,of basic principles .of Janguage acquisition: ,mtrinsicm (li:iVdti on , autonomy! self-confidence, Iang uage ego, rnreclllnguage:, and stnue:gicin.v~"stmem"alnong others. (~ee PLLT,and T; f3P.fo ran ex:ph'ln~t{Clnotthese j>rJnclpIES.)

    One way to enhance washback is to comment generously and specifi.caUy ontest perlo:I1I1a:tlce. Mw'lYoverworked (and ullderpaid!) reaebers return rests [.Q stu-dents with it single letter grade or uumerical score and consider thcitiob done. I. al,"ea1i'tl~ener gradesand ntlmeficai seores give abs t, I'll tdy no .k if i: )I :a :mt lonof .inIJrioIs,tetnteresr to tile student, Grades and scores re-duce a mOl.lntaitlo lin,gui,Strcand cog-nitive. p~fOrrnan:ce dara to an . absurd ll1oJebiJil..Ar best, Che:y give arelative Indica-tion of a formulaic ittdgrn,ettt of performance as compared to others in the ~~wbkn foster!'} (!ompfti~tive.llot COQPt::ra.ttve, Ieacning,

    With this: m .nlind. 'wlleo you return a writltell rest or a l data sheet from an o'~LIproduction test, consider givm..amore man a nnmher, grade, or phrase as your fee-d-back. Even _if your ev.Ul:lahonIs not a neat Littleparn.grdph"ppend.ed to the test }fOUcan respond to

    asmmydeta:iJstbtolI;gb:QUt' the-test as time will

    permit.Give:praise

    for $J;fcngtIis-tbe '"g:0od5ruff~_;as well as constrUctiVe critl.dsiIl 'o f weaknesses,~ srrategtc 'biihts on llOW H student mighr improve certatn elements of perfor-- .mance.ln ,!!')th.erwords, take some time [0 mike the test pedormance ~n. i n t r i _p s . t : c r u I y 'm.otiVaringexperience from which a ~ud,entw iUg91 .D a se nse o f a ccompllshm en rand challenge,

    A little bit of washback O1uyalso help Stude:ilts.t'hroliq~h a' spec.ificati0"ll of thenumerical ~cor~son tile y~r;tous subsecUQn~o rfile.''~est.A,.sl:lb~octi.onGil ve'rb teases,for example, [hat. yiiclds a relatively low score ma,y servethe di31gll0stl-c purpose 01'Showmg the student an area of challenge,

    H. DOUGLAS BROWN