Download pdf - CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research． In

CHAPTER　3　TESTING　READtNG　PERFORMANCE

　　　　　Since　the　fbcus　of　this　study　is　on　the　testing　of　reading　compreherision，　this

chapter　now　turns　to　the　testing　perspective　of　the　present　research．　In　discussing

language　testing，　there　are　two　things　that　should　be　considered：‘what　it　is　that　we

are　trying　to　measure’and‘how　we　are　going　to　measure　it’．　The　purpose　of　this

chapter　is　to　explore　these　two　concepts　in　depth　and　that　will　hopefUlly　lead　us　to

conceptualize　the　role　and　nature　of　reading　tests　in　the　measurement　of　reading

ability．

3．1　Test　as　an　Instrument　in　Measuring　Language　Ability

　　　　　Atest，　according　to　the　second　edition　ofDictionarγ（～fLanguage　Teaching・and

／llPjりlied　L　inguistics　（Richards，　Platt　and　Platt　1992：377），　is　‘any　procedure　fbr

measuring　ability，㎞owledge，　or　performance．’In　other　words，　it　could　be

interpreted　that　a　test　is　an　instmment　that　we　can　use　to　weigh　how　much　ability（or

㎞owledge，　performance）is　existent　in　a　leamer，　j　ust　as　we　use　rulers　to丘nd　out　how

long　a　piece　of　cloth　is．　Furthermore，　when　we　hear　the　word‘test’，　what　comes　to

our　mind　i　s　a　set　oftest　questions　from　which，　the　total　nurnber　of　the　ones　we　answer

correctly，　our　total　scores　on　the‘test’are　calculated．

　　　　　Ihave　listed　the　definition　in　a　general　sense　because，　many　of　the　times，　it　is

difficult　to　find　a　simple　and　explicit　definition　of　what　a　test　is　in　language　testing

literatures（e．g．　Henning　1987；Bachman　1990；McNamara　1996；Urquhart　and　Weir

1998；Alderson　2000）as　there　is　much　to　be　considered　in　describing　the　nature　of　a

test．　However，　what　many　of　them　do　indicate　equivocally　in　the　discussion　of　what

a　test　i　s　is　that　the　measurement　that　i　s　acquired　through　implementing　a　test　assumes

measurement　errors　and　that　what　is　determined　regarding　the　ability　of　test　takers

through　the　use　of　a　test　i　s　no　more　than　an　inference　made　from　their　performances

on　it．　Jo㎞ston（1983：53－54）expresses　his　concem　about　test　methods，　as　a　crucial

factor　for　accurate　measurement，　especially　in　testing　reading：

23

東京外国語大学博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

．．．since　reading　comprehension　is　a　mental　activity，　it　is　only　available　fbr　indirect，

second－hand　scrutiny．　We　can　never　actually　watch　the　mental　operations，　but　must

infer　them　from　other　sources　of　data．　In　making　these　inferences，　we　must　be　very

clear　about　the　grounds　we　have　for　doing　so．　In　order　to　be　so　informed，　we　should

understand（as　clearly　as　our　data　and　theory　will　allow）the　actual　demands　and

assumptions　involved　in　our　assessment　techniques．

　　　　　In　other　words，　in　trying　to皿derstand　the　nature　of　a　test，　the　central　concem，

whether　in　developing　a　test　or　making　use　of　it　to　measure　language　ability，　is

whether，　or　to　what　extent，　the　test　is　measuring　what　it　purports　to　measure．

Moreover，　in　this　regard，　the　challenge　in　the　research　of　language　testing　would　be

how　and　to　what　extent　we　can　ensure　that　a　particular　test　makes　it　possible　to　elicit

arepresentative　sample　ofthe　language　ability　we　wish　to　measure．

3．2　　Test　Validity

　　　　　In　understanding　what　makes　a‘good’test，　we　have　to　consider　two　things，　one

of　which　is　test　validity：whether　and　to　what　extent　a　te　st　measures　what　it　purports

to　measure．　For　example，　paper－and－pencil‘pronunciation’tests　or　a　writing　test

heavily　based・n　specialized　backgr・und㎞・wledge　w6uld　be　very　l・w　in　its　validity，

since　what　the　ability　sample　extracted　via　the　test　is　very　different　from　what　they

are　intended　to　measure．　The　other　criterion　that　should　be　considered　in　making　a

good　test　is　reliability：the　extent　to　which　a　test　is　consistent　in　its　measurement．　A

popular　example　of　when　the　test　reliability　must　be　questioned　is，　in　a　writing　test，

when　two　raters　are　marking　the　same　essay　but　are　giving　very　different　marks．

Moreover，　when　the　same　rater　is　rating　the　same　essay　twice　and　giving　different

marks　in　the　second　rating，　it　is　also　considered　problematic　in　terms　of　reliability．

The　fbrmer　could　be　considered　as　an　example　of　inter－rater　reliability，　while　the

latter　is　an　example　of　intra－rater　reliability・

24


　　　　　Comparing　the　two　concepts，　a　popular　notion　is　that，　while　reliability　is　a

necessary　condition　fbr　a　test　to　be　valid　because　test　scores　that　are　not　reliable

cannot　provide　a　basis　fbr　valid　interPretation　and　use，　reliability　alone　does　not

guarantee　test　validity（Kobayashi　1995：81－82）．　For　example，　in　the　advent　of

paper－and－pencil‘pronunciationうtest，　it　is　perfectly　possible　to　make　a　very

consistent　paper－and－pencil‘pronunciation’test，　but，　despite　the　high　reliability　of　the

test，　the　test　is　very　low　in　validity．　Therefbre，　validity　is‘the　most　important

quality　in　the　development，　interpretation，　and　use　of　language　tests’（Bac㎞｛m

1990：289）．

　　　　　There　are　fbur　types　of　validity　in　language　testing：construct　validity，　content

validity，　face　validity，　and　concurrent　or　predictive　validity（B　achman　1990；Heaton

1988；Hughes　2003；McNamara　2000）．　Concurrent　or　predictive　validity　is　obtained

by　compa血g　the　results　of　a　test　with　those　of　other　measurements：the　former　With

other　existing　tests　and　the　latter　with　future　performance　of　the　testees，　usually　by

calculating　correlations（Kobayashi　1995：83）．　They　have　little　significance　as　a

measure　of　validity　unless　the　other　measures　with　which　they　are　to　be　compared　are

themselves　established　as　valid（i．e．　a　well－established　standardized　test）．　However，

if　the　test　against　which　the　new　test　is　validated　is，　indeed，　considered　to　be　valid，　it

would　be　a　powerfU1　means　of　test　validation．

　　　　　Face　validity，　according　to　Davies　et　al．（1999：59），　is　the　degree　to　which　a　test

apPears　to　measure　the　knowledge　or　abilities　it　claims　to　measure，　as　j　udged　by　an

皿trained　observer（such　as　the　candidate　taking　the　test　or　the　institution　which　plans

to　administer　it）．　Concems　for　face　validity　are　often　dismissed　as　trivial　because

they　have　to　do　with　appearances　rather　than　with　the　underlying　construct　of　ability

being　measured　by　the　test，　but　it　has　also　been　argued　that　failure　to　take　issues　of

face　validity　into　acco皿t　may　j　eopardize　the　public　credibility　of　a　test・

　　　　　　Content　validity　can　be　defined　as　a　parameter　which　concems‘whether　the　test

content　consists　of　a　representative　sample　of　the　domain　of　language　ability　to　be

measured’（Davies　et　al．1999：34）．　Some　testing　specialists　make　no　distinction

between　face　validity　and　content　validity　in　that　they　are　both　intuitive　and　logica1

25


but　usually　lacking　an　empirical　basis（Henning　1987：94）．　Yet，　others　do　make　a

distinction　l）etween　the　two，　disregarding　face　validity　as　something　that　is

‘impressionistic’compared　to　content　validity　which　employs　more　scientific

approaches　in　determining　validity（e．g．　Oller　1979，　cited　in　He皿ing　1989：96）．

　　　　　In　the　development　of　a　performance　test4，　content　validity　is　normally

achieved　l）y　means　of　a　thorough　needs　analysis　of　the　target　domain，　upon　which　the

test　content　is　based（e．g．　McNamara　1996）．　An　achievement　test5　seeks　content

validity　by　drawing　a　representative　item　sample丘om　the　syllabus　on　which　it　is

based．　For　a　general　proficiency　test，　where　the　whole　of　the　language　is　the　target

domain，　content　then　becomes　the　construct．　This　means　that，　in　the　present

research，　where　the　test　instrument　to　be　used　is　a　general　proficiency　test，　content

validity　ofthe　test　is　strongly　related　to　its　construct　validity・

　　　　　　Construct　validity，　as　is　often　described　by　many　publications　in　language

testing　research（e．g．　Henning　l　987；Bachman　1990），　is　concemed　with　the　extent　to

which　a　test　is　related　to　a　theoretical　construct　of　language　ability．　Construct

validation　involves　an　investigation　of　the　qualities　that　a　test　measures，　thus

providing　a　basis　fbr　the　rationale　of　a　test．　Therefbre，　when　the　construct　validity

of　a　certain　test　is　discussed，　the　te㎜‘validity’is　often　used　interchangeably　to　mean

4Not　all　language　tests　are　of　the　same　kind．　They　differ　in　respect　to　test　method　and　test　purpose．

In　tems　ofmethod，　McNamara（2000：5）distinguishes仕aditional　paper－and－pencil　language　tests

廿om　perfbrmance　tests．　According　to　his　distinction，　paper－and－pencil　te　sts　take　the　fbrm　of　the

血miliar　examination　question　papeL　On　the　other　hand，　in　performance　based　tests，　language　skills

are　assessed　in　a　fbrm　ofactual　perfbrmance（e．g．　interview　tests　to　assess　speaking　ability　or　essay

tests　to　assess　writing　ability）．　In　recent　practice，　both　of　these　test　methods　can　be　realized　virtually

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　In　that・n・・mputer・（u・u・lly・eferr・d　t・a・C・mputer－B・・ed　T・・ting・・C・mput・・Ad・ptiv・T・・ting）・

，e・pect，　it　m・y　b・b・廿・・t・・xp・ess　p・per一紐d－P・n・il　l・ngu・g・t・・t・by　th・t・㎜‘・bility’t・・tlng・Th・

distinction　between‘ability　testing’and‘perfbrmance　testing’is　discussed　in　detail　in　3．3．21n　relatlon

with　construct　definition．

51n　terms　of　test　purpose，　the　most飽miliar　distinction　is　made　between　achievement　tests　and

艦麗麟＝蕊霊：罐蒜謙霊te㍑よ蒜蕊瓢・i就・d㌫㍑：鷲罐＝，濃㌫㌶1麟㌶㍑蒜鯉：蒜㌫ie灘y。th，，　h孤d，　P・・丘・i・n・y　t・・t・1・・k　t・th・血加・e・i加・ti・n・f1・ngu・g・u・e　with・ut・necessa・ily・・efe・ence

t。th・p・evi・u・p・・cess・ftea・hing．　Th・・加dy　d・n・in　th・p・e・ent・e・ear・h　d…n・t　inv・lv・皿y

an・ly・i・・f・e・ult・廿・m　in・血・ti・n．　It　is　s・1・ly　inv・1v・d　with　p・・且・i・n・y　t・・ting・th・ugh血・6nding・

may　be　helpfUl　in　the　constnlction　of　achievement　tests．

26


construct　validity．　Furthermore，　as　it　may　be　clear丘om　the　previous　discussion　in

this　section，　construct　validity　is　strongly　related　to　the　other　three　types　of　validity，　if

not　subsumes　them．

　　　　　It　seems，　from　the　de　scriptions　above，　that　construct　validity，　and　thus　content

validity　in　the　present　research，　are　of　the　utmost　importance　because　they　relate

essentially　to　the　center　of　what　is　to　be　tested　and　how　it　is　best　measured．

Therefore，　the　discussion　now　turns　to　how　the　construct　and　contents　could　be

constituted　in　investigating　the　nature　of　reading　tests・

3．3　Construct　Definition

　　　　　Construct　is　the　trait　or　traits　that　a　test　is　intended　to　measure．　It　is　defined　as

‘an　ability　or　set　of　abilities　that　will　be　reflected　in　test　performance’（D　avies　et　al．

1999：31），and丘om　which‘inferences　can　be　drawn’on　the　basis　of　test　scores

（Chapelle　1999：154）．　In　other　words，　it　is　a　meaningfUI　and　usefUl　way　of

interpreting　test　performance（Messick　1988）．　A　construct　is　usually　based　on　a

theory，　so　a　test，　then，　represents　an　operationalization　of　the　theory　on　which　it　is

based．　Therefbre，　a　reading　test　is　a血operationalization　of　a　reading　construct

derived　from　theories　of　reading　ability，　which　have　j　ust　been　discussed　in　Chapter　2

0f　this　paper．

　　　　　Thus，　construct　definition　is　a　very　important　component　in　a　test　to　clarify

what　is　to　be　inferred　about　the　ability　of　a　test　taker　from　his　performance　on　the　test

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　（丘om　the　perspective　of　a　test　constructor），　or　to皿derstand　it（丘om　the　perspectlve

of　a　test　user）．　With　regard　to　the　test　validity，　a　well－defined　construct　is　essential

in　keeping　the　test　validity　high．

3．3．1　Conceptuanzing　Language　Ability

　　　　　In　pri・r　t・the　discussi・n・f　c・nstmct　de丘niti・n，　it　is　essential　that　the　the・ries

。f　l皿guage　ability　be　reviewed　because　that　are，　by　de丘niti・n・what　c・nstmct　is

established　up・n．　Traditi・nally，　language　ability　was　th・ught　t・c・nsist・f　m・dules

27


of　linguistic㎞owledge．　In　other　words，　they　were　considered　to　be　an

accumulative　knowledge　of　discrete　elements　of　language．　However，　in　the　last　few

decades，　a　considerable　progress　has　been　made　in　establishing　models　of　language

ability（e．g．　Chomsky　l　957，1980），　especially　in　regard　to　how　the　concept　of

‘co㎜皿ication’　should　be　synthe　sized　in　defining　what　language　ability　is（e．g．

Hymes　1972；Canale　and　Swain　1983；Bachman　1990）．　This　trend　has　made　a　great

contribution　to　the　field　of　English　Language　Teaching，　and　a　line　of　approach　such　as

Communicative　Approach　in　teaching　or　Communicative　Testing　has　become　very

influential　in　both　pedagogy　and　research・

　　　　　In　the　discussion　of　language　ability，　Chomsky’s　distinction　between

‘competenceう（the　speaker－hearerうs　knowledge　of　his　language）and　‘performance’

（the　actual　use　of　language　in　concrete　situations）was　a　significant　milestone．　In

considering　language　testing，　the　distinction　between　underlying‘competence’and

actual‘perfbmance’is　crucial　because　we　need　to　sample　actual　language　use，　or

what　can　be　directly　observed　and　evaluated　as　a　product．　In　other　words，　as　much

study　suggests（e．9．　Canale　and　Swain　l　980），皿derlying　competence　can　be　assessed

only　through　its　realization　in　performance．　Thus，　a鋤her　examination　of　how

competence　and　performance　are　related　is　necessary．

3．3．1．1Defi痂9　competence

　　　　　Inspired　by　Hymes’s（1972）n・ti・n・f‘c・mmunicative　c・mpetence’・which

takes　s。ci。linguistic　elements　int・acc・unt　as・PP・sed　t・Ch・msky’s‘linguistic

c。mpetence・，　vari・us　m・dels・f　l孤guage　ability　which　presents　the　idea・f　l孤guage

ability　c・nsisting・f　grammatical　kn・wledge　and㎞・wledge・f　use　have　been

introduced．　Bachman’s（1990）model，　shown　in　Figure　3－1，　seems　to　be　the　most

comprehensive　of　all　at　present・

　　　　　In　the　mode1，　language　competence　is　divided　into　organizational　competence

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　（which　includes　grammatical　and　textual　c・mpetence）and　pragmatlc　c°mpetence

（which　includes　ill・cuti・nary　and　s・ci・linguistic　c・mpetence）・

28


Language　Competence

Organizatio　al　Competence

Grammatical

C◎mpetence

TeXtual

C◎mpetence

lll◎cutionary

CompetenceS◎ciolinguistic

Competen◎e

Figure　3－1　ComponentS　of　tanguage　competence（Bachman　1990：87》

　　　　　When　the　model　is　consulted　with　regard　to　the　theories　of　reading　ability

discussed　in　the　previous　chapter，　the　way　Bachmanうs　model　makes　the　distinction

between　organizational　competence　and　pragmatic　competence　seems　to　coincide

with　the　way　some　studies（e．g．　Negishi　1996；Grabe　1999）make　the　distinction

between‘a　text　model　of　comprehensionうand‘a　situation　model　of　comprehension’

（see　2．2．2）．　The　concepts　of　organizational　competence　and　text　modeling　are　both

established　in　the　linguistic　dimension　of　language　ability，　whereas　the　concepts　of

pragmatic　competence　and　situation　modeling　are　both　enacted　in　its　world

㎞owledge　co皿telpalt

　　　　　Furthermore，　with　regard　to　an　empirically　derived　model　of　reading　ability，　the

three　components　of　FL　reading　ability　extracted　in　Negishi（1996）（see　2．3．2．1）fits

nicely　to　Bachman’s（1990）model：‘Linguistic　Competence’factor　of　Negishi（1996：

134）could　be　explained　by　g…tical　competence　in　Bac輌s　model，‘world

㎞owledge’factor　by　sociolinguistic　competence，　and‘Reading　Skills’factor　by

textual　and　illocutionary　competence．　The　fact　that　Negishi’s（1996）‘Reading

Skills’factor　is　explained　by　two　different　components　of　competence　that　are

allotted　in　two　different　dimensions　of　language　competence　in　Bachman’s（1990）

model　is　further　explained　by　another　empirically　derived　model　of　a

‘two－dimensional　approach’to　the　latent　structure　of　reading　ability（Negishi　1997；

Wada　2003）（see　2．3．2．1）．　The‘local／global　comprehension’component，　which　is

attributed　to　the　amount　of　infbrmation　integrated　（Wada　2003：　58），　could　be

29


explained　by　textual　competence，　and　the　‘literal／inferential　comprehension’

dimension，　which　i　s　attributed　to　the　amo皿t　of　information　processing（Baclman

and　Palmer　1982；Wada　2003），　could　be　explained　by　illocutionary　competence．

Thus，　the‘two－dimensional　approachりto　the　latent　structure　of　reading　ability

（Negishi　1997；Wada　2003）could　be　considered　to　be　a　usefU1　model　that　can　work　as

aconstruct　to　explain　the　reading　skills　part　of　reading　ability，　as　it　was　suggested　in

2．40f　this　paper．

3．3．1．2　　1）efining　perfor〃∂ance

　　　　　Vatrious　attempts　have　been　made　to　define　a　mechani　sm　by　which　competence

and　performance　could　be　bridged．　The　inclusion　of　strategic6　competence　in　a

language　ability　model　by　Canale　and　Swain（1980）is　one　such　attempt．　However，

they　treated　this　competence　mainly　as　being　compensatory（i．e．　ability　necessary

when　communication　breaks　do㎜）and　did　not　put　much　emphasis．　Later，　Canale

（1983）modified　the　earlier　joint　model，　emphasizing　the　importance　of　strategic

competence　as　a　more　independent　mechanism　essential　fbr　successfU1

　　　　　　　　　　　　　　　　　り

CO㎜迦catlon．

　　　　　Canale’s　idea　of　strategic　competence　was　finther　developed　by　Bachman

（1990）in　his　theoretical　framework　of　communicative　language　ability．　Bad皿孤

and　Palmer（1996）ftロther　developed　Bachman’s（1990）model　and　presented　it　as　a

visual　metaphor　shown　in　Figure　3．3．1．2．

　　　　　As　it　is　illustrated　below，　the　framework　of　Baciman　and　Palmer（1996）views

language　use　（or　perfbrmance）　as　interactions　among　areas　of　language　ability

（composed　of　language　knowledge　and　strategic　competence；described　in　detail　in

Figure　3－1），　topical㎞owledge，　and　affective　schemata，　on　the　one　hand，　and　how

these　interact　with　the　characteristics　of　the　language　use　situation，　or　test　task，　on　the

other．　The　figure　al　so　illustrates　various　interactions　that　are　assumed　to　be　involved

61n　Note＃2　in　2．4，　it　was　stated　that　the　present　study　treats　the　term‘strategy’to　be　equivocal　to

what　is　meant　by‘skills’．　Although　the　present　author　maintains　this　notion　and　considers　what　is

meant　by　strategic　competence　here　is　actually‘skills’discussed　in　Chapter　2，　the　word‘strategy’is

used　here　because　the　stUdies　cited　in　the　present　discussion　used　the　term　in　their　publications．

30


in　langUage　USe．

their　model：

Bachrnan　and　Palmer（1996：62）give　a　detailed　explanation　of

　Topical

knowledge

　　knowledge

／≡⊆：“ゼ＼

　Strateglc

competenc8

ー

　　　　　　　　　　　　　　　　　　　　　　　　　　Charactenstics　of　the

　　　　　　　　　　　　　　　　　　　　　　　　　　language　use　or　test

　　　　　　　　　　　　　　　　　　　　　　　　　　　task　and　seding

Figure　3－2　Some　componentS　of　language　use　and　Ianguage　test　pertormance

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　｛Bachman　and　Palmer　1996：63）

The　components　that　are　within　the　smaller，　bold　circle（‘topical㎞owledge’，

‘language㎞owledge’，‘personal　characteristics’，‘strategic　competence’and

‘affect’）represent　characteristics　of　individual　language　users，　while　the　outer

circle　includes　characteristics　in　the　task　or　setting　with　which　the　language　use

interacts．　The　double－headed　arrows　indicate　interactions．　The　figure

indicates　that　strategic　competence　is　the　component　that　links　other

components　within　the　individual，　as　well　as　providing　the　cognitive　link　with

31


the　characteristics　of　the　language　use　task　and　setting．

　　　　　What　is　to　be　noted，　here，　is　the　interaction　illustrated　between　strategic

competence　along　with　the　characteristics　of　the　language　use　situation，　or　test　task，

in　the　model．　Bachman（1990：84）defines　strategic　competence　as‘the　mental

capacity　fbr　implementing　the　components　of　language　competence　in　contextualized

co㎜皿icative　l孤guage　use，’and　this　definition　c輌lso　be　applied　to　Bachm皿孤d

Palmer’s（1996）model．

　　　　　When　the　distinction　that　was　made　between‘competence’and‘performance’

is　revisited，　it　is‘the　speaker－hearer’s㎞owledge　of　his　languageうversus‘the　actual

use　of　language　in　concrete　situations．’In　other　words，　to　define　performance，　it　is

the　result　of　leamer’s　language　competence（or　language　knowledge｛md　topical

㎞owledge　components　in　the　model）put　fbrth　in　a　context（the　characteristics　of　the

language　use　or　test　task）via　his　capacity　to　actually　put　it　in　use　（or　strategic

competence）．　It　is　essentially　what　comes　out　from　an　interaction　of　language

competence，　strategic　competence，　and　the　context．　Thus，　what　is　elicited　by　a　test

item　in　a　reading　te　st　is　a　reading　performamce（the　product）embodied　by　language

competence　along　with　other　underlying　competences．

　　　　　Many　of　the　studies　that　draw　upon　this　model　put　much　fbcus　on　the　strategic

competence　component（Alderson　2000：332）．　However，　as　far　as　testing　of　reading

ability　is　concemed，　too　much　fbcus　on　strategies（or　skills）is　dangerous　since　the

process　of　how　strategic　competence　influence　language　competence　cannot　be

observed　directly（Phakiti　2008）and　may　lead　to　the　confUsion　that　we　have　seen　in

identifiablitiy　studies　of　reading　subskills　in　2．3．2．2．　The　interest　in　strategies　comes

in　part　from　an　interest　in　characterizing　the　process　of　reading　rather　than　the　product

of　reading（Alderson　2000：307），　and　that，　as　previously　stated　in　the　present　thesis，　is

beyond　the　scope　ofthis　study・

　　　　　Conversely，　if　we　turn　our　attention　to　the　component　of　characteristics　of　the

language　use　or　test　task（the　context　aspect），　an　implication　can　be　Ib皿d　toward

investigating　the　nature　of　a　reading　test．　That　is，　various］features　of　a　test，　or

32


‘facets’of　test　methods（Baciman　1990：115）7　could　be　ass㎜ed　to螂e　up　the

context　component　as　factors　that　affect　learners’performance．　Thus，　the　inference

that　we　make　from　learners’　performance　on　a　test　about　their　ability　encompasses　the

nature　ofthat　particular　test．

　　　　　Tb　elaborate　on　this　in　the　context　of　reading　test，　what　we　are　observing　as　test

takersうperforrnance　on　a　reading　test　i　s　the　re　sult（or　the‘product’）of　their　reading

competence　implemented　into　the　context（of　which　the　reading　test　comprises　a　part）

in　conj　unction　with　their　strategic　competence．　Therefbre，　the　facets　of　a　reading

test（as　a　part　of　the　context）infiuences　how　a　reading　competence　is　contextualized

as　a　reading　performance．　In　this　regard，　in　Negishi（1996）and　Wada（2003），　as　it

was　pointed　out　in　3．3．1．1，　the　factors　that　were　extracted　from　students’reading

performance　on　a　test　were　in　close　relation　with　the　components　that　seem　to

compose　the　reading　competence．　This　may　have　been　because　the　validity　of　test

instrument　that　was　employed　in　these　studies　were　high（implying　that　there　was

little　test　method　effect　in　the　tests），　allowing　the　latent　structure　of　the　reading

competence　to　be　readily　observable　among　other　factors　that　constitute　the

performance．　Therefbre，　fbr　reading　competence　to　be　properly　contextualized　in

reading　performances，　it　is　essential　that　we‘delineateう（Bachman　1990：115）the

nature（or　the　facets）of　a　reading　test　so　as　not　to　distort　leamers’‘comprehension’．

At　the　same　time，　it　is　also　vital　that　a　theoretical　view　of　what‘comprehension’（or

construct）is　duly　operationalized（or　defined）in　developing　a　test　item．　Therefbre，

the　discussion　now　turns　to　different　approache　s　taken　toward　defining　constructs　and

the　ways　of　how　test　takers’performances　are　interpreted　in　making　inferences　about

their　language　ability　in　deterrnining　how　a　test　item　should　be　developed．

7Bachman’s（1990）framework　oftest　method　facets　consists　of　five　maj　or　categories：

　　　　　　　1）　testing　environment

　　　　　　　2）　test　rubric

　　　　　　　3）　the　nature　of　input　the　test　taker　receives

　　　　　　　4）　the　nature　of　the　expected　response　to　that　input

　　　　　　　5）　the　relationship　between　input　and　response　　（see　Bachman　1990：111－159）

33


3．3．2　Approaches　toward　Construct　Definition

　　　　　Researchers　and／or　teachers　use　tests　to　elicit　learners’　performance　and　make

inferences　about　their　language　ability　on　the　basis　of　what　they　observe　from　his

perfbmlance．　For　example，　an　inference　is　made　about　test　takers’‘reading

comprehension’on　the　basis　of　their　responses　to　questions　on　a　reading

comprehension　test．　The　teml‘inference’is　used　to　indicate　that　the　test　result　is　not

itself　the　obj　ect　of　interest　to　test　users（researchers　and　teachers）．　Instead，　test　users

want　to㎞ow　what　a　test　taker（learner）might　be　expected　to　be　capable　of　in

non－te　st　settings．　What　kind　of　language　ability　test　users　want　to　observe　from　test

takers’test　performance，　or　constnlct，　is　conceived，　and　thus　defined，　in　different

ways　depending　on　whether　one　takes　an‘ability’or　a‘performance’orientation　to

testlng．

3．3．2．1　　Two　aPjりroach　es：co〃s〃uct　or　content～

　　　　　Messick（1989：15）defines　construct　in　a　very　strict　sense　by　saying　that　it　is‘a

relatively　stable　characteristic　of　a　person　－－　an　attribute，　enduring　Process，　or

disposition－－which　is　consistently　manifested　to　some　degree　when　relevant，　despite

considerable　variation　in　the　range　of　settings　and　circumstances．’Inspired　by　this

notion，　Chapelle（1998：34），　in　her　discussion　of　validity　studies　with　regard　to

performance　assessment　in　second　language　acquisition　research，　categorizes

construct　definition　into　three　types　in　terms　of　their　approache　s　on　how　constructs

are　defined．

　　　　　‘Trait　theorists’define　constructs　in　terms　of　the　knowledge　and　fundamental

process　of　the　test　taker．　Therefbre，　their　approach，　also　called　‘trait－type’or

‘trait－oriented’approach　in　Chapelle（1999：156－157），　would　be　to　interpret　the　test

performance　as　evidence　of　underlying　processes　or　stmctures，　which　are　also

responsible　fbr　performance　in　non－test　settings．　Thus，　the　fbcal　problem　in　test

design　is　to　assess　accurately　the　ability　of　interest　rather　than　other　things．　This　is

an　apProach　taken　in‘ability　testing’．

　　　　　On　the　other　hand，‘behavioristsうdefine　constructs　with　reference　to　the

34


environmental　conditions皿der　which　performance　is　observed．　Therefore，　in　their

approach，　the　performance　elicited　by　the　test　item　should　be　interpreted　as　the　result

of　contextual　features，　and　no　inference　should　be　made　as　to　what　underlying　ability

is　tapped　by　the　test　item　from　it．‘Performamce8　testing’，　which　aims　to　make

i噛ences　more‘directlyうabout　performance　in　non－test　settings　on　the　basis　of　test

performance，　takes　this　approach．　The　test　design　problem　here，　therefore，　is

constructing　a　test　with　characteristics　as　similar　as　possible　to　the　non－test　setting．

　　　　　The　last　type，‘interactionist’approach，　can　be　placed　in　the　mid　way　between

the　two　approaches　above．　It　sees　performance　as　the　result　of　traits，　contextual

features，　and　their　interaction．　Such　an　apProach　to　construct　definitions　includes

both　a　cognitive　skill　or　capacity　and　a　domain　where　the　capacity　is　relevant，　such　as

‘reading　fbr　academic　purposes’（Chapelle　1999：157）．　In　other　words，　their

construct　definition　suggests　that　a　learner　might　be　good　at　using　the　target　language

for　some　purposes　but　that　is　not　guaranteed　for　other　purposes．

　　　　　In　3．3．1．2，　it　was　repeatedly　emphasized　that　what　is　observable　in　a　test　is

leamers’performance，　so　language　ability　cannot　be　observed　without　the

intervention　of　a　test　instrument．　Therefore，　Chapelle’s（1998）categorization　of　a

‘trait－type’approach　toward　construct　definition　seems　very　weak，　if　not　invalid．

Therefbre，　fbr　the　present　discussion，　fbcus　will　be　put　on　the　difference　between　a

‘behaviorist’apProach　and　an‘interactionisピapProach　to　defining　construct．

　　　　In　line　with　Chapelle’s（1998）conceptualization　of　construct　definition，

Bachman（2002：456），　in　discussing　validity　concems　of　task－based　language

performance　assessment（TBLPA），　introduces　two　approaches　toward　defining

construct：　‘ability－based’　and　‘task－based’　apProaches．　　Tb　be　precise，　these

approaches　are　discussed　in　terms　of　how　a　performance　assessment　is　developed，

however，　it　seems　that，　actually，　what　is　central　in　his　discussion　is　how　construct　is

8The　term‘perfbmance’in‘performance　testing’may　be　confUsed　with‘performance’　discussed　in

3．3．1．　The　distinction　between　the　two　is　that　the　word‘perfbrmance’in　perfbrmance　testing　is　used

in　a　narrower　sense　in　that　it　is　assumed　as　something　that　is　indivisible　and　insusceptible　to　any　effort

to　break　it　down　into　interpretive　components，　whereas‘performance’defined　in　relation　with

‘competence’is　presumed　to　be　an　interaction　of　language　competence，　strategic　competence，　and　the

context，　as　was　illustrated　in　3．3．1．2．

35


approached，　which，　fUndamentally，　is　about　construct　definition．

the　tWo　approaches　by　citing　Norris　et　a1．（1998）：

a．

He　distinguishes

in　developing　a　performance　assessment，　focus　either　on　constructs　or　on　tasks：

i．Begin　construct－based　test　development　by　focusing　on　the　construct　of　interest

and　then　develop　tasks　based　on　the　performance　attributes　of　the　construct，　score

　　uses，　scoring　constraints，　and　so　fbrth．

ii．　Begin　task－centered　test　development　by　deciding　which　performances　are　the

　　desired　ones．　Then，　score　uses，　scoring　criteria，　and　so　forth　become　part　of　the

pe㎡formance　test　itself．（Norris　et　al．　1998：25）

Furthermore，　Figure　3－3（B　achman　2002：

the　two　concepts．

457）well　illustrates　the　difference　between

Interpretation

‘Has　language

　　ability’

Domain　of

TLU　tasks　　　▲

ξ8　　　詩

6t　　　漬蓮

Interpretation

‘Can　do“real－Eife”

　　　　tasks’

Perfbrmance　　　1　－9

conslstency　　　l　O　8

　　　　　　　　　　　　　魯

　　　　　　　　　　　1

　　　　　　　　　　　1

　　　　　　　　　　　‘

　　　　　　　　　　　　1

　　　　　　　　　　　　　　　　　　　　　　　Assessment　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　．Assessment　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Language　　　　　　　　　　Language　　　　　　　tasks　and　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　tasks　and　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　ability　　　　　　　　　　　ability　　　　　　　　　　　　　　　　　　　　　　　　context　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　context

　　　　　　　　（a）　　　　　　　　　　　　　（b）

Figure　3■3　　　Different　interpretations　or　response　consistencies　on　Ianguage

assessment　tasks：（a》‘Ability－based，　inferences　about　language　ability　and（b）

‘Task・based，　predictions　about　future　performance　as‘real・wortd，　tasks

36


　　　　　In　discussing　content　validity　in　3．2，　it　was　stipulated　that，　in　a　general

proficiency　test，　where　the　whole　of　the　language　is　the　target　domain，‘content’

becomes　the‘construct’in　that　a　representative　sample　of　the　domain　of　language

ability　to　be　measured　is　not　directly　observable．　However，　when　two　approaches　by

Chapelle（1998）and　Bachman（2002）（they　describe　the　same　notion　in　essence）are

considered，　there，　indeed，　is　a　difference　between　the　two．　It　appears　that，　in

competence－based‘interactionist’approach，　or（a）in　Bachman’s（2002）model，　one

must　consider　both　constructs　and　test　items，　while，　in　content－based‘behaviorist’

approach，　or（b），　one　considers　only　performances　on　test　items．　The　former

approach　maintains　that　the　process　of　designing，　developing　and　using　language　tests

should　incorporate　both　specifシing　the　test　items　to　be　included　and　defining　the

abilities　to　be　measured　（i．e．　constnlct）（Bachman　and　Palmer　1996；Brown　1996；

Alderson　2000；Douglas　2000）．　The　latter　approach　requires　so　far　as　to　defining

the　tasks　embedded　in　the　context（i．e．　content）．

　　　　　This　distinction　between　the　two　approaches　in　construct　definition　is　also

debated　in　Hudson（2005）and　Norris　and　Ortega（2003）．　Reflecting　upon

criterion－referenced　language　assessment，　the　complexity　of　language　use，　the

complexity　of　assessing　language　ability，　and　the　difficulty　in　interpreting　potential

interactions　of　task　and　its　difficulty　that　are　indispensable　yet　difficult　to　implement

are　discussed．　Hudson（2005：205）describes　the　views　on　this　issue　as，‘‘They

reflect　a　current　appetite　for　language　as　sessment　anchored　in　the　world　of　fUnctions

and　events，　but　also　must　address　how　the　worlds　of　fUnctions　and　events　contain　non

skill－specific　and　discretely　hierarchical　variability．’うAt　the　same　time，　he　stipulates

that　fUrther　research　is　required　to　investigate　into　the　relationships　between

‘‘

狽≠唐求|dependent”view　and‘‘task－independent”view　of　the　construct．　Norris　and

Ortega（2003：729）casts　an　insight　that　these　conflicting　views　traces　the　differing

paradigms　from　where　the　motivations　of　research　originate　and　observes　them　as

bearing‘‘witness　to　the　fact　that　construct　definitions　are　available”，　thus　encouraging

a　shift　to　a　fUrther　examination　of　conceptual　bases　on　the‘measurement’aspect．

37


　　　　　The　distinction　between　the　two　approaches　in　constmct　definition　is　very

similar　to　what　can　be　perceived　as　the　difference　between　the丘amework　of

℃ommon　Reference　Levels　of　Language　Proficiency’，　or　more　popularly，℃ommon

European　Framework　of　Reference　fbr　Languages’（CEFR）（Co皿cil　fbr　Cultural

Co－operation　Education　Committee，　Modern　Language　Division　2001：24－29）and　the

ALTE℃an　Do’statements（Association　of　Language　Testers　in　Europe　2002）．

　　　　　CEFR　was　developed　with　the　intention　to　provide　a　common　basis　for　the

elaboration　of　language　syllabi，　cuπiculum　guidelines，　examinations，　textbooks，　etc．

across　Europe，　where　many　people　with　different　first　languages　emigrate　or

immigrate　to　the　places　where　they　would　be　required　to　learn　a　new　language．　It　i　s

acomprehensive　description　of　what　language　leamers　have　to　leam　to　do　in　order　to

use　a　language　fbr　communication．　In　particular，　the　proficiency　descriptors　define

levels　of　proficiency　which　allow　leamersうprogress　to　be　measured　at　each　stage　of

learning　and　on　a　life－long　basis．　As　it　is　stipulated　by　the　developers　of　CEFR，　they

are　to　be　used　as　a‘grid　which　users　can　exploit　to　describe　their　system’（Council　of

Europe　2001：21）as　a　scale　of　reference　levels．　They　fUrther　note　that　they　should

be‘context－free’in　order　to　accommodate　generalizable　results　from　different

specific　contexts　and　be‘based　on　theories’of　language　competence（Council　of

Europe　2001：21）．　This　is　exactly　the　approach　taken　by　the　competence－based

model　of　construct　de丘nition．　In　other　words，　the‘丘㎜ework’approach　that　these

models　and　descriptors　take　assumes　the　situation　where　the　inference　to　be　made

about　leamers’fUture　performance　wil1　be　acquired　artalytically　as　an　interaction　of

his　competence　and　the　context．

　　　　　On　the　other　hand，　the　ALTE℃an　Do’statements　take　a　different　approach：an

approach　that　fbcuses　on　the　content　of　what　is　to　be　measured．　The　ALTE℃an

Do’statements　are　an　application　of　CEFR　descriptors　that　was　made　with　an　aim　to

develop　and　validate　a　set　of　performance－related　scales，　describing　what　leamers

can　actually　do　in　the　fbreign　language（see　Appendix　D　of　Council　of　Europe　2001

fbr　details）．　In　their　original　conception，　they　were　made　to　be　user－orientated　to

provide　the　inte叩retations　of　test　results　that　can　be　easily皿derstood　by

38


non－specialists．　As　it　is　stated　in　Council　of　Europe（2001：244－245），　it　is

supPosed　to　be　a　‘tool’　fbr　providing　easily　understandable　‘descriptions　of

performance’　which　can　be　used　in‘specifying　requirements　to　language　trainers，

formulating　j　ob　descriptions，　specifying　language　requirements　fbr　new　posts’．

The　whole　list　consists　of　400　statements　that　are　organized　into　three　areas

according　to　their　apPlicable　contexts（e．9．　Social　and　Tburist，　W（）rk，　and　Study）．　It

is　clear　from　these　descriptions　that　the　ALrE℃an　Do’statements　are　made　with

the　content－based　approach　to　describing　what　i　s　to　be　inferred　from　the　test　in　order

fbr　them　to　be　easily　understood　by　non－specialists．　Their　main　concem　is　defining

the　performance　holistically　‘in　the　context’　and　to　illustrate　what　i　s　to　be　measured

in　the　test（or　described　in　the　list，　fbr　the　ALTE‘Can　Do’statements）in　a　way　so

that　they　describe　the　representative　sample　of　fUture　performance　expected　for

learners．

3．3．2．2　　1汐mportan（re　of　aノをα7ηewo7r」k　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　’

　　　　　Both　of　the　two　approaches　described　above　are　valid　ways　of　defining

construct　fbr　language　ability　measurement．　The　difference　is　in　their　purposes　of

use　as　it　was　apparent　in　the　difference　between　CEFR　descriptors　which　was

developed　for　experts　to　be　applied　to　various　circumstances　and　the　A口E℃an　Do’

statements　which　was　contextualized　fbr　non－specialists　to　be　used　as　an　easy

reference　in　concrete　contexts．　Hence，　as　Norris　and　Ortega（2003：729）concludes，

the　option　traces　itself　to　the　differing　paradigms　where　the　motivations　of　research

originate　from．

　　　　　In　the　present　research，　the　prime　interest　is　in　investigating　how　a　reading

product，　or　performance，　could　be　elicited　by　a　test　item　so　that　some　inte叩retations

and　generalizations　could　be　made　about　test　taker’s　ability．　The　obj　ective　in

language　testing　is　making‘inferences’or‘predictions’about　what　a　leamer‘may’be

able　to　do　in　the　real－life　situations．　So　far　as‘measuring’ability　is　concemed，　even

if　the　situation　calls　for　content－based　evaluation　of　leamers’performance，　test

developers　need　to　have　a‘theoretical　framework’that　they　can　work　with　to

39


conceptualize　what　it　is　that　we　are　measuring．　In　this　respect，　in　the　present

research，　competence－based　approach　to　defining　the　construct　seems　prevailing，　and

that　is　what　this　study　is　going　to　inquire　into．

3．3．3　Considering‘‘Speed，，　as　a　Construct

　　　　　A　distinction　is　often　made　between　speed　tests　and　power　tests．　Speed　tests

are　tests　which　employ　content　of　a　sufficiently　low　difficulty　level　that　the　maj　ority

of　people　fbr　whom　the　tests　are　intended　would　be　expected　to　perfbrm　perfectly

when　they　are　given　a　sufficient　amo皿t　of　time，　but，　since　they　are　not，　the　rate　of

response　is　of　primary　importance　in　determining　success．　On　the　other　hand，　power

tests　allow　enough　time　fbr　responding，　so　that　nearly　all　people　may　attempt　every

item，　but，　because　the　items　bear　such　a　high　difficulty　level，　their㎞owledge　level，

or“垂盾翌?秩h，　becomes　the　point　of　success　in　completing　the　test（He皿ing　1987：196）．

　　　　　Most　tests　fall　somewhat　between　the　two　extremes，　since㎞owledge　rather

than　speed　is　the　primary　fbcus，　but　time　limits　are　enfbrced　since　weaker　students

may　take　an　unreasonable　length　of　time　to　finish（Henning　1987：8）．　Most　test

designers，　experimentally　or　intuitively，　time　their　test　to　allow　roughly　90％of

test－takers　to　complete　in　time，　but　do　not　consider　their　test　to　be　speeded（Alderson

2000：150）．　Although　the　distinction　between　speed　tests　and　power　tests　is　often

considered　j　ust　a　difference　in　the　fbcus　intended　in　implementing　or　designing　the

test，　since　most　power　tests，　in　practice，　are　timed　with　their　results　influenced　by　test

takersうspeed　of　processing　and　production，‘‘speed”could　be，　and　should　be，　regarded

as　an　important　variable　that　constitutes　language　ability．　In　fact，　the　results　from

Hirai（1999）suggest　that　a　correlative　relationship　could　be　fo皿d　between　the　scores

on　a　cloze　test　of　Japanese　EFL　leamers　and　their　reading　speeds　as　well　as　their

listening　speeds．　Furthermore，　Shizuka（2000），　in　his　study　on　the　validity　of

incorporating　reading　speed　and　response　confidence　in　measuring　EFL　reading

proficiency，　concluded　that　the　reading　speed　was　a　valid　element　in　demonstrating

te　st　takers’reading　ability．　In　the　same　manner，　Naganuma　and　Wada（2002）

experimentally　demonstrated　that　test　takers’‘‘reading　speed”had　a　certain

40


relationship　with　their　ability　levels．　On　many　occasions，‘‘powerラ’elements　such　as

those　introduced　in　3．3．2　are　investigated　into　as　possible　factors　that　constitute　the

reading　performance　of　test　takers．　However，　in　the　situation　where　no“true”power

test　can　exist，　the　speed　at　which　test　takers　process　and　perfbml　test　items　should　be

considered　as　a　factor　that　constitutes　their　reading　ability．

3．4　Lining　1「bst　Items　in　Sequence

　　　　　With　the　interest　in　inquiring　into　how　a　reading　perforrnance　could　be

accounted　with　respect　to　learnerうs　reading　ability，　it　is　crucial　that　a　test　item　is

approached丘om　the　perspective　of‘measurement．’If　different　perfbrmances　of

reading　which　are　elicited　by　different　test　items　were　to　be　termed　as　a‘qualitative’

perspective　of　a　te　st　item，　the‘quantitative’perspective　would　be　their　diffriculty　that

are　assigned　to　those　performances．　In　this　section，　operationalization　of　test　items

With　regard　to　item　difficulty　will　be　discussed．

3．4．1　　Specifying　difficulty

　　　　　In　an　attempt　to　cast　light　on　the　quantitative　side　of　a　test　item，　it　is　essential

that　this　i　s　done　from　the　perspective　of　item　specifications．　In　particular，　when　the

language　use　is　observed　in　settings　that　are　more　realistic　and　complex，　fbcusing　on

its‘producでaspect，　it　is　vital　that　the　person　who　is　constructing　the　item　demonstrate

an　explicit　outline　of　the　trait　to　be　measured，　how　that　trait　will　be　realized　in　the

performance　elicited，　how　that　performance　will　be　elicited，　and　how　that

performance　will　be　quanti丘ed　to　provide　an　index　of　the　test　taker’s　ability．

Mislevy　and　Almond（2002：478）stresses　that“a　systematic　means　fbr　designing

performance　assessments　that　will　directly　and　adequately　inform　the　particular　kinds

and　qualities　of　inferences　that　need　to　be　made”is　vital，　befbre　proposing　their

influential　framework，　the　Evidence　C　entered　Design丘amework（ECD丘amework）

　　　　　Davies　et　al．（1999：207）describes　test　speci丘cations　as　a“document　which　sets

out　what　a　test　is　designed　to　measure　and　how　it　will　be　tested．”　It　provides‘‘a

41


blueprint”fbr　item　writers　and　is‘‘important　in　the　establislment　of　the　test’s

construct　validity．　　In　essence，　this　explanation　could　be　applied　to　item

specifications　also．

　　　　　With　re　spect　to　the　quantifying　of　test　items，　since　a　test　taker’s　performance　is

elicited　by　a　test　item，　it　could　be　presumed　that　each　element　that　constitutes　the　item

specifications　reflects　the　traits　that　are　tapped　by　the　test　item．　Thus，　the　item

difficulty　of　the　test　item　should　be　interpreted　to　be　quantifying　how　difficult　it　is　to

succeed　in　fUlfilling　all　tho　se　elements　that　constitute　the　performance　in　aggregation

（Mislevy　and　Almond　2002）．　In　other　words，　what　can　be　inferred　from　the　item

difficulty　indicated　i　s　the　difficulty　ofthe　task　as　a　whole．

　　　　　However，　in　developing　test　items　with　regard　to　the　competence　approach，　the

interest　is　posed　on　the　difficulty　of　each　element　rather　than　that　of　the　whole　test

item　since　it　is　aiming　fbr　an　accountability　of‘why’or‘how’the　item　has　come　to

possess　the　diffiTculty　indicated．　If　the　difliriculty　of　each　of　the　elements　that

constitute　an　item　could　be　specified，　then　the　prediction　of　difficulty　for　test　items，

and　thus　their　quantification，　becomes　possible．　Tb　this　end，　the　search　into　the

difficulty　of　each‘element’becomes　vital．

3・4・2　Seeking　a　Link　between　Question　Type　and　ltem　1）ifficulty

　　　　　Among　the　studies　that　search　into　the　measurement　of　certain　competence　in

reading　ability，　a　few　studies　that　focus　on　the　element　of“type　of　reading”can　be

fbund．　Although　there　is　some　research　that　investigates　into　certain　relationships

between　item　difficulty　and　what　type　of　reading　i　s　tapped　by　an　item（i．e．　Tal　et　al．

1994，North　2000，0ded　and　Walters　2001，Trites　and　McGroarty　2005，　Gomez　et　al．

2007），few　such　attempts　have　been　made　in　Japan．

　　　　　As　it　was　discussed　in　Chapter　20f　the　present　thesis，　with　respect　to　how

reading　product　could　be　illustrated　as　a　construct，　the‘two－dimensional　approach’to

reading　ability　which　was　derived　from　factor　analytic　studies　in　Negishi（1996）and

Wada（2003）may　hold　a　potential　key．　Wada（2003），　inspired　by　Negishi（1996），　in

the　factor　analytic　study　of　reading　tests　given　to　EFL　leamers　in　Japan，　observed　that

42


the　reading　ability　could　be　broken　down　into　components　described　by　the

‘local／global　comprehension’dimension　and　the‘literal／inferential　comprehension’

dimension，　suggesting　the　validity　of‘question　types9’that　elicit‘local－litera1’，

‘local－inferential’and‘global－inferential’type　of　reading．10

　　　　　Wada（2003），　on　the　base　of　her　empirical　findings，　had　illustrated　that

‘local－literal’　question　type　i　s　a　type　of　test　items　that　could　be　answered　by　returning

to　a　small　unit　of　information　in　the　passage　and皿derstanding　what　i　s　explicitly

stated　there．　‘Local－inferential’question　type　is　a　type　of　test　items　that　could　be

answered　by　referring　to　a　small皿it　of　information　in　the　passage　and皿derstanding

what　is　implicitly　stated　there　by　making　inferences．　‘Global－inferential’question

type　is　a　type　of　test　items　that　could　be　answered　by　referring　to　a　large　unit　of

information　in　the　passage　and　making　inferences　in　order　to　come　up　with

macropropositional　idea　of　the　text．

　　　　　So　far　as‘measuring’ability　is　concemed，　constructing　test　items　demands　a

‘theoretical　framework’，　which　embodies‘developmental　sequence’and　acco皿ts　for

how　and　why　certain　items　are　perceived　to　be　more　difficult　than　others　by　the　test

takers．　Such　a　framework　would　provide　a　‘construct’that　delineates　the

relationship　between　the　elements　of　reading　performances，　such　as‘que　stion　types’，

and　test　takers’　reading　abilities．　As　previously　referred，　North（2000）had　made　a

substantial　effort　in　developing　a　CEFR　and　a　scale　to　describe　language　proficiency．

The　same　sort　of　approach　was　taken　by　Gomez　et　al．（2007）which　had　conducted　a

“scale－anchoring　study”，　an　attempt　to　create　de　scriptors　that　acco皿t　for　the　reading

perfbrmances　of　test　takers　at　different　levels　of　English　proficiency　based　on　both

empirical　data　and　judgments　by　test　developers．　Gomez　et　al．（2007）had

9The　te㎜‘question　type’in　the　language　te　sting　context　may　sometimes　indicate　the　format　of　te　st

items（i．e．　multiple－choice　question，　true－false　question）．　However，　here，　it　indicates　the　type　ofatest

item　with　reference　to　what　type　of　reading（i．e．　loca1－literal）it　elicits．

10　　Although　there　were　two　dimensions（locaYglobal　and　literal／inferential）originally　assumed　in　the

stUdy，　Wada（2003）could　not　extract　the　fburth　type　of　comprehension，‘global－literaP　comprehension

in　her　factor　analytic　study，　concluding　that　only‘910bal－inferentiaP，‘local－inferential’，　and

‘local－literaP　types　ofcomprehensions　are　valid　to　be　assumed　as‘question　types’．

43


succeeded　in　creating　descriptors　that　encapsulate　what　test　takers　at　a　given

proficiency　level　are　able　to　do　in　carrying　out　reading　tasks　and　illustrated　how　this

latent　ability　structure　alters　with　accord　to　test　takers’proficiency　levels．　The　study

had　noted　that　there　were　variances　in　what　test　takers　could　do　in　coming　to　an

answer　in　solving　reading　test　items．

　　　　　For　example，　in　reporting　the　results　of　their　findings，　Gomez　et　a1．（2007）

described：

Perforrriance　at　the　Low　level　The　descriptors　that　emerged　from　our

analyses　state　that　test　takers　at　the　Low　perforrnance　level“have　difficulty

identifシing　the　author’s　purpose　except　when　that　purpose　is　explicitly　stated

in　the　text　or　easy　to　infer　from　the　text．”　This　statement　implies　that

Low－level　test　takers　are　able　to　identify　the　author’s　purpose　when　it　is

explicitly　stated　or　easy　to　infer　from　the　text．．．（p．430）

　　　　　This　could　be　compared　with　their

responding　to　the　same　test　item：

de　scriptions　fbr　other　ability　groups

Performance　at　the　lntermediate　level　The　descriptors　that　emerged　from

our　analyses　state　that　test　takers　at　the　lntermediate　perfomiance　level“can

recognize　the　expository　organization　of　a　text　and　the　role　that　specific

information　serves　within　a　larger　text　but　have　some　difficulty　when　the　se

are　not　explicit　or　easy　to　infer　from　the　text，”while　test　takers　at　the　High

level　can　recognize　text　organization　and　the　role　served　by　specific

information‘‘even　when　the　text　is　conceptually　dense．”（p．431）．

Performance　at　the　High　level　＿Faced　with　this　degree　of　conceptual

density，　most　test　takers　at　the　lntermediate　level　were　unable　to　infer　the

author’s　purpose　correctly，　whereas　most　of　those　at　the　High　level　were　able

to　do　so．（P．432）

44


　　　　　The　fact　that　this　alteration　in　the　latent　ability　structure　was　revealed　in　Gomez

et　al．（2007）gives　a　positive　perspective　to　the　present　research　to　fUrther　investigate

the　relationship　between　the‘question　types’suggested　by　Wada（2003）and　their

perceived　difficulty　among　different　ability　groups　to　construct　a　sequence　which

reflects　ELT　environment　in　Japan．　With血e　intention　to　investigate　how　a　reading

test　item　could　be　constructed　to　elicit　types　of　reading　performances　described　by

Wada’s（2003）‘question　typesうwith　regard　to　test　takers’latent　ability　structure，

seeking　a　link　between　these‘question　types’and　their　item　diff7iculty　seems　to　be　a

requisite．　Although　Wada（2003）explored　the　qualitative　aspects　of‘que　stion　types’

and　their　validity　as　variables　in　item　construction　to　some　extent，　how　these

‘question　types’could　be　linked　with　test　takers’latent　ability　structure　was　not

considered．　Would　a　test　item　be　perceived　to　render　the　same‘question　type’across

test　takers　of　different　ability　levels，　or　would　it　be　perceived　differently　according　to

their　ability　levels？　Would　test　items　with　different　question　types　have　a

generalizable　order　in　their　difficulties，　and，　if　so，　how　could　they　be　ranked？　To

investigate　how　this　notion　of‘question　types’could　be　implemented　as　a　variable　in

constructing　a　test　item　to‘‘measure’うatest　taker’s　reading　ability，　a　way　to　quantify

this　aspect　of‘que　stion　types’　in　conj皿ction　with　te　st　takers’latent　ability　structure　is

needed．

45