23
CHAPTER 3 TESTING READtNG PERFORMANC Since the fbcus of this study is on the testing of reading chapter now turns to the testing perspective of the present language testing, there are two things that should be conside are trying to measure’and‘how we are going to measure it’. chapter is to explore these two concepts in depth and that w conceptualize the role and nature of reading tests in the ability. 3.1 Test as an Instrument in Measuring Language Ab Atest, according to the second edition ofDictionarγ(~f /llPjりlied L inguistics (Richards, Platt and Platt 1992:377), i measuring ability,㎞owledge, or performance.’In other interpreted that a test is an instmment that we can use to we ㎞owledge, performance)is existent in a leamer, j ust as we use long a piece of cloth is. Furthermore, when we hear the word‘ our mind i s a set oftest questions from which, the total nurnbe correctly, our total scores on the‘test’are calculated. Ihave listed the definition in a general sense because, man difficult to find a simple and explicit definition of what a te literatures(e.g. Henning 1987;Bachman 1990;McNamara 1998;Alderson 2000)as there is much to be considered in des test. However, what many of them do indicate equivocally in a test i s is that the measurement that i s acquired through imp measurement errors and that what is determined regarding t through the use of a test i s no more than an inference made on it. Jo㎞ston(1983:53-54)expresses his concem about test m factor for accurate measurement, especially in testing readin 23 東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

CHAPTER 3 TESTING READtNG PERFORMANCE

     Since the fbcus of this study is on the testing of reading compreherision, this

chapter now turns to the testing perspective of the present research. In discussing

language testing, there are two things that should be considered:‘what it is that we

are trying to measure’and‘how we are going to measure it’. The purpose of this

chapter is to explore these two concepts in depth and that will hopefUlly lead us to

conceptualize the role and nature of reading tests in the measurement of reading

ability.

3.1 Test as an Instrument in Measuring Language Ability

     Atest, according to the second edition ofDictionarγ(~fLanguage Teaching・and

/llPjりlied L inguistics (Richards, Platt and Platt 1992:377), is ‘any procedure fbr

measuring ability,㎞owledge, or performance.’In other words, it could be

interpreted that a test is an instmment that we can use to weigh how much ability(or

㎞owledge, performance)is existent in a leamer, j ust as we use rulers to丘nd out how

long a piece of cloth is. Furthermore, when we hear the word‘test’, what comes to

our mind i s a set oftest questions from which, the total nurnber of the ones we answer

correctly, our total scores on the‘test’are calculated.

     Ihave listed the definition in a general sense because, many of the times, it is

difficult to find a simple and explicit definition of what a test is in language testing

literatures(e.g. Henning 1987;Bachman 1990;McNamara 1996;Urquhart and Weir

1998;Alderson 2000)as there is much to be considered in describing the nature of a

test. However, what many of them do indicate equivocally in the discussion of what

a test i s is that the measurement that i s acquired through implementing a test assumes

measurement errors and that what is determined regarding the ability of test takers

through the use of a test i s no more than an inference made from their performances

on it. Jo㎞ston(1983:53-54)expresses his concem about test methods, as a crucial

factor for accurate measurement, especially in testing reading:

23

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 2: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

...since reading comprehension is a mental activity, it is only available fbr indirect,

second-hand scrutiny. We can never actually watch the mental operations, but must

infer them from other sources of data. In making these inferences, we must be very

clear about the grounds we have for doing so. In order to be so informed, we should

understand(as clearly as our data and theory will allow)the actual demands and

assumptions involved in our assessment techniques.

     In other words, in trying to皿derstand the nature of a test, the central concem,

whether in developing a test or making use of it to measure language ability, is

whether, or to what extent, the test is measuring what it purports to measure.

Moreover, in this regard, the challenge in the research of language testing would be

how and to what extent we can ensure that a particular test makes it possible to elicit

arepresentative sample ofthe language ability we wish to measure.

3.2  Test Validity

     In understanding what makes a‘good’test, we have to consider two things, one

of which is test validity:whether and to what extent a te st measures what it purports

to measure. For example, paper-and-pencil‘pronunciation’tests or a writing test

heavily based・n specialized backgr・und㎞・wledge w6uld be very l・w in its validity,

since what the ability sample extracted via the test is very different from what they

are intended to measure. The other criterion that should be considered in making a

good test is reliability:the extent to which a test is consistent in its measurement. A

popular example of when the test reliability must be questioned is, in a writing test,

when two raters are marking the same essay but are giving very different marks.

Moreover, when the same rater is rating the same essay twice and giving different

marks in the second rating, it is also considered problematic in terms of reliability.

The fbrmer could be considered as an example of inter-rater reliability, while the

latter is an example of intra-rater reliability・

24

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 3: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

     Comparing the two concepts, a popular notion is that, while reliability is a

necessary condition fbr a test to be valid because test scores that are not reliable

cannot provide a basis fbr valid interPretation and use, reliability alone does not

guarantee test validity(Kobayashi 1995:81-82). For example, in the advent of

paper-and-pencil‘pronunciationうtest, it is perfectly possible to make a very

consistent paper-and-pencil‘pronunciation’test, but, despite the high reliability of the

test, the test is very low in validity. Therefbre, validity is‘the most important

quality in the development, interpretation, and use of language tests’(Bac㎞{m

1990:289).

     There are fbur types of validity in language testing:construct validity, content

validity, face validity, and concurrent or predictive validity(B achman 1990;Heaton

1988;Hughes 2003;McNamara 2000). Concurrent or predictive validity is obtained

by compa血g the results of a test with those of other measurements:the former With

other existing tests and the latter with future performance of the testees, usually by

calculating correlations(Kobayashi 1995:83). They have little significance as a

measure of validity unless the other measures with which they are to be compared are

themselves established as valid(i.e. a well-established standardized test). However,

if the test against which the new test is validated is, indeed, considered to be valid, it

would be a powerfU1 means of test validation.

     Face validity, according to Davies et al.(1999:59), is the degree to which a test

apPears to measure the knowledge or abilities it claims to measure, as j udged by an

皿trained observer(such as the candidate taking the test or the institution which plans

to administer it). Concems for face validity are often dismissed as trivial because

they have to do with appearances rather than with the underlying construct of ability

being measured by the test, but it has also been argued that failure to take issues of

face validity into acco皿t may j eopardize the public credibility of a test・

      Content validity can be defined as a parameter which concems‘whether the test

content consists of a representative sample of the domain of language ability to be

measured’(Davies et al.1999:34). Some testing specialists make no distinction

between face validity and content validity in that they are both intuitive and logica1

25

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 4: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

but usually lacking an empirical basis(Henning 1987:94). Yet, others do make a

distinction l)etween the two, disregarding face validity as something that is

‘impressionistic’compared to content validity which employs more scientific

approaches in determining validity(e.g. Oller 1979, cited in He皿ing 1989:96).

     In the development of a performance test4, content validity is normally

achieved l)y means of a thorough needs analysis of the target domain, upon which the

test content is based(e.g. McNamara 1996). An achievement test5 seeks content

validity by drawing a representative item sample丘om the syllabus on which it is

based. For a general proficiency test, where the whole of the language is the target

domain, content then becomes the construct. This means that, in the present

research, where the test instrument to be used is a general proficiency test, content

validity ofthe test is strongly related to its construct validity・

      Construct validity, as is often described by many publications in language

testing research(e.g. Henning l 987;Bachman 1990), is concemed with the extent to

which a test is related to a theoretical construct of language ability. Construct

validation involves an investigation of the qualities that a test measures, thus

providing a basis fbr the rationale of a test. Therefbre, when the construct validity

of a certain test is discussed, the te㎜‘validity’is often used interchangeably to mean

4Not all language tests are of the same kind. They differ in respect to test method and test purpose.

In tems ofmethod, McNamara(2000:5)distinguishes仕aditional paper-and-pencil language tests

廿om perfbrmance tests. According to his distinction, paper-and-pencil te sts take the fbrm of the

血miliar examination question papeL On the other hand, in performance based tests, language skills

are assessed in a fbrm ofactual perfbrmance(e.g. interview tests to assess speaking ability or essay

tests to assess writing ability). In recent practice, both of these test methods can be realized virtually

                                                                    In that・n・・mputer・(u・u・lly・eferr・d t・a・C・mputer-B・・ed T・・ting・・C・mput・・Ad・ptiv・T・・ting)・

,e・pect, it m・y b・b・廿・・t・・xp・ess p・per一紐d-P・n・il l・ngu・g・t・・t・by th・t・㎜‘・bility’t・・tlng・Th・

distinction between‘ability testing’and‘perfbrmance testing’is discussed in detail in 3.3.21n relatlon

with construct definition.

51n terms of test purpose, the most飽miliar distinction is made between achievement tests and

艦麗麟=蕊霊:罐蒜謙霊te㍑よ蒜蕊瓢・i就・d㌫㍑:鷲罐=,濃㌫㌶1麟㌶㍑蒜鯉:蒜㌫ie灘y。th,, h孤d, P・・丘・i・n・y t・・t・1・・k t・th・血加・e・i加・ti・n・f1・ngu・g・u・e with・ut・necessa・ily・・efe・ence

t。th・p・evi・u・p・・cess・ftea・hing. Th・・加dy d・n・in th・p・e・ent・e・ear・h d…n・t inv・lv・皿y

an・ly・i・・f・e・ult・廿・m in・血・ti・n. It is s・1・ly inv・1v・d with p・・且・i・n・y t・・ting・th・ugh血・6nding・

may be helpfUl in the constnlction of achievement tests.

26

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 5: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

construct validity. Furthermore, as it may be clear丘om the previous discussion in

this section, construct validity is strongly related to the other three types of validity, if

not subsumes them.

     It seems, from the de scriptions above, that construct validity, and thus content

validity in the present research, are of the utmost importance because they relate

essentially to the center of what is to be tested and how it is best measured.

Therefore, the discussion now turns to how the construct and contents could be

constituted in investigating the nature of reading tests・

3.3 Construct Definition

     Construct is the trait or traits that a test is intended to measure. It is defined as

‘an ability or set of abilities that will be reflected in test performance’(D avies et al.

1999:31),and丘om which‘inferences can be drawn’on the basis of test scores

(Chapelle 1999:154). In other words, it is a meaningfUI and usefUl way of

interpreting test performance(Messick 1988). A construct is usually based on a

theory, so a test, then, represents an operationalization of the theory on which it is

based. Therefbre, a reading test is a血operationalization of a reading construct

derived from theories of reading ability, which have j ust been discussed in Chapter 2

0f this paper.

     Thus, construct definition is a very important component in a test to clarify

what is to be inferred about the ability of a test taker from his performance on the test

                                                                      (丘om the perspective of a test constructor), or to皿derstand it(丘om the perspectlve

of a test user). With regard to the test validity, a well-defined construct is essential

in keeping the test validity high.

3.3.1 Conceptuanzing Language Ability

     In pri・r t・the discussi・n・f c・nstmct de丘niti・n, it is essential that the the・ries

。f l皿guage ability be reviewed because that are, by de丘niti・n・what c・nstmct is

established up・n. Traditi・nally, language ability was th・ught t・c・nsist・f m・dules

27

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 6: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

of linguistic㎞owledge. In other words, they were considered to be an

accumulative knowledge of discrete elements of language. However, in the last few

decades, a considerable progress has been made in establishing models of language

ability(e.g. Chomsky l 957,1980), especially in regard to how the concept of

‘co㎜皿ication’ should be synthe sized in defining what language ability is(e.g.

Hymes 1972;Canale and Swain 1983;Bachman 1990). This trend has made a great

contribution to the field of English Language Teaching, and a line of approach such as

Communicative Approach in teaching or Communicative Testing has become very

influential in both pedagogy and research・

     In the discussion of language ability, Chomsky’s distinction between

‘competenceう(the speaker-hearerうs knowledge of his language)and ‘performance’

(the actual use of language in concrete situations)was a significant milestone. In

considering language testing, the distinction between underlying‘competence’and

actual‘perfbmance’is crucial because we need to sample actual language use, or

what can be directly observed and evaluated as a product. In other words, as much

study suggests(e.9. Canale and Swain l 980),皿derlying competence can be assessed

only through its realization in performance. Thus, a鋤her examination of how

competence and performance are related is necessary.

3.3.1.1Defi痂9 competence

     Inspired by Hymes’s(1972)n・ti・n・f‘c・mmunicative c・mpetence’・which

takes s。ci。linguistic elements int・acc・unt as・PP・sed t・Ch・msky’s‘linguistic

c。mpetence・, vari・us m・dels・f l孤guage ability which presents the idea・f l孤guage

ability c・nsisting・f grammatical kn・wledge and㎞・wledge・f use have been

introduced. Bachman’s(1990)model, shown in Figure 3-1, seems to be the most

comprehensive of all at present・

     In the mode1, language competence is divided into organizational competence

                                                            (which includes grammatical and textual c・mpetence)and pragmatlc c°mpetence

(which includes ill・cuti・nary and s・ci・linguistic c・mpetence)・

28

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 7: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

Language Competence

Organizatio al Competence

Grammatical

C◎mpetence

TeXtual

C◎mpetence

lll◎cutionary

CompetenceS◎ciolinguistic

Competen◎e

Figure 3-1 ComponentS of tanguage competence(Bachman 1990:87》

     When the model is consulted with regard to the theories of reading ability

discussed in the previous chapter, the way Bachmanうs model makes the distinction

between organizational competence and pragmatic competence seems to coincide

with the way some studies(e.g. Negishi 1996;Grabe 1999)make the distinction

between‘a text model of comprehensionうand‘a situation model of comprehension’

(see 2.2.2). The concepts of organizational competence and text modeling are both

established in the linguistic dimension of language ability, whereas the concepts of

pragmatic competence and situation modeling are both enacted in its world

㎞owledge co皿telpalt

     Furthermore, with regard to an empirically derived model of reading ability, the

three components of FL reading ability extracted in Negishi(1996)(see 2.3.2.1)fits

nicely to Bachman’s(1990)model:‘Linguistic Competence’factor of Negishi(1996:

134)could be explained by g…tical competence in Bac輌s model,‘world

㎞owledge’factor by sociolinguistic competence, and‘Reading Skills’factor by

textual and illocutionary competence. The fact that Negishi’s(1996)‘Reading

Skills’factor is explained by two different components of competence that are

allotted in two different dimensions of language competence in Bachman’s(1990)

model is further explained by another empirically derived model of a

‘two-dimensional approach’to the latent structure of reading ability(Negishi 1997;

Wada 2003)(see 2.3.2.1). The‘local/global comprehension’component, which is

attributed to the amount of infbrmation integrated (Wada 2003: 58), could be

29

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 8: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

explained by textual competence, and the ‘literal/inferential comprehension’

dimension, which i s attributed to the amo皿t of information processing(Baclman

and Palmer 1982;Wada 2003), could be explained by illocutionary competence.

Thus, the‘two-dimensional approachりto the latent structure of reading ability

(Negishi 1997;Wada 2003)could be considered to be a usefU1 model that can work as

aconstruct to explain the reading skills part of reading ability, as it was suggested in

2.40f this paper.

3.3.1.2  1)efining perfor〃∂ance

     Vatrious attempts have been made to define a mechani sm by which competence

and performance could be bridged. The inclusion of strategic6 competence in a

language ability model by Canale and Swain(1980)is one such attempt. However,

they treated this competence mainly as being compensatory(i.e. ability necessary

when communication breaks do㎜)and did not put much emphasis. Later, Canale

(1983)modified the earlier joint model, emphasizing the importance of strategic

competence as a more independent mechanism essential fbr successfU1

                 り

CO㎜迦catlon.

     Canale’s idea of strategic competence was finther developed by Bachman

(1990)in his theoretical framework of communicative language ability. Bad皿孤

and Palmer(1996)ftロther developed Bachman’s(1990)model and presented it as a

visual metaphor shown in Figure 3.3.1.2.

     As it is illustrated below, the framework of Baciman and Palmer(1996)views

language use (or perfbrmance) as interactions among areas of language ability

(composed of language knowledge and strategic competence;described in detail in

Figure 3-1), topical㎞owledge, and affective schemata, on the one hand, and how

these interact with the characteristics of the language use situation, or test task, on the

other. The figure al so illustrates various interactions that are assumed to be involved

61n Note#2 in 2.4, it was stated that the present study treats the term‘strategy’to be equivocal to

what is meant by‘skills’. Although the present author maintains this notion and considers what is

meant by strategic competence here is actually‘skills’discussed in Chapter 2, the word‘strategy’is

used here because the stUdies cited in the present discussion used the term in their publications.

30

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 9: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

in langUage USe.

their model:

Bachrnan and Palmer(1996:62)give a detailed explanation of

 Topical

knowledge

  knowledge

/≡⊆:“ゼ\

 Strateglc

competenc8

                          Charactenstics of the

                          language use or test

                           task and seding

Figure 3-2 Some componentS of language use and Ianguage test pertormance

                                          {Bachman and Palmer 1996:63)

The components that are within the smaller, bold circle(‘topical㎞owledge’,

‘language㎞owledge’,‘personal characteristics’,‘strategic competence’and

‘affect’)represent characteristics of individual language users, while the outer

circle includes characteristics in the task or setting with which the language use

interacts. The double-headed arrows indicate interactions. The figure

indicates that strategic competence is the component that links other

components within the individual, as well as providing the cognitive link with

31

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 10: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

the characteristics of the language use task and setting.

     What is to be noted, here, is the interaction illustrated between strategic

competence along with the characteristics of the language use situation, or test task,

in the model. Bachman(1990:84)defines strategic competence as‘the mental

capacity fbr implementing the components of language competence in contextualized

co㎜皿icative l孤guage use,’and this definition c輌lso be applied to Bachm皿孤d

Palmer’s(1996)model.

     When the distinction that was made between‘competence’and‘performance’

is revisited, it is‘the speaker-hearer’s㎞owledge of his languageうversus‘the actual

use of language in concrete situations.’In other words, to define performance, it is

the result of leamer’s language competence(or language knowledge{md topical

㎞owledge components in the model)put fbrth in a context(the characteristics of the

language use or test task)via his capacity to actually put it in use (or strategic

competence). It is essentially what comes out from an interaction of language

competence, strategic competence, and the context. Thus, what is elicited by a test

item in a reading te st is a reading performamce(the product)embodied by language

competence along with other underlying competences.

     Many of the studies that draw upon this model put much fbcus on the strategic

competence component(Alderson 2000:332). However, as far as testing of reading

ability is concemed, too much fbcus on strategies(or skills)is dangerous since the

process of how strategic competence influence language competence cannot be

observed directly(Phakiti 2008)and may lead to the confUsion that we have seen in

identifiablitiy studies of reading subskills in 2.3.2.2. The interest in strategies comes

in part from an interest in characterizing the process of reading rather than the product

of reading(Alderson 2000:307), and that, as previously stated in the present thesis, is

beyond the scope ofthis study・

     Conversely, if we turn our attention to the component of characteristics of the

language use or test task(the context aspect), an implication can be Ib皿d toward

investigating the nature of a reading test. That is, various]features of a test, or

32

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 11: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

‘facets’of test methods(Baciman 1990:115)7 could be ass㎜ed to螂e up the

context component as factors that affect learners’performance. Thus, the inference

that we make from learners’ performance on a test about their ability encompasses the

nature ofthat particular test.

     Tb elaborate on this in the context of reading test, what we are observing as test

takersうperforrnance on a reading test i s the re sult(or the‘product’)of their reading

competence implemented into the context(of which the reading test comprises a part)

in conj unction with their strategic competence. Therefbre, the facets of a reading

test(as a part of the context)infiuences how a reading competence is contextualized

as a reading performance. In this regard, in Negishi(1996)and Wada(2003), as it

was pointed out in 3.3.1.1, the factors that were extracted from students’reading

performance on a test were in close relation with the components that seem to

compose the reading competence. This may have been because the validity of test

instrument that was employed in these studies were high(implying that there was

little test method effect in the tests), allowing the latent structure of the reading

competence to be readily observable among other factors that constitute the

performance. Therefbre, fbr reading competence to be properly contextualized in

reading performances, it is essential that we‘delineateう(Bachman 1990:115)the

nature(or the facets)of a reading test so as not to distort leamers’‘comprehension’.

At the same time, it is also vital that a theoretical view of what‘comprehension’(or

construct)is duly operationalized(or defined)in developing a test item. Therefbre,

the discussion now turns to different approache s taken toward defining constructs and

the ways of how test takers’performances are interpreted in making inferences about

their language ability in deterrnining how a test item should be developed.

7Bachman’s(1990)framework oftest method facets consists of five maj or categories:

       1) testing environment

       2) test rubric

       3) the nature of input the test taker receives

       4) the nature of the expected response to that input

       5) the relationship between input and response  (see Bachman 1990:111-159)

33

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 12: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

3.3.2 Approaches toward Construct Definition

     Researchers and/or teachers use tests to elicit learners’ performance and make

inferences about their language ability on the basis of what they observe from his

perfbmlance. For example, an inference is made about test takers’‘reading

comprehension’on the basis of their responses to questions on a reading

comprehension test. The teml‘inference’is used to indicate that the test result is not

itself the obj ect of interest to test users(researchers and teachers). Instead, test users

want to㎞ow what a test taker(learner)might be expected to be capable of in

non-te st settings. What kind of language ability test users want to observe from test

takers’test performance, or constnlct, is conceived, and thus defined, in different

ways depending on whether one takes an‘ability’or a‘performance’orientation to

testlng.

3.3.2.1  Two aPjりroach es:co〃s〃uct or content~

     Messick(1989:15)defines construct in a very strict sense by saying that it is‘a

relatively stable characteristic of a person -- an attribute, enduring Process, or

disposition--which is consistently manifested to some degree when relevant, despite

considerable variation in the range of settings and circumstances.’Inspired by this

notion, Chapelle(1998:34), in her discussion of validity studies with regard to

performance assessment in second language acquisition research, categorizes

construct definition into three types in terms of their approache s on how constructs

are defined.

     ‘Trait theorists’define constructs in terms of the knowledge and fundamental

process of the test taker. Therefbre, their approach, also called ‘trait-type’or

‘trait-oriented’approach in Chapelle(1999:156-157), would be to interpret the test

performance as evidence of underlying processes or stmctures, which are also

responsible fbr performance in non-test settings. Thus, the fbcal problem in test

design is to assess accurately the ability of interest rather than other things. This is

an apProach taken in‘ability testing’.

     On the other hand,‘behavioristsうdefine constructs with reference to the

34

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 13: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

environmental conditions皿der which performance is observed. Therefore, in their

approach, the performance elicited by the test item should be interpreted as the result

of contextual features, and no inference should be made as to what underlying ability

is tapped by the test item from it.‘Performamce8 testing’, which aims to make

i噛ences more‘directlyうabout performance in non-test settings on the basis of test

performance, takes this approach. The test design problem here, therefore, is

constructing a test with characteristics as similar as possible to the non-test setting.

     The last type,‘interactionist’approach, can be placed in the mid way between

the two approaches above. It sees performance as the result of traits, contextual

features, and their interaction. Such an apProach to construct definitions includes

both a cognitive skill or capacity and a domain where the capacity is relevant, such as

‘reading fbr academic purposes’(Chapelle 1999:157). In other words, their

construct definition suggests that a learner might be good at using the target language

for some purposes but that is not guaranteed for other purposes.

     In 3.3.1.2, it was repeatedly emphasized that what is observable in a test is

leamers’performance, so language ability cannot be observed without the

intervention of a test instrument. Therefore, Chapelle’s(1998)categorization of a

‘trait-type’approach toward construct definition seems very weak, if not invalid.

Therefbre, fbr the present discussion, fbcus will be put on the difference between a

‘behaviorist’apProach and an‘interactionisピapProach to defining construct.

    In line with Chapelle’s(1998)conceptualization of construct definition,

Bachman(2002:456), in discussing validity concems of task-based language

performance assessment(TBLPA), introduces two approaches toward defining

construct: ‘ability-based’ and ‘task-based’ apProaches.  Tb be precise, these

approaches are discussed in terms of how a performance assessment is developed,

however, it seems that, actually, what is central in his discussion is how construct is

8The term‘perfbmance’in‘performance testing’may be confUsed with‘performance’ discussed in

3.3.1. The distinction between the two is that the word‘perfbrmance’in perfbrmance testing is used

in a narrower sense in that it is assumed as something that is indivisible and insusceptible to any effort

to break it down into interpretive components, whereas‘performance’defined in relation with

‘competence’is presumed to be an interaction of language competence, strategic competence, and the

context, as was illustrated in 3.3.1.2.

35

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 14: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

approached, which, fUndamentally, is about construct definition.

the tWo approaches by citing Norris et a1.(1998):

a.

He distinguishes

in developing a performance assessment, focus either on constructs or on tasks:

i.Begin construct-based test development by focusing on the construct of interest

and then develop tasks based on the performance attributes of the construct, score

  uses, scoring constraints, and so fbrth.

ii. Begin task-centered test development by deciding which performances are the

  desired ones. Then, score uses, scoring criteria, and so forth become part of the

pe㎡formance test itself.(Norris et al. 1998:25)

Furthermore, Figure 3-3(B achman 2002:

the two concepts.

457)well illustrates the difference between

Interpretation

‘Has language

  ability’

Domain of

TLU tasks   ▲

ξ8   詩

6t   漬蓮

Interpretation

‘Can do“real-Eife”

    tasks’

Perfbrmance   1 -9

conslstency   l O 8

             魯

           1

           1

           ‘

            1

                       Assessment                                                      .Assessment                                         Language          Language       tasks and                                                       tasks and                                           ability           ability                        context                                                        context

        (a)             (b)

Figure 3■3   Different interpretations or response consistencies on Ianguage

assessment tasks:(a》‘Ability-based, inferences about language ability and(b)

‘Task・based, predictions about future performance as‘real・wortd, tasks

36

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 15: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

     In discussing content validity in 3.2, it was stipulated that, in a general

proficiency test, where the whole of the language is the target domain,‘content’

becomes the‘construct’in that a representative sample of the domain of language

ability to be measured is not directly observable. However, when two approaches by

Chapelle(1998)and Bachman(2002)(they describe the same notion in essence)are

considered, there, indeed, is a difference between the two. It appears that, in

competence-based‘interactionist’approach, or(a)in Bachman’s(2002)model, one

must consider both constructs and test items, while, in content-based‘behaviorist’

approach, or(b), one considers only performances on test items. The former

approach maintains that the process of designing, developing and using language tests

should incorporate both specifシing the test items to be included and defining the

abilities to be measured (i.e. constnlct)(Bachman and Palmer 1996;Brown 1996;

Alderson 2000;Douglas 2000). The latter approach requires so far as to defining

the tasks embedded in the context(i.e. content).

     This distinction between the two approaches in construct definition is also

debated in Hudson(2005)and Norris and Ortega(2003). Reflecting upon

criterion-referenced language assessment, the complexity of language use, the

complexity of assessing language ability, and the difficulty in interpreting potential

interactions of task and its difficulty that are indispensable yet difficult to implement

are discussed. Hudson(2005:205)describes the views on this issue as,‘‘They

reflect a current appetite for language as sessment anchored in the world of fUnctions

and events, but also must address how the worlds of fUnctions and events contain non

skill-specific and discretely hierarchical variability.’うAt the same time, he stipulates

that fUrther research is required to investigate into the relationships between

‘‘

狽≠唐求|dependent”view and‘‘task-independent”view of the construct. Norris and

Ortega(2003:729)casts an insight that these conflicting views traces the differing

paradigms from where the motivations of research originate and observes them as

bearing‘‘witness to the fact that construct definitions are available”, thus encouraging

a shift to a fUrther examination of conceptual bases on the‘measurement’aspect.

37

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 16: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

     The distinction between the two approaches in constmct definition is very

similar to what can be perceived as the difference between the丘amework of

℃ommon Reference Levels of Language Proficiency’, or more popularly,℃ommon

European Framework of Reference fbr Languages’(CEFR)(Co皿cil fbr Cultural

Co-operation Education Committee, Modern Language Division 2001:24-29)and the

ALTE℃an Do’statements(Association of Language Testers in Europe 2002).

     CEFR was developed with the intention to provide a common basis for the

elaboration of language syllabi, cuπiculum guidelines, examinations, textbooks, etc.

across Europe, where many people with different first languages emigrate or

immigrate to the places where they would be required to learn a new language. It i s

acomprehensive description of what language leamers have to leam to do in order to

use a language fbr communication. In particular, the proficiency descriptors define

levels of proficiency which allow leamersうprogress to be measured at each stage of

learning and on a life-long basis. As it is stipulated by the developers of CEFR, they

are to be used as a‘grid which users can exploit to describe their system’(Council of

Europe 2001:21)as a scale of reference levels. They fUrther note that they should

be‘context-free’in order to accommodate generalizable results from different

specific contexts and be‘based on theories’of language competence(Council of

Europe 2001:21). This is exactly the approach taken by the competence-based

model of construct de丘nition. In other words, the‘丘㎜ework’approach that these

models and descriptors take assumes the situation where the inference to be made

about leamers’fUture performance wil1 be acquired artalytically as an interaction of

his competence and the context.

     On the other hand, the ALTE℃an Do’statements take a different approach:an

approach that fbcuses on the content of what is to be measured. The ALTE℃an

Do’statements are an application of CEFR descriptors that was made with an aim to

develop and validate a set of performance-related scales, describing what leamers

can actually do in the fbreign language(see Appendix D of Council of Europe 2001

fbr details). In their original conception, they were made to be user-orientated to

provide the inte叩retations of test results that can be easily皿derstood by

38

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 17: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

non-specialists. As it is stated in Council of Europe(2001:244-245), it is

supPosed to be a ‘tool’ fbr providing easily understandable ‘descriptions of

performance’ which can be used in‘specifying requirements to language trainers,

formulating j ob descriptions, specifying language requirements fbr new posts’.

The whole list consists of 400 statements that are organized into three areas

according to their apPlicable contexts(e.9. Social and Tburist, W()rk, and Study). It

is clear from these descriptions that the ALrE℃an Do’statements are made with

the content-based approach to describing what i s to be inferred from the test in order

fbr them to be easily understood by non-specialists. Their main concem is defining

the performance holistically ‘in the context’ and to illustrate what i s to be measured

in the test(or described in the list, fbr the ALTE‘Can Do’statements)in a way so

that they describe the representative sample of fUture performance expected for

learners.

3.3.2.2  1汐mportan(re of aノをα7ηewo7r」k                              ’

     Both of the two approaches described above are valid ways of defining

construct fbr language ability measurement. The difference is in their purposes of

use as it was apparent in the difference between CEFR descriptors which was

developed for experts to be applied to various circumstances and the A口E℃an Do’

statements which was contextualized fbr non-specialists to be used as an easy

reference in concrete contexts. Hence, as Norris and Ortega(2003:729)concludes,

the option traces itself to the differing paradigms where the motivations of research

originate from.

     In the present research, the prime interest is in investigating how a reading

product, or performance, could be elicited by a test item so that some inte叩retations

and generalizations could be made about test taker’s ability. The obj ective in

language testing is making‘inferences’or‘predictions’about what a leamer‘may’be

able to do in the real-life situations. So far as‘measuring’ability is concemed, even

if the situation calls for content-based evaluation of leamers’performance, test

developers need to have a‘theoretical framework’that they can work with to

39

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 18: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

conceptualize what it is that we are measuring. In this respect, in the present

research, competence-based approach to defining the construct seems prevailing, and

that is what this study is going to inquire into.

3.3.3 Considering‘‘Speed,, as a Construct

     A distinction is often made between speed tests and power tests. Speed tests

are tests which employ content of a sufficiently low difficulty level that the maj ority

of people fbr whom the tests are intended would be expected to perfbrm perfectly

when they are given a sufficient amo皿t of time, but, since they are not, the rate of

response is of primary importance in determining success. On the other hand, power

tests allow enough time fbr responding, so that nearly all people may attempt every

item, but, because the items bear such a high difficulty level, their㎞owledge level,

or“垂盾翌?秩h, becomes the point of success in completing the test(He皿ing 1987:196).

     Most tests fall somewhat between the two extremes, since㎞owledge rather

than speed is the primary fbcus, but time limits are enfbrced since weaker students

may take an unreasonable length of time to finish(Henning 1987:8). Most test

designers, experimentally or intuitively, time their test to allow roughly 90%of

test-takers to complete in time, but do not consider their test to be speeded(Alderson

2000:150). Although the distinction between speed tests and power tests is often

considered j ust a difference in the fbcus intended in implementing or designing the

test, since most power tests, in practice, are timed with their results influenced by test

takersうspeed of processing and production,‘‘speed”could be, and should be, regarded

as an important variable that constitutes language ability. In fact, the results from

Hirai(1999)suggest that a correlative relationship could be fo皿d between the scores

on a cloze test of Japanese EFL leamers and their reading speeds as well as their

listening speeds. Furthermore, Shizuka(2000), in his study on the validity of

incorporating reading speed and response confidence in measuring EFL reading

proficiency, concluded that the reading speed was a valid element in demonstrating

te st takers’reading ability. In the same manner, Naganuma and Wada(2002)

experimentally demonstrated that test takers’‘‘reading speed”had a certain

40

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 19: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

relationship with their ability levels. On many occasions,‘‘powerラ’elements such as

those introduced in 3.3.2 are investigated into as possible factors that constitute the

reading performance of test takers. However, in the situation where no“true”power

test can exist, the speed at which test takers process and perfbml test items should be

considered as a factor that constitutes their reading ability.

3.4 Lining 1「bst Items in Sequence

     With the interest in inquiring into how a reading perforrnance could be

accounted with respect to learnerうs reading ability, it is crucial that a test item is

approached丘om the perspective of‘measurement.’If different perfbrmances of

reading which are elicited by different test items were to be termed as a‘qualitative’

perspective of a te st item, the‘quantitative’perspective would be their diffriculty that

are assigned to those performances. In this section, operationalization of test items

With regard to item difficulty will be discussed.

3.4.1  Specifying difficulty

     In an attempt to cast light on the quantitative side of a test item, it is essential

that this i s done from the perspective of item specifications. In particular, when the

language use is observed in settings that are more realistic and complex, fbcusing on

its‘producでaspect, it is vital that the person who is constructing the item demonstrate

an explicit outline of the trait to be measured, how that trait will be realized in the

performance elicited, how that performance will be elicited, and how that

performance will be quanti丘ed to provide an index of the test taker’s ability.

Mislevy and Almond(2002:478)stresses that“a systematic means fbr designing

performance assessments that will directly and adequately inform the particular kinds

and qualities of inferences that need to be made”is vital, befbre proposing their

influential framework, the Evidence C entered Design丘amework(ECD丘amework)

     Davies et al.(1999:207)describes test speci丘cations as a“document which sets

out what a test is designed to measure and how it will be tested.” It provides‘‘a

41

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 20: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

blueprint”fbr item writers and is‘‘important in the establislment of the test’s

construct validity.  In essence, this explanation could be applied to item

specifications also.

     With re spect to the quantifying of test items, since a test taker’s performance is

elicited by a test item, it could be presumed that each element that constitutes the item

specifications reflects the traits that are tapped by the test item. Thus, the item

difficulty of the test item should be interpreted to be quantifying how difficult it is to

succeed in fUlfilling all tho se elements that constitute the performance in aggregation

(Mislevy and Almond 2002). In other words, what can be inferred from the item

difficulty indicated i s the difficulty ofthe task as a whole.

     However, in developing test items with regard to the competence approach, the

interest is posed on the difficulty of each element rather than that of the whole test

item since it is aiming fbr an accountability of‘why’or‘how’the item has come to

possess the diffiTculty indicated. If the difliriculty of each of the elements that

constitute an item could be specified, then the prediction of difficulty for test items,

and thus their quantification, becomes possible. Tb this end, the search into the

difficulty of each‘element’becomes vital.

3・4・2 Seeking a Link between Question Type and ltem 1)ifficulty

     Among the studies that search into the measurement of certain competence in

reading ability, a few studies that focus on the element of“type of reading”can be

fbund. Although there is some research that investigates into certain relationships

between item difficulty and what type of reading i s tapped by an item(i.e. Tal et al.

1994,North 2000,0ded and Walters 2001,Trites and McGroarty 2005, Gomez et al.

2007),few such attempts have been made in Japan.

     As it was discussed in Chapter 20f the present thesis, with respect to how

reading product could be illustrated as a construct, the‘two-dimensional approach’to

reading ability which was derived from factor analytic studies in Negishi(1996)and

Wada(2003)may hold a potential key. Wada(2003), inspired by Negishi(1996), in

the factor analytic study of reading tests given to EFL leamers in Japan, observed that

42

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 21: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

the reading ability could be broken down into components described by the

‘local/global comprehension’dimension and the‘literal/inferential comprehension’

dimension, suggesting the validity of‘question types9’that elicit‘local-litera1’,

‘local-inferential’and‘global-inferential’type of reading.10

     Wada(2003), on the base of her empirical findings, had illustrated that

‘local-literal’ question type i s a type of test items that could be answered by returning

to a small unit of information in the passage and皿derstanding what i s explicitly

stated there. ‘Local-inferential’question type is a type of test items that could be

answered by referring to a small皿it of information in the passage and皿derstanding

what is implicitly stated there by making inferences. ‘Global-inferential’question

type is a type of test items that could be answered by referring to a large unit of

information in the passage and making inferences in order to come up with

macropropositional idea of the text.

     So far as‘measuring’ability is concemed, constructing test items demands a

‘theoretical framework’, which embodies‘developmental sequence’and acco皿ts for

how and why certain items are perceived to be more difficult than others by the test

takers. Such a framework would provide a ‘construct’that delineates the

relationship between the elements of reading performances, such as‘que stion types’,

and test takers’ reading abilities. As previously referred, North(2000)had made a

substantial effort in developing a CEFR and a scale to describe language proficiency.

The same sort of approach was taken by Gomez et al.(2007)which had conducted a

“scale-anchoring study”, an attempt to create de scriptors that acco皿t for the reading

perfbrmances of test takers at different levels of English proficiency based on both

empirical data and judgments by test developers. Gomez et al.(2007)had

9The te㎜‘question type’in the language te sting context may sometimes indicate the format of te st

items(i.e. multiple-choice question, true-false question). However, here, it indicates the type ofatest

item with reference to what type of reading(i.e. loca1-literal)it elicits.

10  Although there were two dimensions(locaYglobal and literal/inferential)originally assumed in the

stUdy, Wada(2003)could not extract the fburth type of comprehension,‘global-literaP comprehension

in her factor analytic study, concluding that only‘910bal-inferentiaP,‘local-inferential’, and

‘local-literaP types ofcomprehensions are valid to be assumed as‘question types’.

43

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 22: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

succeeded in creating descriptors that encapsulate what test takers at a given

proficiency level are able to do in carrying out reading tasks and illustrated how this

latent ability structure alters with accord to test takers’proficiency levels. The study

had noted that there were variances in what test takers could do in coming to an

answer in solving reading test items.

     For example, in reporting the results of their findings, Gomez et a1.(2007)

described:

Perforrriance at the Low level The descriptors that emerged from our

analyses state that test takers at the Low perforrnance level“have difficulty

identifシing the author’s purpose except when that purpose is explicitly stated

in the text or easy to infer from the text.” This statement implies that

Low-level test takers are able to identify the author’s purpose when it is

explicitly stated or easy to infer from the text...(p.430)

     This could be compared with their

responding to the same test item:

de scriptions fbr other ability groups

Performance at the lntermediate level The descriptors that emerged from

our analyses state that test takers at the lntermediate perfomiance level“can

recognize the expository organization of a text and the role that specific

information serves within a larger text but have some difficulty when the se

are not explicit or easy to infer from the text,”while test takers at the High

level can recognize text organization and the role served by specific

information‘‘even when the text is conceptually dense.”(p.431).

Performance at the High level _Faced with this degree of conceptual

density, most test takers at the lntermediate level were unable to infer the

author’s purpose correctly, whereas most of those at the High level were able

to do so.(P.432)

44

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Page 23: CHAPTER 3 TESTING READtNG PERFORMANCErepository.tufs.ac.jp/bitstream/10108/51461/7/dt-ko-0108005.pdf · chapter now turns to the testing perspective of the present research. In

     The fact that this alteration in the latent ability structure was revealed in Gomez

et al.(2007)gives a positive perspective to the present research to fUrther investigate

the relationship between the‘question types’suggested by Wada(2003)and their

perceived difficulty among different ability groups to construct a sequence which

reflects ELT environment in Japan. With血e intention to investigate how a reading

test item could be constructed to elicit types of reading performances described by

Wada’s(2003)‘question typesうwith regard to test takers’latent ability structure,

seeking a link between these‘question types’and their item diff7iculty seems to be a

requisite. Although Wada(2003)explored the qualitative aspects of‘que stion types’

and their validity as variables in item construction to some extent, how these

‘question types’could be linked with test takers’latent ability structure was not

considered. Would a test item be perceived to render the same‘question type’across

test takers of different ability levels, or would it be perceived differently according to

their ability levels? Would test items with different question types have a

generalizable order in their difficulties, and, if so, how could they be ranked? To

investigate how this notion of‘question types’could be implemented as a variable in

constructing a test item to‘‘measure’うatest taker’s reading ability, a way to quantify

this aspect of‘que stion types’ in conj皿ction with te st takers’latent ability structure is

needed.

45

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)