CHAPTER 3 TESTING READtNG PERFORMANCE
Since the fbcus of this study is on the testing of reading compreherision, this
chapter now turns to the testing perspective of the present research. In discussing
language testing, there are two things that should be considered:‘what it is that we
are trying to measure’and‘how we are going to measure it’. The purpose of this
chapter is to explore these two concepts in depth and that will hopefUlly lead us to
conceptualize the role and nature of reading tests in the measurement of reading
ability.
3.1 Test as an Instrument in Measuring Language Ability
Atest, according to the second edition ofDictionarγ(~fLanguage Teaching・and
/llPjりlied L inguistics (Richards, Platt and Platt 1992:377), is ‘any procedure fbr
measuring ability,㎞owledge, or performance.’In other words, it could be
interpreted that a test is an instmment that we can use to weigh how much ability(or
㎞owledge, performance)is existent in a leamer, j ust as we use rulers to丘nd out how
long a piece of cloth is. Furthermore, when we hear the word‘test’, what comes to
our mind i s a set oftest questions from which, the total nurnber of the ones we answer
correctly, our total scores on the‘test’are calculated.
Ihave listed the definition in a general sense because, many of the times, it is
difficult to find a simple and explicit definition of what a test is in language testing
literatures(e.g. Henning 1987;Bachman 1990;McNamara 1996;Urquhart and Weir
1998;Alderson 2000)as there is much to be considered in describing the nature of a
test. However, what many of them do indicate equivocally in the discussion of what
a test i s is that the measurement that i s acquired through implementing a test assumes
measurement errors and that what is determined regarding the ability of test takers
through the use of a test i s no more than an inference made from their performances
on it. Jo㎞ston(1983:53-54)expresses his concem about test methods, as a crucial
factor for accurate measurement, especially in testing reading:
23
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
...since reading comprehension is a mental activity, it is only available fbr indirect,
second-hand scrutiny. We can never actually watch the mental operations, but must
infer them from other sources of data. In making these inferences, we must be very
clear about the grounds we have for doing so. In order to be so informed, we should
understand(as clearly as our data and theory will allow)the actual demands and
assumptions involved in our assessment techniques.
In other words, in trying to皿derstand the nature of a test, the central concem,
whether in developing a test or making use of it to measure language ability, is
whether, or to what extent, the test is measuring what it purports to measure.
Moreover, in this regard, the challenge in the research of language testing would be
how and to what extent we can ensure that a particular test makes it possible to elicit
arepresentative sample ofthe language ability we wish to measure.
3.2 Test Validity
In understanding what makes a‘good’test, we have to consider two things, one
of which is test validity:whether and to what extent a te st measures what it purports
to measure. For example, paper-and-pencil‘pronunciation’tests or a writing test
heavily based・n specialized backgr・und㎞・wledge w6uld be very l・w in its validity,
since what the ability sample extracted via the test is very different from what they
are intended to measure. The other criterion that should be considered in making a
good test is reliability:the extent to which a test is consistent in its measurement. A
popular example of when the test reliability must be questioned is, in a writing test,
when two raters are marking the same essay but are giving very different marks.
Moreover, when the same rater is rating the same essay twice and giving different
marks in the second rating, it is also considered problematic in terms of reliability.
The fbrmer could be considered as an example of inter-rater reliability, while the
latter is an example of intra-rater reliability・
24
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Comparing the two concepts, a popular notion is that, while reliability is a
necessary condition fbr a test to be valid because test scores that are not reliable
cannot provide a basis fbr valid interPretation and use, reliability alone does not
guarantee test validity(Kobayashi 1995:81-82). For example, in the advent of
paper-and-pencil‘pronunciationうtest, it is perfectly possible to make a very
consistent paper-and-pencil‘pronunciation’test, but, despite the high reliability of the
test, the test is very low in validity. Therefbre, validity is‘the most important
quality in the development, interpretation, and use of language tests’(Bac㎞{m
1990:289).
There are fbur types of validity in language testing:construct validity, content
validity, face validity, and concurrent or predictive validity(B achman 1990;Heaton
1988;Hughes 2003;McNamara 2000). Concurrent or predictive validity is obtained
by compa血g the results of a test with those of other measurements:the former With
other existing tests and the latter with future performance of the testees, usually by
calculating correlations(Kobayashi 1995:83). They have little significance as a
measure of validity unless the other measures with which they are to be compared are
themselves established as valid(i.e. a well-established standardized test). However,
if the test against which the new test is validated is, indeed, considered to be valid, it
would be a powerfU1 means of test validation.
Face validity, according to Davies et al.(1999:59), is the degree to which a test
apPears to measure the knowledge or abilities it claims to measure, as j udged by an
皿trained observer(such as the candidate taking the test or the institution which plans
to administer it). Concems for face validity are often dismissed as trivial because
they have to do with appearances rather than with the underlying construct of ability
being measured by the test, but it has also been argued that failure to take issues of
face validity into acco皿t may j eopardize the public credibility of a test・
Content validity can be defined as a parameter which concems‘whether the test
content consists of a representative sample of the domain of language ability to be
measured’(Davies et al.1999:34). Some testing specialists make no distinction
between face validity and content validity in that they are both intuitive and logica1
25
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
but usually lacking an empirical basis(Henning 1987:94). Yet, others do make a
distinction l)etween the two, disregarding face validity as something that is
‘impressionistic’compared to content validity which employs more scientific
approaches in determining validity(e.g. Oller 1979, cited in He皿ing 1989:96).
In the development of a performance test4, content validity is normally
achieved l)y means of a thorough needs analysis of the target domain, upon which the
test content is based(e.g. McNamara 1996). An achievement test5 seeks content
validity by drawing a representative item sample丘om the syllabus on which it is
based. For a general proficiency test, where the whole of the language is the target
domain, content then becomes the construct. This means that, in the present
research, where the test instrument to be used is a general proficiency test, content
validity ofthe test is strongly related to its construct validity・
Construct validity, as is often described by many publications in language
testing research(e.g. Henning l 987;Bachman 1990), is concemed with the extent to
which a test is related to a theoretical construct of language ability. Construct
validation involves an investigation of the qualities that a test measures, thus
providing a basis fbr the rationale of a test. Therefbre, when the construct validity
of a certain test is discussed, the te㎜‘validity’is often used interchangeably to mean
4Not all language tests are of the same kind. They differ in respect to test method and test purpose.
In tems ofmethod, McNamara(2000:5)distinguishes仕aditional paper-and-pencil language tests
廿om perfbrmance tests. According to his distinction, paper-and-pencil te sts take the fbrm of the
血miliar examination question papeL On the other hand, in performance based tests, language skills
are assessed in a fbrm ofactual perfbrmance(e.g. interview tests to assess speaking ability or essay
tests to assess writing ability). In recent practice, both of these test methods can be realized virtually
In that・n・・mputer・(u・u・lly・eferr・d t・a・C・mputer-B・・ed T・・ting・・C・mput・・Ad・ptiv・T・・ting)・
,e・pect, it m・y b・b・廿・・t・・xp・ess p・per一紐d-P・n・il l・ngu・g・t・・t・by th・t・㎜‘・bility’t・・tlng・Th・
distinction between‘ability testing’and‘perfbrmance testing’is discussed in detail in 3.3.21n relatlon
with construct definition.
51n terms of test purpose, the most飽miliar distinction is made between achievement tests and
艦麗麟=蕊霊:罐蒜謙霊te㍑よ蒜蕊瓢・i就・d㌫㍑:鷲罐=,濃㌫㌶1麟㌶㍑蒜鯉:蒜㌫ie灘y。th,, h孤d, P・・丘・i・n・y t・・t・1・・k t・th・血加・e・i加・ti・n・f1・ngu・g・u・e with・ut・necessa・ily・・efe・ence
t。th・p・evi・u・p・・cess・ftea・hing. Th・・加dy d・n・in th・p・e・ent・e・ear・h d…n・t inv・lv・皿y
an・ly・i・・f・e・ult・廿・m in・血・ti・n. It is s・1・ly inv・1v・d with p・・且・i・n・y t・・ting・th・ugh血・6nding・
may be helpfUl in the constnlction of achievement tests.
26
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
construct validity. Furthermore, as it may be clear丘om the previous discussion in
this section, construct validity is strongly related to the other three types of validity, if
not subsumes them.
It seems, from the de scriptions above, that construct validity, and thus content
validity in the present research, are of the utmost importance because they relate
essentially to the center of what is to be tested and how it is best measured.
Therefore, the discussion now turns to how the construct and contents could be
constituted in investigating the nature of reading tests・
3.3 Construct Definition
Construct is the trait or traits that a test is intended to measure. It is defined as
‘an ability or set of abilities that will be reflected in test performance’(D avies et al.
1999:31),and丘om which‘inferences can be drawn’on the basis of test scores
(Chapelle 1999:154). In other words, it is a meaningfUI and usefUl way of
interpreting test performance(Messick 1988). A construct is usually based on a
theory, so a test, then, represents an operationalization of the theory on which it is
based. Therefbre, a reading test is a血operationalization of a reading construct
derived from theories of reading ability, which have j ust been discussed in Chapter 2
0f this paper.
Thus, construct definition is a very important component in a test to clarify
what is to be inferred about the ability of a test taker from his performance on the test
(丘om the perspective of a test constructor), or to皿derstand it(丘om the perspectlve
of a test user). With regard to the test validity, a well-defined construct is essential
in keeping the test validity high.
3.3.1 Conceptuanzing Language Ability
In pri・r t・the discussi・n・f c・nstmct de丘niti・n, it is essential that the the・ries
。f l皿guage ability be reviewed because that are, by de丘niti・n・what c・nstmct is
established up・n. Traditi・nally, language ability was th・ught t・c・nsist・f m・dules
27
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
of linguistic㎞owledge. In other words, they were considered to be an
accumulative knowledge of discrete elements of language. However, in the last few
decades, a considerable progress has been made in establishing models of language
ability(e.g. Chomsky l 957,1980), especially in regard to how the concept of
‘co㎜皿ication’ should be synthe sized in defining what language ability is(e.g.
Hymes 1972;Canale and Swain 1983;Bachman 1990). This trend has made a great
contribution to the field of English Language Teaching, and a line of approach such as
Communicative Approach in teaching or Communicative Testing has become very
influential in both pedagogy and research・
In the discussion of language ability, Chomsky’s distinction between
‘competenceう(the speaker-hearerうs knowledge of his language)and ‘performance’
(the actual use of language in concrete situations)was a significant milestone. In
considering language testing, the distinction between underlying‘competence’and
actual‘perfbmance’is crucial because we need to sample actual language use, or
what can be directly observed and evaluated as a product. In other words, as much
study suggests(e.9. Canale and Swain l 980),皿derlying competence can be assessed
only through its realization in performance. Thus, a鋤her examination of how
competence and performance are related is necessary.
3.3.1.1Defi痂9 competence
Inspired by Hymes’s(1972)n・ti・n・f‘c・mmunicative c・mpetence’・which
takes s。ci。linguistic elements int・acc・unt as・PP・sed t・Ch・msky’s‘linguistic
c。mpetence・, vari・us m・dels・f l孤guage ability which presents the idea・f l孤guage
ability c・nsisting・f grammatical kn・wledge and㎞・wledge・f use have been
introduced. Bachman’s(1990)model, shown in Figure 3-1, seems to be the most
comprehensive of all at present・
In the mode1, language competence is divided into organizational competence
(which includes grammatical and textual c・mpetence)and pragmatlc c°mpetence
(which includes ill・cuti・nary and s・ci・linguistic c・mpetence)・
28
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Language Competence
Organizatio al Competence
Grammatical
C◎mpetence
TeXtual
C◎mpetence
lll◎cutionary
CompetenceS◎ciolinguistic
Competen◎e
Figure 3-1 ComponentS of tanguage competence(Bachman 1990:87》
When the model is consulted with regard to the theories of reading ability
discussed in the previous chapter, the way Bachmanうs model makes the distinction
between organizational competence and pragmatic competence seems to coincide
with the way some studies(e.g. Negishi 1996;Grabe 1999)make the distinction
between‘a text model of comprehensionうand‘a situation model of comprehension’
(see 2.2.2). The concepts of organizational competence and text modeling are both
established in the linguistic dimension of language ability, whereas the concepts of
pragmatic competence and situation modeling are both enacted in its world
㎞owledge co皿telpalt
Furthermore, with regard to an empirically derived model of reading ability, the
three components of FL reading ability extracted in Negishi(1996)(see 2.3.2.1)fits
nicely to Bachman’s(1990)model:‘Linguistic Competence’factor of Negishi(1996:
134)could be explained by g…tical competence in Bac輌s model,‘world
㎞owledge’factor by sociolinguistic competence, and‘Reading Skills’factor by
textual and illocutionary competence. The fact that Negishi’s(1996)‘Reading
Skills’factor is explained by two different components of competence that are
allotted in two different dimensions of language competence in Bachman’s(1990)
model is further explained by another empirically derived model of a
‘two-dimensional approach’to the latent structure of reading ability(Negishi 1997;
Wada 2003)(see 2.3.2.1). The‘local/global comprehension’component, which is
attributed to the amount of infbrmation integrated (Wada 2003: 58), could be
29
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
explained by textual competence, and the ‘literal/inferential comprehension’
dimension, which i s attributed to the amo皿t of information processing(Baclman
and Palmer 1982;Wada 2003), could be explained by illocutionary competence.
Thus, the‘two-dimensional approachりto the latent structure of reading ability
(Negishi 1997;Wada 2003)could be considered to be a usefU1 model that can work as
aconstruct to explain the reading skills part of reading ability, as it was suggested in
2.40f this paper.
3.3.1.2 1)efining perfor〃∂ance
Vatrious attempts have been made to define a mechani sm by which competence
and performance could be bridged. The inclusion of strategic6 competence in a
language ability model by Canale and Swain(1980)is one such attempt. However,
they treated this competence mainly as being compensatory(i.e. ability necessary
when communication breaks do㎜)and did not put much emphasis. Later, Canale
(1983)modified the earlier joint model, emphasizing the importance of strategic
competence as a more independent mechanism essential fbr successfU1
り
CO㎜迦catlon.
Canale’s idea of strategic competence was finther developed by Bachman
(1990)in his theoretical framework of communicative language ability. Bad皿孤
and Palmer(1996)ftロther developed Bachman’s(1990)model and presented it as a
visual metaphor shown in Figure 3.3.1.2.
As it is illustrated below, the framework of Baciman and Palmer(1996)views
language use (or perfbrmance) as interactions among areas of language ability
(composed of language knowledge and strategic competence;described in detail in
Figure 3-1), topical㎞owledge, and affective schemata, on the one hand, and how
these interact with the characteristics of the language use situation, or test task, on the
other. The figure al so illustrates various interactions that are assumed to be involved
61n Note#2 in 2.4, it was stated that the present study treats the term‘strategy’to be equivocal to
what is meant by‘skills’. Although the present author maintains this notion and considers what is
meant by strategic competence here is actually‘skills’discussed in Chapter 2, the word‘strategy’is
used here because the stUdies cited in the present discussion used the term in their publications.
30
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
in langUage USe.
their model:
Bachrnan and Palmer(1996:62)give a detailed explanation of
Topical
knowledge
knowledge
/≡⊆:“ゼ\
Strateglc
competenc8
ー
Charactenstics of the
language use or test
task and seding
Figure 3-2 Some componentS of language use and Ianguage test pertormance
{Bachman and Palmer 1996:63)
The components that are within the smaller, bold circle(‘topical㎞owledge’,
‘language㎞owledge’,‘personal characteristics’,‘strategic competence’and
‘affect’)represent characteristics of individual language users, while the outer
circle includes characteristics in the task or setting with which the language use
interacts. The double-headed arrows indicate interactions. The figure
indicates that strategic competence is the component that links other
components within the individual, as well as providing the cognitive link with
31
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
the characteristics of the language use task and setting.
What is to be noted, here, is the interaction illustrated between strategic
competence along with the characteristics of the language use situation, or test task,
in the model. Bachman(1990:84)defines strategic competence as‘the mental
capacity fbr implementing the components of language competence in contextualized
co㎜皿icative l孤guage use,’and this definition c輌lso be applied to Bachm皿孤d
Palmer’s(1996)model.
When the distinction that was made between‘competence’and‘performance’
is revisited, it is‘the speaker-hearer’s㎞owledge of his languageうversus‘the actual
use of language in concrete situations.’In other words, to define performance, it is
the result of leamer’s language competence(or language knowledge{md topical
㎞owledge components in the model)put fbrth in a context(the characteristics of the
language use or test task)via his capacity to actually put it in use (or strategic
competence). It is essentially what comes out from an interaction of language
competence, strategic competence, and the context. Thus, what is elicited by a test
item in a reading te st is a reading performamce(the product)embodied by language
competence along with other underlying competences.
Many of the studies that draw upon this model put much fbcus on the strategic
competence component(Alderson 2000:332). However, as far as testing of reading
ability is concemed, too much fbcus on strategies(or skills)is dangerous since the
process of how strategic competence influence language competence cannot be
observed directly(Phakiti 2008)and may lead to the confUsion that we have seen in
identifiablitiy studies of reading subskills in 2.3.2.2. The interest in strategies comes
in part from an interest in characterizing the process of reading rather than the product
of reading(Alderson 2000:307), and that, as previously stated in the present thesis, is
beyond the scope ofthis study・
Conversely, if we turn our attention to the component of characteristics of the
language use or test task(the context aspect), an implication can be Ib皿d toward
investigating the nature of a reading test. That is, various]features of a test, or
32
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
‘facets’of test methods(Baciman 1990:115)7 could be ass㎜ed to螂e up the
context component as factors that affect learners’performance. Thus, the inference
that we make from learners’ performance on a test about their ability encompasses the
nature ofthat particular test.
Tb elaborate on this in the context of reading test, what we are observing as test
takersうperforrnance on a reading test i s the re sult(or the‘product’)of their reading
competence implemented into the context(of which the reading test comprises a part)
in conj unction with their strategic competence. Therefbre, the facets of a reading
test(as a part of the context)infiuences how a reading competence is contextualized
as a reading performance. In this regard, in Negishi(1996)and Wada(2003), as it
was pointed out in 3.3.1.1, the factors that were extracted from students’reading
performance on a test were in close relation with the components that seem to
compose the reading competence. This may have been because the validity of test
instrument that was employed in these studies were high(implying that there was
little test method effect in the tests), allowing the latent structure of the reading
competence to be readily observable among other factors that constitute the
performance. Therefbre, fbr reading competence to be properly contextualized in
reading performances, it is essential that we‘delineateう(Bachman 1990:115)the
nature(or the facets)of a reading test so as not to distort leamers’‘comprehension’.
At the same time, it is also vital that a theoretical view of what‘comprehension’(or
construct)is duly operationalized(or defined)in developing a test item. Therefbre,
the discussion now turns to different approache s taken toward defining constructs and
the ways of how test takers’performances are interpreted in making inferences about
their language ability in deterrnining how a test item should be developed.
7Bachman’s(1990)framework oftest method facets consists of five maj or categories:
1) testing environment
2) test rubric
3) the nature of input the test taker receives
4) the nature of the expected response to that input
5) the relationship between input and response (see Bachman 1990:111-159)
33
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
3.3.2 Approaches toward Construct Definition
Researchers and/or teachers use tests to elicit learners’ performance and make
inferences about their language ability on the basis of what they observe from his
perfbmlance. For example, an inference is made about test takers’‘reading
comprehension’on the basis of their responses to questions on a reading
comprehension test. The teml‘inference’is used to indicate that the test result is not
itself the obj ect of interest to test users(researchers and teachers). Instead, test users
want to㎞ow what a test taker(learner)might be expected to be capable of in
non-te st settings. What kind of language ability test users want to observe from test
takers’test performance, or constnlct, is conceived, and thus defined, in different
ways depending on whether one takes an‘ability’or a‘performance’orientation to
testlng.
3.3.2.1 Two aPjりroach es:co〃s〃uct or content~
Messick(1989:15)defines construct in a very strict sense by saying that it is‘a
relatively stable characteristic of a person -- an attribute, enduring Process, or
disposition--which is consistently manifested to some degree when relevant, despite
considerable variation in the range of settings and circumstances.’Inspired by this
notion, Chapelle(1998:34), in her discussion of validity studies with regard to
performance assessment in second language acquisition research, categorizes
construct definition into three types in terms of their approache s on how constructs
are defined.
‘Trait theorists’define constructs in terms of the knowledge and fundamental
process of the test taker. Therefbre, their approach, also called ‘trait-type’or
‘trait-oriented’approach in Chapelle(1999:156-157), would be to interpret the test
performance as evidence of underlying processes or stmctures, which are also
responsible fbr performance in non-test settings. Thus, the fbcal problem in test
design is to assess accurately the ability of interest rather than other things. This is
an apProach taken in‘ability testing’.
On the other hand,‘behavioristsうdefine constructs with reference to the
34
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
environmental conditions皿der which performance is observed. Therefore, in their
approach, the performance elicited by the test item should be interpreted as the result
of contextual features, and no inference should be made as to what underlying ability
is tapped by the test item from it.‘Performamce8 testing’, which aims to make
i噛ences more‘directlyうabout performance in non-test settings on the basis of test
performance, takes this approach. The test design problem here, therefore, is
constructing a test with characteristics as similar as possible to the non-test setting.
The last type,‘interactionist’approach, can be placed in the mid way between
the two approaches above. It sees performance as the result of traits, contextual
features, and their interaction. Such an apProach to construct definitions includes
both a cognitive skill or capacity and a domain where the capacity is relevant, such as
‘reading fbr academic purposes’(Chapelle 1999:157). In other words, their
construct definition suggests that a learner might be good at using the target language
for some purposes but that is not guaranteed for other purposes.
In 3.3.1.2, it was repeatedly emphasized that what is observable in a test is
leamers’performance, so language ability cannot be observed without the
intervention of a test instrument. Therefore, Chapelle’s(1998)categorization of a
‘trait-type’approach toward construct definition seems very weak, if not invalid.
Therefbre, fbr the present discussion, fbcus will be put on the difference between a
‘behaviorist’apProach and an‘interactionisピapProach to defining construct.
In line with Chapelle’s(1998)conceptualization of construct definition,
Bachman(2002:456), in discussing validity concems of task-based language
performance assessment(TBLPA), introduces two approaches toward defining
construct: ‘ability-based’ and ‘task-based’ apProaches. Tb be precise, these
approaches are discussed in terms of how a performance assessment is developed,
however, it seems that, actually, what is central in his discussion is how construct is
8The term‘perfbmance’in‘performance testing’may be confUsed with‘performance’ discussed in
3.3.1. The distinction between the two is that the word‘perfbrmance’in perfbrmance testing is used
in a narrower sense in that it is assumed as something that is indivisible and insusceptible to any effort
to break it down into interpretive components, whereas‘performance’defined in relation with
‘competence’is presumed to be an interaction of language competence, strategic competence, and the
context, as was illustrated in 3.3.1.2.
35
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
approached, which, fUndamentally, is about construct definition.
the tWo approaches by citing Norris et a1.(1998):
a.
He distinguishes
in developing a performance assessment, focus either on constructs or on tasks:
i.Begin construct-based test development by focusing on the construct of interest
and then develop tasks based on the performance attributes of the construct, score
uses, scoring constraints, and so fbrth.
ii. Begin task-centered test development by deciding which performances are the
desired ones. Then, score uses, scoring criteria, and so forth become part of the
pe㎡formance test itself.(Norris et al. 1998:25)
Furthermore, Figure 3-3(B achman 2002:
the two concepts.
457)well illustrates the difference between
Interpretation
‘Has language
ability’
Domain of
TLU tasks ▲
ξ8 詩
6t 漬蓮
Interpretation
‘Can do“real-Eife”
tasks’
Perfbrmance 1 -9
conslstency l O 8
魯
1
1
‘
1
Assessment .Assessment Language Language tasks and tasks and ability ability context context
(a) (b)
Figure 3■3 Different interpretations or response consistencies on Ianguage
assessment tasks:(a》‘Ability-based, inferences about language ability and(b)
‘Task・based, predictions about future performance as‘real・wortd, tasks
36
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
In discussing content validity in 3.2, it was stipulated that, in a general
proficiency test, where the whole of the language is the target domain,‘content’
becomes the‘construct’in that a representative sample of the domain of language
ability to be measured is not directly observable. However, when two approaches by
Chapelle(1998)and Bachman(2002)(they describe the same notion in essence)are
considered, there, indeed, is a difference between the two. It appears that, in
competence-based‘interactionist’approach, or(a)in Bachman’s(2002)model, one
must consider both constructs and test items, while, in content-based‘behaviorist’
approach, or(b), one considers only performances on test items. The former
approach maintains that the process of designing, developing and using language tests
should incorporate both specifシing the test items to be included and defining the
abilities to be measured (i.e. constnlct)(Bachman and Palmer 1996;Brown 1996;
Alderson 2000;Douglas 2000). The latter approach requires so far as to defining
the tasks embedded in the context(i.e. content).
This distinction between the two approaches in construct definition is also
debated in Hudson(2005)and Norris and Ortega(2003). Reflecting upon
criterion-referenced language assessment, the complexity of language use, the
complexity of assessing language ability, and the difficulty in interpreting potential
interactions of task and its difficulty that are indispensable yet difficult to implement
are discussed. Hudson(2005:205)describes the views on this issue as,‘‘They
reflect a current appetite for language as sessment anchored in the world of fUnctions
and events, but also must address how the worlds of fUnctions and events contain non
skill-specific and discretely hierarchical variability.’うAt the same time, he stipulates
that fUrther research is required to investigate into the relationships between
‘‘
狽≠唐求|dependent”view and‘‘task-independent”view of the construct. Norris and
Ortega(2003:729)casts an insight that these conflicting views traces the differing
paradigms from where the motivations of research originate and observes them as
bearing‘‘witness to the fact that construct definitions are available”, thus encouraging
a shift to a fUrther examination of conceptual bases on the‘measurement’aspect.
37
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
The distinction between the two approaches in constmct definition is very
similar to what can be perceived as the difference between the丘amework of
℃ommon Reference Levels of Language Proficiency’, or more popularly,℃ommon
European Framework of Reference fbr Languages’(CEFR)(Co皿cil fbr Cultural
Co-operation Education Committee, Modern Language Division 2001:24-29)and the
ALTE℃an Do’statements(Association of Language Testers in Europe 2002).
CEFR was developed with the intention to provide a common basis for the
elaboration of language syllabi, cuπiculum guidelines, examinations, textbooks, etc.
across Europe, where many people with different first languages emigrate or
immigrate to the places where they would be required to learn a new language. It i s
acomprehensive description of what language leamers have to leam to do in order to
use a language fbr communication. In particular, the proficiency descriptors define
levels of proficiency which allow leamersうprogress to be measured at each stage of
learning and on a life-long basis. As it is stipulated by the developers of CEFR, they
are to be used as a‘grid which users can exploit to describe their system’(Council of
Europe 2001:21)as a scale of reference levels. They fUrther note that they should
be‘context-free’in order to accommodate generalizable results from different
specific contexts and be‘based on theories’of language competence(Council of
Europe 2001:21). This is exactly the approach taken by the competence-based
model of construct de丘nition. In other words, the‘丘㎜ework’approach that these
models and descriptors take assumes the situation where the inference to be made
about leamers’fUture performance wil1 be acquired artalytically as an interaction of
his competence and the context.
On the other hand, the ALTE℃an Do’statements take a different approach:an
approach that fbcuses on the content of what is to be measured. The ALTE℃an
Do’statements are an application of CEFR descriptors that was made with an aim to
develop and validate a set of performance-related scales, describing what leamers
can actually do in the fbreign language(see Appendix D of Council of Europe 2001
fbr details). In their original conception, they were made to be user-orientated to
provide the inte叩retations of test results that can be easily皿derstood by
38
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
non-specialists. As it is stated in Council of Europe(2001:244-245), it is
supPosed to be a ‘tool’ fbr providing easily understandable ‘descriptions of
performance’ which can be used in‘specifying requirements to language trainers,
formulating j ob descriptions, specifying language requirements fbr new posts’.
The whole list consists of 400 statements that are organized into three areas
according to their apPlicable contexts(e.9. Social and Tburist, W()rk, and Study). It
is clear from these descriptions that the ALrE℃an Do’statements are made with
the content-based approach to describing what i s to be inferred from the test in order
fbr them to be easily understood by non-specialists. Their main concem is defining
the performance holistically ‘in the context’ and to illustrate what i s to be measured
in the test(or described in the list, fbr the ALTE‘Can Do’statements)in a way so
that they describe the representative sample of fUture performance expected for
learners.
3.3.2.2 1汐mportan(re of aノをα7ηewo7r」k ’
Both of the two approaches described above are valid ways of defining
construct fbr language ability measurement. The difference is in their purposes of
use as it was apparent in the difference between CEFR descriptors which was
developed for experts to be applied to various circumstances and the A口E℃an Do’
statements which was contextualized fbr non-specialists to be used as an easy
reference in concrete contexts. Hence, as Norris and Ortega(2003:729)concludes,
the option traces itself to the differing paradigms where the motivations of research
originate from.
In the present research, the prime interest is in investigating how a reading
product, or performance, could be elicited by a test item so that some inte叩retations
and generalizations could be made about test taker’s ability. The obj ective in
language testing is making‘inferences’or‘predictions’about what a leamer‘may’be
able to do in the real-life situations. So far as‘measuring’ability is concemed, even
if the situation calls for content-based evaluation of leamers’performance, test
developers need to have a‘theoretical framework’that they can work with to
39
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
conceptualize what it is that we are measuring. In this respect, in the present
research, competence-based approach to defining the construct seems prevailing, and
that is what this study is going to inquire into.
3.3.3 Considering‘‘Speed,, as a Construct
A distinction is often made between speed tests and power tests. Speed tests
are tests which employ content of a sufficiently low difficulty level that the maj ority
of people fbr whom the tests are intended would be expected to perfbrm perfectly
when they are given a sufficient amo皿t of time, but, since they are not, the rate of
response is of primary importance in determining success. On the other hand, power
tests allow enough time fbr responding, so that nearly all people may attempt every
item, but, because the items bear such a high difficulty level, their㎞owledge level,
or“垂盾翌?秩h, becomes the point of success in completing the test(He皿ing 1987:196).
Most tests fall somewhat between the two extremes, since㎞owledge rather
than speed is the primary fbcus, but time limits are enfbrced since weaker students
may take an unreasonable length of time to finish(Henning 1987:8). Most test
designers, experimentally or intuitively, time their test to allow roughly 90%of
test-takers to complete in time, but do not consider their test to be speeded(Alderson
2000:150). Although the distinction between speed tests and power tests is often
considered j ust a difference in the fbcus intended in implementing or designing the
test, since most power tests, in practice, are timed with their results influenced by test
takersうspeed of processing and production,‘‘speed”could be, and should be, regarded
as an important variable that constitutes language ability. In fact, the results from
Hirai(1999)suggest that a correlative relationship could be fo皿d between the scores
on a cloze test of Japanese EFL leamers and their reading speeds as well as their
listening speeds. Furthermore, Shizuka(2000), in his study on the validity of
incorporating reading speed and response confidence in measuring EFL reading
proficiency, concluded that the reading speed was a valid element in demonstrating
te st takers’reading ability. In the same manner, Naganuma and Wada(2002)
experimentally demonstrated that test takers’‘‘reading speed”had a certain
40
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
relationship with their ability levels. On many occasions,‘‘powerラ’elements such as
those introduced in 3.3.2 are investigated into as possible factors that constitute the
reading performance of test takers. However, in the situation where no“true”power
test can exist, the speed at which test takers process and perfbml test items should be
considered as a factor that constitutes their reading ability.
3.4 Lining 1「bst Items in Sequence
With the interest in inquiring into how a reading perforrnance could be
accounted with respect to learnerうs reading ability, it is crucial that a test item is
approached丘om the perspective of‘measurement.’If different perfbrmances of
reading which are elicited by different test items were to be termed as a‘qualitative’
perspective of a te st item, the‘quantitative’perspective would be their diffriculty that
are assigned to those performances. In this section, operationalization of test items
With regard to item difficulty will be discussed.
3.4.1 Specifying difficulty
In an attempt to cast light on the quantitative side of a test item, it is essential
that this i s done from the perspective of item specifications. In particular, when the
language use is observed in settings that are more realistic and complex, fbcusing on
its‘producでaspect, it is vital that the person who is constructing the item demonstrate
an explicit outline of the trait to be measured, how that trait will be realized in the
performance elicited, how that performance will be elicited, and how that
performance will be quanti丘ed to provide an index of the test taker’s ability.
Mislevy and Almond(2002:478)stresses that“a systematic means fbr designing
performance assessments that will directly and adequately inform the particular kinds
and qualities of inferences that need to be made”is vital, befbre proposing their
influential framework, the Evidence C entered Design丘amework(ECD丘amework)
Davies et al.(1999:207)describes test speci丘cations as a“document which sets
out what a test is designed to measure and how it will be tested.” It provides‘‘a
41
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
blueprint”fbr item writers and is‘‘important in the establislment of the test’s
construct validity. In essence, this explanation could be applied to item
specifications also.
With re spect to the quantifying of test items, since a test taker’s performance is
elicited by a test item, it could be presumed that each element that constitutes the item
specifications reflects the traits that are tapped by the test item. Thus, the item
difficulty of the test item should be interpreted to be quantifying how difficult it is to
succeed in fUlfilling all tho se elements that constitute the performance in aggregation
(Mislevy and Almond 2002). In other words, what can be inferred from the item
difficulty indicated i s the difficulty ofthe task as a whole.
However, in developing test items with regard to the competence approach, the
interest is posed on the difficulty of each element rather than that of the whole test
item since it is aiming fbr an accountability of‘why’or‘how’the item has come to
possess the diffiTculty indicated. If the difliriculty of each of the elements that
constitute an item could be specified, then the prediction of difficulty for test items,
and thus their quantification, becomes possible. Tb this end, the search into the
difficulty of each‘element’becomes vital.
3・4・2 Seeking a Link between Question Type and ltem 1)ifficulty
Among the studies that search into the measurement of certain competence in
reading ability, a few studies that focus on the element of“type of reading”can be
fbund. Although there is some research that investigates into certain relationships
between item difficulty and what type of reading i s tapped by an item(i.e. Tal et al.
1994,North 2000,0ded and Walters 2001,Trites and McGroarty 2005, Gomez et al.
2007),few such attempts have been made in Japan.
As it was discussed in Chapter 20f the present thesis, with respect to how
reading product could be illustrated as a construct, the‘two-dimensional approach’to
reading ability which was derived from factor analytic studies in Negishi(1996)and
Wada(2003)may hold a potential key. Wada(2003), inspired by Negishi(1996), in
the factor analytic study of reading tests given to EFL leamers in Japan, observed that
42
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
the reading ability could be broken down into components described by the
‘local/global comprehension’dimension and the‘literal/inferential comprehension’
dimension, suggesting the validity of‘question types9’that elicit‘local-litera1’,
‘local-inferential’and‘global-inferential’type of reading.10
Wada(2003), on the base of her empirical findings, had illustrated that
‘local-literal’ question type i s a type of test items that could be answered by returning
to a small unit of information in the passage and皿derstanding what i s explicitly
stated there. ‘Local-inferential’question type is a type of test items that could be
answered by referring to a small皿it of information in the passage and皿derstanding
what is implicitly stated there by making inferences. ‘Global-inferential’question
type is a type of test items that could be answered by referring to a large unit of
information in the passage and making inferences in order to come up with
macropropositional idea of the text.
So far as‘measuring’ability is concemed, constructing test items demands a
‘theoretical framework’, which embodies‘developmental sequence’and acco皿ts for
how and why certain items are perceived to be more difficult than others by the test
takers. Such a framework would provide a ‘construct’that delineates the
relationship between the elements of reading performances, such as‘que stion types’,
and test takers’ reading abilities. As previously referred, North(2000)had made a
substantial effort in developing a CEFR and a scale to describe language proficiency.
The same sort of approach was taken by Gomez et al.(2007)which had conducted a
“scale-anchoring study”, an attempt to create de scriptors that acco皿t for the reading
perfbrmances of test takers at different levels of English proficiency based on both
empirical data and judgments by test developers. Gomez et al.(2007)had
9The te㎜‘question type’in the language te sting context may sometimes indicate the format of te st
items(i.e. multiple-choice question, true-false question). However, here, it indicates the type ofatest
item with reference to what type of reading(i.e. loca1-literal)it elicits.
10 Although there were two dimensions(locaYglobal and literal/inferential)originally assumed in the
stUdy, Wada(2003)could not extract the fburth type of comprehension,‘global-literaP comprehension
in her factor analytic study, concluding that only‘910bal-inferentiaP,‘local-inferential’, and
‘local-literaP types ofcomprehensions are valid to be assumed as‘question types’.
43
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
succeeded in creating descriptors that encapsulate what test takers at a given
proficiency level are able to do in carrying out reading tasks and illustrated how this
latent ability structure alters with accord to test takers’proficiency levels. The study
had noted that there were variances in what test takers could do in coming to an
answer in solving reading test items.
For example, in reporting the results of their findings, Gomez et a1.(2007)
described:
Perforrriance at the Low level The descriptors that emerged from our
analyses state that test takers at the Low perforrnance level“have difficulty
identifシing the author’s purpose except when that purpose is explicitly stated
in the text or easy to infer from the text.” This statement implies that
Low-level test takers are able to identify the author’s purpose when it is
explicitly stated or easy to infer from the text...(p.430)
This could be compared with their
responding to the same test item:
de scriptions fbr other ability groups
Performance at the lntermediate level The descriptors that emerged from
our analyses state that test takers at the lntermediate perfomiance level“can
recognize the expository organization of a text and the role that specific
information serves within a larger text but have some difficulty when the se
are not explicit or easy to infer from the text,”while test takers at the High
level can recognize text organization and the role served by specific
information‘‘even when the text is conceptually dense.”(p.431).
Performance at the High level _Faced with this degree of conceptual
density, most test takers at the lntermediate level were unable to infer the
author’s purpose correctly, whereas most of those at the High level were able
to do so.(P.432)
44
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
The fact that this alteration in the latent ability structure was revealed in Gomez
et al.(2007)gives a positive perspective to the present research to fUrther investigate
the relationship between the‘question types’suggested by Wada(2003)and their
perceived difficulty among different ability groups to construct a sequence which
reflects ELT environment in Japan. With血e intention to investigate how a reading
test item could be constructed to elicit types of reading performances described by
Wada’s(2003)‘question typesうwith regard to test takers’latent ability structure,
seeking a link between these‘question types’and their item diff7iculty seems to be a
requisite. Although Wada(2003)explored the qualitative aspects of‘que stion types’
and their validity as variables in item construction to some extent, how these
‘question types’could be linked with test takers’latent ability structure was not
considered. Would a test item be perceived to render the same‘question type’across
test takers of different ability levels, or would it be perceived differently according to
their ability levels? Would test items with different question types have a
generalizable order in their difficulties, and, if so, how could they be ranked? To
investigate how this notion of‘question types’could be implemented as a variable in
constructing a test item to‘‘measure’うatest taker’s reading ability, a way to quantify
this aspect of‘que stion types’ in conj皿ction with te st takers’latent ability structure is
needed.
45
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)