Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
Developing Invariant, Construct-valid Measurement Scales in Spoken English as a Second Language
Paper presented at AERA April 2000
Diane Strong-Krause
Brigham Young University
Introduction
Design Experiments Require Construct Valid, Invariant Measurement
Design experiments are best accomplished when all members of a research team,
whether instructional scientists, measurement scientists, or teachers have a clear
theoretical understanding of the domain of learning expertise. Such an understanding
may be called a domain theory. A domain theory includes an understanding of the
dimensions of learning or growth that run through the domain. A domain theory also
provides an account, for each of those unidimensional scales, of how the constructs of
growing expertise are ordered along each scale and why higher constructs along the scale
are more difficult than lower ones.
This paper is an instance of the construction of such a domain theory: a set of
construct valid, invariant measurement scales that have the four properties discussed by
Bunderson (2000) in the previous paper. These properties are: (1) interpretive invariance
through the link of each construct along each scale to testlets having construct validity for
their associated construct, (2) invariance to the sample of people in a given cycle of the
design experiment, (3) invariance to the particular subset of tasks each person takes, and
(4) equal intervals along the scale. Item response theory has long claimed invariance to
the set of persons in the sample and to the subset of tasks used to estimate a person’s
score. Proponents of the Rasch model have also claimed that, when the trace lines are
2
parallel because all use the same average ‘a’ parameter, then the units along the scale are
equal interval units (Wright, 1999, Perline, Wright and Wainer, 1979).
In this paper a domain theory with construct-valid testlets and invariant scales is
sought through a design experiment approach. This domain theory embodies a
unidimensional scale through the domain of English speaking expertise, obtained using
the Rasch model on both theoretical and real data. Previous research in language
learning has found that reading, writing, speaking, and listening are learning processes
best measured on separate scales. This domain theory is a work in progress, because
results from only the first two cycles are available. These two include the baseline cycle,
which uses theory to generate a preliminary scaling, and the second cycle, which is the
first data collection cycle using the testlets constructed for each construct along the scale.
This design experiment will continue over subsequent semesters in the classes for foreign
speakers learing English at Brigham Young University. There is a four semester
sequence of classes; English 101, 102, 201, and 202. Data will be collected from all of
these classes. Although there are at least four unidimensional scales in this domain, (for
reading, writing, speaking, and listening), the scale for English speaking was selected for
this stage of a long-term design experiment.
The Problem of Measuring Language Competence
During the last decades of the twentieth century, a call for tests useful to teaching
and learning has been issued. Teachers and administrators tire of scores that don’t really
mean anything in terms of specific feedback to the program. This is true in the field of
language testing. Shohamy (1992) suggests; “the real power [of language tests]… is their
ability to provide meaningful and important information that can be incorporated into the
learning system” (p. 10). She goes on to say that useful information can provide
3
evidence of student progress, teacher performance, curriculum usefulness, and method
and material effectiveness. In order for language tests to be informative, they need to be
sensitive enough to measure progress within a course, not merely to be a summative
score or label. Measuring progress implies at least three things:
1) It implies at least two measurements—the student’s ability level at the beginning and
the student’s ability level at a later point in time.
2) This, in turn, implies an ordered scale among the tasks students perform. That is, a
score is not simply reported, but rather a measure is indicated along this invariant
scale.
3) In order to provide meaningful feedback, it is not enough to say a student was here
and now she is here. Useful feedback includes information about the language skills,
or combination of skills, needed in order to successfully perform tasks at each level
along the ordered scale.
In other words, a theory-anchored, construct valid scale is needed in order to provide
useful information about examinees at different levels of language expertise. Obtaining
evidence that scores along a scale reflect constructs in a theory of growth is construct
validity.
The purpose of this paper is to explore the initial steps in developing an invariant
scale of ordered tasks so that ultimately language performance can be compared over
time. The cyclical nature of a design experiment approach strengthens research of this
type. Once an initial scale is developed, data from subsequent examinees and test tasks
can be used, through a series of iterations, to refine the scale. This scale, in turn, will
form the foundation of what Bunderson(2000) and Bunderson and Newby (2000) have
termed a domain theory.
4
According to Bunderson (2000) in the first paper in this symposium, a domain theory
is defined as
an ordered set of constructs describing the constructs, whether cognitive,
linguistic, conitive, affective, or other aspects of evolving expertise in the domain.
These range from the level of person alpha on task alpha (the easiest task that the
least minimally qualified person can pass), to person omega and task omega (a
task whose difficulty is just beyond what we will expect in the most advanced class
in the domain)… A domain theory begins with a qualitative model describing
persons and tasks in a defined domain. This, in turn, evolves to a qualitative map
of the domain with rough order relations. A qualitative model of the domain can
lead to the development of testlets for each construct along each unidimensional
scale, and these order relations can be converted to measures by the methods of
additive conjoint measurement. (p. 8).
In language testing, such aqualitative model is termed by Bachman and Palmer
(1996) as the “target language use” (TLU) domain. They define it as “a set of specific
language use tasks that the test taker is likely to encounter outside the test itself, and to
which we want our inferences about language to generalize” (p. 44). They go on to say
“that it is neither feasible nor necessary, for the purpose of developing language test tasks
and tests, to provide an exhaustive discourse of language use” but rather to describe
“those features that are critical to the kinds of inferences we want to make and the
specific domains to which we want these inferences to generalize” (p. 46). Therefore, two
important facets need to be described for these domains: first, the critical language
abilities needed, and second, the range of tasks in the domain.
5
The American Council on the Teaching of Foreign Languages (ACTFL) has
developed a domain model describing four hierarchically-ordered levels of language
expertise: novice, intermediate, advanced and superior. Detailed, qualitative descriptions
of tasks and person language abilities are provided at each level of expertise. For
example, the ACTFL speaking guidelines indicate that speakers at the Intermediate level
“produce relatively short, discrete sentences, ask simple questions, and handle
straightforward survival situations” (American Council on the Teaching of Foreign
Languages, 1999, p. 15).
This domain model is used as the basis for creating tests of language proficiency,
specifically the Oral Proficiency Interview (OPI). Test development using this approach,
which Bachman (1990) terms the “real-life” approach, is widely used because it is very
practical and “provides a relatively easy way to develop tests that ‘look good’ and may
have some predictive validity” (Bachman, 1990, p. 330). However, there are two main
criticisms. The first problem deals with evaluating validity. The claim of validity in this
approach is one of content validity. However, Messick (1989) cautions:
In a fundamental sense so-called content validity does not count as validity at all,
although . . . considerations of content relevance and representativeness clearly
do and should influence the nature of score inferences supported by other
evidence…. Some test specialists contend that what a test is measuring is
operationally defined by specifying the universe of item content and the item-
selection process. But . . . determining what a test is measuring always requires
recourse to other forms of evidence. (p. 17)
Messick (1998) goes on to say that content-related evidence for validity is not enough:
6
“Validity is a unitary concept, which means that fundamentally there is only one kind of
validity, namely, construct validity” (p. 1).
A proponent of the ACTFL scale might answer that content validity is not the
only claim. The heirarchical, nested nature of the ACTFL domain model is also an
hypothesis that the intermediate level encompasses the novice, the advanced the
intermediate, and so on. This is an hypothesis of a Guttman scale (Guttman, 1945),
which has highly desirable interpretive properties. If you can perform a task at the
intermediate level, you can probably perform a task below it on the ordered scale.
Another criticism of the ACTFL scales is aimed at the inability to look at
language abilities separately from tasks. That is, the construct of language competence
cannot be defined as a separate facet. Skehan (1998) indicates this real-life approach has
“little underlying theory as to how a structure of abilities might link to different patterns
of language use, and how such underlying abilities might relate to different contexts and
performance conditions” (156-157). NcNamara (1995) concurs and argues “that a model
of underlying capacities in performance … is necessary, if we are to advance our thinking
about performance assessment” (p. 6).
Bachman (1990) proposes an alternative: a unified model of language
performance which distinguishes language abilities from other facets. This model posits
that four categories influence test performance—communicative language ability, test
method facets, personal attributes or test taker characteristics, and random measurement
error. Bachman proposes two principal components of communicative language
competence, organizational competence and pragmatic competence (see Figure 1).
McNamara (1995) suggests that “a model such as Bachman’s helps us to articulate the
‘theoretical rationale’ for such inferences: it permits the necessary clarity, specificity and
7
explicitness in stating the grounds for inferences about candidates’ abilities made on the
basis of test performance, thereby also facilitating the empirical investigation of such
claims” (p. 19).
Developing a Domain Theory
It comes as no surprise that these two means of getting at language performance
are at odds. Certainly both approaches have their own merits, and before we discard one
or the other, perhaps it would be wise to consider how a combination of the two may lead
to more useful tests. The ACTFL Speaking Guidelines (1999) provide hierarchically-
ordered, qualitative descriptions of language expertise based on years of real world
experience. On the other hand, Bachman (1990) offers an in-depth theory of
communicative competence. What we need is an invariant scale of hierachically-ordered
tasks that can be linked to a theory of language competence. Indeed, Bachman (1990)
proposes a possible integration of the two where the test task design “would involve the
analysis of the test tasks developed through the real-life approach” (p. 357) and the use of
the framework of language abilities. This, in turn, would be followed by construct
validation research and would lead to the ability to make predictions about how language
abilities affect test performance. This strategy allows us to begin to move from a domain
model to a domain theory.
Invariant Scales
At the core of a domain theory is the development of invariant interval scales.
However, most performance assessments terminate in a rating or a score. Scores are
ordinal scales and are sample dependent, task dependent, and rater dependent. The
performance on the task depends on the ability of the examinee (a task may be easy for
8
an advanced student while the same task may be difficult for a low proficiency student),
and the score may also vary depending on the severity of the rater.
Interval measures, as opposed to ordinal scores, offer the type of invariance
needed—the unit is the same whether found at the low, middle of high range of a scale.
The development of invariant scales requires conjoint estimation of scale positions for
both task difficulty and person proficiency. The measure associated with each task should
be independent of the ability level of the student, and the measure of examinee
proficiency should not depend on the particular tasks performed or the judge or judges
used.
However, an interaction between task and examinee always exists. How difficult
a task is depends on the abilities of the examinee. Difficulty, then, is not a property of a
task; rather it is the interaction between the task and the ability of the examinee. What is
needed is a measure that provides a way to determine the probability of success on a
particular task given the ability level of the examinee.
The multifaceted Rasch measurement (Linacre, 1989) does just this. The approach
is an extension of the Rasch model, a probabilistic model where two facets—examinees
and items are modeled. The multifaceted model extends to more than two facets. A
three-facet model may include raters along with items and examinees. Using the program
FACETS (Linacre & Wright, 1992) a measure is produced for both tasks and examinees
taking into account the severity of the rater or raters. Because these measures are on the
same scale, they can be used together to predict the probability of a success on a
particular task given the ability measure of the examinee. Therefore, we are able to begin
the development of an invariant scale, which, over time, can be added to and refined.
9
Methodology
This study used speaking data collected from ESL students studying at an intensive
English program. One hundred and sixty-nine students representing a variety of language
backgrounds including Spanish, Portuguese, Korean, and Japanese responded to speaking
tasks. Students were included from xx different classes representing four different
semesters of progress, English 101, 102, 201, and 202.
Instrument
A computer-delivered speaking test was developed by the BYU Humanities
Research Center. The most significant aspect of the development process used was the
use of the ACTFL Speaking Guidelines (American Council on the Teaching of Foreign
Languages, 1999). as a means to generate hypotheses about the ordering of task difficulty
along the speaking scale. Ten groups of task types were developed, and in the process,
further principled design guidance was sought through study of models of communicative
competence, with an emphasis on Bachman’s (1990) model. The goal was to meet
criticisms of ACTFL by linking it to theory, and also to a stable measurement scale. The
result was 40 tasks and associated scoring rubrics, carefully designed and linked by
theory to different hypothesized constructs of growth ordered along the speaking scale.
Ten levels were more than the current ACTFL guidelines discriminates, but experience
and theory indicated that finer gradations might be useful. Four testlets (prompts to
inititiate speaking, and scoring rubrics) were developed for each of the ten ordered
construct levels. This instrument of 40 speaking prompts was used to collect student
response data. The ten general task types include: naming common objects, giving
personal information, giving information about others, dealing with typical social
situations, asking questions, narrating a personal story, narrating a story given a visual
10
prompt, dealing with a complication in a social setting, telling about the future, and
supporting an opinion (see Appendix for a complete listing of the task types and four
versions within each one).
Data Collection
Following a brief introduction to the exam including written instructions in the
students' first language, students take the computer-delivered speaking exam. The first
three tasks are practice items, which are not rated. Students were randomly assigned to
one of three groups. Each group responded to a total of twenty tasks, two from each task
type. Ten of these tasks were common among the three groups. The other ten tasks were
unique to each group. Therefore, a linking design was used where students were
randomly assigned to one of three groups and the groups were linked by the common set
of ten tasks.
Responses were rated by eighteen judges with experience teaching a second or
foreign language. All judges were trained using the rubric. All responses were at least
double-rated with more than half rated by up to five judges. Responses to each task were
scored holistically using a rubric consisting of a ratings of 20, 40, 60, and 80 (see Table
1). Judges made two decisions in their rating. First, using guidelines provided in the
training material, the judge decided whether or not the task was completed. If the task
was clearly not completed, the judge gave a rating of 20. If a partial response
demonstrated developing, but not sufficient, language abilities to complete the task, the
judge gave a rating of 40. If the judge decided the task was indeed completed, the quality
of the response was rated. If distracting errors were present, the judge gave a rating of
60; otherwise, the judge gave a rating of 80.
11
Data Analysis
The first step in data analysis was to determine if a theoretical distribution of tasks
could be predicted. Although the ACTFL guidelines (1999) indicate that they are “not
based on a particular linguistic theory or pedagogical method” (p. i), “four major levels
are delineated according to a hierarchy of global tasks. This hierarchy is summarized in a
rating scale spanning a wide range of performance profiles” (p. 8) from Novice to
Superior. Therefore, a certified ACTFL oral interviewer was able to predict what rating
(20, 40, 60, 80) on each of the forty tasks would be received by a prototypical examinee
at each proficiency level ranging from Novice-Mid to Superior. A distribution of 100
prototypical examinees was then created based on an estimation of a typical distribution
of students attending the intensive English program (see Table 2). From this information,
a 100 x 40 matrix was created using the predicted ratings on each task for each
proficiency level. These predicted ratings were then analyzed using the FACETS
program to obtain theoretical predictions of difficulty values for each task. A theoretical
distribution of tasks along an invariant scale was now available for comparison to the
actual data distributions (see Figure 2).
The data collected from students from the intensive English program was then
analyzed, again using the FACETS program. After an analysis of fit statistics, two
students, three raters, and one task were dropped and the data re-analyzed.1 The
difficulty orderingof tasks from this analysis is shown in Table3. Correlation analysis
1 The three raters that were dropped had high outfit statistics, which means they were overusing extreme ratings (not the middle ratings). In addition, these raters were the most lenient of the eighteen judges. Therefore, the analysis suggests that they were giving a large number of ratings of 80. The one task that was dropped was the “Classroom” task where examinees simply named objects in a picture of a classroom. A high outfit statistic suggests that this item was measuring a different ability from the other tasks. The other three “Naming Objects” tasks also tended to have high outfit statistics, which suggests that possibly this task type is on a different dimension from the others.
12
between the real data values and the values obtained with the theoretical ratings was
carried out.
Results and Discussion
Table 3 shows the results of the theoretical and real scaling of tasks. The
correlation between the real data task measures and the theoretical data task measures
produced a .78 correlation coefficient. A comparison between the order based on the
theoretical data and the order based on the actual data is found in Figure 2. In general,
the order of task values was similar, but the values from the real data showed a more
defined distribution. Based on these results, four areas of interest deserve further
comment.
First, the task type “Naming Objects” appears to be quite distinct from the other
tasks. All of these tasks are grouped at the very low end of the scale. These tasks seem
to be measuring some other aspect of the construct of speaking meaningful utterances,
perhaps a knowledge of basic vocabulary. Outfit statistics suggest the same, since these
tasks misfit the scale badly. One task was eliminated and the remaining three had the
highest fit statistics of all the tasks. Missfit of this type is a warning that these tasks may
not be on the same unidimensional scale with the other speaking tasks. Perhaps
meaningful speech is on one scale, and work knowledge demonstrated without the need
for constructing meaningful sentences is on another.
Second, while the ACTFL Guidelines do not distinguish between giving personal
information about self versus giving similar information about others, the results indicate
that these task types tend to be at different places on the scale. Giving personal
information about self seems to be an easier task than giving similar information about
others. The same is true with narrations. No distinction is made within the ACTFL
13
Guidelines, but narrating a personal story is lower on the scale than narrating a story
given some type of visual prompt.
Third, an examination of the data shows that the “Ask Questions” tasks had the
widest variability in values, ranging from 6.5 to 9.7 (on a 10 point scale). These tasks
seemed to cover almost the full range except for the very lowest “naming” tasks. It is not
clear that this cluster forms a unified growth construct, and may need to be replaced with
one that does.
Finally, the tasks requiring examinees to support an opinion, the most difficult
tasks on the theoretical scale, seemed to flip-flop between the future-oriented and
situation-with-a-complication tasks. Upon further investigation of the tasks, the set of
opinion tasks with this test may be described more accurately as giving an opinion than
supporting one. Examinees were limited in the amount of available response time. To
truly support an opinion may require much more time than allotted in this exam.
Conclusions
The design of this study makes it possible to continue to develop a construct-
valid, invariant scale of learning and growth in the domain of speaking competence for
non-English speakers. The 10 groups of four testlets each were developed by reference
both to the ACTFL model and to substantive process constructs in emerging theories of
communicative competence. Order relationships were also hypothesized as part of the
testlet-development process. By developing the testlets in this manner, and refining them
based on the scaling results over the continuing cycles of a design experiment, we hope to
achieve the goal stated above: to develop a measurement instrument with the strengths of
14
the ACTFL scale in linking to ‘real-world’ tasks, but with the theoretical underpinnings
of an evolving theory of competence in speaking.
The results of this study have provided us with the first phase of developing a
useful measurement instrument. Using the qualitative ACTFL domain model of language
proficiency, in combination with previously collected data on the distribution of learners
across the four semesters of English classes, a theoretical scaling of tasks was developed.
Using actual data, this theoretical scale positions of tasks weredefined more precisely. An
initial scale of ordered speaking tasks is now available as a foundation for developing a
domain theory. However, this is just the beginning. Future iterations in scale refinement
must look more closely at particular task types, including “Asking Questions” and
“Opinion” tasks by varying topic and response time allowed. Furthermore, additional
task types must also be added and calibrated onto the scale
Finally, and most importantly, as this ordered scale of speaking tasks continues to
be refined, we can investigate more fully the critical language abilities associated with the
probability of success on particular task types. It is only with invariant scales that this
type of proficiency scaling can link underlying language skills with particular values
along a scale. Perhaps now at the beginning of the 21st century, we are one step closer to
providing meaningful information to programs, teachers, and examinees about progress
in language abilities. And perhaps better science is in the offing. As the domain theory
develops, we can investigate in the design experiment format many research questions
about learning to progress along this increasingly well-defined growth scale. We can
know in conducting this research that the outcome measures taken from semester to
semester will have the same metric; and, as the domain theory develops further beyond
the qualitative ACTFL model, we can come to understand the ordering of tasks within a
15
coherent and testable interpretative framework. This framework offers connections
between data and theory all along the way.
16
References
American Council on the Teaching of Foreign languages. (1999). Oral
proficiency interview tester training manual. Yonkers, NY: American Council on the
Teaching of Foreign Languages.
Bachman, L. F., & Palmer. (1996). Language Testing in Practice. New York:
Oxford University Press.
Bachman, L. F. (1990). Fundamental considerations in language testing. New
York: Oxford University.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.
Linacre, J. M., & Wright, B. D. (1992). FACETS: Many-faceted Rasch analysis.
Chicago: MESA Press.
McNamara, T. F. (1995). Modelling performance: Opening Pandora's box.
Applied Linguistics, 16, 159-79.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement.
Third Edition (pp. 13-104). New York: Macmillan.
Messick, S. (1998). Consequences of test interpretation and use: The fusion of
validity and values in psychological assessment [Research Report 98-4]. Princeton, New
Jersey: Educational Testing Service.
Shohamy, E. (1992). New modes of assessment: The connection between testing
and learning. In E. Shohamy & A. R. Walton (Eds.), Language assessment for feedback:
Testing and other strategies (pp. 7-28). Dubuque, Iowa: Kendall/Hunt.
Skehan, P. (1998). A cognitive approach to language learning. New York:
Oxford University.
17
Table 1
Holistic Rating Scale Rubric
Rating Description
20 Some attempt at the task may take place, but basically it is impossible for someone who doesn’t know the purpose of the task to understand. Some words or sentences may be spoken, but no real communication takes place.
40 The task is attempted and performance shows developing
language abilities to accomplish the task. However, it is clear that the examinee doesn’t have enough language skills to complete the task. Performance shows lack of control or development of essential structures, vocabulary, and/or audience awareness. Little attempt is made to connect ideas to make the response cohesive.
60 Performance shows that the student has sufficient language
ability to complete the task, although quality is not always maintained. Errors in syntax are present and are somewhat distracting, but they don’t interfere with communication. Vocabulary is sufficient for the task, but it is inadequate to provide detail for the task.
80 Performance clearly shows sufficient language ability to
complete the task. Minor errors are present, but don’t interfere with understanding. Performance shows awareness of language needed in different social settings.
________________________________________________________________________
18
Table 2 Estimated Distribution of Proficiency Levels
Proficiency Level Percent of the Distribution
Novice Mid 7
Novice High 20
Intermediate Low 24
Intermediate Mid 23
Intermediate High 21
Advance Low 2
Advanced Mid* 1
Advanced High* 1
Superior* 1
*Students at Advanced Mid, Advanced High, and Superior levels rarely attend the program, so 1 percent in each of these levels was used in the estimation.
19
Table 3 Results of FACETS analysis for theoretical data and actual data
Task Theoretical Data Actual Data Task
1a Transport -18.11 -2.2 1a Transport 1b Frontroom -18.11 -2.05 1b Frontroom 1c Foods -18.11 -1.74 1c Foods 2a HobSelf -6.52 -0.99 2a HobSelf 2b FamSelf -6.52 -0.98 2b FamSelf 2c SkedSelf -6.52 -0.92 4a OrderFood 2d HomeSelf -6.52 -0.74 2c SkedSelf 3a SkedOther -6.52 -0.36 3a SkedOther 3b FamOther -6.52 -0.36 3b FamOther 3c HomeOther -6.52 -0.34 5a AskRest 3d HobOther -6.52 -0.32 4b Invitpar 4a. OrderFood -6.52 -0.27 4c HotelRes 4b Invitpar -6.52 -0.26 2d HomeSelf 4c HotelRes -6.52 -0.24 3c HomeOther 4d GiveDirec -6.52 -0.11 6a Vacation 5a. AskRest -6.52 -0.02 3d HobOther 5b AskClass -6.52 0.06 6b ChildExp 5c AskMovie -6.52 0.06 4d GiveDirec 5d AskBook -6.52 0.09 6c Movie 6a Vacation 3.8 0.13 5b AskClass 6b ChildExp 3.8 0.16 8a Dinner 6c Movie 3.8 0.31 5c AskMovie 6d DifSit 3.8 0.34 10a Pollution 7a WashMachine 3.8 0.37 7a WashMachine 7b Fall 3.8 0.4 10b Zoo 7c Accident 3.8 0.5 6d DifSit 7d Toaster 3.8 0.55 8b Car 8a Dinner 8.99 0.55 7b Fall 8b Car 8.99 0.69 10c ProtectLand 8c Cleaners 8.99 0.7 10d Jobs 8d BossParty 8.99 0.72 9a FutProblem 9a FutProblem 8.99 0.75 7c Accident 9b FutTravl 8.99 0.75 5d AskBook 9c FutTech 8.99 0.76 8c Cleaners 9d Communicate 8.99 0.77 9b FutTravl 10a Pollution 18.61 0.77 7d Toaster 10b Zoo 18.61 0.8 9c FutTech 10c ProtectLand 18.61 0.83 9d Communicate 10d Jobs 18.61 0.84 8d BossParty
20
Figure 1. Bachman’s (1990) components of language competence.
Vocabulary
Styntax
Morphology
Phonemes
Grammatical
Competence
Cohesion
Rhetorical
Organization
Textual
Competence
Organizational
Competence
Ideational
Function
Manipulative
Functions
Heuristic
Functions
Imaginative
Functions
Ilocutionary
Competence
Sensitivity to
Dialogue or
Variety
Sensitivity to
Register
Sensitivity to
Naturalness
Cultural
References, or
Figures of
Speech
Sociolinguistic
Competence
Pragmatic
Competence
Language Competence
21
Figure 2. Theoretical distribution of tasks (on a 1 to 10 scale). Tasks are numbered 1
through 10.
1 6 2 3 4 5 7 8 9 10
6 5
1 2 1
3
4
7
8
9
10
22
Figure 3. Distribution of tasks using actual data (on a 1 to 10 scale). Tasks are
numbered 1 through 10.
Actual Data
1 6 2 3 4 5 7 8 9 10
1 2 1
4
3
5
10
6 5
8
9
23
Figure 4. Comparison of distribution of theoretical values and actual data values (on a 1
to 10 scale). Tasks are numbered 1 through 10.
Actual Data
Theoretical Data
5
7
1 6 2 3 4 5 7 8 9 10
6 5
1 2 1
3
4
7
8
9
10
1 2 1
4
3
5
10
6 5
8
9
24
Appendix
Task type 1: Name common objects 1a Modes of Transportation 1b Front room 1c Foods 1d Classroom
Task Type 2: Give information about self
Tell about. . . 2a Hobbies 2b Family 2c Daily activities 2d Home or apartment
Task Type 3: Give information about others
Given a drawing, tell about. . . 3a Daily activities 3b Family 3c Home or apartment 3d Hobbies
Task Type 4: Deal with typical social situations
4a Order food at a restaurant (given a menu) 4b Invite someone to a party 4c Get a room at a hotel (over telephone) 4d Give directions to get from one place to another
Task Type 5: Ask questions
5a Find out information about a restaurant. 5b Find out information about a class a nearby school. 5c Find out information about a movie at a theater. 5d Find out information about a book recommended to you.
Task Type 6: Narrate a personal story 6a Tell about a recent vacation. 6b Tell about an experience with your family when you were a child.. 6c Tell about a movie you have seen. 6d Tell about a difficult situation you have been in.
Task Type 7: Narrate a story given a visual prompt
7a Washing machine story (4 picture cues) 7b A passenger takes a fall entering an airplane (video). Describe what
happened. 7c You see a minor car accident (video). Call police and tell what happened. 7d Toaster (4 picture cues)
25
Task Type 8: Deal with a complication 8a You take a business associate to dinner. You go to pay, but you don't have
your wallet. 8b You borrow a friend's car and get in a minor accident. Apologize, offer to
pay, and beg the owner of the other car not to report the accident to the police.
8c You need to get a jacket cleaned for an important presentation. Take it to the dry cleaners, explain the situation, and convince him to put your jacket as top priority.
8d Your company is having a party at your boss� home. Your co-worker indicated it would be casual dress, but when you arrive, you discover it is a formal dinner. You can't go home to change because it is too far. Without getting your co-worker into trouble, apologize to your boss and explain your casual dress.
Task Type 9: Tell about the future
9a What major problems do you think we will face over the next 50 years? 9b What do you think travel will be like by the middle of the 21st Century? 9c How do you think information will be stored and accessed in the year 2050? 9d How do you think communication may be different 100 years from now?
Task Type 10: Support an opinion
10a Pollution is becoming a major problem today. Some say we should all use public transportation and not allow the use of privately-owned cars. What is your opinion? Explain.
10b There are many zoos around the world. Some say it is cruel to keep animals in a zoo. What is your opinion? Explain.
10c Some people believe we should use the land to benefit us by building factories and cities. Others say we need to protect the land and not build on it. What is your opinion? Explain.
10d Some argue that the government should give money to businesses to help provide jobs. Others suggest the government should give money to programs that help people gain skills so they can get jobs. What is your opinion? Explain.