View
10
Download
0
Category
Preview:
Citation preview
RESEARCH REPORT April 2003 RR-03-11
Validating LanguEdge™ Courseware Scores Against Faculty Ratings and Student Self-assessments
Research & Development Division Princeton, NJ 08541
Donald E. Powers Carsten Roever Kristin L. Huff Catherine S. Trapani
Validating LanguEdge™ Courseware Scores Against
Faculty Ratings and Student Self-assessments
Donald E. Powers, Carsten Roever, Kristin L. Huff, and Catherine S. Trapani
Educational Testing Service, Princeton, NJ
April 2003
Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from:
Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541
Abstract
LanguEdge™ Courseware is a software tool that is designed to help teachers of English as a
second language (ESL) build and assess the communicative skills of their students. The purpose
of this study was to generate information to help LanguEdge Courseware users understand better
the meaning (or validity) of the assessment scores bases on the LanguEdge Courseware.
Specifically, the objective was to describe, for each of the four sections of the LanguEdge
assessment, relevant characteristics of test takers at various test score levels. To accomplish this
objective, we gathered data that represent two different perspectives—those of instructors and
those of students themselves.
Approximately 3,000 students each took one of two parallel forms of the LanguEdge
assessment at domestic and international testing sites. Participants also completed a number of
self-assessment questions about their English language skills. In addition, for some study
participants, instructors rated selected language skills.
LanguEdge test scores related moderately (correlations mostly in the .30s and .40s) with
student self-assessments. Of the four LanguEdge tests, Listening exhibited the strongest
relationships to self-assessments; Speaking, the next strongest; Reading, the next; and Writing,
the least.
The correlations of faculty ratings with each of the LanguEdge section test scores were
generally in the .40s, with some reaching the .50s. The correlations between the various student
self-assessment scales and faculty ratings were modest, mostly in the .30s. These correlations
suggest that students and faculty had different perspectives on students’ English language skills.
i
As isolated entities, summary test scores, even when accompanied by normative data, are
not especially informative about what test takers know and can do. In an effort to make test
scores more useful, some testing programs—for example, the National Assessment of
Educational Progress (NAEP)—have implemented relatively sophisticated reporting procedures
in order to facilitate test score interpretations. One such effort, generally known as proficiency
scaling, is usually but not always based on item response theory (IRT) methods (Beaton & Allen,
1992) and entails procedures such as the following. Several ability levels are selected on an
overall ability/proficiency score scale. For each of these levels, individual items are selected
such that, at a given level of ability, examinees have a specified probability (say, 80%) of
answering each item correctly. At lower levels of ability, however, examinees have a
significantly lower probability of answering each of these items correctly, but a high probability
of answering some other set of items correctly. Experts then judge the items that examinees
correctly answer at each level in order to characterize examinee proficiency at various score
points (see, for example, Mullis & Jenkins, 1988; Beaton & Allen, 1992).
The resulting scales have a number of attractive features. They are, however, not entirely
problem-free. For example, the proficiencies that underlie success on test items at various score
levels are not always readily inferred, especially when the domains being tested are either
multidimensional or ill defined. Such attempts can give rise to questionable inferences about
examinee proficiency (Forsyth, 1991), possibly because test users do not adequately understand
the score reports (Hambleton & Slater, 1994).
Another noteworthy aspect of proficiency scaling is that it is internally focused. That is,
proficiency scales are given meaning by referencing performance on the test items that the scales
comprise. Because score levels are interpreted according to the items that determine scores, the
method may appear to be circular. At the least, the method has a bootstrapping nature insofar as
it makes use of existing resources (i.e., test items) to improve an existing state (i.e., test score
interpretations).
In contrast, the effort undertaken here approached test score meaning from an external
perspective. The aim was to relate test score levels to nontest, external indicators of examinees’
language proficiency. The test scores of interest were those based on the LanguEdge
Courseware software.
1
Overview
LanguEdge Courseware (http://www.toefl.org/languedge.html) is a professional
development tool designed to help teachers of English as a second language (ESL) build and
assess the communicative skills of their students. The courseware package consists of interactive
software (two full-length tests of reading, writing, speaking, and listening) and supporting
materials (a teacher's guide, a scoring handbook, and a score interpretation guide). The package
is based on the likely test format of a future version of the Test of English as a Foreign
Language™ (TOEFL®), which will employ tasks that integrate speaking and writing with
reading and listening.
The purpose of this study was to try out procedures that might, eventually, prove useful
for generating information to help LanguEdge courseware users better understand the meaning
(or validity) of LanguEdge test scores. Specifically, the objective was to describe, for each of
four sections of the test, relevant characteristics of test takers at various test score levels, thereby
helping to establish the validity of test score distinctions among test takers. To accomplish this
objective, we gathered data that represent two different perspectives—those of instructors and
those of students themselves. The collection of multiple sources of information is consistent
with commonly accepted standards for test validation (Messick, 1989; American Educational
Research Association, 1999).
Instructors’ assessments of students’ English language skills were gathered because
teachers seem well-positioned to judge the academic skills of their students. The (less obvious)
rationale for collecting student self-assessments was as follows. Self-assessments of various
sorts—self-reports, checklists, self-testing, mutual peer assessment, diary-keeping, log books,
behaviorally anchored questionnaires, global proficiency scales, and “can-do” statements
(Oscarson, 1997)—have proven to be useful indicators in a variety of evaluation contexts,
especially in the assessment of language skills. Upshur (1975), for instance, noted that language
learners typically have a wider view of their successes and failures than do external evaluators.
More generally, Shrauger and Osberg (1981) concluded that there is substantial evidence, both
empirical and conceptual, that self-assessors frequently have both the information and the
motivation to make effective judgments about themselves.
2
Methods
Sample Selection
In the spring of 2002, approximately 3,000 candidates were recruited both internationally
and domestically (United States and Canada) to participate in a field study of the LanguEdge
Courseware. Each of these students took one of two parallel forms of the LanguEdge assessment
at one of 18 domestic and 12 international test sites. After deleting records for test takers whose
motivation was questionable, usable test data were available for 2,703 test takers.
The field study sample was generally representative of the TOEFL population in terms of
native language. A majority (60%) of field study participants came from the following native
language groups: Chinese (18%), Spanish (13%), Arabic (7%), Korean (7%), Japanese (5%),
French (4%), Indonesian (3%), and Latvian (3%). These groups constitute approximately 61%
of the TOEFL test-taking population and are represented in the following proportions: 23%, 5%,
5%, 12%, 13%, 2%, 1%, and <1%, respectively. The field study sample was also generally representative of the TOEFL population in
terms of level of English language proficiency as measured by the paper-and-pencil TOEFL.
Both the domestic and international field study subsamples performed slightly better on each
section of the TOEFL than did their domestic and international counterparts who took the
TOEFL. The mean scores on the Listening, Structure, and Reading sections, which range from
20 to 67 (or 68), were, respectively, 53.7, 51.8, and 52.9 for the study sample. The same mean
scores for the TOEFL operational test population were 52.6, 49.3, and 51.6 (domestic test takers)
and 50.5, 50.7, and 52.6 (international test takers). The differences between the study sample
and the operational testing population were relatively small, ranging from approximately .03 to
.34 standard deviation units on each of the three scales (listening, structure, and reading).
Procedure/Instruments
Each study participant took the LanguEdge assessment along with a retired paper-based
TOEFL test (TOEFL PPT). LanguEdge has four sections, corresponding to the four modalities
of communication: Listening, Reading, Speaking, and Writing. The LanguEdge assessment is
composed of several different item types, including (a) conventional four-choice, single-correct-
answer multiple-choice items, (b) multiple-choice items requiring one or more correct responses,
3
(c) extended written response (essay) items, and (d) spoken response items. Productive response
items (i.e., the Speaking and Writing items) require evaluation by trained human raters and are
worth 1 to 5 points each.
With respect to scoring, raw score totals for Listening and Reading are calculated by
summing the number of points awarded for each item answered correctly. Classical
equipercentile equating methods were used to equate Listening and Reading scores across the
two forms of the assessment. In addition to being equated, Listening and Reading scores were
linearly scaled to have a minimum value of 1 and a maximum value of 25, respectively.
There are five Speaking tasks and three Writing tasks in each form of LanguEdge. Several
of these tasks are designed to reflect the integrated nature of communicative language ability. One
of the Speaking tasks is integrated with Listening (Listening/Speaking) and the other with Reading
(Reading/Speaking). These tasks require examinees to either read or to listen to a stimulus and
then to speak about it. Similarly, there are two integrated Writing tasks that are administered as
part of the Listening and Reading sections (i.e., Listening/Writing and Reading/Writing). The
remaining tasks (three Speaking and one Writing) are referred to as independent tasks, as responses
do not require examinees to read or listen to an extended verbal stimulus. Scores on the five
Speaking tasks and scores on the three Writing tasks comprise the Speaking and Writing total
scores, respectively. Scores for these sections of the assessment have not been scaled or equated.
Instead, scores are reported as the average of scores on each of the tasks.
Before they were tested, participants were also asked to complete a number of questions
about their English language skills. Several kinds of self-assessment questions were developed.
Two sets of can-do type statements were devised on the basis of reviews of existing statements
(e.g., Tannenbaum, Rosenfeld, Breyer, & Wilson, 2003) and with regard to the claims being
made for LanguEdge. Only statements that concerned academically related language
competencies, not more general language skills, were written. One set (19 items) asked test
takers to rate (on a 5-point scale ranging from “extremely well” to “not at all”) their ability to
perform each of several language tasks. The other set (20 items) asked test takers to indicate the
extent to which they agreed or disagreed (on a 5-point scale ranging from “completely agree” to
“completely disagree”) with each of several other can-do statements. For each set,
approximately equal numbers of questions addressed each of the four language modalities
(Listening, Reading, Speaking, and Writing).
4
Test takers were also asked to compare (on a 5-point scale from “a lot higher” to “a lot
lower”) their English language ability in each of the four language modalities with those of other
students—both in classes they were taking to learn English and also, if applicable, in subject
classes (biology or business, for example) in which the instruction was in English. Test takers
were also asked to provide a rating (on a 5-point scale from “extremely good” to “poor”) of their
overall English language ability.
Finally, test takers who had taken some or all of their classes in English were asked to
indicate (on a 5-point scale ranging from “not at all difficult” to “extremely difficult”) how
difficult it was for them to learn from courses because of problems with reading English or with
understanding spoken English. They were also asked to indicate how much difficulty they had
encountered when attempting to demonstrate what they had learned because of problems with
speaking English or with writing English.
In addition to completing self-assessment questions about their language skills, study
participants who tested at U.S. sites (but not international sites) were also asked to contact two
people who had taught them during the past year and to give each one the Faculty Assessment
Form. Study participants were asked to contact only people who had had some opportunity to
observe their English language skills. Participants were told that after faculty completed the
forms, faculty would mail the envelopes directly to us.
The instructions that accompanied the Faculty Assessment Form asked faculty to provide
their opinions about the student’s English language skills. Specifically, instructors were told that
Educational Testing Service was developing a new TOEFL to facilitate the admission and
placement of nonnative speakers of English in academic programs in North America and that, in
conjunction with this effort, we were gathering a variety of information about the students who
had taken the first version of the test in order to establish more firmly the meaning of scores on
the new assessment. Instructors were also told that they had been asked to provide information
because they had had relevant contact with the student who had contacted them. Finally, they
were informed that their assessment would be treated confidentially and would not be shared
with anyone, including the student.
The Faculty Assessment Form asked faculty to indicate (on a 5-point scale ranging from
“not successful at all” to “extremely successful”) how successful the student had been at
(1) understanding lectures, discussions, and oral instructions
5
(2) understanding (a) the main ideas in reading assignments and (b) written instructions for
exams/assignments
(3) making him/herself understood by you and other students during classroom and other
discussions
(4) expressing ideas in writing and responding to assigned topics.
Faculty were also asked to compare (on a 7-point scale ranging “well below average” to
“well above average”) the student’s overall command of English with that of other nonnative
English students they had taught.
For each question, instructors were allowed to omit their rating, if appropriate, and to
respond instead that they had not had adequate opportunity to observe the student’s language
skills. Instructors were also asked to indicate their current position or title, the approximate
number of nonnative speakers of English they had taught at their current and previous academic
institutions, and just how much opportunity they had had to observe the student’s facility with
the English language (little if any, some, a moderate amount, or a substantial amount).
A final item on the form requested the faculty member’s telephone number and e-mail
address “for verification purposes only.” This item was included only to discourage study
participants from completing the form themselves.
Results
Student Self-assessments
It was important to first establish the extent to which test takers were consistent in
reporting about their own language skills. For this purpose, 4-, 5-, or 6-item scales were formed
by summing responses to individual items having the same response format (e.g., “how well” or
“agree”) for each language modality. Table 1 shows the number of items that comprised each
scale, as well as the internal consistency reliability estimate (coefficient alpha) for each of the
various scales. As is clear, each of the various scales exhibits reasonably high internal
consistency, ranging from a low of .81 (for four items asking students to compare their English
language skills with those of other students in English language classes) to .95 for the five-item
scale asking students to rate how well they could perform various reading tasks.
6
Table 1
Reliability Estimates for Language Skill Self-assessments
Scale Number
of items
Coefficient
alpha
“How well” scales
Listening 5 .93
Reading 5 .95
Speaking 5 .93
Writing 4 .89
Composite 19 .97
“Agreement” scales
Listening 4 .88
Reading 6 .92
Speaking 5 .89
Writing 5 .91
Comparison scales
Students in ESL classes 4 .81
Students in subject courses
4 .88
Overall English ability 4 .84
Difficulty with English 4 .85
Note. Ns for scales range from 2,235 to 2,629 due to nonresponse to some questions.
The internal consistency reliability estimates for the LanguEdge test sections were .88,
.89, .80, and .76 for the Listening, Reading, Speaking, and Writing sections, respectively. The
intercorrelations among LanguEdge section scores ranged from .57 between Reading and
Speaking to .76 between Listening and Reading. All other intercorrelations were in the mid to
high .60s.
Table 2 shows the correlations of each of the various student self-assessment scales with
performance on each section of LanguEdge. Generally, test scores related least strongly to the
7
scales on which students were asked to compare their abilities to those of other students. They
related most strongly, generally, to the various can-do scales (both those using a “how well”
response format and those using an “agree” format). Of the four LanguEdge tests, Listening
most often exhibited the strongest relationships to self-assessments; Speaking, the next strongest;
Reading, the next; and Writing, the least.
Table 2
Correlations of Self-assessment Scales With LanguEdge Scores
LanguEdge score
Self-assessment scale M SD Listening Reading Speaking Writing
“How well” Scales
Listening 12.8 4.1 .47(.50) .31(.33) .49(.55) .29(.33)
Reading 12.4 4.0 .46(.49) .41(.43) .42(.47) .31(.36)
Speaking 14.0 4.1 .33(.35) .18(.19) .43(.48) .19(.22)
Writing 11.4 3.1 .36(.38) .26(.28) .41(.46) .26(.30)
Composite 51.0 13.5 .46(.49) .32(.34) .48(.54) .29(.33)
“Agreement” Scales
Listening 8.4 2.9 .48(.51) .34(.36) .46(.51) .28(.32)
Reading 12.9 4.3 .51(.54) .43(.46) .44(.49) .32(.37)
Speaking 11.0 3.7 .41(.44) .28(.30) .44(.49) .26(.30)
Writing 11.4 3.7 .40(.43) .31(.33) .40(.45) .28(.32)
Composite 43.8 13.3 .49(.52) .37(.39) .48(.54) .31(.36)
Comparison Scales
Students in ESL classes 10.5 2.7 .25(.27) .14(.15) .33(.37) .16(.18)
Students in subject courses
11.1 3.1 .16(.17) .07(.07) .21(.23) .04(.05)
(Table continues)
8
Table 2 (continued)
LanguEdge score
Self-assessment scale M SD Listening Reading Speaking Writing
Overall English ability 11.2 3.0 .36(.38) .22(.23) .44(.49) .21(.24)
Difficulty with English 8.1 2.9 .40(.43) .29(.31) .40(.45) .24(.28)
Note. Ns range from 2,235 to 2,616 for Reading and Listening, from 818 to 952 for Speaking, and from 1,117 to 1,303 for Writing. The different Ns reflect mainly that all responses could not be scored in time to meet the schedule for data analysis. Entries in parentheses have been corrected for attenuation due to unreliability of LanguEdge scores.
Faculty Ratings
Faculty returned ratings for 819 of the study participants. For 637 participants, two
ratings were available. The sample for whom faculty ratings were returned had slightly lower
LanguEdge scores on average but was reasonably representative of the total study sample in
terms of the range of test performances.
Faculty who returned rating forms described their positions or titles as follows: faculty
member (45%), teaching assistant (11%), ESL instructor (38%), and other (6%). Nearly all
respondents reported having had an opportunity to observe the student’s facility with English—
either some (17%), a moderate amount (40%), or a substantial amount (41%). (About 1% of the
respondents said they had had little if any opportunity to observe the student’s English language
skills, and so they were deleted from the analysis.) Respondents reported having taught various
numbers of nonnative speakers of English at their current and previous academic institutions, with
6% having taught fewer than 10 such students, 25% from 10 to 100, and 70% more than 100.
A scale consisting of all four faculty ratings (one for each language modality) was highly
internally consistent, exhibiting a coefficient alpha of .91. Table 3 shows the agreement statistics
between pairs of faculty raters for each of the four ratings, plus those for a fifth, which is an
overall rating of students’ language skills. As can be seen, the agreement rates are modest,
indicating that instructors did not agree completely, possibly because of different perspectives,
about the English language skills of the students they taught. Rates of exact agreement ranged
from 39% to 50%, and rates of agreement that were exact or within one point ranged from 74%
9
to 94%. Correlations between pairs of faculty raters ranged from .47 to .52, and Cohen’s kappa
ranged from .21 to .26. Weighted kappas ranged from .33 to .39. (Kappa values of .21 to .40
have been described by Landis and Koch [1977] as “fair.”)
Table 3
Agreement Statistics for Faculty Ratings
Statistic
Faculty rating Exact agreement (%)
Exact or adjacent (%)
r Kappa Weighted kappa
In general, how successful has this student been:
in understanding lectures, discussions, and oral instructions
49.8
94.0
.52
.26
.39
at understanding (a) the main ideas in reading assignments and (b) written instructions for exams/assignments
47.3
92.7
.47
n.e.
n.e.
at making him/herself understood by you and other students during classroom and other discussions
47.0
90.5
.51
.25
.37
at expressing ideas in writing and responding to assigned topics
44.9
89.2
.47
.21
.33
Compared to other nonnative English students you have taught, how is this student’s overall command of the English language?
38.9
73.7
.49
.21
.36
Note. N = 637 test takers for whom two faculty ratings were available. n.e. = not estimable.
Table 4 shows the correlations of faculty ratings (mean of two ratings when available)
with each of the LanguEdge section test scores. With few exceptions, these correlations are all
in the .40s, with some reaching the .50s. The correlations between the various student self-
assessment scales and faculty ratings were modest, ranging from .09 to .41, with a majority
10
(65%) falling in the .30s. These correlations suggest that students and faculty had different
perspectives on students’ English language skills.
Table 4
Correlation of Instructor Ratings With LanguEdge Scores
LanguEdge test score
Faculty rating L R S W
In general, how successful has this student been:
in understanding lectures, discussions, and oral instructions
.49(.52)
.42(.45)
.47(.53)
.36(.41)
at understanding (a) the main ideas in reading assignments and (b) written instructions for exams/assignments
.47(.50)
.45(.48)
.42(.47)
.40(.46)
at making him/herself understood by you and other students during classroom and other discussions
.43(.46)
.36(.38)
.42(.47)
.35(.40)
at expressing ideas in writing and responding to assigned topics
.45(.48)
.43(.46)
.42(.47)
.42(.48)
Composite Rating (sum of the four above) .52(.55)
.47(.50)
.51(.57)
.44(.50)
Compared to other nonnative English students you have taught, how is this student’s overall command of the English language?
.51(.54)
.45(.48)
.53(.59)
.41(.47)
Note. Ns range from 400 to 465 for Writing, from 260 to 303 for Speaking, from 716 to 819 for Listening and for Reading. All correlations are significant at the .001 level or beyond. Entries in parentheses have been corrected for attenuation due to unreliability of LanguEdge scores.
Characteristics of Test Takers at LanguEdge Score Levels
Tables 5-8 show for each LanguEdge test section the relationships between score level and
both student self-assessments and instructor ratings. Table entries are percentages of either students
or instructors who gave various responses to each question. For instance, the first line in Table 5
11
shows, by test takers’ score level, the percentages of instructors who judged that students at the score
level had been more than moderately successful (i.e., very successful or extremely successful) in
understanding lectures, etc. Each table contains only the assessments and ratings for the language
modality matching the test section. For example, Table 5 shows that, for Listening scores, 32% of
faculty participants felt that test takers who scored at the lowest level (1-5) had been more than
moderately successful in understanding lectures, discussions, and oral instructions. On the other
hand, students who scored at the highest level on the Listening test (21-25) were judged much more
often (by 90% of faculty raters) as being more than moderately successful.
The corresponding faculty rating for (a) Reading (success at understanding main ideas in
reading assignments and written instructions for exams/assignments), (b) Speaking (success at
making him/herself understood by faculty and students during classroom and other discussions),
and (c) Writing (success in expressing ideas in writing and responding to assigned topics) are
shown in Tables 6, 7, and 8, respectively.
12
Table 5
Key Descriptors of LanguEdge Learner
Descriptor
Faculty (%) Judging that students had been m understanding lectures, discussio
Who felt that students’ overall co above average when compared w
Students (%)
Who agreed that they could: • remember the most importan • understand instructors’ direc • recognize which points in a l • relate information that they h
Who said they did not perform w
• understanding the main ideas • understanding important fact • understanding the relationsh • understanding a speaker’s at • recognizing why a speaker is
13
s by Listening Score Level
Test score level
1-5 6-10 11-15 16-20 21-25
ore than moderately successful at ns, and oral instructions 32
42 60 78 90
mmand of English was at least somewhat ith other nonnative students they had taught 13 22 41 68 77
t points in a lecture 34 37 45 63 78 tions about assignments and their due dates 43 60 75 89 95 ecture are important and which are less so 33 41 57 72 84 ear to what they know 29 42 63 76 88ell at:
of lectures and conversations 31 29 14 5 2 s and details of lectures 36 33 19 9 5 ips among ideas in a lecture 36 32 18 11 5 titude or opinion
ething 38 43
28 32
15 16
9 10
5 4 saying som
(Table continues)
Table 5 (continued)
Test score level
Descriptor 1-5 6-10 11-15 16-20 21-25
Students (%) Who felt their listening ability was lower than that of other students in ESL classes 29
22 14 10 5
Who felt that problems understanding spoken English made learning difficult
52 41 35 22 10
14
Table 6
Key Descriptors of LanguEdge Learners by Reading Score Level
Test score level
Descriptor 1-5 6-10 11-15 16-20 21-25
Faculty (%)
Judging that students had been more than moderately successful at understanding the main ideas in reading assignments and written instructions for exams 36
54 73 83 92
Who felt that students’ overall command of English was at least somewhat above average when compared with other nonnative students they had taught 20 33 50 66 83
Students (%)
Who agreed that they could:
• quickly find information in academic texts 42 49 62 75 86
• understand the most important points when reading an academic text 40 55 71 83 91
• figure out the meaning of unknown words by using context and background knowledge
34 43 60 71 83
• remember major ideas when reading an academic text 42 50 66 75 85
• understand charts and graphs in academic texts 42 53 73 84 91
• understand academic texts well enough to answer questions about them 41 45 63 75 85
15
(Table continues)
Table 6 (continued)
Test score level
Descriptor 1-5 6-10 11-15 16-20 21-25
Students (%)
Who said they did not perform well at:
• understanding vocabulary and grammar 25 26 13 5 2
• understanding major ideas 20 13 6 2 0
• understanding how the ideas in a text relate to each other 26 23 10 6 3
• understanding the relative importance of ideas 28 18 9 4 3
• organizing or outlining the important ideas and concepts in texts 29 26 12 7 4
Who felt their reading ability was lower than that of other students in ESL classes
18 14 5 4 2
Who felt that problems reading English made learning difficult 43 29 16 10 6
16
Table 7
Key Descriptors of LanguEdge Learners by Speaking Score Level
Test score level
Descriptor 1-2 2-3 3-4 4-5
Faculty (%)
Judging that students had been more than moderately successful at making himself/herself understood by during classroom and other discussions 44
57 76 86
Who felt that students’ overall command of English was at least somewhat above average when compared with other nonnative students they had taught 22 34 72 83
Students (%)
Who agreed that they could:
• state and support their opinion 31 51 68 85
• make themselves understood when asking a question 56 70 81 93
• talk for a few minutes about a familiar topic 39 66 73 90
• give prepared presentations 38 62 78 90
• talk about facts or theories they know well and explain them in English 28 55 68 82
17
(Table continues)
Table 7 (continued)
Test score level
Descriptor 1-2 2-3 3-4 4-5
Students (%)
Who said they did not perform well at:
• speaking for one minute in response to a question 53 36 23 17
• getting other people to understand them 26 16 7 5
• participating in conversations or discussions 36 25 17 6
• orally summarizing information from a lecture listened to in English 47 38 25 8
• orally summarizing information they have read in English 40 23 16 7
Who felt their speaking ability was lower than that of other students in ESL classes 19 17 11 6
Who felt that problems speaking English made it difficult to demonstrate learning 46 41 25 13
18
Table 8
Key Descriptors of LanguEdge Learners by Writing Score Level
Test score level
Descriptor 1-2 2-3 3-4 4-5
Faculty (%)
Judging that students had been more than moderately successful at understanding the main ideas in expressing ideas in writing and responding to assigned topics
35
68 77 83
Who felt that students’ overall command of English was at least somewhat above average when compared with other nonnative students they had taught
30 61 73 88
Students (%)
Who agreed that they could:
• express ideas & arguments effectively when writing in English 43 62 70 76
• support ideas with examples or data when writing 48 63 77 77
• write texts that are long enough without writing too much 41 58 68 73
• organize text so that the reader understands the main and supporting ideas 51 69 80 85
• write more or less formally depending on the purpose and the reader 42 58 68 75
19
(Table continues)
Table 8 (continued)
Test score level
Descriptor 1-2 2-3 3-4 4-5
Students (%)
Who said they did not perform well at:
• writing an essay in class on an assigned topic 30 19 11 10
• summarizing & paraphrasing in writing information read in English 26 15 10 9
• summarizing in writing information that was listened to in English 39 30 20 14
• using correct grammar, vocabulary, spelling and punctuation when writing 39 28 16 12
Who felt their writing ability was lower than that of other students in ESL classes 20 13 8 8
Who felt that problems writing English made it difficult to demonstrate learning 38 29 18 17
20
Student-self assessments are shown in a similar manner in each table. For example,
Table 5 reveals that 34% of the students who obtained LanguEdge listening scores of 1-5 agreed
that they could remember important points in a lecture, whereas 78% of those at the highest level
(21-25) agreed that they could do this. We note that for all but one of the various ratings
(understanding vocabulary and grammar), percentages increase (or decrease) monotonically as
expected.
Finally, it may be useful to LanguEdge users to know how test takers viewed the various
tasks that make up the assessment, that is, how valid they appeared to be. Table 9 shows the
reactions of field study participants to each of the LanguEdge tasks. As can be seen, students
generally viewed the tasks as being appropriate ones on which to demonstrate their English
language skills. With the exception of two speaking tasks (speaking about a lecture and speaking
about a reading passage), each of the tasks was deemed by nearly 80% (or more) of test takers to
have been a good way in which to demonstrate their skills.
Table 9
Test Taker Agreement With Statements About LanguEdge Tasks
Statement Percent agreeing or strongly agreeing
Writing about a general topic was a good way to demonstrate my ability to write in English.
90
This was a good test of my ability to understand conversations and lectures in English.
82
Answering questions about single points or details in the reading text was a good way for me to demonstrate my reading ability.
82
Answering questions by organizing information from the entire reading passage into a table was a good way for me to demonstrate my reading ability.
82
(Table continues)
21
Table 9 (continued)
Statement Percent agreeing or strongly agreeing
This was a good test of my ability to read and understand academic texts in English.
80
Writing about a reading passage was a good way to demonstrate my ability to write in English.
79
Speaking about general topics was a good way to demonstrate my ability to speak in English.
78
Writing about a lecture was a good way to demonstrate my ability to write in English.
78
Speaking about a lecture was a good way to demonstrate my ability to speak in English.
65
Speaking about a reading passage was a good way to demonstrate my ability to speak in English.
62
Note. Ns range from 2,685 to 2,694.
Discussion
Although faculty ratings and student self-assessments proved to relate only modestly to
each other, both related significantly to scores on each section of the LanguEdge assessment.
LanguEdge test scores related moderately (correlations mostly in the .30s and .40s) with student
self-assessments. The correlations of faculty ratings with each of the LanguEdge section test
scores were generally in the .40s, with some reaching the .50s. Moreover, individually, each of
the faculty ratings and student self-assessment questions distinguished among test takers scoring
at different levels on the assessments. This was true for each of the four LanguEdge test
sections. The correlations between the various student self-assessment scales and faculty ratings
were modest, mostly in the .30s, suggesting that students and faculty had different perspectives
on students’ English language skills.
How do the correlations between self-assessments and test scores found in this study
compare with those detected in other efforts? The answer is “generally quite favorably.” For
instance, several reviews or meta-analyses have been conducted in which self-assessments have
22
been shown to correlate, on average, about .35 with peer and supervisor ratings (Harris &
Schaubroeck, 1988), about .29 with a variety of performance measures (Mabe & West, 1982),
about .39 with teacher evaluations (Falchikov & Boud, 1989), and in the .60s for studies dealing
with self-assessment in second and foreign languages (Ross, 1998).
The correlations computed here also compare favorably with those typically found in test
validity studies. For instance, in the context of graduate admissions, Graduate Record
Examinations® (GRE®) General Test scores generally correlate in the .20–.40 range with
graduate grade averages (Briel, O’Neill, & Scheuneman, 1993; Kuncel, Hezlett, & Ones, 2001)
and in the .30–.50 range with such criteria as faculty ratings and performance on comprehensive
examinations (Kuncel, Hezlett, & Ones, 2001).
We believe, therefore, that the validity criteria employed here (i.e., faculty judgments and
student self-assessments) may prove useful in providing additional meaning to LanguEdge test
scores. An obvious limitation of the study, however, is that we have provided no validation of
students’ self-assessments themselves. That is, we did not attempt to verify that students knew
and could actually do what they said they could do (beyond, of course, obtaining somewhat
similar ratings from faculty). Moreover, pairs of faculty members did not agree very strongly
with regard to their assessments of the students they had taught. Despite this lack of agreement
(which may simply reflect different but legitimate perspectives), LanguEdge scores correlated
significantly with faculty ratings.
The strength of the study, we believe, is that, unlike previous efforts that have relied on
internal anchors (i.e., the items constituting a test), we have enhanced test score meaning by
referencing external “anchors.” A shortcoming of this study is that relatively few anchor items
were administered, therefore precluding a more selective identification of the most
discriminating items for score interpretation. Consequently, no attempt was made to summarize
and interpret performance at the various score levels by generalizing across sets of items, as has
been the practice for internal methods for which much larger numbers of test items have usually
been available. Next steps in developing this methodology might be to take a more model-based
(rather than solely data-driven) approach in order to provide more stable estimates of the
relationships between test scores and validation criteria. In addition, a larger number of external
anchors could be administered in order to select only those that exhibit the greatest ability to
distinguish among score levels.
23
References
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of
Educational Statistics, 17, 191–204.
Briel, J. B., O’Neill, K. A., & Scheuneman, J. D. (Eds.). (1993). GRE technical manual: Test
development, score interpretation, and research for the Graduate Record Examinations
Program (pp. 67–88). Princeton, NJ: Educational Testing Service.
Falchikov, N., & Boud, D. (1989). Student self-assessment in higher education: A meta-analysis.
Review of Educational Research, 59, 395–430.
Forsyth, R. A. (1991). Do NAEP scales yield valid criterion-referenced interpretations?
Educational Measurement Issues and Practice, 10, 3-9, 16.
Hambleton, R. K., & Slater, S. (1994, October). Using performance standards to report national
and state assessment data: Are the reports understandable and how can they be
improved? Paper presented at the Joint Conference on Standard Setting for Large-Scale
Assessments, Washington, DC.
Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisor, self-peer, and peer-
supervisor ratings. Personnel Psychology, 41, 43–62.
Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). Comprehensive meta-analysis of the
predictive validity of the Graduate Record Examinations: Implications for graduate
student selection and performance. Psychological Bulletin, 127, 162–181.
Landis, J. D., & Koch, G. G. (1977). The measurement of observer agreement for categorical
data. Biometrics, 33, 159–174.
Mabe, P. A., & West, S. G. (1982). Validity of self-evaluation of ability: A review and meta-
analysis. Journal of Applied Psychology, 67, 280–296.
Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103).
Washington, DC: American Council on Education.
Mullis, I. V. S., & Jenkins, L. B. (1988). The science report card: Elements of risk and recovery.
Princeton, NJ: Educational Testing Service.
24
Oscarson, M. (1997). Self-assessment of foreign and second language proficiency. In C.
Clapham & D. Corson (Eds.), The encyclopedia of language and education: Vol. 7.
Language testing and assessment (pp. 175–187). Dordrecht, The Netherlands: Kluwer
Academic Publishers.
Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and analysis of
experiential factors. Language Testing, 15, 1–20.
Shrauger, J. S., & Osberg, T. M. (1981). The relative accuracy of self-predictions and judgments
by others of psychological assessment. Psychological Bulletin, 90, 322–351.
Tannenbaum, R. J., Rosenfeld, M., Breyer, F. J., & Wilson K. (2003). Linking TOEIC scores to
self-assessments of English-language abilities: A study of score interpretation.
Manuscript submitted for publication.
Upshur, J. (1975). Objective evaluation of oral proficiency in the ESOL classroom. In L. Palmer
& B. Spolsky (Eds.), Papers on language testing 1967-1974 (pp. 53–65). Washington,
DC: TESOL.
25
Recommended