Title: The Same Cloze for All Occasions?
Using the Brown (1980) Cloze Test for Measuring Proficiency in SLA
Research
Authors: James D. Brown
Theres Grüter
Affiliation: University of Hawai‘i at Mānoa
Address for correspondence:
James D. Brown
University of Hawai‘i at Mānoa
Department of Second Language Studies
1890 East-West Road, Moore Hall, rm570
Honolulu, HI 96822
U.S.A.
This is an Accepted Manuscript of an article to be published in International Review of Applied
Linguistics in Language Teaching (IRAL). Date of acceptance: September 12, 2019.
2
Abstract
Target language proficiency assessment has become an integral part of Second Language
Acquisition (SLA) research design, with cloze tests frequently serving this purpose for reasons of
practicality. Assumptions underlying the interpretation of such cloze test scores, however, are
often not examined. With the goal of providing researchers with better means for drawing
inferences from cloze test scores, we present an analysis of a combined dataset comprised of
scores from 1,724 test takers on a frequently used English cloze test (Brown 1980). We examine
variation in score distributions and reliability estimates among L2 groups, between L2 and
native-speaker (NS) examinees, and for different scoring methods, and investigate the degree to
which different sets of items were effective for classifying low- vs high-proficiency L2
examinees and L2 vs NS test takers. Standardized scores are provided for each scoring method
so that future researchers can reference their scores to this larger set.
Acknowledgments
This project would not have been possible without the colleagues who generously shared their
data from the Brown Cloze test with us. Mahalo to Michelle Adams, Anna Chrabaszcz,
Takafumi Fukushima, Amanda Huensch, Kitaek Kim, Hye-Young Kwak, Sunyoung Lee, Akira
Omaki, Shoko Sasayama, John Thurman, Annie Tremblay, and Fred Zenker.
3
Cloze tests have often been used as a part of language placement examinations because of
their relatively well-researched psychometric properties, but also because they are easy to
develop and score, are quick to administer, and are generally free. It is likely for similar reasons
that cloze tests have become a tool of choice for many SLA researchers who want to obtain a
measure of overall language proficiency from participants in their studies, recognizing the need
for the inclusion of such measures in SLA and bilingualism research (Grosjean, 1998; Hulstjin,
2012; Tremblay, 2011). In Tremblay’s (2011) survey of 144 studies published in two SLA and
one French language journal between 2000 and 2008, she found that almost a third of the studies
that included an independent measure of proficiency chose a cloze test or C-test1 for this
purpose. In this paper, we use aggregated data from one particular English language cloze test
(Brown 1980) to re-examine the assumptions behind the use of this test as an independent
measure of proficiency in SLA research, and to discuss more broadly some of the issues and
questions that SLA researchers should be aware of in the selection and interpretation of cloze
tests as measures of proficiency in L2 research studies.
In the field of second language testing, cloze tests have been studied for many decades. For
reviews at various stages see, among others, Alderson (1978), Brown (1978, 2002), Oller (1979),
and Tremblay (2011). An issue of key concern throughout this literature has been the question of
the construct that is being measured by cloze tests (for a recent review and discussion, see
McCray and Brunfaut 2018). In earlier work, Brown (1982: 28) had referred to the cloze test in
1 C-tests are similar to cloze tests in that they create blanks in a text, however, the blanks are
typically the second half of every other word in the text. See Norris (2018) for a recent collection
of C-tests in various languages.
4
that study as a measure of “overall English language proficiency.” Subsequent SLA researchers
using this test have used the terms “global proficiency measure” (Huensch 2014), “general
language proficiency” (Brien 2013), “measure of L2 learners’ vocabulary, morphosyntactic
knowledge, and discourse competence” (Chrabaszcz et al. 2014), “English proficiency” (Kim
and Kim 2013, Lee 2010), or just “proficiency” (Dekydtspotter and Miller 2009). As has become
increasingly clear over the decades, however, ‘language proficiency’ is a difficult and elusive
construct to define, and hence to operationalize and measure (see e.g., Hulstjin 2011, and
Leclercq and Edmonds 2014, for detailed discussion). Thus to assume that a single cloze test can
provide an unambiguous measure of the construct of ‘language proficiency’ would certainly be
naïve. Yet in the absence of commonly agreed upon practices for assessing language proficiency
in SLA research (Schoonen 2011: 701), the SLA researcher is faced with the practical question
of how to heed Tremblay’s (2011: 364) well-justified advice that “[d]ocumenting and controlling
for L2 learners’ proficiency in the target language should no longer be optional in experimental
SLA research that seeks to explain the linguistic knowledge and behavior of L2 learners.”
Hulstjin (2010) offers a very useful overview and discussion of the pros and cons of four
candidate measures for this purpose, one of them being cloze tests.
An oft-cited reference in the context of using cloze tests as a measure of proficiency is
Brown (1980), a paper that investigated the reliability and validity of four different scoring
methods for a 50-item English cloze test. Although the test itself was not published as part of that
paper, it was made available to other researchers upon request, and used in a number of
published and unpublished SLA studies over the years. The ways in which different researchers
have used and interpreted scores obtained from the Brown cloze test (1980; hereafter referred to
as BCT80), however, have been far from uniform, as we will illustrate in more detail below.
5
Given that one of the stated objectives of including independent measures of proficiency, such as
cloze tests, into SLA research designs is to increase comparability between studies (Tremblay
2011, Marian et al. 2007, Norris and Ortega 2000), such differences in use and interpretation of
scores from one and the same test are not ideal. Nonetheless, the fact that this test has been used
in many different studies with diverse learner populations struck us as providing a unique
opportunity to re-examine the properties of this test based on a much larger and more diverse
dataset by assembling data from as many of these studies as possible.
Thanks to the generosity of a number of colleagues who made their data from the BCT80
available to us, we were able to compile a dataset of scores from 1,724 test takers. This dataset
allowed us to examine descriptive statistics and reliability estimates for scores from the BCT80
based on a substantially larger sample than previously available. Critically, it also allowed us to
examine the variations among these different studies and learner groups. Researchers who used
the BCT80 often refer to its established reliability and validity as one of the main reasons for
including it in their research design (e.g., Chrabaszcz and Jiang 2014, Tremblay 2008). It is
important to bear in mind, however, that test statistics for reliability and validity are never
characteristics of a test per se but are instead characteristics of the scores based on performances
of a certain group of examinees under a certain set of conditions. Put another way, reliability
estimates and validity arguments have limited meaningfulness until they are interpreted in light
of the distributions of scores from a particular sample of test takers. For example, knowing that
reliability and validity estimates for BCT80 scores from one group of low-intermediate Japanese-
speaking learners of English were high is reassuring for the interpretation of findings from that
specific study, and it is useful information for a researcher who intends to use the same test for a
study with a different group of low-intermediate Japanese-speaking learners. However, this
6
information about Japanese learners is of more limited use to a researcher who intends to study a
group of highly proficient German-speaking learners of English. In other words, researchers
should be cautious in generalizing interpretations of reliability and validity statistics beyond the
sample from which they were derived and recognize the need to examine (and report) both
descriptive and testing statistics for their own BCT80 data before drawing conclusions about the
reliability and validity of this measure in their study.
While we believe that increased awareness of these often-misunderstood limitations is
important for the SLA field, we also realize that this information alone is not particularly helpful
to researchers faced with decisions on what measure of proficiency to include in their studies.
The primary practical goal of this paper is thus to provide information about the extent to which
statistical estimates derived from BCT80 scores vary between different groups of test takers, and
thereby enable researchers to better judge—based on evidence from a large and diverse sample
of test takers—whether the BCT80 is likely to provide a reliable and efficient measure of
proficiency for the participants they intend to include in their planned research. To this end, we
present descriptive statistics, reliability estimates, as well as item analyses based on our
aggregated dataset. We present separate analyses for scores derived through exact-answer (EX)
and acceptable-answer (AC) scoring, the two most frequently used scoring methods for the
BCT80. In addition, with the goal of providing a better basis for comparison between studies, we
calculated equivalent T scores, z scores, and percentile scores for both EX and AC scoring
methods across the entire dataset. We present these scores as an informal scale of reference for
future researchers who may wish to situate their participants in a specific range of English
language proficiency as measured by the BCT80. Finally, we will continue to make available the
original cloze test together with the answer key for both EX and AC scoring methods so that
7
researchers will have ready access to the BCT80 in the future. It is our hope that beyond the
insights on the specific test properties of the BCT80, this analysis also serves to provide an
illustration of what researchers need to consider in the selection of a proficiency measure for
research purposes more generally.
1 Assessing Global Proficiency in SLA Research
Researchers in the field of SLA are often faced with the problem of sampling from a
heterogeneous population. L2 learners differ from each other in a myriad of ways, including their
native languages, learning environments, and learning histories. This makes it difficult to
generalize findings from any specific learner group to SLA in general and to compare findings
among various studies. One common solution to this problem is for SLA researchers to
characterize the participants in their studies as precisely as possible so that readers can judge for
themselves the appropriateness of inferences and comparisons based on the findings of any given
study. An indication of learners’ overall proficiency range is often one important characteristic of
an L2 sample.
Another key issue in SLA research is learners’ development over time. Ideally, development
is assessed longitudinally, yet for well-known logistical reasons, longitudinal studies are often
not feasible. Hence, SLA researchers frequently resort to cross-sectional designs, in which they
compare more advanced with less advanced groups. This raises the obvious question of how to
operationalize and measure more vs. less advanced. In response to this problem, Thomas (1994)
surveyed 157 SLA studies published between 1988 and 1992 to find out what methods were
being used in the field to assess “target language proficiency” (p. 307). Based on this survey,
Thomas (1994) established four categories for ways that proficiency is assessed in SLA studies:
8
(a) impressionistic judgment, (b) institutional status (e.g., class level), (c) in-house assessment,
and (d) standardized tests (including participants’ self-reporting of standardized test scores) (p.
312). In a follow-up paper, Thomas (2006) repeated her survey on a new set of studies published
between 2000 and 2004 and found no major changes in the overall proportions of use of the four
broad types of proficiency assessment across studies but noted a shift towards probing
proficiency in finer detail and integrating measures from overall proficiency more tightly into the
research design, a development further confirmed by Tremblay (2011).
A review of SLA studies that included the BCT80 as an independent measure of proficiency
provides a good illustration of the different ways in which researchers have integrated this
measure into their research designs. One possibility is to simply provide descriptive statistics for
participants’ cloze test scores, along with other demographic information (e.g., age, L1), to
characterize the sample (see e.g., Brown, Cuhna et al. 2001). This information is useful to the
extent that other researchers who wish to replicate the study or compare the results to those from
a different group, can include the same test in their own design to measure potential differences
in proficiency. Its usefulness is limited, however, when characterizing the sample in descriptive
terms such as intermediate or advanced, since there are currently no conventions for associating
certain score ranges with specific descriptive categories. (Note that, below, we do provide an
informal frame of reference for BCT80 scores.)
A different use of cloze test scores within the broader design of SLA research studies is to
control for proficiency between subgroups when proficiency is not a variable of interest. For
example, Chrabaszcz and Jiang (2014) compared Russian- and Spanish-speaking learners of
English to investigate the role of L1 transfer in L2 learners’ use of articles in English. To this
end, they needed to eliminate differences in overall proficiency between the two groups as a
9
possible explanation for any observed differences in their use of articles. Thus, they included the
BCT80 in their design, conducted a statistical comparison between scores in the two groups, and
concluded from the absence of a significant between-group difference that “both groups of L2
English speakers were comparable in terms of their L2 proficiency” (p. 361). While this is a
useful first step towards minimizing the influence of proficiency in a context where proficiency
is not the variable of interest, we would like to caution more generally against the interpretation
of a non-significant effect as evidence of the probability of absence of an effect. More cautious
strategies in this sort of situation, if allowed by the nature of the data and analysis, would be to
do additional power analyses of the group cloze test score comparison or to include cloze test
scores as a continuous co-variate in the statistical model (for this use of scores from the BCT80,
see Qin et al. 2017, Grüter et al. 2017).
Perhaps the most common use of cloze test scores in SLA research is to create subgroups of
learners by proficiency in a cross-sectional design aimed at investigating L2 development over
time. In these cases, proficiency is typically one of the main variables of interest. For example,
Kwak (2010) conducted a study with Korean learners of English to investigate potential changes
over time in learners’ interpretation of numeral quantifiers and negation in English. Kwak
divided her participants into “low” vs “high” proficiency groups based on their scores on the
BCT80, with those scoring above the group mean (Experiment 5) or the median (Experiment 6)
assigned to the “high” and those below the mean/median to the “low” group. Others used scores
from the BCT80 combined with scores from other measures of proficiency to divide learners into
subgroups representing developmental stages. For example, Tremblay (2008) used a combined
score from the BCT80 and a read-aloud task to create “intermediate,” “low advanced,” and “high
advanced” groups. These studies then typically proceed to use proficiency-subgroup as a
10
categorical predictor in their analysis, where significant effects of this predictor are interpreted as
an indication of developmental change. We do not take issue with this logic, in principle, but
would like to point out that it critically rests on the assumption that scores derived from the
BCT80 (alone, or in combination with another proficiency measure) can provide a good indicator
of L2 developmental stages. It is important to bear in mind that the test was not originally created
for this purpose, and there are currently no agreed-upon correspondences between certain score
ranges and developmental stages.
Moreover, different researchers have used different criteria, diverse cut-off points, and
different labels to refer to their proficiency subgroups created based on scores from the BCT80.
As a result, there are considerable discrepancies among studies in terms of the descriptions of
these groups, created (in part) based on the very same measure of proficiency. For example,
Huensch (2014), Kim and Kim (2013), and Kwak (2010) all used the BCT80 to create subgroups
among Korean learners of English. While all authors used AC scoring, the groups labelled
“high” (proficiency) in the three studies differed substantially in terms of their score ranges: 37-
47 in Huensch (2014), 32-44 in Kim and Kim (2013), and 22-41 and 24-39 in Kwak (2010,
Experiments 5 and 6, respectively). This further contrasts with a score range of 19-30 in a group
of “high” proficiency Japanese learners of English in Sasayama (2015), and a range of 34-46 in a
group of “low advanced” French Canadian learners of English in Tremblay (2008). These
discrepancies considerably complicate comparisons between studies and inferences about L2
development beyond the samples studied.
One way around the assignment of learners to discreet proficiency subgroups is to include
measures derived from the BCT80 as a continuous predictor in the analysis. Brown, Robson et
al. (2001), for example, included cloze test scores as one of several individual-difference
11
variables in a correlational analysis. In a recent study, Zenker and Schwartz (2017) used Rasch
analysis to convert raw cloze test scores to person ability estimates on a logit scale and used this
variable in a regression analysis aimed at testing for developmental effects on Chinese learners’
sensitivity to syntactic island effects in English. This use of a continuous-scale measurement of
proficiency provides more precise information than nominal-scale subgrouping based on study-
specific criteria, and is therefore preferable, if feasible and all statistical assumptions are met,
within the overall research design (see also Grosjean, 1998, and Hulstjin, 2012, for earlier
recommendations to include general proficiency measures as covariates in the final analyses).
For all of these uses of cloze test scores as an indicator of proficiency, it is essential for
researchers to understand the distribution of the total scores in their own data, as well as the
relation of this distribution to that in the wider population of L2 learners. We thus reiterate our
encouragement for researchers who decide to use the BCT80 in their research to calculate
reliability estimates and examine their validity arguments for the scores from their own samples
and interpret them with reference to the analyses of scores from our large aggregated dataset.
Before we present these re-examinations of the BCT80, we will briefly describe the nature and
development of the BCT80 itself.
2 The BCT80
The cloze test at the center of this study is usually referred to as the cloze in Brown (1980), but it
first appeared in Brown’s MA thesis (1978). According to Brown (1978), after a survey of
various texts of moderate difficulty, two passages from an intermediate ESL reader by Margaret
Kurilecz (1969) were selected and adapted to the needs of this study. Second, an every-7th word
deletion pattern was used in both passages for a total of 50 blanks in each with two sentences
12
unmutilated at the beginning (for lead-in) and one at the end. The blanks were numbered, and the
examinees were asked to write their answers in corresponding blanks along the right side of the
test paper. After a first round of piloting, one passage was selected and 77 English native
speakers in freshman composition courses at UCLA took this final cloze test. Anwers provided
by at least two participants in this group were included in the glossary of acceptable answers for
the AC scoring method. While in exact-answer (EX) scoring, only the originally deleted words
are counted as correct answers, in AC scoring, all (and only) the items in the AC answer glossary
are counted as correct. That paper reported K-R20 reliability estimates ranging from .89 to .95
depending on the cloze scoring method used. The study also found criterion-related validity
coefficients ranging from .88 to .91, again depending on the scoring method. These validity
coefficients were correlation coefficients between the cloze test scores and total scores on
UCLA’s standardized English as a Second Language Placement Test (the ESLPE included
reading, listening, writing, and grammar subtests).
In the journal length version of this study, Brown (1980) concluded that the best overall
scoring method for the cloze test was the AC method. Yet as Brown (2002) discussed in more
detail, this conclusion was potentially premature, as it “failed to consider the distributions of
scores and their effects on the relative values of […] descriptive, item analysis, reliability, and
validity statistics” (p. 80). The in-depth analysis of these statistics derived from a much larger
and more diverse dataset that we present here allows us to re-examine Brown’s (1980) original
conclusion and arrive at more nuanced recommendations for the choice of scoring method in
different contexts.
3 This Study
13
3.1 Collection of Datasets
In summer 2015, we conducted a search on Google Scholar for publications citing Brown (1978
or 1980). We then inspected these publications for reports of original data from the BCT80. In
fall 2015, we contacted fifteen scholars who were thus identified as first or corresponding
authors of one or more publications reporting such data, to ask if they would be willing to share
their datasets with us in deidentified format for the present analyses. Twelve responded. Three of
them reported that they no longer had the data in question; nine generously agreed to share their
data with us. Of these nine, five were able to share participant-level data, i.e., a total score for
each participant; four shared item-level data, consisting of responses to each test item, i.e., 50
data points per participant. (Note that some authors contributed more than one dataset.) In
addition to the datasets obtained through this process, unpublished data from three M.A. student
projects in our department were made available to us in item-level format. Finally, we included
item-level datasets from our own prior research.
We thus collected a total of 19 datasets, consisting of data from a total of 1,724 test takers,
including 1,637 L2 and 87 native speakers (NS) of English. Thirteen of these datasets contained
item-level data; for seven of them, both EX and AC scores were available, four had only EX and
two had only AC scores. An additional six datasets contained participant-level data, all with AC
scores. The original data were collected between 1977 and 2015 in both second and foreign
language contexts, and learners’ native languages included French, Japanese, Korean, Mandarin,
Portuguese, Russian, and Spanish.
14
3.2 Goals and Research Questions
A key goal of our study is to draw SLA researchers’ attention to the fact that reliability estimates
vary with the distribution of scores in each sample. To this end, we use our aggregated dataset to
address the following research question:
RQ1: To what degree do score distributions and reliability estimates vary (a) among L2
groups, (b) between L2 and NS examinees, and (c) between EX vs AC scoring?
Secondly, we seek to examine the extent to which scores on the BCT80 function well for
assessing learners at different ends of the proficiency spectrum. Thus, to explore if and how the
scores on the BCT80 provide effective measures for classifying learners along the continuum of
L2 ability, we formulate the following research question:
RQ2: To what degree do different sets of items on the BCT80 contribute to the
effectiveness of overall scores for discriminating among test takers, including (a) low-
vs high-proficiency L2 examinees across studies and (b) L2 vs NS test takers?
We will address RQ2 by using item discrimination analysis to examine which items discriminate
well within different subgroups, looking separately at both EX and AC scoring. The results from
these analyses may also help researchers choose the best scoring method for learners at different
points in the proficiency continuum.
Finally, with the goal of providing a better basis for comparison between and across studies,
we will present equivalent raw scores, T scores, z scores, and percentile scores for both EX and
AC scoring methods for the BCT80 based on the entire datasets in an informal table of reference.
3.3 Analysis and Results
15
This section will present the following analyses (more or less in the same order as the research
questions):
1. Total score results for the EX scoring of cloze tests including reliability analyses,
descriptive statistics, and box plots, as well as one-way ANOVA results comparing group
means for all 11 L2 groups and the NS group.
2. Total score results for the AC scoring of cloze tests including the same analyses for all 15
L2 groups and the NS group.
3. Item-level discrimination analyses to determine which items are discriminating for EX
and AC scoring for all L2 speakers and all NSs, as well as for high vs. low proficiency L2
subgroups.
4. Standardized-score conversions separately for EX and AC scoring raw-score, z-score, T-
score, and percentile score equivalencies for all L2 learners.
3.3.1 Total Score Results for EX Scoring
Focus on EX reliability. The total score results for EX scoring are for 1257 L2 speakers in
11 studies (labeled here A to K) and 46 NSs. The reliability estimates for each set of data—
including Cronbach alpha and Kuder-Richardson 21 (K-R21)—are shown in Table 1. Notice that
the alpha reliability estimates in the first column of numbers are moderately high throughout and
range from .69 to .84 for the L2 studies. This means that these various sets of scores were
moderately reliable since they ranged from 69% to 84% internally consistent (i.e., doing the
same things in terms of ordering students). As would be expected for the combined samples,
Total L2 and Grand Total had considerably higher alpha reliability estimates of .90 and .91,
respectively.
16
The K-R21 reliability estimates are systematically lower as has been found in a number of
cloze studies. Brown (1983, 1994) was the first to explore this phenomenon, which is probably
due to systematic violation of the assumption of equal item variances that underlies K-R21, but
not alpha. The K-R21 estimates (which are based on total scores) are presented here because
later in Table 2, there will be a number of data sets for which we had only total scores, and
therefore could not calculate alpha (because alpha must be based on item-level data). Since we
wanted some basis for comparing the reliability of those groups with each other and with the
ones in Table 1, we used K-R21 for that purpose.
17
Table 1. Reliability Estimates EX Scoring by Study
Study Test takers’ L1 N M SD Alpha K-R21
NS English 46 27.76 5.82 .74 .63
A Mandarin 82 23.07 5.50 .72 .58
B Portuguese 88 8.51 4.38 .72 .62
C Portuguese 270 15.22 6.51 .83 .75
D Japanese 320 7.21 4.06 .71 .62
E Spanish 20 23.00 7.00 .81 .74
F Japanese, Korean 48 15.52 5.14 .69 .59
G Japanese 112 5.51 3.89 .73 .67
H Japanese 119 14.23 6.84 .84 .78
I French 110 25.18 6.78 .80 .72
J Mandarin 31 17.68 6.91 .83 .76
K Korean 57 19.28 6.87 .82 .74
Total L2 1257 13.52 8.42 .90 .86
Grand Total 1303 13.95 8.73 .91 .87
We also present descriptive statistics in Table 1 because, mathematically, reliability estimates
are somewhat related to the amounts of variance in the distributions upon which they are based.
For example, the fact that the scores for the Total L2 and Grand Total groupings produced higher
reliability estimates than most of the other groupings may be largely due to the fact that those
two Total groups had larger n sizes and had wider distributions (as represented by larger SDs).
18
Note also that all of those groups that produced alpha reliability estimates over .80, had SDs
higher than 6.50, while those groups with lower estimates had lower SDs.
Focus on EX distributions. The L2 means in Table 1 range from 5.51 (Study G) to 25.18
(Study I) for the L2 speakers, and even to 27.76 for the NS group. By any standard, this amount
of variation in means for different groups is large. The SDs also vary considerably. The same
results are displayed graphically in box-and-whiskers plots in Figure 1. This graphical
representation clearly shows (as the descriptive statistics cannot) that the groups vary greatly
from each other in terms of median and general ability levels. Notice how much overlap there is
between some of the L2 group distributions and the distribution for NSs.
Figure 1. Box Plots for EX scoring (for Studies with Item-Level Data) by Studies in Mean Order
19
A one-way ANOVA was performed to determine the degree to which the fluctuations in
means found here were due to chance alone. The overall ANOVA indicated that significant
differences existed among the 12 groups (F = 169.45 with df = 11 and 1291; p < .0005; eta2 =
.591; power = 1.000). Scheffé follow-up analyses further indicated that 41 (or about 62%) of the
possible 66 comparisons were significant (i.e., the means were different due to factors other than
chance) at .0005 (i.e., with 99.95% probability). Note also that Groups A, E, and I were not
found to be significantly different in the Scheffé analyses from the NS group, with a substantial
number of learners in those groups scoring well within the range of scores observed among
native speakers, a pattern that we discuss in more detail below (4.1).
3.3.2 Total Score Results for AC Scoring
Focus on AC reliability. The total AC scoring results are from 840 L2 speakers in 15 studies
(labeled here A to S) and 87 NSs. The alpha reliability estimates in Table 2 are moderately high
throughout with alpha coefficients ranging (more widely than those for EX scoring) from .74 to
.92 for the L2 studies. This means that the internal consistency for these various sets of scores
ranged from 74% to 92% consistent. Again, as would be expected for the combined samples,
Total L2 and Grand Total had considerably higher alpha reliability estimates. Also, where both
alpha and K-R21 estimates were available, those for K-R21 are systematically lower.
Recall that some authors cited the original BCT80 K-R20 reliability estimate for AC scoring
of .95 (K-R20 more or less = alpha) and split-half adjusted reliability estimate of .94 as
justification for using the BCT80 in their studies. However, while the scores for the Total L2 and
Grand Total groupings did turn out be almost as reliable (at alpha both = .93) as the original
scores in Brown (1980), the various studies’ AC scores are considerably lower in reliability than
20
the EX scores in that original study. Note again in Table 2 that those groupings that had the
highest standard deviations tended to have the highest reliability estimates.
Table 2. Reliability Estimates for AC Scoring by Study (for all data with total scores)
Study Test takers’ L1 N M SD Alpha K-R21
NS English 87 43.94 4.36 NA .71
A Mandarin 82 36.61 6.15 .80 .74
E Spanish 20 37.75 8.56 .91 .87
F Japanese, Korean 48 27.73 6.34 .75 .69
G Japanese 112 10.39 6.11 .83 .78
I French 110 39.39 7.97 .90 .87
J Mandarin 31 29.23 10.21 .92 .88
K Korean 57 31.53 8.97 .89 .85
L Japanese 31 19.84 6.87 .80 .74
M Mandarin 12 38.33 5.30 .74 .68
N Russian, Spanish 32 42.84 6.15 NA .83
O Russian, Mandarin 29 41.17 5.68 NA .77
P Korean 75 22.76 14.27 NA .94
Q Korean 141 22.01 10.36 NA .88
R Korean 36 32.36 8.01 NA .82
S Japanese 24 40.08 4.25 NA .55
Total L2 840 28.41 13.18 NA .93
Grand Total 927 29.87 13.41 NA .93
21
Focus on AC distributions. The L2 means in Table 2 are generally much higher than those
shown in Table 1, and they range from 10.39 (Study G) to 42.84 (Study N), and even higher
(43.94) for the NS group. Many of the L2 SDs in this table are also higher than those shown in
Table 1, and they vary considerably. In the graphical representation of these results in Figure 2,
notice again how much the groups vary from each other in terms of median and general ability
levels, and how much overlap there is between the distributions for many of the L2 groups and
the NS group.
Again, one-way ANOVA was performed to determine the degree to which the fluctuations in
means found here were due to chance alone. The overall ANOVA indicated that significant
differences existed among the 16 groups (F = 99.215 with df = 15 and 911; p < .0005; eta2 =
.620; power = 1.000). Scheffé follow-up analyses further indicated that 46 (or about 36.33%) of
the possible 120 comparisons were significant at the p = .0005 level. Groups A, E, I, M, N, O,
and S were not found to be significantly different in the Scheffé analyses from the NS group.
Figure 2. AC (for all data with total scores) by Studies in Mean Order
22
3.3.3 Item-Level Discrimination Analyses
Examining our cloze test data from a classical testing perspective, we first analyzed the items for
item discrimination. It is well established that items that discriminate well for a particular sample
of examinees are those that are approximately the right level of difficulty for the examinees who
took it and those that separated the high scoring examinees on the whole test from the low
scoring examinees. Items that discriminate also promote variance in the scores on the test as well
as higher reliability estimates. Conversely, items that are not discriminating well tend to be too
difficult or easy for the students, do not separate the high scoring examinees from the low
scoring ones, and do not contribute substantially to the variance in the total scores on the test or
to the reliability of those scores. In other words, these non-functioning or ‘turned off’ items, have
a profound effect on the overall distribution of scores on a test, as well as on the reliability
associated with these tests (see also Brown 2002).
Table 3 shows our analysis of items that discriminated above .20 for different groups within
our data. We chose the .20 level of discrimination because items at .19 or below are typically
“poor items, to be rejected or improved by revision” (Ebel 1979: 267) in the test development
process due to the fact that they are contributing very little to the test score variance and
reliability of the scores. We calculated discrimination estimates separately for four different
groups of examinees: the total group of L2 speakers taken together (TOT L2), the total group of
native-speakers taken together (TOT NS), the high proficiency L2 (HIGH L2, defined, for
present purposes, as the top third of all L2 examinees in terms of total scores), and the low
proficiency L2 (LOW L2, defined as the bottom third of all L2 examinees in terms of total
scores). The results for these four groups using EX scoring are in the first five columns of Table
3 and for the same four groups using AC scoring are in the last five. Notice that the fifty items
23
are labeled in the first column for each scoring method, and that each asterisk within Table 3
represents an item that discriminated at .20 or above. Cronbach alpha reliability estimates for
each group are shown at the bottom of Table 3.
L2 Tot discrimination. Very striking at first glance in Table 3 is the fact that 39 out of 50
items for the EX TOT L2 sample are discriminating at .20 or above. Even more strikingly, all but
one (49/50) for the AC TOT L2 sample are discriminating at .20 or above. Also, the alpha
estimates for these groups are high at .90 and .95, respectively.
24
Table 3. Items That Discriminate Above .20 for Total L2 (for tests with item-level data),
Total NS, the Top Third of L2, & the Bottom Third of L2 Examinees
EXACT-ANSWERSCORING ACCEPTABLE-ANSWERSCORING
ITEM
#
TOTL2
TOTNS
HIGHL2
LOWL2
ITEM
#
TOTL2
TOTNS
HIGHL2
LOWL2
1 * * *
1 * * * *
2 *
*
2 * * * *
3 *
*
3 *
*
4 *
*
4 * * *
5 * * *
5 *
*
6 *
*
6 *
7 * *
7 * *
*
8
* *
8 *
*
9 * * *
9 * *
*
10
* *
10 *
*
11 *
*
11 * *
*
12 * *
12 * *
*
13 *
13 *
14 *
*
14 *
*
15
* *
15
* *
16 * *
*
16 *
*
17 * *
*
17 *
*
18 * *
*
18 * *
*
19
* *
19 * * *
20 *
20 *
*
21 * * *
21 *
*
22 * * *
22 *
*
23 *
*
23 *
*
24
*
24 * *
*
25
*
25 * *
*
25
26
*
26 * *
*
27 *
*
27 *
*
28 * *
28 * *
*
29 * * *
29 *
*
30 * * *
30 *
*
31 * *
*
31 * *
32 *
*
32 * * * *
33 * * *
33 *
* *
34 * * *
34 *
*
35 * *
35 * *
*
36 * * *
36 * * *
37
*
37 * *
*
38 *
38 * *
39 * * *
39 * *
40 * *
40 *
*
41 * * *
41 * *
*
42
*
42 *
*
43 * *
43 * * *
44 * * *
44 *
45
* *
45 * *
*
46 * * *
46 *
*
47
*
47 * * *
48 *
48 *
* *
49 * * *
49 *
* *
50 * * *
50 * * *
Alpha .90 .74 .56 .23Alpha .95 .80 .41 .79
26
NS Tot discrimination. Notice that 37 out of 50 (74%) EX items discriminate at .20 or above
in the NS data and that the scores are moderately reliable at .74. Fewer of the AC items (26/50,
52%) discriminated at .20 or above for the NS group, yet the scores had somewhat higher
reliability at .80 than the EX items.
High and low proficiency groups discrimination. For the EX scoring in the high proficiency
L2 examinees, 25 out of 50 (50%) items discriminated at .20 or higher, and the Cronbach alpha
reliability was low at .56. For the low proficiency groups, a remarkably low 9 out of 50 (18%)
EX items discriminated at .20 or above, and the Cronbach alpha reliability was a very low .23.
For the AC scoring, only 13 out of 50 (26%) AC items discriminated at .20 or above for the high
proficiency L2 group, and alpha turned out to be a low of .41. However, 37 out of 50 AC items
discriminated at .20 or higher for the low proficiency L2 group, and the alpha reliability was a
moderately respectable .79.
3.3.4 Standardized-Score Conversions
Table 4 shows the raw-scores, T-scores, and percentile equivalencies for all L2 learners for EX
and AC scoring. All of these statistics are based on total scores of all L2 examinees with no NSs
included. The two scoring methods are different because they are based on scores with very
different distributions, as explained in several ways above. In the first column of each set, we
show the possible raw scores. The EX raw scores range from 0 to 40 because nobody scored 41
or above, whereas the AC scores use the entire range from 0 to 50. In successive columns, we
show the z-score, T-score, and percentile equivalents. As points of reference, the mean and
standard deviation for z-scores is 0.00 and 1.00, while they are 50 and 10 for T-scores,
respectively. In addition, the means and standard deviations (SDs) in the distributions (both
27
positive and negative) are delineated for the reader’s convenience to the left of the scores for
each scoring method. Percentile scores represent the position in the distribution that a particular
score has in terms of the percentage of people falling below it.
For example, a participant on the EX scale with a raw score of 38 would have a z-score of
2.91 (meaning that they scored 2.91 standard deviations above the mean), a T-score of 79.07
(which means the same thing but has been transformed for a distribution with a standard
deviation of 10 and mean of 50 [i.e., (1.91 x 10) + 50 = 79.1]), and with a percentile score of
99.8.
28
Table 4: EX and AC Raw-Score, T-Score, and Percentile Equivalencies for all L2 Learners
29
Exact-Answer Scoring (EX) Acceptable-Answer Scoring (AC)
Raw z-score T-score Percentile Raw z-score T-score Percentile
50 NA NA NA 50 1.64 66.38 94.9
49 NA NA NA 49 1.56 65.62 94.1
48 NA NA NA 48 1.49 64.86 93.2
47 NA NA NA 47 1.41 64.10 92.1
46 NA NA NA 46 1.33 63.35 90.8
45 NA NA NA 45 1.26 62.59 89.6
44 NA NA NA 44 1.18 61.83 88.1
43 NA NA NA 43 1.11 61.07 86.7
42 NA NA NA +1SD
42 1.03 60.31 84.8
41 NA NA NA 41 0.96 59.55 83.1
40 3.15 81.45 99.9 40 0.88 58.79 81.1
+3 SD 39 3.03 80.26 99.9 39 0.80 58.03 78.8
38 2.91 79.07 99.8 38 0.73 57.28 76.7
37 2.79 77.89 99.7 37 0.65 56.52 74.2
36 2.67 76.70 99.6 36 0.58 55.76 71.9
35 2.55 75.51 99.5 35 0.50 55.00 69.1
34 2.43 74.32 99.3 34 0.42 54.24 66.3
33 2.31 73.14 98.0 33 0.35 53.48 63.7
32 2.19 71.95 98.6 32 0.27 52.72 60.6
+2 SD 31 2.08 70.76 98.1 31 0.20 51.97 57.9
30 1.96 69.57 97.5 30 0.12 51.21 54.8
29 1.84 68.38 96.7 Mean
29 0.04 50.45 51.6
28 1.72 67.20 95.7 28 -0.03 49.69 48.8
27 1.60 66.01 94.5 27 -0.11 48.93 45.6
26 1.48 64.82 93.1 26 -0.18 48.17 42.9
25 1.36 63.63 91.3 25 -0.26 47.41 39.7
24 1.24 62.45 89.3 24 -0.33 46.65 37.1
23 1.13 61.26 87.1 23 -0.41 45.90 34.1
30
+1 SD 22 1.01 60.07 84.4 22 -0.49 45.14 31.2
21 0.89 58.88 81.3 21 -0.56 44.38 28.8
20 0.77 57.70 77.9 20 -0.64 43.62 26.1
19 0.65 56.51 74.2 19 -0.71 42.86 23.9
18 0.53 55.32 70.2 18 -0.79 42.10 21.5
17 0.41 54.13 65.9 17 -0.87 41.34 19.2
16 0.29 52.95 61.4 -1 SD
16 -0.94 40.58 17.4
15 0.18 51.76 57.1 15 -1.02 39.83 15.4
Mean 14 0.06 50.57 52.4 14 -1.09 39.07 13.8
13 -0.06 49.38 47.6 13 -1.17 38.31 12.1
12 -0.18 48.19 42.9 12 -1.25 37.55 10.6
11 -0.30 47.01 38.2 11 -1.32 36.79 9.3
10 -0.42 45.82 33.7 10 -1.40 36.03 8.1
9 -0.54 44.63 29.5 9 -1.47 35.27 7.1
8 -0.66 43.44 25.5 8 -1.55 34.51 6.1
7 -0.77 42.26 22.1 7 -1.62 33.76 5.3
-1 SD 6 -0.89 41.07 18.7 6 -1.70 33.00 4.5
5 -1.01 39.88 15.6 5 -1.78 32.24 3.8
4 -1.13 38.69 12.9 4 -1.85 31.48 3.2
3 -1.25 37.51 10.6 -2 SD
3 -1.93 30.72 2.7
2 -1.37 36.32 8.5 2 -2.00 29.96 2.3
1 -1.49 35.13 6.8 1 -2.08 29.20 1.9
0 -1.61 33.94 5.4 0 -2.16 28.44 1.5
These standardized scores may prove useful to researchers who use the BCT80 in their studies
because they can now reference their EX or AC scores to this larger sample derived from
multiple sets of data. They can also describe their individual sample in terms other than high and
low proficiency, at least in terms of what high and low proficiency mean relative to the results of
31
other studies. These standardized scores may also prove useful in interpreting the already
published studies (including those used in this paper) for comparison purposes.
For example, if a team of researchers used the EX scoring method to score this cloze test in
their study and divided their sample in halves with the 50% in a high proficiency group scoring
28 to 39 and the 50% in a low proficiency group scoring 16 to 27, they could further cite Table 4
and describe their high proficiency group as ranging from the 95.7th to the 99.9th percentile,
while their low proficiency group ranged from the 61.4th to the 94.5th percentile in the larger
sample included in this paper. Perhaps another researcher using the AC scoring method in her
study divided her participants into halves with 50% in a high proficiency group for those scoring
15 to 27 and the 50% in a low proficiency group for those with 11 to 14. This researcher could
also cite Table 4 and describe high proficiency group as ranging from the 15.4th to the 45.6th
percentile, while his low proficiency group ranged from the 9.3rd to the 13.8th percentile in the
larger sample included in this paper. More importantly, the second researcher and her readers
could all interpret her results in light of the fact that the lowest scoring “low proficiency”
participant in the first study was higher than the highest performing participant in this second
study.
Notice also that the AC scoring distribution only goes up to the 94.9th percentile for the AC
scoring method. Thus, the EX scoring, which continues to differentiate among those examinees
between 95.7 and 99.9 percentile (see the raw scores ranging from 28 to 40), generally measures
better than the AC scoring at the top end of the distribution.
32
4 Discussion
We will begin here by briefly summarizing the outcomes of our analyses in light of our two
research questions , followed by discussion of the implications of these findings for making
scoring decisions (4.1), and for dealing with reliability issues (4.2).
RQ1: To what degree do score distributions and reliability estimates vary (a) among L2
groups, (b) between L2 and NS examinees, and (c) between EX vs AC scoring?
Our first goal in this study was to draw SLA researchers’ attention to the ways reliability
estimates vary depending on the distributions of scores in each sample.
Variations among L2 groups. In terms of both reliability and score distributions, we found
considerable variation across studies, both for AC (Table 1) and EX (Table 2) scoring. For both
scoring methods, Total L2 and Grand Total scores were reliable at .90 and above, similar to what
was reported in Brown (1980). Yet reliability estimates for individual samples ranged from 71%
to 84% for EX scoring, and from 55% to 94% for AC scoring.
Variations between L2 and NS examinees. For both AC and EX scoring, we noted how
wide the distribution was for the NSs, and how much overlap there was with the distributions for
several L2 groups, particulary for AC scoring. The substantial variability within the NS group is
notable and aligns well with accounts arguing that L1 proficiency is far from homogeneous,
especially with regard to skills that involve ‘higher language cognition’ (HLC, Hulstjin 2011),
such as tasks requiring reading and writing. NSs’ educational attainment is a well-known
predictor for performance on HLC tasks (see Hulstjin 2011 for review) and has also been argued
to explain significant variance in NSs’ performance on tasks more narrowly focused on basic
language skills, such as grammatical processing (Dabrowska 2012, Pakulak and Neville 2010).
The fact that we found moderate reliability for the BCT80 with an NS group, with a majority of
33
items discriminating, suggests that the observed variability among NS is systematic. Thus,
whenever comparison between an L2 and an NS control group for language proficiency is part of
the research design of an SLA study, it should be borne in mind that the comparison is not with
an L1 baseline, but with an L1 base-distribution.
Variations in EX vs AC scoring. Means were generally higher and SDs larger for AC scoring
(Table 2) than for EX scoring (Table 1). This of course is unsurprising since correct answers in
EX scoring are a subset of those in AC scoring. The larger SDs in AC scoring indicate that the
AC scores were more varied than those for EX scoring, which in turn may explain in part why
the AC scoring produced higher reliability estimates for all groups scored both ways.
RQ2: To what degree do different sets of items on the BCT80 contribute to the
effectiveness of overall scores for discriminating among test takers, including (a) low- vs
high-proficiency L2 examinees across studies and (b) L2 vs NS test takers?
Our second goal was to examine if and how the scores on the BCT80 provide effective measures
for classifying learners along a continuum of L2 ability. We addressed RQ2 by using item
discrimination analysis to examine which items distinguish well between high and low scorers
with subgroups. For all L2 speakers taken together, Table 3 showed that 39 out of 50 EX scored
items were discriminating at .20 and therefore contributing to the variance and reliability of the
scores, while, for AC scoring, 49 out of 50 of the AC items were discriminating, and, for NSs
total, 37 out of 50 of the EX items discriminated, while 26 out of 50 of the AC items did so.
For the high and low proficiency L2 groups, for EX scoring, 25 out of 50 items discriminated
for high proficiency L2 examinees, while for low proficiency L2 examinees, only nine out of 50
discriminated, and for AC scoring, only 13 out of 50 items discriminated for high proficiency L2
34
speakers. Thus, for groups with restricted proficiency ranges, fewer items tended to discriminate.
Note also in Table 3 that none of the EX items that discriminated for the high proficiency and
low proficiency examinees were the same. Similarly, for the AC scoring, seven of the 13 items
that discriminated for the high proficiency group did not do so for the low proficiency group. All
of this is consistent with Brown’s (2002: 108) observation that “…the items that are working
well for students at different levels of proficiency appear to be quite different”.
4.1 Implications for Making Scoring Decisions
The results for RQs 1 and 2 have useful implications for researchers needing to make choices
of scoring method for learners at specific proficiency levels. As indicated in Table 2, if a
researcher’s group has low proficiency (like say, group G in Table 2 with a mean of 10.39, SD of
6.11, and reliability of .83), it appears that AC scoring will work better. If the group has high
proficiency (like group I in Table 1 with a mean of 25.18, SD of 6.78, and reliability of .84), it
appears that EX will generally work better. And, if the group has a wide range of scores (like the
total L2 sample), the EX scoring will probably work marginally better. What all of this illustrates
is that one strength of cloze is that only those items that are appropriate for the sample at hand
will discriminate, while the others are essentially switched off in the sense that they will not
contribute to the item or test score variance. The BCT80 thus constitutes a fairly flexible
instrument given that researchers can use the EX or AC scoring method, whichever is most likely
to fit the level of the group(s) and range of scores involved in their particular study so that the
items will be at the appropriate level of difficulty and therefore be most likely to discriminate.
We have demonstrated here that, for groups with wide score ranges, larger numbers of items are
more likely to discriminate between high and low performing students and therefore produce
35
more reliable total scores. For heterogeneous L2 groups that are generally high in proficiency, it
appears that EX scoring is likely to have larger numbers of items that discriminate between high
and low performing students within that group and therefore produce more reliable total scores.
In contrast, for heterogeneous L2 groups that are generally low in proficiency, it appears that AC
scoring is much more likely to have larger numbers of items that discriminate between high and
low performing students within that group and therefore produce more reliable total scores.
An additional goal in this study was to provide a basis for comparing the results between and
among studies that use the BCT80. Thus, we presented the equivalent raw scores, T scores, z
scores, and percentile scores in Table 4 for both EX and AC scoring methods calculated across
the entire L2 dataset in an informal scale of reference. While our aggregated dataset should not
be considered a norming sample, it is nevertheless based on data from a large number of L2
learners from across the proficiency spectrum. It can thus serve as the basis for a useful common
frame of reference for future researchers reporting EX or AC scores from the BCT80. Instead of
describing their learner groups in terms of arbitrarily defined ‘high’ and ‘low’ proficiency
groups, this scale will provide researchers with the option to define their (sub)groups in terms of
specific percentile bands on this reference scale, thus hopefully making future comparisons
across studies more transparent.
Much of this paper has focused critically on potential problems and limitations of the BCT80
when used in experimental SLA research. Nonetheless, we only know these limitations because
so much research has been done on cloze generally, and with the BCT80 in particular. This
knowledge constitutes a decided advantage in that it allows researchers to gauge with somewhat
more confidence what kind of variance this measure can and cannot capture among the learners
in their study. Given the long track record and flexibility of the scoring methods for the BCT80,
36
researchers may wish to continue using the BCT80 in their studies largely because of the
flexibility it provides, that is, they can choose the EX and/or AC scoring methods after they see
how well each scoring method works and thereby use whichever provides the more reliable and
perhaps valid decisions about their study participants’ English language proficiency.
4.2 Implications for Dealing with Reliability Issues
In a number of individual studies examined in this paper, reliability estimates were
considerably lower than the values reported in Brown (1980). Generally, what this means to
researchers is that they should never assume that BCT80 scores in their studies will be
distributed as they were in the original study, nor that the scores will be as reliable as they were
in Brown (1980). Indeed, it is reasonable to expect reliability estimates to vary due to differences
in means and distributions, especially with regard to differences in score ranges. In fact, the high
reliability estimates for both EX and AC scoring found in the original Brown (1980) study may
have been due to the way that researcher created a test that produced fairly well-centered and
widely varying score distributions suitable for the students taking the UCLA ESLPE.
So what is a researcher to do? Certainly, SLA researchers who use the BCT80 can refer to its
established reliability and validity (as reported in Brown, 1980), but they should also calculate
and discuss the descriptive statistics and reliability estimates that resulted in their specific study
as well because score distributions and reliability are not characteristics of a test, but rather are
characteristics of the scores on a test when it is administered to specific examinees under specific
conditions. Further decisions must then depend on the specific purpose and context of the study
in question:
37
• If the purpose of the study is to use the cloze scores to create proficiency groups, and
moderate to high reliability is found for the cloze test, researchers can be reasonably
confident in their classification of students into groups.
• If the purpose is to use the cloze scores to simply document the proficiency of the L2 learners
in comparison to L2 learners who were tested the same way in other studies, then low
reliability estimates should certainly be reported, but the reliability issue can be discussed in
reference to Table 4 of this paper, as well as in terms of the problems that small sample sizes
and/or restrictions in range create for the reliability of the BCT80 cloze, as would be true for
any test (see e.g., Brown 1984).
• Regardless of their purpose, researchers might be wise to refer to the informal scale of
reference supplied in Table 4 of this paper to help situate groups of participants relative to
each other and to the overall range of English language proficiency that has been measured
reliably in the aggregated data for this test.
• For any researchers who are tempted to add additional acceptable answers to their AC answer
keys because those answers seem correct (though they are not included in the original answer
key), first consider the benefits of using the BCT80 answer keys exactly as they were
originally used in Brown (1980) in terms of scoring reliability and the comparability of
scores to those in Table 4 (and to those in any other studies that have followed those same
procedures).
5. Conclusion
The analyses we have presented in this paper, based on an aggregated dataset of scores from
1,724 test takers in 19 data sets, have demonstrated that the BCT80 can be a reliable and useful
38
option for SLA researchers looking for a one-shot global measure of English proficiency for
purposes of classifying research participants along a continuum of language ability. While in the
context of high-stakes educational assessment, computer-adaptive proficiency testing, in which
test takers’ initial responses determine the difficulty of subsequent items, is increasingly being
used to address the difficulty of creating tests appropriate for all skill levels, such methods are
typically not available, or affordable, to SLA researchers, nor are they easily deliverable within
the time-constraints of a typical research test session. For these reasons, SLA researchers often
turn to more easily deliverable measures, such as the publicly accessible LexTALE
(www.lextale.com), which provides a quick score of ‘proficiency’. Yet, as acknowledged by its
authors (Lemhöfer and Broersma 2012: 340), the validity of short tests such as this may depend
on the skill level of test takers, with the initial validation study indicating that the LexTALE is
appropriate primarily for “medium- to high-proficiency” learners. The BCT80, whose flexibility
and effectiveness with learners across the language ability continuum we have examined and
illustrated in some detail here, can present a reliable and practical alternative for SLA
researchers. It is likely, but remains to be demonstrated, that these benefits also apply to other
carefully constructed cloze tests. We hope that the analyses presented in this paper will
encourage and enable SLA researchers to examine other available measures of proficiency for
their suitability in the contexts in which they are to be employed.
References
Alderson, J. C. 1978. A study of the cloze procedure with native and non-native speakers of
English. Edinburgh, Scotland: Doctoral dissertation.
39
Brown, J. D. 1978. Correlational study of four methods for scoring cloze tests. Los Angeles, CA:
UCLA MA Thesis.
Brown, J. D. 1980. Relative merits of four methods for scoring cloze tests. Modern Language
Journal 64. 311-317.
Brown, J. D. 1982. Testing EFL reading comprehension in engineering English. Los Angeles,
CA: UCLA dissertation.
Brown, J. D. 1983. A closer look at cloze: Validity and reliability. In J. W. Oller, Jr. (ed.), Issues
in language testing research, 237-250. Rowley, MA: Newbury House.
Brown, J. D. (1984). A cloze is a cloze is a cloze? In J. Handscombe, R. Orem, & B. Taylor
(eds.) On TESOL '83: The question of control, 109-119. Selected papers from the 17th
Annual TESOL Convention, Toronto. Washington, DC: TESOL
Brown, J. D. 1994. A closer look at cloze: Validity and reliability. In J. W. Oller, Jr. & J. Jonz
(eds.), Cloze and coherence, 189-196. Lewisburg, PA: Associated University Presses.
Brown, J. D. 2002. Do cloze tests work? Or, is it just an illusion? Second Language Studies
21(1). 79-125.
Brown, J. D., M. I. A. Cunha & S. de F. N. Frota. 2001. The development and validation of a
Portuguese version of the Motivated Strategies for Learning Questionnarie. In Z. Dörnyei &
R. Schmidt (eds.), Motivation and second language acquisition, 257-280. Honolulu, HI:
Second Language Teaching & Curriculum Center, University of Hawai‘i Press.
Brown, J. D., G. Robson & P. Rosenkjar. 2001. Personality, motivation, anxiety, strategies, and
language proficiency of Japanese students. In Z. Dörnyei & R. Schmidt (eds.), Motivation
and second language acquisition, 361-398. Honolulu, HI: Second Language Teaching &
Curriculum Center, University of Hawai‘i Press.
40
Brien, C. 2013. Neurophysiological evidence of a second language influencing lexical ambiguity
resolution in the first language. Ottawa, Canada: University of Ottawa dissertation.
https://ruor.uottawa.ca/bitstream/10393/26223/1/Brien_Christie_2013_thesis.pdf (accessed 5
February 2019)
Chrabaszcz, A. & N. Jiang. 2014. The role of the native language in the use of the English
nongeneric definite article by L2 learners: A cross-linguistic comparison. Second Language
Research 30. 351-379.
Chrabaszcz, A., M. Winn, C. Y. Lin & W. J. Idsardi. 2014. Acoustic Cues to Perception of Word
Stress by English, Mandarin, and Russian Speakers. Journal of Speech, Language, and
Hearing Research 57. 1468-1479.
Dąbrowska, E. 2012. Different speakers, different grammars: Individual differences in native
language attainment. Linguistic Approaches to Bilingualism 2. 219-253.
Dekydtspotter, L., & A. K. Miller. 2009. Probing for intermediate traces in the processing of
long-distance wh-dependencies in English as a second language. In M. Bowles, T. Ionin, S.
Montrul & A. Tremblay (eds.), Proceedings of the 10th Generative Approaches to Second
Language Acquisition Conference (GASLA 2009), 113-124. Somerville, MA: Cascadilla
Proceedings Project.
Ebel, R. L. 1979. Essentials of educational measurement (3rd ed.). Englewood Cliffs, NJ:
Prentice-Hall.
Grosjean, F. 1998. Studying bilinguals: Methodological and conceptual issues. Bilingualism:
Language and Cognition 1. 131-149.
41
Grüter, T., H. Rohde & A. J. Schafer. 2017. Coreference and discourse coherence in L2: The
roles of grammatical aspect and referential form. Linguistic Approaches to Bilingualism 7.
199-229.
Huensch, A. 2014. The perception and production of palatal codas by Korean L2 learners of
English. Urbana, IL: University of Illinois at Urbana-Champaign dissertation.
http://hdl.handle.net/2142/46739 (accessed 5 February 2019)
Hulstjin, J. H. 2010. Measuring second language proficiency. In E. Blom & S. Unsworth (eds.),
Experimental methods in language acquisition research (EMLAR), 185-199. Amsterdam:
John Benjamins.
Hulstijn, J. H. 2011. Language proficiency in native and non-native speakers: An agenda for
research and suggestions for second-language assessment. Language Assessment Quarterly
8. 229-249.
Hulstjin, J. H. 2012. The construct of language proficiency in the study of bilingualism from a
cognitive perspective. Bilingualism: Language and Cognition 15. 422-433.
Kim, K. & H. Kim. 2013. L1 Korean transfer in processing L2 English passive sentences. In E.
Voss, S. D. Tai, & Z. Li (eds.), Selected Proceedings of the 2011 Second Language Research
Forum, 118-128. Somerville, MA: Cascadilla Proceedings Project.
Kurilecz, M. 1969. Man and his world: A structured reader. New York: Crowell.
Kwak, H. Y. 2010. Scope interpretation in first and second language acquisition: Numeral
quantifiers and negation. Honolulu, HI: University of Hawai‘i at Mānoa dissertation.
http://www.ling.hawaii.edu/graduate/Dissertations/HyeYoungKwakFinal.pdf (accessed 5
February 2019)
42
Leclercq, P. & A. Edmonds. 2014. How to assess L2 proficiency? An overview of proficiency
assessment research. In P. Leclercq, A. Edmonds & H. Hilton (eds.), Measuring L2
proficiency: Perspectives from SLA, 3-23. Bristol, UK: Multilingual Matters.
Lee, S. 2010. Interpretation of scope by Korean L2 learners of English: A self-paced reading
study. English Teaching 65. 59-78.
Lemhöfer, K. & M. Broersma. 2012. Introducing LexTALE: A quick and valid Lexical Test for
Advanced Learners of English. Behavior Research Methods 44. 325-343.
Marian, V., H. K. Blumenfeld & M. Kaushanskaya. 2007. The Language Experience and
Proficiency Questionnaire (LEAP-Q): Assessing language profiles in bilinguals and
multilinguals. Journal of Speech, Language, and Hearing Research 50. 940-967.
McCray, G. & T. Brunfaut. 2018. Investigating the construct measured by banked gap-fill items:
Evidence from eye-tracking. Language Testing 35. 51-73.
Norris, J. M. (ed.). 2018. Developing C-tests for estimating proficiency in foreign language
research. New York: Peter Lang.
Norris, J. M. & L. Ortega. 2000. Effectiveness of L2 instruction: A research synthesis and
quantitative meta-analysis. Language Learning 50. 417-528.
Oller, J. W. Jr. (1979). Language tests at school: A pragmatic approach. London: Longman.
Pakulak, E. & H. J. Neville. 2010. Proficiency differences in syntactic processing of native
speakers indexed by event-related potentials. Journal of Cognitive Neuroscience 22. 2728-
2744.
Qin, Z., Y. F. Chien & A. Tremblay. 2017. Processing of word-level stress by Mandarin-
speaking second language learners of English. Applied Psycholinguistics 38, 541-570.
43
Sasayama, S. 2015. Validating the assumed relationship between task design, cognitive
complexity, and second language task performance. Washington, DC: Georgetown
University dissertation. https://repository.library.georgetown.edu/handle/10822/1029904
(accessed 5 February 2019)
Schoonen, R. 2011. How language ability is assessed. In E. Hinkel (ed.), Handbook of Research
in Second Language Teaching and Learning (Vol. II), 701-716. London: Routledge.
Thomas, M. 1994. Assessment of L2 proficiency in second language acquisition research.
Language Learning 44. 307-336.
Thomas, M. 2006. Research synthesis and historiography: The case of assessment of second
language proficiency. In J. Norris & L. Ortega (eds.), Synthesizing research on language
learning and teaching, 279-298. Philadelphia, PA: John Benjamins.
Tremblay, A. 2008. Is second language lexical access prosodically constrained? Processing of
word stress by French Canadian second language learners of English. Applied
Psycholinguistics 29, 553-584.
Tremblay, A. 2011. Proficiency assessment standards in second language acquisition research:
“Clozing” the Gap. Studies in Second Language Acquisition 33, 339-372.
Zenker, F. & B. D. Schwartz. 2017. Topicalization from adjuncts in English vs. Chinese vs.
Chinese-English interlanguage. In M. LaMendola & J. Scott (eds.), Proceedings of the 41st
annual Boston University Conference on Language Development, 806-819. Somerville, MA:
Cascadilla Press.