Download pdf - Title: The Same Cloze for All Occasions? Using the Brown ...theresgruter.homestead.com/BrownGruter_IRAL2019_accepted_version.pdfAssumptions underlying the interpretation of such cloze

Title: The Same Cloze for All Occasions?

Using the Brown (1980) Cloze Test for Measuring Proficiency in SLA

Research

Authors: James D. Brown

Theres Grüter

Affiliation: University of Hawai‘i at Mānoa

Address for correspondence:

James D. Brown

University of Hawai‘i at Mānoa

Department of Second Language Studies

1890 East-West Road, Moore Hall, rm570

Honolulu, HI 96822

U.S.A.

[email protected]

This is an Accepted Manuscript of an article to be published in International Review of Applied

Linguistics in Language Teaching (IRAL). Date of acceptance: September 12, 2019.

2

Abstract

Target language proficiency assessment has become an integral part of Second Language

Acquisition (SLA) research design, with cloze tests frequently serving this purpose for reasons of

practicality. Assumptions underlying the interpretation of such cloze test scores, however, are

often not examined. With the goal of providing researchers with better means for drawing

inferences from cloze test scores, we present an analysis of a combined dataset comprised of

scores from 1,724 test takers on a frequently used English cloze test (Brown 1980). We examine

variation in score distributions and reliability estimates among L2 groups, between L2 and

native-speaker (NS) examinees, and for different scoring methods, and investigate the degree to

which different sets of items were effective for classifying low- vs high-proficiency L2

examinees and L2 vs NS test takers. Standardized scores are provided for each scoring method

so that future researchers can reference their scores to this larger set.

Acknowledgments

This project would not have been possible without the colleagues who generously shared their

data from the Brown Cloze test with us. Mahalo to Michelle Adams, Anna Chrabaszcz,

Takafumi Fukushima, Amanda Huensch, Kitaek Kim, Hye-Young Kwak, Sunyoung Lee, Akira

Omaki, Shoko Sasayama, John Thurman, Annie Tremblay, and Fred Zenker.

3

Cloze tests have often been used as a part of language placement examinations because of

their relatively well-researched psychometric properties, but also because they are easy to

develop and score, are quick to administer, and are generally free. It is likely for similar reasons

that cloze tests have become a tool of choice for many SLA researchers who want to obtain a

measure of overall language proficiency from participants in their studies, recognizing the need

for the inclusion of such measures in SLA and bilingualism research (Grosjean, 1998; Hulstjin,

2012; Tremblay, 2011). In Tremblay’s (2011) survey of 144 studies published in two SLA and

one French language journal between 2000 and 2008, she found that almost a third of the studies

that included an independent measure of proficiency chose a cloze test or C-test1 for this

purpose. In this paper, we use aggregated data from one particular English language cloze test

(Brown 1980) to re-examine the assumptions behind the use of this test as an independent

measure of proficiency in SLA research, and to discuss more broadly some of the issues and

questions that SLA researchers should be aware of in the selection and interpretation of cloze

tests as measures of proficiency in L2 research studies.

In the field of second language testing, cloze tests have been studied for many decades. For

reviews at various stages see, among others, Alderson (1978), Brown (1978, 2002), Oller (1979),

and Tremblay (2011). An issue of key concern throughout this literature has been the question of

the construct that is being measured by cloze tests (for a recent review and discussion, see

McCray and Brunfaut 2018). In earlier work, Brown (1982: 28) had referred to the cloze test in

1 C-tests are similar to cloze tests in that they create blanks in a text, however, the blanks are

typically the second half of every other word in the text. See Norris (2018) for a recent collection

of C-tests in various languages.

4

that study as a measure of “overall English language proficiency.” Subsequent SLA researchers

using this test have used the terms “global proficiency measure” (Huensch 2014), “general

language proficiency” (Brien 2013), “measure of L2 learners’ vocabulary, morphosyntactic

knowledge, and discourse competence” (Chrabaszcz et al. 2014), “English proficiency” (Kim

and Kim 2013, Lee 2010), or just “proficiency” (Dekydtspotter and Miller 2009). As has become

increasingly clear over the decades, however, ‘language proficiency’ is a difficult and elusive

construct to define, and hence to operationalize and measure (see e.g., Hulstjin 2011, and

Leclercq and Edmonds 2014, for detailed discussion). Thus to assume that a single cloze test can

provide an unambiguous measure of the construct of ‘language proficiency’ would certainly be

naïve. Yet in the absence of commonly agreed upon practices for assessing language proficiency

in SLA research (Schoonen 2011: 701), the SLA researcher is faced with the practical question

of how to heed Tremblay’s (2011: 364) well-justified advice that “[d]ocumenting and controlling

for L2 learners’ proficiency in the target language should no longer be optional in experimental

SLA research that seeks to explain the linguistic knowledge and behavior of L2 learners.”

Hulstjin (2010) offers a very useful overview and discussion of the pros and cons of four

candidate measures for this purpose, one of them being cloze tests.

An oft-cited reference in the context of using cloze tests as a measure of proficiency is

Brown (1980), a paper that investigated the reliability and validity of four different scoring

methods for a 50-item English cloze test. Although the test itself was not published as part of that

paper, it was made available to other researchers upon request, and used in a number of

published and unpublished SLA studies over the years. The ways in which different researchers

have used and interpreted scores obtained from the Brown cloze test (1980; hereafter referred to

as BCT80), however, have been far from uniform, as we will illustrate in more detail below.

5

Given that one of the stated objectives of including independent measures of proficiency, such as

cloze tests, into SLA research designs is to increase comparability between studies (Tremblay

2011, Marian et al. 2007, Norris and Ortega 2000), such differences in use and interpretation of

scores from one and the same test are not ideal. Nonetheless, the fact that this test has been used

in many different studies with diverse learner populations struck us as providing a unique

opportunity to re-examine the properties of this test based on a much larger and more diverse

dataset by assembling data from as many of these studies as possible.

Thanks to the generosity of a number of colleagues who made their data from the BCT80

available to us, we were able to compile a dataset of scores from 1,724 test takers. This dataset

allowed us to examine descriptive statistics and reliability estimates for scores from the BCT80

based on a substantially larger sample than previously available. Critically, it also allowed us to

examine the variations among these different studies and learner groups. Researchers who used

the BCT80 often refer to its established reliability and validity as one of the main reasons for

including it in their research design (e.g., Chrabaszcz and Jiang 2014, Tremblay 2008). It is

important to bear in mind, however, that test statistics for reliability and validity are never

characteristics of a test per se but are instead characteristics of the scores based on performances

of a certain group of examinees under a certain set of conditions. Put another way, reliability

estimates and validity arguments have limited meaningfulness until they are interpreted in light

of the distributions of scores from a particular sample of test takers. For example, knowing that

reliability and validity estimates for BCT80 scores from one group of low-intermediate Japanese-

speaking learners of English were high is reassuring for the interpretation of findings from that

specific study, and it is useful information for a researcher who intends to use the same test for a

study with a different group of low-intermediate Japanese-speaking learners. However, this

6

information about Japanese learners is of more limited use to a researcher who intends to study a

group of highly proficient German-speaking learners of English. In other words, researchers

should be cautious in generalizing interpretations of reliability and validity statistics beyond the

sample from which they were derived and recognize the need to examine (and report) both

descriptive and testing statistics for their own BCT80 data before drawing conclusions about the

reliability and validity of this measure in their study.

While we believe that increased awareness of these often-misunderstood limitations is

important for the SLA field, we also realize that this information alone is not particularly helpful

to researchers faced with decisions on what measure of proficiency to include in their studies.

The primary practical goal of this paper is thus to provide information about the extent to which

statistical estimates derived from BCT80 scores vary between different groups of test takers, and

thereby enable researchers to better judge—based on evidence from a large and diverse sample

of test takers—whether the BCT80 is likely to provide a reliable and efficient measure of

proficiency for the participants they intend to include in their planned research. To this end, we

present descriptive statistics, reliability estimates, as well as item analyses based on our

aggregated dataset. We present separate analyses for scores derived through exact-answer (EX)

and acceptable-answer (AC) scoring, the two most frequently used scoring methods for the

BCT80. In addition, with the goal of providing a better basis for comparison between studies, we

calculated equivalent T scores, z scores, and percentile scores for both EX and AC scoring

methods across the entire dataset. We present these scores as an informal scale of reference for

future researchers who may wish to situate their participants in a specific range of English

language proficiency as measured by the BCT80. Finally, we will continue to make available the

original cloze test together with the answer key for both EX and AC scoring methods so that

7

researchers will have ready access to the BCT80 in the future. It is our hope that beyond the

insights on the specific test properties of the BCT80, this analysis also serves to provide an

illustration of what researchers need to consider in the selection of a proficiency measure for

research purposes more generally.

1 Assessing Global Proficiency in SLA Research

Researchers in the field of SLA are often faced with the problem of sampling from a

heterogeneous population. L2 learners differ from each other in a myriad of ways, including their

native languages, learning environments, and learning histories. This makes it difficult to

generalize findings from any specific learner group to SLA in general and to compare findings

among various studies. One common solution to this problem is for SLA researchers to

characterize the participants in their studies as precisely as possible so that readers can judge for

themselves the appropriateness of inferences and comparisons based on the findings of any given

study. An indication of learners’ overall proficiency range is often one important characteristic of

an L2 sample.

Another key issue in SLA research is learners’ development over time. Ideally, development

is assessed longitudinally, yet for well-known logistical reasons, longitudinal studies are often

not feasible. Hence, SLA researchers frequently resort to cross-sectional designs, in which they

compare more advanced with less advanced groups. This raises the obvious question of how to

operationalize and measure more vs. less advanced. In response to this problem, Thomas (1994)

surveyed 157 SLA studies published between 1988 and 1992 to find out what methods were

being used in the field to assess “target language proficiency” (p. 307). Based on this survey,

Thomas (1994) established four categories for ways that proficiency is assessed in SLA studies:

8

(a) impressionistic judgment, (b) institutional status (e.g., class level), (c) in-house assessment,

and (d) standardized tests (including participants’ self-reporting of standardized test scores) (p.

312). In a follow-up paper, Thomas (2006) repeated her survey on a new set of studies published

between 2000 and 2004 and found no major changes in the overall proportions of use of the four

broad types of proficiency assessment across studies but noted a shift towards probing

proficiency in finer detail and integrating measures from overall proficiency more tightly into the

research design, a development further confirmed by Tremblay (2011).

A review of SLA studies that included the BCT80 as an independent measure of proficiency

provides a good illustration of the different ways in which researchers have integrated this

measure into their research designs. One possibility is to simply provide descriptive statistics for

participants’ cloze test scores, along with other demographic information (e.g., age, L1), to

characterize the sample (see e.g., Brown, Cuhna et al. 2001). This information is useful to the

extent that other researchers who wish to replicate the study or compare the results to those from

a different group, can include the same test in their own design to measure potential differences

in proficiency. Its usefulness is limited, however, when characterizing the sample in descriptive

terms such as intermediate or advanced, since there are currently no conventions for associating

certain score ranges with specific descriptive categories. (Note that, below, we do provide an

informal frame of reference for BCT80 scores.)

A different use of cloze test scores within the broader design of SLA research studies is to

control for proficiency between subgroups when proficiency is not a variable of interest. For

example, Chrabaszcz and Jiang (2014) compared Russian- and Spanish-speaking learners of

English to investigate the role of L1 transfer in L2 learners’ use of articles in English. To this

end, they needed to eliminate differences in overall proficiency between the two groups as a

9

possible explanation for any observed differences in their use of articles. Thus, they included the

BCT80 in their design, conducted a statistical comparison between scores in the two groups, and

concluded from the absence of a significant between-group difference that “both groups of L2

English speakers were comparable in terms of their L2 proficiency” (p. 361). While this is a

useful first step towards minimizing the influence of proficiency in a context where proficiency

is not the variable of interest, we would like to caution more generally against the interpretation

of a non-significant effect as evidence of the probability of absence of an effect. More cautious

strategies in this sort of situation, if allowed by the nature of the data and analysis, would be to

do additional power analyses of the group cloze test score comparison or to include cloze test

scores as a continuous co-variate in the statistical model (for this use of scores from the BCT80,

see Qin et al. 2017, Grüter et al. 2017).

Perhaps the most common use of cloze test scores in SLA research is to create subgroups of

learners by proficiency in a cross-sectional design aimed at investigating L2 development over

time. In these cases, proficiency is typically one of the main variables of interest. For example,

Kwak (2010) conducted a study with Korean learners of English to investigate potential changes

over time in learners’ interpretation of numeral quantifiers and negation in English. Kwak

divided her participants into “low” vs “high” proficiency groups based on their scores on the

BCT80, with those scoring above the group mean (Experiment 5) or the median (Experiment 6)

assigned to the “high” and those below the mean/median to the “low” group. Others used scores

from the BCT80 combined with scores from other measures of proficiency to divide learners into

subgroups representing developmental stages. For example, Tremblay (2008) used a combined

score from the BCT80 and a read-aloud task to create “intermediate,” “low advanced,” and “high

advanced” groups. These studies then typically proceed to use proficiency-subgroup as a

10

categorical predictor in their analysis, where significant effects of this predictor are interpreted as

an indication of developmental change. We do not take issue with this logic, in principle, but

would like to point out that it critically rests on the assumption that scores derived from the

BCT80 (alone, or in combination with another proficiency measure) can provide a good indicator

of L2 developmental stages. It is important to bear in mind that the test was not originally created

for this purpose, and there are currently no agreed-upon correspondences between certain score

ranges and developmental stages.

Moreover, different researchers have used different criteria, diverse cut-off points, and

different labels to refer to their proficiency subgroups created based on scores from the BCT80.

As a result, there are considerable discrepancies among studies in terms of the descriptions of

these groups, created (in part) based on the very same measure of proficiency. For example,

Huensch (2014), Kim and Kim (2013), and Kwak (2010) all used the BCT80 to create subgroups

among Korean learners of English. While all authors used AC scoring, the groups labelled

“high” (proficiency) in the three studies differed substantially in terms of their score ranges: 37-

47 in Huensch (2014), 32-44 in Kim and Kim (2013), and 22-41 and 24-39 in Kwak (2010,

Experiments 5 and 6, respectively). This further contrasts with a score range of 19-30 in a group

of “high” proficiency Japanese learners of English in Sasayama (2015), and a range of 34-46 in a

group of “low advanced” French Canadian learners of English in Tremblay (2008). These

discrepancies considerably complicate comparisons between studies and inferences about L2

development beyond the samples studied.

One way around the assignment of learners to discreet proficiency subgroups is to include

measures derived from the BCT80 as a continuous predictor in the analysis. Brown, Robson et

al. (2001), for example, included cloze test scores as one of several individual-difference

11

variables in a correlational analysis. In a recent study, Zenker and Schwartz (2017) used Rasch

analysis to convert raw cloze test scores to person ability estimates on a logit scale and used this

variable in a regression analysis aimed at testing for developmental effects on Chinese learners’

sensitivity to syntactic island effects in English. This use of a continuous-scale measurement of

proficiency provides more precise information than nominal-scale subgrouping based on study-

specific criteria, and is therefore preferable, if feasible and all statistical assumptions are met,

within the overall research design (see also Grosjean, 1998, and Hulstjin, 2012, for earlier

recommendations to include general proficiency measures as covariates in the final analyses).

For all of these uses of cloze test scores as an indicator of proficiency, it is essential for

researchers to understand the distribution of the total scores in their own data, as well as the

relation of this distribution to that in the wider population of L2 learners. We thus reiterate our

encouragement for researchers who decide to use the BCT80 in their research to calculate

reliability estimates and examine their validity arguments for the scores from their own samples

and interpret them with reference to the analyses of scores from our large aggregated dataset.

Before we present these re-examinations of the BCT80, we will briefly describe the nature and

development of the BCT80 itself.

2 The BCT80

The cloze test at the center of this study is usually referred to as the cloze in Brown (1980), but it

first appeared in Brown’s MA thesis (1978). According to Brown (1978), after a survey of

various texts of moderate difficulty, two passages from an intermediate ESL reader by Margaret

Kurilecz (1969) were selected and adapted to the needs of this study. Second, an every-7th word

deletion pattern was used in both passages for a total of 50 blanks in each with two sentences

12

unmutilated at the beginning (for lead-in) and one at the end. The blanks were numbered, and the

examinees were asked to write their answers in corresponding blanks along the right side of the

test paper. After a first round of piloting, one passage was selected and 77 English native

speakers in freshman composition courses at UCLA took this final cloze test. Anwers provided

by at least two participants in this group were included in the glossary of acceptable answers for

the AC scoring method. While in exact-answer (EX) scoring, only the originally deleted words

are counted as correct answers, in AC scoring, all (and only) the items in the AC answer glossary

are counted as correct. That paper reported K-R20 reliability estimates ranging from .89 to .95

depending on the cloze scoring method used. The study also found criterion-related validity

coefficients ranging from .88 to .91, again depending on the scoring method. These validity

coefficients were correlation coefficients between the cloze test scores and total scores on

UCLA’s standardized English as a Second Language Placement Test (the ESLPE included

reading, listening, writing, and grammar subtests).

In the journal length version of this study, Brown (1980) concluded that the best overall

scoring method for the cloze test was the AC method. Yet as Brown (2002) discussed in more

detail, this conclusion was potentially premature, as it “failed to consider the distributions of

scores and their effects on the relative values of […] descriptive, item analysis, reliability, and

validity statistics” (p. 80). The in-depth analysis of these statistics derived from a much larger

and more diverse dataset that we present here allows us to re-examine Brown’s (1980) original

conclusion and arrive at more nuanced recommendations for the choice of scoring method in

different contexts.

3 This Study

13

3.1 Collection of Datasets

In summer 2015, we conducted a search on Google Scholar for publications citing Brown (1978

or 1980). We then inspected these publications for reports of original data from the BCT80. In

fall 2015, we contacted fifteen scholars who were thus identified as first or corresponding

authors of one or more publications reporting such data, to ask if they would be willing to share

their datasets with us in deidentified format for the present analyses. Twelve responded. Three of

them reported that they no longer had the data in question; nine generously agreed to share their

data with us. Of these nine, five were able to share participant-level data, i.e., a total score for

each participant; four shared item-level data, consisting of responses to each test item, i.e., 50

data points per participant. (Note that some authors contributed more than one dataset.) In

addition to the datasets obtained through this process, unpublished data from three M.A. student

projects in our department were made available to us in item-level format. Finally, we included

item-level datasets from our own prior research.

We thus collected a total of 19 datasets, consisting of data from a total of 1,724 test takers,

including 1,637 L2 and 87 native speakers (NS) of English. Thirteen of these datasets contained

item-level data; for seven of them, both EX and AC scores were available, four had only EX and

two had only AC scores. An additional six datasets contained participant-level data, all with AC

scores. The original data were collected between 1977 and 2015 in both second and foreign

language contexts, and learners’ native languages included French, Japanese, Korean, Mandarin,

Portuguese, Russian, and Spanish.

14

3.2 Goals and Research Questions

A key goal of our study is to draw SLA researchers’ attention to the fact that reliability estimates

vary with the distribution of scores in each sample. To this end, we use our aggregated dataset to

address the following research question:

RQ1: To what degree do score distributions and reliability estimates vary (a) among L2

groups, (b) between L2 and NS examinees, and (c) between EX vs AC scoring?

Secondly, we seek to examine the extent to which scores on the BCT80 function well for

assessing learners at different ends of the proficiency spectrum. Thus, to explore if and how the

scores on the BCT80 provide effective measures for classifying learners along the continuum of

L2 ability, we formulate the following research question:

RQ2: To what degree do different sets of items on the BCT80 contribute to the

effectiveness of overall scores for discriminating among test takers, including (a) low-

vs high-proficiency L2 examinees across studies and (b) L2 vs NS test takers?

We will address RQ2 by using item discrimination analysis to examine which items discriminate

well within different subgroups, looking separately at both EX and AC scoring. The results from

these analyses may also help researchers choose the best scoring method for learners at different

points in the proficiency continuum.

Finally, with the goal of providing a better basis for comparison between and across studies,

we will present equivalent raw scores, T scores, z scores, and percentile scores for both EX and

AC scoring methods for the BCT80 based on the entire datasets in an informal table of reference.

3.3 Analysis and Results

15

This section will present the following analyses (more or less in the same order as the research

questions):

1. Total score results for the EX scoring of cloze tests including reliability analyses,

descriptive statistics, and box plots, as well as one-way ANOVA results comparing group

means for all 11 L2 groups and the NS group.

2. Total score results for the AC scoring of cloze tests including the same analyses for all 15

L2 groups and the NS group.

3. Item-level discrimination analyses to determine which items are discriminating for EX

and AC scoring for all L2 speakers and all NSs, as well as for high vs. low proficiency L2

subgroups.

4. Standardized-score conversions separately for EX and AC scoring raw-score, z-score, T-

score, and percentile score equivalencies for all L2 learners.

3.3.1 Total Score Results for EX Scoring

Focus on EX reliability. The total score results for EX scoring are for 1257 L2 speakers in

11 studies (labeled here A to K) and 46 NSs. The reliability estimates for each set of data—

including Cronbach alpha and Kuder-Richardson 21 (K-R21)—are shown in Table 1. Notice that

the alpha reliability estimates in the first column of numbers are moderately high throughout and

range from .69 to .84 for the L2 studies. This means that these various sets of scores were

moderately reliable since they ranged from 69% to 84% internally consistent (i.e., doing the

same things in terms of ordering students). As would be expected for the combined samples,

Total L2 and Grand Total had considerably higher alpha reliability estimates of .90 and .91,

respectively.

16

The K-R21 reliability estimates are systematically lower as has been found in a number of

cloze studies. Brown (1983, 1994) was the first to explore this phenomenon, which is probably

due to systematic violation of the assumption of equal item variances that underlies K-R21, but

not alpha. The K-R21 estimates (which are based on total scores) are presented here because

later in Table 2, there will be a number of data sets for which we had only total scores, and

therefore could not calculate alpha (because alpha must be based on item-level data). Since we

wanted some basis for comparing the reliability of those groups with each other and with the

ones in Table 1, we used K-R21 for that purpose.

17

Table 1. Reliability Estimates EX Scoring by Study

Study Test takers’ L1 N M SD Alpha K-R21

NS English 46 27.76 5.82 .74 .63

A Mandarin 82 23.07 5.50 .72 .58

B Portuguese 88 8.51 4.38 .72 .62

C Portuguese 270 15.22 6.51 .83 .75

D Japanese 320 7.21 4.06 .71 .62

E Spanish 20 23.00 7.00 .81 .74

F Japanese, Korean 48 15.52 5.14 .69 .59

G Japanese 112 5.51 3.89 .73 .67

H Japanese 119 14.23 6.84 .84 .78

I French 110 25.18 6.78 .80 .72

J Mandarin 31 17.68 6.91 .83 .76

K Korean 57 19.28 6.87 .82 .74

Total L2 1257 13.52 8.42 .90 .86

Grand Total 1303 13.95 8.73 .91 .87

We also present descriptive statistics in Table 1 because, mathematically, reliability estimates

are somewhat related to the amounts of variance in the distributions upon which they are based.

For example, the fact that the scores for the Total L2 and Grand Total groupings produced higher

reliability estimates than most of the other groupings may be largely due to the fact that those

two Total groups had larger n sizes and had wider distributions (as represented by larger SDs).

18

Note also that all of those groups that produced alpha reliability estimates over .80, had SDs

higher than 6.50, while those groups with lower estimates had lower SDs.

Focus on EX distributions. The L2 means in Table 1 range from 5.51 (Study G) to 25.18

(Study I) for the L2 speakers, and even to 27.76 for the NS group. By any standard, this amount

of variation in means for different groups is large. The SDs also vary considerably. The same

results are displayed graphically in box-and-whiskers plots in Figure 1. This graphical

representation clearly shows (as the descriptive statistics cannot) that the groups vary greatly

from each other in terms of median and general ability levels. Notice how much overlap there is

between some of the L2 group distributions and the distribution for NSs.

Figure 1. Box Plots for EX scoring (for Studies with Item-Level Data) by Studies in Mean Order

19

A one-way ANOVA was performed to determine the degree to which the fluctuations in

means found here were due to chance alone. The overall ANOVA indicated that significant

differences existed among the 12 groups (F = 169.45 with df = 11 and 1291; p < .0005; eta2 =

.591; power = 1.000). Scheffé follow-up analyses further indicated that 41 (or about 62%) of the

possible 66 comparisons were significant (i.e., the means were different due to factors other than

chance) at .0005 (i.e., with 99.95% probability). Note also that Groups A, E, and I were not

found to be significantly different in the Scheffé analyses from the NS group, with a substantial

number of learners in those groups scoring well within the range of scores observed among

native speakers, a pattern that we discuss in more detail below (4.1).

3.3.2 Total Score Results for AC Scoring

Focus on AC reliability. The total AC scoring results are from 840 L2 speakers in 15 studies

(labeled here A to S) and 87 NSs. The alpha reliability estimates in Table 2 are moderately high

throughout with alpha coefficients ranging (more widely than those for EX scoring) from .74 to

.92 for the L2 studies. This means that the internal consistency for these various sets of scores

ranged from 74% to 92% consistent. Again, as would be expected for the combined samples,

Total L2 and Grand Total had considerably higher alpha reliability estimates. Also, where both

alpha and K-R21 estimates were available, those for K-R21 are systematically lower.

Recall that some authors cited the original BCT80 K-R20 reliability estimate for AC scoring

of .95 (K-R20 more or less = alpha) and split-half adjusted reliability estimate of .94 as

justification for using the BCT80 in their studies. However, while the scores for the Total L2 and

Grand Total groupings did turn out be almost as reliable (at alpha both = .93) as the original

scores in Brown (1980), the various studies’ AC scores are considerably lower in reliability than

20

the EX scores in that original study. Note again in Table 2 that those groupings that had the

highest standard deviations tended to have the highest reliability estimates.

Table 2. Reliability Estimates for AC Scoring by Study (for all data with total scores)

Study Test takers’ L1 N M SD Alpha K-R21

NS English 87 43.94 4.36 NA .71

A Mandarin 82 36.61 6.15 .80 .74

E Spanish 20 37.75 8.56 .91 .87

F Japanese, Korean 48 27.73 6.34 .75 .69

G Japanese 112 10.39 6.11 .83 .78

I French 110 39.39 7.97 .90 .87

J Mandarin 31 29.23 10.21 .92 .88

K Korean 57 31.53 8.97 .89 .85

L Japanese 31 19.84 6.87 .80 .74

M Mandarin 12 38.33 5.30 .74 .68

N Russian, Spanish 32 42.84 6.15 NA .83

O Russian, Mandarin 29 41.17 5.68 NA .77

P Korean 75 22.76 14.27 NA .94

Q Korean 141 22.01 10.36 NA .88

R Korean 36 32.36 8.01 NA .82

S Japanese 24 40.08 4.25 NA .55

Total L2 840 28.41 13.18 NA .93

Grand Total 927 29.87 13.41 NA .93

21

Focus on AC distributions. The L2 means in Table 2 are generally much higher than those

shown in Table 1, and they range from 10.39 (Study G) to 42.84 (Study N), and even higher

(43.94) for the NS group. Many of the L2 SDs in this table are also higher than those shown in

Table 1, and they vary considerably. In the graphical representation of these results in Figure 2,

notice again how much the groups vary from each other in terms of median and general ability

levels, and how much overlap there is between the distributions for many of the L2 groups and

the NS group.

Again, one-way ANOVA was performed to determine the degree to which the fluctuations in

means found here were due to chance alone. The overall ANOVA indicated that significant

differences existed among the 16 groups (F = 99.215 with df = 15 and 911; p < .0005; eta2 =

.620; power = 1.000). Scheffé follow-up analyses further indicated that 46 (or about 36.33%) of

the possible 120 comparisons were significant at the p = .0005 level. Groups A, E, I, M, N, O,

and S were not found to be significantly different in the Scheffé analyses from the NS group.

Figure 2. AC (for all data with total scores) by Studies in Mean Order

22

3.3.3 Item-Level Discrimination Analyses

Examining our cloze test data from a classical testing perspective, we first analyzed the items for

item discrimination. It is well established that items that discriminate well for a particular sample

of examinees are those that are approximately the right level of difficulty for the examinees who

took it and those that separated the high scoring examinees on the whole test from the low

scoring examinees. Items that discriminate also promote variance in the scores on the test as well

as higher reliability estimates. Conversely, items that are not discriminating well tend to be too

difficult or easy for the students, do not separate the high scoring examinees from the low

scoring ones, and do not contribute substantially to the variance in the total scores on the test or

to the reliability of those scores. In other words, these non-functioning or ‘turned off’ items, have

a profound effect on the overall distribution of scores on a test, as well as on the reliability

associated with these tests (see also Brown 2002).

Table 3 shows our analysis of items that discriminated above .20 for different groups within

our data. We chose the .20 level of discrimination because items at .19 or below are typically

“poor items, to be rejected or improved by revision” (Ebel 1979: 267) in the test development

process due to the fact that they are contributing very little to the test score variance and

reliability of the scores. We calculated discrimination estimates separately for four different

groups of examinees: the total group of L2 speakers taken together (TOT L2), the total group of

native-speakers taken together (TOT NS), the high proficiency L2 (HIGH L2, defined, for

present purposes, as the top third of all L2 examinees in terms of total scores), and the low

proficiency L2 (LOW L2, defined as the bottom third of all L2 examinees in terms of total

scores). The results for these four groups using EX scoring are in the first five columns of Table

3 and for the same four groups using AC scoring are in the last five. Notice that the fifty items

23

are labeled in the first column for each scoring method, and that each asterisk within Table 3

represents an item that discriminated at .20 or above. Cronbach alpha reliability estimates for

each group are shown at the bottom of Table 3.

L2 Tot discrimination. Very striking at first glance in Table 3 is the fact that 39 out of 50

items for the EX TOT L2 sample are discriminating at .20 or above. Even more strikingly, all but

one (49/50) for the AC TOT L2 sample are discriminating at .20 or above. Also, the alpha

estimates for these groups are high at .90 and .95, respectively.

24

Table 3. Items That Discriminate Above .20 for Total L2 (for tests with item-level data),

Total NS, the Top Third of L2, & the Bottom Third of L2 Examinees

EXACT-ANSWERSCORING ACCEPTABLE-ANSWERSCORING

ITEM

#

TOTL2

TOTNS

HIGHL2

LOWL2

ITEM

#

TOTL2

TOTNS

HIGHL2

LOWL2

1 * * *

1 * * * *

2 *

*

2 * * * *

3 *

*

3 *

*

4 *

*

4 * * *

5 * * *

5 *

*

6 *

*

6 *

7 * *

7 * *

*

8

* *

8 *

*

9 * * *

9 * *

*

10

* *

10 *

*

11 *

*

11 * *

*

12 * *

12 * *

*

13 *

13 *

14 *

*

14 *

*

15

* *

15

* *

16 * *

*

16 *

*

17 * *

*

17 *

*

18 * *

*

18 * *

*

19

* *

19 * * *

20 *

20 *

*

21 * * *

21 *

*

22 * * *

22 *

*

23 *

*

23 *

*

24

*

24 * *

*

25

*

25 * *

*

25

26

*

26 * *

*

27 *

*

27 *

*

28 * *

28 * *

*

29 * * *

29 *

*

30 * * *

30 *

*

31 * *

*

31 * *

32 *

*

32 * * * *

33 * * *

33 *

* *

34 * * *

34 *

*

35 * *

35 * *

*

36 * * *

36 * * *

37

*

37 * *

*

38 *

38 * *

39 * * *

39 * *

40 * *

40 *

*

41 * * *

41 * *

*

42

*

42 *

*

43 * *

43 * * *

44 * * *

44 *

45

* *

45 * *

*

46 * * *

46 *

*

47

*

47 * * *

48 *

48 *

* *

49 * * *

49 *

* *

50 * * *

50 * * *

Alpha .90 .74 .56 .23Alpha .95 .80 .41 .79

26

NS Tot discrimination. Notice that 37 out of 50 (74%) EX items discriminate at .20 or above

in the NS data and that the scores are moderately reliable at .74. Fewer of the AC items (26/50,

52%) discriminated at .20 or above for the NS group, yet the scores had somewhat higher

reliability at .80 than the EX items.

High and low proficiency groups discrimination. For the EX scoring in the high proficiency

L2 examinees, 25 out of 50 (50%) items discriminated at .20 or higher, and the Cronbach alpha

reliability was low at .56. For the low proficiency groups, a remarkably low 9 out of 50 (18%)

EX items discriminated at .20 or above, and the Cronbach alpha reliability was a very low .23.

For the AC scoring, only 13 out of 50 (26%) AC items discriminated at .20 or above for the high

proficiency L2 group, and alpha turned out to be a low of .41. However, 37 out of 50 AC items

discriminated at .20 or higher for the low proficiency L2 group, and the alpha reliability was a

moderately respectable .79.

3.3.4 Standardized-Score Conversions

Table 4 shows the raw-scores, T-scores, and percentile equivalencies for all L2 learners for EX

and AC scoring. All of these statistics are based on total scores of all L2 examinees with no NSs

included. The two scoring methods are different because they are based on scores with very

different distributions, as explained in several ways above. In the first column of each set, we

show the possible raw scores. The EX raw scores range from 0 to 40 because nobody scored 41

or above, whereas the AC scores use the entire range from 0 to 50. In successive columns, we

show the z-score, T-score, and percentile equivalents. As points of reference, the mean and

standard deviation for z-scores is 0.00 and 1.00, while they are 50 and 10 for T-scores,

respectively. In addition, the means and standard deviations (SDs) in the distributions (both

27

positive and negative) are delineated for the reader’s convenience to the left of the scores for

each scoring method. Percentile scores represent the position in the distribution that a particular

score has in terms of the percentage of people falling below it.

For example, a participant on the EX scale with a raw score of 38 would have a z-score of

2.91 (meaning that they scored 2.91 standard deviations above the mean), a T-score of 79.07

(which means the same thing but has been transformed for a distribution with a standard

deviation of 10 and mean of 50 [i.e., (1.91 x 10) + 50 = 79.1]), and with a percentile score of

99.8.

28

Table 4: EX and AC Raw-Score, T-Score, and Percentile Equivalencies for all L2 Learners

29

Exact-Answer Scoring (EX) Acceptable-Answer Scoring (AC)

Raw z-score T-score Percentile Raw z-score T-score Percentile

50 NA NA NA 50 1.64 66.38 94.9

49 NA NA NA 49 1.56 65.62 94.1

48 NA NA NA 48 1.49 64.86 93.2

47 NA NA NA 47 1.41 64.10 92.1

46 NA NA NA 46 1.33 63.35 90.8

45 NA NA NA 45 1.26 62.59 89.6

44 NA NA NA 44 1.18 61.83 88.1

43 NA NA NA 43 1.11 61.07 86.7

42 NA NA NA +1SD

42 1.03 60.31 84.8

41 NA NA NA 41 0.96 59.55 83.1

40 3.15 81.45 99.9 40 0.88 58.79 81.1

+3 SD 39 3.03 80.26 99.9 39 0.80 58.03 78.8

38 2.91 79.07 99.8 38 0.73 57.28 76.7

37 2.79 77.89 99.7 37 0.65 56.52 74.2

36 2.67 76.70 99.6 36 0.58 55.76 71.9

35 2.55 75.51 99.5 35 0.50 55.00 69.1

34 2.43 74.32 99.3 34 0.42 54.24 66.3

33 2.31 73.14 98.0 33 0.35 53.48 63.7

32 2.19 71.95 98.6 32 0.27 52.72 60.6

+2 SD 31 2.08 70.76 98.1 31 0.20 51.97 57.9

30 1.96 69.57 97.5 30 0.12 51.21 54.8

29 1.84 68.38 96.7 Mean

29 0.04 50.45 51.6

28 1.72 67.20 95.7 28 -0.03 49.69 48.8

27 1.60 66.01 94.5 27 -0.11 48.93 45.6

26 1.48 64.82 93.1 26 -0.18 48.17 42.9

25 1.36 63.63 91.3 25 -0.26 47.41 39.7

24 1.24 62.45 89.3 24 -0.33 46.65 37.1

23 1.13 61.26 87.1 23 -0.41 45.90 34.1

30

+1 SD 22 1.01 60.07 84.4 22 -0.49 45.14 31.2

21 0.89 58.88 81.3 21 -0.56 44.38 28.8

20 0.77 57.70 77.9 20 -0.64 43.62 26.1

19 0.65 56.51 74.2 19 -0.71 42.86 23.9

18 0.53 55.32 70.2 18 -0.79 42.10 21.5

17 0.41 54.13 65.9 17 -0.87 41.34 19.2

16 0.29 52.95 61.4 -1 SD

16 -0.94 40.58 17.4

15 0.18 51.76 57.1 15 -1.02 39.83 15.4

Mean 14 0.06 50.57 52.4 14 -1.09 39.07 13.8

13 -0.06 49.38 47.6 13 -1.17 38.31 12.1

12 -0.18 48.19 42.9 12 -1.25 37.55 10.6

11 -0.30 47.01 38.2 11 -1.32 36.79 9.3

10 -0.42 45.82 33.7 10 -1.40 36.03 8.1

9 -0.54 44.63 29.5 9 -1.47 35.27 7.1

8 -0.66 43.44 25.5 8 -1.55 34.51 6.1

7 -0.77 42.26 22.1 7 -1.62 33.76 5.3

-1 SD 6 -0.89 41.07 18.7 6 -1.70 33.00 4.5

5 -1.01 39.88 15.6 5 -1.78 32.24 3.8

4 -1.13 38.69 12.9 4 -1.85 31.48 3.2

3 -1.25 37.51 10.6 -2 SD

3 -1.93 30.72 2.7

2 -1.37 36.32 8.5 2 -2.00 29.96 2.3

1 -1.49 35.13 6.8 1 -2.08 29.20 1.9

0 -1.61 33.94 5.4 0 -2.16 28.44 1.5

These standardized scores may prove useful to researchers who use the BCT80 in their studies

because they can now reference their EX or AC scores to this larger sample derived from

multiple sets of data. They can also describe their individual sample in terms other than high and

low proficiency, at least in terms of what high and low proficiency mean relative to the results of

31

other studies. These standardized scores may also prove useful in interpreting the already

published studies (including those used in this paper) for comparison purposes.

For example, if a team of researchers used the EX scoring method to score this cloze test in

their study and divided their sample in halves with the 50% in a high proficiency group scoring

28 to 39 and the 50% in a low proficiency group scoring 16 to 27, they could further cite Table 4

and describe their high proficiency group as ranging from the 95.7th to the 99.9th percentile,

while their low proficiency group ranged from the 61.4th to the 94.5th percentile in the larger

sample included in this paper. Perhaps another researcher using the AC scoring method in her

study divided her participants into halves with 50% in a high proficiency group for those scoring

15 to 27 and the 50% in a low proficiency group for those with 11 to 14. This researcher could

also cite Table 4 and describe high proficiency group as ranging from the 15.4th to the 45.6th

percentile, while his low proficiency group ranged from the 9.3rd to the 13.8th percentile in the

larger sample included in this paper. More importantly, the second researcher and her readers

could all interpret her results in light of the fact that the lowest scoring “low proficiency”

participant in the first study was higher than the highest performing participant in this second

study.

Notice also that the AC scoring distribution only goes up to the 94.9th percentile for the AC

scoring method. Thus, the EX scoring, which continues to differentiate among those examinees

between 95.7 and 99.9 percentile (see the raw scores ranging from 28 to 40), generally measures

better than the AC scoring at the top end of the distribution.

32

4 Discussion

We will begin here by briefly summarizing the outcomes of our analyses in light of our two

research questions , followed by discussion of the implications of these findings for making

scoring decisions (4.1), and for dealing with reliability issues (4.2).

RQ1: To what degree do score distributions and reliability estimates vary (a) among L2

groups, (b) between L2 and NS examinees, and (c) between EX vs AC scoring?

Our first goal in this study was to draw SLA researchers’ attention to the ways reliability

estimates vary depending on the distributions of scores in each sample.

Variations among L2 groups. In terms of both reliability and score distributions, we found

considerable variation across studies, both for AC (Table 1) and EX (Table 2) scoring. For both

scoring methods, Total L2 and Grand Total scores were reliable at .90 and above, similar to what

was reported in Brown (1980). Yet reliability estimates for individual samples ranged from 71%

to 84% for EX scoring, and from 55% to 94% for AC scoring.

Variations between L2 and NS examinees. For both AC and EX scoring, we noted how

wide the distribution was for the NSs, and how much overlap there was with the distributions for

several L2 groups, particulary for AC scoring. The substantial variability within the NS group is

notable and aligns well with accounts arguing that L1 proficiency is far from homogeneous,

especially with regard to skills that involve ‘higher language cognition’ (HLC, Hulstjin 2011),

such as tasks requiring reading and writing. NSs’ educational attainment is a well-known

predictor for performance on HLC tasks (see Hulstjin 2011 for review) and has also been argued

to explain significant variance in NSs’ performance on tasks more narrowly focused on basic

language skills, such as grammatical processing (Dabrowska 2012, Pakulak and Neville 2010).

The fact that we found moderate reliability for the BCT80 with an NS group, with a majority of

33

items discriminating, suggests that the observed variability among NS is systematic. Thus,

whenever comparison between an L2 and an NS control group for language proficiency is part of

the research design of an SLA study, it should be borne in mind that the comparison is not with

an L1 baseline, but with an L1 base-distribution.

Variations in EX vs AC scoring. Means were generally higher and SDs larger for AC scoring

(Table 2) than for EX scoring (Table 1). This of course is unsurprising since correct answers in

EX scoring are a subset of those in AC scoring. The larger SDs in AC scoring indicate that the

AC scores were more varied than those for EX scoring, which in turn may explain in part why

the AC scoring produced higher reliability estimates for all groups scored both ways.

RQ2: To what degree do different sets of items on the BCT80 contribute to the

effectiveness of overall scores for discriminating among test takers, including (a) low- vs

high-proficiency L2 examinees across studies and (b) L2 vs NS test takers?

Our second goal was to examine if and how the scores on the BCT80 provide effective measures

for classifying learners along a continuum of L2 ability. We addressed RQ2 by using item

discrimination analysis to examine which items distinguish well between high and low scorers

with subgroups. For all L2 speakers taken together, Table 3 showed that 39 out of 50 EX scored

items were discriminating at .20 and therefore contributing to the variance and reliability of the

scores, while, for AC scoring, 49 out of 50 of the AC items were discriminating, and, for NSs

total, 37 out of 50 of the EX items discriminated, while 26 out of 50 of the AC items did so.

For the high and low proficiency L2 groups, for EX scoring, 25 out of 50 items discriminated

for high proficiency L2 examinees, while for low proficiency L2 examinees, only nine out of 50

discriminated, and for AC scoring, only 13 out of 50 items discriminated for high proficiency L2

34

speakers. Thus, for groups with restricted proficiency ranges, fewer items tended to discriminate.

Note also in Table 3 that none of the EX items that discriminated for the high proficiency and

low proficiency examinees were the same. Similarly, for the AC scoring, seven of the 13 items

that discriminated for the high proficiency group did not do so for the low proficiency group. All

of this is consistent with Brown’s (2002: 108) observation that “…the items that are working

well for students at different levels of proficiency appear to be quite different”.

4.1 Implications for Making Scoring Decisions

The results for RQs 1 and 2 have useful implications for researchers needing to make choices

of scoring method for learners at specific proficiency levels. As indicated in Table 2, if a

researcher’s group has low proficiency (like say, group G in Table 2 with a mean of 10.39, SD of

6.11, and reliability of .83), it appears that AC scoring will work better. If the group has high

proficiency (like group I in Table 1 with a mean of 25.18, SD of 6.78, and reliability of .84), it

appears that EX will generally work better. And, if the group has a wide range of scores (like the

total L2 sample), the EX scoring will probably work marginally better. What all of this illustrates

is that one strength of cloze is that only those items that are appropriate for the sample at hand

will discriminate, while the others are essentially switched off in the sense that they will not

contribute to the item or test score variance. The BCT80 thus constitutes a fairly flexible

instrument given that researchers can use the EX or AC scoring method, whichever is most likely

to fit the level of the group(s) and range of scores involved in their particular study so that the

items will be at the appropriate level of difficulty and therefore be most likely to discriminate.

We have demonstrated here that, for groups with wide score ranges, larger numbers of items are

more likely to discriminate between high and low performing students and therefore produce

35

more reliable total scores. For heterogeneous L2 groups that are generally high in proficiency, it

appears that EX scoring is likely to have larger numbers of items that discriminate between high

and low performing students within that group and therefore produce more reliable total scores.

In contrast, for heterogeneous L2 groups that are generally low in proficiency, it appears that AC

scoring is much more likely to have larger numbers of items that discriminate between high and

low performing students within that group and therefore produce more reliable total scores.

An additional goal in this study was to provide a basis for comparing the results between and

among studies that use the BCT80. Thus, we presented the equivalent raw scores, T scores, z

scores, and percentile scores in Table 4 for both EX and AC scoring methods calculated across

the entire L2 dataset in an informal scale of reference. While our aggregated dataset should not

be considered a norming sample, it is nevertheless based on data from a large number of L2

learners from across the proficiency spectrum. It can thus serve as the basis for a useful common

frame of reference for future researchers reporting EX or AC scores from the BCT80. Instead of

describing their learner groups in terms of arbitrarily defined ‘high’ and ‘low’ proficiency

groups, this scale will provide researchers with the option to define their (sub)groups in terms of

specific percentile bands on this reference scale, thus hopefully making future comparisons

across studies more transparent.

Much of this paper has focused critically on potential problems and limitations of the BCT80

when used in experimental SLA research. Nonetheless, we only know these limitations because

so much research has been done on cloze generally, and with the BCT80 in particular. This

knowledge constitutes a decided advantage in that it allows researchers to gauge with somewhat

more confidence what kind of variance this measure can and cannot capture among the learners

in their study. Given the long track record and flexibility of the scoring methods for the BCT80,

36

researchers may wish to continue using the BCT80 in their studies largely because of the

flexibility it provides, that is, they can choose the EX and/or AC scoring methods after they see

how well each scoring method works and thereby use whichever provides the more reliable and

perhaps valid decisions about their study participants’ English language proficiency.

4.2 Implications for Dealing with Reliability Issues

In a number of individual studies examined in this paper, reliability estimates were

considerably lower than the values reported in Brown (1980). Generally, what this means to

researchers is that they should never assume that BCT80 scores in their studies will be

distributed as they were in the original study, nor that the scores will be as reliable as they were

in Brown (1980). Indeed, it is reasonable to expect reliability estimates to vary due to differences

in means and distributions, especially with regard to differences in score ranges. In fact, the high

reliability estimates for both EX and AC scoring found in the original Brown (1980) study may

have been due to the way that researcher created a test that produced fairly well-centered and

widely varying score distributions suitable for the students taking the UCLA ESLPE.

So what is a researcher to do? Certainly, SLA researchers who use the BCT80 can refer to its

established reliability and validity (as reported in Brown, 1980), but they should also calculate

and discuss the descriptive statistics and reliability estimates that resulted in their specific study

as well because score distributions and reliability are not characteristics of a test, but rather are

characteristics of the scores on a test when it is administered to specific examinees under specific

conditions. Further decisions must then depend on the specific purpose and context of the study

in question:

37

• If the purpose of the study is to use the cloze scores to create proficiency groups, and

moderate to high reliability is found for the cloze test, researchers can be reasonably

confident in their classification of students into groups.

• If the purpose is to use the cloze scores to simply document the proficiency of the L2 learners

in comparison to L2 learners who were tested the same way in other studies, then low

reliability estimates should certainly be reported, but the reliability issue can be discussed in

reference to Table 4 of this paper, as well as in terms of the problems that small sample sizes

and/or restrictions in range create for the reliability of the BCT80 cloze, as would be true for

any test (see e.g., Brown 1984).

• Regardless of their purpose, researchers might be wise to refer to the informal scale of

reference supplied in Table 4 of this paper to help situate groups of participants relative to

each other and to the overall range of English language proficiency that has been measured

reliably in the aggregated data for this test.

• For any researchers who are tempted to add additional acceptable answers to their AC answer

keys because those answers seem correct (though they are not included in the original answer

key), first consider the benefits of using the BCT80 answer keys exactly as they were

originally used in Brown (1980) in terms of scoring reliability and the comparability of

scores to those in Table 4 (and to those in any other studies that have followed those same

procedures).

5. Conclusion

The analyses we have presented in this paper, based on an aggregated dataset of scores from

1,724 test takers in 19 data sets, have demonstrated that the BCT80 can be a reliable and useful

38

option for SLA researchers looking for a one-shot global measure of English proficiency for

purposes of classifying research participants along a continuum of language ability. While in the

context of high-stakes educational assessment, computer-adaptive proficiency testing, in which

test takers’ initial responses determine the difficulty of subsequent items, is increasingly being

used to address the difficulty of creating tests appropriate for all skill levels, such methods are

typically not available, or affordable, to SLA researchers, nor are they easily deliverable within

the time-constraints of a typical research test session. For these reasons, SLA researchers often

turn to more easily deliverable measures, such as the publicly accessible LexTALE

(www.lextale.com), which provides a quick score of ‘proficiency’. Yet, as acknowledged by its

authors (Lemhöfer and Broersma 2012: 340), the validity of short tests such as this may depend

on the skill level of test takers, with the initial validation study indicating that the LexTALE is

appropriate primarily for “medium- to high-proficiency” learners. The BCT80, whose flexibility

and effectiveness with learners across the language ability continuum we have examined and

illustrated in some detail here, can present a reliable and practical alternative for SLA

researchers. It is likely, but remains to be demonstrated, that these benefits also apply to other

carefully constructed cloze tests. We hope that the analyses presented in this paper will

encourage and enable SLA researchers to examine other available measures of proficiency for

their suitability in the contexts in which they are to be employed.

References

Alderson, J. C. 1978. A study of the cloze procedure with native and non-native speakers of

English. Edinburgh, Scotland: Doctoral dissertation.

39

Brown, J. D. 1978. Correlational study of four methods for scoring cloze tests. Los Angeles, CA:

UCLA MA Thesis.

Brown, J. D. 1980. Relative merits of four methods for scoring cloze tests. Modern Language

Journal 64. 311-317.

Brown, J. D. 1982. Testing EFL reading comprehension in engineering English. Los Angeles,

CA: UCLA dissertation.

Brown, J. D. 1983. A closer look at cloze: Validity and reliability. In J. W. Oller, Jr. (ed.), Issues

in language testing research, 237-250. Rowley, MA: Newbury House.

Brown, J. D. (1984). A cloze is a cloze is a cloze? In J. Handscombe, R. Orem, & B. Taylor

(eds.) On TESOL '83: The question of control, 109-119. Selected papers from the 17th

Annual TESOL Convention, Toronto. Washington, DC: TESOL

Brown, J. D. 1994. A closer look at cloze: Validity and reliability. In J. W. Oller, Jr. & J. Jonz

(eds.), Cloze and coherence, 189-196. Lewisburg, PA: Associated University Presses.

Brown, J. D. 2002. Do cloze tests work? Or, is it just an illusion? Second Language Studies

21(1). 79-125.

Brown, J. D., M. I. A. Cunha & S. de F. N. Frota. 2001. The development and validation of a

Portuguese version of the Motivated Strategies for Learning Questionnarie. In Z. Dörnyei &

R. Schmidt (eds.), Motivation and second language acquisition, 257-280. Honolulu, HI:

Second Language Teaching & Curriculum Center, University of Hawai‘i Press.

Brown, J. D., G. Robson & P. Rosenkjar. 2001. Personality, motivation, anxiety, strategies, and

language proficiency of Japanese students. In Z. Dörnyei & R. Schmidt (eds.), Motivation

and second language acquisition, 361-398. Honolulu, HI: Second Language Teaching &

Curriculum Center, University of Hawai‘i Press.

40

Brien, C. 2013. Neurophysiological evidence of a second language influencing lexical ambiguity

resolution in the first language. Ottawa, Canada: University of Ottawa dissertation.

https://ruor.uottawa.ca/bitstream/10393/26223/1/Brien_Christie_2013_thesis.pdf (accessed 5

February 2019)

Chrabaszcz, A. & N. Jiang. 2014. The role of the native language in the use of the English

nongeneric definite article by L2 learners: A cross-linguistic comparison. Second Language

Research 30. 351-379.

Chrabaszcz, A., M. Winn, C. Y. Lin & W. J. Idsardi. 2014. Acoustic Cues to Perception of Word

Stress by English, Mandarin, and Russian Speakers. Journal of Speech, Language, and

Hearing Research 57. 1468-1479.

Dąbrowska, E. 2012. Different speakers, different grammars: Individual differences in native

language attainment. Linguistic Approaches to Bilingualism 2. 219-253.

Dekydtspotter, L., & A. K. Miller. 2009. Probing for intermediate traces in the processing of

long-distance wh-dependencies in English as a second language. In M. Bowles, T. Ionin, S.

Montrul & A. Tremblay (eds.), Proceedings of the 10th Generative Approaches to Second

Language Acquisition Conference (GASLA 2009), 113-124. Somerville, MA: Cascadilla

Proceedings Project.

Ebel, R. L. 1979. Essentials of educational measurement (3rd ed.). Englewood Cliffs, NJ:

Prentice-Hall.

Grosjean, F. 1998. Studying bilinguals: Methodological and conceptual issues. Bilingualism:

Language and Cognition 1. 131-149.

41

Grüter, T., H. Rohde & A. J. Schafer. 2017. Coreference and discourse coherence in L2: The

roles of grammatical aspect and referential form. Linguistic Approaches to Bilingualism 7.

199-229.

Huensch, A. 2014. The perception and production of palatal codas by Korean L2 learners of

English. Urbana, IL: University of Illinois at Urbana-Champaign dissertation.

http://hdl.handle.net/2142/46739 (accessed 5 February 2019)

Hulstjin, J. H. 2010. Measuring second language proficiency. In E. Blom & S. Unsworth (eds.),

Experimental methods in language acquisition research (EMLAR), 185-199. Amsterdam:

John Benjamins.

Hulstijn, J. H. 2011. Language proficiency in native and non-native speakers: An agenda for

research and suggestions for second-language assessment. Language Assessment Quarterly

8. 229-249.

Hulstjin, J. H. 2012. The construct of language proficiency in the study of bilingualism from a

cognitive perspective. Bilingualism: Language and Cognition 15. 422-433.

Kim, K. & H. Kim. 2013. L1 Korean transfer in processing L2 English passive sentences. In E.

Voss, S. D. Tai, & Z. Li (eds.), Selected Proceedings of the 2011 Second Language Research

Forum, 118-128. Somerville, MA: Cascadilla Proceedings Project.

Kurilecz, M. 1969. Man and his world: A structured reader. New York: Crowell.

Kwak, H. Y. 2010. Scope interpretation in first and second language acquisition: Numeral

quantifiers and negation. Honolulu, HI: University of Hawai‘i at Mānoa dissertation.

http://www.ling.hawaii.edu/graduate/Dissertations/HyeYoungKwakFinal.pdf (accessed 5

February 2019)

42

Leclercq, P. & A. Edmonds. 2014. How to assess L2 proficiency? An overview of proficiency

assessment research. In P. Leclercq, A. Edmonds & H. Hilton (eds.), Measuring L2

proficiency: Perspectives from SLA, 3-23. Bristol, UK: Multilingual Matters.

Lee, S. 2010. Interpretation of scope by Korean L2 learners of English: A self-paced reading

study. English Teaching 65. 59-78.

Lemhöfer, K. & M. Broersma. 2012. Introducing LexTALE: A quick and valid Lexical Test for

Advanced Learners of English. Behavior Research Methods 44. 325-343.

Marian, V., H. K. Blumenfeld & M. Kaushanskaya. 2007. The Language Experience and

Proficiency Questionnaire (LEAP-Q): Assessing language profiles in bilinguals and

multilinguals. Journal of Speech, Language, and Hearing Research 50. 940-967.

McCray, G. & T. Brunfaut. 2018. Investigating the construct measured by banked gap-fill items:

Evidence from eye-tracking. Language Testing 35. 51-73.

Norris, J. M. (ed.). 2018. Developing C-tests for estimating proficiency in foreign language

research. New York: Peter Lang.

Norris, J. M. & L. Ortega. 2000. Effectiveness of L2 instruction: A research synthesis and

quantitative meta-analysis. Language Learning 50. 417-528.

Oller, J. W. Jr. (1979). Language tests at school: A pragmatic approach. London: Longman.

Pakulak, E. & H. J. Neville. 2010. Proficiency differences in syntactic processing of native

speakers indexed by event-related potentials. Journal of Cognitive Neuroscience 22. 2728-

2744.

Qin, Z., Y. F. Chien & A. Tremblay. 2017. Processing of word-level stress by Mandarin-

speaking second language learners of English. Applied Psycholinguistics 38, 541-570.

43

Sasayama, S. 2015. Validating the assumed relationship between task design, cognitive

complexity, and second language task performance. Washington, DC: Georgetown

University dissertation. https://repository.library.georgetown.edu/handle/10822/1029904

(accessed 5 February 2019)

Schoonen, R. 2011. How language ability is assessed. In E. Hinkel (ed.), Handbook of Research

in Second Language Teaching and Learning (Vol. II), 701-716. London: Routledge.

Thomas, M. 1994. Assessment of L2 proficiency in second language acquisition research.

Language Learning 44. 307-336.

Thomas, M. 2006. Research synthesis and historiography: The case of assessment of second

language proficiency. In J. Norris & L. Ortega (eds.), Synthesizing research on language

learning and teaching, 279-298. Philadelphia, PA: John Benjamins.

Tremblay, A. 2008. Is second language lexical access prosodically constrained? Processing of

word stress by French Canadian second language learners of English. Applied

Psycholinguistics 29, 553-584.

Tremblay, A. 2011. Proficiency assessment standards in second language acquisition research:

“Clozing” the Gap. Studies in Second Language Acquisition 33, 339-372.

Zenker, F. & B. D. Schwartz. 2017. Topicalization from adjuncts in English vs. Chinese vs.

Chinese-English interlanguage. In M. LaMendola & J. Scott (eds.), Proceedings of the 41st

annual Boston University Conference on Language Development, 806-819. Somerville, MA:

Cascadilla Press.