38
Score Comparability of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills Walter D. Way Laurie Laughlin Davis Steven Fitzpatrick Pearson Educational Measurement Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA, April, 2006

Score Comparability of Online and Paper Administrations of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Score Comparability of Online and Paper Administrations of the

Texas Assessment of Knowledge and Skills

Walter D. Way

Laurie Laughlin Davis

Steven Fitzpatrick

Pearson Educational Measurement

Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA, April, 2006

1

Score Comparability of Online and Paper Administrations of the

Texas Assessment of Knowledge and Skills

Introduction

A rapidly increasing number of state education departments are exploring or

implementing online assessments as part of their statewide assessment programs. The potential

advantages of online testing in K-12 settings are obvious. These include quicker turnaround of

results, cost savings related to printing and shipping paper test materials, improved test security,

more flexible and less burdensome test administrations, and a technological basis for introducing

innovative item formats and test delivery algorithms. In addition, recent surveys indicate that

students testing online enjoy their experiences, feel comfortable with taking tests by computer,

and tend to prefer it to traditional paper testing (Glassnapp, Poggio, Poggio, & Yang, 2005;

O’Malley et al., 2005; Ito & Sykes, 2004).

In states where online testing has been introduced as part of their high-stakes

assessments, not all schools have had the infrastructure and equipment to test online. For this

reason, paper and online versions of the same tests are typically offered side-by-side. Any time

both paper-based and online assessments co-exist, professional testing standards indicate the

need to ensure comparable results across paper and online mediums. The Guidelines for

Computer-Based Tests and Interpretations (APA, 1986) states: “...when interpreting scores from

the computerized versions of conventional tests, the equivalence of scores from computerized

versions should be established and documented before using norms or cut scores obtained from

conventional tests.” (p. 18). The joint Standards for Educational and Psychological Testing also

recommends empirical validation of score interpretations across computer-based and paper-

based tests (AERA, APA, NCME, 1999, Standard 4.10).

The comparability of test scores based on online versus paper testing has been studied for

more than 20 years. Reviews of the comparability literature research were reported by Mazzeo

and Harvey (1988), who reported mixed results, and Drasgow (1993), who concluded that there

were essentially no differences in examinee scores by mode-of-administration for power tests.

Paek (2005) provided a summary of more recent comparability research and concluded that, in

general, computer and paper versions of traditional multiple-choice tests are comparable across

grades and academic subjects. However, when tests are timed, differential speededness can lead

2

to mode effects. For example, a recent study by Ito and Sykes (2004) reported significantly

lower performance on timed web-based norm-referenced tests at grades 4-12 compared with

paper versions. These differences seemed to occur because students needed more time on the

web-based test than they did on the paper test. Pommerich (2004) reported evidence of mode

differences due to differential speededness in tests given at grades 11 and 12, but in her study

online performance on questions near the end of several tests was higher than paper performance

on these same items. She hypothesized that students who are rushed for time might actually

benefit from testing online because the computer makes it easier to respond and move quickly

from item to item.

A number of studies have suggested that no mode differences can be expected when

individual test items can be presented within a single screen (Poggio, Glassnapp, Yang, &

Poggio, 2005; Hetter, Segall & Bloxom, 1997; Bergstrom, 1992; Spray, Ackerman, Reckase, &

Carlson, 1989). However, when items are associated with text that requires scrolling, such as is

typically the case with reading tests, studies have indicated lower performance for students

testing online (O’Malley, 2005; Pommerich, 2004; Bridgeman, Lennon, & Jackenthal, 2003;

Choi & Tinkler, 2002; Bergstrom, 1992).

In general, the results of comparability research are difficult to evaluate for several

reasons. First, there has been a continual evolution in both computer technology and the

computer skills of test-takers. Thus, earlier studies have limited generalizability, and more

recent studies may not generalize well to future settings. Second, most comparability research is

carried out in the context of operational testing programs, where less-than-desirable experimental

control is usually the norm. In such studies, conclusions are often tempered because of design

limitations such as lack of random assignment, insufficient statistical power, order-of-

administration effects, and effects due to differences in test forms given across modes. Finally,

the content areas, test designs, test administration systems, and testing populations can differ

considerably across comparability studies, and differences in any of these factors could lead to

different findings from one study to another.

For a policy maker interested in introducing online assessments for a high-stakes K-12

testing program, the need to assess comparability creates a number of challenges. While some

stakeholders will lobby for immediate and widespread introduction of online testing, researchers

3

and psychometricians will advise more cautious and controlled experimental studies. Such

studies can be expensive and usually require efforts beyond those needed to meet the usual

challenges associated with the ongoing paper-based program. Furthermore, no matter how well

a comparability study is designed, executing the design depends on the volunteer participation of

individual schools and districts. As such, one can expect that schools will vary in their ability to

execute the procedures called for in the experimental design, and that a nontrivial number of

schools signed up for the study will invariably drop out.

Poggio et al. (2005) and Poggio, Glasnapp, Yang, Beauchamp, and Dunham, (2005)

reported on an approach to comparability research in the live context of Kansas’ assessment

program that balanced an aggressive approach to online implementation with the need to collect

comparability data. In their studies, all schools were invited to administer the Kansas

Computerized Assessment (KCA), and online volunteers were further asked if they would be

willing to “double” test their students by administering a paper form of the test in addition to the

online assessment. Studies were carried out for grade 7 mathematics in spring 2003 and for

mathematics (grades 4, 7, and 10) and reading (grades 5, 8, and 11) in 2004. The studies reported

no evidence of mode effects for any of the tests evaluated. However, some of the findings may

have been confounded by order-of-administration effects and limited samples of students for

whom testing order could be reliably identified. If a mode effect for reading did exist, it is not

clear whether the design carried out could have identified it, and if so, whether a sufficient

statistical adjustment could have been applied. Because only a subset of students taking the

KCA also took the paper test, it would not have been possible to assign each online student the

higher of two scores.

In this paper, we present results from two online comparability studies that were

conducted for the Texas statewide assessment program in spring 2005. The purpose of the

studies were to evaluate the comparability of online and paper versions of the Texas Assessment

of Knowledge and Skills (TAKS) in mathematics, reading/English language arts, science and

social studies at grades 8 and 11 for the purposes of test score reporting, and to appropriately

adjust equated score conversion tables for students testing online as warranted. In the sections

that follow, we will describe the TAKS program and initial efforts to transition the program to

online testing, introduce the design and methodology used for the comparability studies at each

grade level, and present results of the score comparability studies conducted at grades 8 and 11.

4

In particular, we will introduce an approach and design to studying the comparability of online

and paper tests that we refer to as “matched samples comparability analyses” (MSCA). We

believe this approach is particularly well-suited to monitoring comparability as states transition

their high-stakes testing programs to online testing. In the last section of this paper, we will

report on some additional analyses that evaluate the sensitivity of the MSCA approach for

detecting differences in online and paper group performance when these groups differ in terms of

overall proficiency.

The TAKS Program and Online Testing

TAKS is the primary state-mandated assessment in Texas, and represents the latest and

most comprehensive testing implementation of statewide assessments in Texas that have been

ongoing for more than 20 years. First administered in spring 2003, TAKS is given to students in

mathematics at grades 3–10 and at the exit level (grade 11); in reading at grades 3–9; in writing

at grades 4 and 7; in English language arts (ELA) at Grade 10 and at the exit level; in science at

grades 5, 8, and 10 and at the exit level; and in social studies at grades 8 and 10 and at the exit

level. Spanish versions of TAKS are available at grades 3–6. Every TAKS test is directly aligned

to the Texas Essential Knowledge and Skills (TEKS) curriculum. On each TAKS test, the critical

knowledge and skills are measured by a series of test objectives. These objectives are not found

verbatim in the TEKS curriculum. Rather, the objectives are umbrella statements that serve as

headings under which student expectations from the TEKS can be meaningfully grouped. TAKS

test results are used to comply with the requirements of the No Child Left Behind (NCLB) act, as

well as for statewide accountability purposes. The exit level TAKS is part of high school

graduation requirements in Texas and is offered multiple times to students who do not pass. Test

results are reported to teachers and parents, and are used for instructional decisions as

appropriate. The TAKS tests are scaled separately at each grade, with a score of 2100

representing “met standard” and 2400 representing “commended performance” at each grade

level. In practice, the highest equated scale score below these thresholds is set to these threshold

values. Additional information on the TAKS can be found on the Texas Education Agency

(TEA) web site at http://www.tea.state.tx.us/student.assessment/taks/booklets/index.html.

The TEA first began testing by computer in fall 2002, when an end-of-course

examination in Algebra I was made available online and districts were given the option of using

5

this test either in online or paper format. In spring 2004, an online testing pilot was carried out

in three grade 8 TAKS subject areas, reading, mathematics, and social studies. The goals of the

pilot were to determine the administrative procedures necessary to deliver online assessments in

the schools, to assess the readiness of Texas school districts to administer online assessments, to

document administrative challenges, and to the extent possible, to compare performance on

online assessments with paper test performance. The pilot tests were administered in

volunteering campuses during a two-week window prior to the operational grade 8 TAKS

administration. Although data related to online performance were collected, the design of the

pilot did not permit conclusive comparisons of online and paper performance.

In spring 2005, the TEA carried out additional studies of online testing at grades 8 and 11

to compare online and paper test performance in reading, mathematics, social studies, and

science. Score comparability for science was assessed only at grade 11, although a science field-

test at grade 8 included an online component. The grade 8 and 11 studies involved different data

collection designs. At grade 8, schools that volunteered to participate were randomly assigned to

administer one of the three TAKS content areas online. The same test form was administered

both in paper and online. Each student tested only one time in a given content area; thus, the

results for students testing online were to be reported as part of the statewide assessment results.

At grade 11 (exit level) TAKS, a special re-test administration was offered in June. Students in

the participating schools who had not yet passed exit-level TAKS in at least one of the four

subject areas were offered an extra testing opportunity as part of this administration. In addition,

a small number of students that would be entering grade 11 in the fall were allowed to participate

in the administration (these students will be referred to as “rising juniors”). For each exit-level

TAKS subject area, volunteering students in these schools were randomly assigned to take either

an online or a paper version of the same test form.

Research Methodology

The comparability study design required conducting analyses that would support score

adjustments for those students testing online, if such adjustments were warranted. To

accomplish this, we utilized an approach that considered score comparability in the context of

test equating. Specifically, we equated the online version of the tests to the paper version of the

6

tests under the assumptions of a random groups design. The details of how the equatings were

accomplished differed for grade 8 and grade 11, as described below.

Matched Samples Comparability Analyses for Grade 8

For grade 8, we initially thought that the comparability data could be analyzed based on

random assignments to condition at the school level, as it was expected that approximately 40

schools would administer each of the three content areas online. However, voluntary

participation for the comparability study was much lower than expected, and the numbers of

schools testing in each subject area was too small to support analyses based on random

assignment at the school level. As a result, we compared test performance for students testing

online with comparison groups from the paper results that were matched to the online students in

terms of spring 2004 test performance. We refer to this approach as matched samples

comparability analyses (MSCA). In this approach, student scale scores for reading and

mathematics obtained in grade 7 were used as matching variables, and sub-samples of students

equal to the numbers of students testing online were selected from the paper TAKS tests. The

paper students were selected so that the grade 7 reading and mathematics scores in the online and

matched paper groups were identical.

In devising this approach, we first regressed 2004 grade 8 TAKS scale scores on 2003

grade 7 TAKS scale scores. We found the following multiple correlations across reading, math,

and social studies (note that there is no grade 7 social studies test):

Dependent Variable Independent Variable(s) r

G8ReadingSS G7ReadingSS 0.74 G8ReadingSS G7ReadingSS, G7MathSS 0.76

G8MathSS G7MathSS 0.82 G8MathSS G7ReadingSS, G7MathSS 0.83

G8SocSS G7ReadingSS, G7MathSS 0.72

The MSCA involved a bootstrap method that was designed to establish raw to scale score

conversions by equating the online form to the paper form, and also to estimate bootstrap

standard errors of the equating to assist in interpreting differences between the online and paper

score conversions (c.f., Kolen & Brennan, 2004, p. 232-235). The application of equating

7

methods was based on an assumption that the online and matched paper sample groups were

randomly equivalent. For each replication, we used IRT true score equating based on Rasch

calibrations of the online and paper samples using the WINSTEPS program (Linacre, 2001). The

MSCA involved sampling with replacement, in which both online and matched paper student

samples were drawn 500 times and analyses were repeated for each replicated sample. The

specific procedures used in the MSCA were as follows:

1. Each student testing online with grade 7 TAKS score in reading and mathematics was matched to a student from the available 2005 paper TAKS data with identical grade 7 reading and mathematics scale scores. Both reading and mathematics were used in the matching for all three grade 8 subject areas.

2. Online versus paper comparability analyses were performed using the matched groups of students by repeating the following steps 500 times:

a. A bootstrap sample of students (i.e., random sampling with replacement) was drawn from the online participants.

b. A matched stratified bootstrap sample (i.e., random sampling with replacement at each combination of mathematics and reading scores observed in the online sample drawn in step 2.a) was drawn from the available 2005 paper TAKS data.

c. A raw score-to-raw score equating was carried out with each of the bootstrap samples as follows:

i. WINSTEPS was used to calibrate the online group data, centering the calibrations so that the mean of the ability estimates was zero. The item difficulty estimates and raw score-to-theta conversions were retained.

ii. WINSTEPS was used to calibrate the paper comparison group data, centering the calibrations so that the mean of the ability estimates was zero. The item difficulty estimates and raw score-to-theta conversions were retained.

iii. IRT true score equating was used to find the raw score equivalents on the paper comparison group to each integer raw score for the online group by calculating ΣP(θ), where the summation is over the paper item difficulty estimates and θ is from the conversions for the integer raw score found in step 2.c.i.

iv. Using linear interpolation and the unrounded operational raw score-to-scale score conversions, the paper raw score equivalents found in step 3 were converted to scale score equivalents.

d. The raw score equivalents were transformed to scale scores using the operational 2005 score conversion tables and linear interpolation.

3. Online scale score conversions for each raw score were based on the average of the conversions calculated over each of the 500 replications. These average scale score values comprised the alternate online raw score to scale conversion table.

4. The standard deviation of online scaled score conversions at each raw score represented the conditional bootstrap standard errors of the linking.

To assist in comparing the online and paper score conversions, we considered the

following criterion suggested by Dorans and Lawrence (1990): “To assess equivalence, it is

8

convenient to compute the difference between the equating function and the identity

transformation, and to divide this difference by the standard error of equating. If the resultant

ratio falls within a bandwidth of plus or minus two, then the equating function is deemed to be

within sampling error of the identity function” (p. 247). It should be pointed out that the Dorans

and Lawrence criterion is only one of many justifiable approaches that could be used to interpret

the results. We also paid special attention to differences in the range of scaled scores around the

“met standard” score levels. Differences at extremes of the scale were considered less important,

given the purpose and primary uses of the TAKS tests.

Grade 11 Comparability Analyses

For the grade 11 comparability analyses, the researchers involved in the study randomly

assigned the participating students from each school to the online or paper testing conditions.

Because testing occurred over a single day for each subject area and many of the participating

schools were limited in how many students they could test in a single day, slightly more students

were assigned to the paper condition than to the online condition.

To evaluate score comparability for the grade 11 study, we employed some of the same

procedures that we used in the MSCA analyses for grade 8. Specifically, we randomly selected

students from the online and paper samples with replacement 500 times and equated the scores

obtained in each sampling replication. These bootstrap analyses resulted in alternate online score

conversion tables for each test and bootstrap standard errors of equating to assist in interpreting

results. One difference between the grade 11 and the grade 8 analyses was the bootstrap

replications involved simple random sampling with replacement, that is, that there was no need

to select a sample from the paper group that was matched to the online sample in terms of

previous test scores. Another difference was that the bootstrap analyses for grade 11 ELA

incorporated polytomously-scored constructed response and extended essay item types.

Results

Matched Samples Comparability Analyses for Grade 8

Table 1 presents the means and standard deviations of the grade 8 raw scores and grade 7

scale scores for each test evaluated using the MSCA. It can be seen in Table 1 that the mean raw

scores on the grade 8 tests for the online and paper groups are similar (within 0.16) for all three

9

tests. The grade 7 reading and mathematics scale scores used with the MSCA were very similar

for the mathematics and social studies online and paper samples (within 7 points). However, for

reading the previous scale scores were noticeably higher for the online group compared with the

paper group (e.g., the mean reading scale score was about 18 points higher and the mean

mathematics scale score was about 12 points higher).

Insert Table 1 about here

Tables 2 to 4 summarize the comparability analysis results for mathematics, reading, and

social studies. The columns of the tables are as follows:

RS – Paper test raw score CBT_RS – Equivalent raw scores on the online test based on the MSCA equating. Note that a higher equivalent raw score indicates that the online version of the test was more difficult. RS_SD – Standard deviation of the equivalent raw scores over the 500 replications. PAP_SS – Paper test scale score conversions, based on the 2005 TAKS equating results CBT_SS – Equivalent scale scores on the online test based on the MSCA equating. Again, higher equivalent scale scores indicate that the online version of the test was more difficult. SS_SD – Standard deviation of the equivalent scale scores over the 500 replications. RS_DIF – Difference between online raw score equivalent and paper raw score. SS_DIF – Difference between online scale score equivalent and paper scale score. SIG? – Scale score differences exceeding two bootstrap standard errors are noted by “**”.

Insert Tables 2 to 4 about here

In these tables, the equating conversions for the online and paper forms are assumed to be

the same for zero and perfect scores, since true score equating conversions cannot be estimated

with the Rasch model at these score points.

For mathematics (Table 2), the online versus paper differences were slight. In terms of

the raw score conversions, the differences were never as much as one-half of a point. In terms of

scaled score conversions, the differences were less than five points over most of the scale.

However, at the upper raw score points (41 and higher), scaled score differences exceeded two

standard errors of the linking.

For reading (Table 3), large differences occurred throughout the scale. Differences in

raw score conversions exceeded one and a half points over much of the score range. Differences

10

in scale score conversions were over 20 points over most of the score range. All of the

differences in scale score conversions exceeded two standard errors of the linking.

For social studies (Table 4) slight differences in both raw score and scale score

conversions occurred. The raw score differences were never as much as one-half of a point, and

the scale score differences were never as much as six points. None of the scale score differences

exceeded two standard errors of the linking.

Figure 1 presents differences between the online and paper scale score conversions

graphically as a function of raw score, along with upper and lower intervals defined by plus and

minus two bootstrap standard errors of equating. These graphs provide a relatively concise

summary of the patterns found in Tables 2 through 4.

Insert Figure 1 about here

Grade 11 Comparability Results

Table 5 presents univariate statistics for the online and paper groups participating in the

grade 11 comparability study. These data indicate higher raw scores for the paper group

compared to the online group for mathematics, science, and ELA. For social studies, the raw

score mean is slightly higher for the online group than for the paper group.

Insert Table 5 about here

Unlike the other exit level TAKS tests, which consist entirely of objectively scored items

(i.e., multiple-choice and a small number of grid-in response items for mathematics), ELA

contains both short-answer open-ended items and an extended essay item. Table 6 presents

univariate statistics for the online and paper groups on the ELA test broken down by the different

item types. The first entry in the table (ELA) is for the unweighted sum of the items and has a

possible maximum of 61 points (48 points for multiple-choice, 9 points for short-answer, and

four points for the essay). The second entry is weighted (ELA_WT), with the essay counting four

times its maximum point total. The maximum possible weighted score is 73. The scale score for

the ELA test is based on the weighted raw score. For the multiple-choice raw score, the essay

11

score, and two of the three open-ended scores, the mean for the paper group was slightly higher

than the mean for the online group.

Insert Table 6 about here

For many of the students participating in the grade 11 study, a previous TAKS score was

available. These scores are listed in Table 7 and include entries both for students for whom the

previously available TAKS score was from the grade 10 test (mostly the “rising juniors”) and

students for whom the previously available TAKS score was from the exit level test.

Insert Table 7 about here

Note that previous TAKS scores means were substantially higher for students for whom

the previous score was from the grade 10 test than it was for students for whom the previously

available scale score was from the exit level test.1 This was expected, given that the rising

juniors participating in the study were higher achieving students generally, as compared with

grade 11 students who had previously failed the exit level TAKS. For all four tests, the previous

TAKS score means were very similar (within three scale score points) for online and paper

students with a previous score from the grade 11 test. For mathematics, science, and social

studies, students testing online had higher previous TAKS score means from grade 10 than

students testing by paper. Although these scale score differences were comparatively large (16.2,

18.6, and 12.2, respectively), they seem less of a concern considering the small numbers of

students and the much larger scale score standard deviations for the students with a previous

TAKS scores from a grade 10 test. Considering all of the information from the analysis of

previous TAKS scores, we concluded that the assumption of randomly equivalent online and

paper groups was reasonable for the grade 11 TAKS comparability analyses.

Figure 2 and 3 present graphs of the differences between the grade 11 online and paper

scale scores as a function of raw score, along with upper and lower intervals defined by ±2

1 Strictly speaking, the mean scale scores on the grade 10 and 11 tests are not directly comparable, since the scales are uniquely defined within each grade. However, since “met standard” and “commended performance” are defined at 2100 and 2400 for both grades, general comparative inferences seem reasonable.

12

bootstrap standard errors of equating. Figure 2 present results for mathematics and ELA and

Figure 3 presents results for science and social studies.

Insert Figures 2 and 3 about here

The results for grade 11 mathematics presented in Figure 2 are very similar to the results

for grade 8, except that both the online minus paper scale score differences and the intervals

defined by ±2 bootstrap standard errors are larger than they were for grade 8 mathematics. For

grade 11 mathematics, the online minus paper raw score differences were as high as 0.77 of a

raw score point, and these differences occurred in the region of the “met standard” cut score.

Thus, even though the scale score differences for grade 11 mathematics were within ±2 bootstrap

standard errors, there was evidence of a greater mode effect for grade 11 mathematics than there

was for grade 8 mathematics.

The results for grade 11 ELA presented in Figure 2 also indicated that the test was more

difficult for the online group than for the paper group, but as with grade 11 mathematics the scale

score differences were within ±2 bootstrap standard errors of equating over most of the score

range. The online minus paper differences were never as large as one weighted raw score point,

although the largest difference (0.95) occurred in the region of the “met standard” cut score. The

differences at extremely high score levels (i.e., above a weighted raw score of 65) reflect the fact

that ELA scale score conversions change by large amounts with changes in the weighted raw

scores in this region of the scale. For example, in the paper conversion table, a weighted raw

score of 70 corresponded to a scale score of 2802, and a weighted raw score of 71 corresponded

to a scale score of 2903. Thus, the online minus paper raw score difference of about -0.50 of a

point in this region of the scale converted to a scale score difference of about 50 points. In

addition, the Rasch true score equating was not accurate in this region of the scale because of the

limited numbers of high performing students participating in the study and the relative difficulty

of the open-ended constructed response items.2

2 In fact, to obtain Rasch scaling tables that extended to the entire range of possible scores, it was necessary to augment both the online and paper ELA samples with an “imputed” item response record that included maximum scores on the two open-ended items for which no student in either the online or paper samples obtained the maximum possible score (see Table 6).

13

The results for grade 11 science and social studies presented in Figure 3 indicate little or

no evidence of mode effects between the students testing online and the students testing by

paper. The results for science indicated the online version was slightly more difficult than the

paper version, but the raw score differences were never as high as one-half of a raw score point.

The scale score differences ranged between 3 and 7 score points over most of the scale, and these

differences never exceeded ±2 bootstrap standard errors. The social studies results indicated that

the online form was slightly easier than the paper version. Raw score differences were never

more than 0.40 of a raw score point and scale score differences were six points or lower over

most of the scale. The social studies differences never exceeded ±2 bootstrap standard errors,

although the bootstrap standard errors of equating for social studies were much larger than they

were for the other tests because there were only 355 online students and 388 paper students

taking the social studies test.

Summary of the Grade 8 and Grade 11 Comparability Study Results

To summarize the results of the grade 8 and grade 11 comparability studies, there was

evidence across grades and content areas that the online versions of the TAKS tests were more

difficult than the paper versions. In grade 8 reading, the mode differences were quite pronounced

and warranted the use of the alternate score conversion table for reporting online results. In grade

11 mathematics and ELA, the differences were less pronounced and the ELA results were also

complicated by the contributions of constructed response and extended essay items to the total

scores. Nevertheless, the alternate score conversions were used for reporting scores with these

tests, in part because of the magnitudes of raw score differences but also because of the high

stakes associated with these tests. For the social studies tests, there was little evidence of mode

effects across the two grades, since differences slightly favored the paper group at grade 8 and

slightly favored the online group at grade 11. The comparability results for grade 8 mathematics

and grade 11 science also favored the paper groups, although differences were slight and within

the ±2 bootstrap standard errors of equating for nearly all score points.

In general, the results of the comparability analyses for the TAKS tests at grades 8 and 11

were consistent with the existing literature on the comparability of online and paper assessments

in that the tests where the most significant mode differences were detected involved reading

14

passages that required scrolling. The mode differences in mathematics, although not large, were

less consistent with the comparability literature, which mostly supports thecomparability of

online and paper mathematics tests. Keng, McClarty and Davis (2006) further investigate the

mode differences found for these measures through item-level analyses.

Sensitivity of the MSCA Approach

Although the MSCA method appeared to work well in the context of the grade 8 TAKS

online tests, the conditions for the matched sample analyses were reasonably good in that the

ability levels of the paper and online groups (based on previous test scores) were reasonably

similar. In reviewing results of the analyses, technical advisors working with the state of Texas

recommended that the performance of the MCSA method should be studied further to see how

sensitive it is under conditions where the online group and the paper group are less similar in

overall ability. Such documentation is important given that the MCSA approach has been used

to determine and potentially apply alternate score conversions for students taking operational

TAKS tests online.

To address this recommendation, additional sensitivity analyses of the MCSA method

were carried out. The purpose of these analyses was to answer two specific questions:

1. How will the matched sample analyses perform when no “mode” differences exist but the online group and paper group differ in ability based on past test performance?

2. Will the matched sample analyses recover simulated mode differences when the online and paper groups differ in ability based on past test performance?

Sensitivity Analysis Procedures

The general approach for the sensitivity analyses was to select samples of students from

the paper data used in the spring 2005 grade 8 comparability study and to carry out matched

sample analyses as if these samples were students testing online. Analyses were conducted for

mathematics and reading. Four sets of analyses were undertaken. The first set utilized six

mathematics data sets and six readings data sets drawn from the overall paper data for each

measure. For each of the six data sets for a given test, a different target frequency distribution

was established for sampling students. The variables used to sample the data were the previous

15

spring’s scale scores (mathematics scale scores for mathematics, reading scale scores for

reading). The sample sizes were 1,275 for mathematics data sets and 1,850 for reading data sets,

roughly equivalent to the numbers of Spring 2005 grade online testers in these subjects. Tables 8

and 9 list the scale score frequencies and score means of the six selected sensitivity samples and

the overall paper group. For both mathematics and reading, performance increased from sample

1 to sample 6, and sample 4 was proportionally equivalent to the overall paper data.

Insert Tables 8 and 9 about here

The second, third, and fourth set of analyses simulated mode differences between the

online and paper groups. The data for these analyses were created by systematically modifying

the data sets from the first set of analyses to lower performance on the 2005 test. Three

conditions of lower performance (0.25, 0.5, and 1.0 raw score points, respectively) were

simulated for each data set from the first set of samples. To accomplish this, the responses to

randomly selected items were changed from correct to incorrect for approximately one-half of

the students. (Only one-half of the records were altered to ensure that some perfect and near-

perfect scores remained in the data). Because the process of changing responses from correct to

incorrect was random, it was built into the bootstrap replications. The SAS code to accomplish

this change is shown below:

compare=1/rawold*&mode.*2; rchange=ranuni(-1); if rchange>0.5 then do; do i=1 to &nitem.; if items{i}=1 then do; if compare>ranuni(-1) then items{i}=0; end; end; end;

The variable, rawold, is the student’s original raw score, and &mode is equal to 0.25, 0.5,

or 1.0, depending upon the condition. The SAS function ranuni generates a uniform random

variable between zero and one.

The total number of conditions in the sensitivity analyses was 48 (2 content areas × 6

samples × 4 sets of analyses). For each condition, we ran matched sample comparability

analyses involving 100 bootstrap replications according to the steps outlined above. To

16

summarize results within a condition, the differences in equating conversions between the paper

and simulated online forms were evaluated.

Results – No Mode Effects Simulated

Figure 4 presents the differences between the online score conversions resulting from the

100 bootstrap replications and the reported paper test scale score conversions. These differences

are graphed as a function of the paper form raw score. The bootstrap standard errors of the

linking for the online conversions are not shown in these graphs. For mathematics, the

differences ranged from about 3.5 to 4.5 scale score points across the six samples at most score

points. For reading, the differences ranged from about 4.5 to 5.5 across the six samples at most

score points. For both mathematics and reading, the bootstrap standard errors were higher at the

extreme score points, with a pattern similar to bootstrap standard errors from for the Spring 2005

mathematics and reading comparabilty analyses that are presented in Tables 2 and 3 and Figure1.

Insert Figure 4 about here

In general, the sensitivity analysis results suggested that the MSCA method is unlikely to

indicate either statistical significant or practically significant performance differences between

online and paper groups in situations where no true differences exist. Moreover, the differences

observed between the online and paper groups based on the matched samples analyses were not

related to the overall proficiency differences between the simulated online and paper samples.

Sample M2 from the mathematics simulation resulted in the largest differences between the

simulated online and paper groups. For this sample, the median raw score difference over 100

replications was about 0.35. However, this difference favored the online group over the paper

group. Since the online versus paper scale score differences were within two bootstrap standard

errors and there is currently no reason to hypothesize that the online group would be advantaged

by mode-of-administration effects for the TAKS program, these results would not have led to

any score adjustments for the online group.

Results – Mode Effects Simulated

Figures 5 to 7 present results of the sensitivity analyses when mode effects were

simulated. Figure 5 presents results based on a simulated mode effect of 0.25 raw score points,

17

Figure 6 presents results based on a simulated mode effect of 0.50 raw score points, and Figure 7

presents results based on a simulated mode effect of one raw score point.

Insert Figures 5 to 7 about here

The results in presented in Figure 5 suggest that the “significance” criterion of two

bootstrap standard errors was, for the most part, too conservative to identify a simulated mode

effect of 0.25 of a raw score point. For mathematics, the results varied over the six simulation

samples. For samples M1 and M4, scale score differences exceeded two bootstrap standard

errors over nearly all score points, suggesting a “significant” mode effect that would

disadvantage online students. For samples M3, M5, and M6, differences indicated that the

simulated online test was more difficult, but the differences were within 10 scale score points

and two bootstrap standard errors. For sample M2, no mode effects were indicated.

In the case of reading, the results over the six simulated samples were more consistent. In

all cases, the simulated online test was more difficult. However, scale score differences were

less than 10 points across virtually all score points for all simulated reading data sets, which was

within the “significance” criterion of two bootstrap standard errors.

The results presented in Figure 6 indicated that the matched samples comparability

analyses consistently detected simulated mode effects of 0.5 raw score points. For all

mathematics samples except M2 and all reading samples, the scale score differences exceeded

two bootstrap standard errors of the linkings. The average scale score differences were between

10 and 20 points for most of these samples.

As would be expected, the results shown in Figure 7 indicated that the simulated mode

effect of 1.0 raw score points was detected in all mathematics and reading samples. Evidence of

mode effects for these data sets was unequivocal.

Discussion of the MSCA Sensitivity Analyses

In general, the sensitivity analyses supported the MCSA approach. The method does not

seem to be affected by differences in the ability of the group taking an online test versus the

comparison paper test takers, at least within the range of differences studied here. One reason for

18

this robustness might be the difference in sample sizes between the online and paper groups. In

the grade 8 mathematics and reading comparability studies, the paper groups were larger than the

online groups by factors of 125 and 85, respectively. It is not clear from these analyses whether

the MCSA approach sensitivity analyses will work as well if the relative sample sizes of the

online and paper groups become more similar. However, this seems unlikely to happen in Texas,

at least in the near future.

Not surprisingly, the method did not appear to be robust in detecting a simulated mode

effect of 0.25 raw score points. In part, this is a function of how conservative or liberal one is in

evaluating the results. The criterion of two bootstrap standard errors of the linkings seemed, in

the context of the data studied here, a somewhat conservative criterion. From looking at Figure

5, one could argue that a more liberal criteria (in the sense of being more willing to apply a

separate set of score conversions for the paper group) might have led to a decision to adjust the

online scores for three of the six samples for both reading and mathematics. Of course, as with

any statistical analysis, the power was related to sample size. To the extent that future online

comparability analyses in Texas involve increasing online sample sizes, the two bootstrap

standard errors criterion will be less likely to be considered conservative. At some point, other

considerations may carry more weight in evaluations, such the magnitude of raw score-to-raw

score equating differences.

One finding from the sensitivity analyses that would be worth further study was the range

of differences across the six simulated online data sets, particularly in the sensitivity analyses

done for mathematics. A limitation of the study was that the same six samples were used to

study both the conditions where no mode effects were simulated and the conditions where

various levels of mode effects were simulated (since the “mode effects simulated” data sets were

created by randomly changing item responses from the “no mode effects” data). One of the

mathematics simulation samples (sample M2) was drawn by chance in such a way that the

matched samples comparability analyses suggested higher performance for the online group

when no mode effects were present. Sample M2 was drawn with a targeted distribution of

previous scores to be of lower overall performance than the paper group; however, this does not

seem to explain the anomalous results for sample M2. Rather, it seems that some significant

level of sampling variation in the selection of sample M2 occurred that was related to the

19

relationship between the previous test scores (e.g., the Spring 2004 grade 7 mathematics and

reading scale scores) and the criterion score (Spring 2005 grade 8 mathematics raw scores).

Thus, a more extensive set of simulations that incorporated the variation in sampling simulated

online test takers would be helpful in assessing the extent to which this might be a concern. It

might also help to inform decision rules regarding “significant” mode effects.

One final comment about the sensitivity analyses carried out in this study is that future

online comparability studies in Texas will involve matching on a different set of criteria than

those used for the Spring 2005 grade 8 study. For example, an attractive alternative matching

approach would be to create target frequencies based not only on previous scale scores but also

on other important demographic variables such as gender, ethnicity, and English language

proficiency. Because of the extremely large paper group sample sizes, a fairly refined “sampling

grid” could be defined that incorporates most or all of these variables, although it would be

necessary to group previous scale scores into intervals to prevent empty cells in the sampling

grid. A limitation of the current study is that it did not examine the sensitivity of the MSCA

approach to different ways of matching performance between the online and paper groups

besides using previously obtained scale scores. We are currently undertaking such sensitivity

analyses and will use the results to inform the design of Spring 2006 comparability analyses for

the TAKS tests.

Conclusions

In K-12 testing, the current advantages and future promises of online testing have reached

a tipping point that is encouraging virtually every state to consider or pursue online testing

initiatives as part of their testing program. It is easy to envision that K-12 assessments will be

administered almost exclusively online within the foreseeable future. In the enthusiasm to

embrace what Bennett (2002) refers to the “inexorable and inevitable” evolution of technology

and assessment, it is tempting to downplay or dismiss the comparability of online and paper

assessments. Nevertheless, state testing programs and the vendors that serve these programs are

clearly obliged to address the issue of score comparability between online and paper versions of

K-12 assessments, especially given the high-stakes that results from assessments have taken on

in recent years.

20

The strategy Texas has adopted for introducing online testing is similar to the strategy

that many states are using, where online testing is made available to those districts and schools

that are willing and able to pursue it. The comparability studies presented in this paper illustrate

how responsible and psychometrically defensible comparability analyses can be incorporated

within the constraints of a high-stakes, operational testing program. In Texas, the MSCA

approach is a central part of the strategy to offer online and paper versions of TAKS tests side-

by-side as the districts and schools in the state transition to online testing. By routinely including

these analyses both when online versions of tests are introduced and as they continue to be

offered, it will be possible to monitor the comparability of online and paper tests over time.

Although this approach will not be without challenges, it seems to be an equitable and viable

approach to a difficult assessment problem.

21

References

American Psychological Association Committee on Professional Standards and Committee on

Psychological Tests and Assessments (APA) (1986). Guidelines for computer-based

tests and interpretations. Washington, DC: Author.

American Educational Research Association (AERA), American Psychological Association

(APA), and the National Council on Measurement in Education (NCME). (1999).

Standards for educational and psychological testing. Washington, DC: AERA.

Bennett, R.E. (2002). Inexorable and inevitable: The continuing story of technology and

assessment. Journal of Technology, Learning, and Assessmen, 1(1). Available from

http://www.jtla.org.

Bergstrom, B. (1992, April). Ability measure equivalence of computer adaptive and pencil and

paper tests: A research synthesis. Paper presented at the annual meeting of the American

Educational Research Association: San Francisco.

Bridgeman, B., Lennon, M.L., & Jackenthal, A. (2001). Effects of screen size, screen resolution,

and display rate on computer-based test performance (ETS RR-01-23). Princeton, NJ:

Educational Testing Service.

Choi, S.W. & Tinkler, T. (2002). Evaluating comparability of paper and computer-based

assessment in a K-12 setting. Paper presented at the Annual Meeting of the National

Council on Measurement in Education, New Orleans, LA.

Dorans, N. J., & Lawrence, I. M. (1990). Checking the statistical equivalence of nearly identical

test forms. Applied Measurement in Education, 3, 245-254.

Glasnapp, D.R., Poggio, J., Poggio, A., & Yang, X. (2005). Student Attitudes and Perceptions

Regarding Computerized Testing and the Relationship to Performance in Large Scale

Assessment Programs. Paper presented at the annual meeting of the National Council on

Measurement in Education, Montreal, CA.

22

Ito, K., & Sykes, R. C. (2004). Comparability of Scores from Norm-Referenced Paper-and-

Pencil and Web-Based Linear Tests for Grades 4 – 12. Paper presented at the annual

meeting of the American Educational Research Association, San Diego, CA.

Keng, L., McClarty, K. L., & Davis, L. L. (2006). Item-Level Comparative Analysis of Online

and Paper Administrations of the Texas Assessment of Knowledge and Skills. Paper

presented at the annual meeting of the National Council on Measurement in Education,

San Francisco, CA.

Kolen, M.J., & Brennan, R.L. (2004). Test equating,scaling, and linking: methods and practices

(2nd ed.). New York: Springer.

Linacre, J. M. (2001). WINSTEPS Rasch Measurement Program, Version 3.32. Chicago: John

M. Linacre.

Mazzeo, J., & Harvey, A.L. (1988). The equivalence of scores from automated and conventional

educational and psychological tests. A review of the literature (ETS RR-88-21).

Princeton, NJ: Educational Testing Service.

Mead, A.D. & Drasgow, F. (1993). Equivalence of computerized and paper cognitive ability

tests: A meta-analysis. Psychological Bulletin, 114(3), 449-458.

O’Malley, K. J., Kirkpatrick, R., Sherwood, W.,Burdick, H. J., Hsieh, M.C., & Sanford, E.E.

(2005, April). Comparability of a Paper Based and Computer Based Reading Test in

Early Elementary Grades. Paper presented at the AERA Division D Graduate Student

Seminar, Montreal, Canada.

Paek, P. (2005). Recent trends in comparability studies (PEM Research Report 05-05). Available

from http://www.pearsonedmeasurement.com/downloads/research/RR_05_05.pdf.

Poggio, J., Glasnapp, D. R., Yang, X., & Poggio, A. J. (2005). A comparative evaluation of score

results from computerized and paper and pencil mathematics testing in a large scale state

assessment program. Journal of Technology, Learning, and Assessment,3(6). Available

from http://www.jtla.org.

23

Poggio, J., Glasnapp, D., Yang, X., Beauchamp, A., & Dunham, M. (2005). Moving from Paper

and Pencil to Online Testing: Findings from a State Large Scale Assessment Program.

Paper presented at the annual meeting of the National Council on Measurement in

Education, Montreal, Canada.

Pommerich, M. (2004). Developing computerized versions of paper-and-pencil tests: Mode

effects for passage-based tests. Journal of Technology, Learning, and Assessment, 2(6).

Available from http://www.jtla.org.

Spray, J. A., Ackerman, T. A., Reckase, M. D., & Carlson, J. E. (1989). Effect of the medium of

item presentation on examinee performance and item characteristics. Journal of

Educational Measurement, 26, 261-271.

24

Table 1. Online and Paper Sample Means and Standard Deviations for Grade 8 Raw Scores and Grade 7 Scale Scores used for the MSCA

Grade 8 Raw Score Grade 7 Reading SS Grade 7 Math SS Mode Subject N Mean Std Mean Std Mean Std

Reading 1,840 40.60 7.16 2241.32 183.49 2159.40 141.33 Mathematics 1,273 32.60 9.27 2225.88 171.18 2146.81 131.36 Online

Social Studies 1,449 33.97 7.73 2229.56 178.01 2148.38 134.31 Reading 158,282 40.73 7.35 2223.36 185.07 2147.76 150.97

Mathematics 158,809 32.76 9.78 2223.59 185.39 2148.25 150.73 Paper Social Studies 157,809 33.94 8.35 2223.48 185.40 2148.02 151.07

Note: Number of items per test: reading, 48; mathematics, 50; social studies, 48.

25

Table 2. Summary of Comparability Analysis Results – Grade 8 Math

RS CBT_RS RS_SD PAP_SS CBT_SS SS_SD RS_DIF SS_DIF SIG? 0 0.0000 N/A 1222.99 1222.99 N/A 0.00 0.00 1 1.0262 0.03489 1376.71 1379.48 4.19048 0.03 2.77 2 2.0469 0.06509 1488.99 1491.83 5.02326 0.05 2.84 3 3.0636 0.09144 1557.09 1560.06 4.91924 0.06 2.97 4 4.0772 0.11465 1607.10 1610.04 4.84358 0.08 2.94 5 5.0887 0.13525 1647.22 1650.11 4.76723 0.09 2.89 6 6.0987 0.15365 1681.09 1683.91 4.69091 0.10 2.82 7 7.1080 0.17022 1710.65 1713.43 4.62034 0.11 2.78 8 8.1169 0.18522 1737.08 1739.83 4.55582 0.12 2.75 9 9.1258 0.19881 1761.13 1763.87 4.49686 0.13 2.74

10 10.1350 0.21118 1783.32 1786.07 4.44182 0.14 2.75 11 11.1447 0.22248 1804.01 1806.79 4.39325 0.14 2.78 12 12.1551 0.23274 1823.48 1826.31 4.35233 0.16 2.83 13 13.1661 0.24203 1841.96 1844.86 4.31354 0.17 2.90 14 14.1779 0.25042 1859.60 1862.59 4.27712 0.18 2.99 15 15.1903 0.25795 1876.53 1879.62 4.24776 0.19 3.09 16 16.2035 0.26464 1892.88 1896.09 4.22228 0.20 3.21 17 17.2172 0.27048 1908.74 1912.08 4.19496 0.22 3.34 18 18.2315 0.27554 1924.17 1927.66 4.18301 0.23 3.49 19 19.2462 0.27979 1939.30 1942.93 4.15349 0.25 3.63 20 20.2611 0.28323 1954.09 1957.89 4.13358 0.26 3.80 21 21.2762 0.28583 1968.65 1972.62 4.11976 0.28 3.97 22 22.2912 0.28767 1983.04 1987.19 4.10697 0.29 4.15 23 23.3061 0.28874 1997.30 2001.64 4.09458 0.31 4.34 24 24.3207 0.28902 2011.47 2016.00 4.08511 0.32 4.53 25 25.3349 0.28851 2025.60 2030.32 4.07175 0.33 4.72 26 26.3483 0.28725 2039.71 2044.64 4.06389 0.35 4.93 27 27.3610 0.28517 2053.86 2059.00 4.05925 0.36 5.14 28 28.3727 0.28238 2068.10 2073.45 4.05312 0.37 5.35 29 29.3833 0.27876 2082.46 2088.03 4.04572 0.38 5.57 30 30.3925 0.27439 2096.98 2102.77 4.04454 0.39 5.79 31 31.4003 0.26926 2111.73 2117.74 4.03923 0.40 6.01 32 32.4064 0.26337 2126.74 2132.98 4.03989 0.41 6.24 33 33.4107 0.25669 2142.09 2148.56 4.04002 0.41 6.47 34 34.4129 0.24925 2157.84 2164.54 4.03999 0.41 6.70 35 35.4128 0.24105 2174.06 2181.00 4.04637 0.41 6.94 36 36.4104 0.23203 2190.86 2198.04 4.05234 0.41 7.18 37 37.4053 0.22224 2208.34 2215.76 4.06272 0.41 7.42 38 38.3974 0.21163 2226.64 2234.30 4.07537 0.40 7.66 39 39.3865 0.20014 2245.92 2253.83 4.09162 0.39 7.91 40 40.3724 0.18780 2266.39 2274.56 4.11313 0.37 8.17 41 41.3547 0.17456 2288.32 2296.76 4.14520 0.35 8.44 **42 42.3334 0.16032 2312.10 2320.82 4.18301 0.33 8.72 **43 43.3082 0.14506 2338.23 2347.24 4.23461 0.31 9.01 **44 44.2788 0.12870 2367.47 2376.82 4.30573 0.28 9.35 **45 45.2450 0.11112 2400.99 2410.73 4.40490 0.24 9.74 **46 46.2065 0.09221 2440.72 2450.97 4.56203 0.21 10.25 **47 47.1630 0.07185 2490.33 2501.37 4.84320 0.16 11.04 **48 48.1143 0.04983 2557.98 2570.78 5.54320 0.11 12.80 **49 49.0601 0.02597 2669.80 2679.01 3.96352 0.06 9.21 **50 50.0000 N/A 2822.97 2822.97 N/A 0.00 0.00

26

Table 3. Summary of Comparability Analysis – Grade 8 Reading RS CBT_RS RS_SD PAP_SS CBT_SS SS_SD RS_DIF SS_DIF SIG?

0 0.0000 N/A 1174.29 1174.29 N/A 0.00 0.00 1 1.1417 0.03914 1328.54 1344.27 4.34428 0.14 15.73 ** 2 2.2772 0.07382 1439.53 1458.05 4.93151 0.28 18.52 ** 3 3.4063 0.10482 1506.33 1526.13 5.10897 0.41 19.80 ** 4 4.5288 0.13266 1555.07 1575.63 5.15905 0.53 20.56 ** 5 5.6445 0.15778 1593.96 1615.01 5.15115 0.64 21.05 ** 6 6.7535 0.18048 1626.63 1648.02 5.08850 0.75 21.39 ** 7 7.8557 0.20098 1655.04 1676.65 4.98703 0.86 21.61 ** 8 8.9511 0.21952 1680.37 1702.11 4.88667 0.95 21.74 ** 9 10.0399 0.23621 1703.35 1725.19 4.81142 1.04 21.84 **

10 11.1220 0.25117 1724.51 1746.43 4.75545 1.12 21.92 ** 11 12.1975 0.26454 1744.23 1766.21 4.71342 1.20 21.98 ** 12 13.2666 0.27638 1762.77 1784.81 4.68332 1.27 22.04 ** 13 14.3292 0.28676 1780.35 1802.45 4.65328 1.33 22.10 ** 14 15.3855 0.29575 1797.15 1819.28 4.62599 1.39 22.13 ** 15 16.4355 0.30347 1813.28 1835.44 4.60295 1.44 22.16 ** 16 17.4794 0.30986 1828.86 1851.06 4.57744 1.48 22.20 ** 17 18.5172 0.31504 1843.99 1866.21 4.55589 1.52 22.22 ** 18 19.5490 0.31898 1858.74 1880.99 4.53393 1.55 22.25 ** 19 20.5748 0.32187 1873.19 1895.45 4.51017 1.57 22.26 ** 20 21.5948 0.32362 1887.40 1909.66 4.49162 1.59 22.26 ** 21 22.6090 0.32426 1901.41 1923.69 4.47189 1.61 22.28 ** 22 23.6175 0.32390 1915.29 1937.57 4.45429 1.62 22.28 ** 23 24.6203 0.32249 1929.08 1951.36 4.43585 1.62 22.28 ** 24 25.6175 0.32010 1942.83 1965.10 4.41928 1.62 22.27 ** 25 26.6092 0.31672 1956.58 1978.84 4.40195 1.61 22.26 ** 26 27.5954 0.31239 1970.38 1992.63 4.38807 1.60 22.25 ** 27 28.5761 0.30716 1984.27 2006.51 4.37174 1.58 22.24 ** 28 29.5515 0.30106 1998.31 2020.51 4.34880 1.55 22.20 ** 29 30.5215 0.29396 2012.54 2034.71 4.35021 1.52 22.17 ** 30 31.4862 0.28607 2026.98 2049.17 4.33762 1.49 22.19 ** 31 32.4457 0.27726 2041.79 2063.95 4.32972 1.45 22.16 ** 32 33.3999 0.26765 2056.97 2079.10 4.32238 1.40 22.13 ** 33 34.3489 0.25718 2072.62 2094.71 4.31895 1.35 22.09 ** 34 35.2927 0.24589 2088.82 2110.87 4.31675 1.29 22.05 ** 35 36.2315 0.23379 2105.69 2127.69 4.31397 1.23 22.00 ** 36 37.1651 0.22088 2123.36 2145.29 4.31091 1.17 21.93 ** 37 38.0936 0.20717 2141.99 2163.86 4.30213 1.09 21.87 ** 38 39.0172 0.19261 2161.80 2183.59 4.27818 1.02 21.79 ** 39 39.9358 0.17731 2183.07 2204.78 4.23488 0.94 21.71 ** 40 40.8495 0.16117 2206.16 2227.81 4.17388 0.85 21.65 ** 41 41.7584 0.14421 2231.59 2253.23 4.12945 0.76 21.64 ** 42 42.6626 0.12644 2260.11 2281.84 4.14760 0.66 21.73 ** 43 43.5622 0.10779 2292.90 2314.83 4.20372 0.56 21.93 ** 44 44.4574 0.08828 2331.90 2354.34 4.33187 0.46 22.44 ** 45 45.3484 0.06782 2380.97 2404.32 4.54411 0.35 23.35 ** 46 46.2356 0.04637 2447.97 2474.17 5.15647 0.24 26.20 ** 47 47.1193 0.02381 2559.18 2577.78 3.71164 0.12 18.60 ** 48 48.0000 N/A 2715.06 2715.06 N/A 0.00 0.00

27

Table 4. Summary of Comparability Analysis – Grade 8 Social Studies RS CBT_RS RS_SD PAP_SS CBT_SS SS_SD RS_DIF SS_DIF SIG?

0 0.0000 N/A 1361.76 1361.76 N/A 0.00 0.00 1 1.0085 0.03243 1506.93 1507.46 3.94365 0.01 0.53 2 2.0222 0.06223 1612.22 1612.97 4.94385 0.02 0.75 3 3.0399 0.08966 1675.72 1677.24 4.68584 0.04 1.52 4 4.0609 0.11492 1722.17 1724.22 4.59585 0.06 2.05 5 5.0843 0.13822 1759.29 1761.79 4.54612 0.08 2.50 6 6.1093 0.15967 1790.55 1793.43 4.51153 0.11 2.88 7 7.1354 0.17942 1817.79 1821.01 4.48432 0.14 3.22 8 8.1620 0.19754 1842.11 1845.64 4.46028 0.16 3.53 9 9.1885 0.21419 1864.22 1868.02 4.43932 0.19 3.80

10 10.2147 0.22936 1884.61 1888.66 4.42081 0.21 4.05 11 11.2400 0.24318 1903.64 1907.92 4.40239 0.24 4.28 12 12.2643 0.25565 1921.56 1926.04 4.38411 0.26 4.48 13 13.2872 0.26683 1938.57 1943.23 4.36923 0.29 4.66 14 14.3085 0.27682 1954.84 1959.66 4.35471 0.31 4.82 15 15.3281 0.28558 1970.49 1975.44 4.33894 0.33 4.95 16 16.3457 0.29319 1985.62 1990.70 4.32683 0.35 5.08 17 17.3614 0.29970 2000.33 2005.51 4.31226 0.36 5.18 18 18.3749 0.30502 2014.68 2019.95 4.30032 0.37 5.27 19 19.3864 0.30934 2028.75 2034.09 4.28813 0.39 5.34 20 20.3955 0.31254 2042.59 2047.99 4.27732 0.40 5.40 21 21.4025 0.31473 2056.26 2061.71 4.26525 0.40 5.45 22 22.4073 0.31594 2069.80 2075.28 4.25233 0.41 5.48 23 23.4098 0.31611 2083.25 2088.75 4.24609 0.41 5.50 24 24.4102 0.31530 2096.68 2102.19 4.23484 0.41 5.51 25 25.4083 0.31354 2110.11 2115.62 4.22563 0.41 5.51 26 26.4045 0.31085 2123.59 2129.08 4.21861 0.40 5.49 27 27.3986 0.30723 2137.17 2142.62 4.20100 0.40 5.45 28 28.3907 0.30263 2150.85 2156.29 4.20556 0.39 5.44 29 29.3810 0.29713 2164.77 2170.17 4.19772 0.38 5.40 30 30.3695 0.29074 2178.92 2184.27 4.19448 0.37 5.35 31 31.3564 0.28345 2193.38 2198.67 4.18920 0.36 5.29 32 32.3417 0.27523 2208.20 2213.42 4.18539 0.34 5.22 33 33.3255 0.26612 2223.46 2228.61 4.18150 0.33 5.15 34 34.3081 0.25610 2239.24 2244.31 4.18302 0.31 5.07 35 35.2894 0.24513 2255.66 2260.64 4.18271 0.29 4.98 36 36.2697 0.23325 2272.83 2277.72 4.18550 0.27 4.89 37 37.2491 0.22041 2290.91 2295.71 4.19165 0.25 4.80 38 38.2277 0.20659 2310.10 2314.84 4.22410 0.23 4.74 39 39.2057 0.19174 2330.79 2335.41 4.22980 0.21 4.62 40 40.1832 0.17583 2353.13 2357.66 4.24902 0.18 4.53 41 41.1603 0.15880 2377.69 2382.14 4.28007 0.16 4.45 42 42.1372 0.14058 2405.19 2409.57 4.32222 0.14 4.38 43 43.1140 0.12109 2436.72 2441.06 4.38615 0.11 4.34 44 44.0908 0.10025 2474.14 2478.49 4.48869 0.09 4.35 45 45.0677 0.07790 2520.90 2525.38 4.67900 0.07 4.48 46 46.0448 0.05387 2584.74 2589.75 5.17909 0.04 5.01 47 47.0223 0.02799 2690.37 2693.80 3.83598 0.02 3.43 48 48.0000 N/A 2837.51 2837.51 N/A 0.00 0.00

28

Table 5. Univariate Summary Statistics for Online and Paper Exit Level TAKS Raw Scores Online Testing Group Paper-Pencil Testing Group

Subject N Mean Std Min Max N Mean Std Min MaxMath 958 26.76 8.92 7 59 1198 27.47 8.81 9 60Science 1004 23.70 7.56 7 54 1197 24.17 7.92 9 55Social Studies 355 29.49 11.17 8 54 388 29.19 11.13 9 55ELA 649 37.52 11.10 9 58 719 38.24 10.76 8 58Note: Number of items per test: mathematics, 60; science, 54; social studies, 54, ELA, 58. Table 6. Univariate Summary Statistics for Online and Paper Component Scores – Exit Leve ELA

Online Testing Group Paper-Pencil Testing Group Score N Mean Std Min Max N Mean Std Min MaxELA 649 37.52 11.10 9 58 719 38.24 10.76 8 58ELA_WT 649 42.21 12.60 10 70 719 43.32 12.30 8 70ELA_MC 649 33.44 9.34 8 48 719 33.88 9.06 7 48Essay 649 1.56 0.73 0 4 719 1.69 0.71 0 4OE 1 649 0.87 0.71 0 2 719 0.96 0.71 0 2OE 2 649 0.96 0.69 0 2 719 1.06 0.70 0 3OE 3 649 0.69 0.68 0 3 719 0.66 0.69 0 3Note: The ELA score is weighted four times in calculating the overall ELA composite score (ELA_WT). Table 7. Univariate Statistics for Previous TAKS Scores for Exit Level Online and Paper-Pencil Groups

Online Testing Group Paper-Pencil Testing Group Subject Grade* N SS Mean SS Std SS Min SS Max N SS Mean SS Std SS Min SS MaxELA 10 96 2270.6 110.8 1940 2529 104 2275.1 122.7 1964 2665ELA 11 413 2043.2 48.3 1728 2126 464 2042.9 50.2 1703 2099Math 10 86 2314.5 200.8 1853 2780 96 2298.3 208.7 1853 2780Math 11 698 1979.7 50.9 1817 2083 873 1981.9 55.4 1295 2083Science 10 75 2288.6 169.8 1930 2846 98 2270.0 161.1 1791 2684Science 11 715 2005.2 41.0 1816 2081 867 2003.3 44.2 1835 2081Soc St 10 91 2438.5 169.4 1911 2796 89 2426.3 199.2 1969 2796Soc St 11 212 2014.3 43.5 1857 2067 224 2012.1 55.6 1415 2089

* Refers to the grade associated with the most recent previous test score.

29

Table 8: Scale Score Frequency Distributions and Means for the Math Sensitivity Analysis Samples

Math SS M Sample1 M Sample2 M Sample3 M Sample4 M Sample5 M Sample6 Paper1335 0 0 0 0 0 0 41473 0 0 0 0 0 0 11635 0 0 0 0 0 0 11680 0 0 0 0 0 0 31716 1 1 1 0 0 0 101747 3 2 2 0 0 0 171773 5 4 3 1 0 0 521796 8 6 4 1 0 0 1251818 9 7 5 2 1 0 2661837 24 19 7 4 1 0 4881855 16 13 9 5 2 0 6851873 19 15 13 9 4 1 10601889 22 18 15 11 6 3 14231904 25 21 18 14 10 5 17231919 27 23 21 17 13 9 21531934 27 24 23 20 17 13 25171948 31 28 26 24 20 16 29441961 35 31 28 25 22 17 31591975 36 33 31 29 26 23 35691988 36 34 33 32 29 28 39492001 39 36 35 34 32 31 42342023 41 37 36 35 34 33 43042026 43 41 39 37 36 33 46562039 44 43 41 39 40 37 48922061 44 44 42 42 41 39 52182064 45 45 43 43 42 41 54132077 46 45 44 44 43 42 54932100 46 45 45 45 44 44 56222103 46 46 46 45 45 45 55452117 46 46 46 46 46 45 57332130 47 47 46 46 46 46 57612144 47 47 47 47 46 46 57972159 46 46 48 47 47 47 58172174 45 46 46 47 47 47 59002189 44 46 46 48 48 47 59802206 43 45 46 47 48 48 58152223 42 45 45 46 47 48 57752241 41 43 44 46 47 48 57472260 39 41 42 45 46 47 55862281 35 38 41 44 46 47 53512305 31 35 38 42 45 47 51902331 25 30 34 40 44 46 50042360 16 23 30 35 43 46 44932400 7 15 24 31 38 45 38782439 2 10 18 25 32 42 31352499 1 7 13 18 25 35 21882597 0 3 8 12 17 24 15162732 0 1 3 5 9 14 617

Math SS 2088.25 2108.25 2128.25 2148.25 2168.25 2188.25 2148.25Reading 2171.32 2190.64 2207.22 2219.41 2245.80 2253.29 2223.59

Raw 29.56 31.15 31.82 32.49 33.73 34.44 32.76N 1275 1275 1275 1275 1275 1275 158809

30

Table 9: Scale Score Frequency Distributions and Means for the Reading Sensitivity Analysis Samples

Reading R Sample1 R Sample2 R Sample3 R Sample4 R Sample5 R Sample6 Paper1189 0 0 0 0 0 0 131498 0 0 0 0 0 0 11545 0 0 0 0 0 0 11583 0 0 0 0 0 0 41615 0 0 0 0 0 0 201644 1 1 1 0 0 0 241669 2 2 1 1 0 0 461693 5 3 3 1 0 0 841714 8 4 2 2 1 0 1511734 10 6 4 2 1 1 2131753 12 8 5 3 1 1 2711771 14 10 6 4 2 0 3441789 15 11 7 5 3 1 4231805 16 12 8 6 4 2 5121822 17 13 9 7 5 3 5931837 18 14 10 8 6 4 6661853 19 14 11 9 7 5 7681868 20 15 12 11 8 7 9181883 22 16 13 11 9 8 9671897 24 18 15 13 11 9 10851912 25 19 16 14 12 10 12301927 26 19 17 15 13 10 13091941 27 23 19 17 15 11 14391955 28 24 21 19 17 12 15891970 29 26 23 20 18 13 17201985 31 28 25 23 21 15 19272009 37 32 31 25 19 16 21452014 43 37 35 29 23 19 24902030 49 43 38 32 26 22 27142053 55 48 42 36 30 24 30742061 61 52 46 42 38 29 35922077 66 62 55 47 39 35 40332100 71 64 58 54 50 41 45782112 79 74 67 61 55 47 52432130 85 85 78 70 62 56 59982150 98 88 82 77 71 63 65832170 106 105 99 87 73 72 74582192 127 134 124 99 75 78 84342216 134 141 132 113 94 87 96162241 123 141 140 118 99 94 101392270 110 126 145 130 113 106 110822303 86 118 130 134 138 140 115362342 70 94 116 138 161 161 117672400 48 66 98 126 154 176 108122455 18 29 61 113 165 221 97542561 11 16 23 85 148 164 72592705 4 9 22 43 63 87 3657

Reading 2123.36 2153.36 2183.36 2223.36 2263.36 2293.36 2223.36Math SS 2092.24 2106.32 2120.90 2144.44 2171.69 2185.12 2147.76Raw Score 38.06 38.83 39.82 40.71 41.66 42.43 40.73N 1850 1850 1850 1850 1850 1850 158282

31

TAKS Grade 8 Mathematics

-15

-10

-5

0

5

10

15

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49Raw Score

SS D

iffer

ence

TAKS Grade 8 Reading

-15-10

-505

1015202530

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46Raw Score

SS D

iffer

ence

TAKS Grade 8 Social Studies

-15

-10

-5

0

5

10

15

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46Raw Score

SS D

iffer

ence

C

Figure 1. Plots of Online Minus Paper Test Scale Score Conversions (Dark Line) and ±2 Bootstrap Standard Errors of the Differences (White Lines) as a function of Raw Score – Grade 8

32

TAKS Grade 11 Mathematics

-20

-15

-10

-5

0

5

10

15

20

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58Raw Score

SS D

iffer

ence

TAKS Grade 11 English Language Arts

-50-40-30-20-10

01020304050

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70Raw Score

SS D

iffer

ence

Figure 2. Plots of Online Minus Paper Test Scale Score Conversions (Dark Line) and ±2 Bootstrap Standard Errors of the Differences (White Lines) as a function of Raw Score – Grade 11 Mathematics and English Language Arts

33

TAKS Grade 11 Science

-15

-10

-5

0

5

10

15

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52Raw Score

SS D

iffer

ence

TAKS Grade 11 Social Studies

-30

-20

-10

0

10

20

30

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52Raw Score

SS D

iffer

ence

M h

Figure 3. Plots of Online Minus Paper Test Scale Score Conversions (Dark Line) and ±2 Bootstrap Standard Errors of the Differences (White Lines) as a function of Raw Score – Grade 11 Science and Social Studies

34

-8

-6

-4

-2

0

2

4

6

8

0 5 10 15 20 25 30 35 40 45 50

Raw Score

Scal

e Sc

ore

Diff

eren

ce

M_Sample1 M_Sample2 M_Sample3M_Sample4 M_Sample5 M_Sample6

-8

-6

-4

-2

0

2

4

6

8

0 4 8 12 16 20 24 28 32 36 40 44 48

Raw Score

Scal

e Sc

ore

Diff

eren

ce

R_Sample1 R_Sample2 R_Sample3R_Sample4 R_Sample5 R_Sample6

Figure 4. Bootstrap Mean Scale Score Differences (Online minus Paper) for Mathematics (Upper Graph) and Reading (Lower Graph) – No Mode Effects Simulated

35

-14

-7

0

7

14

0 5 10 15 20 25 30 35 40 45 50

Raw Score

Scal

e Sc

ore

Diff

eren

ce

M_Sample1 M_Sample2 M_Sample3M_Sample4 M_Sample5 M_Sample6

-14

-7

0

7

14

0 4 8 12 16 20 24 28 32 36 40 44 48

Raw Score

Scal

e Sc

ore

Diff

eren

ce

R_Sample1 R_Sample2 R_Sample3R_Sample4 R_Sample5 R_Sample6

Figure 5. Bootstrap Mean Scale Score Differences (Online minus Paper) for Mathematics (Upper Graph) and Reading (Lower Graph) – Simulated Mode Effects = 0.25

36

-20

-10

0

10

20

0 5 10 15 20 25 30 35 40 45 50

Raw Score

Scal

e Sc

ore

Diff

eren

ce

M_Sample1 M_Sample2 M_Sample3M_Sample4 M_Sample5 M_Sample6

-20

-10

0

10

20

0 4 8 12 16 20 24 28 32 36 40 44 48

Raw Score

Scal

e Sc

ore

Diff

eren

ce

R_Sample1 R_Sample2 R_Sample3R_Sample4 R_Sample5 R_Sample6

Figure 6. Bootstrap Mean Scale Score Differences (Online minus Paper) for Mathematics (Upper Graph) and Reading (Lower Graph) – Simulated Mode Effects = 0.50

37

-50

-25

0

25

50

0 5 10 15 20 25 30 35 40 45 50

Raw Score

Scal

e Sc

ore

Diff

eren

ce

M_Sample1 M_Sample2 M_Sample3M_Sample4 M_Sample5 M_Sample6

-50

-25

0

25

50

0 4 8 12 16 20 24 28 32 36 40 44 48

Raw Score

Scal

e Sc

ore

Diff

eren

ce

R_Sample1 R_Sample2 R_Sample3R_Sample4 R_Sample5 R_Sample6

Figure 7. Bootstrap Mean Scale Score Differences (Online minus Paper) for Mathematics (Upper Graph) and Reading (Lower Graph) – Simulated Mode Effects = 1.00