INVESTIGATING FORM COMPARABILITY IN THE IDAHO COMPREHENSIVE LITERACY ASSESSMENT: MATTERS OF FAIRNESS AND TRANSPARENCY

www.elsevier.com/stueduc

Studies in Educational Evaluation 32 (2006) 369–380

INVESTIGATING FORM COMPARABILITY IN THE IDAHOCOMPREHENSIVE LITERACY ASSESSMENT: MATTERS

OF FAIRNESS AND TRANSPARENCY

David Squires*, Michael S. Trevisan** and George F. Canney**

*Idaho State University**Assessment and Evaluation Center, Washington State University

***Department of Curriculum and Instruction, University of Idaho

Abstract

The Idaho Comprehensive Literacy Assessment (ICLA) is a faculty-developed, state-

wide, high-stakes assessment of pre-service teachers’ knowledge and application of

research based literacy practices. The literacy faculty control all aspects of the test,

including construction, refinement, administration, scoring and reporting. The test

development and validation process is open to public inspection. This article

investigates the comparability of three Forms administered across two years. The

results of statistical analyses show that while there are small differences in pass rates

across forms in some administrative periods, those differences are negligible and due

to unsystematic variability.

Introduction

High-stakes testing requires empirical data, reasoned analysis, and balanced

judgment. High-stakes teacher testing is a case in point. What little evidence there is in the

literature about teacher tests and testing suggests that current tests are often poorly crafted,

unconnected with what teachers need to know and be able to do, and lack evidence that

warrants the use of such tests to ascertain teacher competence (Haney, Fowler, &

Wheelock,1999; Ludlow, 2001). Nevertheless, the national trend remains strong for using

tests for both the accountability of teacher preparation programs and licensure of pre-

service teachers. To this end, the development of high-quality tests is essential.

In contrast to most teacher tests currently employed in the United States, the Idaho

Comprehensive Literacy Assessment (ICLA) has been developed by the teacher education

0191-491X/04/$ – see front matter # 2006 Published by Elsevier Ltd.

doi:10.1016/j.stueduc.2006.10.004

http://dx.doi.org/10.1016/j.stueduc.2006.10.004

D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380370

literacy faculty across Idaho. The ICLA faculty committee crafted this criterion-referenced

measure in light of state and international standards for teacher preparation in reading

(Idaho MOST Standards, 2001; International Reading Association, 1998). They

systematically refined the ICLA formats and content utilizing test data from each

administration. This systematic process has impacted literacy course content at all the

teacher preparation institutions in Idaho. A recent article detailed the history, politics, and

test development for this innovative pre-service teacher certification test (Squires, Canney,

& Trevisan, 2004). Guiding this work has been the application of the Standards forEducational and Psychological Testing (American Educational Research Association,American Psychological Association, National Council on Measurement in Education, &

Joint Committee on Standards for Educational and Psychological Testing, 1999) and

guidelines for the development and implementation of high-stakes tests and testing policy

developed by the United States Department of Education, Office of Civil Rights (2002).

Passage of the ICLA has been in place as a requirement for elementary and special

education teacher certification since August of 2002.

In this article, we describe our ongoing commitment to develop a valid, high-stakes

assessment open to public inspection. The decision to be transparent has been a central

feature of the ICLA development process. Cronbach (1980) noted that the task in the test

validation process is not to usurp the general public’s prerogative to judge the validity of

the test, but to clarify validity issues surrounding the assessment so that stakeholders can

make informed decisions. Despite Cronbach’s exhortations, as well as those of others, we

know of few appraisals of the validity of high-stakes tests for teacher certification open to

public scrutiny. The consequences of not having a public examination of teacher

certification assessments are negative and profound, as exemplified by court cases in

Alabama and Massachusetts (Ludlow, 2001).

Many in the Idaho educational hierarchy questioned faculty investment, motives,

and competence to create a high-stakes, statewide literacy test for pre-service teachers.

Despite such criticism and minimal financial support, the faculty have persisted in this

endeavor. While open to the concerns and insights of literacy colleagues, the literacy

research base, and data from the ICLA, they have forged ahead with this project. The goal

has always been to devise a valid, reliable, and relevant measure of pre-service teachers’

knowledge of core literacy practices as a part of the overall process to prepare teachers

highly qualified to meet each child’s learning needs. This article, in its examination of

Form comparability, reflects the best efforts of the committee in its self-examination of the

test development process and commitment to public transparency in reporting the results

and outcomes of their work.

Early in the project, a validity argument was created to guide the development, use,

and ongoing validation of the ICLA, particularly with an eye toward fairness (Cronbach,

1988; Shepard, 1993). The argument included item and test specifications, procedures for

pilot testing and item evaluation, and the empirical tasks necessary to test the validity

argument (see Squires et al., 2004, for details). An ad hoc committee composed of

members from underrepresented groups in Idaho independently reviewed the item content

and language of the ICLA for systematic biases likely to negatively impact teacher

candidates of those groups (Kmitta & Goellner, 2002). The ad hoc committee accepted the

texts as fair after several minor revisions.

D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380 371

The following guidelines from the United States Department of Education, Office

of Civil Rights (2002) were also implemented:

1. Students received adequate notice of the test and its consequences; test dates, times

and locations were posted well in advance online and at each teacher preparation

institution. Students pre-registered with staff at each location weeks prior to the

administration of the ICLA.

2. Students were given access to the information being tested when faculty aligned

course content to address the content covered in the ICLA.

3. The Angoff Method and the Benchmark Method (Cizek, 1996) were chosen by the

faculty to determine the cut score for passing each Standard and the results of this

process were documented.

4. An electronic study guide sectioned to match the parts of each of the three ICLA

Standards was made available free of charge to all students.

5. Accommodations were developed for special needs students; in addition, any

student could retake a Standard multiple times.

The ICLA addresses three distinct areas, or Standards: Standard I Language

Learning and Literacy Development; Standard II Reading Comprehension Research and

Best Practices; Standard III Literacy Assessment and Intervention (Figure 1).

Standard Section 1 Section 2 Section 3

I – Language Learning and

Literacy Development

Definitions Phonological

awareness;

phonics and

structural

analysis;

instructional

strategies

Essay-like

scenarios

II – Reading Comprehension

Research and Best

Practices

Definitions Instructional

strategies

Essay-like

scenarios

III – Literacy Assessment and

Intervention

Definitions Instructional

strategies

Essay-like

scenarios

Figure 1: Configuration of the Idaho Comprehensive Literacy Assessment

Each Standard is divided into three Sections. Section 1 requires matching common

literacy terms with definitions; Section 2 matches terms to descriptions of research-based


instructional strategies; Section 3 presents two scenarios with two or three subparts.

Candidates provide essay responses that literacy faculty evaluate using a 4-point holistic

scoring rubric; faculty also have scoring guides to inform the consistent and accurate

application of the scoring rubric as it pertains to that particular essay. The item formats for

Standards I, II, and III are the same, with one exception. Standard I, Section 2, has an

additional set of eleven items that assess candidates’ knowledge of phonemes, graphemes,

syllables, morphemes, and phonic patterns; the Forms of Standard I, however, are

configured alike. Therefore, all elements of the three Standards, across Forms of the ICLA,

were developed around the same content and item specifications (for an in-depth

description, see Squires et al., 2004).

The faculty openly discussed and shared assessment data in various forums in

acknowledgement of the public trust they were given to create a high-stakes test legislated

by the state representatives. In recent conversations, as a result of ongoing test data

analyses, the committee has become concerned that one Form of the ICLA, Form C, may

be easier for pre-service teachers than Forms A and B. Forms A and B were developed by

the same subcommittee and a year and a half earlier than Form C. The difference in

subcommittee membership for the development of Forms A and B versus Form C, and the

difference in time when the Forms were developed, has left a lingering perception that

Form C is not as challenging as A and B. Since the goal remains the development of a

teacher certification test that validly, reliably, and fairly measures individual pre-service

teachers’ knowledge about research-based best practices in literacy instruction, this article

explores the comparability of Forms across two years’ of testing. At issue has been whether

two candidates taking different Forms of the ICLA are presented with an equivalent

challenge.

Administration and Test Equating

Administration of the three forms of the ICLA is typically done in a serial fashion

within Standard. Serial administration has been shown to produce sample characteristics

similar to those obtained when tests are randomly assigned (Angoff, 1971, p. 569),

providing the basis for inferential statistical comparisons of student performance across

forms. Also, serial administration de facto creates an equivalent-groups data collection

design for test equating investigations, a design recommended as a viable alternative when

equating investigations are constrained by logistics or resources (Petersen, Kolen, &

Hoover, 1989). Test forms within Standard were equated using the equipercentile method

based on data from the December 2004 administration. Smooth curves were drawn by

hand.

Analysis and Discussion of ICLA Pass Rates by Form

To examine Form comparability on the ICLA, student pass rates were compared

across Forms and sections within Standard and administration period. Data from six

administrations of the ICLA each semester and summer between April 2003 and December

2004 were analyzed to ascertain Form comparability. Candidate performance is

summarized for each Form for spring 2003 through December 2004.1


Table1:PassRateforFormA,BandCforStandardIBetweenApril2003toDecember2004

Form

April2003

July2003

December2003

April2004

July2004

December2004

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

A68

150

64

50

80.9

141

75

112

61.4

44

79.5

146

B62.6

115

74.4

43

78

150

77.3

119

68.2

44

84.4

147

C72.3

130

68.4

38

78.5

144

79.2

96

78.6

42

89.2

129

Table2:PassRateforFormA,BandCforStandardIIBetweenApril2003toDecember2004

Form

April2003

July2003

December2003

April2004

July2004

December2004

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

A88.6

88

92.1

38

94.1

68

89.2

102

84.4

45

83.8

68

B84.9

93

85.3

41

87.5

80

83.1

89

91.6

36

90.9

99

C88.8

116

93.9

33

95.3

86

89.5

86

96.0

51

91.0

78


Table3:PassRateforFormA,BandCforStandardIIIBetweenApril2003toDecember2004

Form

April2003

July2003

December2003

April2004

July2004

December2004

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Student

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

Percent

Pass

Number

of

Students

A84.7

59

93.3

31

95.2

63

89.0

64

94.6

37

93.4

76

B91.8

74

80.0

40

85.9

78

83.1

83

86.5

52

92.4

92

C93.0

86

93.1

29

97.6

86

89.2

65

91.9

37

92.5

67


Statistical analysis of pass rates was first conducted descriptively by comparing pass

rates across forms within Standard and administration period. Further analysis was

conducted by using analysis of variance (ANOVA) methodology for the same comparisons.

Given the large number of ANOVA comparisons, a major concern for these

analyses was adequate control of the experimentwise Type I error rate. Klockars and Sax

(1986) recommended the Scheffe´ Test as the most widely used and straightforward

procedure for controlling the experimentwise Type I error rate and thus, it was used in

these analyses. The Scheffe´ Test accomplishes this protection by requiring a larger value

for the F statistic before the null hypothesis is rejected.

From the test results shown in Table 1, the pass rates for Forms A, B and C within

Standard I (Language learning and literacy development) are roughly equivalent, with

small differences among Forms and across administrations. For the administrations in

2004, the percentage of students passing favored Form C, with differences in passing rate

greater in July and December than in previous administrations. However, ANOVA results

show that no statistically significant differences were found across Forms in any of the

administration periods.

From the data for Standard II (Reading comprehension research and best practice)

seen in Table 2, Form C had the highest pass rate on all administrations, but the differences

among Form pass rates were negligible, save July 2004. The same pattern of results can be

seen in Table 3 for Standard III (Literacy assessment and intervention). Results of the

ANOVA procedures show that no statistically significant differences were found across

forms for either Standard.

In sum, we are left with the conclusion that unsystematic or random variability

explains differences in pass rates across the Forms.

A secondary analysis was done to compare the relative difficulty of the knowledge

and performance portions of the test. This analysis was stimulated by concern from ICLA

committee members who thought that pre-service teachers find most difficult writing open-

ended essays in response to classroom scenarios (Squires et al., 2004). As indicated in

Table 4 (Standard I), Table 5 (Standard II), and Table 6 (Standard III), the knowledge

portions of the test (Sections 1 and 2) have accounted for more of the total points possible

than has the performance portion (Section 3). This is true for all Forms and administration

periods. Differences between the percentage of points earned on the knowledge sections

versus the performance (scenario) sections ranged from 5.7% of possible points (Standard

1, Form C, July 2004) to 39.5% (Standard III, Form B, April 2003).2

Comparisons of the proportions of total points obtained in the knowledge and

performance sections across Forms show few differences. ANOVA results comparing

knowledge and performance points across Forms for all administration periods show that

no statistically significant differences were found.

Again, we are left with the conclusion that nonsystematic or random variation

explains differences in possible points for the knowledge and performance sections across

Forms.


Table 4: Mean Percentage of Knowledge and Performance Items for Standard I Between April

2003 and December 2004

Form Item April

2003

July

2003

December

2003

April

2004

July

2004

December

2004

A Knowledge 83.5 82.7 85.9 83.3 80.1 85.1

A Performance 60.0 58.3 70.0 66.7 60.0 66.7

B Knowledge 80.3 79.4 81.8 79.2 77.3 84.4

B Performance 61.7 68.3 71.7 73.3 71.7 76.7

C Knowledge 81.8 82.0 83.9 81.9 80.7 87.0

C Performance 66.7 56.7 71.7 75.0 75.0 75.0

Table 5: Mean Percentage of Knowledge and Performance Items for Standard II Between April


Form Item April

2003

July

2003

December

2003

April

2004

July

2004

December

2004

A Knowledge 82.7 89.7 88.7 89.9 87.6 88.2

A Performance 66.7 70.0 73.3 70.0 70.0 63.3

B Knowledge 91.4 94.0 93.0 91.9 91.9 92.4

B Performance 61.7 61.7 65.0 68.3 66.7 61.7

C Knowledge 94.0 95.2 94.1 94.6 90.2 94.6

C Performance 68.3 66.7 71.7 68.3 75.0 61.7

Table 6: Mean Percentage of Knowledge and Performance Items for Standard III Between April


Form Item April

2003

July

2003

December

2003

April

2004

July

2004

December

2004

A Knowledge 95.9 95.2 95.7 94.0 95.1 97.1

A Performance 61.7 61.7 71.7 63.3 61.7 68.3

B Knowledge 96.2 92.7 95.2 93.7 94.5 96.6

B Performance 56.7 58.3 56.7 58.3 60.0 66.7

C Knowledge 94.2 94.0 96.6 94.0 93.5 95.4

C Performance 68.3 63.3 71.7 78.3 71.7 73.3


Concluding Thoughts

So, is Form C of the ICLA easier for pre-service teachers than either Forms A or B,

based upon current test data across 2003-2004? From the data in Tables 1 and 3, for

Standards I and III, Forms A, B, and C appear comparable in terms of overall student pass

rates. For Standard II (Table 2), Form C registered a slightly higher pass rate than did

Forms A and B. The actual percentage of students passing on Form C, Standard II, was

within 2 percentage points on five of the six administration periods. July 2004 appears to be

an anomaly that we are unable to explain at this time. More conclusive, however, is that no

statistically significant differences were found across Forms for any administration period.

Table 5 sheds additional light on the issue of Form comparability. Scores on the

knowledge portion of the ICLA for Form C appeared to have been slightly higher than for

Forms A and B, which accounts for the overall slightly higher scores on Form C, Standard

II (Table 2). It should be noted that one item in the knowledge section is equivalent to

about 3 percentage points.3Therefore, for five of the six administration periods the

difference between Form pass rates is the consequence of less than one item correct.

At this writing, the committee is disaggregating the data by Section, within

Standard, in its efforts to further determine if any systematic differences among Forms

exist. The committee also recognizes that test revision will be an ongoing process linked to

evolving issues of validity; as such, validity is not an end in itself (Cronbach, 1980;

Shepard, 1993). The committee continues to discuss the content domains appropriate for a

criterion referenced test of pre-service teachers’ knowledge related to literacy, as well as to

conduct item analyses and scrutinize those items that fall outside our decision to have items

of moderate difficulty. Items identified as producing unacceptably low p-values have beenmodified or eliminated and the revised items piloted.

4A standing subcommittee is using

data and committee input to develop new items and new item formats for future editions of

all Forms of the ICLA. The whole ICLA committee scrutinizes all subcommittee

recommendations for any item modifications before they are even piloted.

In addition to this oversight role, other ways of addressing differences in pass rates

include formalizing and making uniform the instructions to proctors for reasons of validity

and fairness. With respect to a uniform literacy curriculum, we continue to honor

differences among faculty in terms of their preferences for meeting teacher preparation

literacy standards. Early on, a significant effort was successful in setting aside contentious

issues emanating out of the “reading wars”, in order to focus on core literacy information

that all pre-service teachers need to know to meet the standards-based climate of today’s

classrooms. It may be, in light of the data being collected, that further discussions need to

occur among faculty about their pedagogical practices that seem to enhance student test

score performance on the ICLA. In fact, the ICLA members have begun such

conversations, using available test data to shed light on test content that students find

particularly difficult.

We have entered uncharted waters in the sense that while measurement specialists

have talked of an examination of test data like these, there are few published accounts of

this validity examination process actually being done, particularly with faculty-controlled

assessments. In a recent article discussing the emerging performance-based indicator

systems for NCATE-approved teacher preparation programs, a persistent issue remains—


whether these systems “…will satisfy policy makers’ demands for results” (Olson, 2005, p.

19). This uncertainty also clouds the future of the ICLA. It is in this climate that the ICLA

committee continues to embrace the position stated by Cronbach more than twenty-five

years ago:

The validity of interpretation cannot be established by a research monograph or

a detailed manual. The aim for the report is to advance sensible discussion.

Why should we wish for more? On matters before the public, the evidence is

usually clouded. The institutions of the polity are geared to weigh up

reasonable, partly persuasive, disputed arguments; and they can be tolerant

when we acknowledge uncertainties. The more we learn, and the franker we

are with ourselves and our clientele, the more valid the use of tests will

become. (1980, p. 107)

Acknowledgement

The authors would like to thank Huihua He for statistical consultation on this paper.

Notes

1. The ICLA has been operational since April 2002 for Forms A and B only; Form C was initiated in

April 2003. Consequently, data for 2003-2004 were reported for purposes of Form comparison.

2. Concern in this part of the secondary analysis was whether the performance portion of the test was

more difficult than the knowledge portion of the test (regardless of form). Thus, descriptive

comparisons were conducted by comparing the mean percentage of points obtained on each portion

of the test. This is a simple and intuitive means of analyses.

3. Test items are weighted to accomplish a 60%-40% split between objective and performance items,

respectively.

4. Operationally, any item with a p-value less than .50 is flagged for revision or elimination.

References

American Educational Research Association, American Psychological Association, National

Council of Measurement in Education, & Joint Committee on Standards for Educational and Psychological

Testing. (1999). Standards for educational and psychological testing. Washington DC: American

Educational Research Association.

Angoff, W.H. (1971). Norms, scales, and equivalent scores. In R. L. Thorndike (Ed.), EducationalMeasurement (2nd ed.). Washington D.C.:American Council on Education.

Cronbach, L.J. (1980). Validity on parole: How can we go straight? New Directions for Testing andMeasurement, 5, 99-108.

Cronbach, L.J. (1988). Five perspectives on the validity argument. In H. Wainer & H.I. Braun

(Eds.), Test validity (pp. 3-18). Hillsdale, NJ: Erlbaum.


Cizek, G.J. (1996). Setting passing scores. Educational Measurement: Issues and Practices, 15 (2),20-31.

Haney, W., Fowler, C., & Wheelock, A. (1999). Less truth than error? An independent study of the

Massachusetts Teacher Test. Educational Policy Analysis Archives 7 (4). Retrieved November 9, 2004,from http://epaa.asu.edu/epaa/v7n4/

Idaho’s MOST. (2001). Idaho maximizing opportunities for students and teachers. Retrieved May21, 2004, from http://www.sde.state.id.us/MOST/Reading.htm

International Reading Association, Professional Standards and Ethics Committee. (1998). Standardsfor reading professionals (Rev. ed.). Newark, DE: International Reading Association.

Klockars, A.J., & Sax, G. (1986). Multiple comparisons. Beverley Hills, CA: Sage.

Kmitta, D., & Goellner, L. (2002, July). Fairness review report: Idaho comprehensive literacyassessment. Paper presented at the National Evaluation Institute, Boise, ID.

Linn, R.L. (1993). Linking results of distinct assessments. Applied Measurement in Education, 6(1), 83-102.

Ludlow, L.H. (2001). Teacher test accountability: From Alabama to Massachusetts. EducationPolicy Analysis Archives, 9(6). Retrieved November 9, 2004, from http://epaa.asu.edu/epaa/v9n6/

Olson, L. (2005). Education schools use performance standards to improve graduates. EducationWeek, 24 (36), 1, 18-19.

Petersen, N.S., Kolen, M.J, & Hoover, H. D. (1989). Scaling, norming, and equating. In R.L. Linn

(Ed.), Educational Measurement (3rd ed.). Washington DC: National Council on Measurement in Educationand American Council on Education/MacMillan.

Shepard, L.A. (1993). Evaluating test validity. In L. Darling-Hammond & J.A. Banks (Eds.),

Review of research in education (pp. 405-450). Washington, DC: American Educational Research

Association.

Squires, D., Canney, G.F., & Trevisan, M.S. (2004, November 9). There is another way: The

faculty-developed Idaho Comprehensive Literacy Assessment for K-8 pre-service teachers. EducationPolicy Analysis Archives, 12 (62). Retrieved November 11, 2004 from http://epaa.asu.edu/epaa/v12n62

U.S. Department of Education, Office for Civil Rights. (2000). The use of tests as part of high-stakes decision-making for students: A resource guide for educators and policymakers. Washington, DC:Author. Retrieved May 21, 2004, from http://www.ed.gov/offices/OCR/archives/ testing/index1.html

The Authors

DAVID SQUIRES was an elementary classroom teacher and reading teacher. He is

currently a teacher educator. His research interests include the impact of theory on

classroom instructional decisions and pre-service teacher literacy assessment.


MICHAEL S. TREVISAN is currently director of the Assessment and Evaluation Center at

Washington State University. His professional interests include classroom and large-scale

educational assessment, program evaluation, and evaluation capacity building.

GEORGE F. CANNEY was an elementary classroom teacher and reading teacher. For over

thirty years, his professional interests have included pre-service and in-service teacher

education on beginning reading instruction, vocabulary and comprehension development,

literacy assessment and intervention.

Correspondence: <[email protected] >

Documents

INVESTIGATING FORM COMPARABILITY IN THE IDAHO COMPREHENSIVE LITERACY ASSESSMENT: MATTERS OF FAIRNESS AND TRANSPARENCY