Upload
david-squires
View
212
Download
0
Embed Size (px)
Citation preview
www.elsevier.com/stueduc
Studies in Educational Evaluation 32 (2006) 369–380
INVESTIGATING FORM COMPARABILITY IN THE IDAHOCOMPREHENSIVE LITERACY ASSESSMENT: MATTERS
OF FAIRNESS AND TRANSPARENCY
David Squires*, Michael S. Trevisan** and George F. Canney**
*Idaho State University**Assessment and Evaluation Center, Washington State University
***Department of Curriculum and Instruction, University of Idaho
Abstract
The Idaho Comprehensive Literacy Assessment (ICLA) is a faculty-developed, state-
wide, high-stakes assessment of pre-service teachers’ knowledge and application of
research based literacy practices. The literacy faculty control all aspects of the test,
including construction, refinement, administration, scoring and reporting. The test
development and validation process is open to public inspection. This article
investigates the comparability of three Forms administered across two years. The
results of statistical analyses show that while there are small differences in pass rates
across forms in some administrative periods, those differences are negligible and due
to unsystematic variability.
Introduction
High-stakes testing requires empirical data, reasoned analysis, and balanced
judgment. High-stakes teacher testing is a case in point. What little evidence there is in the
literature about teacher tests and testing suggests that current tests are often poorly crafted,
unconnected with what teachers need to know and be able to do, and lack evidence that
warrants the use of such tests to ascertain teacher competence (Haney, Fowler, &
Wheelock,1999; Ludlow, 2001). Nevertheless, the national trend remains strong for using
tests for both the accountability of teacher preparation programs and licensure of pre-
service teachers. To this end, the development of high-quality tests is essential.
In contrast to most teacher tests currently employed in the United States, the Idaho
Comprehensive Literacy Assessment (ICLA) has been developed by the teacher education
0191-491X/04/$ – see front matter # 2006 Published by Elsevier Ltd.
doi:10.1016/j.stueduc.2006.10.004
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380370
literacy faculty across Idaho. The ICLA faculty committee crafted this criterion-referenced
measure in light of state and international standards for teacher preparation in reading
(Idaho MOST Standards, 2001; International Reading Association, 1998). They
systematically refined the ICLA formats and content utilizing test data from each
administration. This systematic process has impacted literacy course content at all the
teacher preparation institutions in Idaho. A recent article detailed the history, politics, and
test development for this innovative pre-service teacher certification test (Squires, Canney,
& Trevisan, 2004). Guiding this work has been the application of the Standards forEducational and Psychological Testing (American Educational Research Association,American Psychological Association, National Council on Measurement in Education, &
Joint Committee on Standards for Educational and Psychological Testing, 1999) and
guidelines for the development and implementation of high-stakes tests and testing policy
developed by the United States Department of Education, Office of Civil Rights (2002).
Passage of the ICLA has been in place as a requirement for elementary and special
education teacher certification since August of 2002.
In this article, we describe our ongoing commitment to develop a valid, high-stakes
assessment open to public inspection. The decision to be transparent has been a central
feature of the ICLA development process. Cronbach (1980) noted that the task in the test
validation process is not to usurp the general public’s prerogative to judge the validity of
the test, but to clarify validity issues surrounding the assessment so that stakeholders can
make informed decisions. Despite Cronbach’s exhortations, as well as those of others, we
know of few appraisals of the validity of high-stakes tests for teacher certification open to
public scrutiny. The consequences of not having a public examination of teacher
certification assessments are negative and profound, as exemplified by court cases in
Alabama and Massachusetts (Ludlow, 2001).
Many in the Idaho educational hierarchy questioned faculty investment, motives,
and competence to create a high-stakes, statewide literacy test for pre-service teachers.
Despite such criticism and minimal financial support, the faculty have persisted in this
endeavor. While open to the concerns and insights of literacy colleagues, the literacy
research base, and data from the ICLA, they have forged ahead with this project. The goal
has always been to devise a valid, reliable, and relevant measure of pre-service teachers’
knowledge of core literacy practices as a part of the overall process to prepare teachers
highly qualified to meet each child’s learning needs. This article, in its examination of
Form comparability, reflects the best efforts of the committee in its self-examination of the
test development process and commitment to public transparency in reporting the results
and outcomes of their work.
Early in the project, a validity argument was created to guide the development, use,
and ongoing validation of the ICLA, particularly with an eye toward fairness (Cronbach,
1988; Shepard, 1993). The argument included item and test specifications, procedures for
pilot testing and item evaluation, and the empirical tasks necessary to test the validity
argument (see Squires et al., 2004, for details). An ad hoc committee composed of
members from underrepresented groups in Idaho independently reviewed the item content
and language of the ICLA for systematic biases likely to negatively impact teacher
candidates of those groups (Kmitta & Goellner, 2002). The ad hoc committee accepted the
texts as fair after several minor revisions.
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380 371
The following guidelines from the United States Department of Education, Office
of Civil Rights (2002) were also implemented:
1. Students received adequate notice of the test and its consequences; test dates, times
and locations were posted well in advance online and at each teacher preparation
institution. Students pre-registered with staff at each location weeks prior to the
administration of the ICLA.
2. Students were given access to the information being tested when faculty aligned
course content to address the content covered in the ICLA.
3. The Angoff Method and the Benchmark Method (Cizek, 1996) were chosen by the
faculty to determine the cut score for passing each Standard and the results of this
process were documented.
4. An electronic study guide sectioned to match the parts of each of the three ICLA
Standards was made available free of charge to all students.
5. Accommodations were developed for special needs students; in addition, any
student could retake a Standard multiple times.
The ICLA addresses three distinct areas, or Standards: Standard I Language
Learning and Literacy Development; Standard II Reading Comprehension Research and
Best Practices; Standard III Literacy Assessment and Intervention (Figure 1).
Standard Section 1 Section 2 Section 3
I – Language Learning and
Literacy Development
Definitions Phonological
awareness;
phonics and
structural
analysis;
instructional
strategies
Essay-like
scenarios
II – Reading Comprehension
Research and Best
Practices
Definitions Instructional
strategies
Essay-like
scenarios
III – Literacy Assessment and
Intervention
Definitions Instructional
strategies
Essay-like
scenarios
Figure 1: Configuration of the Idaho Comprehensive Literacy Assessment
Each Standard is divided into three Sections. Section 1 requires matching common
literacy terms with definitions; Section 2 matches terms to descriptions of research-based
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380372
instructional strategies; Section 3 presents two scenarios with two or three subparts.
Candidates provide essay responses that literacy faculty evaluate using a 4-point holistic
scoring rubric; faculty also have scoring guides to inform the consistent and accurate
application of the scoring rubric as it pertains to that particular essay. The item formats for
Standards I, II, and III are the same, with one exception. Standard I, Section 2, has an
additional set of eleven items that assess candidates’ knowledge of phonemes, graphemes,
syllables, morphemes, and phonic patterns; the Forms of Standard I, however, are
configured alike. Therefore, all elements of the three Standards, across Forms of the ICLA,
were developed around the same content and item specifications (for an in-depth
description, see Squires et al., 2004).
The faculty openly discussed and shared assessment data in various forums in
acknowledgement of the public trust they were given to create a high-stakes test legislated
by the state representatives. In recent conversations, as a result of ongoing test data
analyses, the committee has become concerned that one Form of the ICLA, Form C, may
be easier for pre-service teachers than Forms A and B. Forms A and B were developed by
the same subcommittee and a year and a half earlier than Form C. The difference in
subcommittee membership for the development of Forms A and B versus Form C, and the
difference in time when the Forms were developed, has left a lingering perception that
Form C is not as challenging as A and B. Since the goal remains the development of a
teacher certification test that validly, reliably, and fairly measures individual pre-service
teachers’ knowledge about research-based best practices in literacy instruction, this article
explores the comparability of Forms across two years’ of testing. At issue has been whether
two candidates taking different Forms of the ICLA are presented with an equivalent
challenge.
Administration and Test Equating
Administration of the three forms of the ICLA is typically done in a serial fashion
within Standard. Serial administration has been shown to produce sample characteristics
similar to those obtained when tests are randomly assigned (Angoff, 1971, p. 569),
providing the basis for inferential statistical comparisons of student performance across
forms. Also, serial administration de facto creates an equivalent-groups data collection
design for test equating investigations, a design recommended as a viable alternative when
equating investigations are constrained by logistics or resources (Petersen, Kolen, &
Hoover, 1989). Test forms within Standard were equated using the equipercentile method
based on data from the December 2004 administration. Smooth curves were drawn by
hand.
Analysis and Discussion of ICLA Pass Rates by Form
To examine Form comparability on the ICLA, student pass rates were compared
across Forms and sections within Standard and administration period. Data from six
administrations of the ICLA each semester and summer between April 2003 and December
2004 were analyzed to ascertain Form comparability. Candidate performance is
summarized for each Form for spring 2003 through December 2004.1
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380 373
Table1:PassRateforFormA,BandCforStandardIBetweenApril2003toDecember2004
Form
April2003
July2003
December2003
April2004
July2004
December2004
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
A68
150
64
50
80.9
141
75
112
61.4
44
79.5
146
B62.6
115
74.4
43
78
150
77.3
119
68.2
44
84.4
147
C72.3
130
68.4
38
78.5
144
79.2
96
78.6
42
89.2
129
Table2:PassRateforFormA,BandCforStandardIIBetweenApril2003toDecember2004
Form
April2003
July2003
December2003
April2004
July2004
December2004
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
A88.6
88
92.1
38
94.1
68
89.2
102
84.4
45
83.8
68
B84.9
93
85.3
41
87.5
80
83.1
89
91.6
36
90.9
99
C88.8
116
93.9
33
95.3
86
89.5
86
96.0
51
91.0
78
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380374
Table3:PassRateforFormA,BandCforStandardIIIBetweenApril2003toDecember2004
Form
April2003
July2003
December2003
April2004
July2004
December2004
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Student
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
Percent
Pass
Number
of
Students
A84.7
59
93.3
31
95.2
63
89.0
64
94.6
37
93.4
76
B91.8
74
80.0
40
85.9
78
83.1
83
86.5
52
92.4
92
C93.0
86
93.1
29
97.6
86
89.2
65
91.9
37
92.5
67
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380 375
Statistical analysis of pass rates was first conducted descriptively by comparing pass
rates across forms within Standard and administration period. Further analysis was
conducted by using analysis of variance (ANOVA) methodology for the same comparisons.
Given the large number of ANOVA comparisons, a major concern for these
analyses was adequate control of the experimentwise Type I error rate. Klockars and Sax
(1986) recommended the Scheffe´ Test as the most widely used and straightforward
procedure for controlling the experimentwise Type I error rate and thus, it was used in
these analyses. The Scheffe´ Test accomplishes this protection by requiring a larger value
for the F statistic before the null hypothesis is rejected.
From the test results shown in Table 1, the pass rates for Forms A, B and C within
Standard I (Language learning and literacy development) are roughly equivalent, with
small differences among Forms and across administrations. For the administrations in
2004, the percentage of students passing favored Form C, with differences in passing rate
greater in July and December than in previous administrations. However, ANOVA results
show that no statistically significant differences were found across Forms in any of the
administration periods.
From the data for Standard II (Reading comprehension research and best practice)
seen in Table 2, Form C had the highest pass rate on all administrations, but the differences
among Form pass rates were negligible, save July 2004. The same pattern of results can be
seen in Table 3 for Standard III (Literacy assessment and intervention). Results of the
ANOVA procedures show that no statistically significant differences were found across
forms for either Standard.
In sum, we are left with the conclusion that unsystematic or random variability
explains differences in pass rates across the Forms.
A secondary analysis was done to compare the relative difficulty of the knowledge
and performance portions of the test. This analysis was stimulated by concern from ICLA
committee members who thought that pre-service teachers find most difficult writing open-
ended essays in response to classroom scenarios (Squires et al., 2004). As indicated in
Table 4 (Standard I), Table 5 (Standard II), and Table 6 (Standard III), the knowledge
portions of the test (Sections 1 and 2) have accounted for more of the total points possible
than has the performance portion (Section 3). This is true for all Forms and administration
periods. Differences between the percentage of points earned on the knowledge sections
versus the performance (scenario) sections ranged from 5.7% of possible points (Standard
1, Form C, July 2004) to 39.5% (Standard III, Form B, April 2003).2
Comparisons of the proportions of total points obtained in the knowledge and
performance sections across Forms show few differences. ANOVA results comparing
knowledge and performance points across Forms for all administration periods show that
no statistically significant differences were found.
Again, we are left with the conclusion that nonsystematic or random variation
explains differences in possible points for the knowledge and performance sections across
Forms.
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380376
Table 4: Mean Percentage of Knowledge and Performance Items for Standard I Between April
2003 and December 2004
Form Item April
2003
July
2003
December
2003
April
2004
July
2004
December
2004
A Knowledge 83.5 82.7 85.9 83.3 80.1 85.1
A Performance 60.0 58.3 70.0 66.7 60.0 66.7
B Knowledge 80.3 79.4 81.8 79.2 77.3 84.4
B Performance 61.7 68.3 71.7 73.3 71.7 76.7
C Knowledge 81.8 82.0 83.9 81.9 80.7 87.0
C Performance 66.7 56.7 71.7 75.0 75.0 75.0
Table 5: Mean Percentage of Knowledge and Performance Items for Standard II Between April
2003 and December 2004
Form Item April
2003
July
2003
December
2003
April
2004
July
2004
December
2004
A Knowledge 82.7 89.7 88.7 89.9 87.6 88.2
A Performance 66.7 70.0 73.3 70.0 70.0 63.3
B Knowledge 91.4 94.0 93.0 91.9 91.9 92.4
B Performance 61.7 61.7 65.0 68.3 66.7 61.7
C Knowledge 94.0 95.2 94.1 94.6 90.2 94.6
C Performance 68.3 66.7 71.7 68.3 75.0 61.7
Table 6: Mean Percentage of Knowledge and Performance Items for Standard III Between April
2003 and December 2004
Form Item April
2003
July
2003
December
2003
April
2004
July
2004
December
2004
A Knowledge 95.9 95.2 95.7 94.0 95.1 97.1
A Performance 61.7 61.7 71.7 63.3 61.7 68.3
B Knowledge 96.2 92.7 95.2 93.7 94.5 96.6
B Performance 56.7 58.3 56.7 58.3 60.0 66.7
C Knowledge 94.2 94.0 96.6 94.0 93.5 95.4
C Performance 68.3 63.3 71.7 78.3 71.7 73.3
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380 377
Concluding Thoughts
So, is Form C of the ICLA easier for pre-service teachers than either Forms A or B,
based upon current test data across 2003-2004? From the data in Tables 1 and 3, for
Standards I and III, Forms A, B, and C appear comparable in terms of overall student pass
rates. For Standard II (Table 2), Form C registered a slightly higher pass rate than did
Forms A and B. The actual percentage of students passing on Form C, Standard II, was
within 2 percentage points on five of the six administration periods. July 2004 appears to be
an anomaly that we are unable to explain at this time. More conclusive, however, is that no
statistically significant differences were found across Forms for any administration period.
Table 5 sheds additional light on the issue of Form comparability. Scores on the
knowledge portion of the ICLA for Form C appeared to have been slightly higher than for
Forms A and B, which accounts for the overall slightly higher scores on Form C, Standard
II (Table 2). It should be noted that one item in the knowledge section is equivalent to
about 3 percentage points.3Therefore, for five of the six administration periods the
difference between Form pass rates is the consequence of less than one item correct.
At this writing, the committee is disaggregating the data by Section, within
Standard, in its efforts to further determine if any systematic differences among Forms
exist. The committee also recognizes that test revision will be an ongoing process linked to
evolving issues of validity; as such, validity is not an end in itself (Cronbach, 1980;
Shepard, 1993). The committee continues to discuss the content domains appropriate for a
criterion referenced test of pre-service teachers’ knowledge related to literacy, as well as to
conduct item analyses and scrutinize those items that fall outside our decision to have items
of moderate difficulty. Items identified as producing unacceptably low p-values have beenmodified or eliminated and the revised items piloted.
4A standing subcommittee is using
data and committee input to develop new items and new item formats for future editions of
all Forms of the ICLA. The whole ICLA committee scrutinizes all subcommittee
recommendations for any item modifications before they are even piloted.
In addition to this oversight role, other ways of addressing differences in pass rates
include formalizing and making uniform the instructions to proctors for reasons of validity
and fairness. With respect to a uniform literacy curriculum, we continue to honor
differences among faculty in terms of their preferences for meeting teacher preparation
literacy standards. Early on, a significant effort was successful in setting aside contentious
issues emanating out of the “reading wars”, in order to focus on core literacy information
that all pre-service teachers need to know to meet the standards-based climate of today’s
classrooms. It may be, in light of the data being collected, that further discussions need to
occur among faculty about their pedagogical practices that seem to enhance student test
score performance on the ICLA. In fact, the ICLA members have begun such
conversations, using available test data to shed light on test content that students find
particularly difficult.
We have entered uncharted waters in the sense that while measurement specialists
have talked of an examination of test data like these, there are few published accounts of
this validity examination process actually being done, particularly with faculty-controlled
assessments. In a recent article discussing the emerging performance-based indicator
systems for NCATE-approved teacher preparation programs, a persistent issue remains—
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380378
whether these systems “…will satisfy policy makers’ demands for results” (Olson, 2005, p.
19). This uncertainty also clouds the future of the ICLA. It is in this climate that the ICLA
committee continues to embrace the position stated by Cronbach more than twenty-five
years ago:
The validity of interpretation cannot be established by a research monograph or
a detailed manual. The aim for the report is to advance sensible discussion.
Why should we wish for more? On matters before the public, the evidence is
usually clouded. The institutions of the polity are geared to weigh up
reasonable, partly persuasive, disputed arguments; and they can be tolerant
when we acknowledge uncertainties. The more we learn, and the franker we
are with ourselves and our clientele, the more valid the use of tests will
become. (1980, p. 107)
Acknowledgement
The authors would like to thank Huihua He for statistical consultation on this paper.
Notes
1. The ICLA has been operational since April 2002 for Forms A and B only; Form C was initiated in
April 2003. Consequently, data for 2003-2004 were reported for purposes of Form comparison.
2. Concern in this part of the secondary analysis was whether the performance portion of the test was
more difficult than the knowledge portion of the test (regardless of form). Thus, descriptive
comparisons were conducted by comparing the mean percentage of points obtained on each portion
of the test. This is a simple and intuitive means of analyses.
3. Test items are weighted to accomplish a 60%-40% split between objective and performance items,
respectively.
4. Operationally, any item with a p-value less than .50 is flagged for revision or elimination.
References
American Educational Research Association, American Psychological Association, National
Council of Measurement in Education, & Joint Committee on Standards for Educational and Psychological
Testing. (1999). Standards for educational and psychological testing. Washington DC: American
Educational Research Association.
Angoff, W.H. (1971). Norms, scales, and equivalent scores. In R. L. Thorndike (Ed.), EducationalMeasurement (2nd ed.). Washington D.C.:American Council on Education.
Cronbach, L.J. (1980). Validity on parole: How can we go straight? New Directions for Testing andMeasurement, 5, 99-108.
Cronbach, L.J. (1988). Five perspectives on the validity argument. In H. Wainer & H.I. Braun
(Eds.), Test validity (pp. 3-18). Hillsdale, NJ: Erlbaum.
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380 379
Cizek, G.J. (1996). Setting passing scores. Educational Measurement: Issues and Practices, 15 (2),20-31.
Haney, W., Fowler, C., & Wheelock, A. (1999). Less truth than error? An independent study of the
Massachusetts Teacher Test. Educational Policy Analysis Archives 7 (4). Retrieved November 9, 2004,from http://epaa.asu.edu/epaa/v7n4/
Idaho’s MOST. (2001). Idaho maximizing opportunities for students and teachers. Retrieved May21, 2004, from http://www.sde.state.id.us/MOST/Reading.htm
International Reading Association, Professional Standards and Ethics Committee. (1998). Standardsfor reading professionals (Rev. ed.). Newark, DE: International Reading Association.
Klockars, A.J., & Sax, G. (1986). Multiple comparisons. Beverley Hills, CA: Sage.
Kmitta, D., & Goellner, L. (2002, July). Fairness review report: Idaho comprehensive literacyassessment. Paper presented at the National Evaluation Institute, Boise, ID.
Linn, R.L. (1993). Linking results of distinct assessments. Applied Measurement in Education, 6(1), 83-102.
Ludlow, L.H. (2001). Teacher test accountability: From Alabama to Massachusetts. EducationPolicy Analysis Archives, 9(6). Retrieved November 9, 2004, from http://epaa.asu.edu/epaa/v9n6/
Olson, L. (2005). Education schools use performance standards to improve graduates. EducationWeek, 24 (36), 1, 18-19.
Petersen, N.S., Kolen, M.J, & Hoover, H. D. (1989). Scaling, norming, and equating. In R.L. Linn
(Ed.), Educational Measurement (3rd ed.). Washington DC: National Council on Measurement in Educationand American Council on Education/MacMillan.
Shepard, L.A. (1993). Evaluating test validity. In L. Darling-Hammond & J.A. Banks (Eds.),
Review of research in education (pp. 405-450). Washington, DC: American Educational Research
Association.
Squires, D., Canney, G.F., & Trevisan, M.S. (2004, November 9). There is another way: The
faculty-developed Idaho Comprehensive Literacy Assessment for K-8 pre-service teachers. EducationPolicy Analysis Archives, 12 (62). Retrieved November 11, 2004 from http://epaa.asu.edu/epaa/v12n62
U.S. Department of Education, Office for Civil Rights. (2000). The use of tests as part of high-stakes decision-making for students: A resource guide for educators and policymakers. Washington, DC:Author. Retrieved May 21, 2004, from http://www.ed.gov/offices/OCR/archives/ testing/index1.html
The Authors
DAVID SQUIRES was an elementary classroom teacher and reading teacher. He is
currently a teacher educator. His research interests include the impact of theory on
classroom instructional decisions and pre-service teacher literacy assessment.
D. Squires et al. / Studies in Educational Evaluation 32 (2006) 369–380380
MICHAEL S. TREVISAN is currently director of the Assessment and Evaluation Center at
Washington State University. His professional interests include classroom and large-scale
educational assessment, program evaluation, and evaluation capacity building.
GEORGE F. CANNEY was an elementary classroom teacher and reading teacher. For over
thirty years, his professional interests have included pre-service and in-service teacher
education on beginning reading instruction, vocabulary and comprehension development,
literacy assessment and intervention.
Correspondence: <[email protected] >