Upload
rahmandan-apit
View
222
Download
0
Embed Size (px)
DESCRIPTION
cap
Citation preview
JThere are dozens of dimensions on which almost any
commodity can be judged. When people buy a new automobile, for
example, they may attend to such factors as style, safety, gasoline
mileage, and so on. In the same way, when educators attempt to judge
the appropriateness of a testing device, they can focus on all sorts of
considerations. Some of these considerations, naturally, are more
defensible than others. In this chapter we consider six factors that are
pivotal in deciding whether or not an educational test is suitable.
Because the last three of these factors to be described are treated
at greater length in separate chapters, they are only introduced
at this time.
It must be noted at the outset of this discussion that the six
factors to be discussed in this chapter do not, obviously, exhaust the
range of consid erations that prospective test users will wish to apply
when evaluating edu cational tests. We will not, for example, do much
more than mention the cost of a test. Yet, quite obviously, if you're a
school principal who needs
100 tests and you have only $200 to make a purchase, it becomes
significant whether the tests cost more than $2 each.
There are other things to consider in evaluating educational
tests, such as the ease with which they can be administered and
scored. All of
these practical considerations are genuinely important.
However, in at tempting to cover all the possible points to review
when evaluating a test, the reader would soon be overwhelmed. This
is another instance where more would surely turn out to be less.
As a consequence, there will be a deliberate effort to focus your
atten
tion on only six significant factors to employ while judging the
quality of an educational test. If you become truly conversant with
these half dozen dimensions and apply them more or less
systematically when scrutinizing educational tests, you should be able
to do a decisively better job in distin guishing between winning and
losing tests.
Surely, choices can be made among educational tests. In a very
com
mon case, for example, teachers at a particular school (or
officials for an entire school district) will find themselves obliged to
select a test for deter- . mining the effectiveness of a language arts
program. Obviously, because a number of different commercial test
publishers have language arts tests to peddle, a choice must be made
among them. Whether that choice will be made wisely or whimsically
depends in large measure on the sophistication that the test choosers
bring to their task.
Although the act of choosing among commercially published
tests presents a fairly common and straightforward instance where
one's test evaluation skillsshould be sharp, there are numerous other
occasions when educators will need expertise in deciding whether a
test is laudable or laughable. To illustrate, in recent years a good many
large and middle-size school districts have, sometimes rather
energetically, gone into the test development business. Most often
these tests are intended exclusively for LIsein the local district,
although there are some cases in which a school district has created a
test for local use but has subsequently turned the test over to a
commercial publishing house that distributes the test nationally, with
royalties going back to the district. The Los Angeles Unified School
District, for example, developed a test of !-:ighschool students'
reading skills in the late seventies, then turned that test over for
distribution pur poses to a commercial publisher, CTB/McGraw-HiII.
These school district test development efforts often result in
the production of first-rate tests. In other cases, because district
personnel lack either the test development expertise or the financial
resources to do the job properly, the resulting tests make better scratch
pa;:>erthan assessment devices. The trick is to figure out which tests are
which. An educational test should not be considered as a decent
assessment tool, even if it is produced by local test developers. After all,
even some of mother's homemade apple pies may have been baked too
long.
Teachers and administrators who are supposed to employ
these district-developed assessment tools have a professional
responsibility to judge whether the tests are good ones. To make these
kinds of judgments sensibly requires educators to bring a set of
defensible standards to the test evaluation enterprise. If a test is found
wanting, it is the responsibility of teachers and administrators to call
those deficits to the attention of the district's officials who, often
because of parental pride, fail to see the flaws in their locally sired
assessment devices.
At an even more personal level, many teachers create their own
tests. Some of these tests are, of necessity, churned out in a hurry to
meet the requirements of an end-of-unit exam or a once-a-week quiz
scheme. Such tests, as any harassed teacher will tell you, are not
intended to serve as inspired examples of assessment at its best.
However, there are other tests that classroom teachers develop
which reflect substantially more work. Some of these tests are
employed to make significant decisions about students. For instance,
an end-of-course exam ination in many cases may help the teacher
decide whether a student gets an A, B, C, D, or F. In some classes a
course grade can be the deciding factor in whether a student moves to
the next class or grade level. Oh yes, the test-based stakes in many
classes are truly high ones.
Now in such settings it should be apparent that considerable
attention
should be given by teachers to the quality of their
examinations. Teachers who possess the skills necessary to detect flaws
in their own tests can then set about rectifying those defects. Teachers
who do not possess the skills needed to evaluate tests will, in all
probability, continue to employ flawed assessment tools. As a
consequence, they will oft-eo make inappropriate decisions about the
students who must suffer through those tests.
In these and other instances, it should be clear that the educator
who knows how to evaluate testing instruments is going to be a more
effective professional than one who is intimidated by such tests. Let's
turn, there fore, to a consideration of six factors which will elevate the
reader from the status of Test Evaluation Tyro to that of Test
Evaluation Tiger.
EVALUATIVE FACTOR ONE: DESCRIPTION OF MEASURED
BEHAVIOR
The initial factor to consider in appraising an educational test
stems from the reason that such tests were created in the first place, na
mel)" to allow us to draw inferences about an examinee's status.
However, that examinee's status, of course, is being assessed with
respect to a particuiar focus. We obviously can't measure everything
about examinees at one time. No, we attempt to assess a limited sphere
of the examinee's skill, aptitude, and the like. It is imperative, therefore,
that an educational test be accompanied by a description of just what it
is that the test is attempting to measure.
A description of measured behavior is available for every
commercially
published test and, almost without exception, for district-
developed educa tional tests as well. With most teacher-made tests
designed for classroom use, the description oflearner behavior usually
bounces around in teacher's heeds rather than on paper, although
some teachers do take the trouble to explicate in writing what it is
that their major examinations are sup posedly measuring.
A description of measured behavior can range in length and
detail from a terse, almost telegrarnlike behavioral objective (that
issupposedly measured by the test) to a detailed, highly elaborate set
of test specifi cations, setting forth the permissible operations that can
be used to gener ate test items. In all cases, although certain of these
descriptions are obvi ously more communicative than others, the
intention of the description is to convey to the user of the test a notion
of what it is that the test is allegedly measunng.
The attentive reader will have noted the use of the term
allegedly in the previous sentence. As we shall see later in this chapter,
not every test that is called a cow by its creators can, in fact, give milk.
For example, some tests that are described as "measures of higher-order
problem solving" turn out to be little more than tests of one's vocabulary
and short-term memory skills. The match between what a test says it's
measuring and what the test actually measures constitutes a key aspect
of the test's validity. Validity of a test will be the fifth factor we consider
in this chapter;'
Before we can discern whether a test's description of
measured be havior is accurately representing what the test is
measuring, we must have the description itself. Because there are no
hard and fast rules (or even soft and slow ones) regarding how test
developers should describe their wares, there is enormous variability
in the nature of the test descriptions that accompany educational
assessment devices.
Norm-Referenced Measures
As a rule, since norm-referenced tests rely more heavily on
percentile comparisons as the chief mechanism for interpreting one's
test score, the descriptive language accompanying such tests is not
intensely delimiting in its nature. More often than not, a somewhat
general statement will be made about what it is that the test measures.
In some cases these general state ments will be backed up by a list of
topics (content) that the test covers or a list of fairly general skills that
the test attempts to tap. In some manuals for norm-referenced tests, a
more detailed breakdown is provided of number of items per topic or
skill.
Indeed, as one surveys the range of descriptive literature
available with norm-referenced tests, it soon becomes apparent that the
degree of descriptive detail accompanying these tests can range from
an extremely terse description, perhaps little more than a thumbnail
sketch, to a fairly elaborate and detailed effort to let test users know
what it is that the test's developers hoped to measure.
Criterion-Referenced Measures
In view of their relatively recent arrival on the educational
scene, it is not surprising that the quality of the descriptions
accompanying criterion referenced tests is even more variable. In a
sense this is somewhat dis-
'The reader will, hopefully, not be too distressed if I occasionally treat tests an
thropomorphically. Now everyone knows, well arnost everyone, that tests really don't
talk, hence, when I refer to what "a test says it's measuring," this represents literary license.
I have, however, known some colleagues who, after working with educational tests for
many years, assert with substantial conviction that many educational tests can indeed
speak-but only when the moon is full.
couraging, since the major contribution that a good criterion-
referenced test can make is its more sharply delimited description of
what it is that examinees can or can't do. Yet, as matters currently stand,
many creators of criterion-referenced tests attempt to get by with
descriptions of measured behavior no more detailed, that is, no more
limiting, than those of their norm-referenced counterparts,
A good description of measured behavior for a criterion-
referenced test, as we shall see in Chapter 9, may be several pages in
length, rather than the sentence or two sometimes associated with
behavioral objectives. The essence of an effective set of descriptive
information for a well constructed criterion-referenced test is thai: it
communicates, with as little ambiguity as possible, what it is that the
test is supposedly measuring. Because we must rely, for the most part,
on words as we attempt to explain what it is that the test measures, our
efforts will sometimes fall short of our expectations. Words, even well-
woven words, sometimes fail to capture the complexity of behaviors
that educators are attempting to assess. By and large, however, a
criterion-referenced test's description of measured be havior will tend
to be more detailed than is a norm-referenced test's de scription.
Now that we've stressed the importance of appraising a test's
descrip tion of the behavior it's attempting to measure, what does one
do about it? In other words, how does a person evaluating an
educational test decide whether a test's description is sufficiently
explicit to be acceptable?
As always the answer.to this question hinges directly on the
fruTpose to which the educational test is to be put. Tests to be used
for examinee selection in fixed-quota settings, situations for which
norm-referenced measures are well suited, need not contain
inordinately explicit descrip tions of measured behavior. Tests to be
used for instructional design or program evaluation should contain
really tight descriptions of what the tests are measuring.
In the case of judging whether a test's description of
measured be havior is sufficiently precise for the measurement task at
hand, it depends on the purpose for wnich the test is being used. As a
test evaluator, your job willbe to consult the test's description of the
behavior it's ostensibly measur ing, then judge whether-for the
intended purpose-that description is sufficiently explicit.
Here, for example, is a description of a social studies subtest
in a
widely used, commercially published norm-referenced
achievement test:
Test I contains 60 four-option multiple-choice items at each grade
level. The items were selected to reflect the most current content and
objectives of the social studies curriculum at each grade. The test items
include items on American history, world history, government, geography,
economics, sociol ogy, and study skills. Some items test knowledge of fact;
many test ability to use factual knowledge of social studies in unfamiliar
situations."
Now suppose you were a teacher who was attempting to
design a relevant instructional sequence to improve students' scores
on this test. Would the previous description of what the test measures
give you a decent idea of what you should be striving to accomplish?
Well, not unless you possessed the gift of prophecy, the test's
description of measured behavior offers you scant guidance for
instructional design. Different social studies teachers would read the
preceding paragraph and come up with substan tially different
interpretations. For purposes of aiding instructional de signers. this
test's description of measured behavior falls way short of the mark.
Yet, for other purposes, the test's fairly general description
may be
acceptable. Suppose as a school administrator you were only
trying to get a very general fix on what your students knew about
social studies. Under such circumstances, the fairly ink-blotty
description may prove sufficient.
The best way to determine whether the test's description of
measured behavior is adequate is to have different individuals, those
who may use the test for the purpose at hand, read the description.
then independently write out what they think is really being measured,
for example, what the test items will be like. If these independently
prepared interpretations coincide fairly well, for this purpose the
test's description of measured behavior is-suitable. If the
interpretations are markedly different, the test's description of
measured behavior is not clear enough.
EVALUATIVE FACTOR TWO: ITEMS PER MEASURED
BEHAVIOR
A second factor to consider when evaluating an educational
test is the number of items being employed to assess the behavior, or
behaviors, being measured by the test. To make this point, perhaps
ludicrously, let's imagine that a sales representative from a commercial
testing firm tries to persuade you to buy the firm's brand new test of
political science knowledge. The sales rep might extol with rapture
the new test's "up-to-date content, its subtle diagnostic potential, the
first-rate consultants who participated in its creation, and its remarkably
low cost." When you ask how many total items are in this new
instrument of assessment wizardry, the sales rep replies, "Only two;
but they're damned good ones!"
It is apparent, of course, that only a handful of test items
will not permit educators to get an accurate fix on what it is that
students can or
'Tests oj Academic Progress, Form S. Manual for Administrators. Supervisors. and
Coun
selors. (Boston: Houghton Mifflin 1972), p. 11.
can't do. There's just too much chance that results on a test
with so few items will be nothing more than a fluke. Students'
examination perfor mance is too apt to be due to chance, rather than
their true capabilities (attitudes, and so on).
Tests, Subtests, and Sub-Sub-Subtests
Before proceeding to a consideration of the tricky question
regarding "How many test items are sufficient," it's neces~ary to take
a brief side excursion dealing with the number of behaviors that a
test sets out to measure. Let's think of a typical norm-referenced
achievement test which attempts to assess a primary child's "reading
comprehension ability." Al though there might be a number of
different kinds of items on the test, with those differences reflecting
fairly distinctive sorts of comprehension subskills, the test yields a total
score that is a composite index of a student's reading comprehension
ability. Although the test's developers intend it to assess a medley of
varied comprehension subskills, there is no attempt to provide
separate indices of those subs kills. Thus, if the test consists of 100
items, all of those items can be counted as measuring that test's
single measured behavior. namely, reading comprehension ability.
Other tests, however, attempt to measure a series of distinctive
sub
skills in such a way as to yield separate scores for each subskill.
For exam ple, suppose a test of basic mathematical skills has 80 items,
20 each on the four fundamental calculation skills (addition,
subtraction. multiplication, and division). Each of these 20-item sets
yields a separate score for each of the four calculation skills. In other
words, an examinee earns a separate score for addition skill, another
for subtraction skill, and so on. Whether or not there is a total score
reported for the examinee's performance on the
80 items (sometimes there is and sometimes there isn't), the
test is surely
attempting to measure each of the four calculation skills with
20 items per skill. Hence, for that test there would be 20 items for the
smallest behavior the test purports to measure.
So far so good, but let's now consider a test that allegedly
measures 5S separate decoding skills in reading with 55 items. For
readers with less than fundamental mathematical prowess, it can be
noted that the test is sup posedly measuring each decoding skill
(separately listed in the test's man ual), with one item per skill. One
item per skill, most would agree, is pretty skimpy. Since the test
permits the reporting of results on a skill-by-skill basis, there's no
doubt that the test's publishers are really trying to peddle it as S5
separate tests. And one-item tests at that!
You can see that when you evaluate a test to discern whether
it pro
vides a sufficient number of items per measured behavior, you
must isolate the smallest units of behavior which the test supposedly
measures, then see how many items the test provides for measuring
those units. In the preced ing paragraphs we've seen examples of tests
containing 100 items, 20 items, and one item per measured behavior.
What about tests that not only report subscores but also
provide a
composite, total score? For example, if our 55-item test of
decoding skills provides a total "overall decoding skill" score, for that
measured behavior the test would possess 55 items. In such instances,
therefore, one might conclude that there were enough items when the
test is used as a composite instrument but not when the test is viewed
as 55 separate subtests.
As always, the purpose for which the test will be used turns out
to be the chief determiner of how many items are needed. Let's suppose
we want to use our 55-item decoding skills test to gain an overall estimate
of whether a given set of textbooks helps youngsters acquire the
decoding skills of
reading. For such a purpose, with no need to move to a more
precise level
of reporting, the 55-item test would appear to have enough
items.
On the other hand, if we wished to use the test for some
sort of diagnostic purpose, for instance, to get an idea of the particular
decoding skills that pupils did or did not possess, then one item per skill
is [00 paltry. It is absurd to contend that with a single item (or even a
married one) we could detect whether a child possesses a specific
decoding skill.
For this reason, of course, we see that the purpose for which the
test is to be used must be clearly understood prior to engaging in any
sensible sort of test evaluation. In educational testing, as in other
realms of human endeavor, form should follow function.
The Number of Items
One would think, given so much messing around with
educational mea surement over the decades, that by now we would
have available a delight fully succinct and satisfyingly solid answer to the
question: ("How many test items are really necessary?" Distressingly,
there is no simple answer to that important question; "It depends.")
In the case of deciding how many test items per measured
behavior
are needed, it is fairly easy to demonstrate why it is that
the requisite number of test items depends on the purpose for which the
test's results are tobe used. Let's dabble with a couple of common
situations to illustrate that point.
Imagine that you were using a written test to certify whether
clinical
psychologists knew their stuff, that is, knew it well enough to
allow them to practice on a neurotic world. Since the stakes are quite
high, you would want plenty of items on the test, maybe 100 or so, to
make sure that the actual clinical skill of the examinees is well sampled.
To give those clinical psychology candidates a four- or five-item exam,
then expect to have any confidence in the exam's results, would be
patently silly (or a bit psychotic).
However, let's say you're a classroom teacher of first-grade
pupils, and you want to find out which of the youngsters need an
extra hour's worth of instruction on how to pronounce consonant-
vowel-consonant words. Your plan is to assemble a subgroup ofthose
children and give them an extra 60-minute practice session. In this
instance, when the stakes are decisively lower than in the clinical
psychologist example, and where to day's errors in measurement can
be rectified during tomorrow's reading instruction, you can
undoubtedly get by with a lest that has only a few items, as few as
two or three, to provide you with the rough estimates of student skill.
These two examples should, hopefully, highlight one key
question that must be considered when deciding how many test items
per measured behavior are needed, namely, how important is the
decision at hand? High-stakes, difficult-to-reverse decisions require
more items than low stakes, easy-to-reverse decisions.
There are other considerations involved in deciding upon test
length,
and these include the level of proficiency required of
examinees in rela tionship to their expected level of functioning. That
sounds fairly complex, and it is. Let's analyze it. Suppose you were
using an exam for a placement decision and had determined that only
those examinees who earned a score of 80 percent correct or better
would be selected. Now if most of the examinees are actually
functioning around a 20 percent correct score, it will be the
exceptional examinee who scores at the required (80 percent) level.
An exam performance at that level will be a rare event and not apt to
be attributable to chance. For example, if yOU had five fill-in-the-
blank, short-answer test items, most students would only be getting one
out of five items correct, and those getting four or five (out of five)
correct would display an exceptional (and believable) performance.
You'd have confi dence that such high-scoring examinees really
were knowledgeable.
However, let's imagine that with the same required
performance level, that is, 80 percent correct, we expected that most
students would actually be performing around 75 percent to 85
percent correct. 3 In that setting an examinee's performance might be
expected to fluctuate around the 80 percent cut-off level more often,
merely as a function of the impreci sion of our measurement device.
Consequently, we would want more items per measured behavior in
Older to increase our confidence that those who topped our required
80 percent-level were doing so because of true skill, rather than
chance.
'We'Utreat the establishment of such performance levels in Chapter 16, so be
patient.
I am not arguing that the passing standard should be set at a
point substantially different from examinees' average performance
merely to allow us to use fewer items. However, if a standard is set at
a point which turns out to be near the group's typical performance,
more items are cer tainly requisite.
Once more we see that the educational measurement world
is far
more complex than it ought to be. The decision as to how
many items per measured behavior constitutes an acceptable minimum
is much more dif ficult than most people imagine. Although some
psychometricians (a fancy term to describe those folks who measure
psychologically or educationally relevant variables) have addressed this
problem, so' far we do not have any tidy answers to the question: "How
long should a test be?"
Promising approaches. Two of the more encouraging
strategies for dealing with the test-length issue have been proposed by
Millman" and by Novick and Lewis." Millman has prepared a series
of tables which, for specified passing score percentages, permit us to
identify how many stu dents would be misclassified if tests of
different lengths were used. Millman's approach is based on the
binomial probabilities associated with an examinee's likelihood of
passing or failing each item in a behavioral domain of test items. It
is easy to use, although it fails to take account of certain factors
involved in the approach recommended by Novick and his colleagues.
Novick's solution strategy employs a Bayesian statistical model.
Baye sian statistical approaches, to oversimplify outrageously. allow
one to pick up greater analytic precision with less actual data by
making experience based assumptions regarding the nature of such
variables as examinees' current level of skill. The Bayesian solutions to
the test-length problem take into consideration several subtle elements,
including how much more seri ous an error it is to misdassify false-
positives (examinees who appear to possess a skill but really don't) than
to misclassifyfalse-negatives (examinees who appear to lack a skill they
really possess). The estimated relative gravity of these two sorts of
misclassification errors is referred to as the loss ratio and is built into a
Bayesian formula which helps us decide how many test items are
needed.
In general the binomial model requires somewhat longer tests
than
the Bayesian model. The Bayesian approach, however, is more
complicated
'J. Millman. "PassingScores and Test Lengths for Domain-Referenced
Measures,"
Review of Educational Resources, 43 (1973), 205-216.
'M. R. Novick and C. Lewis, "Prescribing Test Length for Criterion-Referenced
Mea
surement," in Problems in Criterion-Referenced Aleasuremenl, ed. C. W. Harris,
M. C. Aikin. and W. J. Popham, CSE ~Ionograph Series in Evaluation, no. 3 (Los
Angeles: UCLA Center for the Study of Evaluation, 1974).