Chapter 3 (Page1-5)

JThere are dozens of dimensions on which almost any

commodity can be judged. When people buy a new automobile, for

example, they may attend to such factors as style, safety, gasoline

mileage, and so on. In the same way, when educators attempt to judge

the appropriateness of a testing device, they can focus on all sorts of

considerations. Some of these considerations, naturally, are more

defensible than others. In this chapter we consider six factors that are

pivotal in deciding whether or not an educational test is suitable.

Because the last three of these factors to be described are treated

at greater length in separate chapters, they are only introduced

at this time.

It must be noted at the outset of this discussion that the six

factors to be discussed in this chapter do not, obviously, exhaust the

range of consid erations that prospective test users will wish to apply

when evaluating edu cational tests. We will not, for example, do much

more than mention the cost of a test. Yet, quite obviously, if you're a

school principal who needs

100 tests and you have only $200 to make a purchase, it becomes

significant whether the tests cost more than $2 each.

There are other things to consider in evaluating educational

tests, such as the ease with which they can be administered and

scored. All of

these practical considerations are genuinely important.

However, in at tempting to cover all the possible points to review

when evaluating a test, the reader would soon be overwhelmed. This

is another instance where more would surely turn out to be less.

As a consequence, there will be a deliberate effort to focus your

atten

tion on only six significant factors to employ while judging the

quality of an educational test. If you become truly conversant with

these half dozen dimensions and apply them more or less

systematically when scrutinizing educational tests, you should be able

to do a decisively better job in distin guishing between winning and

losing tests.

Surely, choices can be made among educational tests. In a very

com

mon case, for example, teachers at a particular school (or

officials for an entire school district) will find themselves obliged to

select a test for deter- . mining the effectiveness of a language arts

program. Obviously, because a number of different commercial test

publishers have language arts tests to peddle, a choice must be made

among them. Whether that choice will be made wisely or whimsically

depends in large measure on the sophistication that the test choosers

bring to their task.

Although the act of choosing among commercially published

tests presents a fairly common and straightforward instance where

one's test evaluation skillsshould be sharp, there are numerous other

occasions when educators will need expertise in deciding whether a

test is laudable or laughable. To illustrate, in recent years a good many

large and middle-size school districts have, sometimes rather

energetically, gone into the test development business. Most often

these tests are intended exclusively for LIsein the local district,

although there are some cases in which a school district has created a

test for local use but has subsequently turned the test over to a

commercial publishing house that distributes the test nationally, with

royalties going back to the district. The Los Angeles Unified School

District, for example, developed a test of !-:ighschool students'

reading skills in the late seventies, then turned that test over for

distribution pur poses to a commercial publisher, CTB/McGraw-HiII.

These school district test development efforts often result in

the production of first-rate tests. In other cases, because district

personnel lack either the test development expertise or the financial

resources to do the job properly, the resulting tests make better scratch

pa;:>erthan assessment devices. The trick is to figure out which tests are

which. An educational test should not be considered as a decent

assessment tool, even if it is produced by local test developers. After all,

even some of mother's homemade apple pies may have been baked too

long.

Teachers and administrators who are supposed to employ

these district-developed assessment tools have a professional

responsibility to judge whether the tests are good ones. To make these

kinds of judgments sensibly requires educators to bring a set of

defensible standards to the test evaluation enterprise. If a test is found

wanting, it is the responsibility of teachers and administrators to call

those deficits to the attention of the district's officials who, often

because of parental pride, fail to see the flaws in their locally sired

assessment devices.

At an even more personal level, many teachers create their own

tests. Some of these tests are, of necessity, churned out in a hurry to

meet the requirements of an end-of-unit exam or a once-a-week quiz

scheme. Such tests, as any harassed teacher will tell you, are not

intended to serve as inspired examples of assessment at its best.

However, there are other tests that classroom teachers develop

which reflect substantially more work. Some of these tests are

employed to make significant decisions about students. For instance,

an end-of-course exam ination in many cases may help the teacher

decide whether a student gets an A, B, C, D, or F. In some classes a

course grade can be the deciding factor in whether a student moves to

the next class or grade level. Oh yes, the test-based stakes in many

classes are truly high ones.

Now in such settings it should be apparent that considerable

attention

should be given by teachers to the quality of their

examinations. Teachers who possess the skills necessary to detect flaws

in their own tests can then set about rectifying those defects. Teachers

who do not possess the skills needed to evaluate tests will, in all

probability, continue to employ flawed assessment tools. As a

consequence, they will oft-eo make inappropriate decisions about the

students who must suffer through those tests.

In these and other instances, it should be clear that the educator

who knows how to evaluate testing instruments is going to be a more

effective professional than one who is intimidated by such tests. Let's

turn, there fore, to a consideration of six factors which will elevate the

reader from the status of Test Evaluation Tyro to that of Test

Evaluation Tiger.

EVALUATIVE FACTOR ONE: DESCRIPTION OF MEASURED

BEHAVIOR

The initial factor to consider in appraising an educational test

stems from the reason that such tests were created in the first place, na

mel)" to allow us to draw inferences about an examinee's status.

However, that examinee's status, of course, is being assessed with

respect to a particuiar focus. We obviously can't measure everything

about examinees at one time. No, we attempt to assess a limited sphere

of the examinee's skill, aptitude, and the like. It is imperative, therefore,

that an educational test be accompanied by a description of just what it

is that the test is attempting to measure.

A description of measured behavior is available for every

commercially

published test and, almost without exception, for district-

developed educa tional tests as well. With most teacher-made tests

designed for classroom use, the description oflearner behavior usually

bounces around in teacher's heeds rather than on paper, although

some teachers do take the trouble to explicate in writing what it is

that their major examinations are sup posedly measuring.

A description of measured behavior can range in length and

detail from a terse, almost telegrarnlike behavioral objective (that

issupposedly measured by the test) to a detailed, highly elaborate set

of test specifi cations, setting forth the permissible operations that can

be used to gener ate test items. In all cases, although certain of these

descriptions are obvi ously more communicative than others, the

intention of the description is to convey to the user of the test a notion

of what it is that the test is allegedly measunng.

The attentive reader will have noted the use of the term

allegedly in the previous sentence. As we shall see later in this chapter,

not every test that is called a cow by its creators can, in fact, give milk.

For example, some tests that are described as "measures of higher-order

problem solving" turn out to be little more than tests of one's vocabulary

and short-term memory skills. The match between what a test says it's

measuring and what the test actually measures constitutes a key aspect

of the test's validity. Validity of a test will be the fifth factor we consider

in this chapter;'

Before we can discern whether a test's description of

measured be havior is accurately representing what the test is

measuring, we must have the description itself. Because there are no

hard and fast rules (or even soft and slow ones) regarding how test

developers should describe their wares, there is enormous variability

in the nature of the test descriptions that accompany educational

assessment devices.

Norm-Referenced Measures

As a rule, since norm-referenced tests rely more heavily on

percentile comparisons as the chief mechanism for interpreting one's

test score, the descriptive language accompanying such tests is not

intensely delimiting in its nature. More often than not, a somewhat

general statement will be made about what it is that the test measures.

In some cases these general state ments will be backed up by a list of

topics (content) that the test covers or a list of fairly general skills that

the test attempts to tap. In some manuals for norm-referenced tests, a

more detailed breakdown is provided of number of items per topic or

skill.

Indeed, as one surveys the range of descriptive literature

available with norm-referenced tests, it soon becomes apparent that the

degree of descriptive detail accompanying these tests can range from

an extremely terse description, perhaps little more than a thumbnail

sketch, to a fairly elaborate and detailed effort to let test users know

what it is that the test's developers hoped to measure.

Criterion-Referenced Measures

In view of their relatively recent arrival on the educational

scene, it is not surprising that the quality of the descriptions

accompanying criterion referenced tests is even more variable. In a

sense this is somewhat dis-

'The reader will, hopefully, not be too distressed if I occasionally treat tests an

thropomorphically. Now everyone knows, well arnost everyone, that tests really don't

talk, hence, when I refer to what "a test says it's measuring," this represents literary license.

I have, however, known some colleagues who, after working with educational tests for

many years, assert with substantial conviction that many educational tests can indeed

speak-but only when the moon is full.

couraging, since the major contribution that a good criterion-

referenced test can make is its more sharply delimited description of

what it is that examinees can or can't do. Yet, as matters currently stand,

many creators of criterion-referenced tests attempt to get by with

descriptions of measured behavior no more detailed, that is, no more

limiting, than those of their norm-referenced counterparts,

A good description of measured behavior for a criterion-

referenced test, as we shall see in Chapter 9, may be several pages in

length, rather than the sentence or two sometimes associated with

behavioral objectives. The essence of an effective set of descriptive

information for a well constructed criterion-referenced test is thai: it

communicates, with as little ambiguity as possible, what it is that the

test is supposedly measuring. Because we must rely, for the most part,

on words as we attempt to explain what it is that the test measures, our

efforts will sometimes fall short of our expectations. Words, even well-

woven words, sometimes fail to capture the complexity of behaviors

that educators are attempting to assess. By and large, however, a

criterion-referenced test's description of measured be havior will tend

to be more detailed than is a norm-referenced test's de scription.

Now that we've stressed the importance of appraising a test's

descrip tion of the behavior it's attempting to measure, what does one

do about it? In other words, how does a person evaluating an

educational test decide whether a test's description is sufficiently

explicit to be acceptable?

As always the answer.to this question hinges directly on the

fruTpose to which the educational test is to be put. Tests to be used

for examinee selection in fixed-quota settings, situations for which

norm-referenced measures are well suited, need not contain

inordinately explicit descrip tions of measured behavior. Tests to be

used for instructional design or program evaluation should contain

really tight descriptions of what the tests are measuring.

In the case of judging whether a test's description of

measured be havior is sufficiently precise for the measurement task at

hand, it depends on the purpose for wnich the test is being used. As a

test evaluator, your job willbe to consult the test's description of the

behavior it's ostensibly measur ing, then judge whether-for the

intended purpose-that description is sufficiently explicit.

Here, for example, is a description of a social studies subtest

in a

widely used, commercially published norm-referenced

achievement test:

Test I contains 60 four-option multiple-choice items at each grade

level. The items were selected to reflect the most current content and

objectives of the social studies curriculum at each grade. The test items

include items on American history, world history, government, geography,

economics, sociol ogy, and study skills. Some items test knowledge of fact;

many test ability to use factual knowledge of social studies in unfamiliar

situations."

Now suppose you were a teacher who was attempting to

design a relevant instructional sequence to improve students' scores

on this test. Would the previous description of what the test measures

give you a decent idea of what you should be striving to accomplish?

Well, not unless you possessed the gift of prophecy, the test's

description of measured behavior offers you scant guidance for

instructional design. Different social studies teachers would read the

preceding paragraph and come up with substan tially different

interpretations. For purposes of aiding instructional de signers. this

test's description of measured behavior falls way short of the mark.

Yet, for other purposes, the test's fairly general description

may be

acceptable. Suppose as a school administrator you were only

trying to get a very general fix on what your students knew about

social studies. Under such circumstances, the fairly ink-blotty

description may prove sufficient.

The best way to determine whether the test's description of

measured behavior is adequate is to have different individuals, those

who may use the test for the purpose at hand, read the description.

then independently write out what they think is really being measured,

for example, what the test items will be like. If these independently

prepared interpretations coincide fairly well, for this purpose the

test's description of measured behavior is-suitable. If the

interpretations are markedly different, the test's description of

measured behavior is not clear enough.

EVALUATIVE FACTOR TWO: ITEMS PER MEASURED

BEHAVIOR

A second factor to consider when evaluating an educational

test is the number of items being employed to assess the behavior, or

behaviors, being measured by the test. To make this point, perhaps

ludicrously, let's imagine that a sales representative from a commercial

testing firm tries to persuade you to buy the firm's brand new test of

political science knowledge. The sales rep might extol with rapture

the new test's "up-to-date content, its subtle diagnostic potential, the

first-rate consultants who participated in its creation, and its remarkably

low cost." When you ask how many total items are in this new

instrument of assessment wizardry, the sales rep replies, "Only two;

but they're damned good ones!"

It is apparent, of course, that only a handful of test items

will not permit educators to get an accurate fix on what it is that

students can or

'Tests oj Academic Progress, Form S. Manual for Administrators. Supervisors. and

Coun

selors. (Boston: Houghton Mifflin 1972), p. 11.

can't do. There's just too much chance that results on a test

with so few items will be nothing more than a fluke. Students'

examination perfor mance is too apt to be due to chance, rather than

their true capabilities (attitudes, and so on).

Tests, Subtests, and Sub-Sub-Subtests

Before proceeding to a consideration of the tricky question

regarding "How many test items are sufficient," it's neces~ary to take

a brief side excursion dealing with the number of behaviors that a

test sets out to measure. Let's think of a typical norm-referenced

achievement test which attempts to assess a primary child's "reading

comprehension ability." Al though there might be a number of

different kinds of items on the test, with those differences reflecting

fairly distinctive sorts of comprehension subskills, the test yields a total

score that is a composite index of a student's reading comprehension

ability. Although the test's developers intend it to assess a medley of

varied comprehension subskills, there is no attempt to provide

separate indices of those subs kills. Thus, if the test consists of 100

items, all of those items can be counted as measuring that test's

single measured behavior. namely, reading comprehension ability.

Other tests, however, attempt to measure a series of distinctive

sub

skills in such a way as to yield separate scores for each subskill.

For exam ple, suppose a test of basic mathematical skills has 80 items,

20 each on the four fundamental calculation skills (addition,

subtraction. multiplication, and division). Each of these 20-item sets

yields a separate score for each of the four calculation skills. In other

words, an examinee earns a separate score for addition skill, another

for subtraction skill, and so on. Whether or not there is a total score

reported for the examinee's performance on the

80 items (sometimes there is and sometimes there isn't), the

test is surely

attempting to measure each of the four calculation skills with

20 items per skill. Hence, for that test there would be 20 items for the

smallest behavior the test purports to measure.

So far so good, but let's now consider a test that allegedly

measures 5S separate decoding skills in reading with 55 items. For

readers with less than fundamental mathematical prowess, it can be

noted that the test is sup posedly measuring each decoding skill

(separately listed in the test's man ual), with one item per skill. One

item per skill, most would agree, is pretty skimpy. Since the test

permits the reporting of results on a skill-by-skill basis, there's no

doubt that the test's publishers are really trying to peddle it as S5

separate tests. And one-item tests at that!

You can see that when you evaluate a test to discern whether

it pro

vides a sufficient number of items per measured behavior, you

must isolate the smallest units of behavior which the test supposedly

measures, then see how many items the test provides for measuring

those units. In the preced ing paragraphs we've seen examples of tests

containing 100 items, 20 items, and one item per measured behavior.

What about tests that not only report subscores but also

provide a

composite, total score? For example, if our 55-item test of

decoding skills provides a total "overall decoding skill" score, for that

measured behavior the test would possess 55 items. In such instances,

therefore, one might conclude that there were enough items when the

test is used as a composite instrument but not when the test is viewed

as 55 separate subtests.

As always, the purpose for which the test will be used turns out

to be the chief determiner of how many items are needed. Let's suppose

we want to use our 55-item decoding skills test to gain an overall estimate

of whether a given set of textbooks helps youngsters acquire the

decoding skills of

reading. For such a purpose, with no need to move to a more

precise level

of reporting, the 55-item test would appear to have enough

items.

On the other hand, if we wished to use the test for some

sort of diagnostic purpose, for instance, to get an idea of the particular

decoding skills that pupils did or did not possess, then one item per skill

is [00 paltry. It is absurd to contend that with a single item (or even a

married one) we could detect whether a child possesses a specific

decoding skill.

For this reason, of course, we see that the purpose for which the

test is to be used must be clearly understood prior to engaging in any

sensible sort of test evaluation. In educational testing, as in other

realms of human endeavor, form should follow function.

The Number of Items

One would think, given so much messing around with

educational mea surement over the decades, that by now we would

have available a delight fully succinct and satisfyingly solid answer to the

question: ("How many test items are really necessary?" Distressingly,

there is no simple answer to that important question; "It depends.")

In the case of deciding how many test items per measured

behavior

are needed, it is fairly easy to demonstrate why it is that

the requisite number of test items depends on the purpose for which the

test's results are tobe used. Let's dabble with a couple of common

situations to illustrate that point.

Imagine that you were using a written test to certify whether

clinical

psychologists knew their stuff, that is, knew it well enough to

allow them to practice on a neurotic world. Since the stakes are quite

high, you would want plenty of items on the test, maybe 100 or so, to

make sure that the actual clinical skill of the examinees is well sampled.

To give those clinical psychology candidates a four- or five-item exam,

then expect to have any confidence in the exam's results, would be

patently silly (or a bit psychotic).

However, let's say you're a classroom teacher of first-grade

pupils, and you want to find out which of the youngsters need an

extra hour's worth of instruction on how to pronounce consonant-

vowel-consonant words. Your plan is to assemble a subgroup ofthose

children and give them an extra 60-minute practice session. In this

instance, when the stakes are decisively lower than in the clinical

psychologist example, and where to day's errors in measurement can

be rectified during tomorrow's reading instruction, you can

undoubtedly get by with a lest that has only a few items, as few as

two or three, to provide you with the rough estimates of student skill.

These two examples should, hopefully, highlight one key

question that must be considered when deciding how many test items

per measured behavior are needed, namely, how important is the

decision at hand? High-stakes, difficult-to-reverse decisions require

more items than low stakes, easy-to-reverse decisions.

There are other considerations involved in deciding upon test

length,

and these include the level of proficiency required of

examinees in rela tionship to their expected level of functioning. That

sounds fairly complex, and it is. Let's analyze it. Suppose you were

using an exam for a placement decision and had determined that only

those examinees who earned a score of 80 percent correct or better

would be selected. Now if most of the examinees are actually

functioning around a 20 percent correct score, it will be the

exceptional examinee who scores at the required (80 percent) level.

An exam performance at that level will be a rare event and not apt to

be attributable to chance. For example, if yOU had five fill-in-the-

blank, short-answer test items, most students would only be getting one

out of five items correct, and those getting four or five (out of five)

correct would display an exceptional (and believable) performance.

You'd have confi dence that such high-scoring examinees really

were knowledgeable.

However, let's imagine that with the same required

performance level, that is, 80 percent correct, we expected that most

students would actually be performing around 75 percent to 85

percent correct. 3 In that setting an examinee's performance might be

expected to fluctuate around the 80 percent cut-off level more often,

merely as a function of the impreci sion of our measurement device.

Consequently, we would want more items per measured behavior in

Older to increase our confidence that those who topped our required

80 percent-level were doing so because of true skill, rather than

chance.

'We'Utreat the establishment of such performance levels in Chapter 16, so be

patient.

I am not arguing that the passing standard should be set at a

point substantially different from examinees' average performance

merely to allow us to use fewer items. However, if a standard is set at

a point which turns out to be near the group's typical performance,

more items are cer tainly requisite.

Once more we see that the educational measurement world

is far

more complex than it ought to be. The decision as to how

many items per measured behavior constitutes an acceptable minimum

is much more dif ficult than most people imagine. Although some

psychometricians (a fancy term to describe those folks who measure

psychologically or educationally relevant variables) have addressed this

problem, so' far we do not have any tidy answers to the question: "How

long should a test be?"

Promising approaches. Two of the more encouraging

strategies for dealing with the test-length issue have been proposed by

Millman" and by Novick and Lewis." Millman has prepared a series

of tables which, for specified passing score percentages, permit us to

identify how many stu dents would be misclassified if tests of

different lengths were used. Millman's approach is based on the

binomial probabilities associated with an examinee's likelihood of

passing or failing each item in a behavioral domain of test items. It

is easy to use, although it fails to take account of certain factors

involved in the approach recommended by Novick and his colleagues.

Novick's solution strategy employs a Bayesian statistical model.

Baye sian statistical approaches, to oversimplify outrageously. allow

one to pick up greater analytic precision with less actual data by

making experience based assumptions regarding the nature of such

variables as examinees' current level of skill. The Bayesian solutions to

the test-length problem take into consideration several subtle elements,

including how much more seri ous an error it is to misdassify false-

positives (examinees who appear to possess a skill but really don't) than

to misclassifyfalse-negatives (examinees who appear to lack a skill they

really possess). The estimated relative gravity of these two sorts of

misclassification errors is referred to as the loss ratio and is built into a

Bayesian formula which helps us decide how many test items are

needed.

In general the binomial model requires somewhat longer tests

than

the Bayesian model. The Bayesian approach, however, is more

complicated

'J. Millman. "PassingScores and Test Lengths for Domain-Referenced

Measures,"

Review of Educational Resources, 43 (1973), 205-216.

'M. R. Novick and C. Lewis, "Prescribing Test Length for Criterion-Referenced

Mea

surement," in Problems in Criterion-Referenced Aleasuremenl, ed. C. W. Harris,

M. C. Aikin. and W. J. Popham, CSE ~Ionograph Series in Evaluation, no. 3 (Los

Angeles: UCLA Center for the Study of Evaluation, 1974).

Documents

Chapter 3 (Page1-5)