20
Journal of Educational Measurement Fall 1996, Vol. 33, No. 3, pp. 271-290 Decision Theory Models for Validating Course Placement Tests Richard Sawyer American College Testing Most American postsecondary institutions have course placement systems for their first-year students. Placement systems typically consist of an assessment component (to estimate students' probability of success in standard first-year courses) and an instructional component (in which academically underprepared students are taught the skills and knowledge they need to succeed in the standard courses). Validity issues related to these functions are discussed in the context of decision theory, and methods are proposed for determining appropriate cutoff scores on placement tests. A typical and important use of tests in postsecondary education is course placement (i.e., matching students with instruction appropriate to their academic preparation). For example, entering students whose reading or writing skills are insufficient for them to be successful in a standard first-year English literature course might be advised or required to enroll in a college-preparatory reading or writing course. On the other hand, students with an unusually high level of academic preparation might be encouraged to enroll in an accelerated or higher- level course. Most colleges and universities enroll some students who are not academically prepared to do work at a level traditionally expected of first-year students. The percentage of postsecondary institutions with some form of placement and remedial instruction has steadily increased in the past decade and is now about 90% ("Colleges and Universities," 1994; McNabb, 1990; Woods, 1985; Wright & Cahalan, 1985). One suggested explanation is that American high schools have become less effective in preparing students for college (Carnegie Foundation for the Advancement of Teaching, 1988; National Commission on Excellence in Education, 1983; Singal, 1991). An alternative explanation is that more students from disadvantaged backgrounds are attending college (College Entrance Examination Board, 1977; Munday, 1976). During the past three decades, several authors have proposed using decision theory to validate educational selection systems. Two different general ap- proaches are those proposed by Cronbach and Gleser (1965) and by Petersen and Novick (1976). Cronbach and Gleser adapted linear regression methodology to estimate the expected costs and benefits of using a test score or other predictor variable for classifying or selecting personnel. Their technique continues to be widely applied in industrial-organizational settings. Petersen and Novick devel- I thank Mark Houston, Michael Kane, Alan Nicewander, Julie Noble, and two anony- mous reviewers for their critiques of earlier drafts. 271

Decision Theory Models for Validating Course Placement Tests

Embed Size (px)

Citation preview

Journal of Educational Measurement Fall 1996, Vol. 33, No. 3, pp. 271-290

Decision Theory Models for Validating Course Placement Tests

Richard Sawyer American College Testing

Most American postsecondary institutions have course placement systems for their first-year students. Placement systems typically consist o f an assessment component (to estimate students' probability of success in standard first-year courses) and an instructional component (in which academically underprepared students are taught the skills and knowledge they need to succeed in the standard courses). Validity issues related to these functions are discussed in the context of decision theory, and methods are proposed for determining appropriate cutoff scores on placement tests.

A typical and important use of tests in postsecondary education is course placement (i.e., matching students with instruction appropriate to their academic preparation). For example, entering students whose reading or writing skills are insufficient for them to be successful in a standard first-year English literature course might be advised or required to enroll in a college-preparatory reading or writing course. On the other hand, students with an unusually high level of academic preparation might be encouraged to enroll in an accelerated or higher- level course.

Most colleges and universities enroll some students who are not academically prepared to do work at a level traditionally expected of first-year students. The percentage of postsecondary institutions with some form of placement and remedial instruction has steadily increased in the past decade and is now about 90% ("Colleges and Universities," 1994; McNabb, 1990; Woods, 1985; Wright & Cahalan, 1985). One suggested explanation is that American high schools have become less effective in preparing students for college (Carnegie Foundation for the Advancement of Teaching, 1988; National Commission on Excellence in Education, 1983; Singal, 1991). An alternative explanation is that more students from disadvantaged backgrounds are attending college (College Entrance Examination Board, 1977; Munday, 1976).

During the past three decades, several authors have proposed using decision theory to validate educational selection systems. Two different general ap- proaches are those proposed by Cronbach and Gleser (1965) and by Petersen and Novick (1976). Cronbach and Gleser adapted linear regression methodology to estimate the expected costs and benefits of using a test score or other predictor variable for classifying or selecting personnel. Their technique continues to be widely applied in industrial-organizational settings. Petersen and Novick devel-

I thank Mark Houston, Michael Kane, Alan Nicewander, Julie Noble, and two anony- mous reviewers for their critiques of earlier drafts.

271

Sawyer

oped a "threshold" model based on Bayesian decision theory. Ben-Shakhar, Kiderman, and Belier (1994) compared these two approaches and illustrated them using data from an admission selection problem.

The principal goal of this article is to apply the threshold model to course placement systems. The key outcomes of placement decisions are described, and models are developed in which utility functions quantify the desirability of the decisions, given the outcomes.

Remedial Instruction

At many postsecondary institutions, there are two levels of first-year courses: a "standard" course in which most first-year students enroll, and a "remedial" course for students who are not academically prepared for the standard course. At some institutions, a lower-level course may be given other names, such as "college-preparatory," "compensatory," "developmental," or " review"; Carriuolo (1994) articulated differences in the meanings of "remedial" and "developmental." At some institutions, there may be courses that require more knowledge and skills than the most basic lower-level courses, but less than the standard courses. In this article, only a single lower-level course is considered, and it is designated "remedial ," to be consistent with Willingham's (1974) nomenclature. Note that the designations "standard" and "remedial" are institu- tion specific; that is, one institution's standard course could be another institu- tion's remedial course. Often, remedial courses do not carry credit toward satis- fying degree requirements.

Though essential to placement, testing is but one component in a system. To be educationally effective, a remedial placement system should have the following characteristics:

(1) The system accurately identifies students who have small chance of suc- ceeding in the standard course (underprepared students), as well as those who are likely to succeed.

(2) Appropriate remedial instruction is provided to the underprepared stu- dents.

(3) Both the students who originally enrolled in the standard course and the students who were provided remedial instruction eventually do satisfac- tory work in the standard course.

Note that accurately classifying students (Requirement 1) is necessary, but not sufficient, for a placement system as a whole to be effective. Accurate prediction is not an end but only a mechanism for effectively allocating remedial instruction to students (Requirement 2). On the other hand, providing remedial instruction is intended to achieve the larger goal that students succeed in college (Require- ment 3). Even if underprepared students are accurately identified and are pro- vided remedial instruction, if they eventually drop out or fail in the standard course, then little will have been accomplished by the placement system. On the contrary, both the institution's and the students' resources will have been wasted. In a comprehensive survey of the educational applications of decision theory, van der Linden (1991) noted that a defining characteristic of course

272

Validating Course Placement Tests

placement systems is that students take different treatments (courses), but the success of each treatment is measured by the same criterion variable.

One might argue that failure in the standard course could lead to positive results, such as selecting (and ultimately succeeding in) another educational program better matched with a student's talents and interests. While this statement is undoubtedly true for some students, the students would have done better (with respect to their time and money) to have selected their preferred educational programs in the first place, through appropriate counseling. This scenario illustrates that effective counseling is also important for effective placement. This article does not, how- ever, attempt to model the effects of counseling on placement.

The need for an institution to serve students who, by traditional standards, are academically unprepared for college imposes another requirement on placement systems. Even if most students' academic preparation is accurately assessed, even if remedial instruction is provided to the underprepared students, and even if nearly all the underprepared students eventually succeed in the standard course, the result still might be unsatisfactory. This would occur if an institution diverted resources to remedial instruction to such an extent that the achievement of students in standard courses was adversely affected. In other words, institutions need to consider the trade-offs they must make in allocating their finite resources when they provide remedial placement systems; such considerations may relate to institutional mission and policy as much as to costs and to grades. There is controversy about the proper role of remedial placement in postsecondary educa- tion. Lively (1993), for example, reported on efforts in different states to reduce or eliminate remedial instruction in their four-year public institutions by designat- ing that role exclusively to two-year colleges.

Validity Arguments In validating tests for use in educational selection, researchers have tradition-

ally emphasized documenting time-ordered statistical relationships between test scores and relevant criteria. Typically, this documentation has consisted of corre- lation coefficients. To the extent that the usefulness of a test depends on the existence of statistical relationships, such evidence is clearly essential to valida- tion. There are, however, other, more theoretically oriented strategies that offer significant potential advantages.

In Kane's (1992) paradigm, for example, one first states as clearly as possible the particular use being made of test scores, and the logical chain of assumptions by which the use can be justified. Next, one examines the plausibility of each assumption. One then investigates more thoroughly those assumptions that are least plausible, based on available evidence; usually, this will involve collecting and analyzing relevant data. The final step is to review the overall plausibility of the logical chain of inferences, and to determine how the plausibility can be enhanced, either by modifying the test or the use made of the test. This more theoretical approach to validity offers the possibility of understanding why em- pirical relationships exist, of making educated guesses about what limitations there may be to extrapolating such relationships to new situations, and even of devising new tools that are more effective in accomplishing goals.

273

Sawyer

Sawyer (1989) devised a simple validity argument, based on Kane's (1992) paradigm, for using curriculum-based tests and high school course work data to identify underprepared college students (Requirement 1 in the previous section). Essentially, the argument rests on the congruence between the academic skills measured by the placement variables and the academic skills students need to succeed in a particular course. Given a congruence in content, and given that course grades are valid measures of educational achievement, there should be a statistical relationship between placement variables and course grades. (In McDonald's [1985] discussion of why variables are correlated, this validity argument is a "theoretical answer.") By measuring the strength of this statistical relationship, one can gather evidence on the plausibility of the assumptions. By making additional assumptions about the costs and benefits of different actions and outcomes, one can also use observed statistical relationships to make infer- ences about the practical effectiveness of the placement variables.

The other requirements of remedial placement systems, described previously, offer a structure for a validity argument for the effectiveness of an entire course placement system. Such an argument would be very useful because of the insights it would afford into how placement systems operate, what their consequences are to students and to institutions, and how the systems might be improved. While important, such a task is beyond the scope of the present article; instead, this article focuses on statistical issues related to Requirement 1. For a discussion of some methodological issues involved in the larger goal of validating placement systems, see Sawyer (1996).

Decision Theory Models

Suppose that an argument has been constructed for using a placement test score (or other variable) to identify underprepared students, and that evidence has been collected to support the plausibility of the argument. How does one quantify from this evidence the usefulness of the placement variables? One approach is to augment the validity argument with additional assumptions about the benefits of student achievement, the negative consequences of failure, and the costs of operating the placement system. These assumptions should address the important outcomes of the placement system. The additional assumptions can then be related to statistical relationships estimated from data to produce a summary measure of the usefulness of the placement variables. Conceptually, decision theory seems well suited to this task. Validating placement systems with decision theory in a way that realistically models the choices students and institutions must make, as well as the consequences of their choices, is a complex undertaking, however. The methods proposed here will, it is hoped, result in progress toward this goal.

The decision problem can be formally defined as follows. One must select a particular decision d from a set D of possible decisions. A particular outcome 0 occurs, from among a set of possible outcomes O. A utility function u(d, O) assigns a value to the desirability of decision d when the outcome is 0. The exact outcome 0 that will occur is unknown to the decision maker, but there is probabilistic information available about the likely values of 0. In a Bayesian

274

Validating Course Placement Tests

decision theory model, this information is described by a subjective probability distribution on O; the subjective probability distribution quantifies the decision maker 's beliefs about the likely values of 0, given both prior beliefs and any relevant data previously collected. The Bayesian optimal strategy is to choose the decision d that maximizes the expected value of u(d, O) with respect to the subjective probability distribution on O (Lindley, 1972).

To illustrate, let us first consider the requirement that a placement test accu- rately identify underprepared students. Suppose that a given cutoff score K on a placement test is being considered, and that:

• test scores are obtained for all first-year students at an institution; • students whose test scores are less than K are classified as needing remedial

instruction, and students whose test scores are greater than or equal to K are classified as not needing remedial instruction; and

• the hypothetical performance of students in the standard course, without any prior remedial instruction, can be measured or estimated.

Each student is classified either as adequately prepared for the standard course (if his or her test score equals or exceeds K) or as inadequately prepared (if his or her score is less than K). Because the classification for any student depends on the assumed cutoff score K, the set of "decis ions" (D) in this case is the set of possible values of K. The goal is to find the "bes t " value of K, and to quantify the effectiveness of the associated classifications. (An extension of this notion to placement based on multiple predictors is described later.)

In the identification component of a placement system, the "ou tcomes" (O) for a group of students are their test scores and their performance (without prior remedial instruction) in the standard course. In common usage, the word "outcomes" suggests thefinal events that result from decisions about individuals (e.g., students' grades in the courses they decided to enroll in). This article follows Lindley (1972), in which "ou tcomes" are anything (e.g., students' test scores and course grades) observed after a decision (cutoff score) is selected. In many discussions of decision theory (e.g., van der Linden, 1991), the term "outcomes" is more consistent with common usage, because the information used to make a decision for an individual (e.g., a student's test scores) is distin- guished from the result (e.g., the student's course grades). The approach taken here, on the other hand, allows us to separate the practical issues of the decision problem from the technical details of the statistical inference problem (see discus- sion below on expected utility functions).

Therefore, for each student, four possible events could occur, as shown in Table I. Recall that all the events in Table 1 assume that no prior remedial instruction has occurred. In identifying underprepared students, Events A and C are preferred, because they correspond to correct classifications. The effective- ness of the decision (cutoff score) can be summarized by the frequencies of students associated with each event.

Utility Functions

At an institution without a placement system, the events in Table 1 could (in principle) be observed by requiring all students, regardless of their test scores,

275

Sawyer

TABLE 1 Events Associated With Identifying Academically Underprepared Students

Performance in Event Test score Inference about student standard course

A > K Adequately prepared Successful

B > K Adequately prepared Not successful

C < K Needs remedial instruction Not successful

D < K Needs remedial instruction Successful

to enroll in the standard course, and then noting which of them succeed and which do not succeed. For each value Of K, there would be a set of proportions associated with Events A, B, C, and D. Let us suppose, temporarily, that this assumption is true; a later section of this article describes the modifications required when there is prior selection resulting from an existing placement system.

Let pA(K), pB(K), pc(K), and pD(K) denote the observed proportions of stu- dents corresponding to Events A, B, C, and D in the entire group of students when the cutoff score is K. Then pA(K) + pc(K) is the proportion of students who are correctly classified, and pB(K) + pD(K) is the proportion of students who are incorrectly classified. The usefulness of the placement test could then be evaluated in terms of the benefits of correct classifications and the losses resulting from incorrect classifications.

A function that expresses preferences among different pairs of outcomes and decisions is called a utility function. For this simple model, one possible utility function would be the relative frequency of correct classifications (accuracy rate); according to such a utility, every correct classification results in the same positive benefit, and every incorrect classification results in zero benefit. A more complex utility function would assign different values to the different events and weight their sum:

u(K, 0) = W~OA(K) + wBpB(K) + wcPc(K) + wopo(K), (1)

where wA . . . . . wo are positive weights. Because utility functions are invariant under linear transformations (DeGroot, 1970), we can specify that 0 ~ w^ . . . . . WD ~ 1. Such a function would quantify the different benefits of the two types of correct classifications and the different costs of the two types of incorrect classifications. Even more general utility functions could be constructed by expressing course outcomes other than dichotomously; for example, one could elicit preferences for specific letter grades (Sawyer, 1996).

In principle, utility functions are specific to each decision maker and therefore need to be elicited separately for each student, counselor, teacher, or administra- tor. In practice, this is not feasible, and we must look for utility functions that approximate the preferences of different groups of people. The accuracy rate, obtained by letting wA = Wc = 1 and wB = wo = 0 in Equation 1, reflects

276

Validating Course Placement Tests

institutions' preferences for accurately identifying underprepared students. Of course, there may be other preferences, particularly among students (Whitney, 1989). To a student who is willing to take risks, enrolling in and failing the standard course (Event B) might be more favorable than taking the remedial course when the student could have been successful in the standard course (Event D). A utility function with weights wA = 1, Wc = 2/3, wB = 1/3, and wo = 0 could c6rrespond to such students' preferences to minimize the time and cost required to pass their courses.

The utility function u in Equation 1 resembles the expected utility function (defined below) in the decision theory model developed by Petersen and Novick (1976). Petersen and Novick's " threshold" utility function, however, pertained to the possible outcomes for an individual student rather than for a group of students. For simple utility functions like u, which can be expressed as sums of utilities for individual students, the two approaches amount to the same thing. The group model, however, is more general and permits one to consider more complex outcomes and utilities. Consider, for example, the cost of providing remedial instruction: With each additional student placed in the remedial course, one can expect ~ small increment in cost, until a new instructor must be hired and an additional classroom provided. At this point, the total cost jumps by a large amount, and the per-student cost must be recomputed. Such a cost structure cannot therefore be represented as a simple sum of costs for any number of individual students, because it must consider the total number of students who are provided remedial instruction.

Expected Utility Functions

In practice, a utility function cannot be directly computed for the group of students for whom placement decisions are to be made, because the outcomes (the students' test scores and performance in the standard course) are not yet known. In (I), for example, the actual proportions pA(K) . . . . . po(K) are not known for a particular group of students before they are tested and complete the standard course. These proportions must instead be estimated in some way from data on past students, under the assumption that future students will be similar to past students.

The expected utility function is a formal mechanism for dealing with the uncertainty of outcomes in a decision theory model. It is from the expected utility function that decisions on the effectiveness of a placement system can be made. In Bayesian decision theory models, an expected utility function is the average (expected) value of a utility function u(d, O) with respect to the subjective probability distribution on the outcome space O. In the example previously given,

U(K) = Eo[u(K, 0)] = w^(K)Eo[PA(K)] + w~(K)Eo[pB(K)]

+ wc(K)Eo[Pc(K)] + wD(K)Eo[pD(K)], (2)

where E0[pA(K)], Eo[p~(K)], Eo[pc(K)], and Eo[pD(K)] are the expected values of the cell proportions pA(K), pB(K), pc(K), and po(K) with respect to the

277

Sawyer

decision maker 's subjective probability distribution for future students' test scores and course grades (0).

In the terminology of Bayesian statistical inference, the subjective probability distribution for future students' test scores and course grades is specified by a predictive density for their joint distribution. The predictive density is calculated from an assumed likelihood function for the future observations, given certain unknown parameters, and a posterior density of the unknown parameters. The posterior density is based on prior beliefs about the unknown parameters, as well as on data collected in the past.

In principle, this method is the only one consistent with the axioms on which Bayesian decision theory is based (Lindley, 1972). In practice, deriving a pre- dictive density function is often a mathematically intractable problem. When prior beliefs are vague, or as sample sizes become large, however, the Bayesian expected values Eo[pA(K)] . . . . . E~[pD(K)] are, for practical purposes, very simi- lar to classical statistical estimates pA(K) . . . . . Po(K), based only on data (De- Groot, 1970). In this case, we have

U(K) ~ WA(K)PA(K) + wB(K)PB(K) + wc(K)Pc(K) + WD(K)Po(K). (3)

In the following section, a method for estimating cell proportions is described that is based on classical statistical methods.

Estimating Proportions

The proportions p^(K) . . . . . pn(K), in (1) could in principle be estimated from direct observation. Suppose that an institution administered a placement test to every first-year student, but placed each student in the standard course, regardless of the student's score on the test. Some students would be successful, and others, presumably, would be unsuccessful in the standard course. The statistical relationship between the placement test score and performance in the standard course could then be modeled. A convenient way to model such a relationship is with the logistic regression function:

P(x) = P[Y = I IX = x] = (1 + e-'~-e'~) - ' , (4)

where Y = 1 if a student is successful, and 0 if a student is unsuccessful; and X is the student's test score. The numbers ot and/5 in (4) are unknown parameters. They are estimated from data on the test scores and the success/failure variable Y from a group of students who enrolled in the standard course. The logistic regression model resembles the ordinary linear regression model in that it de- scribes the conditional mean of one variable Y on another variable X in terms of unknown parameters ot and/5. Because the logistic function is nonlinear in ot and /5, however, there is not an explicit formula for estimating them as in ordinary linear regression; they must instead be estimated iteratively, using successive approximation techniques. Logistic regression procedures are avail- able in both the SAS (1990) and SPSS (1990) packages.

Of course, other variables (such as high school grades) could instead be used as predictors. Moreover, logistic regression models with multiple predictors (e.g.,

278

Validating Course Placement Tests

test scores, high school grades, and hours working at a job) could also be developed. Using multiple predictors for placement is desirable for several rea- sons. First, it potentially increases placement accuracy, because it brings more information to bear on making placement decisions. Second, it decreases the anxiety that some students would perhaps feel if all placement decisions were based on a single high-stakes test. Third, it could allow an institution to exempt from local testing students who, given their scores on nationally standardized tests or other variables, are predicted to have a high probability of success in the standard course.

The most straightforward way to extend the notion of a cutoff score on a single test to multiple predictor variables is to base decisions on a minimum required probability of success in the standard course (i.e., on a cutoffprobabil- ity). In the author's experience, however, institutions more typically develop complicated rules keyed to cutoff scores on each of the individual predictor variables.

Once estimates & and/~ have been obtained for the parameters c~ and/3, the estimated conditional probability of success P(x) can be estimated by substitut- ing & and /~ in (4). From the estimated conditional probabilities, estimated proportions for the four events described in Table 1 can be easily calculated. For example, the proportion of students associated with Event A in Table 1 can be estimated by

I~A(K) = ~ P(x)*n(x)/N, (5)

where P(x) = estimated P(x); K = the minimum score required for enrollment in the standard course (cutoff score); n(x) = the number of students whose test score is equal to x; and N = g n(x), the total number of students. The proportions for Events B, C, and D can be estimated similarly, by adjusting the summand and the range of summation. An example given later in this article illustrates these calculations with data from a mathematics course placement system.

Note that the summations in Equation 5 are based on the x-values (test scores) of all the students in the placement group, that is, all students for whom place- ment decisions could be made. (In contrast, the logistic regression function in (4) is estimated from the estimation sample, that is, students who enrolled in the standard course.) In practical terms, the placement group will usually consist of all students with test scores, regardless of which course they enroll in. A placement group could also be defined for students in a particular program of study (e.g., business) or with particular background characteristics (e.g., minority students).

One could, in Equation 5, model the marginal distribution of X in the place- ment group rather than use the empirical frequencies n(x) (Houston, 1990). Modeling the marginal distribution of X offers the potential for increasing estima- tion accuracy (provided, of course, the model is realistic).

Statistical Issues

It is not possible to observe Events C and D directly at an institution where the placement test being studied is also used operationally. The reason is that

279

Sawyer

students for whom X < K do not enroll directly in the standard course and therefore do not have performance data unaffected by remedial instruction. The logistic regression model in (4) can conveniently be used to estimate the probabil- ities pc(K) and pD(K), however, even though Events C and D themselves are not directly observable. Estimating pc(K) and po(K) involves extrapolating the logistic regression function (estimated from the data of students enrolled in the standard course) to the test scores below K. Note that one could also apply this method to data obtained from an existing placement system using a different test Y; in this case, one would administer both tests X and Y to students.

Houston (1993) did computer simulations to examine the effects of extrapola- tion on the accuracy of estimated conditional probabilities of success. He found increases in standard error of 6%, 30%, and 43% when the placement group was truncated at the 25th, 50th, and 75th percentiles, respectively, as compared to the standard error associated with no truncation. These results suggest that if 25% or fewer of the students are assigned to the remedial course, then the procedure described here will estimate the conditional probability of success with reasonable accuracy.

Houston also found that the sample sizes needed to obtain acceptably accurate estimates of the logistic regression parameters a and/3 are larger than the sample sizes typically required for linear regression. Sample size is an important practical issue in course placement, in which a particular standard-level course with only 20 students would not be unusual. One way to overcome sample size problems in course placement research is to cumulate data over more than one term; if the course content and grading standards do not change over time, this method is satisfactory. Another way to overcome sample size problems is to develop a central prediction system, in which data from several institutions are used collat- erally to estimate parameters for each institution individually. Houston and Woodruff (1994) developed empirical Bayes estimates of a and /3 in central prediction models, in which there is an exchangeable prior distribution for the parameters across institutions. Central prediction systems may also be effective in mitigating the deleterious effects of truncation resulting from prior selection.

Formulas for the standard errors of estimated accuracy rates do not currently exist. Simulation (e.g., Houston, 1993) and bootstrapping can be used to obtain general guidelines.

In this article, cell probabilities are estimated by modeling the probability of success function with logistic regression. One could also estimate cell probabili- ties with traditional methods based on linear regression and normal distributions. Using traditional methods, one would first adjust the observed correlation p between course grade and test score for the restriction of range in the test score (Lord & Novick, 1968). Call the adjusted correlation p*. Then, assuming a bivariate normal distribution with correlation p* for course grade and test score, one could estimate the cell probabilities using a table of the bivariate normal distribution. Besides being more complex and laborious, linear regression meth- ods require more assumptions than the logistic regression method and are there- fore likely to be less robust in practice.

280

Validating Course Placement Tests

Op t ima l C u t o f f Scores

Associated with each potential cutoff score is a value of the expected utility U in Equation 3. One can show (see Appendix) that if the test score is a continuous random variable, if w^ + wc > wB + Wo, and if

P ( K ) = (wc - WB)/(WA -- WB + WC -- WO), (6)

then K is the optimal cutoff score, in the sense that it maximizes U. For the accuracy rate (in which wA = Wc = 1, and wB = wo = 0), P(K) = ~/2 implies that K is the optimal cutoff score. For a utility function with w^ = 1, wB = ~/3, Wc = 2/~, and wo = 0, the optimal cutoff score is that for which P(K) = ~/4. Of course, in practice, other aspects of the course placement system (such as the proportion of students placed in the remedial course) need to be considered in selecting a cutoff score.

One interesting aspect of this result is that the estimated optimal cutoff score (provided that it exists) does not explicitly depend on a particular model for the marginal distribution of X, or even on modeling P ( x ) with the logistic regression function. For example, if two data sets have different marginal distributions on X but the same estimated conditional probability function/3(x), then the optimal cutoff score will be the same for both groups. Moreover, linear regression under normal theory and logistic regression will typically yield different estimated conditional probability functions (say, P~ and P2); but, if P~(K) = Pz(K) = (Wc

- WB)/(WA -- WB + WC -- WO) for some test score K, then the estimated optimal cutoff score for both models will be the same (namely, K).

Data sets with no opt imal cu to f f score. In some data sets, the expected utility function does not have a definite peak, but instead is strictly monotonic. In many courses, when success is defined as earning a C or higher grade, the estimated accuracy rate usually is very high (e.g., greater than .85) for all test scores, but frequently is a strictly decreasing function of test score. In such courses, the optimal cutoff score is the lowest possible test score, which is to say that every student should be placed in the standard course regardless of test score.

What a result like this really means is that even among students with low test scores, more than half earn a C or higher. Now, this phenomenon could occur for several reasons. One possible reason is that despite an apparent congruence between the content of the test and the skills required for the standard course, the test scores are invalid for placement in the course (whether because of content mismatch, test administration error, or whatever). A second possible reason is that there is an existing placement system, and it is succeeding wonder- fully, because nearly every student placed in the standard course is successful. A third possibility is that the C-or-higher success criterion is not a valid measure of achievement in the course, because of instructors' reluctance to assign any grade lower than a C. Shea (1994) suggested that grade inflation may be wide- spread, particularly in humanities courses.

I have analyzed dozens of data sets in which the accuracy rate curve for the C-or-higher success criterion does not have a definite peak. In nearly every such data set, the accuracy rate curve associated with the B-or-higher success criterion

281

Sawyer

does have a definite peak. Therefore, for these data sets at least, invalidity of the test scores is an unlikely explanation.

A similar result (with respect to incremental utility of test scores; see below) was found by Ben-Shakhar et al. (1994), who developed a threshold utility model for using admission test scores to predict first-year college GPA in Israel. When the success criterion was set at a low level, the admission test score showed no incremental utility over other measures already used; when the success criterion was set at higher levels, then using the admission test score was beneficial.

I have occasionally analyzed data for which the B-or-higher accuracy rate curve is strictly increasing, but the C-or-higher accuracy rate curve has a definite peak. Such data typically come from mathematics courses in which few students earn a B or higher grade, but many earn a C or higher grade. The example given later in this article is based on data like these.

Incremental Accuracy Rate

The accuracy rate is an index of the effectiveness of a placement variable (e.g., test) in identifying underprepared students. In deciding whether to use the placement variable, however, one needs to consider whether using the variable significantly increases the accuracy rate over that which would occur if the variable were not used. In other words, we need to consider the incremental accuracy rate of the placement variable. If the choice is between using a place- ment variable X and not using any placement variable at all, then the incremental accuracy rate of X is

zXAR(K) = ARx(K) - ARx(xm~.), (7)

the difference between the accuracy rate at a cutoff score K and the accuracy rate at the minimum possible score. (The latter quantity is the base success rate that would occur if every student took the standard course.) ff the choice is between using placement variables W and X jointly and using only the placement variable W, then the incremental accuracy rate of X is

ZkAR(p ) = ARw.x(p ) - ARw(p ), (8)

the difference between the accuracy rate using W and X at a cutoff probability p, and the accuracy rate using W alone at cutoff probability p.

Incremental accuracy rates depend on the course and institution, as well as on the placement variables and success criterion. In course placement studies based on the ACT Assessment tests (American College Testing [ACT], 1991 a) and the B-or-higher success criterion, the median zXAR for standard first-year English composition courses was about .07 (n = 53 institutions). For standard first-year college algebra courses (n = 40), however, the median ZXAR was about .29.

Remedial Course Placement Rate

The decision theory model just described does not consider the cost of provid- ing remedial instruction, which is surely an important consideration in main-

282

Validating Course Placement Tests

taining any placement system. Traditionally, institutions address the issue of cost by considering the proportion of students placed in the remedial course: If resources are available to provide remedial instruction to no more than X% of students, then the cutoff score on the placement test is taken to be the largest score less than or equal to the scores of X% of the placement group. While this method indirectly addresses the issue of cost, it can not be recommended as the sole means for selecting cutoff scores, because it does not address the issue of classification accuracy or benefit of remedial instruction. When used with other information, such as the accuracy rate, the placement rate is an appropriate, if informal, piece of evidence in choosing cutoff scores. A decision theory model that formally considers costs and benefits is being studied by the author.

Example In September, 1989, the ACT Assessment Mathematics test (ACT, 1991a)

was administered to a sample of students who were enrolled in the prealgebra, college algebra, or trigonometry courses at a large public university in the Midwest. Subscores (range = 1-18) were reported for prealgebra/elementary algebra, intermediate algebra/coordinate geometry, and plane geometry/trigo- nometry. Students were placed into courses according to their scores on local placement tests (range = 200-800) in arithmetic, elementary algebra, and inter- mediate algebra. The students' grades, ACT Assessment scores, and local place- ment test scores were matched, and the relationships between students' test scores and their success in the courses were analyzed. This example is a summary of the results for the college algebra course, using the ACT Intermediate Algebra/ Coordinate Geometry subscore (referred to as the ACT subscore) and the local intermediate algebra score (referred to as the LOC score) as placement variables.

Probability of success relationships were estimated separately from the records of 130 students who had ACT subscores and who enrolled in the course, and from the records of 52 students who had valid ACT subscores and LOC scores, and who enrolled in the course. Two success criteria were studied: C-or-higher course grade, and B-or-higher course grade. Students who did not complete the course were considered unsuccessful. Some analysts prefer to treat withdrawals as missing data, rather than as unsuccessful outcomes, because students may withdraw from a course for nonacademic reasons. Ang and Noble (1993), in contrast, found that interpreting withdrawals as unsuccessful outcomes, rather than as missing, results in larger incremental accuracy rates and higher optimal cutoff scores; this result suggests that students withdrew primarily for academic reasons.

For each criterion, the probability of success was estimated from the logistic regression model in (4). Next, the proportions pA(K) . . . . . pD(K) of students associated with outcomes A, B, C, and D were estimated from records with valid data for the predictor variables, using formulas analogous to (5). The placement group consisted of students who enrolled in any mathematics course; the frequen- cies n(x) in (5) were calculated from their test scores. Finally, accuracy rates and increases in accuracy rate were estimated for the predictor variables.

Means and standard deviations for the test scores and grades are summarized in Table 2. The mean ACT subscore (10.2) for students who enrolled in the

283

Sawyer

TABLE 2 Desc r i p t i ve Sta t is t ics Related to Placement in College Algebra Course

A C T LOC Course grade Placement subscore score

S a m p l e v a r i a b l e s N M e a n (SD) M e a n (SD) Pct. > C Pct. > B

Reference group: ACT subscore Fall, 1991 entering only 309 9.8 (2.4) n/a n/a n/a students

ACT subscore and local score 155 9.7 (2.3) 417 (87) n/a n/a

Estimation sample: ACT subscore Students who completed only 130 10.2 (2.2) n/a 45 23 College Algebra course

ACT subscore and local score 52 10.8 (2.2) 474 (77) 50 19

Note: n/a=not applicable.

course (the estimation sample) was higher than the mean ACT subscore (10.0) for all ACT-tested students who graduated from high school in June, 1991 (ACT, 1991b). The mean ACT subscore for the placement group (9.8), in contrast, was lower than the national mean. The standard deviation for the ACT subscore for the students in the estimation sample (2.2) was smaller than the comparable national standard deviation (2.9). These results can be attributed to the restriction of range caused by prior selection of students in admission and placement.

Note that there were 130 cases in the estimation sample with valid ACT subscores, but only 52 of these cases also had valid LOC scores. ACT subscores were analyzed separately using records from both samples so that a more direct comparison could be made with the local placement test.

One indicator of the validity of a test for correctly identifying underprepared students is the conditional probability of success for a given test score x. If the test is valid for this purpose, the conditional probability of success should be an increasing function of x near the cutoff score; that is, students with test scores above the cutoff should have a greater chance of succeeding than students with scores below the cutoff.

Figure 1 shows the estimated probability of success in the course, as defined by B-or-higher and by C-or-higher course grade, given scores on the two predictor variables. All four functions in Figure 1 were estimated from the sample with N = 52. For economy of presentation, ACT subscores and LOC scores with similar percentile ranks are plotted at the same position on the horizontal axis; as a result, the shape of the logistic curve for the LOC scores is slightly distorted.

Note that the estimated probability of B or higher increases sharply with the ACT subscore, but not with the LOC score. This result suggests that the local test is not effective in identifying the students who earn a B or higher grade in the course. On the other hand, the probability of C or higher increases more sharply with the LOC score than with the ACT subscore. This result suggests that the local test is more effective than the ACT subtest for identifying students

284

Validating Course Placement Tests

I

0.9

0.8

0.7

0.6

Probabi l i ty 015 of SUCCESS

0.4

0.3

0.2

0.1

0

C oq h~her (LOCJ • . . . . . l ~ . . . . . . . . zx

, • C or h/ghet CA CTJ

I I I I I I I I I I

7 8 9 10 I I 12 13 14 15 16 17 319 365 405 466 473 511 542 605 615 637 649

ACT subscore LOC score

FIGURE 1. Estimated probability of success in college algebra course, given ACT

subscore and local test score

who earn a C or higher grade. A possible explanation for both results is that the content of the ACT items was more advanced than that of the LOC items.

The probability of success in the course, given ACT subscore, was also esti- mated from the ACT-only sample of 130 records, and yielded similar results.

This function is not shown in Figure 1. Another aspect of the validity of a test for correctly identifying underprepared

students is the accuracy rate (i.e., the proportion of students who are correctly classified). Figure 2 shows the relationship between cutoff scores on the two predictor variables and the estimated accuracy rates. With respect to the B-or- higher success criterion, the accuracy rate was an increasing function of the ACT subscore, although all subscore values above 13 resulted in essentially the

same accuracy rate. For the C-or-higher success criterion, the LOC score had higher optimal

accuracy rates than did the ACT subscore. The optimal ACT cutoff score was 11, and the optimal LOC cutoff score was 473. For both predictor variables, however, a broad range of score values produced nearly optimum accuracy rates.

Another important consideration in evaluating the validity of a placement test is the increase in the accuracy rate that results from using the test, as compared to not using the test. According to the accuracy rate curves in Figure 2, for example, about 85% of students would be correctly placed using an ACT sub- score cutoff of 17, with respect to the B-or-higher success criterion. If no test were used, the estimated proportion of correct decisions would be .24; that is, if all students enrolled in the standard course, only about 24% would earn a B or higher. Therefore, using the ACT subscore for placement could increase the

285

Sawyer

Accuracy rate

I

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

o

7 319

| o¢ h/ghe¢ (AC/)

: : I I I I I I I I

8 9 I0 !1 12 13 14 15 16 17 365 405 466 473 511 542 605 615 637 649

ACT subscore LOC score

F I G U R E 2. Estimated accuracy rates associated with placement in college algebra course, given cutoffs on ACT subscore and local test score

accuracy rate by .85 - .24 = .61. (In practice, the increase in accuracy rate would be smaller than .61, because a cutoff score less than 17 would be chosen.) With respect to the C-or-higher success criterion, the estimated increase in accu- racy rate for the ACT subscore is about .09; the comparable increase associated with the LOC score is about .23. Both results are consistent with those for the probability of success function; they suggest that the ACT subscore is more effective in identifying underprepared students with respect to the B-or-higher success criterion, and that the LOC score is more effective with respect to the C-or-higher success criterion.

Validity statistics were also computed with respect to a weighted utility func- tion with weights wA = 1, wc = 2/3, wB = ~/3, and wo = 0 in Equation 1. Not surprisingly, the optimal ACT cutoff scores for the weighted utility function (B- or-higher, 13; C-or-higher, 7) were lower than those for the accuracy rate (B- or-higher, 17; C-or-higher, 11). For the B-or-higher success criterion, the increase in the weighted validity statistic associated with the ACT subscore was higher than that associated with the LOC score; but the opposite was true for the C- or-higher success criterion.

Summary Remedial course placement systems identify academically underprepared stu-

dents and provide remedial instruction to them. Decision theory is a useful method for assessing the effectiveness of the components of a course placement system. The effectiveness of placement tests or other variables in identifying

286

Validating Course Placement Tests

underprepared students is described by the conditional probabili ty of success in the standard course, given a test score. The conditional probabili ty of success function can be estimated by logistic regression; the estimated conditional proba- bility of success function can then be combined with other information to develop accuracy rates and other validity indexes.

A P P E N D I X

Derivat ion o f f o r m u l a f o r opt imal cu to f f score

Proposi t ion. Let a test score X be a continuous random variable with probabil- ity density function f; let Y be a dichotomous (success-failure) variable; and let /5(x) be the estimated conditional probability P [ Y = 1 IX = x]. I f

(i) f is posit ive on a nonempty interval (a, b) and zero elsewhere (in other words, the marginal distribution of X has no gaps),

(ii) P is strictly increasing on (a, b), (iii) P and f are differentiable on (a, b), and (iv) wA + Wc > wB + wo,

then the expected utility function

U ( K ) = w^(K) IS^ (K) + wB(K) f i s (K) + w c ( K ) f i c ( K ) + wo(K) f i o (K)

is maximized if P(K) = (Wc - WB)/(WA -- WB + WC -- WD). I f there is no score K for which P(K) = (Wc - WB)/(WA -- WB + WC -- WD), then the " o p t i m a l " cutoff score is either the min imum or the max imum possible score on the test.

P r o o f

U ( K ) = Eo[u(K, 0)] = WAfiA(K) + WBI3B(K) + Wc/3c(K) + WolSo(K)

f; = WA 15(x)f(x) dx + wa [1 - / 5 (x ) ] f (x )d r

rio + Wc - , 6 ( x ) ] f ( x ) d r + w o P ( x ) f ( x ) d r .

Suppose first of all that there is a score K for which U ' ( K ) = 0. Now,

U ' ( K ) = - w A P ( K ) f ( k ) - wa [I - P(k) ] f (k) + Wc[1 - P(k )] f (k ) + w ~ 6 ( k ) f ( k )

= f ( k ) ( w c - w . - P ( k ) [ w A - - w B + W c - - wo]) .

Moreover, because f and t3 are differentiable, U' is differentiable, and

U"(K) = f ' ( K ) ( w c - w8 - P(K)[wA -- WB + wc -- wo])

-f(K) P ' ( K ) [wA - wB + Wc - wD].

287

Sawyer

Let a < Ko < b. Then, U'(Ko) = 0 if and only i f f (Ko) = 0 or/5(Ko) = (Wc - WB)/(WA -- WB + WC -- WD). Because f (Ko) > 0, the second condition is true. Therefore, Ko is a critical point o f U if and only if P(Ko) = (Wc - WB)/(WA -- WB + WC-- WD).

TO verify that U has a relative maximum at Ko, note that f ' and/~ ' exist and are positive at Ko. Therefore,

U"(K0) = - ~ K 0 ) P ' ( K , , ) [ w ^ - w . + wc - wD].

and thus,

U"(Ko) < 0 if and only if wA + Wc > WB + WD > O.

Hence, given the other conditions, U has a relative maximum at Ko if and only if WA + WC > WB + WO. But, this latter condition is assumed to be true.

Now, consider the case when there is no score Ko at which U'(Ko) = O. Because /~ and f are continuous, U' is continuous. Thus, either U' is always negative, or U' is always positive. In the former case, U is strictly decreasing, and U is maximized at the smallest possible score. In the latter case, U is strictly increasing, and U is maximized at the largest possible score.

References

American College Testing. (1991 a). The ACT assessment user handbook. Iowa City, IA: Author.

American College Testing. (1991b). High school profile national norms for June, 1991 graduates. Iowa City, IA: Author.

Ang, C. H., & Noble, J. P. (1993). The effects o f alternative interpretations of incomplete and withdrawal grades on course placement validity indices (Research Rep. No. 93- 3). Iowa City, IA: American College Testing.

Ben-Shakhar, G., Kiderman, I., & Belier, M. (1994). Comparing the utility of two proce- dures for admitting students to liberal arts: An application of decision theoretic models. Jerusalem: National Institute for Testing and Evaluation.

Carnegie Foundation for the Advancement of Teaching. (1988). An imperiled generation: Saving urban schools. Princeton, NJ: Princeton University Press.

Carriuoio, N. (1994, April 13). Why developmental education is such a hot potato. The Chronicle o f Higher Education, sec. 2, pp. 1-2.

College Entrance Examination Board. (1977). On further examination. New York: Au- thor.

Colleges and universities offering remedial instruction and tutoring. (1994, April 13). Education Week. p. 6.

Cronbach, L. J., & Gieser, G. C. (1965). Psychological tests and personnel decisions. Urbana: University of Illinois Press.

DeGroot, M. H. (1970). Optimal statistical decisions. New York: McGraw-Hill.

288

Validating Course Placement Tests

Houston, W. M. (1990, April). Conditional validity and projection distributions. Paper presented at the Annual Meeting of the American Educational Research Association,

Boston. Houston, W. M. (1993, April). Accuracy of validity indices for course placement systems.

Paper presented at the Annual Meeting of the American Educational Research Associa- tion, Atlanta, GA.

Houston, W. M., & Woodruff, D. J. (1994, June). Empirical Bayes estimates of parame- ters from logistic regression models. Paper presented at the meeting of the Psychomet- ric Society, Champaign, IL.

Kane, M. T. (1992). An argument-based approach to validation. Psychological Bulletin, 112(3), 527-535.

Lindley, D. V. (1972). Bayesian statistics: A review. Philadelphia: Society for Industrial and Applied Mathematics.

Lively, K. (1993, February 24). States step up efforts to end remedial courses at 4-year colleges. The Chronicle of Higher Education, p. A28.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Erlbaum. McNabb, T. (1990). Course placement practices of American postsecondary institutions

(Research Rep. No. 90-10). Iowa City, IA: American College Testing. Munday, L. (1976). Declining admissions test scores. (Research Rep. No. 71). Iowa

City, IA: American College Testing. National Commission on Excellence in Education. (1983). A nation at risk: The impera-

tive for educational reform. Washington, DC: U.S. Government Printing Office. Petersen, N. S., & Novick, M. R. (1976). An evaluation of some models for culture-fair

selection. Journal of Educational Measurement, 13, 3-29. SAS Institute, Inc. (1990). SAS/STAT user's guide, Version 6 (4th ed., Vol. 2). Cary,

NC: Author. Sawyer, R. L. (1989). Validating the use of ACT Assessment scores and high school

grades for remedial course placement in college (Research Rep. No. 89-4). Iowa City, IA: American College Testing.

Sawyer, R. L. (1996, April). Assessing the effectiveness of course placement systems in college. Paper presented at the Annual Meeting of the National Council for Measure- ment in Education, New York.

Shea, C. (1994, January 5). Grade inflation's consequences. The Chronicle of Higher Education, p. A45.

Singal, D. J. (1991, November). The other crisis in American education. The Atlantic, 268, 59-69.

SPSS, Inc. (1990). SPSS-X advanced statistics user's guide. Chicago: Author. van der Linden, W. (1991). Applications of decision theory to test-based decision making.

In R. L. Hambleton & J. N. Zaal (Eds.), Advances in educational and psychological testing (pp. 129-156). Boston: Kluwer.

Whitney, D. R. (1989). Educational admissions and placement. In R. L. Linn (Ed.), Educational measurement (3rd ed. pp. 515-525). New York: American Council on Education and Macmillan.

Willingham, W. (1974). College placement and exemption. New York: College Entrance Examination Board.

Woods, J. E. (1985). Status of testing practices at two-year postsecondary institutions. Iowa City, IA: American College Testing and the American Association of Community and Junior Colleges.

289

Sawyer

Wright, D. A., & Cahalan, M. W. (1985, April). Remedial~developmental studies in institutions of higher education--policies and practices, 1984. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago. (ERIC Document Reproduction Service No. ED 263 828)

Author

RICHARD SAWYER is Assistant Vice President, Measurement and Statistical Research, American College Testing, P.O. Box 168, 2201 North Dodge St., Iowa City, IA 52243-0168; [email protected]. Degrees: BA, Rice University; PhD, University of lowa. Specialization: statistics.

290