37
J U S T T H I N K . . . Some experts in testing have called for a moratorium on the use of grade-equivalent as well as age-equivalent scores because norms is as a convenient, readily understandable gauge of how one student’s perfor- mance compares with that of fellow students in the same grade. One drawback of grade norms is that they are useful only with respect to years and months of schooling com- pleted. They have little or no applicability to children who are not yet in school or to children who are out of school. Further, they are not typically designed for use with adults who have returned to school. Both grade norms and age norms are referred to more generally as developmental norms, a term applied broadly to norms developed on the basis of any trait, ability, skill, or other characteristic that is presumed to develop, deteriorate, or otherwise be affected by chronological age, school grade, or stage of life. A Piagetian theorist might, for example, develop a test of the Piagetian concept of accommodation and then norm this test in terms of stage of life as set forth in Jean Piaget’s writings. Such developmental norms would be subject to the same sorts of limitations we have previously described for other developmental norms that are not designed for application to physical charac- teristics. The accommodation test norms would further be subject to additional scrutiny and criticism by potential test users who do not subscribe to Piaget’s theory. National norms As the name implies, national norms are derived from a normative sam- ple that was nationally representative of the population at the time the norming study was conducted. In the elds of psychology and education, for example, national norms may be obtained by testing large numbers of people representative of different variables of interest such as age, gender, racial/ethnic background, socioeconomic strata, geo- graphical location (such as North, East, South, West, Midwest), and different types of communities within the various parts of the country (such as rural, urban, suburban). If the test were designed for use in the schools, norms might be obtained for stu- dents in every grade to which the test aimed to be applicable. Factors related to the rep- resentativeness of the school from which members of the norming sample were drawn might also be criteria for inclusion in or exclusion from the sample. For example, is the school the student attends publicly funded, privately funded, religiously oriented, mili- tary, or something else? How representative are the pupil/teacher ratios in the school under consideration? Does the school have a library, and if so, how many books are in it? These are only a sample of the types of questions that could be raised in assembling a normative sample to be used in the establishment of national norms. The precise nature of the questions raised when developing national norms will depend on whom the test is designed for and what the test is designed to do. Norms from many different tests may all claim to have nationally representative samples. Still, close scrutiny of the description of the sample employed may reveal that the sample differs in many important respects from similar tests also claiming to be based on a nationally representative sample. For this reason, it is always a good idea to check the manual of the tests under consideration to see exactly how comparable the tests are. Two important questions that test users must raise as consumers of test- related information are “What are the differences between

Chapter4 Cont

Embed Size (px)

DESCRIPTION

resuemnn

Citation preview

norms is as a convenient, readily understandable gauge of how one students perfor- mance compares with that of fellow students in the same grade. One drawback of grade norms is that they are useful only with respect to years and months of schooling com-J U S T T H I N K . . .Some experts in testing have called for a moratorium on the use of grade-equivalent as well as age-equivalent scores because such scores may so easily be misinter- preted. What is your opinion on this issue?

pleted. They have little or no applicability to children who are not yet in school or to children who are out of school. Further, they are not typically designed for use with adults who have returned to school.Both grade norms and age norms are referred to more generally as developmental norms, a term applied broadly to norms developed on the basis of any trait, ability, skill,or other characteristic that is presumed to develop, deteriorate, or otherwise be affected by chronological age, school grade, or stage of life. A Piagetian theorist might, for example, develop a test of the Piagetian concept of accommodation and then norm this test in terms of stage of life as set forth in Jean Piagets writings. Such developmental norms would be subject to the same sorts of limitations we have previously described for other developmental norms that are not designed for application to physical charac- teristics. The accommodation test norms would further be subject to additional scrutiny and criticism by potential test users who do not subscribe to Piagets theory.

National norms As the name implies, national norms are derived from a normative sam- ple that was nationally representative of the population at the time the norming study was conducted. In the elds of psychology and education, for example, national norms may be obtained by testing large numbers of people representative of different variables of interest such as age, gender, racial/ethnic background, socioeconomic strata, geo- graphical location (such as North, East, South, West, Midwest), and different types of communities within the various parts of the country (such as rural, urban, suburban).If the test were designed for use in the schools, norms might be obtained for stu- dents in every grade to which the test aimed to be applicable. Factors related to the rep- resentativeness of the school from which members of the norming sample were drawn might also be criteria for inclusion in or exclusion from the sample. For example, is the school the student attends publicly funded, privately funded, religiously oriented, mili- tary, or something else? How representative are the pupil/teacher ratios in the school under consideration? Does the school have a library, and if so, how many books are in it? These are only a sample of the types of questions that could be raised in assembling a normative sample to be used in the establishment of national norms. The precise nature of the questions raised when developing national norms will depend on whom the test is designed for and what the test is designed to do.Norms from many different tests may all claim to have nationally representative samples. Still, close scrutiny of the description of the sample employed may reveal that the sample differs in many important respects from similar tests also claiming to be based on a nationally representative sample. For this reason, it is always a good idea to check the manual of the tests under consideration to see exactly how comparable the tests are. Two important questions that test users must raise as consumers of test- related information are What are the differences between the tests I am considering for use in terms of their normative samples? and How comparable are these normative samples to the sample of testtakers with whom I will be using the test?

National anchor norms Even the most casual survey of catalogues from various test publishers will reveal that, with respect to almost any human characteristic or abil- ity, there exist many different tests purporting to measure the characteristic or ability.

Dozens of tests, for example, purport to measure reading. Suppose we select a read- ing test designed for use in grades 3 to 6, which, for the purposes of this hypothetical example, we call the Best Reading Test (BRT). Suppose further that we want to compare ndings obtained on another national reading test designed for use with grades 3 to 6, the hypothetical XYZ Reading Test, with the BRT. An equivalency table for scores on the two tests, or national anchor norms, could provide the tool for such a comparison. Just as an anchor provides some stability to a vessel, so national anchor norms provide some stability to test scores by anchoring them to other test scores.The method by which such equivalency tables or national anchor norms are estab- lished typically begins with the computation of percentile norms for each of the tests to be compared. Using the equipercentile method, the equivalency of scores on different tests is calculated with reference to corresponding percentile scores. Thus, if the 96th percentile corresponds to a score of 69 on the BRT and if the 96th percentile corresponds to a score of 14 on the XYZ, then we can say that a BRT score of 69 is equivalent to an XYZ score of 14. We should note that the national anchor norms for our hypothetical BRT and XYZ tests must have been obtained on the same sampleeach member of the sample took both tests, and the equivalency tables were then calculated on the basis of these data.4 Although national anchor norms provide an indication of the equivalency of scores on various tests, technical considerations entail that it would be a mistake to treat these equivalencies as precise equalities (Angoff, 1964, 1966, 1971).

Subgroup norms A normative sample can be segmented by any of the criteria initially used in selecting subjects for the sample. What results from such segmentation are more narrowly dened subgroup norms. Thus, for example, suppose criteria used in select- ing children for inclusion in the XYZ Reading Test normative sample were age, educa- tional level, socioeconomic level, geographic region, community type, and handedness (whether the child was right-handed or left-handed). The test manual or a supplement to it might report normative information by each of these subgroups. A community school board member might nd the regional norms to be most useful, whereas a psy- chologist doing exploratory research in the area of brain lateralization and reading scores might nd the handedness norms most useful.

Local norms Typically developed by test users themselves, local norms provide nor- mative information with respect to the local populations performance on some test. A local company personnel director might nd some nationally standardized test useful in making selection decisions but might deem the norms published in the test manual to be far aeld of local job applicants score distributions. Individual high schools may wish to develop their own school norms (local norms) for student scores on an examination that is administered statewide. A school guidance center may nd that locally derived norms for a particular testsay, a survey of personal valuesare more useful in counseling students than the national norms printed in the manual. Some test users use abbreviated forms of existing tests, which requires new norms. Some test users substitute one subtest for another within a larger test, thus creating the need for new norms. There are many different scenarios that would lead the prudent test user to develop local norms

Fixed Reference Group Scoring SystemsNorms provide a context for interpreting the meaning of a test score. Another type of aid in providing a context for interpretation is termed a xed reference group scoring

1. When two tests are normed from the same sample, the norming process is referred to as co-norming.

system. Here, the distribution of scores obtained on the test from one group of testtak- ersreferred to as the xed reference groupis used as the basis for the calculation of test scores for future administrations of the test. Perhaps the test most familiar to college students that exemplies the use of a xed reference group scoring system is the SAT. This test was rst administered in 1926. Its norms were then based on the mean and standard deviation of the people who took the test at the time. With passing years, more colleges became members of the College Board, the sponsoring organization for the test. It soon became evident that SAT scores tended to vary somewhat as a function of the time of year the test was administered. In an effort to ensure perpetual comparabil- ity and continuity of scores, a xed reference group scoring system was put into place in 1941. The distribution of scores from the 11,000 people who took the SAT in 1941 was immortalized as a standard to be used in the conversion of raw scores on future admin- istrations of the test.5 A new xed reference group, which consisted of the more than 2 million testtakers who completed the SAT in 1990, began to be used in 1995. A score of 500 on the SAT corresponds to the mean obtained by the 1990 sample, a score of 400 cor- responds to a score that is 1 standard deviation below the 1990 mean, and so forth. As an example, suppose John took the SAT in 1995 and answered 50 items correctly on a particular scale. And lets say Mary took the test in 2008 and, just like John, answered 50 items correctly. Although John and Mary may have achieved the same raw score, they would not necessarily achieve the same scaled score. If, for example, the 2008 version of the test was judged to be somewhat easier than the 1995 version, then scaled scores for the 2008 testtakers would be calibrated downward. This would be done so as to make scores earned in 2008 comparable to scores earned in 1995.Test items common to each new version of the SAT and each previous version of it are employed in a procedure (termed anchoring) that permits the conversion of raw scores on the new version of the test into xed reference group scores. Like other xed reference group scores, including Graduate Record Examination scores, SAT scores are most typically interpreted by local decision-making bodies with respect to local norms. Thus, for example, college admissions ofcers usually rely on their own independently collected norms to make selection decisions. They will typically compare applicants SAT scores to the SAT scores of students in their school who completed or failed to complete their program. Of course, admissions decisions are seldom made on the basis of the SAT (or any other single test) alone. Various criteria are typically evaluated in admissions decisions.

Norm-Referenced versus Criterion-Referenced EvaluationOne way to derive meaning from a test score is to evaluate the test score in relation to other scores on the same test. As we have pointed out, this approach to evaluation is referred to as norm-referenced. Another way to derive meaning from a test score is to evaluate it on the basis of whether or not some criterion has been met. We may dene a criterion as a standard on which a judgment or decision may be based. Criterion- referenced testing and assessment may be dened as a method of evaluation and a way of deriving meaning from test scores by evaluating an individuals score with reference to a set standard. Some examples: To be eligible for a high-school diploma, students must demonstrate at least a sixth-grade reading level.

2. Conceptually, the idea of a xed reference group is analogous to the idea of a xed reference foot, the foot of the English king that also became immortalized as a measurement standard (Angoff, 1962).

To earn the privilege of driving an automobile, would-be drivers must take a road test and demonstrate their driving skill to the satisfaction of a state-appointed ex- aminer. To be licensed as a psychologist, the applicant must achieve a score that meets or exceeds the score mandated by the state on the licensing test.The criterion in criterion-referenced assessments typically derives from the values or standards of an individual or organization. For example, in order to earn a blackbelt in karate, students must demonstrate a black-belt levelJ U S T T H I N K . . .List other examples of a criterion that must be met in order to gain privileges or access of some sort.

of prociency in karate and meet related criteria such as those related to self-discipline and focus. Each student is evaluated individually to see if all of these criteria are met. Regardless of the level of performance of all the testtakers, only students who meet all the criteria will leave the dojo (training room) with a brand-new black belt.Criterion-referenced testing and assessment goes by other names. Because the focus in the criterion-referenced approach is on how scores relate to a particular content area or domain, the approach has also been referred to as domain- or content-referenced testing and assessment.6 One way of conceptualizing the difference between norm- referenced and criterion-referenced approaches to assessment has to do with the area of focus regarding test results. In norm-referenced interpretations of test data, a usual area of focus is how an individual performed relative to other people who took the test. In criterion-referenced interpretations of test data, a usual area of focus is the testtakers performance: what the testtaker can do or not do; what the testtaker has or has not learned; whether the testtaker does or does not meet specied criteria for inclusion in some group, access to certain privileges, and so forth. Because criterion-referenced tests are frequently used to gauge achievement or mastery, they are sometimes referred to as mastery tests. The criterion-referenced approach has enjoyed widespread acceptance in the eld of computer-assisted education programs. In such programs, mastery of seg- ments of materials is assessed before the program user can proceed to the next level.Has this ight trainee mastered the material she needs to be an airline pilot? This is the type of question that an airline personnel ofce might seek to address with a mas- tery test on a ight simulator. If a standard, or criterion, for passing a hypothetical Air- line Pilot Test (APT) has been set at 85% correct, then trainees who score 84% correct or less will not pass. It matters not whether they scored 84% or 42%. Conversely, trainees who score 85% or better on the test will pass whether they scored 85% or 100%. All who score 85% or better are said to have mastered the skills and knowledge necessary to be an airline pilot. Taking this example one step further, another airline might nd it use- ful to set up three categories of ndings based on criterion-referenced interpretation of test scores:85% or better correct = pass75% to 84% correct = retest after a two-month refresher course 74% or less = fail

3. Although acknowledging that content-referenced interpretations can be referred to as criterion-referenced interpretations, the 1974 edition of the Standards for Educational and Psychological Testing also noted a technical distinction between interpretations so designated: Content-referenced interpretations are those where the score is directly interpreted in terms of performance at each point on the achievement continuum being measured. Criterion-referenced interpretations are those where the score is directly interpreted in terms of performance at any given point on the continuum of an external variable. An external criterion variable might be grade averages or levels of job performance (p. 19; footnote in original omitted).

How should cut scores in mastery testing be determined? How many and what kinds of test items are needed to demonstrate mastery in a given eld? The answers to these and related questions have been tackled in diverse ways (Cizek & Bunch, 2007; Ferguson & Novick, 1973; Glaser & Nitko, 1971; Panell & Laabs, 1979).Critics of the criterion-referenced approach argue that if it is strictly followed, potentially important information about an individuals performance relative to other testtakers is lost. Another criticism is that although this approach may have value with respect to the assessment of mastery of basic knowledge, skills, or both, it has little or no meaningful application at the upper end of the knowledge/skill continuum. Thus, the approach is clearly meaningful in evaluating whether pupils have mastered basic read- ing, writing, and arithmetic. But how useful is it in evaluating doctoral-level writing or math? Identifying stand-alone originality or brilliant analytic ability is not the stuff of which criterion-oriented tests are made. By contrast, bril- liance and superior abilities are recognizable in tests thatJ U S T T H I N K . . .For licensing of physicians, psychologists, engineers, and other professionals, would you advocate that your state use criterion- or norm-referenced assessment? Why?

employ norm-referenced interpretations. They are the scores that trail off all the way to the right on the normal curve, past the third standard deviation.Norm-referenced and criterion-referenced are two of many ways that test data may be viewed and interpreted.However, these terms are not mutually exclusive, and the use of one approach with a set of test data does not necessarily preclude the use of the other approach for another application. In a sense, all testing is ultimately normative, even if the scores are as seemingly criterion-referenced as passfail. This is so because even in a passfail score there is an inherent acknowledgment of a continuum of abili- ties. At some point in that continuum, a dichotomizing cutoff point has been applied.We now proceed to a discussion of another word thatalong with impeach and percentilewould easily make a national list of Frequently Used but Little Understood Terminology. The word is correlation, a word that enjoys widespread confusion with the concept of causation. Lets state at the outset that correlation is not synonymous with causation. But what does correlation mean? And what is meant by regression? Read on.

Correlation and InferenceCentral to psychological testing and assessment are inferences (deduced conclusions) about how some things (such as traits, abilities, or interests) are related to other things (such as behavior). A coefcient of correlation (or correlation coefcient) is a number that provides us with an index of the strength of the relationship between two things. An understanding of the concept of correlation and an ability to compute a coefcient of correlation is therefore central to the study of tests and measurement.

The Concept of CorrelationSimply stated, correlation is an expression of the degree and direction of correspon- dence between two things. A coefcient of correlation (r) expresses a linear relation- ship between two (and only two) variables, usually continuous in nature. It reects the degree of concomitant variation between variable X and variable Y. The coefcient of correlation is the numerical index that expresses this relationship: It tells us the extent to which X and Y are co-related.The meaning of a correlation coefcient is interpreted by its sign and magnitude. If a correlation coefcient were a person asked Whats your sign? it wouldnt answer

anything like Leo or Pisces. It would answer plus (for a positive correlation), minus (for a negative correlation), or none (in the rare instance that the correlation coefcient was exactly equal to zero). If asked to supply information about its magni- tude, it would respond with a number anywhere at all between 1 and +1. And here is a rather intriguing fact about the magnitude of a correlation coefcient: It is judged by its absolute value. This means that to the extent that we are impressed by correlation coefcients, a correlation of .99 is every bit as impressive as a correlation of +.99. To understand why, you need to know a bit more about correlation.Ahh . . . a perfect correlation! Let me count the ways. Well, actually there are only two ways. The two ways to describe a perfect correlation between two variables are as either +1 or 1. If a correlation coefcient has a value of +1 or 1, then the relationship between the two variables being correlated is perfectwithout error in the statistical sense. And just as perfection in almost anything is difcult to nd, so too are perfect correlations. Its challenging to try to think of any two variables in psychological work that are perfectly correlated. Perhaps that is why, if you look in the margin, you are asked to just think about it.If two variables simultaneously increase or simultaneously decrease, then those two variables are said to be positively (or directly) correlated. The height and weight of normal, healthy children ranging in age from birth to 10 years tend to be positively or directly correlated. As children get older, their height and their weight generally increase simultaneously. A positive correlation also exists when two variables simul-taneously decrease. For example, the less preparation aJ U S T T H I N K . . .What two psychological variables are perfectly correlated? How about two psy- chological variables that just come close to being perfectly correlated?

student does for an examination, the lower the score onthe examination. A negative (or inverse) correlation occurs when one variable increases while the other variable decreases. For example, there tends to be an inverse rela- tionship between the number of miles on your cars odom- eter (mileage indicator) and the number of dollars a car dealer is willing to give you on a trade-in allowance; allother things being equal, as the mileage increases, the number of dollars offered on trade-in decreases.If a correlation is zero, then absolutely no relationship exists between the two vari- ables. And some might consider perfectly no correlation to be a third variety of per- fect correlation; that is, a perfect noncorrelation. After all, just as it is nearly impossible in psychological work to identify two variables that have a perfect correlation, so it is nearly impossible to identify two variables that have a zero correlation. Most of the time, two variables will be fractionally correlated. The fractional correlation may beextremely small but seldom perfectly zero.J U S T T H I N K . . .What two psychological variables have a zero correlation? How about two psycho- logical variables that just come close to having a zero correlation?

As we stated in our introduction to this topic, correla-tion is often confused with causation. It must be empha- sized that a correlation coefcient is merely an index of the relationship between two variables, not an index of the causal relationship between two variables. If you were told, for example, that from birth to age 9 there is a high positive correlation between hat size and spelling ability,would it be appropriate to conclude that hat size causes spelling ability? Of course not. The period from birth to age 9 is a time of maturation in all areas, including physical size and cognitive abilities such as spelling. Intellectual development parallels physi- cal development during these years, and a relationship clearly exists between physical and mental growth. Still, this doesnt mean that the relationship between hat size and spelling ability is causal.

Although correlation does not imply causation, there is an implication of predic- tion. Stated another way, if we know that there is a high correlation between X and Y, then we should be able to predictwith various degrees of accuracy, depending on other factorsthe value of one of these variables if we know the value of the other.

The Pearson rMany techniques have been devised to measure correlation. The most widely used of all is the Pearson r, also known as the Pearson correlation coefcient and the Pearson prod- uct-moment coefcient of correlation. Devised by Karl Pearson (Figure 42), r can be the statistical tool of choice when the relationship between the variables is linear and when the two variables being correlated are continuous (that is, they can theoretically take any value). Other correlational techniques can be employed with data that are discon- tinuous and where the relationship is nonlinear. The formula for the Pearson r takes into account the relative position of each test score or measurement with respect to the mean of the distribution.A number of formulas can be used to calculate a Pearson r. One formula requires that we convert each raw score to a standard score and then multiply each pair of stan- dard scores. A mean for the sum of the products is calculated, and that mean is the value of the Pearson r. Even from this simple verbal conceptualization of the Pearson r, it can be seen that the sign of the resulting r would be a function of the sign and the magnitude of the standard scores used. If, for example, negative standard score values for measurements of X always corresponded with negative standard score values for Y scores, the resulting r would be positive (because the product of two negative values is positive). Similarly, if positive standard score values on X always corresponded with positive standard score values on Y, the resulting correlation would also be positive. However, if positive standard score values for X corresponded with negative standard score values for Y and vice versa, then an inverse relationship would exist and so a negative correlation would result. A zero or near-zero correlation could result when some products are positive and some are negative.The formula used to calculate a Pearson r from raw scores is

\ (XX)(YY)

r =[\(X X)2 ][\(Y Y)2 ]This formula has been simplied for shortcut purposes. One such shortcut is a devi- ation formula employing little x, or x in place of X X, and little y, or y in place of Y Y :\ xyr =(\ x2 )(\ y2 )

Another formula for calculating a Pearson r is

N \ XY (\ X)(\ Y)r =N \ X2 (\ X)2 N \ Y2 (\ Y)2

Although this formula looks more complicated than the previous deviation for- mula, it is easier to use. Here N represents the number of paired scores; XY is the sum of the product of the paired X and Y scores; X is the sum of the X scores; Y is the sum of the Y scores; X2 is the sum of the squared X scores; and Y2 is the sum of the squared Y scores. Similar results are obtained with the use of each formula.

Figure 42Karl Pearson (18571936)Pictured here with his daughter is Karl Pearson, whose name has become synonymous with correlation. History records, however, that it was actually Sir Francis Galton who shouldbe credited with developing the concept of correlation (Magnello & Spies, 1984). Galton experimented with many formulas to measure correlation, including one he labeled r. Pearson, a contemporary of Galtons, modied Galtons r,and the rest, as they say, is history. The Pearson r eventually became the most widely used measure of correlation.

The next logical question concerns what to do with the number obtained for the value of r. The answer is that you ask even more questions, such as Is this number statistically signicant given the size and nature of the sample? or Could this result have occurred by chance? At this point, you will need to consult tables of signicance for Pearson r tables that are probably in the back of your old statistics textbook. In those tables you will nd, for example, that a Pearson r of .899 with an N = 10 is signicant at the .01 level (using a two-tailed test). You will recall from your statistics course that signicance at the.01 level tells you, with reference to these data, that a correlation such as this could have been expected to occur merely by chance only one time or less in a hundred if X and Y are not correlated in the population. You will also recall that signicance at either the .01 level or the (somewhat less rigorous) .05 level provides a basis for concluding that a cor- relation does indeed exist. Signicance at the .05 level means that the result could have been expected to occur by chance alone ve times or less in a hundred.The value obtained for the coefcient of correlation can be further interpreted by deriving from it what is called a coefcient of determination, or r2. The coefcient of determination is an indication of how much variance is shared by the X- and the Y-variables. The calculation of r2 is quite straightforward. Simply square the correla- tion coefcient and multiply by 100; the result is equal to the percentage of the variance accounted for. If, for example, you calculated r to be .9, then r2 would be equal to .81. The number .81 tells us that 81% of the variance is accounted for by the X- and Y-variables. The remaining variance, equal to 100(1 r2), or 19%, could presumably be accounted for by chance, error, or otherwise unmeasured or unexplainable factors.7

4. On a technical note, Ozer (1985) cautioned that the actual estimation of a coefcient of determination must be made with scrupulous regard to the assumptions operative in the particular case. Evaluating a coefcient of determination solely in terms of the variance accounted for may lead to interpretations that underestimate the magnitude of a relation.

Before moving on to consider another index of correlation, lets address a logical question sometimes raised by students when they hear the Pearson r referred to as the product-moment coefcient of correlation. Why is it called that? The answer is a little com- plicated, but here goes.In the language of psychometrics, a moment describes a deviation about a mean of a distribution. Individual deviations about the mean of a distribution are referred to as deviates. Deviates are referred to as the rst moments of the distribution. The second moments of the distribution are the moments squared. The third moments of the dis- tribution are the moments cubed, and so forth. The computation of the Pearson r in one of its many formulas entails multiplying corresponding standard scores on two measures. One way of conceptualizing standard scores is as the rst moments of a dis- tribution. This is because standard scores are deviates about a mean of zero. A formula that entails the multiplication of two corresponding standard scores can therefore be conceptualized as one that entails the computation of the product of corresponding moments. And there you have the reason r is called product-moment correlation. Its prob- ably all more a matter of psychometric trivia than anything else, but we think its cool to know. Further, you can now understand the rather high-end humor contained in the cartoon.

The Spearman RhoThe Pearson r enjoys such widespread use and acceptance as an index of correlation that if for some reason it is not used to compute a correlation coefcient, mention is made of the statistic that was used. There are many alternative ways to derive a coefcient of correlation. One commonly used alternative statistic is variously called a rank-order correlation coefcient, a rank-difference correlation coefcient, or simply Spearmans rho. Developed by Charles Spearman, a British psychologist (Figure 4 3), this coef- cient of correlation is frequently used when the sample size is small (fewer than 30 pairs

Figure 43Charles Spearman (18631945)Charles Spearman is best known as the developer of the Spearman rho statistic and the Spearman- Brown prophecy formula, which is used to prophesize the accuracy of tests of different sizes. Spearman is also credited with beingthe father of a statistical method called factor analysis, discussed later in this text.

of measurements) and especially when both sets of measurements are in ordinal (or rank-order) form. Special tables are used to determine if an obtained rho coefcient is or is not signicant.

Graphic Representations of CorrelationOne type of graphic representation of correlation is referred to by many names, includ- ing a bivariate distribution, a scatter diagram, a scattergram, orour favoritea scatterplot. A scatterplot is a simple graphing of the coordinate points for values of the X-variable (placed along the graphs horizontal axis) and the Y-variable (placed along the graphs vertical axis). Scatterplots are useful because they provide a quick indication of the direction and magnitude of the relationship, if any, between the two variables. Figures 44 and Figures 45 offer a quick course in eyeballing the nature and degree of correlation by means of scatterplots. To distinguish positive from negative correlations, note the direction of the curve. And to estimate the strength of magnitude of the cor- relation, note the degree to which the points form a straight line.Scatterplots are useful in revealing the presence of curvilinearity in a relationship. As you may have guessed, curvilinearity in this context refers to an eyeball gauge of how curved a graph is. Remember that a Pearson r should be used only if the relation- ship between the variables is linear. If the graph does not appear to take the form of a straight line, the chances are good that the relationship is not linear (Figure 46). When the relationship is nonlinear, other statistical tools and techniques may be employed.8

5. The specic statistic to be employed will depend at least in part on the suspected reason for the nonlinearity. For example, if it is believed that the nonlinearity is due to one distribution being highly skewed because of a poor measuring instrument, then the skewed distribution may be statistically normalized and the result may be a correction of the curvilinearity. Ifeven after graphing the dataa question remains concerning the linearity of the correlation, a statistic called eta squared (2) can be used to calculate the exact degree of curvilinearity.

Correlation coefficient = 0Correlation coefficient = .40 665544332211

0001234560

123456

(a) (b)Correlation coefficient = .60Correlation coefficient = .80 665544332211

00123456

00123456

(c)(d)Correlation coefficient = .90Correlation coefficient = .95 665544332211

0001234560

123456

(e)(f)Figure 44Scatterplots and Correlations for Positive Values of r

Correlation coefficient = .30Correlation coefficient = .50 665544332211

0001234560

123456

(a)(b)Correlation coefficient = .70Correlation coefficient = .90 665544332211

00123456

00123456

(c)(d)Correlation coefficient = .95Correlation coefficient = .99 665544332211

0001234560

123456

(e)(f)Figure 45Scatterplots and Correlations for Negative Values of r

YYOutlier

XX

Figure 46Scatterplot Showing a Nonlinear Correlation

Figure 47Scatterplot Showing an Outlier

A graph also makes the spotting of outliers relatively easy. An outlier is an extremely atypical point located at a relatively long distancean outlying distancefrom the rest of the coordinate points in a scatterplot (Figure 47). Outliers stimulate interpreters of test data to speculate about the reason for the atypical score. For example, consider an outlier on a scatterplot that reects a correlation between hours each member of a fth- grade class spent studying and their grades on a 20-item spelling test. And lets say one student studied for 10 hours and received a failing grade. This outlier on the scatter- plot might raise a red ag and compel the test user to raise some important questions, such as How effective are this students study skills and habits? or What was this students state of mind during the test?In some cases, outliers are simply the result of administering a test to a very small sample of testtakers. In the example just cited, if the test were given statewide to fth- graders and the sample size were much larger, perhaps many more low scorers who put in large amounts of study time would be identied.As is the case with very low raw scores or raw scores of zero, outliers can some- times help identify a testtaker who did not understand the instructions, was not able to follow the instructions, or was simply oppositional and did not follow the instructions. In other cases, an outlier can provide a hint of some deciency in the testing or scoring procedures.People who have occasion to use or make interpretations from graphed data need to know if the range of scores has been restricted in any way. To understand why this is so necessary to know, consider Figure 48. Lets say that graph A describes the rela- tionship between Public University entrance test scores for 600 applicants (all of whom were later admitted) and their grade point averages at the end of the rst semester. The scatterplot indicates that the relationship between entrance test scores and grade point average is both linear and positive. But what if the admissions ofcer had accepted only the applications of the students who scored within the top half or so on the entrance exam? To a trained eye, this scatterplot (graph B) appears to indicate a weaker correlation than that indicated in graph Aan effect attributable exclusively to the restriction of range. Graph B is less a straight line than graph A, and its direction is not as obvious.

Graph A

Graph B

0Entrance test scores Unrestricted rangeGrade-point averageGrade-point average

1000

Entrance test scores Restricted range

100

Figure 48Two Scatterplots Illustrating Unrestricted and Restricted Ranges

RegressionIn everyday language, the word regression is synonymous with reversion to some previous state. In the language of statistics, regression also describes a kind of rever- siona reversion to the mean over time or generations (or at least that is what it meant originally).Regression may be dened broadly as the analysis of relationships among vari- ables for the purpose of understanding how one variable may predict another. Simple regression involves one independent variable (X), typically referred to as the predictor variable, and one dependent variable (Y), typically referred to as the outcome variable. Simple regression analysis results in an equation for a regression line. The regression line is the line of best t: the straight line that, in one sense, comes closest to the greatest number of points on the scatterplot of X and Y.Does the following equation look familiar?

Y = a + bX

In high-school algebra, you were probably taught that this is the equation for a straight line. Its also the equation for a regression line. In the formula, a and b are regression coefcients; b is equal to the slope of the line, and a is the intercept, a constant indicat- ing where the line crosses the Y-axis. The regression line represented by specic val- ues of a and b is tted precisely to the points on the scatterplot so that the sum of the squared vertical distances from the points to the line will be smaller than for any other line that could be drawn through the same scatterplot. Although nding the equation for the regression line might seem difcult, the values of a and b can be determined through simple algebraic calculations.

4.0

3.5

3.0

2.5GPA first year

2.0

1.5

1.0

0.5

0.0

5152535455565758595

Score on dental school entrance exam

Figure 49Graphic Representation of Regression LineThe correlation between X and Y is 0.76. The equation for this regression line is Y = 0.82 + 0.03(X); for each unit increase on X (the dental school entrance examination score), the predicted value of Y (the rst- year grade point average) is expected to increase by .03 unit. The standard error of the estimate for this prediction is 0.49.

The primary use of a regression equation in testing is to predict one score or variable from another. For example, suppose a dean at the De Sade School of Dentistry wishes to predict what grade point average (GPA) an applicant might have after the rst year at De Sade. The dean would accumulate data about current students scores on the dental college entrance examination and end-of-the-rst-year GPA. These data would then be used to help predict the GPA (Y) from the score on the dental college admissions test (X). Individual dental students are represented by points in the scatterplot in Figure 49. The equation for the regression line is computed from these data, which means that the values of a and b are calculated. In this hypothetical case,GPA = 0.82 + 0.03 (entrance exam) This line has been drawn onto the scatterplot in Figure 49.Using the regression line, the likely value of Y (the GPA) can be predicted based on specic values of X (the entrance exam) by plugging the X-value into the equation. A student with an entrance exam score of 50 would be expected to have a GPA of 2.3. A student with an entrance exam score of 85 would be expected to earn a GPA of 3.7. This prediction could also be done graphically by tracing a particular value on the

X-axis (the entrance exam score) up to the regression line, and then straight across to the Y-axis, reading off the predicted GPA.Of course, all students who get an entrance exam score of 50 do not earn the same GPA. This can be seen in Figure 48 by tracing from any specic entrance exam score on the X-axis up to the cloud of points surrounding the regression line. This is what is meant by error in prediction: Each of these students would be predicted to get the same GPA based on the entrance exam, but in fact they earned different GPAs. This error in the prediction of Y from X is represented by the standard error of the estimate. As you might expect, the higher the correlation between X and Y, the greater the accuracy of the prediction and the smaller the standard error of the estimate.

Multiple regression Suppose that the dean suspects that the prediction of GPA will be enhanced if another test scoresay, a score on a test of ne motor skillsis also used as a predictor. The use of more than one score to predict Y requires the use of a multiple regression equation.The multiple regression equation takes into account the intercorrelations among all the variables involved. The correlation between each of the predictor scores and what is being predicted is reected in the weight given to each predictor. In this case, what is being predicted is the correlation of the entrance exam and the ne motor skills test with the GPA in the rst year of dental school. Predictors that correlate highly with the predicted variable are generally given more weight. This means that their regres- sion coefcients (referred to as b-values) are larger. No surprise there. We would expect test users to pay the most attention to predictors that predict Y best.The multiple regression equation also takes into account the correlations among the predictor scores. In this case, it takes into account the correlation between the dental college admissions test scores and scores on the ne motor skills test. If many predictors are used and if one is not correlated with any of the other predictors but is correlated with the predicted score, then that predictor may be given relatively more weight because it is providing unique information. In contrast, if two predictor scores are highly correlated with each other then they could be providing redundant information. If both were kept in the regression equation, each might be given less weight so that they would share the prediction of Y.More predictors are not necessarily better. If two predictors are providing the same information, the person using the regression equation may decide to use only one of them for the sake of efciency. If the De Sade dean observed that dental school admission test scores and scores on the test of ne motor skills were highly correlated with each other and that each of these scores correlated about the same with GPA, the dean might decide to use only one predictor because nothing was gained by the addi- tion of the second predictor.When it comes to matters of correlation, regression, and related statistical tools, some students are known to tune out (for lack of a more sophisticated psycho- logical term). They see such topics as little more than obstacles to be overcome to a careera statistics-free careerin psychology. In truth, these students might be quite surprised to learn where, later in their careers, the need for knowledge of correlation, regression, and related statistical tools may arise. For example, who would believe that knowledge of regression and how to formulate regression equations might be useful to an NBA team? Certainly, two sports psychologists you are about to meet would believe it (see Meet an Assessment Professionalmake that, Meet a Team of Assess- ment Professionals).

M E E TA NA S S E S S M E N TP R O F E S S I O N A L Meet Dr. Steve Juliusand Dr. Howard W. AtlasThe Chicago Bulls of the 1990s is considered oneof the great dynasties in sports, as witnessed by their six world championships in that decade . . .The team beneted from great individual contributors, but like all successful organizations, the Bulls were always on the lookout for ways to maintain a competitive edge. The Bulls . . . were one of the rst NBA franchises to apply personal- ity testing and behavioral interviewing to aid in the selection of college players during the annual draft, as well as in the evaluation of goodness-of-t when considering the addition of free agents. The purpose of this effort was not to rule out psychopathology, but rather to evaluate a range of competencies (e.g., resilience, relation- ship to authority, team orientation) that were deemed necessary for success in the league, in general, and the Chicago Bulls, in particular.[The team utilized] commonly used andwell-validated personality assessment tools and techniques from the world of business (e.g., 16PF-fth edition). . . . Eventually, sufcient data was collected to allow for the validation of a regression formula, useful as a prediction tool in its own right. In addition to selection, the infor- mation collected on the athletes often is used to assist the coaching staff in their efforts to moti- vate and instruct players, as well as to create an atmosphere of collaboration.Steve Julius, Ph.D., Sports Psychologist, Chicago BullsHoward W. Atlas, Ed.D., Sports Psychologist, Chicago BullsRead more of what Dr. Atlas and Dr. Julius had to say their complete essay at www.mhhe.com/cohentesting7.

Inference from MeasurementCorrelation, regression, and multiple regression are all statistical tools used to help ensure that predictions or inferences drawn from test data are reasonable and, to the extent that is technically possible, accurate. Another statistical tool that can be helpful in achieving such objectives is meta-analysis.

Meta-AnalysisGenerally, the best estimate of the correlation between two variables is most likely to come not from a single study alone but from analysis of the data from several studies. However, the data from several studies are likely to contain correlation coefcients as well as other statistics that differ for each study. One option to facilitate understand- ing of the research across a number of studies is to present the range of statistical val- ues that appear in various studies: The correlation between variable X and variable Y ranges from .73 to .91. Another option is to combine statistically the information across the various studies. The term meta-analysis refers to a family of techniques used to statistically combine information across studies to produce single estimates of the sta- tistics being studied. See, for example, Kuncel et al.s (2001) meta-analysis of the Gradu- ate Record Examination (GRE). Using a number of published studies, these researchers explored the predictive value of the GRE and undergraduate GPA for performance in graduate school.A key advantage of meta-analysis over simply reporting a range of ndings is that, in meta-analysis, more weight can be given to studies that have larger numbers of sub- jects. This weighting process results in more accurate estimates (Hunter & Schmidt, 1990). Despite this fact as well as other advantages (Hall & Rosenthal, 1995), meta- analysis remains, to some degree, art as well as science. The value of any meta-analytic investigation is very much a matter of the skill and ability of the meta-analyst (Kavale, 1995), and use of an incorrect meta-analytic method can lead to misleading conclusions (Kisamore & Brannick, 2008).

Culture and InferenceAlong with statistical tools designed to help ensure that prediction and inferences from measurement are reasonable, there are other considerations. It is incumbent upon responsible test users not to lose sight of culture as a factor in test administration, scor- ing, and interpretation. So, in selecting a test for use, the responsible test user does some advance research on the tests available norms to check on how appropriate they are for use with the targeted testtaker population. In interpreting data from psycho- logical tests, it is frequently helpful to know about the culture of the testtaker, includ- ing something about the era or times that the testtaker experienced. In this regard, think of the words of the famous anthropologist Marga-J U S T T H I N K . . .What event in recent history may have relevance when interpreting data from a psychological assessment?

ret Mead (1978, p.71) who, in recalling her youth, wrote: We grew up under skies which no satellite had ashed. In interpreting assessment data from assessees of differ- ent generations, it would seem useful to keep in mind whether satellites had or had not ashed in the sky. In other words, historical context should not be lost sight of in evaluation (Rogler, 2002).It seems appropriate to conclude a chapter entitled Of Tests and Testing with the introduction of the term culturally informed assessment and with some guidelines for accomplishing it (Table 41). Think of these guidelines as a list of themes that may be repeated in different ways as you continue to learn about the assessment enterprise. To supplement this list, the interested reader is referred to guidelines published by the American Psychological Association (2003). For now, lets continue to build a sound foundation in testing and assessment with a discussion of the psychometric concept of reliability in the next chapter.

Table 41Culturally Informed Assessment: Some Dos and Donts

DoDo Not

Be aware of the cultural assumptions on which a test is basedTake for granted that a test is based on assumptions that impactall groups in much the same way

Consider consulting with members of particular cultural commu- nities regarding the appropriateness of particular assessment techniques, tests, or test itemsStrive to incorporate assessment methods that complement the worldview and lifestyle of assessees who come from a specic cultural and linguistic populationBe knowledgeable about the many alternative tests or measure- ment procedures that may be used to fulll the assessment objectivesBe aware of equivalence issues across cultures, including equiv- alence of language used and the constructs measured

Score, interpret, and analyze assessment data in its cultural con- text with due consideration of cultural hypotheses as possible explanations for ndings

Take for granted that members of all cultural communities will automatically deem particular techniques, tests, or test items appropriate for useTake a one-size-ts-all view of assessment when it comes to evaluation of persons from various cultural and linguistic populationsSelect tests or other tools of assessment with little or no regard for the extent to which such tools are appropriate for use with the assesseesSimply assume that a test that has been translated into another language is automatically equivalent in every way to the originalScore, interpret, and analyze assessment in a cultural vacuum

Self-AssessmentTest your understanding of elements of this chapter by seeing if you can explain each of the following terms, expressions, and abbreviations:

age-equivalent scoresgrade normsregression coefcient

age normsincidental sampleregression line

bivariate distributioninterceptsample

classical theorylocal normssampling

coefcient of correlationmeta-analysisscatter diagram

coefcient of determinationmultiple regressionscattergram

constructnational anchor normsscatterplot

content-referenced testing andnational normssimple regression

assessmentnormSpearmans rho

convenience samplenormative samplestandard error of the estimate

correlationnormingstandardization

criterionnorm-referenced testing andstandardized test

criterion-referenced testing andassessmentstate

assessmentoutlierstratied-random sampling

cumulative scoringovert behaviorstratied sampling

curvilinearityPearson rsubgroup norms

developmental normspercentage correcttest standardization

domain samplingpercentiletrait

domain-referenced testing andprogram normstrue score theory

assessmentpurposive samplinguser norms

equipercentile methodrace normingY = a + bX

error variancerank-order/rank-difference

xed reference group scoringcorrelation coefcient

systemregression