Jensen, Arthur. R. TITLE An Examination of. Culture Bias ... · PDF filethat the WonderliC shows very little or no evidence of cultural bias ... administered paper-and-pencil test

DOCUMENT RESUME

ED 086 726 TM 003 377

AUTHOR Jensen, Arthur. R.TITLE An Examination of. Culture Bias in the Wonderlic

Personnel Test.PUB DATE 73NOTE 25p.

EDRS PRICE MF-$0.65 HC-$3.29DESCRIPTORS Caucasians; *Cross Cultural Studies; *Intelligence

Tests; Item Analysis; Negroes; *Racial Differences;*Test Bias; *Test Validity

IDENTIFIERS *Wonderlic Personnel Test

ABSTRACTInternal evidence of cultural bias, in terms of

various types of item analysis, was sought in the Wonderlic PersonnelTest results in large, representative samples of whites and Negroestotalling some 1,-500 subjects. Essentially, the lack of anyappreciable Race X Items interaction and the high interracialsimilarity in rank order of item difficulties lead to the conclusionthat the WonderliC shows very little or no evidence of cultural biaswith respect to the present samples, which, however, differappreciably in mean scores. The items which best measure the g factorwithin each racial group are, by and large, the same items that showthe largest interracial discrimination. (Author)

FILMED FROM BEST AVAILABLE COPY

An Examination of Culture Bias in the

Wonderlic Personnel Test

Arthur R. Jensen

University of California, Berkeley

ABSTRACT

U.S. DEeARTMENT OF HEALTH.EDUCATION & WELFARENATIONAL INSTITUTF OF

EDUCATION11.1.$ 30.-U.VENT HAS BEEN REPRODUCED EXACTLY AS .RECEIVED F ROMTHE PERSON OR ORGANIZATION ORIGINAT iNG IT POINTS Or VIEW OR OPINIONSSTATED DO NOT NECESSARILY REPPESENT Or iciAL NATIONAL INSTITUTE OFEDUCATION POSITION OR POLICY

j\k4)Internal evidence of cultural bias, in terms of various.types of item

Val)analysis, was sought in the Wonderlic Personnel Test results in large,

representative samples of whites and Negroes totalling some 1,500 subjects.

Ce63Essentially, the lack of any appreciable Race X Items interaction and the

Cis)

high interracial similarity in rank order of item difficulties lead to the

conclusion that the Wonderlic shows very little or no evidence of cultural

bias with respect to the present samples, which,. however, differ appre-

E°1ciably in mean scores. The items.which best measure the a factor within

each racial group are, by and large, the same' items that show the 'largest

interracial discrimination.

An Examination of Culture Bias in the

Wonderlic Personnel Test

Arthur R. Jensen

University of California, Berkeley

Psychometricians are generally agreed that a population difference

in average test score is not, by itself, evidence of biased sampling of test

items such as to favor (or disfavor) a particular cultural group. The mean

difference between groups may be explainable in terms of factors other

than culture bias in the item content of the test. Evidence of culture

bias thus depends upon criteria other, than a group mean difference.

There are two main classes of criteria for assessing test bias:

external and internal. They are complementary. The external criteria are

the more important in terms of the practical usefulness of the test and

where predictive validity for a specific quantifiable performance criterion

is possible. Bias is indicated when two (or more) populations show signi-

ficantly different regressions of criterion measures on test scores. If

the regression lines for the two (or more) groups do not differ signifi-

cantly in intercept and slope, the test can be said to be "fair" to: all

groups with respect to the given criterion of external validity. Refine-1

ments and variations of this general external criterion for assessing test

bias have been discussed extensively in the measurement literature (e.g.,

Cleary, 1968; Darlington, 1971; Humphreys, 1973; Jensen, 1968; Linn, 1973;

Thorndike, 1971).

7

Internal criteria of cultural bias become important when discussing

the construct validity of the test and in assessing claims of bias even

when the external validity criteria give no evidence of bias. Such claims

of test bias are sometimes made on the grounds that the external criterion

of the test's validity is itself culture-biased and is therefore predictable

by a culture-biased test. Internal criteria of bias get around this argu-

ment by examining the degree to which different socioeconomic and cultural

groups differ in terms of various "internal" features .of the test involving

item statistics. The main criterion for the detection of bias lies in the

magnitude of the groups X items interaction relative to other sources of

variance in an analysis of variance (ANOVA) design comprised of Groups (c),

Items (I), Subjects within Groups (S), and the interactions C x I and S x I.

This method was first used by Cleary and Hilton (1968), who examined the

G X I interaction on two forms of the Preliminary Scholastic Aptidude Test

in white and Negro groups. The Race X Items interaction proved statistically

significant but contributed to minimally relative to the main effects that

the authors concluded: . given the stated definition of bias, the

PSAT for practical purposes is not biased for the groups studied- Stanley

(1969) later showed that a considerable amount of the Race X Items inter-

action was due to just a few items that were too difficult in both racial

groups and therefore did not discriminate much between them. Negroes scored

rather uniformly lower than whites on most of the items.

The Groups X Items interaction is analyzable into two effects: (a)

the similarity inthe rank order of the percent passing, ja, each item in

each of the groups, and (b) the similarity between the groups in the differ-

ences between the 2 values of adjacent items in the test, i.e., 2122, E2 -13'.

etc. There are here called 2. decrements. Group differences in rank order

3

of item difficulties are termed disordinal interactions. Group differences

in 2. decrements, when the rank order of 2 values is the same. in both groups,

are termed ordinal interactions. A measure of similarity between. groups,

such as the Pearson correlation between the groups, in jj values and p decre-

ments, can serve as 'sensitive indexes of the degree to which the groups

behave differently with respect to different items. Presumably all test

items in any test are not equally culture biased, and to the degree that

items differ in this property, the extent of cultural differences between

two groups relevant'to performance on the test should be related inversely

to the size of the intergroup correlations of 2 values and of 2 decrements.

Also, if more test items are culturally irrelevant or unreliable in one

group than in another, this can be expected to result in different magnitudes

of the test's internal consistency reliability id the two groups.

The present study examines the Wonderlic Personnel Test (WPT)for

evidence of culture bias in terms of these internal criteria when applied

to representative white and Negro samples. The WPT is an obviously culture-

loaded test of general intelligence. The fact that it is culture-loaded

only means that most of the items are based on specific information and cog-

nitive skills that are commonly acquired in present-day English-speaking

western culture. This is obvious simply from inspection of the test items.

Whether the obvious culture loading of the items biases the test to the dis-

advantage of any particular population with respect to another population

is a separate question which can be answered only in terms of empirical

investigation of test data from the groups in question.

The cultural-educational loading of the Wonderlic would seem to make

it suspect as a possibly culture-biased test in the American Negro population.

This should be a point of concern when the WPT is used in business and

4

industry, and especially where precise external criteria of the WPT's

validity in the white and Negro groups is not available. More than 6,500N

organizations routinely use the WPT as a part of their personnel selection

and placement procedures, making it one of the most widely used tests of

mental ability.

Detailed descriptions of the WPT and references to previous research

can be found in Buros (1972, pp. 724-6). Briefly, the WPT is a group-

administered paper-and-pencil test of 50 verbal, numerical, and spatial

items arranged in spiral ominbus fashion. It is generally given with a

12-minute lime limit. Alternate form reliabilities average .95. Use of

the WPT is claimed to have validity where educability or trainability is a'

job requirement (Wonderlic & Wonderlic, 1972, p. 60). Large' representative

samples of males and females show no significant difference in totr -i1 raw

score' on the WPT.

Negro Norms

Norms based on 38,452 Negro job applicants have been published (Won-

derlic & Wonderlic, 1972). The authors state: "The vast amount of data

studied in this report confirms that a very stable differential in raw scores

achieved by Negro applicant populations exists.. Where education, sex, age,

region of country and/or position applied for are held constant, Negro7

Caucasian WPT score differentials a.re consistently observed. These mean

score differentials are . . . about one standard deviation apart when com-

parisons of Caucasians and Negroes are studied" (p. 3). As the authors'note

(p. 68), the Negro (as well as white) norms are basedon biased samples of

the Negro (and white) populations to the extent that they are based on an

applicant population of individuals who are looking for jobs. The age group

From 20 to 24 is predominantly represented for both sexes and for both races.

5

The published norms show the mean and median test score olf Negro and

white applicants for each of 80 different occupational categories, from the

professional-managerial level to unskilled labor. The correlation between

the Negro and white medians across the 80 occupational categories is .84

(the correlation between means is .87), indicating a high degree of similarity

between the racial groups in their self-selection for various occupations.

In other words, the rank order of median and mean'test scores of applicants

for various, jobs is very similar in the Negro and whitd populations, despite

the, approximately 1 a race difference in mean scores for all job categories.

Is there internal evidence in the test data that the 1 a difference

between whites and Negroes is attributable in whole or in part to culture

bias in the WPT?

Method

Subjects

Parallel analyses were performed on two pairs of white and Negro

samples. Thus_the findings from the main analyses are replicated in two

sets of Negro-white comparisons based on sarnples selected in different ways.2

Sample 1 consists of 544 white and 544 Negro Ss representing a random

sample of the nationwide population of job applicants on which the published

white and Negro norms are based for Form IV of the WPT. These large samples

thus closely approximate the score distributions of the normative white and

Negro populations, which have been given full statistical description in

the manual of norms of the WPT (Wonderlic & Wonderlic, 1972). The samples

were drawn without selection for characteristics such as age, education,

job category, sex, and region. All Ss coded as "other minority" or Ss with

Spanish surnames were exr:luded from the sample. In terms of the white a1

6

(standard deviation), the mean scores of the white and Negro samples differ

by 1.05 a as compared with 1.00 G in the total normative populations.

Sample.2 consists of randomly slected test protocols of 204 white

and 204 Negro Ss who were job applicants for entry level positions in a

single company in New York City. No selection was made on age, education,

and sex. Ss coded as "other minorities" and Spanish surnames are not

included in the white sample. The white and Negro means of Sample 2 are

very close to the national norms, but the SDs are almost double. (SamDle 2:

White R = 22.07, SD = 14.86; Negro R = 15.63, SD = 13.89. National Norms:

White R = 23.32, SD = 7.50; Negro R = 15.80, SD = 7.06). In terms of the

white sample SD, therefore, the Sample 2 white-Negro mean difference is

only 0.43 a, although it is 0.86 a in terms of the normative white.

Results

P Values and P Decrements.

The P value is the proportion of the total sample who amayat a given

test item correctly. P values were obtained for items 1 - 50 in the white

and Negro groups.

The P decrement is the difference between the E values of ordinally

adjacent test items, e.g., 21-E2, 22-23, etc., where the subscript indicates

the item number in the test. P decrements between adjacent items 1-2, 2-3,

.)49 -50 were obtained in both samples.

Table 1 shows the mean 2 values within sets of 10 items (and for all

items) for each of the racial groups in Samples 1 and 2. The item 2 values

Insert Table 1 about here

Table 1

Summary of Wonderlic Item Statistics on Sample 1 (N = 544 Whites and 544 Negroes)

and Sample 2 (N = 204 Whites

and 204 Negroes)

Items

Mean P

Correlation

(Uncorrected) Between

White P x Negro P

Correlation

(Corrected for

Attenuation) Between

White P X Negro P

Correlation

(Uncorrected) Between

W X N P Decrements

Correlation

(COrrected for

Attenuation) Between

W X N P Decrements

Sample 1

White

Negro

Sample 2

White

Negro

Sample 1

Sample 2

Sample 1

Sample 2

Sample 1

Sample 2

Sample

1Sample 2

1-10

.815

.623

.829

.653

.886

.920

.901

.935

.907

.934,

.923

.950

11-20

.662

.409

,682

.485

.802

.702

.837

.733

.845

.768

.884

.803

21 -30

.461

.233

.439

.266

.879

.945

.895

.962

.855

.928

.876

.952

31-40

.232

.101

.230

.143

.937

.943

.963

1.032

.673

.789

.780

.914

41-50

.035

.007

.031

.014

.765

.933

.835

1.0192

.765

.938

.827

1.014

All Items

.442

.275

.442

.312

.932

.956

.939

.962

.792

.832

.820

.862

7

were correlated between racial groups within 10-item sets and over all 50

items. As can be seen in Table 1, these correlations are quite high even

within sets of 10 items. This means that the relative difficulty of the

items, as indicated by the proportion passing, is highly similar in the

white and Negro samples.

The reliability of the P values within each racial group was esti-

mated by obtaining the correlations between the 1 values of the same racial

groups in Samples 1 and 2. These within-race correlations between values

are all over .90 and for all 50 items the correlations (or reliability of

the P values) are .995 for whites and .992 for Negroes. Using the reli-

-abilities thus obtained, the interracial correlations between item 2 values

were corrected for attenuation, as shown in Table 1. The fact that the

correlations after correction for attenuation are distributed about a mean

of less than 1.00, of course, indicates that the interracial correlation of

2 values is significantly less than the intraracial correlation. Yet the

corrected interracial correlations are very high, which means that the

relative item difficulties, though not identical, are much alike in the

white and Negro groups.

The P decrements were treated in exactiv the same way. Since

decrements, unlike 2 values, are not systematically correlated with the

item's ordinal position in the test, the interracial correlation between

2 decrements is a more sensitive index of group similarity than the corre-

lation of .P values. A high interracial correlation between 2 decrements

means that the relative differences in difficulty between adjacent items

are much alike in the two racial groups. If some items were more racially-

culturally biaped than others, resulting in different relative difficulties

for whites and Negroes, it would be reflected, in a low interracial correlation1

8

between item 2 decrements, both with or without correction for attenuation.

As can be seen in Table 1, this is not the case. The interracial correla-

tions of 2 decrements are remarkably high. They are distributed about a

mean of less than 1.00, however, which means that there is a slight but

significant difference in the relative 2 decrements of the white and Negro

groups.

P Values and P Decrements for Attempted Items Only

As the WPT is a timed test, very few Ss attempt every item. The

typical pattern of response for most Ss is to answer the first 10 or 15

items and then to begin to skip Around looking for items that appear rela-

tively easy for them in order to obtain the highest score they possibly

can in the time available. Items which were left unanswered by the S are

considered to be not attempted.

Table 2 shows (a) the mean proportion of each group attempting

items (ill, sets of 10 items), (b) the interracial correlation (corrected

for attenuation) between these proportions, (c) the mean proportion, PA,

passing the attempted items (d) the interracial correlations of PA'

and

(e) the correlation between proportion attempting and proportion passing

the items.


Whites and Negroes are highly similar in the proportions attempting

each item. The similarity is even greater for the proportion of each group

Table 2

Proportion Passing, P of All Ss Who Attempted Items, and the Inter-racial Correlations

White xlegro

Correlationi

Mean Proportion

fProportion

Mean Proportion Passing.

Attempting Items

Attempting

Attempted Items, PA

Correlation2

Between White

P xNegro P

APA

Correlationl

Between White

P XNegro P

AA

Correlation Between

% Attempting and Y Passing

Item

Items

Sample 1

White

Negro

Sample 2

White

Negro

Sample Sample

12

Sample 1

White

Negro

Sample 2

White

Negro

Sample .Sample

12

Sample Sample

12

Sample 1

White

Negro

Sample 2

White

Negro

1 - 10

1.000

1.000

.935

.886

3.687

.815

.623

.881

.726

.886

.882

1.074

1.069

33

.202

.378

11 - 20

.995

.985

.823

.858

.981

.970

.665

.415

.744

.568

.797

.261

1.100

.360

.077

.047

.513

,.143

21 - 30

.906

.805

.789

,679

.989

.980

.434

,287

.545

.393

.882

.973

1.073

1.183

.043

.140

.087

-.025

31 - 40

.578

.383

.444

.364

.996

.987

.395

.232

.501

.375

.877

.866

1.052

1.039

.402

.615

.255

.244

41 - 50

.161

.088

.095

.051

.994

.961

.217

.081

.291

.242

.805

.643

.981

.783

.284

.079

.346

.019

All

Items

.728

.653

.617

.568

.901

.82'4

.509

.328

.592

.461

.916

.796

1.162

1.009

.724

.711

.716

.578

1 Corrected for

attenuation

2Uncorrected.

3Since 100% of both racial groups attempted items 1-10, the computation of r4 is precluded.

9

passing the attempted items; the interracial correlations, when corrected

for attenuation, generally do not differ significantly from unity. Overall,

in both Samples 1 and 2, the interracial correlations of item difficulties

of attempted items is so high as to indicate that the items have essentially

the same relative difficulties in the white and Negro groups.

White-Negro Differences According to Type of Items

It is often claimed that Negroes perform relatively less well on

verbal items than on other types, since presumably verbal content allows

wider scope for cultural variations and the effects of bias on Negro scores.

To' see if this notion holds true for the various kinds of item content in

the WPT, items were classified as shown in Table 3 and the mean White-Negro

difference in these item categories was determined.


Since items in different categories occur unsystematically Ryt differ-

ent ordinal positions in the test and have different overall levels of dif-

ficulty in both racial groups, it was necessary, in order to make the appro-

priate comparisons, to transform the proportion passing to an index of item

difficulty which constitutes an interval scale. As explained by Guilford

(1954, pp. 418-419), this is accomplished by expressing the proportion

passing in terms of the z score deviations of the normal curve. The group

mean difference is thus expressed in a of z score deviations. For example,

if on a given item Group A has 84% passing and Group B has 60% passing, the

Table 3

Classification of Wonderlic Items and Mean Z Scale Difference Between White and Negro Croups1

Item Category

Number

of items

Sample

Mean

White-Negro Difference

All Items

-Attempted

1Sample 2

Sample 1

Mean

SD

Mean

SD

Items

Sample 2

Mean

SD

Verbal Ability

Opposite-Similar (single words)

14

.656

.25

.587

.21

.639

.24

.545

.28

Oddity Problem (single words)

3.395

.39

.690

.12

.650

.39

.653

.12

Proverbs

5.450

.14

.366

.12

.316

.26

.264

.40

Scrambled Sentences

2.735

.21

.365

.19

.655

.29

.335

.13

Sentence Meaning

3.500

.17

.343

.15

.536

.20

.303

.17

Total Verbal

27

.606

.25

.514

.21

.561

.27

.463

.29

Numerical. Reasoning

Number Series

3.837

.03

.583

.15

.870

.11

1.270

.84

Arithmetic Reasoning tMoney)

5.694

.39

.350

.34

1.000

.36

.808

1.07

Arithmetic Reasoning (Quantity & Rate)

5.658

.33

.402

.26

.576

.43

.020

.93

Total Numerical

13

.713

.30

.425

.27

.807

.38

.611

1.03

Logical Reasoning

Verbal Syllogisms

6.452

.27

.308

.14

.367

.19

-.130

.89

Spatial-Geometric Reasoning

2.850

.04

.295

.42

1.270

.65

.120

.41

Total Logical Reasoning

8.551

.29

.305

.19

.593

.51

-.069

.78

Factual Information

1-

.000

-.240

-.000

-.000

-

Clerical Accuracy

1.520

-.340

-.340

.210

All Items

50

.611

.28

.445

.23

.615

.36

.402

.67

1Sample 1:

N =,544 whites and 544 Negroes

Sample 2:

N = 204 whitea and 204 Negroes

10

corresponding z scores (from the table of areas under the normal curve) are

+1.00 and +0.25 and the difference between Group A and Group B is 1.00 - .25 =

.75 a By thus transforming k values to z scores, items of different diffi-

culty in the two groups can be compared on an interval scale, permitting

direct comparisons of the mean White-Negro z score differences for different

types of items.

Table 3 Shows the mean z scale difference between the white and

Negro group on the various types of items, as well as the SD of the N items

of each type. Because of the smallnumbers of items in the separate cate-

gories, the most important comparisons are between the totals for Verbal,

Numerical, and Logical Reasoning. Also more weight probably should be

given to the results for attempted items. In Sample 1 there were no indi-

vidual items with negative z values, either for all items or for attempted

items, and there were only five such items among those attempted in Sample

2; in all cases these were items attempted by fewer than 8% of either group.

That is to say, whites did better on all items attempted by more than 8%

of Ss in either group. There is no regular tendency for the White-Negro

difference to be greater for the verbal than for numerical or logical rea-

soning, and the smallest differences are in factual information and the

interpretation of proverbs, which, surprisingly, are the types of items

that are so often held up as examples of culture-loaded test items. There

is no consistent difference between "all items" and "attempted items."

Overall the White-Negro difference is about as great for the attempted

items as for all the items. The rather low degree of consistency between

results for Samples 1 and 2 would seem to make unwarranted any strong con-

clusions from the analysis in Table 3. What it does illustrate is the lack

of any marked or consistent tendency for any one type of item to be more

11

racially discriminating than other types, as the items are here classified.

If specific type of content is not systematically related to the

item's racial discriminability, is there any item characteristic that is

so related? It was hypothesized that items'. loadings (or loading on

the first principal component) when the item intercorrelation matrix is

factor analyzed within each racial group separately would be most higly

related to the item's discriminability between the racial groups. That is

to say, 'the more highly an item is correlated with the general factor

common to all items, within either racial group, the more highly it will

discriminate between the racial groups. To test this hypothesis, the items'

loadings on the first principal component (the. factor of the item inter-

correlation matrix) were obtained from separate principal components analyses

of the white and Negro data (Sample 2). The items' factor loadings were

correlated with the items' z index of interracial discriminability (Table 3),

for all items, not just attempted items. The Pearson correlation is .47

in the White sample and .62 in the Negro sample. For items with. loadings

of greater than .40, the mean White-Negro z difference is .64 (for factor

loadings in White sample) and .67 (for factor loadings in Negro sample);

while for items with .E. loadings of less than .40, the corresponding z dif-

ferences are .36 and .37, respectively. A similar relationship holds also

for attempted items. The White-Negro z difference for all items with load-

ings of more than .40 on .E. is .52 (in White sample) and .66 (in the Negro);

the corresponding figures for items loaded less than .40 are .35 and .31.

When this was cross-validated in Sample 1, the White-Negro z difference

for all items with a loadings greater than .40 is .78 (for White sample)

and .79 (for Negro sample); the z differences for all items with .E. loadings

less than .40 are .54 (White sample) and .55 (Negro sample). The cross-

validating correlation between the Sample 2 factor loadings and the Sample 1

12

White-Negro z differences are .44 (White sample) and .33 (Negro sample).

What all this means is that there is a substantial relationship between

the size of the item loadings on the general factor common to all items

in the Wonderlic and the magnitude of the White-Negro difference on the

item, and this is true whether the a factor is determined in the White or

in the Negro Sample. Neither the loadings on any components other than

the first principal component (i.e., J1) nor type of item content reveals

any systematic relationship to the item's interracial discriminability.

On the other hand, the items that best measure the general factor within

each racial group are the same items, by and large, that discriminate most

highly between the racial groups.

Analysis of Variance: Items X Subjects Matrix

The Race X Items interaction in a complete ANOVA of the Items X

Subjects matrix provides a sensitive index of item bias relative to other

sources of variance. Using the Sample 2 data, three such ANOVAs were per-

formed: (1) on the total white and Negro groups, (2) on white and Negro

groups equated on total WPT score, and (3) on "pseudo-racial" groups com-

prised entirely of two groups of white Ss selected so that their total

WPT score distributions closely match the normative white and Negro distri-

butions in means and SDs. The ANOVAs for each of these conditions are sum-

marized in Table 4. To that the three analyses can be directly compared,


the sum of squares for each source in the ANOVA is converted to omega

13

squared (6)2) 100, which is the percent of the total variance attributable

to the given source.

For the ANOVA of the total white and Negro saMples. all of the effects

are significant beyond the .001 level. including the Race Items interaction.

But once the statistical significance of this interaction is shown.. more

important than statistical significance is the magnitude of the interaction

relative to other sources of variance. The smaller it is the more "fair"

the test as regards culture bias. The appropriate index of "fairness."

thus defined, is the A/B ratio, which, in terms of 0)2 is A = R/S and 13

(RXI) /(IxS). In terms of F, A/B = F /F . The two formulas for the A/B-- R --ratio are algebraically equivalent. If the Race Items interaction is non-

significant, it is presumed that no bias has been demonstrated and there is

no point in computing the A/B ratio. The lower the value of the A/B ratio.

the easier it would be to equalize or reverse the racial group means by

item selection. Obviously a small group mean difference along with a large

Groups x Items interaction would mean that a. somewhat different selection

of items from the same item population could equalize or reverse

the group means. The higher the value of A/B, the less is the

possibility of equalizing the group means through item selection from a

similar population of items. This would not rule out the possibility of

introducing different kinds of items into the test. but if doing so decreases

the A/B ratio (even though it decreases the group mean difference), it can

be argued that the minimizing of the group mean difference is simply a

result of balancing item biases. Some tests equate male and female scores

on this basis, balancing items that favor one sex with the selectidWof

items that favor the other. Such a test, resulting in little or no mean

sex difference but a large Sex X Items interaction, of course precludes the

l4

use of such a test for studying the question of sex differences in the

ability which the test purports to measure. The same thing would be true

of any Lest which was made to equalize racial group differences nt the ex-

pense of greatly increasing the Race *, items interaction. The desirable

condition is to minimize the interaction as much as possible.

The A/B ratio for the total samples (Table 4) is 10.84. For com-

parison, a similar study of white and Negro elementary pupils showed an

A/B ratio of 7.10 on the culture-loaded Peabody Picture Vocabulary Test

and of 17.32 on the culture-reduced Raven's Progressive Matrices (Jensen.

in press).

ANOVA on Equated White and Negro Samples. in a previous study. it

was found that when groups of white and Negro school children were roughly

matched for mental age (rather than chronological age), and ANOVA of the

Peabody Picture Vocabulary Test (PPVT) items was performed. the Race

Items interaction was greatly reduced from its magnitude when the two racial

groups were of the same chronblogical agebut different mental ages (Jensen.

in press). This finding suggests that a large part of the Race Y Items

interaction is attributable to a mental maturity Y items interaction rather

than to a racial-cultural difference per se. And this hypothesis was

strengthened by showing that the same magnitude of the actual Race Items

interaction could be achieved entirely with the white sample, simply by

dividing it into two "pseudo-racial" groups for the ANOVA. One group of

white Ss was selected so that-their distribution of total PPVT scores

matched the Negro distribution in mean and SD; the other_ group of white Ss

was selected so that its PPVT score distribution matched the total white

distribution. When these two culturally homogeneous groups, corresponding

to the Negro and white samples, were subjected to the same ANOVA as was

15

applied to the true racial groups, it reproduced the same results almost

perfectly, including the Race X Items interaction. In other words, an

interaction of this magnitude could be attributed to an average ability

difference between the group's rather than to a cultural difference.

The same kind of analysis is here applied to the Wonderlic data.

Since mental age is not a meaningful scale in an adult population. Negro

and white Ss were simply matched for total score on the WPT. Perfect

matching was possible on 127 White-Negro pairs, making the white and Negro

total score distributions'identical.

If the WPT items are culture-biased for Negroes, one might expect

that whites and Negroes with the same total scores would obtain them in

different says, so that even when the main effect of Race is zero in the

ANOVA, the Race x Items interaction would remain.

Table 4 shows the results of the ANOVA on the equated samples. The

main effect of race was, of course, forced to be zero by equating the groups.


But note that the Race X Items interaction is very small and nonsignificant

(F = 1.25, df = 48/12,096, .2 > .10). This finding is consistent with the

hypothesis that the R X I interaction in the ANOVA of the total samples is

due to the average difference in ability between the groups rather than to

a cultural difference. It seems less likely that equating the white and

Negro groups for total score should wipe out. an R x I interaction if it

truly reflectod a cultural difference between the white and Negro groups.

Table 4

Omega Squared (4)

2x 100) and F from ANOVA of Wonderlic Test in Total

and Equated White and Negro Samples, and in "Pseudo-Race" Samples

Total Samples

Equated Samples

"Pseudo-Race" Samples

Source of Variance

df

02X100

Fdf

402x100

Fdf

W2x100

F

Race (R)

11.83

84.87

10

01

2.08

58.87

Items (I)

48

34.22

256.54

48

37.95

172.86

48

39.36

150.57

Subjects within Race (Ss)

406

8.75

7.75

252

6.45

5.56

194

6.87

6.51

R / I

48

1.04

7.83

48

0.29

1.26a

48

0.94

3.61

Ss

I19,488

54.17

12,096

55.31

9,312

50.73

aNonsignificant, 2. .'-''

One might argue that white and Negro Ss who attain the same total

socre must be highly similar in cultural background and therefore'would

show no significant R > I interaction. But are they culturally more simi-

lar than individuals of the same racial group who differ by 7 points in

total Wonderlic score? (The a of total scores in the normative white popu-

lation is close to 7.) Siblings reared together in the same family differ

by almost as much. Since the white and Negro population means differ by

close to 1 (or 7 points on the WPT), we can do an ANOVA on a "useudo-

race" comparison by making up two groups of white Ss selected so that Lheir

score'distributions closely approximate those of. Negroes and whiles. This

was done by ranking all white scores from highest to lowest, and then.

working in from both ends of the distribution, selecting pairs of Ss who

differ by exactly 7 points in total score. The means of the two di.stri-

7butions differ by1points and they have the same SD = 12.78.

Table 4 shows the ANOVA of these "pseudo-race" groups. it can be

seen that the results resemble the true racial comparison (Table 4--Total

Samples). especially as regards the R X I interaction, which for the Total.

Samples constitutes 1.04% of the variance and for the "pseudo-racial"

samples is 0.94%. The F for the R X I interaction is significant beyond

the .001 level for both the Total SaMple and Pseudo-Race Sample, and the

2A/B ratios. are 10.84 and 16.31, respectively. The ratio of46) for the

interactions (R x I / Ss X I) is .019 in both the Total Sample and the

"Pseudo-Race" Sample. All this indicates that a large part of the R I

interaction can be attributed to a level-of-ability Y items interaction,

since it is shown to exist in the "pseudo-race" groups which are both com-

prised of white Ss differing in average ability. If the significant

R X I interaction were explainable only in terms of cultural differences

1 7

between the white and Negro groups, it seems highly improbably that it

could be reduced to nonsignificance simply by equating the racial groups

for overall level of ability, or that the same significant interaction

could be produced within a culturally homogeneous white sample divided

into high and low ability groups with overlapping score distributions

similar to the total white and Negro distributions. In brief, from these

three ANOVAs shown in Table 4, it would be extremely difficult to make a

case that the Race X Itlems interaction is attributable to cultural bias.

These analyses should have produced markedly different results if the

popular claims of culture bias were in fact valid.

Discussion and Conclusion

Several different analyses of test item characteristics have failed

to reveal evidence of culture bias for large Negro and white samples on

the Wonderlic Personnel Test. If some items were more culture biased than

others with respect to the cultural backgrounds of Negroes and whites. one

should expect (a) significantly different rank order of values (percent

passing) for various items in the white and Negro samples, (b)significantly

different intervals (i.e., 2 decrements) between the 2 values of adjacent

test items in white and Negro samples, (c) a significant 1.(ce x Items inter-

action in the analysis of variance of the Race X Items Y Subjects score

matrix, even when both racial groups are equated for total score, and (d)

systematic differences in the types of item content that discriminate most

and least between the white and Negro samples. None of these expectations

was borne out by the present data. The small but significant Race x Items

Interaction could be reduced to nonsignificance by equating the white and

18

Negro groups for overall score, which would not be expected if the two

groups differed culturally in reaction to the test items. Moreover, it

was possible to produce a significant "Pseudo-Race" items interaction

within the culturally homogeneous white group simply by dividing the total

white sample into two groups, one which duplicates the mean and SD of the

Negro norms and'the other which duplicates the mean and SD of the white

norms. This suggests that the Race X Items interaction is really an

ability level X items interaction rather than an interaction due to cul-

tural differences.

The only way one could view these findings as being not incompatible

with the hypothesis that the Wonder-11-c is a culturally biased test for

Negroes would be to claim that culture bias depresses Negroes' performance

on all the test items to much the same degree, which seems highly unlikely

for cultural effects per se, and especially considering the great variety of

item content in the Wonderlic. Otherwise it should be possible to make up

subscales consisting of items on which the Negro group on the average does

as well or better than the white group. This, however, is not possible with

the present pool of Wonderlic items. The items that best measure the general

factor common to all items within each racial group are also the same,itens

that discriminate the most between the racial groups.

The present analyses yield no consistent or strong evidence that the

Wonderlic is reacted to'in any way differently in the Negro and white sam-

ples, except in overall level of performance, in which the normative popu-

lations differ by about__ one.. standard deviation. The present evidence lends

no support to the hypothesis that the cause of this difference in average

score on the Wonderlic is explainable in terms of cultural bias.

1.9

References

BurOs, O. I. (Ed.) Seventh Mental Measurements Yearbook. Vol 1. Highland

Park, New Jersey: Gryphon Press, 1972.

Cleary, T. A. Test bias: Prediction of grades of Negro and white students

in integrated colleges. Journal of Educational Measurement', 1968. 5,

115-124.

Cleary, T. A., & Hilton, T. L. An investigation of item bias. Educational

and Psychological Measurement,.1968, 28, 61-75.

Darlington, R. B. Another look at "cultural fairness." Journal of Educa-

tional Measurement, 1971, 8, 71-82.

Guilford, J. P. Psychometric Methods. 2nd ed. New York: McCraw-Hill, 1954.

Humphreys, L. G. Implications of group differences for test interpretation.

Assessment in a Pluralistic Society. Proceedings of the 1972 Invitational

Conference on Testing Problems. Princeton, N. J.: Educational Testing

Service, 1973. Pp. 56-71.

Jensen, A.-R. Another look at culture-fair tests. In Western Regional Con-

ference on Testing Problems, Proceedings for 1968, "Measurement for

Educational Planning." Berkeley, Calif.: Educational Testing Service,

Western Office, 1968. Pp. 50-104.

Jensen, A. R. How biased are culture-loaded tests? Genetic Psychology

Monographs, in press.

Linn, R. L. Fair test use in selection. Review of Educational Research,

1973, 43, 139-161.

90

Stanley. J. C. Plotting ANOVA interactions for ease of visual interpretation.

.Educational and Psychological Measurement, 1969, 29. 793-797.

Thorndike, R. L. Concepts of culture-fairness. Journal of Educational

Measurement, 1971, 8, 63-70.

Wonderlic, E. F., & Wonderlic, C. F. Wonderlic Personnel Test: Negro Norms.

Northfield, Illinois: E. F. Wonderlic & Associates, Inc. 1972.

Documents

Jensen, Arthur. R. TITLE An Examination of. Culture Bias ... · PDF filethat the WonderliC shows very little or no evidence of cultural bias ... administered paper-and-pencil test