8
Aging and Vocabulary Scores: A Meta-Analysis Paul Verhaeghen Syracuse University Vocabulary scores were examined in a total of 210 articles, containing 324 independent pairings of younger and older adults, from the 1986 –2001 issues of Psychology and Aging. The average effect size, favoring the old, was 0.80 SD. Production tests yielded smaller effects (0.68 SD) than multiple-choice tests (0.93 SD). Both age and education were found to be partially independent determinants of performance in production tests; age effects disappeared in multiple-choice tests as soon as education was taken into account. In addition, the Wechsler Adult Intelligence Scale—Revised Vocabulary subtest (D. Wechsler, 1981) was also found to be sensitive to the Flynn effect (J. R. Flynn, 1987; i.e., increasing test scores with advancing birth year). The results question the approach of using age-group equality in vocabulary scores as a check on sample equivalence. One of the oldest findings in the cognitive aging literature is that there is a sharp distinction in age trajectories between tests tapping experience and knowledge, which show little or no decline over the life span, and tests requiring online processing and mental manipulation, which do show decline. This distinction has been captured under names such as acquired abilities versus basic intelligence (Jones & Conrad, 1933), crystallized versus fluid intelligence (Horn & Cattell, 1967), or the pragmatics and me- chanics of intelligence (Baltes, 1987). In aging research, this knowledge or experience component is most often measured by vocabulary tests, in which participants have to define words (i.e., a production task) or choose from among alternatives a close synonym (sometimes an antonym) for a word presented (i.e., a multiple-choice task). Vocabulary test scores are often used (e.g., Lezak, 1995, p. 103) to estimate premorbid ability level. In fact, one of the most commonly used tests in aging research, the Shipley Institute of Living Scale (Shipley Scale; Shipley, 1946), was explicitly designed to test for mental deterioration in the context of organic or mental disorders. Likewise, Wechsler (1958) devised a deterioration quotient, in which performance on tests that are presumably insensitive to changes associated with aging is com- pared with performance on tests that show age-related decline, and vocabulary is one of the tests considered aging insensitive. If the Lezak–Shipley–Wechsler conjecture is correct, then results con- cerning the size of age differences in fluid cognition are only valid if the vocabulary scores of the younger and older adults tested in cognitive aging studies are identical. It is interesting to note that although many narrative and meta- analytic reviews exist on age relations in fluid intelligence, no such review exists, to my knowledge, concerning vocabulary scores. This article aims at overcoming this oversight. I present results from a meta-analysis conducted on articles containing vocabulary scores, as published in the 1986 –2001 volumes of one journal, Psychology and Aging. The aim of the analysis is, first, to describe age differences in vocabulary scores and, second, to investigate the influence and confluence of selected independent variables on vocabulary scores. The first variable considered is year of publi- cation, which can be considered an index for historical change, birth cohort membership, or both (in a cross-sectional design, these cannot be distinguished; Schaie, 1996). The second variable con- sidered is level of education, which is expected to correlate posi- tively with the size of an individual’s vocabulary (Lezak, 1995, p. 539 and following). The third variable is the type of vocabulary task, that is, whether the test format is multiple choice or whether the participant is required to provide word definitions (i.e., a production task). Production tasks require word finding, and multiple-choice tasks require correct recognition, and older adults may have trouble with the former and not the latter (e.g., Burke, MacKay, & James, 1999). The fourth variable is chronological age. The descriptive analysis mainly consists of calculating mean standardized differences between young and older adults; for the exploration of the influence of year of publication, task type, education, and age, both mean standardized differences and raw scores on the two most frequently used tests (the Wechsler Adult Intelligence Scale—Revised [WAIS–R] Vocabulary subtest [Wechsler, 1981] and the Shipley scale) were used. Method Sample of Studies Because many studies on cognition and aging report vocabulary mea- sures as part of the descriptive information about their samples, I decided to deviate from the usual sampling approach in meta-analysis (i.e., using search terms in search engines, which would not be a successful strategy) and to examine the full population of articles of the main journal in the field, Psychology and Aging. All studies were included that (a) reported a measure of vocabulary and (b) examined a sample of younger adults (average age older than 18 and younger than 30; we allowed one sample of prospective and first-year Harvard students with a mean age of 17.8 that was part of a multiexperiment study; Schacter, Koutstaal, Johnson, Gross, Paul Verhaeghen, Department of Psychology and Center for Health and Behavior, Syracuse University. This research was supported in part by a grant from the National Institute on Aging (AG-16201). Correspondence concerning this article should be addressed to Paul Verhaeghen, Department of Psychology, 430 Huntington Hall, Syra- cuse University, Syracuse, New York 13244-2340. E-mail: pverhaeg@ psych.syr.edu Psychology and Aging Copyright 2003 by the American Psychological Association, Inc. 2003, Vol. 18, No. 2, 332–339 0882-7974/03/$12.00 DOI: 10.1037/0882-7974.18.2.332 332

Vocab

Embed Size (px)

Citation preview

Page 1: Vocab

Aging and Vocabulary Scores: A Meta-Analysis

Paul VerhaeghenSyracuse University

Vocabulary scores were examined in a total of 210 articles, containing 324 independent pairings ofyounger and older adults, from the 1986–2001 issues of Psychology and Aging. The average effect size,favoring the old, was 0.80 SD. Production tests yielded smaller effects (0.68 SD) than multiple-choicetests (0.93 SD). Both age and education were found to be partially independent determinants ofperformance in production tests; age effects disappeared in multiple-choice tests as soon as education wastaken into account. In addition, the Wechsler Adult Intelligence Scale—Revised Vocabulary subtest (D.Wechsler, 1981) was also found to be sensitive to the Flynn effect (J. R. Flynn, 1987; i.e., increasing testscores with advancing birth year). The results question the approach of using age-group equality invocabulary scores as a check on sample equivalence.

One of the oldest findings in the cognitive aging literature is thatthere is a sharp distinction in age trajectories between tests tappingexperience and knowledge, which show little or no decline overthe life span, and tests requiring online processing and mentalmanipulation, which do show decline. This distinction has beencaptured under names such as acquired abilities versus basicintelligence (Jones & Conrad, 1933), crystallized versus fluidintelligence (Horn & Cattell, 1967), or the pragmatics and me-chanics of intelligence (Baltes, 1987). In aging research, thisknowledge or experience component is most often measured byvocabulary tests, in which participants have to define words (i.e.,a production task) or choose from among alternatives a closesynonym (sometimes an antonym) for a word presented (i.e., amultiple-choice task). Vocabulary test scores are often used (e.g.,Lezak, 1995, p. 103) to estimate premorbid ability level. In fact,one of the most commonly used tests in aging research, the ShipleyInstitute of Living Scale (Shipley Scale; Shipley, 1946), wasexplicitly designed to test for mental deterioration in the context oforganic or mental disorders. Likewise, Wechsler (1958) devised adeterioration quotient, in which performance on tests that arepresumably insensitive to changes associated with aging is com-pared with performance on tests that show age-related decline, andvocabulary is one of the tests considered aging insensitive. If theLezak–Shipley–Wechsler conjecture is correct, then results con-cerning the size of age differences in fluid cognition are only validif the vocabulary scores of the younger and older adults tested incognitive aging studies are identical.

It is interesting to note that although many narrative and meta-analytic reviews exist on age relations in fluid intelligence, no suchreview exists, to my knowledge, concerning vocabulary scores.

This article aims at overcoming this oversight. I present resultsfrom a meta-analysis conducted on articles containing vocabularyscores, as published in the 1986–2001 volumes of one journal,Psychology and Aging. The aim of the analysis is, first, to describeage differences in vocabulary scores and, second, to investigate theinfluence and confluence of selected independent variables onvocabulary scores. The first variable considered is year of publi-cation, which can be considered an index for historical change,birth cohort membership, or both (in a cross-sectional design, thesecannot be distinguished; Schaie, 1996). The second variable con-sidered is level of education, which is expected to correlate posi-tively with the size of an individual’s vocabulary (Lezak, 1995, p.539 and following). The third variable is the type of vocabularytask, that is, whether the test format is multiple choice or whetherthe participant is required to provide word definitions (i.e., aproduction task). Production tasks require word finding, andmultiple-choice tasks require correct recognition, and older adultsmay have trouble with the former and not the latter (e.g., Burke,MacKay, & James, 1999). The fourth variable is chronologicalage. The descriptive analysis mainly consists of calculatingmean standardized differences between young and older adults;for the exploration of the influence of year of publication, tasktype, education, and age, both mean standardized differences andraw scores on the two most frequently used tests (the WechslerAdult Intelligence Scale—Revised [WAIS–R] Vocabulary subtest[Wechsler, 1981] and the Shipley scale) were used.

Method

Sample of Studies

Because many studies on cognition and aging report vocabulary mea-sures as part of the descriptive information about their samples, I decidedto deviate from the usual sampling approach in meta-analysis (i.e., usingsearch terms in search engines, which would not be a successful strategy)and to examine the full population of articles of the main journal in thefield, Psychology and Aging. All studies were included that (a) reported ameasure of vocabulary and (b) examined a sample of younger adults(average age older than 18 and younger than 30; we allowed one sample ofprospective and first-year Harvard students with a mean age of 17.8 thatwas part of a multiexperiment study; Schacter, Koutstaal, Johnson, Gross,

Paul Verhaeghen, Department of Psychology and Center for Health andBehavior, Syracuse University.

This research was supported in part by a grant from the NationalInstitute on Aging (AG-16201).

Correspondence concerning this article should be addressed to PaulVerhaeghen, Department of Psychology, 430 Huntington Hall, Syra-cuse University, Syracuse, New York 13244-2340. E-mail: [email protected]

Psychology and Aging Copyright 2003 by the American Psychological Association, Inc.2003, Vol. 18, No. 2, 332–339 0882-7974/03/$12.00 DOI: 10.1037/0882-7974.18.2.332

332

Page 2: Vocab

& Angell, 1997) as well as a sample of older adults (average age of 60 orolder). All issues of Volumes 1–16 (1986–2001) were hand searched. Atotal of 210 articles were included in the database, containing a total of 324independent pairings of young and older adults.1 Descriptive informationfor the data set is reported in Table 1.

Tests Included

Five tests were used in more than 10 studies: the Vocabulary subtest ofthe WAIS (Wechsler, 1955) or the WAIS–R (Wechsler, 1981), The Mill-Hill Vocabulary Scale (Raven, 1982), the Nelson–Denny Reading Test(Nelson & Denny, 1960), the Shipley scale (Shipley, 1946), and one of thevocabulary tests included in the Educational Testing Services (ETS) Kit ofFactor Referenced Tests (Ekstrom, French, Harman, & Derman, 1976). Ofthese, the WAIS and WAIS–R are production tests (i.e., the participantsupplies a dictionary-like description for each of a series of words present-ed); the other tests are multiple-choice tests (i.e., the participant chooses asynonym or description among a set of alternatives for each of a series ofwords presented).

Statistical Analyses

Two types of analyses were conducted. First, traditional effect-sizeanalysis (Hedges & Olkin, 1985) was used to determine the size of the ageeffect on vocabulary scores. Size of the effect was expressed as the meanstandardized difference, that is, the mean of older adults’ performanceminus younger adults’ performance, divided by the pooled standard devi-ation. When mean or standard deviation was not reported, inferentialstatistics, if available, were used to determine effect sizes. An overall effectsize and separate average effect sizes for each test and for each type of task(i.e., production vs. multiple choice) were calculated. Multiple regressionanalysis was used to investigate the possible influence of historic differ-ences, age differences, differences in educational level, and task type inlevel of education on the effect.

The second type of analysis concerned a within-task analysis of the twotypes of vocabulary measures most frequently reported, namely theWAIS–R Vocabulary subtest (a production task) and the Shipley scale (amultiple-choice task). For both of these tasks, I investigated the influenceof historical differences, age, and educational level on the raw scores of thetest in a series of weighted least squares regression analyses, weighting forsample size.

Results

Effect-Size Analysis

Averaged effect sizes are reported in Table 2. All effect sizes aresignificant and favor the older adults, as indicated by the finding

that all lower limits of the 95% confidence intervals are larger thanzero. The overall effect size, highly heterogeneous (Qw �1563.12), is 0.80. When the sample was split into a group of effectsizes derived from production tasks and a group derived frommultiple-choice tasks, it was found that the former groupingyielded a significantly smaller effect size than the latter(QB � 24.24), but each of these remained heterogeneous. Splittingthe sample by test, likewise, did not result in homogeneity of effectsizes, with the exception of the mean weighted effect size for theNelson–Denny test. Figure 1 offers a funnel plot of the data (Light& Pillemer, 1982). In a funnel plot, sample size is plotted againsteffect size. If the law of large numbers holds, then these plotsshould be inverse-funnel shaped; that is, with increasing numbersof participants, the effect sizes should become increasingly lessvariable and closer to the average value. Note, first, that this seemsto be the case here, and, second, that the funnel plot is very regular,that is, there are no gaps or asymmetries, with the exception of aclear outlier with an effect size larger than 7. Removing this outlierdid not change the results of any of our analyses (this study had avery small sample size of 18 younger and 18 older adults, and inall analyses, the appropriate weighting coefficients were used asoutlined in Hedges and Olkin (1985), and these coefficients are afunction of sample size). In a first exploration of heterogeneity, weidentified all studies with a standardized residual larger than 3 andsmaller than �3. Removing those from the data set did not resultin homogeneity.

The source of the heterogeneity was further explored in a seriesof multiple regression analyses, using the method advocated byHedges and Olkin (1985). The results are reported in Table 3. In afirst regression model (Model 1), the predictors were (a) year ofpublication, (b) a dummy variable coding for production versusmultiple-choice format, (c) the older–younger difference in age inthe study, and (d) the older–younger difference in years of educa-tion in the study (k � 222). The results indicate that larger agedifferences in the sample led to larger effect sizes (i.e., to largerage differences in vocabulary scores favoring older adults), largereducation differences in the sample led to larger effect sizes, and

1 Given the size of this data set, I did not list these references in theReference section; however, interested readers can obtain the full data setin the form of an Excel spreadsheet from Paul Verhaeghen or from theInternet on the PsycARTICLES database at http://spider.apa.org/ftacomp/index.cfm?CFID�2081538&CFTOKEN�5006231.

Table 1Descriptive Statistics for the Meta-Analytic Data Set (k � 324)

Variable k M SD Minimum Maximum

No. of younger participants 320 35.40 27.39 6 244No. of older participants 321 40.01 41.99 6 429Age (younger) 319 21.39 2.54 17.8 34.6Age (older) 320 70.42 2.60 62 79.9Age difference (older � younger) 319 49.04 3.69 36.1 61.4Years of education (younger) 249 14.13 1.12 11.9 17.5Years of education (older) 257 15.04 1.40 9.5 17.8Years of education difference (older � younger) 249 0.91 1.66 �3.95 4.1Proportion of women (younger) 166 58.25 21.69 0 100Proportion of women (older) 167 54.92 23.21 0 100

Note. k � number of studies.

333AGING AND VOCABULARY META-ANALYSIS

Page 3: Vocab

multiple-choice tests yielded larger effect sizes than productiontests. The coefficient associated with year of publication was notsignificant. To test whether age differences and education differ-ences have differential influences in production and multiple-choice tests, a second model (Model 2) was run, in which theinteraction between the age difference and response format wasintroduced, as well as the interaction between educational leveldifference and response format. Of the interaction terms, only theterm involving the age difference became significant, indicatingthat the effects of differences in chronological age varied signifi-cantly between response formats.

Given that response format interacted with age, a follow-upmultiple regression analysis was conducted, splitting the sampleaccording to response format (Table 4). In studies using a produc-tion format (k � 123), both the age difference and the differencein years of education influenced the effect size, with larger agedifferences and larger education differences leading to larger dif-ferences in vocabulary scores favoring the older adults. Inmultiple-choice tests (k � 99), the difference in educational levelwas significantly associated with effect size (larger differences ineducation leading to larger age differences in vocabulary favoringthe old), but the age difference was not.

Table 2Effect Sizes (Mean Standardized Differences) for Age in Vocabulary Scores

Vocabulary measure k d� LL of 95% CI UL of 95% CI Qw

All measures 279 0.80 0.77 0.83 1563.12a

Production 142 0.68 0.63 0.72 873.18a

Multiple choice 136 0.93 0.89 0.97 608.71a

WAIS 17 0.63 0.52 0.73 65.15a

WAIS–R 88 0.69 0.63 0.75 586.64a

WAIS–R (second half) 11 0.88 0.74 1.02 24.23a

Mill-Hill 28 0.85 0.73 0.97 97.73a

Nelson–Denny 10 1.62 1.42 1.81 10.42Shipley 44 0.86 0.79 0.93 171.90a

ETS (diverse measures) 31 1.12 1.04 1.21 65.15a

Note. k � number of studies; d� � average weighted effect size for age (positive values denote that olderadults score higher than younger adults); LL � lower limit; CI � confidence interval; UL � upper limit; Qw �within-group homogeneity (chi-square distributed with df � k � 1); WAIS–R � Wechsler Adult IntelligenceScale—Revised; Mill-Hill � Mill-Hill Vocabulary Scale; Nelson–Denny � Nelson–Denny Reading Test;Shipley � Shipley’s Institute of Living Scale; ETS � Educational Testing Services.a Significant heterogeneity at p � .05.

Figure 1. Funnel plot of effect sizes (effect size as a function of number of participants in each study; k � 279).

334 VERHAEGHEN

Page 4: Vocab

Analysis on Raw Scores: WAIS–R Vocabulary Subtest andShipley Institute of Living Scale

Figure 2 shows a plot of raw scores on the two most often usedvocabulary measures, the WAIS–R Vocabulary subtest and theShipley scale, as a function of age. In both plots, the correlationbetween test scores and age was significant, r � .30 and .78,respectively. The difference between these two correlations issignificant, Z � 5.70. It should be noted that for both vocabularytests, a significant quadratic trend was present, suggesting peakperformance in the 50–60 age range and decline thereafter. Eventhough interpreting results from a regression analysis when onlythe extremes of the independent variable are present is precarious,it can be noted that this trend for a peak in late middle age hasalso been found in the WAIS and WAIS–R norming samples(Wechsler, 1955, 1981).

A potential problem with the WAIS–R data should be pointedout: The maximum attainable score on the WAIS–R Vocabularysubtest is 70 and that of the WAIS is 80. Some of the meansreported for the WAIS–R in the samples included here (see Figure1) approach or exceed that maximum value, probably implyingthat some of the authors claiming to use the WAIS–R in fact usedthe WAIS. Because we have no way of knowing exactly whichstudies used the WAIS while claiming to use the WAIS–R, we didnot correct the data. Given that the problem is distributed equally

across age groups, it should not affect estimates of young–old ageslopes drastically, but it does add extraneous noise to the otheranalyses.

Figure 3 summarizes the results of a series of single regres-sions conducted on the two vocabulary measures. I examinedthe effects of age, years of education, and year of publication ontest performance. For the WAIS–R, age within the samples ofyounger adults did not correlate with performance (r � .02);within the samples of older adults, age correlated negativelywith performance (r � �.23; slope of the regression line ��0.52). For the Shipley scale, age correlated positively withperformance within the samples of younger adults (r � .47;slope of the regression line � 0.22) but not within the samplesof older adults (r � �.05). For the WAIS–R, educational leveldid not correlate significantly with performance within thesamples of younger adults (r � .15), but it did within the oldersample (r � .25; slope of the regression line � 0.37). For theShipley scale, educational level correlated significantly andstrongly with performance in both younger and older adultsamples (r � .63 and .73; slope of the regression line � 0.75and 0.86, respectively). Year of publication did not correlatewith WAIS–R performance within the samples of youngeradults (r � .11), but it did within the samples of older adults(r � .38; slope of the regression line � 0.61). Year of publi-

Table 3Results From Multiple Regression Analyses Predicting Effect Sizes (All Studies; k � 222)

Variable B � t

Model 1 (R2 � .24)Age difference (older � younger) 0.02 0.09 3.02*Education difference (older � younger) 0.17 0.43 14.51*Year of publication 0.00 0.03 0.94Test format (production � 0; MC � 1) 0.44 0.35 11.76*

Constant at x � 0 �9.01 �0.97Model 2 (R2 � .25)

Age difference (older � younger) 0.03 0.17 4.19*Education difference (older � younger) 0.17 0.43 10.36*Year of publication 0.01 0.03 1.08Test format (production � 0; MC � 1) 2.05 1.63 3.99*Age difference by test format �0.03 �1.28 �3.08*Education difference by test format 0.01 0.01 0.25

Constant at x � 0 �11.04 �1.16

Note. MC � multiple choice.* p � .05.

Table 4Results From Multiple Regression Analyses Predicting Effect Sizes (Split by Response Format;k � 123, for Production; k � 99, for Multiple Choice)

Variable

Production (R2 � .24) Multiple choice (R2 � .20)

B � t B � t

Age difference (older � younger) 0.03 0.18 4.59* �0.05 �0.03 �0.55Education difference (older � younger) 0.17 0.37 10.09* 0.17 0.48 9.80*Year of publication �0.00 �0.01 0.24 0.01 0.09 1.80

Constant at x � 0 2.08 0.16 �22.69 �1.67

* p � .05.

335AGING AND VOCABULARY META-ANALYSIS

Page 5: Vocab

cation was not related to performance on the Shipley test (r ��.09 and �.05, respectively).

In a final analysis, all three terms were entered simultaneouslyin a multiple regression analysis, that is, age, years of education,and year of publication. Results are reported in Table 5. For theWAIS, all coefficients were significant, indicating that being older,having completed more years of education, and being included ina more recently published study is associated with higher testscores. For the Shipley scale, only the effect of education provedsignificant. In a second step, interaction effects were tested byincluding interaction terms for the age group by education, the agegroup by year of publication, and the age group by age; none ofthose resulted in significant improvement in fit of the regression.

Discussion

The first conclusion from the present meta-analysis is obvious:Vocabulary scores—at least as measured in the samples ofyounger and older adults contained in mostly experimental re-search on cognition in Psychology and Aging—are not age insen-sitive. Rather they show a substantial and positive age effect. Onaverage and across measures, older adults score 0.80 SD higher onvocabulary tests than younger adults. Depending on the test, thismean standardized difference varies between 0.63 (WAIS–R)

and 1.62 (Nelson–Denny Reading Test). To illustrate the size ofthe effect, effect sizes found for abilities that do decline with ageare as follows: �0.36 for short-term memory (digit span; Bopp &Verhaeghen, 2001), �0.52 for working memory (reading span;Bopp & Verhaeghen, 2001), �0.67 for prose recall (Verhaeghen,Marcoen, & Goossens, 1993), �0.82 for spatial ability (recalcu-lated from the r reported in Verhaeghen & Salthouse, 1997),�0.87 for reasoning ability (recalculated from the r reported inVerhaeghen & Salthouse, 1997), �0.91 for paired-associate recall(Verhaeghen et al., 1993), �0.99 for list recall (Verhaeghen et al.,1993), and �1.21 for perceptual speed (recalculated from the rreported in Verhaeghen & Salthouse, 1997). Clearly, the size of thepositive effect of age on vocabulary scores is quite comparable (inabsolute terms) to the size of the negative effect on many indicesof fluid cognition. Even though the effect sizes for all measureswere heterogeneous (with the exception of the Nelson–Denny, butthis seems to be due to lower statistical power due to small k ratherthan to true homogeneity), the lower limit of the confidenceintervals was nowhere close to zero.

One reason why the size and direction of the effect is remark-able is that this result goes counter to longitudinal research, whichsuggests either stability or a modest decline of vocabulary scoresover the portion of the adult life span considered here. One maynote that although there is a body of literature pointing at declinesin very old age (e.g., Schaie, 1996; Singer, Verhaeghen, Ghisletta,Lindenberger, & Baltes, 2003), the oldest sample included in themeta-analysis was still relatively young, that is, only 14 out of 320studies had older adult samples with a mean age over 75. At leasttwo possibilities, not mutually exclusive, can be advanced toexplain the discrepancy between longitudinal stability in the liter-ature and the increase in scores noted here. The first is that theincrease of vocabulary scores over age may be due to a cohorteffect favoring the earlier born. This indeed seems to be the case(Glenn, 1994; Huang & Hauser, 1998). This effect is possiblyenhanced by the age of the tests themselves: The Shipley, forinstance, was designed in the 1940s, and it may contain words thatare generation specific, which would put older adults at an advan-tage compared with individuals born after 1940.

The second possibility is that differential standards may havebeen used for sampling younger and older adults. More specifi-cally, cognitive aging researchers, sensitized by complaints aboutpotential differential representativeness favoring the young (seeSalthouse, 1991, pp. 54–58, for an overview of articles on thistopic), might tend to sample older adults who are more positivelyselected than are younger adults. A finding from the current dataset that supports this interpretation is that, on average, the olderadults recruited in these studies had received more years of edu-cation than the younger adults. This indicates a positive bias inselecting older adults, even more so if one considers cohort dif-ferences in access to higher education. The average weightednumber of years of education (see Table 1) was 14.1 for youngeradults and 15.0 for older adults; the age difference is significantaccording to a Wilcoxon Signed Ranks Test, Z � �7.67.

It could be argued that sampling older adults with a higheraverage level of education than that of the younger adults in thestudy may be good research practice, given that the education ofyounger adults is being measured as it is ongoing and that mostwill complete at least 16 years of education, whereas the levelrecorded for most of the older adults is indeed the ultimate level of

Figure 2. Raw test scores as a function of age (in years) in the WechslerAdult Intelligence Scale—Revised (WAIS–R) Vocabulary subtest and theShipley Institute of Living Scale (k � 190 and 92, respectively).

336 VERHAEGHEN

Page 6: Vocab

education they will reach. Although this is undeniably true, it isalso true that the number of years of education completed may bea more important determinant of vocabulary scores than the num-ber of years of education the participant will finally reach. In thecurrent meta-analysis, I found that within the younger sample thenumber of years of education completed correlated highly withperformance for the Shipley test and that there was a trend in thesame direction for the WAIS (r � .15 for younger adults). In otherwords, the ongoing process of education seems to matter, at leastwith regard to verbal ability.

Note that the finding that there is decline (WAIS–R) or stability(Shipley scale) over age within the sample of older adults can beinterpreted within the context of both hypotheses: This may be thetrue age-related effect that is exposed after the confounding influ-ence of the cohort effect, the sampling effect, or both have beenremoved. Also note that if a positive selection bias is indeedoperating in favor of older adults, then one consequence is that thesize of age differences in fluid cognition as derived from these andsimilar studies will be underestimated. It is even possible thatpreserved functioning in the highly selected older sample pre-

Figure 3. Determinants of raw scores in the Wechsler Adult Intelligence Scale—Revised (WAIS–R) Vocab-ulary subtest and the Shipley Institute of Living Scale: age within the younger samples (k � 95 and 46,respectively), age within the older samples (k � 95 and 46, respectively), number of years of education (k � 186and 72, respectively), and publication date (k � 190 and 94, respectively). Solid dots indicate younger adults;Open dots indicate older adults; Solid line indicates regression line for younger adults; Dotted line indicatesregression line for older adults.

337AGING AND VOCABULARY META-ANALYSIS

Page 7: Vocab

cludes us from observing the real changes that occur with normalaging. This may become critical in experimental studies that ex-amine phenomena for which the observed effects are already smallin younger adults (e.g., negative priming research; see Verhaeghen& De Meersman, 1998).

A second conclusion is that different vocabulary tests behavedifferently in their relation to age. Multiple-choice tests yield alarger advantage for older adults than production tasks. This age bytest format interaction might indicate that although older adultshave a larger pool of words available in semantic memory thanyounger adults, clearer cues are needed for full access to this pool.This result is compatible with a growing number of studies (for anoverview, see Fleischman & Gabrieli, 1998) showing that com-pared with younger adults, older adults perform less well onproduction than identification tasks in priming studies. It is alsocompatible with research showing word finding failures in old age(e.g., Burke, MacKay, & James, 1999; MacKay & Abrams, 1996):Multiple-choice measures are not sensitive to such age-relatedeffects, because they do not require word finding, whereas pro-duction measures do. It should be noted, however, that test formatis confounded with the test itself, that is, there are no tests in thedatabase for which the same words have been tested using bothformats. A simple test of the task-type hypothesis would be to usethe two methods on the same sample of words. A weaker testinvolves a comparison between the linguistic properties of thewords in the existing tests. Using the MRC Psycholinguistic da-tabase, I investigated the two most frequently used tests, theWAIS–R (production) and the Shipley scale (multiple choice).2

Characteristics that were available for a majority of the words inboth tests were the Thorndike–Lorge written frequency, theKucera–Francis word counts, the Kucera–Francis number of cat-egories, and the Kucera–Francis number of samples, the number ofletters, the number of phonemes, and the number of syllables. Theonly reliable differences between the two tests were the number ofletters (7.94 for the WAIS–R, 7.03 for the Shipley scale; Z ��2.68) and phonemes (6.93 for the WAIS–R, 6.05 for the Shipleyscale, Z � �2.30; all other comparisons, 0 � Z � 0.6). Thus,linguistic properties of the words are an unlikely source of thedifference in age effects between the tests. One multiple-choicetest, the Nelson–Denny, seemed to yield a particularly large ben-efit for older adults; the reason for this is unclear.

A third conclusion is that in production tests both age andeducation are independent determinants of vocabulary perfor-mance, as shown in both the effect-size analysis for all production

tests and the analysis on raw scores of the WAIS–R. Thus, the agedifference observed cannot be fully explained by age differences inyears of education. For multiple-choice tests, however, only edu-cation was an independent influence, that is, the age effect disap-pears once differences in education are taken into account. Thiswas the case in both the effect-size analysis and the analysis onraw scores of the Shipley scale. The effect of education on theShipley scale was particularly large: 77% of the variance wasassociated with education, leaving relatively little unique varianceto be associated with other variables. The picture is very differentfor the WAIS–R: Age and education together explained a mere16% of the variance. Adding year of publication as a predictorincreased the amount of variance explained to 22%, still consid-erably less than the percentage explained by age and education inthe Shipley scale. By implication, this means that the amount ofvariance shared between the Shipley scale and the WAIS–R vo-cabulary test must be relatively small. This in itself is a reason forconcern and raises the question as to whether either of these teststruly measures verbal ability in old age.

It is intriguing that performance on the WAIS–R Vocabularytest is sensitive to year of publication, a proxy for the time thestudy was conducted. This may be an indication of cohort effectsfavoring later born cohorts (the so-called Flynn effect; Flynn,1987), of increased positive bias in sampling over time, or both.The Shipley test is not sensitive to this variable. It is unlikely thatthis is due to the nature of the task: At least one other multiple-choice test, the WORDSUM (Huang & Hauser, 1998), shows aclear Flynn effect. Cohort effects on vocabulary tests may be dueto decreases in real verbal ability over successive cohorts, or theymay simply be an artifact of the decreasing familiarity of succes-sive cohorts with words that are included in a test that is a fewdecades old.

Another difference between the two tests is that performance onthe WAIS–R was negatively related to age within the older sample,suggesting that this test might be sensitive to individual differencesdue to aging in addition to those due to cohort or samplingdifferences. This was not the case for the Shipley test, but this testshowed positive age effects within the younger sample, perhapsdue to the high correlation of age with education in this age range(r � .70).

2 The MRC Psycholinguistic database URL is http://www.dci.clrc.ac.uk/Projects/Psych/index.asi.

Table 5Results From Multiple Regression Analyses Predicting Raw Scores in the Wechsler AdultIntelligence Scale—Revised (WAIS–R; k � 169) and Shipley Institute of Living Scale (k � 73)Vocabulary Tests

Variable

WAIS-R (R2 � .22) Shipley (R2 � .77)

B � t B � t

Age 0.08 0.34 4.82* 0.10 1.14 0.93Education 0.25 0.16 2.22* 0.81 0.47 6.96*Year of publication 0.40 0.23 3.24* 0.04 0.09 1.35

Constant at x � 0 �756.75 �3.05* �67.28 �1.06

* p � .05.

338 VERHAEGHEN

Page 8: Vocab

As stated in the introduction, the assumption in the neuropsy-chological assessment literature is that vocabulary scores are ageinsensitive, and this may be the reason why cognitive aging re-searchers record and report them routinely. The present analysiscasts doubt on the usefulness of this approach. It appears that thefield routinely samples older adults with higher scores thanyounger adults, and the effect-size analysis, in fact, suggests that itwill be difficult to find samples for which the young outperformthe old on vocabulary scores—equally difficult as finding samplesin which there is an age advantage for the old in reasoning, spatialability, or recall from episodic memory. It seems that multiple-choice tests may be particularly problematic in this regard, giventheir very large positive correlation with age. Therefore, it mightbe appropriate for researchers to report average vocabulary scorestransformed according to up-to-date age-specific norms, in addi-tion to reporting the average raw scores.

A possible additional reason to include vocabulary scores as partof the sample characteristics could be to check on sample equalityacross studies; that is, researchers could use them to determinewhether the younger and/or older participants sampled in theirstudy are comparable to participants in samples included in otherrelevant studies and hence to trace differences in outcome acrossstudies to possible differences in sample characteristics. In thatcase, it might be wise for the field to set a standard, that is, to reacha consensus on one and only one vocabulary test that would servethis purpose. In this context, it may be useful to point out thatscores on the most often used multiple-choice test, the Shipley, aredetermined (at least in the present across-sample approach) to avery large part (R2 � .77) by age and education alone. Conse-quently, including this particular test in a demographic batterydoes not add much information beyond what is recorded in thesetwo variables. The WAIS–R is much less determined by age andeducation, and it would then be the better choice if one truly wantsto examine sample equivalence in verbal ability.

References

Baltes, P. B. (1987). Theoretical propositions of life-span developmentalpsychology: On the dynamics between growth and decline. Develop-mental Psychology, 23, 611–626.

Bopp, K. L., & Verhaeghen, P. (2001). Aging, short-term memory span andworking memory span: A meta-analysis. Manuscript submitted forpublication.

Burke, D. M., MacKay, D. G. & James, L. E. (1999). Theoretical ap-proaches to language and aging. In T. J. Perfect & E. A. Maylor (Eds.),Models of cognitive aging (pp. 204–237). Oxford, England: OxfordUniversity Press.

Ekstrom, R. B., French, J. W., Harman, H. H., & Derman, D. (1976).Manual for kit of factor-referenced cognitive tests. Princeton, NJ: Edu-cational Testing Service.

Fleischman, D. A., & Gabrieli, J. D. E. (1998). Repetition priming innormal aging and Alzheimer’s disease: A review of findings and theo-ries. Psychology and Aging, 13, 88–119.

Flynn, J. R. (1987). Massive IQ gains in 14 nations: What IQ tests reallymeasure. Psychological Bulletin, 101, 171–191.

Glenn, N. D. (1994). Television watching, newspaper reading, and cohortdifferences in verbal ability. Sociology of Education, 67, 216–230.

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis.Orlando, FL: Academic Press.

Horn, J. L., & Cattell, R. B. (1967). Age differences in fluid and crystal-lized intelligence. Acta Psychologica, 26, 107–129.

Huang, M.-H., & Hauser, R. M. (1998). Trends in Black-White test-scoredifferentials: II. The WORDSUM vocabulary test. In U. Neisser (Ed.),The rising curve: Long-term gains in IQ and related measures (pp.303–332). Washington, DC: American Psychological Association.

Jones, H. E., & Conrad, H. S. (1933). The growth and decline of intelli-gence. Genetic Psychology Monographs, 13, 229–295.

Lezak, M. D. (1995). Neuropsychological assessment (3rd ed.). New York:Oxford University Press.

Light, R. J., & Pillemer, D. B. (1982). Numbers and narrative: Combiningtheir strengths in research reviews. Harvard Educational Review, 52,1–26.

MacKay, D. G., & Abrams. L. (1996). Language, memory, and aging:Distributed deficits and the structure of new-versus-old connections. InJ. E. Birren & K. W. Schaie (Eds.), Handbook of the psychology of aging(4th ed.; pp. 251–265). San Diego, CA: Academic Press.

Nelson, M. J., & Denny, E. C. (1960). The Nelson–Denny Reading Test(revised by James I. Brown). Boston: Houghton Mifflin.

Raven, J. C. (1982). Revised manual for Raven’s Progressive Matrices andVocabulary Scale. Windsor, United Kingdom: NFER Nelson.

Salthouse, T. A. (1991). Theoretical perspectives on cognitive aging.Hillsdale, NJ: Erlbaum.

Schacter, D. L., Koutstaal, W., Johnson, M. K., Gross, M. S., & Angell,K. E. (1997). False recognition induced by photographs: A comparisonof older and younger adults. Psychology and Aging, 12, 203–215.

Schaie, K. W. (1996). Intellectual development in adulthood: The SeattleLongitudinal study. New York: Cambridge University Press.

Shipley, W. C. (1946). Institute of Living Scale. Los Angeles: WesternPsychological Services.

Singer, T., Verhaeghen, P., Ghisletta, P., Lindenberger, U., & Baltes, P. B.(2003). The fate of cognition in very old age: Six-year longitudinalfindings in the Berlin Aging Study (BASE). Psychology and Aging, 18,318–331.

Verhaeghen, P., & De Meersman, L. (1998). Aging and negative priming:A meta-analysis. Psychology and Aging, 13, 435–444.

Verhaeghen, P., Marcoen, A., & Goossens, L. (1993). Improving memoryperformance in the aged through mnemonic training: A meta-analyticstudy. Psychology and Aging, 7, 242–251.

Verhaeghen, P., & Salthouse, T. A. (1997). Meta-analyses of age-cognitionrelations in adulthood: Estimates of linear and non-linear age effects andstructural models. Psychological Bulletin, 122, 231–249.

Wechsler, D. (1955). WAIS manual. New York: Psychological Corporation.Wechsler, D. (1958). The measurement and appraisal of adult intelligence

(4th ed.). Baltimore: Williams & Wilkins.Wechsler, D. (1981). WAIS–R manual. New York: Psychological Corporation.

Received February 26, 2002Revision received August 9, 2002

Accepted August 22, 2002 �

339AGING AND VOCABULARY META-ANALYSIS