View
6
Download
0
Category
Preview:
Citation preview
The Pennsylvania State University
The Graduate School
College of Education
FACTOR STRUCTURE OF WECHSLER PRESCHOOL AND PRIMARY SCALE
OF INTELLIGENCE (THIRD EDITION) – SPANISH VERSION SCORES
AMONG CHILDREN IN PERU
A Dissertation in
School Psychology
by
Abigail E. Crimmins
© 2016 Abigail E. Crimmins
Submitted in Partial Fulfillment of the Requirements
for the Degree of
Doctor of Philosophy
August, 2016
ii
The dissertation of Abigail E. Crimmins was reviewed and approved* by the following: Barbara A. Schaefer Associate Professor of Education Dissertation Advisor Chair of Committee Professor-in-Charge, School Psychology Program Peter M. Nelson Assistant Professor of School Psychology Richard M. Kubina, Jr. Professor of Education Laura E. Murray-Kolb Assistant Professor of Nutritional Sciences *Signatures are on file in the Graduate School.
iii
Abstract
A critical component in the adaptation of measures across culturally different populations
is the validation of the adapted measure for use in the new population. Validation
requires evidence that the scores from the new tool measure the same qualities or aspects
of the construct in the new population as they purport to measure in the original
population. This study examined the reliability and validity of scores for an adapted
version of the Spanish-language form of the Wechsler Preschool and Primary Scale of
Intelligence-Third Edition (WPPSI-III-SP; Wechsler, 2009) as a measure of cognitive
ability among a cohort of children in rural Peru. Using confirmatory factor analyses
(CFA), a series of models were fit to data from a cohort of children age 36 months (n =
147) and the same cohort at age 48 months (n = 167). These models represented the
theoretical factor structure established by the publisher for the normative data as well as
models derived from other studies of cross-cultural intelligence test adaptation in the
region. It was hypothesized that the models derived from prior South American studies
would yield a better fit for the data as compared to the normative sample model.
Convergent validity was also assessed based on the hypothesis that the scores from the
adapted WPPSI-III-SP would be positively and strongly correlated with cognitive scores
from a similarly adapted Bayley Scales of Infant and Toddler Development-Third Edition
(Bayley-III; Bayley, 2005) administered with these children at age 24 months. CFA
results support a one-factor model for both the 36- and 48-month time points for the
adapted WPPSI-III-SP measure; however, evidence for convergent validity with prior
estimates of cognitive ability using the adapted Bayley-III was minimal. Implications for
cross-cultural test adaptation and the use of the adapted WPPSI-III-SP are discussed.
iv
Table of Contents
List of Tables.............................................................................................................. vi List of Figures............................................................................................................. vii Chapter 1: INTRODUCTION..................................................................................... 1 Cultural Context of Peru................................................................................ 4 Purpose and Proposed Models........................................................................ 7 Chapter 2: LITERATURE REVIEW.......................................................................... 11 Cross-Cultural Test Adaptation...................................................................... 11 Cross-Cultural Intelligence Testing................................................................ 18 Adaptation of Preschool Cognitive Assessment............................................. 22
Description of the WPPSI-III.......................................................................... 25 Present Study.................................................................................................. 32 Chapter 3: METHOD................................................................................................. 42
Sample........................................................................................................... 42 Measures......................................................................................................... 43 Procedure........................................................................................................ 45 Data Analyses................................................................................................. 46
Chapter 4: RESULTS................................................................................................. 52 Younger Cohort.............................................................................................. 52 Preliminary Analyses and Descriptive Statistics...................................... 52 Model 1-NormYoung................................................................................ 54 Model 2-OneFactorYoung........................................................................ 56 Model Comparison.................................................................................... 58 Older Cohort................................................................................................... 59 Preliminary Analyses and Descriptive Statistics...................................... 59 Model 3-NormOlder................................................................................. 61 Model 4-OneFactorOlder.......................................................................... 62 Model 5-AltOlder...................................................................................... 64 Convergent Validity Analyses........................................................................ 65
Chapter 5: DISCUSSION........................................................................................... 67 Fit of Hypothesized Models............................................................................ 68 Convergent Validity Evidence........................................................................ 75
Limitations and Future Research.................................................................... 77 Implications..................................................................................................... 79 Conclusions..................................................................................................... 81
v
References................................................................................................................... 82 Appendix A. Overview of WPPSI Content................................................................ 93 Appendix B. Parameter Estimates and Standard Errors for Models 3, 4, and 5......... 95
vi
List of Tables
Table 1. Model Identification Rules for Standard CFA Models................................. 48
Table 2. Free Parameters and Observations for Specified Models............................. 48
Table 3. Intercorrelations, Descriptive Statistics, and Reliability Estimates for Adapted WPPSI-III-SP Subtest Scores (Younger Cohort)............................ 53
Table 4. Parameter and Standard Error Estimates for Model 1-NormYoung........... 55
Table 5. Parameter and Standard Error Estimates for Model 2-OneFactorYoung.. 57
Table 6. Selected Fit Indices for Younger Cohort Models......................................... 58
Table 7. Satorra-Bentler Chi-Square Difference Test................................................. 59
Table 8. Intercorrelations, Descriptive Statistics, and Reliability Estimates for Adapted WPPSI-III-SP Subtest Scores (Older Cohort).................................. 60
Table 9. Parameter and Standard Error Estimates for Model 4-OneFactorOlder with Modifications.......................................................................................... 63
Table 10. Selected Fit Indices for Older Cohort Models............................................ 65
Table 11. Descriptive Statistics and Reliability Estimates for Adapted WPPSI-III-SP Total Scores and Adapted Bayley-III Scores (Younger and Older Cohorts)........................................................................................................... 66
Table A1. Subtests and Composite Scores of the WPPSI, WPPSI-R, and WPPSI-III by Age........................................................................................... 94
Table B1. Parameter and Standard Error Estimates for Model 3-NormOlder............ 95
Table B2. Parameter and Standard Error Estimates for Model 4-OneFactorOlder.... 96 Table B3. Parameter and Standard Error Estimates for Model 5-AltOlder................ 97
vii
List of Figures
Figure 1. Hypothesized model of the adapted WPPSI-III-SP scores among the Peruvian cohort at age 36 months based on the factor structure of the normative data (Wechsler, 2002b), referred to as Model 1-NormYoung for analyses........................................................................................................... 37
Figure 2. Hypothesized single-factor model of the adapted WPPSI-III-SP for the
Peruvian cohort at age 36 months, referred to as Model 2-OneFactorYoung model for analyses................................................................................... 38
Figure 3. Hypothesized model of the adapted WPPSI-III-SP for the Peruvian
cohort at age 48 months based on the factor structure of the normative data (Wechsler, 2002b), referred to as Model 3-NormOlder for analyses............ 39
Figure 4. Hypothesized single-factor model of the adapted WPPSI-III-SP for the
Peruvian cohort at age 48 months, referred to as Model 4-OneFactorOlder model for analyses.......................................................................................... 40
Figure 5. Hypothesized model of the adapted WPPSI-III-SP for the Peruvian
cohort at age 48 months, referred to as Model 5-AltOlder model for analyses........................................................................................................... 41
Figure 6. Completely standardized factor loadings for Model 1-NormYoung........... 56 Figure 7. Completely standardized factor loadings for Model 2-OneFactor Young.. 58 Figure 8. Completely standardized factor loadings for modified Model 4-
OneFactorOlder............................................................................................... 64
1
Chapter 1: Introduction
The construct of intelligence is perhaps one of the most deeply theorized and
researched concepts within the field of psychology (Wasserman, 2012). For centuries,
researchers have attempted to describe and measure this multifaceted compilation of
human thought and behavior (Kamphaus, Winsor, Rowe, & Kim, 2012). A central
question in the study and measurement of intelligence is the extent to which cultures
differentially define intelligent behavior. Georgas (2003) theorizes that all humans
possess common cognitive processes (e.g., memory, spatial reasoning, verbal ability).
However, a culture is demarcated by its people’s unique traditions, language, beliefs, and
social norms (Sattler, 2008). These contextual characteristics affect not only the
manifestation of these universal cognitive abilities but also the importance each process is
to survival and success within that culture (Georgas, 2003).
Given these universal abilities and culture-specific behaviors, Greenfield (1985)
asked whether or not cognitive ability tests “travel” (p. 1115) between cultures. Can a
cognitive ability measure developed in one culture be appropriately used to measure
intelligence with an individual from another culture? In other words, do enough universal
abilities exist between cultures to justify using cognitive ability tests universally?
Overwhelmingly, research supports that the direct transfer of an intelligence measure
from one culture to another does not constitute appropriate test use (American
Educational Research Association, American Psychological Association, & National
Council on Measurement in Education, 1999; American Psychological Association,
2002; Greenfield, 1985). Often, the transferred assessment does not measure the same
qualities or aspects of intelligence in the new population as it proposes to measure in the
2
original population. As such, scores from the new measure are invalid and not
interpretable. While common elements of intelligence exist between cultures, the
cultural-specific context precludes "travelling" intelligence tests.
Instead, tests from one culture must be carefully translated and adapted for
appropriate use in another culture (Geisinger, 1994; Hambleton, 2001). A critical step in
this adaptation process is the validation of the adapted measure's scores for use in the new
population (Stein, Lee, & Jones, 2006). This step gathers evidence in support of the
scores from the adapted assessment measuring the construct of intelligence in the same
manner across both populations. The purpose of the proposed study is to complete this
step for a preschool-aged cognitive ability measure normed in Spain and adapted for use
among children in rural Peru.
This purpose encompasses three important considerations: (1) the age of the
population, (2) the test being adapted, and (3) the cultural context of the adaptation.
Between the ages of 2 and 7, a child develops physically, cognitively, emotionally, and
linguistically at a tremendous rate (Berk, 2009; Edwards, 1999). While great growth is
made during this time period, the cognitive abilities of young children are often more
homogenous as compared to more differentiated abilities demonstrated in older children
and adults (Baron & Leonberger, 2012). Furthermore, qualitative characteristics of
cognitive assessments (e.g., use of manipulatives, administration time, rapport with
examiner) must take into account the shorter attention span, higher energy level, and
lower impulse control demonstrated by preschool-age children (Alfonso & Flanagan,
1999). Developing a cognitive ability assessment that reliably measures these quick
3
changes over time while taking into account the broad abilities and unique test behaviors
of the age group is a challenging task.
One such measure that has attempted this challenge is the Wechsler Preschool and
Primary Scale of Intelligence - Third Edition (WPPSI-III; Wechsler, 2002a). The WPPSI-
III was designed to assess the cognitive functioning of children between the ages of 2
years 6 months and 7 years 3 months. The test separates children into two age bands: (1)
2 years 6 months to 3 years 11 months and (2) 4 years to 7 years 3 months. A Spanish-
language version of the WPPSI-III (WPPSI-III-SP) was published in 2009 and normed in
Spain.
The theoretical orientation of the WPPSI-III is reflected in the subtest and
composite scores included in the assessment battery. For children in the lower age range,
four core subtests combine to provide measures of verbal ability (i.e., verbal knowledge
and the use of language in novel situations) and of performance ability (i.e., problem
solving and interpreting visual stimuli). For the older children, seven core subtests
combine to measure these abilities. For both age bands, these broad composites or
components of intelligence (e.g., verbal ability, performance ability) then load onto a
global general intelligence factor. The idea of intelligence being defined not only by
individual manifestations of intelligent behavior but also by an overall global ability is a
direct reflection of David Wechsler's theory as to the definition of "intelligence"
(Wechsler, 1944). Recent research has expanded this theory to include the importance of
processing speed and working memory as separate measureable components affecting
global intelligence (Fry & Hale, 2000).
4
Cultural Context of Peru
Beyond the considerations of the child's age and the test being used, the most
important consideration in cross-cultural test adaptation is the cultural context in which a
text is being adapted. According to the Central Intelligence Agency (CIA; 2014), the
Republic of Peru is a large country located on the central-western coast of South America
with a population of approximately 30 million people. The country is slightly smaller
than Alaska and contains a wide-range of climates. In the east, the climate is tropical and
is comprised of the Amazon River basin; in the west, the climate is that of a dry desert.
The Andes Mountains are a main geological feature of the country. Spanish, Quechua,
and Aymara are the official languages spoken in Peru, with the latter two being languages
spoken by Amerindians. Forty-five percent of the population identifies as Amerindian.
Nationwide, the poverty rate is about 30% with higher poverty rates (around 55%) in
more rural areas. Approximately one-third of children ages 6 to 17 works, mainly in the
mining and construction industries. Approximately 89.6% of the population over the age
of 15 can read and write. The CIA classifies the risk of infectious diseases as very high,
with high risk for bacterial diarrhea, Hepatitis A, typhoid fever, dengue fever, malaria,
and oroya fever.
Of interest to this study are small rural communities in northeastern Peru, located
within the Department of Loreto (Yori et al., 2014). While other regions of Peru have
seen economic growth, the economic growth of this region has progressed more slowly.
Approximately 5,000 people live in these communities located along the Nanay River, a
tributary of the Amazon River. Given the tropical climate, rain occurs throughout the
year and the heaviest rains fall in the month of January, frequently causing flooding.
5
Compared to the rest of the country, this region has higher rates of malnutrition, child and
infant mortality, malaria, and tuberculosis. Furthermore, these sites lack consistent access
to potable water and sanitation (Yori et al., 2014). The main sources of income for
individuals in these communities are growing vegetables to sell in a nearby city, fishing,
driving a taxi, making bricks (Yori et. al., 2014), and harvesting palm and wood products
from the forest (Foundation for the National Institutes of Health, 2016). The majority of
families in this area live in single-family homes with wooden walls, earthen floors, and
thatched roofs. Seventy-four percent of these homes have electricity. Concerning parent
education levels, 58.2% of mothers participating in one study conducted in this region
reported to have completed primary school and some second school, whereas 21.1% of
mothers reported to have between one and five years of primary education (Yori et al.,
2014).
Adapting the WPPSI-III-SP for use among this sliver of the Peruvian population
offers potential lessons for the wider issues surrounding the use of intelligence tests and
test adaptation. First, these children face distinct challenges to their growth and
development (e.g., malnutrition, disease, limited access to health care). Malnutrition, for
example, has been linked with long-lasting negative impacts on both brain structure and
function (Levitsky & Strupp, 1995) in addition to stunting the development of cognitive
processes (Kar, Rao, & Chandramouli, 2008). However, these challenges are not unique
to children growing up in northeastern Peru. Other children living in developing nations
across the world face these difficulties. In following the malnutrition example and
according to the United Nations, poor nutrition contributes to about half (45%) of deaths
in children under the age of five worldwide (World Food Programme, 2016). In 2014,
6
approximately 1 in 13 children in the world suffered from sudden or acute malnutrition
(UNICEF, World Health Organization, & World Bank Group, 2015).
These environmental impacts highlight the important interplay between biological
and environmental factors affecting child development. Much research is conducted
globally to determine what interventions affecting a child, his or her environment, or both
can be put into place to help foster healthy development (Fernald, Kariger, Engle, &
Raikes, 2009). To effectively carry out this purpose, however, researchers across the
world need reliable and valid measurement tools. The adaptation and validation of scores
from the WPPSI-III-SP in this region can provide insight and lessons for researchers
looking at the potential effects of environmental factors (e.g., hunger, disease) or
intervention programs on the cognitive development of a child.
Second, in adapting the WPPSI-III-SP for use in Peru, at first glance it would
appear as if this adaptation is fairly straightforward as both countries speak Spanish.
Unlike some test adaptation projects, the complicated process of translation is potentially
avoided. However, the dialect of Spanish spoken in Peru is distinct from Castilian
Spanish -- the dialect of Spanish spoken in Spain and utilized for the original test
translation and development. Although Peruvian Spanish is considered one of the most
similar dialects to Castilian Spanish in terms of pronunciation (Benson, Hellander, &
Wlodarski, 2007), grammatical differences distinguish the two dialects.
For example, Louro and Yupanqui (2011) analyzed the use of the preterit (e.g., "I
saw") and past perfect (e.g., "I have seen") tenses in Castilian Spanish and Peruvian
Spanish. By general grammar rules, the preterit is used in situations in which something
has occurred in the past whereas the present perfect is generally used to describe events
7
that have occurred in the past but also have relevance to the present situation. The
researchers coded samples of recorded spontaneous conversations. When referring to past
events, the Castilian Spanish speakers used the preterit tense 46% of the time whereas the
Peruvian Spanish speakers used this tense 85% of the time. Furthermore, the use of the
present perfect tense appeared to be differentially influenced by the subject of the
conversation itself. For the Castilian Spanish speakers, the increased use of the present
perfect tense indicated less influence of the past events on the present situation. Among
the Peruvian speakers, the increased use of the present perfect indicated an increased
perceived relevancy of the past on the present, whether real or psychological (Louro &
Yupanqui, 2011).
Beyond grammar differences, the native languages of Peru (i.e., Quechua,
Aymara, and various languages spoken in the Amazonian jungle) have had some
influence on Peruvian Spanish. While a full translation of language is not needed (e.g.,
from English to Spanish), the test must be adapted and "translated" from Castilian to
Peruvian Spanish. The adaptation process, therefore, provides a potential guide or
lessons for others who are adapting a test between regions or countries with a similar
overall language but with distinct dialects and cultures.
Purpose and Proposed Models
The current study examines the complex and important task of validating a
cognitive ability test adapted for use in a new population. Specifically, this study
addresses whether or not scores from an adapted version of the WPPSI-III-SP are a valid
measure of cognitive ability among a cohort of children in rural Peru. A crucial
component of cross-cultural test adaptation is the validation of the adapted test's scores
8
among the new population of interest (Geisinger, 1994). As outlined by Stein and
colleagues (2006), one method for establishing construct validity is a systematic
evaluation as to how the internal factor structure of the adapted test compares to the
factor structure of the original version. As such, the first research question is as follows:
1. Are the factor structures of scores from the original version of the WPPSI-III
replicated with Peruvian children?
If the two tests' scores measure the same construct in the same manner, the internal factor
structures of the tests were hypothesized to be invariant (Stein et al., 2006). In other
words, the factor structure of the original population would be a good fit for the Peruvian
cohort.
However, it was also expected that cultural differences between Spain and Peru
would impede components of the normative factor structure from being a good fit for the
data. Therefore, the second research question is as follows:
2. Does another model fit the data better than the model documented with the
normative sample?
Previous research into the adaptation of intelligence tests for children within the
region provides potential alternative factor structures. Contreras and Rodriquez (2013)
adapted the Spanish version of the Wechsler Intelligence Scale for Children - Fourth
Edition (Wechsler, 2005) among a sample of children in Bucaramanga, Colombia. Factor
analyses indicated a single factor that accounted for 70.26% of variance among test
scores, as opposed to multiple factors as outlined in the normative sample. This factor
structure may be particularly applicable to the preschool children in Peru as cognitive
ability is thought to be more homogenous among younger children (Baron & Leonberger,
9
2012). As such, a single factor of general intellectual ability was proposed as an
alternative factor structure among the children aged 36 and 48 months.
For the children at 48 months, a third potential factor structure was proposed
stemming from the standardization of a Chilean adaptation of the Argentinean version of
the Wechsler Intelligence Scale for Children - Third Edition (Wechsler, 1991/1997)
conducted by Ramirez and Rosas (2007). Overall, the individual subtests of the Chilean
version loaded onto four distinct factors consistent with the Argentinean version.
However, analyzing the factor structures among four age bands derived from the overall
sample provided a different picture. A subtest asking participants to quickly match shapes
and symbols (Coding) did not load on the factor measuring processing speed as it was
theorized to do in the Argentinean sample. Instead, this subtest loaded on a factor
measuring the child's perceptual organization. As such, among this population, this
subtest may require more abstract thought than previously theorized (Ramirez & Rosas,
2007). Based on this finding, a third factor structure was proposed in which Coding was
theorized to load on the performance factor as opposed to the processing speed factor for
the children at 48 months of age.
While the cultures of Colombia and Chile are not the same as those of individuals
living in rural Peru, the cultures may be more similar than a comparison between Spain
and rural Peru. Therefore, it was hypothesized that these proposed models would provide
a better fit to the data than the model based on normative data.
Beyond analyzing the internal structure of an assessment, construct validity may
also be evidenced by the extent to which a test's scores correlate with scores from related
measures (i.e., convergent validity). If a test's scores validly measure an intended
10
construct, it would be expected that those scores would be highly and positively
correlated with scores from tests measuring the same or a theoretically related construct
(Brown, 2015; Sattler, 2008). The final research question addresses the convergent
validity of the adapted measure:
3. To what extent do the scores from an adapted WPPSI-III-SP correlate with the
scores from another adapted measure of cognitive development?
It was hypothesized that the scores from an adapted measure of intelligence completed by
the children at age 24 months would be positively correlated with the scores on the
adapted WPPSI-III-SP.
11
Chapter 2: Literature Review
Cross-Cultural Test Adaptation
In analyzing the process of cross-cultural test adaptation, an important first step is
to define what constitutes culture. Culture refers to the global aspects of daily life that are
passed down from generation to generation among a segment of the larger population.
These global aspects cover a wide range of individual and collective characteristics. For
example, individuals within a culture may share common behavior patterns, beliefs,
religious views, attitudes, values, social norms, self-definitions, language, norms for the
expression of emotions, and/or political views (Cohen, 2009; Sattler, 2008). In general,
cultures exist within definable geographic regions. Critically, the boundaries of these
cultural areas often do not equate to national borders. Various cultures may exist within
one country (Cohen, 2009). Of importance to psychological assessment, cultures differ
in their expression of cognitive, behavioral, and personality characteristics. For example,
in some cultures speaking in a loud voice in public is an accepted social behavior,
whereas in other cultures this behavior is considered impolite. In another example,
academic success is crucial to success within some cultures but is not considered as
important in others (Sattler, 2008).
Two important concepts for the conceptualization of culture are emic and etic.
According to Georgas (2003), emic refers to the idea that something is studied or
developed within a culture. For test development, this idea translates into individuals
from one culture developing a measure for use within that culture. Etic, on the other
hand, refers to the idea that something is studied or developed by someone outside the
culture. A test developed from this perspective would be one that was developed in one
12
culture and then is used in another culture with minimal adaptation or change. In this
way, the test does not reflect the behaviors, beliefs, traditions, or language indigenous to
the new culture.
This etic approach to test development introduces the possibility of test bias. A
test is biased against an individual or a culture when the reliability and validity of test
scores systematically vary according to group membership. Bias is often demonstrated
through differences in mean scores as a function of an examinee’s group membership.
These mean differences do not represent an objective difference in ability across groups.
Rather, these differences represent issues with test construction, administration, and
interpretation (Brown, Reynolds, & Whitaker, 1999). Sattler (2008) outlines the
following three types of potential bias: (1) construct, (2) content, and (3) predictive.
Construct bias refers to the construct or theoretical entity measured by the test being
differentially defined or expressed across different cultural or ethnic groups. Content bias
refers to specific items on a measure being differentially easy or difficult for one ethic or
cultural group over another group. Finally, predictive bias refers to differential
predictions being made from a test score dependent up group membership. Brown and
colleagues (1999) add that potential bias may also result from inappropriate and
nonrepresentative standardization samples, or from a test administrator’s own bias or
differential treatment of the examinee.
When psychologists interpret the results from potentially biased (and therefore
invalid) cognitive ability measures, the potential for harm is great. For example, the
misinterpretation of biased intelligence test results were historically used to justify
discrimination against African Americans (Jacob, Decker, & Hartshorne, 2011). Many
13
high-stakes decisions are made in consultation with the results from an intelligence test
(e.g., special education placement, criminal sentencing, diagnosis of psychopathology).
In the area of special education, two landmark court cases demonstrate the harm
done in the misinterpretation of culturally inappropriate cognitive ability scores. In the
case of Diana v. State Board of Education (1970), the parents of a group of Mexican
American students argued that their children’s placement in special education was unfair
due to the school’s sole reliance on the children’s intellectual ability scores when making
that determination. Of importance, the students’ cognitive ability was assessed using an
English-language test. When retested in Spanish and English, many of these students
scored significantly higher than they had previously scored (Jacob et al., 2011). When
tested in their primary language, the cultural bias was reduced. Second, in the case of
Larry P. v. Riles (1984), the court found intelligence tests to be racially discriminatory
and biased against African American students. As such, the identification of an African
American student as having an intellectual disability solely based on his or her
performance on an intelligence test was banned (Jacob et al., 2011). These two examples
demonstrate the importance of buffering against the effect of bias, especially in situations
in which high-stakes decisions are being made.
An alternative method for test development comes from the concept of derived
etic (Georgas, 2003). This approach suggests that individuals from within a culture either
adapt a test from another culture or develop a similar test. This method has both the
characteristics of the etic and emic approaches. While the new test is still partially a
reflection of the outside culture (i.e., etic), it was developed from within the target culture
(i.e., emic). Most cross-cultural cognitive test adaptation utilizes this approach (Georgas,
14
2003; e.g., Contreras & Rodriquez, 2013; Ramirez & Rosas, 2007). The process of
generating a new and unique cognitive ability measure is an expensive and time-
consuming endeavor. As such, test developers and researchers generally choose the more
efficient derived etic method and adapt a previously established assessment from a
different culture for use within their culture.
Adaptation considerations. In following this derived etic method and adapting a
cognitive ability measure for use in a new culture, all aspects of the test must be analyzed
for cultural appropriateness. These aspects include the test’s content and task demands.
Regarding the test’s content, Malda, van de Vijver, Srinivasan, Transler, Sukumar, and
Rao (2008) argue that a test’s content should undergo language-driven adaptations,
culture-driven adaptations, theory-driven adaptations, and familiarity-driven adaptations.
Language-driven adaptations refer to considerations of translating an item from one
language to another. At times, a direct translation between languages is not possible due
to nonequivalent words in the two languages or significant grammatical differences.
Cultural considerations for an item should also be considered. For example, an item
asking about holidays should take into account the traditions and customs of the new
population. At times, items must be changed for theoretical considerations. Malda et al.
(2008) gives the example of differing item lengths on a subtest asking children to repeat
back a series of numbers. When reading these numbers out loud, the auditory length of
the series may differ between languages. Theoretically, the span of digits should be the
same length from the original test to the adaptation. As such, adaptations may need to be
made for items in the new language. Finally, familiarity-driven adaptations refer to
adapting item content such that the child is familiar with all materials, images, tasks, and
15
instructions. For example, a common food in one culture (e.g., hamburger in the United
States) may not be familiar to a child from another culture.
Outside of the test’s content, the behavior required by the test itself, or test session
behavior, must also be considered in cross-cultural adaptation. As defined by Frisby
(1999a), test session behavior includes the observable verbal and motor behaviors that
can be evaluated during the assessment and may influence the examinee’s performance
on the measure. These behaviors may include the examinee’s ability to pay attention to
the task at hand, the examinee’s energy level at the time of the assessment, and the
examinee’s familiarity with the task demands.
This last aspect of test session behavior may be particularly relevant to the
adaptation of a test between cultures (Frisby, 1999b). Individuals from various cultures
will differ in their exposure to prior knowledge, formal education, and the types of
activities being completed (e.g., paper-and-pencil tasks; identifying similarities between
objects). In taking a cognitive ability test, this prior exposure may translate into better
time-management or guessing strategies, and a better ability to use the test and the testing
situation itself to complete an activity. For example, Malda and colleagues adapted the
American Kaufman Assessment Battery for Children – Second Edition (Kaufman &
Kaufman, 2004) for use with a low socio-economic status population of children in India
(Malda, van de Vijver, Srinivasan, Transler, & Sukumar, 2010). The researchers
observed that the children had difficulties completing the tasks with puzzles. These
children do not have extensive exposure to play materials such as puzzles. The puzzle
activity on the test, while not novel for the American culture in which the task was
16
originally developed, was novel to Indian children and, thus, differentially difficult for
the new population in India.
Guidelines for adaptation. The International Test Commission published a series
of guidelines for the translation and adaptation of assessments across cultures
(Hambleton, 2001). These guidelines address issues related to the context in which
assessments are administered and interpreted, test development and adaptation itself, the
administration of an adapted test, and the documentation and score interpretation of an
adapted test. Overall, these guidelines suggest analyzing how and to what extent a
construct (e.g., intelligence, spatial reasoning) overlaps between two cultures and
analyzing not only test content but also testing format.
The process of adapting and translating a test from one culture to another is a
complicated task. Geisinger (1994) proposes 10 steps for appropriate adaptation of
language and cognitive development assessments. These steps include: (1) translate and
adapt the test; (2) review the adaptation; (3) make further changes to the adapted measure
based on the review; (4) pilot the adapted measure; (5) field test the adapted measure
with a larger sample; (6) standardize the scores; (7) perform validation research; (8)
develop a manual and other documentation; (9) train users; and (10) collect reactions
from users. For each of these steps, a team of individuals who are not only
knowledgeable about the language and culture of the new test population but also
knowledgeable about these aspects of the test’s original population is crucial.
Validation of the adapted measure. Of these proposed steps for proper test
adaption, perhaps one of the most crucial steps is the validation of the adapted measure
for use in the new population. Validating the test for use in the new population requires
17
evidence that the new tool measures the same qualities or aspects of the construct in the
new population as it proposes to measure in the original population. In addition, the
scores from the adapted measure should be interpretable in the same manner as the
original test (Geisinger, 1994). Stein and colleagues proposed that the statistical
technique of structural equation modeling (SEM) may be a useful tool for providing
evidence in regards to construct validity (Stein et al., 2006).
Structural-equation modeling is able to test measurement invariance between
samples. In the case of cross-cultural assessment these samples are typically a new
population and an original normative population. If the assessment were measuring the
same construct in the same manner across the two populations, a consistent and similar
relationship between items and latent factors would be expected across the populations
(Stein et al., 2006). In other words, invariable factor structures would be expected across
groups. When applying confirmatory factor analysis (CFA) procedures, a model (i.e., the
model from the normative sample) is proposed and then tested on the new population in
order to determine whether or not this original model fits the new data equally well. If the
data do not fit the proposed model, the latent factor structure of the new data is dissimilar
to the original normative data. This incongruence supports the idea that the new
assessment is not measuring the construct in the same manner as the original test or may
not be measuring the same construct at all.
Another method for establishing construct validity is by examining how a test's
scores are related to the scores from other assessments. A test's scores evidence
convergent validity if those scores are positively and highly correlated with scores from
an assessment measuring the same or a similar construct; conversely, low or negatively
18
correlated scores from assessment measuring alternate construct would be evidence of
divergent validity (Brown, 2015; Sattler, 2008). For example, if an intelligence test's
scores measure the intended construct it would be expected for those scores to be highly
and positively correlated with the scores from another intelligence test. This method of
validating a test's scores has been used in measuring cross-language equivalence of
several common parenting measures (Nair, White, Knight, & Roosa, 2009), in validating
the translation of a language assessment into Galician (Perez-Pereira & Reches, 2011),
and in validating language sample measures taken from structured elicitation procedures
in Czech (Smolik & Malkova, 2011).
Cross-Cultural Intelligence Testing
The previously described process of adaptation and validation of assessments
becomes more complicated when complex constructs such as cognitive ability are
considered. As Greenfield (1985) argues, cognitive ability tests are a clear reflection of
the culture in which they were developed. The items on these tests represent a form of
“symbolic culture” (Greenfield, 1985, p. 1115). Each item represents the values,
knowledge, and methods of communication deemed important to the specific culture in
which the test was developed.
Universality of intelligence theory. When adapting a cognitive ability test, a first
consideration is the extent to which the construct of intelligence itself can be universally
applied across cultures. Georgas (2003) theorizes that all humans possess common
cognitive processes (e.g., memory, spatial reasoning, verbal ability). In a meta-analysis of
91 studies looking at the definition of intelligence in countries outside of Europe and the
United States (Irvine, 1979), the following cognitive processes were identified as
19
universal components to how different cultures define intelligence: visual/perceptual
processes, memory function, verbal abilities/skills, numerical operations, and
physical/temperamental qualities. However, a people’s unique culture affects how these
general processes are manifested within that particular environmental context. These
contextual characteristics affect not only the manifestation of these universal cognitive
abilities but also the importance each process is to survival and success within that
culture (Georgas, 2003).
Sternberg and Grigorenko (2004) refer to this ability to survive and thrive within
a specific culture or environmental context as 'successful intelligence'. For example, the
authors asked children living in Kenyan villages to identify local herbs and explain their
medicinal uses. The ability to locate, identify, and process these herbs into medicine is
crucial to survival within these Kenyan communities. In considering the universal
abilities identified by Irvine (1979), these children rely on visual/perceptual and memory
skills to complete the task. Children in the United States, on the other hand, may be asked
to demonstrate these processing and memory skills through more academic-based tasks.
While children across cultures, therefore, are required to express the same intellectual
abilities in order to successfully navigate their environment, how these skills are
demonstrated varies widely across cultures.
Georgas, van de Vijver, Weiss, and Saklofske (2003) looked at the factor
structures of the Wechsler Intelligence Scale for Children – Third Edition (WISC-III;
Wechsler, 1991) across 14 countries (U.S., United Kingdom, France and French-speaking
Belgium, The Netherlands and Flemish-speaking Belgium, Germany, Austria and
Switzerland, Sweden, Lithuania, Slovenia, Greece, Japan, South Korea, and Taiwan) in
20
an effort to analyze the extent to which the factor structure of the test is similar across
nations and cultural groups. As described by Stein et al., (2006), similar factor structures
across cultural groups is indicative of a cross-cultural similarity in the construct of
intelligence. Initially, the findings reported across these sites were either a three- or four-
factor solution (5 and 9 countries, respectively). For those reporting a three-factor
solution, the consistent finding was that the subtest of Arithmetic (in which children
perform mental math in response to word problems) loaded on the Verbal
Comprehension factor as opposed to the Freedom from Distractibility (working memory)
factor.
Georgas et al. (2003) reanalyzed the entire 14-country dataset using exploratory
factor analysis. A four-factor solution and unitary second-order factor loading (general
intelligence) were supported. The researchers then conducted pair-wise comparisons of
the factor structures from each nation. Tucker’s phi coefficients (Burt, 1948) were
calculated for each comparison. Generally, a Tucker’s phi greater than or equal to .90
indicates invariance across structures. For the first three factors, all Tucker phi
coefficients were greater than .90, indicating factorial stability for those three factors
across the datasets. For the fourth factor (Freedom from Distractibility), some Tucker phi
coefficients fell below .90 with the lowest being .79.
Overall, Georgas and colleagues (2003) argued that the factor structure of the
WISC-III demonstrated similarity across the various countries, which suggests a similar
construct of intelligence across cultures. However, the study also noted cultural
differences in the manifestations of some abilities thought to be universal. The
differential loading of the Arithmetic subtest on either the Verbal Comprehension or the
21
Freedom from Distractibility factor was perhaps a reflection of either the subtest itself or
of cultural differences (Georgas et al., 2003). The subtest relies on the child’s verbal
comprehension skills to interpret the word problem. In order to successfully complete the
problem, a child would need to decontextualize the arithmetic problem from the word
problem. Cultures may differ in the extent to which they emphasize or manifest this
ability to decontextualize. Cultures may also differ in the extent to which children are
exposed to word problems or to arithmetic and calculations.
This idea of differing exposure to test content or activities as a contributing factor
to noted differences in cognitive abilities across cultures was explored through a meta-
analysis conducted by Van de Vijver (1997). Van de Vijver identified 197 studies
conducted between 1973 and 1994 that compared cognitive test performance across
cultures. In support of the universality of intelligence, the most frequently reported
finding of these cross-cultural intelligence studies was the absence of any cross-cultural
differences in performance on a measure of cognitive abilities. When studies did note
differential performance across cultures, these differences were often related to the type
of task being completed or global contextual differences between countries (e.g., wealth,
educational opportunity). Greater differences in performance were noted for studies
looking cross-nationally as opposed to comparing cultural groups within the same nation.
Furthermore, a greater disparity in affluence between countries was associated
with greater differences in performance on the measure, with individuals in wealthier
nations scoring higher than individuals in less wealthy nations. In relation, more years in
school were associated with higher performance on a measure. Van de Vijver (1997)
concluded that wealthier nations often have more educational opportunities and this
22
education better prepares individuals to perform well on an intelligence test. Education
potentially exposes a person to tests, learning materials, and performance-based
activities. Of note, Van de Vijver (1997) found that greater differences in performance
were noted across cultures when the study used tasks developed within Western cultures,
as opposed to using locally developed non-Western tasks. Overall, the majority of studies
did not note significant differences in cognitive ability across cultures. However, the
measurement of these universal cognitive abilities is systemically impacted by contextual
factors such as the assessment itself and the individual's exposure to tests and educational
opportunities.
Adaptation of Preschool Cognitive Assessments
Beyond cultural considerations, the adaptation of a preschool-age cognitive
assessment presents distinct challenges. One of the main difficulties to test development
for children in this age range is adequately capturing the profound and rapid development
that occurs in young childhood (Berk, 2009). In the cognitive domain, a child’s memory
and attention span grow rapidly during this time (Edwards, 1999). According to Piaget’s
conceptualization of cognitive development (Berk, 2009), preschool-age children fall in
the preoperational stage of development. At this age children begin to engage in symbolic
thinking, or begin to use words and images to represent events or experiences. Their
communication skills allow for the expression of full concepts, but complex, logical, and
abstract reasoning is still difficult for children in this age range (Berk, 2009). This rapid
cognitive development presents a challenge to the reliable measurement of ability, as
abilities do not often remain stable over time. Measures of cognitive development must
be sensitive to changes over time (Alfonso & Flanagan, 1999). As this development
23
levels off over time, small increases in chronological age increase the ability for tests to
predict future performance (Baron & Leonberger, 2012). Consensus among researchers
suggests that this stability of cognitive ability occurs when a child is approximately six-
years old. After the age of 6, a child's score on an intelligence assessment is more
strongly correlated with later measures of intelligence (Sternberg, Grigorenko, & Bundy,
2001).
Furthermore, traditional preschool cognitive ability assessments are often
downward extensions of ability measures developed for older children and adults. As
such, these assessments may not be well equipped for measuring the unique cognitive
abilities of preschoolers (Alfonso & Flanagan, 1999). Baron and Leonberger (2012)
argued that the intellectual functioning of preschool-aged children may be more
homogeneous as compared to the cognitive ability of older children and adolescents.
Simply extending a measure for older youth downward may result in an invalid
measurement of the preschooler’s cognitive ability. The authors cite factor analyses that
support the construct validity of multidimensional ability tests for older children and
adults but do not support the same multi-factor structures in younger children (Baron &
Leonberger, 2012).
Neisworth and Bagnato (1992) go as far to argue that the inclusion of a
standardized intelligence test in the assessment of an infant or young child is
inappropriate and potentially harmful. The authors argue that intelligence tests for young
children lack a coherent and common definition as to what constitutes "intelligent"
behavior. This lack of a coherent construct produces tests that assess a mixture of skills
that may have little predictive validity and are heavily influenced by experience or other
24
environmental factors (e.g., intervention services, ability of the test administrator to
engage the child). Furthermore, Neisworth and Bagnato (1992) argue that one of the
cornerstones to standardized testing is standardized administration. Children in this age
range, however, may make adherence to standard protocol difficult due to individual
differences in emotional regulation, affect, attention, socialization, and familiarity with
testing procedures or objects.
Despite these challenges to ability measurement in the preschool period, these
tests are often included in a comprehensive assessment battery for young children.
Judicious use of intelligence tests can provide meaningful information in the assessment
of young children (Bracken, 1994). Furthermore, for some diagnoses (e.g., developmental
delay) under the Individual with Disabilities Education Act (2004) a measurement of the
child's cognitive ability may be required. However, careful examination and
consideration must be employed in developing these assessments in order to address
some of the potential drawbacks to engaging in standardized testing at this age
(Neisworth & Bagnato, 1992).
Alfonso and Flanagan (1999) outlined quantitative and qualitative characteristics
that must be examined when evaluating cognitive assessments for preschoolers. The
quantitative characteristics, or psychometric properties of the test and its scores, are much
the same as those that are examined for any age group. These characteristics include
making sure a representative standardization sample with adequate age divisions was
utilized during test and norm development. For preschool children, adequate age
stratification becomes even more important given the rapid changes that occur within this
age group. Other factors to consider are adequate score reliability, adequate floor items
25
such that the ability of lower performing students is sufficiently measured, sufficient
change in difficulty between items such that subtle changes in ability are captured (i.e.,
adequate item gradients), and adequate validity (Alfonso & Flanagan, 1999).
Alfonso and Flanagan (1999) also identified numerous qualitative factors of
preschool assessments that may influence the child’s performance on the exam. These
test characteristics include the test materials themselves and the administration of the test.
These assessments must take into account the age, developmental characteristics, and
interests of a typical preschool child. Assessment materials should include manipulatives
for the child to use that are colorful, simple, large, and universally appealing. The
administration of the test itself should not be long and should be engaging enough to hold
the attention of the child. As such, the administrator of the test can have a significant
impact on the performance of the child. When the administrator is skilled in building
rapport, holding a young child’s interest, and managing the behavior of a young child, the
child’s score is likely a more valid measurement of ability (Baron & Leonberger, 2012).
Finally, the expressive language requirements for test completion must be considered. As
language skills are still developing in this age range (Berk, 2009), the test should only
require short verbal answers and should allow for the use of gestures (Alfonso &
Flanagan, 1999). Therefore, when considering cross-cultural test adaptation for
preschool-age children these qualitative test characteristics must also be examined for the
appropriateness for young children in the new population of interest.
Description of the WPPSI-III
One such assessment for measuring the cognitive ability of preschoolers is the
Wechsler Preschool and Primary Scale of Intelligence – Third Edition (WPPSI-III;
26
Wechsler, 2002a). The WPPSI-III is a clinical instrument designed to assess the cognitive
functioning of children between the ages of 2 years 6 months and 7 years 3 months. The
test separates children into two age bands: (1) 2 years 6 months to 3 years 11 months and
(2) 4 years to 7 years 3 months.
Development and theoretical background. The WPPSI was initially published
in the United States in 1967 (Sattler, 2008), as a downward extension of the Wechsler
Intelligence Scale for Children (Lichtenberger & Kaufman, 2004). The content and
construction of a cognitive ability assessment is often a reflection of the test creators'
theory as to what constitutes intelligent behavior, and the original WPPSI is no exception.
While Wechsler himself did not develop many of the activities included in the first
version of the WPPSI (Boake, 2002), his compilation of subtests and overall construction
of the test was driven by a concrete view as to the nature of "intelligence". Wechsler
(1944) provided the following definition of intelligence:
Intelligence is the aggregate or global capacity of the individual to act
purposefully, to think rationally, and to deal effectively with his environment. It is
global because it characterizes the individual's behavior as a whole; it is aggregate
because it is composed of elements or abilities which, though not entirely
independent, are qualitatively differentiable. (p. 3)
The influence of earlier theoreticians on this definition is evident. The idea of intelligence
being a global entity reflects the work of psychologist Charles Spearman, while the
theories of Edward Thorndike are evident in the idea of a composition of qualitatively
different behaviors (Wasserman, 2012).
27
Wechsler (1944) also acknowledged that one test could not adequately measure
all possible factors that allow an individual to function (e.g., drive and incentive). Instead,
Wechsler divided the products of intelligence into two general categories, verbal abilities
and performance abilities. Wechsler set out to not only measure these components of
intelligence but also to provide an overall estimate of general intellectual ability. In
including activities that measure verbal and performance abilities, an assessment may
then extrapolate an estimate of the latent general intelligence ability.
The first and second editions of the WPPSI (published in 1967 and 1989)
followed closely to Wechsler's original division of general intellectual ability into general
verbal and performance abilities. The WPPSI-III, however, contains an updated
theoretical structure reflecting current research into the nature of intelligence (see the
Appendix for an overview of revisions of the WPPSI). More specifically, the WPPSI-III
measures the potential influence of a child's ability to process information quickly
(processing speed) and to form categories or to problem solve using novel stimuli (fluid
reasoning; Lichtenberger & Kaufman, 2004). This theoretical change may be particularly
important for preschool-aged children. According to Fry and Hale (1996), a child's ability
to process information quickly and to hold information in their memory while
manipulating it in some way are prerequisite skills to fluid reasoning or problem solving.
The addition of these skills into the theoretical construct of intelligence forms a more
accurate picture of the specific cognitive skills that allow a child to function well in his or
her environment. These updates to the theoretical underpinnings of the test are reflected
in the assessment's test content and structure.
28
Test structure and content. The WPPSI-III is comprised of a series of subtests
that provide a variety of composite scores dependent upon the child’s age. Table A1 in
the Appendix provides a summary of core composites and subtests for each age range.
For children in the lower age range, four core subtests combine to provide two composite
scores: (1) Verbal Intelligence Quotient (VIQ) and (2) Performance Intelligence Quotient
(PIQ). The VIQ is a measure of the child’s verbal knowledge and his or her ability to
apply those skills to a novel situation. This measure is also a reflection of the child’s
verbal understanding gained through informal education. The PIQ is generally considered
the nonverbal ability measure, assessing a child’s ability to solve problems and his or her
ability to organize and interpret visual stimuli.
The VIQ is comprised of two subtests (Receptive Vocabulary and Information).
During the Receptive Vocabulary subtest, the child looks at a group of four pictures and
points to the one that the examiner named aloud. For the Information subtest, the child
answers a series of questions that address a broad range of general knowledge topics. The
PIQ is also comprised of two subtests (Block Design and Object Assembly). For the
Block Design subtest children use bicolored blocks to recreate a design modeled or
pictured for them. Object Assembly asks a child to put together increasingly more
difficult puzzles within a certain time limit. The scores from these subtests are combined
to provide an overall measure of cognitive functioning (Full Scale Intelligence Quotient
[FSIQ]; Wechsler, 2002a).
For children in the older age range, seven core subtests also combine into two
core composite scores: (1) Verbal Intelligence Quotient and (2) Performance Intelligence
Quotient. The VIQ and PIQ aim to measure the same abilities as described for the
29
younger cohort. The subtests of Information and Block Design carry over from the
younger age range to the older range. Information, Vocabulary, and Word Reasoning
subtests combine to provide the VIQ. The Vocabulary subtest asks children to verbally
define a series of words, whereas the Word Reasoning subtest asks the children to
identify the common concept being described in a series of increasingly specific clues.
The Block Design, Matrix Reasoning, and Picture Concepts subtests combine to provide
the PIQ composite. During the Matrix Reasoning subtest, the child looks at an incomplete
matrix and selects the missing portion from response options. The child chooses pictures
from a series of rows that form a group with a common characteristic during the Picture
Concepts subtest. The scores from these core subtests, in addition to the Coding subtest,
are then combined to provide an overall measure of general intellectual functioning
(FSIQ; Wechsler, 2002a). During the Coding subtest, the child is presented with a series
of geometric shapes (e.g., star, circle, square). The child uses a key to copy symbols (e.g.,
line, cross) into each shape within a certain time limit.
Reliability and validity. The WPPSI-III was first developed and standardized in
the United States using a normative sample of 1,700 children. The sample was stratified
on the characteristics of age, sex, ethnicity, geographic region, and parental education
(Wechsler, 2002b). The scores from the test were found to have good reliability with
internal consistency indices ranging from .94 to .96 for the VIQ, from .89 to .95 for the
PIQ, and from .95 to .97 for the FSIQ. Split-half reliability estimates for the core subtests
ranged from .83 (Symbol Search) to .91 (Word Reasoning; Wechsler, 2002b). As
reported in the test’s technical manual (Wechsler, 2002b) and through a principal axis
factor analysis by Sattler (2008), the WPPSI-III is comprised of two factors for the
30
younger age range and four factors for the older age range. These factors align with the
composite and subtest groups described previously. For the younger children, significant
factor loadings for the subtests onto their respective factors ranged from .59 (Object
Assembly on PIQ) to .85 (Receptive Vocabulary on VIQ). For the older children,
significant factor loadings for the subtests onto their respective factors ranged from .38
(Matrix Reasoning on PIQ) to .88 (Vocabulary on VIQ).
However, Gordon (2004) notes in his review of the WPPSI-III that
intercorrelational relationships among the subtests point to a potential one-factor
structure, with all subtests loading on a general intellectual factor. In support of a two-
factor structure it would be expected for the VIQ subtests to correlate highly with each
other (convergent validity) than with the PIQ subtests (discriminant validity). Across age
bands, the VIQ subtests for both age bands do correlate more highly with each other than
with the subtests from the PIQ composite. This pattern, however, does not hold for the
PIQ subtests. These subtests correlated equally high among each other and with the
subtests from the VIQ factor. The WPPSI-III test manual (Wechsler, 2002b) addresses
these validity concerns. The authors posit that this lack of discriminant validity may be a
reflection of less differentiation between cognitive abilities evidenced among young
children and may be due to the high g (general intelligence) loadings of all subtests.
Based on these arguments and the intercorrelations presented, Gordon (2004) questions
whether or not a one-factor structure may be more appropriate, particularly for the
younger age band. The factor structure presented in the manual was replicated through
principal axis factor analysis (Sattler, 2008) and item response theory (Price, Raju, Lurie,
Wilkins, & Zhu, 2006). Beyond the interrcorrelational evidence highlighted by Gordon
31
(2004), no further studies were found to either confirm or dismiss the superiority of a
one-factor structure.
The WPPSI-III has been translated, adapted, and standardized for use in the
following languages: Spanish (normed in Spain), French (normed in France), French
Canadian, German, Italian, Swedish, Korean, Japanese, and Dutch. Standardization also
occurred in Australia, the United Kingdom, and Canada (Visser, Ruiter, van der Meulen,
Ruijssenaars, & Timmerman, 2012). Few further adaptations of the WPPSI-III for use in
a culture different from the original normative culture were found within the literature.
While Bagdonas, Pociute, Rimkute, and Valickas (2008) refer to the adaptation of the
WPPSI-III for use in Lithuania, no studies confirming this adaptation process were found.
Furthermore, Wasserman and colleagues (2004) outlined the translation and adaptation of
the WPPSI-III for use among young children in Bangladesh. However, no description of
reliability or validity evidence for scores from the adapted measure was provided. Similar
to Bangladesh, Karino, Laros, and Ribeiro de Jesus (2011) used an adapted version of the
WPPSI-III within a study with no mention of the adaptation and validation process. It
should be noted that the language of the articles in which these studies are written limits a
search for evidence of cross-cultural validation of the WPPSI-III. Many studies may have
provided this evidence and would be accessible for researchers or clinicians within a
country who may use the adapted measure. For the purpose of this study, however, it
remains unclear as to whether or not the factor structures provided within the
standardization samples of WPPSI-III are replicated when these assessments are adapted
for use in a dissimilar culture.
32
Spanish adaptation. In 2009, a Spanish-language version of the WPPSI-III, the
Escala de Inteligencia de Wechsler para Preescolar y Primaria – III (heretofore identified
as the WPPSI-III-SP; Wechsler, 2009) was adapted and normed for use in Spain. The test
was normed on a sample of 1,220 Spanish children (Rodriguez & Miguel, 2012). This
test contains the same subtests and composite scores as the English-language version.
Through the adaptation process, however, items were changed or adapted to be culturally
appropriate for use in Spain. The order of items was necessarily changed so as to ensure
that items became increasingly more difficult for the Spanish children.
Present Study
The question remains, however, as to whether or not an adapted version of the
WPPSI-III-SP reliably and fairly measures intelligence in a cohort of children in rural
Peru. The primary purpose of this study, therefore, is to examine the extent to which a
model based on the factor structure derived from scores on the WPPSI-III-SP completed
by the normative Spain sample fit the scores from an adapted WPPSI-III-SP completed
by children from rural communities in Peru (see Figures 1 and 3). Given the cultural,
developmental, construct, and adaptation considerations, however, the present study also
proposes to determine if another factor structure would provide a better fit. As such, the
study attempts to answer the following questions:
1. Is the factor structure of scores from the original version of the WPPSI-III
replicated with children living in rural Peruvian communities?
2. If not, does another model fit the data better than the model outlined within the
normative sample?
33
In addition, this study aims to assess the construct validity of the scores from the
adapted measure by assessing the relationship of these scores to the scores of another
measure of cognitive development. The final research question addresses the convergent
validity of the adapted measure’s scores:
3. To what extent do the scores from the adapted WPPSI-III-SP correlate with the
scores from another, previously administered adapted measure of cognitive
development?
Possible alternative factor structures. Looking to other instances of intelligence
test adaptation in the region may help predict possible factor structures of the adapted
WPPSI-III-SP, outside the factor structure of the normative sample. As noted earlier,
however, no studies were found to examine the validity of scores from a cross-culturally
adapted version of the WPPSI-III. However, two studies of the adaptation of an
intelligence measure for older children were conducted with children in Colombia and
Chile. While the cultures of Colombia and Chile are not the same as the culture in rural
Peru, the cultures may be more similar than a comparison between Spain and rural Peru.
As such, the standardization of intelligence tests in these two countries may offer possible
alternative factor structures to consider, despite outlining the psychometric properties of
an assessment aimed at older children.
Contreras and Rodriquez (2013) studied the reliability and validity of scores from
the Spanish version of the Wechsler Intelligence Scale for Children – Fourth Edition
(WISC-IV-SP; Wechsler, 2005) in a sample of children and adolescents from
Bucaramanga, Colombia. Similar to the WPPSI-III, the WISC-IV-SP indicates that the 15
subtests depend on a four-factor structure. Regarding reliability, the WISC-IV scores had
34
similar reliability estimates for the Colombian sample as they had for the Spanish
version’s normative sample. For the overall assessment, Contreras and Rodriquez (2013)
calculated a split-half alpha coefficient of .95 and a Cronbach’s alpha coefficient of .98.
Regarding validity, the data did not support a four-factor structure as presented in the
original version of the test. Instead, the researchers found evidence for a single factor that
accounted for 70.26% of the total variance in the test scores (Contreras & Rodriquez,
2013). In addition, Baron and Leonberger (2012) argue that the intellectual functioning of
the preschool-aged child is more homogeneous than is the cognitive ability of an older
individual. As such, the alternative models described in Figures 2 and 4 may prove a
better fit for the data from the Peruvian sample of children.
In a second study of an intelligence test adaptation in South America, Ramirez
and Rosas (2007) adapted the Argentinian version of the Wechsler Intelligence Scale for
Children – Third Edition (WISC-III; Wechsler, 1991/1997) for use in Chile. The
researchers administered the adapted test to a stratified sample of 1,914 children, divided
into 11 age categories. The internal consistency of the subscale and composite scores and
the factorial structure of the test are reported. Regarding internal consistency, Cronbach’s
alpha coefficients ranged from .65 to .91 for the subscale scores and from .75 to .87 for
the composite scores. Using factor analysis, Ramirez and Rosas (2007) analyzed the
factor structure of the sample as a whole and for four age ranges (6 – 7, 8 – 10, 11 – 13,
and 14 – 16 years). Overall, the individual subtests loaded onto four distinct factors in a
manner consistent with the original test’s factor structure. In looking at the results for the
four age ranges, however, the Coding subtest loaded significantly on the factor
representing Perceptual Organization and not on the supplemental Processing Speed
35
factor, as was Coding’s loading in the Argentinian sample. The authors argue that this
result perhaps demonstrates children’s cognitive skills have not fully differentiated and is
a reflection of preoperational thought. In other words, the Coding subtest may require
more abstract reasoning than previously theorized (Ramirez & Rosas, 2007).
This interpretation may be especially relevant for a population in which children
may not have had extensive exposure to paper-and-pencil educational activities. To be
able to quickly complete the Coding subtest, a child relies on fluent skills in shape and
symbol identification. A child who has not had formal exposure to performing this type
of paper-and-pencil test, may be relying more on his or her perceptual reasoning to
interpret the larger shape and then identify which symbol goes into that shape. As such,
the task becomes more a performance than a processing speed task. Therefore, a third
proposed model for the older cohort of children posits the loading of Coding on the PIQ
factor (see Figure 5).
In summary, for each age group of children (i.e., 36-month and 48-month
cohorts), two potential models are proposed: (1) models identical to the factor structure
demonstrated in the normative data (see Figures 1 and 3) and (2) a one-factor model, such
that all subtests will load on one general latent factor of intelligence (i.e., no separate
composite scores; see Figures 2 and 4; Contreras & Rodriquez, 2013). For the 48-month-
old children, a third model is proposed in which the Coding subtest loads on the
Performance factor (see Figure 5; Ramirez & Rosas, 2007).
Hypotheses. If the adapted WPPSI-III-SP measures the construct of intelligence
in the same manner as the original version, it was hypothesized that the theorized factor
structure of the adapted WPPSI-III-SP would be an adequate fit for the Peruvian cohort.
36
Given that, however, the cultures of Colombia and Chile may resemble more closely the
Peruvian culture as compared to Spain, it was also hypothesized that the models based on
the research in these South American countries would provide a better fit to the data than
the model based on the normative data. Finally, it was hypothesized that the scores from
an adapted measure of intelligence completed by the children at age 24 months would be
positively correlated with scores from the adapted WPPSI-III-SP.
37
Figure 1. Hypothesized model of the adapted WPPSI-III-SP scores among the Peruvian cohort at age 36 months based on the factor structure of the normative data (Wechsler, 2002b), referred to as Model 1-NormYoung for analyses. GLC = General Language Composite; VIQ = Verbal Intelligence Quotient; PIQ = Performance Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient.
38
Figure 2. Hypothesized single-factor model of the adapted WPPSI-III-SP for the Peruvian cohort at age 36 months, referred to as Model 2-OneFactorYoung model for analyses. This model proposes a single overall ability as found by Contreras and Rodriguez (2013).
39
Figure 3. Hypothesized model of the adapted WPPSI-III-SP for the Peruvian cohort at age 48 months based on the factor structure of the normative data (Wechsler, 2002b), referred to as Model 3-NormOlder for analyses. VIQ = Verbal Intelligence Quotient; PIQ = Performance Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient.
40
Figure 4. Hypothesized single-factor model of the adapted WPPSI-III-SP for the Peruvian cohort at age 48 months, referred to as Model 4-OneFactorOlder model for analyses. This model proposes a single overall ability as found by Contreras and Rodriguez (2013).
41
Figure 5. Hypothesized model of the adapted WPPSI-III-SP for the Peruvian cohort at age 48 months, referred to as Model 5-AltOlder model for analyses. This model proposes the loading of Coding on the factor representing the Performance Intelligence Quotient (PIQ), based on the findings of Ramirez and Rosas (2007). VIQ = Verbal Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient.
42
Chapter 3: Method
Sample A total of 188 children (101 boys) completed the younger version only (10.63%
of children), older version only (18.09%), or both versions (71.28%) of the adapted
WPPSI-III-SP. Three children were missing gender data. Among this cohort of children,
on average, the children’s mothers completed 7.77 years of education (SD = 2.68). One
hundred and fifty-six children completed the younger version of the assessment, aged 35
to 47 months (M = 36.38, SD = 3.04). For the older version, 168 children completed the
test, all aged 48 months, with the exception of one child age 49 months and one child age
36 months. For brevity’s sake, these will be referred to as the younger and older cohorts,
respectively; however, substantial overlap of participants exists (i.e., 71.28% above).
These children were drawn from a larger sample of children participating in the
Interactions of Malnutrition and Enteric Infections: Consequences for Child Health and
Development (MAL-ED) project overseen by the Foundation for the National Institutes
of Health and the Fogarty International Center. All participants lived in rural
communities in northeastern Peru. A review of this cultural context is provided in the
Introduction. Children were eligible to participate in the longitudinal study if their mother
was older than 16 years of age, if no other child in the household participated in the
study, if they were healthy (e.g., no congenital diseases or severe neonatal diseases
requiring prolonged hospitalization), if the family had no plans to move away from the
community within 6 months, and if the child was not part of a multiple pregnancy.
43
Measures
Adapted Wechsler Preschool and Primary Scale of Intelligence - Spanish
Version (adapted WPPSI-III-SP). The cognitive ability of each child was measured
through an adapted version of the Wechsler Preschool and Primary Scale of Intelligence -
Spanish Version (Wechsler, 2009). A review of the original WPPSI-III and the WPPSI-
III-SP is presented in Chapter 2. For use with the Peruvian sample, adaptations were
made to the WPPSI-III-SP by researchers from the Department of International Health at
Johns Hopkins University. These adaptations included altering pictures to be culturally
appropriate and rewording instructions to be appropriate for the dialect of Spanish spoken
in Peru.
For example, for an item on the Picture Concepts subtest, the picture of a
capybara replaced the picture of a squirrel. As squirrels are not native to the Peruvian
jungle environment, the children would have been unfamiliar with the animal and the
item may have been unfairly difficult for them to answer. In another example on this
subtest, pictures of more realistic and recognizable dogs to the children replaced original
cartoon pictures of dogs. In an example of changes made for Peruvian Spanish, the item
asking children to define "Swing" on the Vocabulary subtest was altered. The word used
on the WPPSI-III-SP signifies both the object (e.g., a playground swing) and a movement
(e.g., to swing back and forth) in Castilian Spanish. In Peruvian Spanish, however, the
word only represents the object. Another word was substituted asking the child to
describe the movement of swinging (A. Orbe, personal communication, July 14, 2014).
For the Block Design subtest, children earned a possible score of 0, 1, or 2 for
each item. Scoring for this test was dependent on not only whether or not the child
44
completed the construction within a specified time limit but also on whether or not the
child required one or two trials to do so. Children could earn either a score of 0 (incorrect
answer) or 1 (correct answer) on each item of the Information, Receptive Vocabulary,
Word Reasoning, Matrix Reasoning, and Picture Concepts subtests. Scores on the Object
Assembly subtest were based on the number of junctures (i.e., the place where two
adjacent puzzle pieces meet) correctly joined, with a possible per item score ranging from
0 to 5 points. For the Coding subtest, children received one point for each correctly
paired symbol and shape. Finally, for each item on the Vocabulary subtest, children could
earn 0, 1, or 2 points, with more sophisticated and specific definitions earning a higher
score. Possible score ranges for the subtests were as follows: (1) Block Design: 0 - 40; (2)
Receptive Vocabulary: 0 - 38; (3) Information: 0 - 34; (4) Object Assembly: 0 - 70; (5)
Vocabulary: 0 - 43; (6) Word Reasoning: 0 - 28; (7) Picture Concepts: 0 - 28; (8) Matrix
Reasoning: 0 - 29; and (9) Coding: 0 - 50.
Adapted Bayley Scales of Infant and Toddler Development - Third Edition
(adapted Bayley-III). The Cognitive subscale of the Bayley-III (Bayley, 2005) was used
to assess cognitive development at age 24 months. The Bayley-III is an individually
administered standardized assessment of cognitive and motor ability for children ages 1
to 42 months. The Cognitive subscale assessed recognition memory, habituation, visual
preference, visual acuity skills, problem solving, number concepts, language, and social
development (Sattler, 2008). Similar to the adapted WPPSI-III-SP, this measure was
adapted for use in Peru. Murray-Kolb and colleagues (2014) provide an overview of the
selection, adaptation, and administration of this assessment for the MAL-ED project.
Adapted scores were calculated based on exploratory factor analyses, confirmatory factor
45
analyses, and item response theory analyses conducted on pooled data across all MAL-
ED project sites. The final adapted Bayley-III Cognitive subscale is comprised of 15
items and scores from the Peruvian cohort revealed good reliability (Cronbach's α = .82).
Scores from the Cognitive subscale of the Bayley (original version) were found to
correlate strongly (.79) with FSIQ composite scores from the WPPSI-III (original US
norms version; Sattler, 2008).
Procedure
For the MAL-ED project, expectant mothers were recruited to the study through
local health posts and lists of expectant mothers complied by researchers within the
community. After the birth of the child, researchers approached the mothers for
participation in the study. The newborn was screened for participation and informed
consent was gathered from the child's mother and father or from the mother and
grandfather. Informed consent forms were reviewed verbally with all participants (Yori et
al., 2014). The research team followed children from birth through the present time. In
Peru, the Asociación Benéfica PRISMA and the Johns Hopkins Bloomberg School of
Public Health oversee this research team.
Children completed the adapted Bayley-III at 24-months old and the adapted
WPPSI-III-SP shortly before or shortly after either their 3rd (younger cohort) or 4th
(older cohort) birthday. One psychologist trained in using the adapted Bayley-III and the
adapted WPPSI-III-SP conducted all assessments. For the adapted WPPSI-III-SP,
standard administration procedures were followed, with children stopping a subtest after
reaching a subtest's discontinuation rule (e.g., three incorrect answers in a row). As such,
not all participants completed all items.
46
Data Analyses
The data were analyzed using the statistical techniques of structural equation
modeling (SEM), specifically confirmatory factor analysis (CFA). Confirmatory factor
analysis (Stein et al., 2006; Brown, 2015) is a useful tool for validating translated and
adapted assessments. The relationships between latent variables (i.e., factors) and
observed variables (i.e., indicators) are hypothesized, or specified a priori. This
hypothesized model is then estimated with collected data to determine whether or not the
relationships observed within the data are consistent with the modeled relationships.
Model-fit statistics provide a measure of the extent to which the hypothesized model is a
plausible representation of the relationships observed within the sample data. This
method of analysis is in contrast to the statistical technique of exploratory factor analysis
in which all observed variables are loaded on all possible factors such that a factor
structure is derived from the data (Lei & Wu, 2007).
Figures 1 through 5 represent the a priori specified models subsequently estimated
using CFA. For each cohort of children (i.e., younger and older), two potential models
were proposed: (1) models identical to the factor structure demonstrated in the normative
data (Model 1-NormYoung and Model 3-NormOlder) and (2) a one-factor model, such
that all subtests load on a single general latent factor of intelligence (Model 2-
OneFactorYoung and Model 4-OneFactorOlder). For the older cohort, a third model was
proposed in which the Coding subtest loads on the Performance factor (Model 5-
AltOlder). Data consisted of the sum of points earned on each subtest. The CFA analyses
were based on the covariance matrices of these raw data.
47
Prior to analysis, a model must be theoretically identified. According to Kline
(1998), a model is theoretically identified when "it is possible...to derive a unique
estimate of every model parameter" (p. 49). Table 1 presents the requirements for
identification of standard CFA models as defined by Kline (1998). A standard model
contains no correlated errors or cross-loadings. As such, all five specified models are
standard CFA models. Models 1, 3, and 5 contain the second-order FSIQ factor. Second-
order factors are identified in the same manner as first-order models in that the first-order
factors are considered the indictor variables for the second-order.
All CFA models must first meet two necessary but not sufficient rules for
identification: (1) the number of free parameters must be less than or equal to the number
of observations, and (2) all latent variables must have a scale. Regarding the first rule, the
number of observations is calculated through the following equation: v(v + 1)/2, where v
is the number of observed variables. Parameters represent the total number of variances
and covariances of the factors, measurement errors, and factor loadings. In analyzing the
parameters and observations of each proposed model as outlined in Table 2, all models
meet the first rule for identification (i.e., Parameters ≤ Observations).
48
Table 1
Model Identification Rules for Standard CFA Models
Number of Latent Factors Identification Conditions Necessary or Sufficient?
1
1. Parameters ≤ Observations Necessary2. Scale for every factor Necessary
3. ≥ 3 Indicators Sufficient
≥ 2
1. Parameters ≤ Observations Necessary
2. Scale for every factor Necessary
3. ≥ 3 Indicators Sufficient Table 2
Free Parameters and Observations for Specified Models
Model Parameters Observations
Model 1-NormYoung - First Order 9 10Model 1-NormYoung - Second Order 2 3
Model 2-OneFactorYoung 8 10Model 3-NormOlder - First Order 13 21
Model 3-NormOlder- Second Order 6 6Model 4-OneFactorOlder 14 28
Model 5-AltOlder - First Order 15 28Model 5-AltOlder - Second Order 2 3 Regarding the second rule, latent variables are unobserved variables and,
therefore, do not inherently have a scale. However, they require a measurement scale in
order to be estimated. According to Brown (2015) a scale is applied to a latent variable in
one of two ways: (1) by fixing the loading of one indicator per factor to 1.0, which
transfers the scale of the indicator to the latent variable, or (2) by fixing the variance of
the latent variable to a constant. This constant is usually 1.0, which standardizes the latent
variable. For each of the first-order models (Models 2 and 4), the latent variable was
49
scaled by fixing the variance of the factor to 1.0. For the second-order models (Models 1,
3, and 5), the first order latent variables (e.g., VIQ and PIQ) were scaled by fixing one
factor loading to 1.0. The second-order factor was scaled by fixing the loadings between
latent variables to 1.0.
LISREL Student Version 9.2 was used for all analyses. To correct for non-
normality with continuous data, the Satorra-Bentler statistic was used. The Satorra-
Bentler statistic is a maximum likelihood estimator that is robust against non-normality.
This statistic “adjusts the value of the standard χ2 downward by a constant that reflects
the degree of observed kurtosis” (Kline, 1998, p. 210). Kline's (1998) criteria were used
to describe univariate skewness: mild = less than 1; moderate = |1| - |3|. Univariate
kurtosis values were considered mild between -1.3 and 7 (Kline, 1998). A covariance
matrix was analyzed; an asymptotic covariance matrix was analyzed for use with the
Satorra-Bentler statistic. No non-standard procedures (e.g., user-specified starting values,
changing convergence criterion, increasing the number of iterations) were used.
Goodness-of-fit was assessed using a variety of fit statistics. More specifically,
incremental and absolute measures of fit were analyzed (Worthington & Whittaker,
2006). Incremental fit indices indicate the improvement in fit by comparing a proposed
model to a baseline model, or a model in which all variables are unrelated (Bentler &
Bonnet, 1980). The incremental fit indices of the Comparative Fit Index (CFI; Bentler,
1990), the Non-Normed Fit Index (NNFI; Bentler & Bonnet, 1980), and the Incremental
Fit Index (IFI; Bollen, 1989) were analyzed. For these three indices, values above .95
were considered indicative of good fit (Hu & Bentler, 1995). Absolute indices indicate
the extent to which the proposed model explains the relationships observed within the
50
sample data. The Root Mean Square Error of Approximation (RMSEA) is an absolute
index that represents the lack of model fit when compared to a perfect model. RMSEA
values less than or equal to .06 are indicative of good fit, whereas values less than or
equal to .08 indicate acceptable fit (Hu & Bentler, 1995). Consistency or inconsistency in
findings across these fit statistics was used to assess overall model fit. More specifically,
the extent to which these indices were consistently high or low was the more important
consideration in overall model fit (Markland, 2006).
In addition to overall model fit, component fit was analyzed and reliability
estimates for each scale are reported. The following criteria from Barker, Pistrang, and
Elliot (2016) were used to describe reliability estimates: good ≥ .80; acceptable = .70 -
.79; low = .60 - .69; and poor ≤ .59. To describe the strength of intercorrelations the
following criteria were used (per Evans, 1996): very weak = 0 - .19; weak = .20 - .39;
moderate = .40 - .59; strong = .60 - .79; and very strong = .80 - 1.00.
Convergent validity analyses. Raw scores from the adapted WPPSI-III-SP
subtests were summed and standardized to provide an overall measure of cognitive
ability. The Cognitive subscale raw scores from the adapted Bayley-III at 24 months were
also standardized. The strength and direction of the relationship between scores on the
adapted WPPSI-III-SP and the adapted Bayley-III were calculated through a correlation
of the standardized adapted Bayley-III and adapted WPPSI-III-SP scores. A supplemental
convergent validity analysis was conducted among those children who completed the
adapted WPPSI-III-SP at both 36 and 48 months. The strength and direction of the
relationship between scores on the adapted WPPSI-III-SP at each time point were
51
calculated through a correlation of a child's total score (after standardization) at 36
months and at 48 months.
52
Chapter 4: Results
Younger Cohort
Preliminary analyses and descriptive statistics. Preliminary analyses were
conducted to examine the scores from the younger cohort for outliers and missing data.
No cases were identified as having missing data. One case was determined to have an
outlying score on the Object Assembly subtest, as the score was greater than three
standard deviations from the mean (z = 7.19). After removing this case, the preliminary
analyses for the Object Assembly scores more closely approached normality. As such,
this case was removed listwise. In addition, seven cases were deleted due to the child
being 48-months old at the age of the assessment, as these children fell outside the age
range delineated by the test's publisher (Wechsler, 2009). The final sample size for data
analysis was 147. Table 3 presents the descriptive statistics (i.e., intercorrelations, means,
skew, kurtosis, and coefficient alphas) for this dataset.
53
Table 3
Intercorrelations, Descriptive Statistics, and Reliability Estimates for Adapted WPPSI-III-SP Subtest Scores (Younger Cohort) Adapted WPPSI-III-SP Subtest 1 2 3 4 1. Receptive Vocabulary --
2. Block Design .32 --
3. Information .41 .39 --
4. Object Assembly .24 .26 .32 --
M 7.01 9.61 6.58 2.87
SD 4.10 3.92 2.30 1.69
Skew 0.74 -0.04 -0.51 1.53
Kurtosis 0.36 -0.40 0.44 2.35
Coefficient alpha (α) .83 .71 .66 .53
Note. n = 147. WPPSI-III-SP = Wechsler Preschool and Primary Scale of Intelligence (Third Edition) - Spanish Version The simultaneous test of multivariate skewness and kurtosis was statistically
significant, χ2 (2) = 52.28, p = .000. However, the relative multivariate kurtosis was 1.08,
indicating that multivariate kurtosis was 8% larger than a multivariate normal
distribution. Multivariate kurtosis, therefore, was considered mildly non-normal (per
Kline, 1998). The χ2 tests of simultaneous univariate skewness and kurtosis of were also
statistically significant for all variables at the .05 level, with the exception of the Block
Design subtest. For the variables that had statistically significant simultaneous univariate
skewness and kurtosis, all variables had skewness values that were significantly different
from normal (p < .05). Object Assembly was the only subtest with a kurtosis value
significantly different from normal. In analyzing the skewness and kurtosis values
presented in Table 3, univariate skewness fell in the moderate range for Object Assembly
and in the mild range for all other variables. Univariate kurtosis was considered mild for
54
all variables. In sum, the data were considered to be mildly non-normal, justifying the use
of robust tests in the analyses.
Intercorrelations were conducted to determine the relationships among the various
subtest scores and all values are reported in Table 3. All subtests were positively
correlated with each other. The correlation between Receptive Vocabulary and
Information fell in the moderate range; all other correlations were weak. Scores from the
Receptive Vocabulary subtest were found to have good reliability, and Block Design
subtest scores demonstrated acceptable reliability, whereas scores from the Information
and Object Assembly subtest scores revealed low and poor reliability, respectively (see
Table 3). Overall, the average reliability coefficient across all four subtests was low
(Cronbach's α = 0.61).
Model 1-NormYoung. The model based on the normative structure appears to
provide a reasonable fit to the data. Selected fit indices are presented in Table 6. The fit
indices of CFI, NNFI, and IFI fall above .95, indicating good fit. The RMSEA falls below
.06, also indicating good fit. Finally, the Satorra-Bentler Scaled Chi-Square is
nonsignificant, indicating good overall fit, χ2SB = 0.092, p = 0.762, df = 1.
Regarding component fit, parameter and standard error estimates are presented in
Table 4. All completely standardized factor loadings are within range and statistically
significant (z-test statistics > 1.96; see Figure 6). Furthermore, it would be expected all
path coefficients be positive (i.e., a positive relationship between indicator and latent
variable). The standard errors are reasonable as they are smaller than the standard
deviations of the indicator variables. All standardized residuals are acceptable (< |2.58|).
No modification indices were greater than 3.84, and all standardized expected change
55
values were small, suggesting that no paths should have been freed. Measurement model
R2 values were poor (< .36) for Object Assembly (R2 = .21), Block Design (R2 = .33), and
Receptive Vocabulary (R2 = .32), and moderate for Information (R2 = .52).
Table 4 Parameter and Standard Error Estimates for Model 1-NormYoung
Model Parameters Standardized Estimate
Unstandardized Estimate
Standard Error
Loadings on VIQ
Receptive Vocabulary .57 2.33a 2.02
Information .73 1.67* 0.63
Loadings on PIQ
Block Design .57 2.25a 1.99
Object Assembly .46 0.78* 0.41
Loadings on FSIQ
VIQ .96 0.43a 0.21
PIQ .99 0.45a 0.33
Note. Table values are Maximum Likelihood estimates. VIQ = Verbal Intelligence Quotient; PIQ = Perceptual Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient. *p < .05 a fixed factor loading.
56
Figure 6. Completely standardized factor loadings for Model 1-NormYoung. * p < .05 a fixed factor loading Model 2-OneFactorYoung. The one-factor model based on the research of
Contreras and Rodriquez (2013) appears to provide a reasonable fit to the data. Selected
fit indices are presented in Table 6. The fit indices of CFI, NNFI, and IFI fall above .95,
indicating good fit. The RMSEA falls below .06, also indicating good fit. Finally, the
Satorra-Bentler Scaled Chi-Square is nonsignificant, indicating good overall fit, χ2SB =
0.191, p = .0.91, df = 2.
Regarding component fit, parameter and standard error estimates are presented in
Table 5. All completely standardized factor loadings are within range and statistically
significant (z-test statistics > 1.96; see Figure 7). As expected, all path coefficients were
positive (i.e., a positive relationship between indicator and latent variable). The standard
errors are reasonable as they are all smaller than the standard deviations of the indicator
.57a
.73*
.57a
.46*
.96a
.99a
57
variables. All standardized residuals are acceptable (< |2.58|). No modification indices
were greater than 3.84, and all standardized expected change values were small,
suggesting that no paths should have been freed. Measurement model R2 values were
poor (< .36) for Receptive Vocabulary (R2 = .32), Block Design (R2 = .31), and Object
Assembly (R2 = .20), and moderate for Information (R2 = .52).
Table 5
Parameter and Standard Error Estimates for Model 2-OneFactorYoung
Model Parameters Standardized Estimate
Unstandardized Estimate
Standard Error
Loadings on FSIQ
Receptive Vocabulary .57 2.33* 0.32
Information .72 1.65* 0.25
Block Design .55 2.17* 0.35
Object Assembly .45 0.76* 0.14
Note. Table values are Maximum Likelihood estimates. FSIQ = Full Scale Intelligence Quotient. *p < .05
58
Figure 7. Completely standardized factor loadings for Model 2-OneFactorYoung. * p < .05 Table 6
Selected Fit Indices for Younger Cohort Models
χ2SB df RMSEA (CI90) CFI NNFI IFI
Model 1-NormYoung 0.092 1 0.0 (0.0 - 0.15) 1.00 1.00 1.00
Model 2-OneFactorYoung 0.188 2 0.0 (0.0 - 0.07) 1.00 1.00 1.00Note. χ2
SB = Satorra-Bentler Chi-Square; df = degrees of freedom; RMSEA = Root Mean Square Error of Approximation; CI90 = 90% Confidence Interval for RMSEA; CFI = Comparative Fit Index; NNFI = Non-Normed Fit Index; IFI = Incremental Fit Index. Model comparisons. To compare the overall fit of Model 1-NormYoung to
Model 2-OneFactorYoung, a Satorra-Bentler Chi-Square difference test was conducted.
As summarized in Table 7, results suggest that Model 1-NormYoung and Model 2-
OneFactorYoung are equivalent.
.57*
.72*
.55*
.45*
59
Table 7
Satorra-Bentler Chi-Square Difference Test
χ2SB Df
Model 1-OneFactorYoung 0.188 2
Model 2-NormYoung 0.092 1
Difference 0.096 1 Note. Difference is statistically significant if greater than 3.84.
Older Cohort
Preliminary analyses and descriptive statistics. Preliminary analyses were
conducted to examine the scores from the older cohort for outliers and missing data. The
data from the child aged 36 months was deleted listwise due to this age being outside the
range of the assessment. Although some cases had scores greater than three standard
deviations from the mean, no Mahalanobis' distance scores were significant (CV = 24.32
at p = .001). The final sample size for data analysis was 167. Table 8 presents the
descriptive statistics (i.e., intercorrelations, means, skew, and kurtosis) for this dataset.
60
Table 8 Intercorrelations, Descriptive Statistics, and Reliability Estimates for Adapted WPPSI-III-SP Subtest Scores (Older Cohort) Adapted WPPSI-III-SP Subtest 1 2 3 4 5 6 7 1. Picture Concepts --
2. Block Design .07 --
3. Information .23 .32 --
4. Word Reasoning .32 .32 .47 --
5. Vocabulary .35 .28 .47 .50 --
6. Matrix Reasoning .49 .30 .28 .29 .38 --
7. Coding .29 .31 .22 .46 .26 .26 --
M 1.46 15.06 12.23 1.09 6.43 1.58 1.08
SD 2.05 3.44 3.21 1.81 4.03 2.14 2.53
Skew 1.72 -0.55 -0.15 2.56 0.72 1.61 2.56
Kurtosis 2.62 1.38 1.79 7.98 -0.19 3.07 6.37
Coefficient alpha (α) .80 .66 .78 .76 .81 .79 .90
Note. n = 167. WPPSI-III-SP = Wechsler Preschool and Primary Scale of Intelligence (Third Edition) - Spanish Version. The simultaneous test of multivariate skewness and kurtosis was statistically
significant, χ2 (2) = 474.35, p = .000. The relative multivariate kurtosis was 1.52,
indicating that multivariate kurtosis was 51.8% larger than a multivariate normal
distribution. The χ2 tests of simultaneous univariate skewness and kurtosis of were also
statistically significant for all variables at the .05 level. Regarding univariate skewness
and kurtosis, all variables had skewness and kurtosis values that were significantly
different from normal (p < .05), with two exceptions. The skewness value for Information
and the kurtosis value for Vocabulary were not significant. In analyzing the skewness and
kurtosis values presented in Table 8, univariate skewness fell in the moderate range for
all variables. Univariate kurtosis was considered mild for all variables, with the exception
61
of the Word Reasoning subtest. This subtest demonstrated moderate kurtosis. In sum, the
data was considered to be non-normal, justifying the use of robust tests in the analyses.
Intercorrelations were conducted to determine the relationships among the various
subtest scores and all values are reported in Table 8. All subtests were positively
correlated with each other. Correlations between Matrix Reasoning and Picture Concepts,
Information and Word Reasoning, Information and Vocabulary, and Vocabulary and
Word Reasoning fell in the moderate range, whereas all other correlations fell in the
weak to very weak ranges. Scores from the Vocabulary, Picture Concepts, and Coding
subtests were found to have good reliability. Scores from the Block Design, Information,
Matrix Reasoning, and Word Reasoning subtests were found to have acceptable
reliability. Overall, the average reliability estimate across subtest scores fell in the
acceptable range (Cronbach's α = .75).
Model 3-NormOlder. The model based on the normative structure appears to
provide a poor fit to the data. Initial analyses yielded a non-positive definite matrix for
latent variables, with a negative error of variance for the PIQ factor. Gerbing and
Anderson (1987) studied three methods for respecification of initial models with one
negative estimate and small sample sizes. The authors suggested fixing the variance of
the improper parameter to a negligible number. This method is also noted by Brown
(2015). As such, the variance of PIQ was set to 0.001. Selected fit indices are presented
in Table 10. The fit indices of CFI, NNFI, and IFI fall below .95, indicating inadequate
fit. The RMSEA falls above .08, also indicating inadequate fit. Finally, the Satorra-
Bentler Scaled Chi-Square is significant, indicating poor overall fit, χ2SB = 38.79, p =
0.007, df = 14. Parameter and standard error estimates are presented in Table B1 in
62
Appendix B. While some modification indices were greater than 3.84, these
modifications to the model were not theoretically supported.
Model 4-OneFactorOlder. The one-factor model based on the research of
Contreras and Rodriquez (2013) appears to provide a poor fit to the data. Selected fit
indices are presented in Table 10. The fit indices of CFI, NNFI, and IFI fall below .95,
indicating inadequate fit. The RMSEA falls above .08, also indicating inadequate fit.
Finally, the Satorra-Bentler Scaled Chi-Square is significant, indicating poor overall fit,
χ2SB = 37.87, p = .000, df = 14. Parameter and standard error estimates are presented in
Table B2 in Appendix B.
Of the modification indices (MI) greater than 3.84, the MIs suggesting correlated
errors between Coding and Block Design and between Vocabulary and Word Reasoning
were the largest (20.78 and 8.62, respectively) and the most theoretically supported.
Coding and Block Design rely on recognizing and matching shapes. Vocabulary and
Word Reasoning draw on word knowledge and definition skills. The selected fit statistics
for this reduced model are also presented in Table 13. While the fit index of IFI and CFI
falls at the threshold of 0.95, the NNFI falls below. The RMSEA is equal to .08,
indicating acceptable fit. However, the Satorra-Bentler Scaled Chi-Square is significant,
indicating potentially unacceptable overall fit, χ2SB = 21.534, p = .04, df = 12. Taking into
account all fit evidence, it appears that, overall, correlating these errors provided an
adequate fitting model.
Regarding component fit, parameter and standard error estimates are presented in
Table 9. All completely standardized factor loadings are within range and statistically
significant (z-test statistics > 1.96; see Figure 8). As expected, all path coefficients were
63
positive (i.e., a positive relationship between indicator and latent variable). The standard
errors are reasonable as they are all smaller than the standard deviations of the indicator
variables. All standardized residuals are acceptable (<|2.58|). Measurement model R2
values were poor (< 0.36) for Information (R2 = 0.21), Block Design (R2 = 0.24), Coding
(R2 = 0.18), and Vocabulary (R2 = 0.18), and moderate for Picture Concepts (R2 = 0.42),
Word Reasoning (R2 = 0.49), and Matrix Reasoning (R2 = 0.51).
Table 9
Parameter and Standard Error Estimates for Model 4-OneFactorOlder with Modifications
Model Parameters Standardized Estimate
Unstandardized Estimate
Standard Error
Loadings on FSIQ
Picture Concepts .65 2.08* 1.07
Information .46 1.57* 1.41
Block Design .49 1.05* 0.58
Word Reasoning .70 1.26* 0.49
Vocabulary .42 1.07* 1.04
Matrix Reasoning .72 2.88* 1.53
Coding .42 0.86* 0.49
Note. Table values are Maximum Likelihood estimates. FSIQ = Full Scale Intelligence Quotient. * p < .05 level.
64
Figure 8. Completely standardized factor loadings for modified Model 4-OneFactorOlder * p < .05 Model 5-AltOlder. The alternative second-order model based on the research of
Ramirez and Rosas (2007) appears to provide a poor fit to the data. Similar to Model 3-
NormOlder model, initial analyses yielded a non-positive definite matrix for latent
variables, with a negative error of variance for the VIQ factor. As such, the variance of
VIQ was set to 0.001 (Gerbing & Anderson, 1987). Selected fit indices are presented in
Table 10. The fit indices of CFI, NNFI, and IFI fall below .95, indicating inadequate fit.
The RMSEA falls above .08, also indicating inadequate fit. Finally, the Satorra-Bentler
Scaled Chi-Square is significant, indicating poor overall fit, χ2SB = 36.83, p = .000, df =
.42*
.46*
.70*
.49*
.65*
.72*
.42*
65
13. Parameter and standard error estimates are presented in Table B3 in Appendix B.
While some modification indices were greater than 3.84, these modifications to the model
were not theoretically supported.
Table 10
Selected Fit Indices for Older Cohort Models
χ2SB df RMSEA (CI90) CFI NNFI IFI
Model 3-NormOlder 38.70 14 0.11(0.08 - 0.15) .88 .83 .89
Model 4-OneFactorOlder 37.87 14 0.13(0.09 - 0.16) .86 .79 .86Model 5-AltOlder 25.94 13 0.12(0.08 - 0.16) .88 .81 .89
After Modifications
Model 4-OneFactorOlder 21.53 12 .08(0.01 - 0.18) .95 .92 .96Note. χ2
SB = Satorra-Bentler Chi-Square; df = degrees of freedom; RMSEA = Root Mean Square Error of Approximation; CI90 = 90% Confidence Interval for RMSEA; CFI = Comparative Fit Index; NNFI = Non-Normed Fit Index; IFI = Incremental Fit Index. Convergent Validity Analyses
Of the 147 children in the younger cohort who were included in the sample for
confirmatory factor analyses, 141 of these children completed the adapted Bayley-III
Cognitive subtest at 24 months. Among the older cohort, 158 children completed the
adapted Bayley-III and the adapted WPPSI-III-SP. Table 11 presents the descriptive
statistics (i.e., means, skew, kurtosis, and coefficient alphas) for these samples. In
addition, 134 children completed the adapted WPPSI-III-SP at 36 and 48 months. Data
for eight children from this sample were deleted due to the child's age being outside the
range of the assessment. As such, the final sample size for examining the direction and
strength of the relationship between scores at each time point was 126.
Cognitive subtest scores from the adapted Bayley-III at 24 months were
significantly and positively correlated (r = .21; p < .05) with scores from the adapted
66
WPPSI-III-SP at 36 months. This correlation was weak. For the older cohort, cognitive
subtest scores from the adapted Bayley-III at 24 months were significantly and positively
correlated (r = .28; p < .05) with scores from the adapted WPPSI-III-SP at 48 months.
This relationship was also weak. Regarding the cohort of children who completed the
adapted WPPSI-III-SP at 36 and 48 months, scores from these time points were
significantly, positively, and moderately correlated (r = .55; p < .05).
Table 11
Descriptive Statistics and Reliability Estimates for Adapted WPPSI-III-SP Total Scores and Adapted Bayley-III Scores (Younger and Older Cohorts) Bayley- III
(Younger Cohort) Bayley- III
(Older Cohort) WPPSI-III-SP
(Younger Cohort) WPPSI-III-SP (Older Cohort)
M 4.66 6.37 26.06 39.11
SD 2.98 2.35 8.64 12.66
Skew 0.45 0.01 -0.19 0.82
Kurtosis -0.57 -0.39 -0.03 1.11
Note. n =141 for younger cohort; n = 158 for older cohort.
67
Chapter 5: Discussion
The primary purpose of this study was to examine the construct validity of the
WPPSI-II-SP adapted for use among children in rural Peru. Using confirmatory factor
analyses, data from a younger and older cohort of children were fitted to models based on
the normative structure of the WPPSI-III-SP. If the adapted version measured
intelligence in the same manner as the original version, it was hypothesized that these
normative models would provide an adequate fit for the scores from the Peruvian cohorts.
The process of adapting a test for use among a different culture, however, is complicated
and must take into account many cultural, developmental, construct, and adaptation
considerations. For example, the process considers differences as to how the construct in
question is expressed across cultures, differences in language that may affect how items
are interpreted across cultures, and differences in exposure to the assessment's tasks (e.g.,
completing pencil-and-paper tasks) across cultures.
Given these adaptation considerations, the present study also proposed to
determine if another factor structure would provide a better fit to the scores. Based on the
research of others (Contreras & Rodriquez, 2013; Ramirez & Rosas, 2007) conducting
test adaptation research in Colombia and Chile, three additional factor structures were
proposed. For both the younger and older cohorts, a one-factor model was proposed. For
the older cohort, an alternative to the normative structure was proposed in which the
Coding subtest loaded on the Performance factor instead of on the overall Full Scale
Intelligence Quotient factor. As the cultures of Colombia and Chile may more closely
resemble the Peruvian culture (in comparison to Spanish culture), it was hypothesized
that the models based on the research in South America would provide a better fit to the
68
data than the models based on the normative data. Finally, to examine evidence of
convergent validity the total subtest scores from the younger cohort and from the older
cohort were correlated with scores from another adapted cognitive ability measure
completed by the children at 24 months. It was expected that the scores from these
measures of cognitive ability would be positively correlated. A supplemental analysis
was conducted to examine the direction and strength of the relationship between scores
on the adapted WPPSI-III-SP from children who completed the assessment at 36 and 48
months.
Fit of the Hypothesized Models
Younger cohort. For the younger cohort, both the model based on the normative
data (Model 1-NormYoung) and the one-factor model based on the research of Contreras
and Rodriquez (2015; Model 2-OneFactorYoung) appear to provide an adequate fit to the
data. Furthermore, Model 1-NormYoung appeared to fit the data as well as Model 2-
OneFactorYoung. As such, the first hypothesis was supported as Model 1-NormYoung
provided an adequate fit for the data. The second hypothesis, however, was not supported
based on the confirmatory factor analyses. In comparing the fit for Model 2-
OneFactorYoung and Model 1-NormYoung, the prior model did not provide a better fit
for the data as compared to the latter model.
Regarding component fit, all standardized estimates were significant across
models. Scores from the Information subtest demonstrated the strongest relationship with
VIQ and with the latent global intelligence factor, while Receptive Vocabulary, Block
Design, and Object Assembly were moderately related to the latent factors. It should be
noted that measurement model R2 values were poor for three out of four subtests
69
(Receptive Vocabulary, Block Design, and Object Assembly), indicating that much of the
variance associated with these subtests was left unexplained. Finally, the second-order
factor loadings were larger than the subtest loadings on the first-order loadings,
suggesting the verbal and performance factors were strongly influenced by overall
cognitive ability.
The relationships between subtest scores demonstrated potential support for the
one-factor structure. In looking at evidence for convergent validity, the scores from the
VIQ subtests (Receptive Vocabulary and Information) were moderately correlated.
However, the relationship between scores from the PIQ subtests (Object Assembly and
Block Design) was weak. Furthermore, little evidence was present for discriminant
validity. Information subtest scores demonstrated a similar strength relationship with
scores from the Block Design subtest and scores from the Receptive Vocabulary subtest.
Scores from the Block Design subtest were more strongly related to scores from the VIQ
subtests as compared to Object Assembly subtest scores. In other words, scores did not
demonstrate consistently stronger relationships among subtest scores from the same
factor (convergent validity) and consistently weaker relationships among subtest scores
from differing factors (discriminant validity). As such, this pattern may provide evidence
of the superiority of a one-factor structure with an overall cognitive ability factor.
Reliability. In exploring the validity of an assessment it is imperative to also look
at the test's scores' reliability, as reliability is the foundation of validity (Sattler, 2008).
The reliability estimate of Object Assembly subtest scores is questionable. As noted
earlier, scores from assessments of young children struggle to produce high reliability
estimates (Alfonso & Flanagan, 1999; Sattler, 2008). In addition, the test taking behavior
70
of young children adds potential error and inconsistency to scores (Frisby, 1999b). Young
children have shorter attention spans, less expressive language, and less exposure to the
formal assessment process (Alfonso & Flanagan, 1999; Baron & Leonberger, 2012; Berk,
2009). The influence of test taking behavior may be particularly impactful for this cohort
of children. Prior to Kindergarten, children living in these Peruvian communities often
have limited exposure to formal schooling let alone exposure to formal testing situations.
It is customary for children to remain in the care of a parent or family member during the
day or to attend a day-care. In addition, it is customary for children within these
communities to be more introverted and quiet around new individuals and within new
situations (A. Orbe, personal communication, April 25, 2016). This culturally influenced
temperamental characteristic may add to inconsistency in responding and to the potential
measurement error of scores.
The poor reliability estimate for the scores from the Object Assembly subtest
suggest a higher amount of measurement error for this subtest. Similar to other puzzle
task adaptations (Malda et al., 2010), the Peruvian children appeared to struggle
completing the puzzles. While children could earn between 0 and 70 points on this
subtest, the subtest mean score was 2.87. Due to discontinuation rules, most children,
therefore, completed a short assessment. As noted by Sattler (2008), the length of an
assessment may impact the reliability of scores, such that scores from longer measures
tend to yield higher reliability estimates.
Furthermore, many potential sources of measurement error may arise due to
adaptation concerns. Perhaps, and similar to the children in India (Malda et al., 2010), the
Peruvian children were not familiar with the task as they have little everyday exposure to
71
puzzles. In addition, perhaps the children were not familiar with the stimuli enough to
understand how to compile the puzzle pieces. For example, one puzzle is of a stereotypic
American house (i.e., door in the center with two windows on either side, triangular roof
with a chimney, yellow in color). Children living in these communities may live in
houses with a dirt floor and a thatched roof (Yori et al., 2014), or a flat metal roof. Thus,
this puzzle may not conform to the child's idea of a house. These children may struggle to
put the pieces together without the context of knowing that the completed picture is of a
house. The puzzles of a bird, fish or dog, on the other hand, may have been differentially
easy for the Peruvian children as these are animals they would encounter frequently in
their environment. A child's performance on this subtest, therefore, should be interpreted
with caution. Instead, the child’s overall performance on other subtests and on the overall
test should be considered as these scores had higher reliability estimates.
Older cohort. For the older cohort, both the model based on the normative data
(Model 3-NormOlder) and the alternative to the normative model based on the research
of Ramirez and Rosas (2007; Model 5-AltOlder) appear to provide a poor fit to the data.
These results should also be interpreted with caution as each model indicated a potential
specification error (i.e., negative error of variance for latent variables). The first
hypothesis was not supported, whereas the second hypothesis was supported. The one-
factor model (Model 4-OneFactorOlder) yielded an adequate fit, after modifications,
while the models based on the normative structure did not yield adequate fit. Scores from
the subtests demonstrated adequate reliability.
Regarding component fit, all standardized estimates were significant. Factor
loadings were moderate to strong, with scores from Picture Concepts, Matrix Reasoning,
72
and Word Reasoning demonstrating the strongest relationship with the general
intelligence factor. The measurement model R2 values were poor for four out of seven
subtests (Information, Block Design, Coding, and Vocabulary), indicating that much of
the variance associated with these subtests was left unexplained. Scores from the
Information subtest were moderately related to scores from Word Reasoning and
Vocabulary subtests. In addition, the Vocabulary and Word Reasoning subtests' scores
were moderately related. Other intercorrelations between subtests fell in the very weak to
moderate range. In other words, the scores from the VIQ subtests were moderately
correlated with each other and more weakly correlated with scores from the PIQ subtests.
However, scores from any of the PIQ subtests were not noted to be any more or less
correlated with VIQ and PIQ subtests' scores. Gordon (2004) noted a similar pattern of
relationships among scores from the VIQ and PIQ subtests in the original U.S. version of
the WPPSI-III. Similar to the younger cohort, this lack of evidence for discriminant
validity and the moderate to strong factor loadings within the one-factor model provides
support for a one-factor structure.
Floor items. Alfonso and Flanagan (1999) highlight the importance of adequate
floor items on intelligence tests for preschool-aged children. A test's floor is the lowest
possible score a child could earn on the test. Intelligence tests with adequate floor items
are able to distinguish among children with lower levels of cognitive ability (Sattler,
2008). For the older age band of the WPPSI-III this question becomes even more crucial,
as the assessment must be able to adequately capture the ability of a low performing four-
year-old child as well as the ability of a typically developing seven-year-old child.
Children at these extreme ages may vary widely in the differentiation of cognitive
73
abilities, in their exposure to formal test taking situations, and in their development of
attention and emotional regulation. Flanagan and Alfonso (1995) posit that a test has a
sufficient floor if a raw score of 1 on an assessment is more than two standard deviations
below the mean. Gordon’s (2004) review of the original WPPSI-III (U.S. version) noted
that some of the subtests (Picture Concepts, Word Reasoning, and Coding) failed to meet
this requirement for children at age 48 months.
In examining the scores for the older cohort, the children earned, on average, at or
below two points total on the Picture Concepts, Word Reasoning, Matrix Reasoning, and
Coding subtests. In other words, the Peruvian children struggled to correctly answer
items on these subtests. A low raw score was typical rather than uncommon.
Consequently, the subtests may have difficulty distinguishing between ability levels for
these low-scoring children. The task demands and item content of these tasks in particular
should be further examined for potential sources of bias. A discussion of floor effects and
potential sources of bias as result of cultural influence is an important aspect of the larger
discussion of item quality. Item quality refers to an item's ability to differentiate between
a lower- and higher-performing child. A test is said to have items of good quality if those
children with higher overall scores tend to answer the more difficult items correctly
whereas children with lower overall scores tend to answer these items wrong (Sattler,
2008). Item discrimination analyses should be conducted to allow for the examination of
this response pattern.
One-factor models. Across both cohorts, confirmatory factor analyses and
intercorrelations between subtests provided support for a one-factor structure such that all
subtests load on a latent factor of general intelligence. This support was stronger for the
74
older cohort, as the scores from those subtests demonstrated stronger factor loadings
overall as compared to the factor loadings of subtests for the younger cohort. These
results support criticism leveled against the original WPPSI-III (U.S. version). Gordon
(2004) cites a lack of clear discriminant validity among subtest score intercorrelations
and high g factor loadings as the foundation of an argument for a one-factor structure to
the WPPSI-III.
Developmentally, the intellectual functioning of young children may be
comprised of more homogenous abilities as compared to the greater differentiation in
abilities among older children (Baron & Leonberger, 2012). Thus, a one-factor solution
may better represent the cognitive developmental stage of a preschool-age child. While a
child may possess differing verbal and performance abilities, he or she may still rely on
all skills to successfully navigate his or her environment. As a child grows, these skills
may become more specific and differentiated from other abilities. The superior overall
and component fit of the one-factor models within this study lends support to the idea
that the homogeneity of cognitive abilities in young children is universal across cultures.
Outside of developmental considerations, a one-factor structure may be more
appropriate when conducting cross-cultural intelligence measurement. Researchers
(Georgas, 2003; Georgas et al., 2003; Irvine, 1979; Van de Vijver, 1997) specify
universalities across cultures as to what constitutes "intelligence". Cultures may differ,
however, in how these abilities are manifested and the extent to which these
manifestations contribute to success (Sternberg & Grigorenko, 2004). A one-factor
structure may better capture this universality despite variability in the manifestation of
separate cognitive abilities.
75
Convergent Validity Evidence
In line with predictions, the scores from the adapted cognitive ability assessment
completed at 24 months were significantly and positively related to cognitive ability
scores from the adapted WPPSI-III-SP for both the younger or older cohorts. These
relationships, however, were weak. Thus, evidence for convergent validity was minimally
supported. Children typically demonstrate rapid cognitive, linguistic, and physical
development between their second and third year (Berk, 2009). Scores from the Bayley-
III have previously demonstrated strong positive correlations with cognitive ability scores
from the WPPSI-III (Bode, D'Eugenio, Mettelman, & Gross, 2014; Sattler, 2008).
However, these previous convergent validity studies were not conducted using adapted
measures. The adapted Bayley-III and the adapted WPPSI-III-SP have somewhat
differing definitions as to what constitutes intelligent behavior. For example, the adapted
Bayley-III Cognitive subscale includes the measurement of social development, visual
preference, and number concepts. This emphasis on somewhat differing cognitive
abilities may contribute to weaker relationships between scores from the scales,
particularly within a cross-cultural context. While these cross-assessment abilities may
have been somewhat commiserate within American culture (the original culture of
development), differential importance and manifestation of abilities within the Peruvian
culture may exaggerate the differences in definitions.
As Van de Vijver (1997) noted, the activities included within a cognitive ability
assessment and an individual's exposure to these tasks may impact performance. The
tasks included within the adapted Bayley-III differ from the tasks included in the adapted
WPPSI-III-SP. The adapted Bayley-III involves more play-based activities and has a
76
more interactive administration. As the adapted Bayley-III aims to assess the abilities and
skills of infants and toddlers, an interactive assessment that includes age-appropriate
manipulatives to engage the child’s interest follows the guidance of Alfonso and
Flanagan (1999). However, the relationship between overall scores from these
assessments may be impacted by differential familiarity with task demands.
To potentially reduce the impact of task unfamiliarity, the scores from the adapted
WPPSI-III-SP administered at 36 months were correlated with 48-month-administration
scores of the adapted measure. These scores across time points were positively,
moderately, and significantly related. While some subtests were new to the child retaking
the adapted measure at 48 months, the testing situation itself was less novel. It is possible
that within a more consistent testing environment, the children performed more
consistently over time. As such, this relationship provides evidence for convergent
validity.
Furthermore, cognitive ability assessments for young children often face the
challenge of adequately capturing not only large changes in ability but also the
incremental development of skills. As these cognitive changes become more stable over
time, the scores from an intelligence test typically demonstrate more consistency and
have a higher relationship with related assessments (Baron & Leonberger, 2012). This
pattern was demonstrated within this study as the strength of the relationship between
scores on the adapted measures increased as the children aged. Further support should be
sought, however, to build upon this evidence. Sternberg and colleagues (2001) noted that
scores from cognitive ability assessments become better predictors of later cognitive
ability scores after approximately the age of 6. Considering the developmental trend
77
toward the stability of intelligence, the current adapted WPPSI-III-SP scores should be
correlated with the scores from any future administrations of the adapted WPPSI-III-SP.
Limitations and Future Research
The findings of this study should be interpreted in light of demographic and
methodological limitations. Several demographic characteristics of the sample of children
limit the generalizability of the study. All data were collected within small communities
in Peru's department of Loreto. Geographically, Peru is a diverse country (CIA, 2014),
containing large urban centers to agrarian communities in the Andes Mountains and the
Amazonian jungle. These regions are also differentially influenced by the cultures of
indigenous populations that inhabit them. Consequently, the findings of this study may
not generalize to other regions of Peru with different socio-cultural characteristics.
Adapting and validating scores from the WPPSI-III-SP for use in this segment of the
Peruvian population lays the foundation for validating scores from the measure among
children in other regions of the country.
Additionally, while the WPPSI-III is appropriate for use among children ages 2
years 6 months to 7 years 3 months, the age range of participants within this study was
restricted to three- and four-year-old children. Thus, it is unclear as to what is the most
appropriate factor structure across the full age range of the assessment. Given the rapid
cognitive development within this age range and the trend toward differentiation of
ability as a child ages, it is crucial to explore the best factor structure across the entire age
range of the test. Finally, the sample sizes for each cohort were somewhat small for use in
structural equation modeling. A general rule of thumb for researchers using CFA is to
have a minimum of 200 participants (Brown, 2015). The specification errors encounter in
78
the second-order models for the older cohort may have been avoided with a larger sample
size. A larger sample size across cohorts would increase the statistical power and
precision of the parameter estimates and estimates of overall model fit.
Reliability estimates for adapted WPPSI-III-SP scores for the younger cohort
were also somewhat low, particularly for the Object Assembly subtest. This suggests that
a child's performance on this subtest should be interpreted with caution. Sattler (2008)
suggests that increasing the length of a test may increase reliability. For the adapted
WPPSI-III-SP, adding more items to the end of the subtests may not be feasible as many
children reached a subtest's ceiling many items from the end. Adding items to the
beginning of a subtest, however, may be a feasible way in which to address not only poor
score reliability but also the test floor issue demonstrated by many subtests. Increasing
the number of less difficult items may help to better distinguish the abilities of lower
performing children. In addition, giving all children the opportunity to answer all items
may also allow for a more accurate analysis of score reliability. Increasing the length of
the test, however, must be done in consideration of what is developmentally appropriate
for a preschool-aged child. Care must be taken to ensure that a subtest does not become
overly taxing for the children.
Future research on the validity of scores from the adapted WPPSI-III-SP should
include more diverse samples of Peruvian children, in terms of both geography and age.
Demonstrating a consistent factor structure across ages and cultural groups within Peru
may help strengthen the argument that this adapted test is appropriate for use among a
wider-range of Peruvian preschoolers. Furthermore, this validation process should
consider having all children answer all items on each subtest. If all children answer all
79
items, further information can be gathered as to item quality, appropriate item order, and
the appropriateness of overall task demands. Item analysis, or the process of determining
how well an item is functioning, may also provide valuable information as to the validity
of scores from the adapted WPPSI-III-SP. These analyses include the examination of not
only item discrimination (as described earlier) but also item difficulty (i.e., the percentage
of children who answer an item incorrectly; Sattler, 2008). Item response theory may also
be a helpful tool for conducting item analysis (de Gruijter & van der Kamp, 2005). The
validation of the test for a more diverse population and an analysis of item quality
continue the iterative process of adapting this measure (Geisinger, 1994).
Future research should also continue collecting evidence for construct validity by
examining the adapted WPPSI-III-SP scores' convergent, discriminant, and predictive
validity. The scores from this adapted measure of cognitive ability did not evidence a
strong relationship with scores from an earlier measure of cognitive development.
However, a moderate and positive relationship was observed between scores at 36
months and at 48 months for the adapted WPPSI-III-SP. Further research should continue
to explore how scores from the adapted WPPSI-III-SP relate to scores from other
measures, such as measures of future academic achievement (predictive validity), other
measure of intelligence and measures of memory and executive functioning (convergent
validity), and measures of behavioral functioning (discriminant validity).
Implications
The findings of this study support the use of the adapted WPPSI-III-SP as a
measure of general intellectual ability among three- and four-year-old children living in
small river communities of the Peruvian Amazon basin. Children growing up in these
80
communities face distinct contextual challenges to their growth and development (e.g.,
malnutrition, disease, limited access to health care). Measuring a child's cognitive ability
may be an important component of research examining how children’s development is
affected by biological and environmental factors (Fernald, Kariger, Engle, & Raikes,
2009). For example, malnutrition has been associated with negative impacts on brain
structure and function and on the development of cognitive processes (Kar et al., 2008;
Levisky & Strupp, 1995). The adapted WPPSI-III-SP can be used to measure these
potential negative impacts. Results from this research can then inform and provide insight
for others studying the negative impact of poverty, malnutrition, and disease in other
areas of the world.
The use of scores from the adapted WPPSI-III-SP to make individual treatment
decisions for children, however, is not supported. Scores from intelligence tests are often
used to inform diagnostic and treatment decisions for children (Bracken, 1994). Lower
reliability estimates suggest that measurement error may be influencing test scores and
that scores are not consistent over time or across situations. In other words, the chance of
misdiagnosis or misinterpretation is higher. Misdiagnosis or misinterpretation may have
serious consequences for a child. Therefore, a high reliability estimate (≥.9) is suggested
when using the scores of a test to inform decisions (Sattler, 2008). The reliability
estimates for the adapted WPPSI-III-SP did not reach this threshold. Thus, as the scores
may be used for research purposes, the use of scores for diagnosis and treatment of an
individual child is questionable.
81
Conclusions
A critical step in the adaptation of a measure for use within a new population is
the establishment of evidence that scores from the adapted assessment continue to
measure the same construct in the same manner as scores from the original measure
(Stein et al, 2006). The present study examined evidence for construct validity of a
version of the WPPSI-III-SP adapted for use among three- and four-year-old children
living in rural Peru. The WPPSI-III has been adapted and validated for use within a
number of countries (Visser et al., 2012). In considering overall model fit and component
fit, a one-factor structure was supported across cohorts such that all subtests load on a
general intelligence factor. Evidence for convergent validity was minimally supported,
as the scores from the adapted test were weakly related to scores from an earlier-
administered adapted measure of cognitive development completed by the children. A
moderately positive relationship, however, was observed between scores on the adapted
WPPSI-III-SP at 36 months and scores at 48 months. As the adaptation of a test is an
iterative process (Geisinger, 1994), the findings from this study should be used inform
future changes to the adapted measure to make it a more accurate assessment of
intelligence for a wider range of Peruvian children.
82
References
Alfonso, V. C., & Flanagan, D. P. (1999). Assessment of cognitive functioning in
preschoolers. In E. N. Nuttall, I. Romero, & J. Kalesnik (Eds.), Assessing and
screening preschoolers: Psychological and educational dimensions, (pp. 186-
217). Needham Heights, MA: Allyn & Bacon.
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1999). Standards for
educational and psychological testing. Washington, DC: American Educational
Research Association.
American Psychological Association. (2002). American Psychological Association
ethical principles of psychologists and code of conduct. Retrieved from
http://www.apa.org
Bagdonas, A., Pociute, B., Rimkute, E., & Valickas, G. (2008). The history of Lithuanian
psychology. European Psychologist, 13, 227-237. doi: 10.1027/1016-
9040.13.3.227
Barker, C., Pistrang, N., & Elliott, R. (2016). Research methods in clinical psychology:
An introduction for students and practitioners (3rd ed.) [Wiley Online Library
version]. doi: 10.1002/9781119154082
Baron, I. S., & Leonberger, K. A. (2012). Assessment of intelligence in the preschool
years. Neuropsychological Review, 22, 334 – 344. doi: 10.1007/s11065-012-
9215-0
Bayley, N. (2005). Bayley Scales of Infant and Toddler Development (3rd ed.). San
Antonio, TX; Psychological Corporation.
83
Benson, S., Hellander, P., & Wlodarski, R. (2007). Peru (6th Ed.). Oakland, CA: Lonely
Planet Publications.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological
Bulletin, 107, 238-246.
Bentler, P. M., & Bonnet, D. G. (1980). Significance tests and goodness of fit in the
analysis of covariance structures. Psychological Bulletin, 88, 588-606.
Berk, L. E. (2009). History, theory, and applied directions. In L. E. Berk (Ed.), Child
development: Custom edition for Pennsylvania State University. Boston: Pearson
Custom Publishing.
Boake, C. (2002). From the Binet-Simon to the Wechsler-Bellevue: Tracing the history
of intelligence testing. Journal of Clinical and Experimental Neuropsychology,
24, 383-405. doi: 1380-3395/02/2402-383
Bode, M. M., D'Eugenio, D. B., Mettelman, B. B., & Gross, S. J. (2014). Predictive
validity of the Bayley, third edition for 2 years for intelligence quotient at 4 years
for preterm infants. Journal of Developmental and Behavioral Pediatrics, 35,
570-575. doi: 10.1097/DBP.0000000000000110
Bollen, K. A. (1989). A new incremental fit index for general structural models.
Sociological Methods and Research, 17, 303-316.
Bracken, B. B. (1994). Advocating for effective preschool assessment practices: A
comment on Bagnato and Neisworth. School Psychology Quarterly, 9, 103-108.
Brown, T. A. (2015). Confirmatory factor analysis in applied research (second edition).
New York, NY: The Guilford Press.
84
Brown, R. T., Reynolds, C. R., & Whitaker, J. S. (1999). Bias in mental testing since Bias
in Mental Testing. School Psychology Quarterly, 14, 208-238.
Burt, C. (1948). The factorial study of temperamental traits. British Journal of Statistical
Psychology, 1, 178-203.
Central Intelligence Agency. (2014). The world factbook: Peru. Retrieved from
https://www.cia.gov/library/publications/the-world-factbook/geos/pe.html
Cohen, A. B. (2009). Many forms of culture. American Psychologist, 64, 194 – 204. doi:
10.1037/a0015308
Contreras, D. M. M., & Rodriguez, A. P. A. (2013). Estudio preliminary de las
propiedades psicometrias del WISC-IV en una muestra de escolares de
Bucaramanga. Informes Psicologicos, 13, 13 – 25.
de Gruijter, D. N. M., & van der Kamp, L. J. T. (2005). Statistical test theory for
education and psychology. Retrieved from http://irt.com.ne.kr/data/
test_theory.pdf
Diana v. State Board of Education, Civ. Act. No. C-70-37 (N.D. Cal., 1970, further
order, 1973).
Edwards, C. P. (1999). Development in the preschool years: The typical path. In E. N.
Nuttall, I. Romero, & J. Kalesnik (Eds.), Assessing and screening preschoolers:
Psychological and educational dimensions, (pp. 186 – 217). Needham Heights,
MA: Allyn & Bacon.
Evans, J. D. (1996). Straightforward statistics for the behavioral sciences. Pacific Grove,
CA: Brooks/Cole Publishing.
85
Fernald, L. C. H., Kariger, P., Engle, P., & Raikes, A. (2009). Examining early child
development in low-income countries: A toolkit for the assessment of children in
the first five years of life. Washington, D.C.: The World Bank.
Flanagan, D. P., & Alfonso, V. C. (1995). A critical review of the technical
characteristics of new and recently revised intelligence tests for children. Journal
of Psychoeducational Assessment, 13, 66-90.
Foundation for the National Institutes of Health. (2016). Iquitos, Peru: About the site.
Retrieved from: http://mal-ed.fnih.org/?page_id=329
Frisby, C. L. (1999a). Culture and test session behavior: Part I. School Psychology
Quarterly, 14, 263 – 280.
Frisby, C. L. (1999b). Culture and test session behavior: Part II. School Psychology
Quarterly, 14, 281 – 303.
Fry, A. F., & Hale, S. (2000). Relationships among processing speed, working memory,
and fluid intelligence in children. Biological Psychology, 54, 1-34.
Geisinger, K. F. (1994). Cross-cultural normative assessment: Translation and adaptation
issues influencing the normative interpretation of assessment instruments.
Psychological Assessment, 6, 304 – 312.
Georgas, J. (2003). Cross-cultural psychology, intelligence and cognitive processes. In J.
Georgas, F. J. R. van de Vijver, L. G. Weiss, & D. H. Saklofske (Eds.), Culture
and children’s intelligence: Cross-cultural analysis of the WISC-III (pp. 23-37).
San Diego: Academic Press.
86
Georgas, J., van de Vijver, F. J. R., Weiss, L. G., & Saklofske, D. H. (2003). A cross-
cultural analysis of the WISC-III. In J. Georgas, F. J. R. van de Vijver, L. G.
Weiss, & D. H. Saklofske (Eds.), Culture and children’s intelligence: Cross-
cultural analysis of the WISC-III (pp. 277-313). San Diego: Academic Press.
Gerbing, D. W., & Anderson, J. C. (1987). Improper solutions in the analysis of
covariance structures: Their interpretation and a comparison of alternate
respecifications. Psychometrika, 52, 99-111.
Gordon, B. (2004). [Review of Wechsler Preschool and Primary Scale of Intelligence-
Third Edition by D. Wechsler]. Canadian Journal of School Psychology, 19, 205-
220.
Greenfield, P. M. (1985). You can’t take it with you: Ability assessments don’t cross
cultures. American Psychologist, 52, 1115 – 1124.
Hambleton, R. K. (2001). The next generation of the ITC test translation and adaptation
guidelines. European Journal of Psychological Assessment, 17, 164 – 172.
Hu, L., & Bentler, P. M. (1995). Cutoff criteria for fit indexes in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation
Modeling, 6, 1-55. doi: 10.1080/10705519909540118
Individuals with Disabilities Education Act of 2004. 20 U.S.C. § 1400, et. seq.
Irvine, S. H. (1979). The place of factor analysis in cross-cultural methodology and its
contribution to cognitive theory. In L. Eckensberger, W. Lonner, & Y. H.
Poortinga (Eds.). Cross-cultural contributions to psychology, (pp. 300-343).
Lisse, The Netherlands: Swets and Zeitlinger.
87
Jacob, S. J., Decker, D. M., & Hartshorne, T. S. (2011). Ethics and law for school
psychologists (6th ed). Hoboken, NJ: John Wiley & Sons, Inc.
Kamphaus, R. W., Winsor, A. P., Rowe, E. W., & Kim, S. (2012). A history of
intelligence test interpretation. In D. P. Flanagan, & P. L. Harrison (Eds.),
Contemporary intellectual assessment: Theories, tests, and issues, (pp. 3 – 55).
New York, NY: The Guildford Press.
Kar, B. R., Rao, S. L., & Chandramouli, B. A. (2008). Cognitive development in children
with chronic protein energy malnutrition. Behavioral and Brain Functions, 4, 31-
42. doi: 10.1186/1744-9081-4-31
Karino, C. A., Laros, J. A., & Ribeiro de Jesus, G. (2011). Evidence of convergent
validity of Son-R 2 1/2-7 [a] with WPPSI-III and WISC-III. Psicologia: Reflexão
e Crítica, 24, 621-629. doi: 10.1590/S0102-79722011000400001
Kaufman, A. S., & Kaufman, N. L. (2004). Kaufman Assessment Battery for Children,
Second Edition: Manual. Circle Pines, MN: AGS Publishing.
Kline, R. B. (1998). Principles and Practice of Structural Equation Modeling. New
York: Guilford Press.
Larry P. v. Riles, 343 F. Supp. 1306 (D.C. N.D. Cal., 1972), aff’d 506 F.2d 963 ((th Cir.
1974), further proceedings, 495 F. Supp. 926 (D.C. N.D. Cal., 1979), aff’d, 502 F.
2d 693 (9th Cir. 1984).
Lei, P. W., & Wu, Q. (2007). Introduction to structural equation modeling: Issues and
practical considerations. Educational Measurement: Issues and Practices, 26, 33-
43.
88
Levitsky, D. A., & Strupp, B. J. (1995). Malnutrition and the brain: Changing concepts,
changing concerns. The Journal of Nutrition, 125, 2212S-2220S.
Lichtenberger, E. O., & Kaufman, A. S. (2004). Essentials of WPPSI-III assessment.
Hoboken, NJ: John Wiley & Sons, Inc.
Louro, C. R., & Yupanqui, M. J. (2011). Otra mirada a los procesos de gramaticalización
del presente perfecto español: Perú y Argentina. Studies in Hispanic and
Lusophone Linguistics, 4, 55-80.
Markland, D. (2006). The golden rule is that there are no golden rules: A commentary on
Paul Barrett's recommendations for reporting model fit in structural equation
modeling. Personality and Individual Differences, 42, 851-858. doi:
10.1016/j.paid.2006.09.023
Malda, M., van de Vijver, F. J. R., Srinivasan, K., Transler, C., & Sukumar, P. (2010).
Traveling with cognitive tests: Testing the validity of a KABC-II adaptation in
India. Assessment, 17, 107-115. doi: 10.1177/1073191109341445
Malda, M., van de Vijver, F. J. R., Srinivasan, K., Transler, C., Sukumar, P., & Rao, K.
(2008). Adapting a cognitive test for a different culture: An illustration of
qualitative procedures, Psychology Science Quarterly, 50, 451-468.
Murray-Kolb, L. E., Rasmussen, Z. A., Scharf, R. J., Rasheed, M. A., Svensen, E.,
Seidman, J. C. … Lang, D. (2014). The MAL-ED cohort study: Methods and
lessons learned when assessing early child development and caregiving mediators
in infants and young children in 8 low- and middle-income countries. Clinical
Infectious Diseases, 59, S261 – S272. doi: 10.1093/cid/ciu437
89
Nair, R. L., White, R. M. B., Knight, G. P., & Roosa, M. W. (2009). Cross-language
measurement equivalence of parenting measures for use with mexican american
population. Journal of Family Psychology, 5, 680-689. doi: 10.1037/a0016142
Neisworth, J. T., & Bagnato, S. J. (1992). The case against intelligence testing in early
intervention. Topics in Early Childhood Special Education, 12, 1-20.
Perez-Pereira, M., & Resches, M. (2011). Concurrent and predictive validity of the
Galician CDI. Journal of Child Language, 38, 121-140. doi:
10.1017/S0305000909990262
Price, L. R., Raju, N., Lurie, A., Wilkins, C., & Zhu, J. (2006). Conditional standard
errors of measurement for composite scores on the Weschler Preschool and
Primary Scale of Intelligence-Third edition. Psychological Reports, 98, 237-252.
Ramirez, V., & Rosas, R. (2007). Estandarizacion del WISC-III en Chile: Descripcion del
test, estructura factorial y consistencia interna de las escalas. Psykhe, 16, 91 –
109.
Rodriguez, L. J. S., & Miguel C. A. S. (2012). Evaluacion en psicologia clinica.
Retrieved from http://www.pir.es
Sattler, J. M. (2008). Assessment of children: Cognitive foundations. La Mesa, CA:
Jerome M. Sattler Publisher.
Smolik, F., & Malkova, G. S. (2011). Validity of language sample measures taken from
structured elicitation procedures in Czech. Ceskoslovenska psychologie, 55, 448-
458.
90
Stein, J. A., Lee, J. W., & Jones, P. S. (2006). Assessing cross-cultural differences
through use of multiple-group invariance analyses. Journal of Personality
Assessment, 87, 249 – 258.
Sternberg, R. J., & Grigorenko, E. L. (2004). Intelligence and culture: how culture shapes
what intelligence means, and the implications for a science of well being.
Philosophical Transactions of the Royal Society of London, 359, 1427-1434. doi:
10.1098/rstb.2004.1514
Sternberg, R. J., Grigorenko, E. L., & Bundy, D. A. (2001). The predictive value of IQ.
Merrill-Palmer Quarterly, 47, 1-41.
UNICEF, World Health Organization, & World Bank Group. (2015). Levels and trends
in child malnutrition. Retrieved from: http://data.unicef.org/corecode/uploads/
document6/uploaded_pdfs/corecode/JME-2015-edition-Sept-2015_203.pdf
Van de Vijver, F. (1997). Meta-analysis of cross-cultural comparisons of cognitive test
performance. Journal of Cross-Cultural Psychology, 28, 678-709.
Visser, L. Ruiter, S. A. J., van der Meulen, B. F., Ruijssenaars, W. A J. J. M., &
Timmerman, M. E. (2012). A review of standardized developmental assessment
instruments for young children and their applicability for children with special
needs. Journal of Cognitive Education and Psychology, 11, 102-127. doi:
10.1891/1945-8959.11.2.102
Wasserman, G. A., Lui, X., Parvez, F., Ahsan, H., Factor-Litvak, P., van Geen, A. ...
Graziano, J. H. (2004). Water arsenic exposure and children's intellectual function
in Araihazar, Bangladesh. Environmental Health Perspectives, 112, 1329-1333.
91
Wasserman, J. D. (2012). A history of intelligence assessment: The unfinished tapestry.
In D. P. Flanagan, & P. L. Harrison (Eds.), Contemporary intellectual assessment:
Theories, tests, and issues, (pp. 3 – 55). New York, NY: The Guilford Press.
Wechsler, D. (1944). The measurement of adult intelligence. Baltimore, MD: Waverly
Press.
Wechsler, D. (1991). Wechsler Intelligence Test for Children –Third Edition. San
Antonio, TX: The Psychological Corporation.
Wechsler, D. (1991/1997). Test de Inteligencia Para Ninos WISC-III: Manual
(Translation by Ofelia Castillo). Buenos-Aires: Paidos.
Wechsler, D. (2002a). WPPSI-III: Administration and scoring manual. San Antonio, TX:
The Psychological Corporation.
Wechsler, D. (2002b). WPPSI-III: Technical and interpretive manual. San Antonio, TX:
The Psychological Corporation.
Wechsler, D. (2005). Escala de inteligencia de Wechsler para niños IV (WISC IV).
Madrid: TEA Ediciones.
Wechsler, D. (2009). Escala de inteligencia de Wechsler para preescolar y primaria –
III. Madrid: TEA Ediciones.
World Food Programme (2016). Hunger statistics. Retrieved from:
https://www.wfp.org/hunger/stats
Worthington, R. L., & Whittaker, T. A. (2006). Scale development research: A content
analysis and recommendations for best practice. The Counseling Psychologist, 34,
806-838. doi: 10.1177/0011000006288127
92
Yori, P. P., Lee, G., Olortegui, M. P., Chavez, C. B., Flores, J. T., Vasquez, A. O. …
Kosek, M. (2014). Santa-Clara de Nanay: The MAL-ED cohort in Peru. Clinical
Infectious Diseases, 59, S310-316. doi: 10.1093/cid/ciu460
93
Appendix A
Overview of WPPSI Content
The following table (Table A1) provides an overview of the revisions to the WPPSI's
content and construction from the original WPPSI published in 1967 through the WPPSI-
III. Core and supplemental subtests are included for reference.
94
Table A1
Subtests and Composite Scores of the WPPSI, WPPSI-R, and WPPSI-III by Age
WPPSI (1967) WPPSI-R (1991) WPPSI-III (2002) Composites Subtests VIQ PIQ FSIQ Subtests VIQ PIQ FSIQ Subtests VIQ PIQ GLC PSQ FSIQ4:0 - 6:6 3:0 - 7:3 2:6 - 3:11 Geometric Design ✓ ✓ Object Assembly ✓ ✓ Receptive Vocabulary ✓ (✓) Block Design ✓ ✓ Geometric Design ✓ ✓ Information ✓ ✓Mazes ✓ ✓ Block Design ✓ ✓ Block Design ✓ ✓Picture Completion ✓ ✓ Mazes ✓ ✓ Object Assembly ✓ ✓(Animal Houses) (✓) ✓ Picture Completion ✓ ✓ (Picture Naming) (✓)
Information (Animal Pegs) (✓) 4:0 - 7:3 Comprehension ✓ ✓ Information ✓ ✓ Vocabulary ✓ ✓Arithmetic ✓ ✓ Comprehension ✓ ✓ Information ✓ ✓Vocabulary ✓ ✓ Arithmetic ✓ ✓ Word Reasoning ✓ ✓Similarities ✓ ✓ Vocabulary ✓ ✓ Block Design ✓ ✓(Sentences) (✓) ✓ Similarities ✓ ✓ Matrix Reasoning ✓ ✓ (Sentences) (✓) Picture Concepts ✓ ✓ Coding (✓) ✓ (Symbol Search) (✓) (✓)
(Comprehension) (✓)
(Similarities) (✓)
(Object Assembly) (✓)
(Picture Completion) (✓)
Note. WPPSI = Wechsler Preschool and Primary Scale of Intelligence. VIQ = Verbal Intelligence Quotient; PIQ = Performance Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient; GLC = General Language Composite; PSQ = Processing Speed Quotient. Check marks indicate the inclusion of a specific subtest into the composite score. Parentheses indicate a supplemental subtest. Reviewed tests are based on US normative sample.
95
Appendix B
Parameter Estimates and Standard Errors for Models 3, 4, and 5
Table B1
Parameter and Standard Error Estimates for Model 3-NormOlder
Model Parameters Standardized Estimate
Unstandardized Estimate
Standard Error
Loadings on VIQ
Vocabulary .56 1.38* 0.81
Information .39 1.27a 1.45
Word Reasoning .81 1.46* 0.38
Loadings on PIQ
Block Design .56 1.19* 0.54
Matrix Reasoning .72 2.86* 1.28
Picture Concepts .61 1.94* 1.10
Loadings on FSIQ
VIQ .79 -- 0.32
PIQ -- -- --
Coding .52 1.06* 0.44
Note. Table values are Maximum Likelihood estimates. VIQ = Verbal Intelligence Quotient; PIQ = Perceptual Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient. *p < .05 a fixed factor loading.
96
Table B2 Parameter and Standard Error Estimates for Model 4-OneFactorOlder
Model Parameters Standardized Estimate
Unstandardized Estimate
Standard Error
Loadings on FSIQ
Picture Concepts .61 1.96* 0.29
Information .45 1.56* 1.43
Block Design .54 1.16* 0.21
Word Reasoning .72 1.30* 0.42
Vocabulary .51 1.28* 0.87
Matrix Reasoning .68 1.07* 0.99
Coding .49 1.00* 0.45
Note. Table values are Maximum Likelihood estimates. FSIQ = Full Scale Intelligence Quotient. *p < .05 level.
97
Table B3
Parameter and Standard Error Estimates for Model 5-AltOlder
Model Parameters Standardized Estimate
Unstandardized Estimate
Standard Error
Loadings on VIQ
Vocabulary .56 0.87* 0.80
Information .46 0.97* 1.42
Word Reasoning .80 0.89* 0.36
Loadings on PIQ
Block Design .56 1.20* 0.53
Matrix Reasoning .72 2.88* 1.25
Picture Concepts .62 1.97a 1.09
Coding .52 1.06* 0.44
Loadings on FSIQ
VIQ -- -- --
PIQ .51 -- 0.22
Note. Table values are Maximum Likelihood estimates. VIQ = Verbal Intelligence Quotient; PIQ = Perceptual Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient. * p < .05 level. a fixed factor loading.
VITA
Abigail E. Crimmins 96 Sunnyside Dr. Elmira, NY 14905 aec210@psu.edu
607-215-3252
Education 2011 – present The Pennsylvania State University M.S. (August 2013), Ph.D. (exp. August, 2016) School Psychology 2005 – 2009 Hamilton College B.A. (May 2009; GPA – 3.85) Psychology and Hispanic Studies
Research Experience • Research Assistant, Penn State Department of Special Education, Summer 2014 • Research Assistant, LEGACY Project, Summer 2013 • Predissertation Research Project, Student-Teacher Relationships among Children with Autism:
Contribution of Students’ Social Skills, August 2013 • Honors Thesis, The Use of Thought Suppression to Cope with Ego-Threat among Those with
Fragile Self-Esteem, May 2009 • Research Assistant, Department of Psychology, Hamilton College, Summer 2007
Clinical Experience • Doctoral School Psychology Intern, Letchworth Central School District, 2015 – present • CEDAR Clinic Mobile Clinician, Juniata County School District, Spring 2015 • CEDAR Clinic Student Supervisor, Penn State CEDAR Clinic, 2014 – 2015 • School Psychology Practicum Intern, State College Area School District, 2013 - 2014 • School Psychology Student Clinician, Penn State CEDAR Clinic, 2011 – 2014
Teaching Experience • Graduate Teaching Assistant, Human Development and Family Studies, 2011 – 2014 • Clinical Graduate Assistant, School Psychology Program, 2012 – 2014 • Statistics Teaching Assistant, Psychology Department, 2007 – 2009
Work Experience • Respite Care Specialist, 2011 – present • Level II Teacher, New England Center for Children, 2009 – 2011 • Undergraduate Counselor Intern for Children with ADHD, Center for Children and Families, 2008
Publications and Presentations Woika, S. A., & Crimmins, A. E. (2014, October). Practical Guidance for Supervisors of School Psychologists. Presentation at the meeting of the Association of School Psychologists of Pennsylvania, State College, PA. Crimmins, A. E. (2014, February). Student-Teacher Relationships among Children with Autism: Contribution of Students’ Social Skills. Poster presented at the meeting of the National Association of School Psychologists, Washington DC. Clark, T. C., Crimmins, A. E., & Leposa, B. (2012, February). Reading first, or is it? Paper presented at the meeting of the National Association of School Psychologists, Philadelphia, PA. Borton, J. L. S., Crimmins, A. E., Ashby, R. S., & Ruddiman, J. F. (2012). How do individuals with fragile self-esteem cope with intrusive thoughts following ego threat? Self and Identity, 11, 16 – 35.
Awards and Honors • Membership Award, Pennsylvania Psychologists Association, June 2013
Recommended