Upload
kevin-mcgrew
View
5.155
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The Art and Science of Applied Test Development. This is the first in a series of PPT modules explicating the development of psychological tests in the domain of cognitive ability using contemporary methods (e.g., theory-driven test specification; IRT-Rasch scaling; etc.). The presentations are intended to be conceptual and not statistical in nature. Feedback is appreciated.
Citation preview
The Art and Science of Test Development—Part A
Planning, development frameworks & domain/testspecification blueprints
The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock
Kevin S. McGrew, PhD.
Educational Psychologist
Research DirectorWoodcock-Muñoz Foundation
Part A: Planning, development frameworks & domain/test specification blueprints
Part B: Test and Item Development
Part C: Use of Rasch Technology
Part D: Develop norm (standardization) plan
Part E: Calculate norms and derived scores
Part F: Psychometric/technical and statistical analysis: Internal
Part G: Psychometric/technical and statistical analysis: External
The Art and Science of Test Development
The above titled topic is presented in a series of sequential PowerPoint modules. It is strongly recommended that the modules (A-G) be viewed in sequence.
The current module is designated by red bold font lettering
“In an ever-changing world, psychological testing remains the flagship of applied
psychology”
Embretson, S. E. (1996). The new rules of measurement.Psychological Assessment, 8 (4), 341-349.
Desirable Personality Traits of Test Developers
• Obsessive-compulsive
• Intellectually inquisitive
• Masochistic
• 1/99 % I/P ratio
• Sadistic
• Tough-skinned
Desirable Personality Traits of Test Developers
• Willingness to take risks and giant leaps of faith
X = T + E observed score = true score + error
The approach to test development used: Item Response Theory (IRT)
Classical Test Theory (CTT), and IRT vs CTT comparisons, are not covered in this presentation
The bible of test development: The “Joint Standards”
Test development is a complex series of interconnected steps
• The reality of the complexity of test development is not fully appreciated by most test users
• The following complex flow-charts are intended to illustrate the magnitude of the overall project complexity
• This presentation will focus on the more general, broad stroke test development framework
• The process is much more non-linear than depicted by flow charts and presentations
“Generic” Woodcock test development
flowchart
Test/Battery Development: Practical “Broad Stroke” Framework
(Woodcock)
A detailed description for a test, often called a test blueprint, that specifies:
• The number or proportion of items that assess each content and process/skill area
• The format of items, response, and scoring rubrics and procedures, and
• The desired psychometric properties of the items and test such as the distribution of item difficulty and discrimination abilities
Test/Battery Development: Common Conceptual Psychometric Validity Framework
(Bensen, 1998 summary)
Substantive Stage of Test Development
Purpose Define the theoretical and empirical/measurement domains of interest (e.g., intelligence or cognitive abilities –cognitive + achievement)
Questions asked How should intelligence be defined and operationally measured?
Method and concepts • Theory development & validation • Generate definitions• Item and scale development• Content validation• Evaluate construct underrepresentation and construct
irrelevancy
Characteristics of strong test validity program
• A strong psychological theory plays a prominent role• Theory provides a well-specified and bounded domain of
constructs• The empirical domain includes measures of all potential
constructs (i.e., adequate construct representation)• The empirical domain includes measures that only contain
reliable variance related to the theoretical constructs (i.e., construct relevance)
Structural (Internal) Stage of Test Development
Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities)
Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence?
Method and concepts Internal domain studies Item/subscale intercorrelations Exploratory/confirmatory factor analysis Item response theory (IRT) Multitrait-Multimethod matrix• Generalizability theory
Characteristics of strong test validity program
• Moderate item internal consistency• Measures co-vary in a manner consistent with the intended
theoretical structure• Factors reflect trait rather than method variance• Items/measures are representative of the empirical domain• Items fit the theoretical structure• The theoretical/empirical model is deemed plausible
(especially when compared against other competing models) based on substantive and statistical criteria
External Stage of Test Development
Purpose Examine the external relations among the focal construct (i.e., intelligence or cognitive abilities) and other constructs and/or subject characteristics
Questions asked Do the focal constructs and observed measures “fit” within a network of expected construct relations (i.e., the nomological network)
Method and concepts • Group differentiation• Structural equation modeling• Correlation of observed measures with other measures• Multitrait-Multimethod matrix
Characteristics of strong test validity program
• Focal constructs vary in theorized ways with other constructs• Measures of the constructs differentiate existing groups that
are known to differ on the constructs• Measures of focal constructs correlate with other validated
measures of the same constructs• Theory-based hypotheses are supported, particularly when
compared to rival hypotheses
What is the intended purpose ?
Who are the potential users ?
Who are the intended examinees ?
What domain (s) of behavior are to be measured and in what proportion ?
• Content/substantive validity • Maximize construct representation• Minimize construct irrelevant variance
What type, or types, of items are to be used ?
How is the test to be scored ?
• By hand, machine, computer• Scoring rubrics/guides • Correction for guessing
What types of derived scores will be provided ?
Practical “Broad Stroke” Framework: Typical Questions to Ask
(Woodcock)
Practical “Broad Stroke” Framework: Typical Questions to Ask
(Woodcock)
How are the scores to be interpreted ?
• Types of profiles to provide
What physical materials are needed and how should they appear?
• Test books• Test records• Manipulatives• Audio tapes/CDs• Computer disks• Scoring keys• Manuals• Training materials• etc.
Practical “Broad Stroke” Framework
Common Conceptual Psychometric Validity Framework
This presentation is an integration of the practical and psychometric test/battery
frameworks
Substantive Stage
Structural (Internal) & External Stages
Based on presenters experience as a coauthor of the Woodcock-Johnson Battery—Third Edition (WJ III; 2001)
Examples used in this presentation come from the domain of intelligence or
cognitive abilities (cognitive + achievement)
Typically there are two types of test specification blueprints
• Well defined a priori (typically theory-based) blueprints
• Less well-defined (emerging) data-driven (empirical) blueprints
Possible theory-based intelligence model test design blueprints (select examples)
Das-Naglieri PASS Theory
Gardner MI theory
Cattell-Horn-Carroll (CHC) theory
Possible emerging, empirical, or pragmatic intelligence model test design blueprints
(select examples)
Original Wechsler Verbal/Nonverbal model
1977 WJ Pragmatic Decision-Making model
Substantive Stage of Test Development
Purpose Define the theoretical and empirical/measurement domains of interest (e.g., intelligence or cognitive abilities –cognitive + achievement)
Questions asked How should intelligence be defined and operationally measured?
Method and concepts • Theory development & validation
Characteristics of strong test validity program
• A strong psychological theory plays a prominent role• Theory provides a well-specified and bounded domain of
constructs
• Psychometric approach: is the dominant approach, has inspired the most research, is used most widely in practical settings
(p. 77).
• Several theorists argue that there are many different “intelligences” (systems of abilities), only a few of which can be captured by standard psychometric tests (p. 78)
CHC Theory Defined
• Combination of research by Raymond Cattell, John Horn, and John Carroll
• The most empirically-supported, psychometric-based, contemporary description of the structure of human cognitive abilities
• Based on the analyses of hundreds of data sets that were not restricted to a particular test battery
• The theory describes cognitive abilities as a function of degree of breadth/generality
– Broad and narrow cognitive abilities
g
Carroll and Cattell-Horn Model Comparison
Flu
id
Inte
llige
nce
Cry
stal
lized
In
telli
genc
e
Gen
. Mem
ory
& L
earn
ing
Bro
ad V
isua
lP
erce
ptio
n
Bro
ad A
udito
ryP
erce
ptio
n
Bro
ad R
etri
eval
Abi
lity
Bro
ad C
ogni
tive
Spee
dine
ss
Dec
/Rea
ctio
nT
ime/
Spee
d
Gf Gq Gsm Gv Ga Gs CDS GrwGc Glr
Flu
id
Inte
llige
nce
Cry
stal
lized
In
telli
genc
e
Qua
ntita
tive
Kno
wle
dge
Shor
t-T
erm
Mem
ory
Vis
ual
Pro
cess
ing
Aud
itory
Pro
cess
ing
Lon
g-T
erm
Ret
riev
al
Pro
cess
ing
Spee
d
Cor
rect
Dec
isio
n S
peed
Rea
ding
/W
ritin
g
Cat
tell-
Hor
nC
arro
ll Gf Gy Gv Gs GtGc GrGu
...most disciplines have a common set of termsand definitions (i.e., a standard nomenclature)
that facilitates communication among professionalsand guards against misinterpretations. In chemistry,this standard nomenclature is reflected in the ‘Tableof Periodic Elements’. Carroll (1993a) has provided
an analogous table for intelligence…..
(Flanagan & McGrew, 1998)
Richard Snow (1993): “John Carroll has done a magnificent thing. He has reviewed and reanalyzed the world’s literature on individual differences in cognitive abilities…no one else could have done it… it defines the taxonomy of cognitive differential psychology for many years to come.”
Burns (1994): Carroll’s book “is simply the finest work of research and scholarship I have read and is destined to be the classic study and reference work on human abilities for decades to come” (p. 35).
John Horn (1998):
A “tour de force summary and integration” that is the “definitive foundation for current theory” (p. 58). Horn compared Carroll’s summary to “Mendelyev’s first presentation of a periodic table of elements in chemistry” (p. 58).
Arthur Jensen (2004): “…on my first reading this tome, in 1993, I was reminded of the conductor Hans von Bülow’s exclamation on first reading the full orchestral score of Wagner’s Die Meistersinger, ‘‘ It’s impossible, but there it is!’’
“Carroll’s magnum opus thus distills and synthesizes the results of a century of factor analyses of mental tests. It is virtually the grand finale of the era of psychometric description and taxonomy of human cognitive abilities. It is unlikely that his monumental feat will ever be attempted again by anyone, or that it could be much improved on. It will long be the key reference point and a solid foundation for the explanatory era of differential psychology that we now see burgeoning in genetics and the brain sciences” (p. 5).
The verdict is unanimous re: the importance of Carroll’s (1993) work
g
Gf GqGcSARGsm
Gv GaTSRGlm
Gs CDS Grw
Gkn Gh Gk Go
Gf Gc Gy Gv Gu Gr Gs Gt
Gp Gps
A. Carroll Three-Stratum Model
B. Cattell-Horn Extended Gf-Gc Model
D. Tentatively identified Stratum II (broad) domains 1
Carroll and Cattell-Horn Broad Ability Correspondence (vertically-aligned ovals represent similar broad domains)
Gf GqGc Gsm Gv Ga Glr Gs Gt Grw
C. Cattell-Horn-Carroll (CHC) Integrated Model
g
Stratum III (general)
Stratum II (broad)
Notes. Broad ability factor codes based on Carroll (1993) and Horn and Blankson (2005). See Table 1 for additional explanation.
80+ Stratum I (narrow) abilities have been identified under the Stratum II broad abilities. They are not listed here due to space limitations (see Table 1).
Placement of g to the left-side of the Carroll Three-Stratum Model (A) is consistent with Carroll's (1993) published figures, a placement reflecting his finding that the broad abilities towards the left (e.g,Gf, Gc) had the highest loadings on the g-factor. The placement of the Grw and Gq factors in the Cattell-Horn Extended Gf-Gc Model (B) is not consistent with thisg-broad ability representation as Grw and Gq typically demonstrate high g-loadings. Grw and Gq are placed to the right in B to reflect their absence in model A.
Gf Fluid reasoning Gkn General (domain-specific) knowledgeGc Comprehension-knowledge Gh Tactile abilitiesGsm Short-term memory Gk Kinesthetic abilitiesGv Visual processing Go Olfactory abilitiesGa Auditory processing Gp Psychomotor abilitiesGlr Long-term storage and retrieval Gps Psychomotor speedGs Processing speedGt Decision and reaction speed (see Table 1 for definitions)
Grw Reading and writing 1 See McGrew (2004, 2005) for literature review supporting these domains
Gq Quantitative knowledge
CHC Broad (Stratum II) Ability Domains
(Missing g-to-broad ability arrows acknowledges that Carroll and Cattell-Horn disagreed on the validity of the general factor)
© Institute for Applied Psychometrics, LLC Kevin S. McGrew 7-22-08
g
Gf Gv Glr Gs
Gc Gsm Ga
Theoretical Domain Specification
Cylinders = broad CHC abilitiesCircles = narrow CHC abilities
Substantive Stage of Test Development:Develop Test Design and Specification Blueprint
• What is the theoretical domain?
• How should intelligence be defined?
• What intelligence theory has the best validity evidence?
Answer: Cattell-Horn-Carroll (CHC) theory of cognitive abilities
What broad and narrow ability domain(s) are to be measured and in what proportion ?
• Answer relates to questions regarding intended purpose of battery, intended examinees, and intended users. • How do we assure adequate construct representation?
How do we define the broad and narrow ability constructs?
• Content validity important
g
Gf Gv Glr Gs
Gc Gsm Ga
Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of
cognitive abilities
Cylinders = broad CHC abilitiesCircles = narrow CHC abilities
Substantive Stage of Test Development:Develop Test Design and Specification Blueprint
g
Gf Gv Glr Gs
Gc Gsm Ga
Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of
cognitive abilities
Cylinders = broad CHC abilitiesCircles = narrow CHC abilities
Substantive Stage of Test Development:Develop Test Design and Specification Blueprint
Example domain to be used for illustration of process: Gv (Visual Processing)
What narrow Gv ability domain(s) are to be measured and in what proportion ?
• Answer relates to questions regarding intended purpose of battery, intended examinees, and intended users.
• How do we assure adequate construct representation?
Substantive Stage of Test Development
Purpose Define the theoretical and empirical/measurement domains of interest (e.g., intelligence or cognitive abilities –cognitive + achievement)
Questions asked How should intelligence be defined and operationally measured?
Method and concepts • Generate definitions
Characteristics of strong test validity program
• The empirical domain includes measures of all potential constructs (i.e., adequate construct representation)
What narrow Gv ability domain(s) are to be measured and in what proportion ?
• Answer relates to questions regarding intended purpose of battery, intended examinees, and intended users.
• How do we assure adequate construct representation?
Definition of broad Gv (Visual Processing)
• Ability to perceive, analyze, synthesize and think with visual patterns
• Ability to store and recall visual representations
• Fluent thinking with stimuli that are visual in the “mind’s eye”
Narrow Gv ability definitions
Spatial Relations (SR): Ability to rapidly perceive and manipulate relatively simple visual patterns or to maintain orientation with respect to objects in space.
Visualization (Vz): The ability to apprehend a spatial form, object, or scene and match it with another spatial object, form, or scene with the requirement to rotate it (one or more times) in two or three dimensions. Requires the ability to mentally imagine, manipulate or transform objects or visual patterns (without regard to speed of responding).
Visual Memory (MV): Ability to form and store a mental representation or image of a visual stimulus and then recognize or recall it later.
We will focus on one: Visualization (Vz)
Substantive Stage of Test Development
Purpose Define the theoretical and empirical/measurement domains of interest (e.g., intelligence or cognitive abilities –cognitive + achievement)
Questions asked How should intelligence be defined and operationally measured?
Method and concepts • Content validation
Characteristics of strong test validity program
Content validity evidence
Refers to logical or empirical analyses of the adequacy with which the test content represents the content domain and of the relevance of the content domain to the proposed interpretation of test scores (Joint Test Standards)
This is a non-statistical type of validity that involves “the systematic examination of the test content to determine whether it covers a representative sample of the behaviour domain to be measured” (Anastasi & Urbina, 1997)
Knowledge and skills covered (sampled) by the test items should be representative of the larger population domain of knowledge and skills.
Content validity evidence: One example
Etc…….
Content validity evidence: One example (cont. – for all tests in battery)
Content validity evidence: Another example in the domain of reading: Logical—theoretical skill hierarchy task analysis model
End of Part A
Additional steps in test development process will be presented in subsequent modules as they are developed