Applied Psych Test Design: Part A--Planning, development frameworks & domain/testspecification blueprints

The Art and Science of Test Development—Part A

Planning, development frameworks & domain/testspecification blueprints

The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock

Kevin S. McGrew, PhD.

Educational Psychologist

Research DirectorWoodcock-Muñoz Foundation

Part A: Planning, development frameworks & domain/test specification blueprints

Part B: Test and Item Development

Part C: Use of Rasch Technology

Part D: Develop norm (standardization) plan

Part E: Calculate norms and derived scores

Part F: Psychometric/technical and statistical analysis: Internal

Part G: Psychometric/technical and statistical analysis: External

The Art and Science of Test Development

The above titled topic is presented in a series of sequential PowerPoint modules. It is strongly recommended that the modules (A-G) be viewed in sequence.

The current module is designated by red bold font lettering

“In an ever-changing world, psychological testing remains the flagship of applied

psychology”

Embretson, S. E. (1996). The new rules of measurement.Psychological Assessment, 8 (4), 341-349.

Desirable Personality Traits of Test Developers

• Obsessive-compulsive

• Intellectually inquisitive

• Masochistic

• 1/99 % I/P ratio

• Sadistic

• Tough-skinned

Desirable Personality Traits of Test Developers

• Willingness to take risks and giant leaps of faith

X = T + E observed score = true score + error

The approach to test development used: Item Response Theory (IRT)

Classical Test Theory (CTT), and IRT vs CTT comparisons, are not covered in this presentation

The bible of test development: The “Joint Standards”

Test development is a complex series of interconnected steps

• The reality of the complexity of test development is not fully appreciated by most test users

• The following complex flow-charts are intended to illustrate the magnitude of the overall project complexity

• This presentation will focus on the more general, broad stroke test development framework

• The process is much more non-linear than depicted by flow charts and presentations

“Generic” Woodcock test development

flowchart

Test/Battery Development: Practical “Broad Stroke” Framework

(Woodcock)

A detailed description for a test, often called a test blueprint, that specifies:

• The number or proportion of items that assess each content and process/skill area

• The format of items, response, and scoring rubrics and procedures, and

• The desired psychometric properties of the items and test such as the distribution of item difficulty and discrimination abilities

Test/Battery Development: Common Conceptual Psychometric Validity Framework

(Bensen, 1998 summary)

Substantive Stage of Test Development

Purpose Define the theoretical and empirical/measurement domains of interest (e.g., intelligence or cognitive abilities –cognitive + achievement)

Questions asked How should intelligence be defined and operationally measured?

Method and concepts • Theory development & validation • Generate definitions• Item and scale development• Content validation• Evaluate construct underrepresentation and construct

irrelevancy

Characteristics of strong test validity program

• A strong psychological theory plays a prominent role• Theory provides a well-specified and bounded domain of

constructs• The empirical domain includes measures of all potential

constructs (i.e., adequate construct representation)• The empirical domain includes measures that only contain

reliable variance related to the theoretical constructs (i.e., construct relevance)

Structural (Internal) Stage of Test Development

Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities)

Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence?

Method and concepts Internal domain studies Item/subscale intercorrelations Exploratory/confirmatory factor analysis Item response theory (IRT) Multitrait-Multimethod matrix• Generalizability theory


• Moderate item internal consistency• Measures co-vary in a manner consistent with the intended

theoretical structure• Factors reflect trait rather than method variance• Items/measures are representative of the empirical domain• Items fit the theoretical structure• The theoretical/empirical model is deemed plausible

(especially when compared against other competing models) based on substantive and statistical criteria

External Stage of Test Development

Purpose Examine the external relations among the focal construct (i.e., intelligence or cognitive abilities) and other constructs and/or subject characteristics

Questions asked Do the focal constructs and observed measures “fit” within a network of expected construct relations (i.e., the nomological network)

Method and concepts • Group differentiation• Structural equation modeling• Correlation of observed measures with other measures• Multitrait-Multimethod matrix


• Focal constructs vary in theorized ways with other constructs• Measures of the constructs differentiate existing groups that

are known to differ on the constructs• Measures of focal constructs correlate with other validated

measures of the same constructs• Theory-based hypotheses are supported, particularly when

compared to rival hypotheses

What is the intended purpose ?

Who are the potential users ?

Who are the intended examinees ?

What domain (s) of behavior are to be measured and in what proportion ?

• Content/substantive validity • Maximize construct representation• Minimize construct irrelevant variance

What type, or types, of items are to be used ?

How is the test to be scored ?

• By hand, machine, computer• Scoring rubrics/guides • Correction for guessing

What types of derived scores will be provided ?

Practical “Broad Stroke” Framework: Typical Questions to Ask

(Woodcock)

Practical “Broad Stroke” Framework: Typical Questions to Ask

(Woodcock)

How are the scores to be interpreted ?

• Types of profiles to provide

What physical materials are needed and how should they appear?

• Test books• Test records• Manipulatives• Audio tapes/CDs• Computer disks• Scoring keys• Manuals• Training materials• etc.

Practical “Broad Stroke” Framework

Common Conceptual Psychometric Validity Framework

This presentation is an integration of the practical and psychometric test/battery

frameworks

Substantive Stage

Structural (Internal) & External Stages

Based on presenters experience as a coauthor of the Woodcock-Johnson Battery—Third Edition (WJ III; 2001)

Examples used in this presentation come from the domain of intelligence or

cognitive abilities (cognitive + achievement)

Typically there are two types of test specification blueprints

• Well defined a priori (typically theory-based) blueprints

• Less well-defined (emerging) data-driven (empirical) blueprints

Possible theory-based intelligence model test design blueprints (select examples)

Das-Naglieri PASS Theory

Gardner MI theory

Cattell-Horn-Carroll (CHC) theory

Possible emerging, empirical, or pragmatic intelligence model test design blueprints

(select examples)

Original Wechsler Verbal/Nonverbal model

1977 WJ Pragmatic Decision-Making model




Method and concepts • Theory development & validation


• A strong psychological theory plays a prominent role• Theory provides a well-specified and bounded domain of

constructs

• Psychometric approach: is the dominant approach, has inspired the most research, is used most widely in practical settings

(p. 77).

• Several theorists argue that there are many different “intelligences” (systems of abilities), only a few of which can be captured by standard psychometric tests (p. 78)

CHC Theory Defined

• Combination of research by Raymond Cattell, John Horn, and John Carroll

• The most empirically-supported, psychometric-based, contemporary description of the structure of human cognitive abilities

• Based on the analyses of hundreds of data sets that were not restricted to a particular test battery

• The theory describes cognitive abilities as a function of degree of breadth/generality

– Broad and narrow cognitive abilities

g

Carroll and Cattell-Horn Model Comparison

Flu

id

Inte

llige

nce

Cry

stal

lized

In

telli

genc

e

Gen

. Mem

ory

& L

earn

ing

Bro

ad V

isua

lP

erce

ptio

n

Bro

ad A

udito

ryP

erce

ptio

n

Bro

ad R

etri

eval

Abi

lity

Bro

ad C

ogni

tive

Spee

dine

ss

Dec

/Rea

ctio

nT

ime/

Spee

d

Gf Gq Gsm Gv Ga Gs CDS GrwGc Glr

Flu

id

Inte

llige

nce

Cry

stal

lized

In

telli

genc

e

Qua

ntita

tive

Kno

wle

dge

Shor

t-T

erm

Mem

ory

Vis

ual

Pro

cess

ing

Aud

itory

Pro

cess

ing

Lon

g-T

erm

Ret

riev

al

Pro

cess

ing

Spee

d

Cor

rect

Dec

isio

n S

peed

Rea

ding

/W

ritin

g

Cat

tell-

Hor

nC

arro

ll Gf Gy Gv Gs GtGc GrGu

...most disciplines have a common set of termsand definitions (i.e., a standard nomenclature)

that facilitates communication among professionalsand guards against misinterpretations. In chemistry,this standard nomenclature is reflected in the ‘Tableof Periodic Elements’. Carroll (1993a) has provided

an analogous table for intelligence…..

(Flanagan & McGrew, 1998)

Richard Snow (1993): “John Carroll has done a magnificent thing. He has reviewed and reanalyzed the world’s literature on individual differences in cognitive abilities…no one else could have done it… it defines the taxonomy of cognitive differential psychology for many years to come.”

Burns (1994): Carroll’s book “is simply the finest work of research and scholarship I have read and is destined to be the classic study and reference work on human abilities for decades to come” (p. 35).

John Horn (1998):

A “tour de force summary and integration” that is the “definitive foundation for current theory” (p. 58). Horn compared Carroll’s summary to “Mendelyev’s first presentation of a periodic table of elements in chemistry” (p. 58).

Arthur Jensen (2004): “…on my first reading this tome, in 1993, I was reminded of the conductor Hans von Bülow’s exclamation on first reading the full orchestral score of Wagner’s Die Meistersinger, ‘‘ It’s impossible, but there it is!’’

“Carroll’s magnum opus thus distills and synthesizes the results of a century of factor analyses of mental tests. It is virtually the grand finale of the era of psychometric description and taxonomy of human cognitive abilities. It is unlikely that his monumental feat will ever be attempted again by anyone, or that it could be much improved on. It will long be the key reference point and a solid foundation for the explanatory era of differential psychology that we now see burgeoning in genetics and the brain sciences” (p. 5).

The verdict is unanimous re: the importance of Carroll’s (1993) work

g

Gf GqGcSARGsm

Gv GaTSRGlm

Gs CDS Grw

Gkn Gh Gk Go

Gf Gc Gy Gv Gu Gr Gs Gt

Gp Gps

A. Carroll Three-Stratum Model

B. Cattell-Horn Extended Gf-Gc Model

D. Tentatively identified Stratum II (broad) domains 1

Carroll and Cattell-Horn Broad Ability Correspondence (vertically-aligned ovals represent similar broad domains)

Gf GqGc Gsm Gv Ga Glr Gs Gt Grw

C. Cattell-Horn-Carroll (CHC) Integrated Model

g

Stratum III (general)

Stratum II (broad)

Notes. Broad ability factor codes based on Carroll (1993) and Horn and Blankson (2005). See Table 1 for additional explanation.

80+ Stratum I (narrow) abilities have been identified under the Stratum II broad abilities. They are not listed here due to space limitations (see Table 1).

Placement of g to the left-side of the Carroll Three-Stratum Model (A) is consistent with Carroll's (1993) published figures, a placement reflecting his finding that the broad abilities towards the left (e.g,Gf, Gc) had the highest loadings on the g-factor. The placement of the Grw and Gq factors in the Cattell-Horn Extended Gf-Gc Model (B) is not consistent with thisg-broad ability representation as Grw and Gq typically demonstrate high g-loadings. Grw and Gq are placed to the right in B to reflect their absence in model A.

Gf Fluid reasoning Gkn General (domain-specific) knowledgeGc Comprehension-knowledge Gh Tactile abilitiesGsm Short-term memory Gk Kinesthetic abilitiesGv Visual processing Go Olfactory abilitiesGa Auditory processing Gp Psychomotor abilitiesGlr Long-term storage and retrieval Gps Psychomotor speedGs Processing speedGt Decision and reaction speed (see Table 1 for definitions)

Grw Reading and writing 1 See McGrew (2004, 2005) for literature review supporting these domains

Gq Quantitative knowledge

CHC Broad (Stratum II) Ability Domains

(Missing g-to-broad ability arrows acknowledges that Carroll and Cattell-Horn disagreed on the validity of the general factor)

© Institute for Applied Psychometrics, LLC Kevin S. McGrew 7-22-08

g

Gf Gv Glr Gs

Gc Gsm Ga

Theoretical Domain Specification

Cylinders = broad CHC abilitiesCircles = narrow CHC abilities

Substantive Stage of Test Development:Develop Test Design and Specification Blueprint

• What is the theoretical domain?

• How should intelligence be defined?

• What intelligence theory has the best validity evidence?

Answer: Cattell-Horn-Carroll (CHC) theory of cognitive abilities

What broad and narrow ability domain(s) are to be measured and in what proportion ?

• Answer relates to questions regarding intended purpose of battery, intended examinees, and intended users. • How do we assure adequate construct representation?

How do we define the broad and narrow ability constructs?

• Content validity important

g

Gf Gv Glr Gs

Gc Gsm Ga

Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of

cognitive abilities



g

Gf Gv Glr Gs

Gc Gsm Ga

Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of

cognitive abilities



Example domain to be used for illustration of process: Gv (Visual Processing)

What narrow Gv ability domain(s) are to be measured and in what proportion ?

• Answer relates to questions regarding intended purpose of battery, intended examinees, and intended users.

• How do we assure adequate construct representation?




Method and concepts • Generate definitions


• The empirical domain includes measures of all potential constructs (i.e., adequate construct representation)

What narrow Gv ability domain(s) are to be measured and in what proportion ?

• Answer relates to questions regarding intended purpose of battery, intended examinees, and intended users.

• How do we assure adequate construct representation?

Definition of broad Gv (Visual Processing)

• Ability to perceive, analyze, synthesize and think with visual patterns

• Ability to store and recall visual representations

• Fluent thinking with stimuli that are visual in the “mind’s eye”

Narrow Gv ability definitions

Spatial Relations (SR): Ability to rapidly perceive and manipulate relatively simple visual patterns or to maintain orientation with respect to objects in space.

Visualization (Vz): The ability to apprehend a spatial form, object, or scene and match it with another spatial object, form, or scene with the requirement to rotate it (one or more times) in two or three dimensions. Requires the ability to mentally imagine, manipulate or transform objects or visual patterns (without regard to speed of responding).

Visual Memory (MV): Ability to form and store a mental representation or image of a visual stimulus and then recognize or recall it later.

We will focus on one: Visualization (Vz)




Method and concepts • Content validation


Content validity evidence

Refers to logical or empirical analyses of the adequacy with which the test content represents the content domain and of the relevance of the content domain to the proposed interpretation of test scores (Joint Test Standards)

This is a non-statistical type of validity that involves “the systematic examination of the test content to determine whether it covers a representative sample of the behaviour domain to be measured” (Anastasi & Urbina, 1997)

Knowledge and skills covered (sampled) by the test items should be representative of the larger population domain of knowledge and skills.

Content validity evidence: One example

Etc…….

Content validity evidence: One example (cont. – for all tests in battery)

Content validity evidence: Another example in the domain of reading: Logical—theoretical skill hierarchy task analysis model

End of Part A

Additional steps in test development process will be presented in subsequent modules as they are developed