Applied Psych Test Design: Part B - Test and Item Development

The Art and Science of Test Development—Part B

Test and item development

The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock

Kevin S. McGrew, PhD.

Educational Psychologist

Research DirectorWoodcock-Muñoz Foundation

Part A: Planning, development frameworks & domain/test specification blueprints

Part B: Test and Item Development

Part C: Use of Rasch Technology

Part D: Develop norm (standardization) plan

Part E: Calculate norms and derived scores

Part F: Psychometric/technical and statistical analysis: Internal

Part G: Psychometric/technical and statistical analysis: External

The Art and Science of Test Development

The above titled topic is presented in a series of sequential PowerPoint modules. It is strongly recommended that the modules (A-G) be viewed in sequence.

The current module is designated by red bold font lettering

Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of cognitive abilities – Gv domain & 3 selected narrow Gv abilities

Gv

Substantive Stage of Test Development:Develop Test Design and Specification Blueprint – Item Scale

Development

Measurement or empirical domain

What types of tasks, items, format, materials, examiner responses, etc.?

How do we minimize construct irrelevant variance and maximize construct relevant variance?

There is a universe of potential Vz item tasks.

Which type/format should be selected?

?

Visualization (Vz): The ability to apprehend a spatial form, object, or scene and match it with another spatial object, form, or scene with the requirement to rotate it (one or more times) in two or three dimensions. Requires the ability to mentally imagine, manipulate or transform objects or visual patterns (without regard to speed of responding).

Visualization (Vz) Item Development: How do you do it?

What type, or types, of items are to be used ?

• Stimulus presentation (e.g, visual-graphic, oral, etc.)• Response mode (e.g., verbal, pointing, drawing, etc.)• etc

What physical materials are needed and how should they appear?

• Test books• Test records• Manipulatives• Audio tapes/CDs• etc.

How is the test to be scored ?

•Dichotomous (0/1), polychotomous (0/1//3/4/…)• By hand, machine, computer• Scoring rubrics/guides • etc

Visualization (Vz): The ability to apprehend a spatial form, object, or scene and match it with another spatial object, form, or scene with the requirement to rotate it (one or more times) in two or three dimensions. Requires the ability to mentally imagine, manipulate or transform objects or visual patterns (without regard to speed of responding).

Visualization (Vz) Item Development: How do you do it? (cont.)

Use definition to identify essential task requirements

Possible sources for prototype items (select list)

• Tests that appear to measure same ability on other test batteries

• Source books • Carroll’s 1993 book lists tasks he found in factor analysis review• International Directory of Spatial Tests (1983)

• Journal articles (old classic articles; recent articles for newer ideas)

• Published test resources (e.g., ETS Kit of Factor-Referenced Cognitive Tests)

• Internet

• Domain related books

• Create your own ideas

• etc

[Some examples from select sources follow on next series of slides]

Vz test design decision: Unspeeded 3-D Block Rotation type task

Vz test design decision: Un-speeded 3-D Block Rotation type task

Primary reasons for decision

• Historical research base supporting construct validity of these tasks (construct validity)

• No manipulatives (actual blocks) required (minimize construct irrelevant variance: no psychomotor abilities involved)

• Ease of administration and scoring (increased reliability)

• Items could span a wide developmental age-range (intended use of battery)

• Primary elements in Vz definition included in this type of task (content /construct validity)

• Added 3-D Gv task to WJ III – all other Gv tests were 2-D (content/face validity; goal to increase predictive validity of Gv cluster)

• Believed block format would be enjoyed by young children (part of intended examines)

• Future considerations: Readily adaptable to possible computer administration

Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of cognitive abilities – Gv domain & 3 selected narrow Gv abilities

Gv

Substantive Stage of Test Development:Develop Test Design and Specification Blueprint – Item Scale

Development


Develop pool of items, from easy to difficult, that will then be ordered from low to high on the underlying Block Rotation measurement scale (analogy = developing a yardstick or ruler)

Low ability/easy items

High ability/difficult items

This is the “try out” phase of the new test idea

Pay extra attention to the development of very “easy” and very “difficult” items

Be prepared for the possibility that your initial item type/format may not be the one you end up with. The empirical may data send you in a different direction, or down a slightly modified path.

Typically these initial prototype try out items are not of publication quality

Often the hardest part of item development is the development of the examiners verbal instructions – need to be succinct, easy to understand, not complicated, etc.

This is one of the reasons for try out phase.To catch possible problems with the items.

This is typically done by authors themselves (and other trained staff) with very small numbers of subjects (as few as 10-12) to see how the instructions and ideas work

Carefully observe examinee performance and make note of any ideas regarding how to improve the items and instructions

•Sometimes useful to do post-testing interview of examinee about their experience—e.g., “how did you go about working the problems?”, etc.

Need to try out with young/immature and mature subjects

•Be careful to not just us very select subjects (friends and children of university faculty)

Make changes in items, instructions, materials, etc. and continue to try out items with additional subjects. Repeat this iterative process as many times as necessary.

•More times are better—better to be safe rather than sorry down the road

Wise to have multiple individuals try out the items to gather multiple sources of feedback

Structural (Internal) Stage of Test Development

Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities)

Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence?

Method and concepts Internal domain studies Item/subscale intercorrelations Item response theory (IRT)

Characteristics of strong test validity program

• Moderate item internal consistency• Items/measures are representative of the empirical

domain• Items fit the theoretical structure

Prepare and evaluate “first draft” version of the test

Develop 2 to 3 times as many items as needed for the final test

•In our Block Rotation example, we crafted an initial set of 44 items

Develop preliminary examiner test record for test. You must have a standardized test record system for quality control purposes

Administer to 50+ subjects (authors and trained staff)

•Need to try out with young/immature and mature subjects

Carefully observe examinee performance and make note of any ideas regarding how to improve the items and instructions

Run initial IRT/Rasch on data to get some sense of item difficulty

Make modifications that are desirable, including dropping and initial re-ordering of items

Prepare and evaluate “calibration” version of the test

Administer to 200 to 300+ subjects

• Need to administer to a “rectangular” (aka, uniform) distribution of subjects

Important concept: You are “norming the scale” – not yet constructing the “norms for the test”

…is better than…

[cont. next slide]

Prepare and evaluate “calibration” version of the test. Then develop “norming” test

Enter calibration data and run IRT/RASCH (WINSTEPS; BIGSTEPS)

• [Note – sample Rasch output will be displayed later]

Revise calibration test as indicated based on RASCH findings

• Drop, add, possibly re-order items• Improve instructions and item scoring keys

Have an expert panel review items for potential bias

Prepare “norming” test form

[Note: There is also a “norming-calibration” process that will notbe covered during this broad stroke presentation. If this process is used one would prepare the norming-calibration form of the test at this stage.

g

Gf Gv Glr Gs

Gc Gsm Ga

Theoretical Domain - CHC


Cylinders = broad CHC abilitiesCircles = narrow CHC abilities

The process just described occurs for all the tests you are developing

Specification of test design blueprint is critically important

Development of operational definition of constructs to be measured is important. Should flow from theory and research

You don’t always need to re-invent the wheel when deciding on item format, type, etc. Lots of existing resources are often available for ideas

Rectangular distributions are critical during item development and try out phases. Pay extra attention to the development of very “easy” and very “difficult” items

Don’t grow to attached to your items or tests. Some will not survive.

IRT/Rasch analysis, although made easy with contemporary software, is not for novices/rookies. You need to know, or have a consultant who knows, how these programs work.

• Common serious error by novices (one example): Not leaving “blanks” for items not administered below the basal or above the ceiling. This makes it all but impossible to know how items are ‘behaving”---IRT technology/software knows how to handle the blanks. DON’T impute 1’s below the basal and 0’s above the ceiling. It makes it all but impossible to then construct a test from the pool of items. 1/0 imputation can be done later (in new separate files) for internal consistency reliability calculations or special Rasch-based reasons.

You should develop 2-3 times as many items as you will need in the final test

Spend considerable time and effort in developing concise standardized examiner instructions. Edit, revise, edit, revise, etc.

More small-sample item tryouts, following by iterative revisions and new item development, is better than less.

Test authors must be intimately involved in item development and initial item try-outs

Test design engineering (Woodcock, 1970, 1975, 1999, 2008)

Test design engineering section will be part of a

future “advanced module”

End of Part B

Additional steps in test development process will be presented in subsequent modules as they are developed