Upload
kevin-mcgrew
View
4.519
Download
1
Tags:
Embed Size (px)
DESCRIPTION
The Art and Science of Applied Test Development. This is the second in a series of PPT modules explicating the development of psychological tests in the domain of cognitive ability using contemporary methods (e.g., theory-driven test specification; IRT-Rasch scaling; etc.). The presentations are intended to be conceptual and not statistical in nature. Feedback is appreciated.
Citation preview
The Art and Science of Test Development—Part B
Test and item development
The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock
Kevin S. McGrew, PhD.
Educational Psychologist
Research DirectorWoodcock-Muñoz Foundation
Part A: Planning, development frameworks & domain/test specification blueprints
Part B: Test and Item Development
Part C: Use of Rasch Technology
Part D: Develop norm (standardization) plan
Part E: Calculate norms and derived scores
Part F: Psychometric/technical and statistical analysis: Internal
Part G: Psychometric/technical and statistical analysis: External
The Art and Science of Test Development
The above titled topic is presented in a series of sequential PowerPoint modules. It is strongly recommended that the modules (A-G) be viewed in sequence.
The current module is designated by red bold font lettering
Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of cognitive abilities – Gv domain & 3 selected narrow Gv abilities
Gv
Substantive Stage of Test Development:Develop Test Design and Specification Blueprint – Item Scale
Development
Measurement or empirical domain
What types of tasks, items, format, materials, examiner responses, etc.?
How do we minimize construct irrelevant variance and maximize construct relevant variance?
There is a universe of potential Vz item tasks.
Which type/format should be selected?
?
Visualization (Vz): The ability to apprehend a spatial form, object, or scene and match it with another spatial object, form, or scene with the requirement to rotate it (one or more times) in two or three dimensions. Requires the ability to mentally imagine, manipulate or transform objects or visual patterns (without regard to speed of responding).
Visualization (Vz) Item Development: How do you do it?
What type, or types, of items are to be used ?
• Stimulus presentation (e.g, visual-graphic, oral, etc.)• Response mode (e.g., verbal, pointing, drawing, etc.)• etc
What physical materials are needed and how should they appear?
• Test books• Test records• Manipulatives• Audio tapes/CDs• etc.
How is the test to be scored ?
•Dichotomous (0/1), polychotomous (0/1//3/4/…)• By hand, machine, computer• Scoring rubrics/guides • etc
Visualization (Vz): The ability to apprehend a spatial form, object, or scene and match it with another spatial object, form, or scene with the requirement to rotate it (one or more times) in two or three dimensions. Requires the ability to mentally imagine, manipulate or transform objects or visual patterns (without regard to speed of responding).
Visualization (Vz) Item Development: How do you do it? (cont.)
Use definition to identify essential task requirements
Possible sources for prototype items (select list)
• Tests that appear to measure same ability on other test batteries
• Source books • Carroll’s 1993 book lists tasks he found in factor analysis review• International Directory of Spatial Tests (1983)
• Journal articles (old classic articles; recent articles for newer ideas)
• Published test resources (e.g., ETS Kit of Factor-Referenced Cognitive Tests)
• Internet
• Domain related books
• Create your own ideas
• etc
[Some examples from select sources follow on next series of slides]
Vz test design decision: Unspeeded 3-D Block Rotation type task
Vz test design decision: Un-speeded 3-D Block Rotation type task
Primary reasons for decision
• Historical research base supporting construct validity of these tasks (construct validity)
• No manipulatives (actual blocks) required (minimize construct irrelevant variance: no psychomotor abilities involved)
• Ease of administration and scoring (increased reliability)
• Items could span a wide developmental age-range (intended use of battery)
• Primary elements in Vz definition included in this type of task (content /construct validity)
• Added 3-D Gv task to WJ III – all other Gv tests were 2-D (content/face validity; goal to increase predictive validity of Gv cluster)
• Believed block format would be enjoyed by young children (part of intended examines)
• Future considerations: Readily adaptable to possible computer administration
Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of cognitive abilities – Gv domain & 3 selected narrow Gv abilities
Gv
Substantive Stage of Test Development:Develop Test Design and Specification Blueprint – Item Scale
Development
Measurement or empirical domain
Develop pool of items, from easy to difficult, that will then be ordered from low to high on the underlying Block Rotation measurement scale (analogy = developing a yardstick or ruler)
Low ability/easy items
High ability/difficult items
This is the “try out” phase of the new test idea
Pay extra attention to the development of very “easy” and very “difficult” items
Be prepared for the possibility that your initial item type/format may not be the one you end up with. The empirical may data send you in a different direction, or down a slightly modified path.
Typically these initial prototype try out items are not of publication quality
Often the hardest part of item development is the development of the examiners verbal instructions – need to be succinct, easy to understand, not complicated, etc.
This is one of the reasons for try out phase.To catch possible problems with the items.
This is typically done by authors themselves (and other trained staff) with very small numbers of subjects (as few as 10-12) to see how the instructions and ideas work
Carefully observe examinee performance and make note of any ideas regarding how to improve the items and instructions
•Sometimes useful to do post-testing interview of examinee about their experience—e.g., “how did you go about working the problems?”, etc.
Need to try out with young/immature and mature subjects
•Be careful to not just us very select subjects (friends and children of university faculty)
Make changes in items, instructions, materials, etc. and continue to try out items with additional subjects. Repeat this iterative process as many times as necessary.
•More times are better—better to be safe rather than sorry down the road
Wise to have multiple individuals try out the items to gather multiple sources of feedback
Structural (Internal) Stage of Test Development
Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities)
Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence?
Method and concepts Internal domain studies Item/subscale intercorrelations Item response theory (IRT)
Characteristics of strong test validity program
• Moderate item internal consistency• Items/measures are representative of the empirical
domain• Items fit the theoretical structure
Prepare and evaluate “first draft” version of the test
Develop 2 to 3 times as many items as needed for the final test
•In our Block Rotation example, we crafted an initial set of 44 items
Develop preliminary examiner test record for test. You must have a standardized test record system for quality control purposes
Administer to 50+ subjects (authors and trained staff)
•Need to try out with young/immature and mature subjects
Carefully observe examinee performance and make note of any ideas regarding how to improve the items and instructions
Run initial IRT/Rasch on data to get some sense of item difficulty
Make modifications that are desirable, including dropping and initial re-ordering of items
Prepare and evaluate “calibration” version of the test
Administer to 200 to 300+ subjects
• Need to administer to a “rectangular” (aka, uniform) distribution of subjects
Important concept: You are “norming the scale” – not yet constructing the “norms for the test”
…is better than…
[cont. next slide]
Prepare and evaluate “calibration” version of the test. Then develop “norming” test
Enter calibration data and run IRT/RASCH (WINSTEPS; BIGSTEPS)
• [Note – sample Rasch output will be displayed later]
Revise calibration test as indicated based on RASCH findings
• Drop, add, possibly re-order items• Improve instructions and item scoring keys
Have an expert panel review items for potential bias
Prepare “norming” test form
[Note: There is also a “norming-calibration” process that will notbe covered during this broad stroke presentation. If this process is used one would prepare the norming-calibration form of the test at this stage.
g
Gf Gv Glr Gs
Gc Gsm Ga
Theoretical Domain - CHC
Measurement or empirical domain
Cylinders = broad CHC abilitiesCircles = narrow CHC abilities
The process just described occurs for all the tests you are developing
Specification of test design blueprint is critically important
Development of operational definition of constructs to be measured is important. Should flow from theory and research
You don’t always need to re-invent the wheel when deciding on item format, type, etc. Lots of existing resources are often available for ideas
Rectangular distributions are critical during item development and try out phases. Pay extra attention to the development of very “easy” and very “difficult” items
Don’t grow to attached to your items or tests. Some will not survive.
IRT/Rasch analysis, although made easy with contemporary software, is not for novices/rookies. You need to know, or have a consultant who knows, how these programs work.
• Common serious error by novices (one example): Not leaving “blanks” for items not administered below the basal or above the ceiling. This makes it all but impossible to know how items are ‘behaving”---IRT technology/software knows how to handle the blanks. DON’T impute 1’s below the basal and 0’s above the ceiling. It makes it all but impossible to then construct a test from the pool of items. 1/0 imputation can be done later (in new separate files) for internal consistency reliability calculations or special Rasch-based reasons.
You should develop 2-3 times as many items as you will need in the final test
Spend considerable time and effort in developing concise standardized examiner instructions. Edit, revise, edit, revise, etc.
More small-sample item tryouts, following by iterative revisions and new item development, is better than less.
Test authors must be intimately involved in item development and initial item try-outs
Test design engineering (Woodcock, 1970, 1975, 1999, 2008)
Test design engineering section will be part of a
future “advanced module”
End of Part B
Additional steps in test development process will be presented in subsequent modules as they are developed