Upload
salma-bradfield
View
216
Download
0
Embed Size (px)
Citation preview
1
Class 3
Classical Methods of Scale Construction October 13, 2005
Anita L. StewartInstitute for Health & Aging
University of California, San Francisco
2
Readings and Homework
Homework as stated in syllabus is for the following week
Readings are relevant to the current week
3
Overview of class
Types of measurement scales Rationale for multi-item measures Scale construction methods Error concepts
4
Types of Measurement Scales
Categorical (nominal)– Classification– Numbers are labels for categories
Continuous (along a continuum)– Ordinal – Interval– Ratio
5
Classification vs. Continuous Scores
CES-D continuous score– 20 items summed using Likert scaling methods– Range of sum is 0-60, used as continuous score in
correlational studies CES-D classification score:
– Those scoring 16 or higher are “classified” as having likely depression » Referred for further screening
6
Categorical (Nominal) Scales/Measures
Primary language 1 Spanish 2 English 3 Other
Can you walk without help? 1 Yes
2 No
Numbers have no inherent meaning
7
Ordinal Scales: Numbers Reflect Increasing Level
Change in health:1 Better
2 No change
3 Worse
Income:1 < $10,000
2 $10,000 - <$20,000
3 $20,000 - <$30,000
4 >$30,000
Numbers have no inherent meaning other than “more” or “less.”
8
Another Example of Ordinal Scale
How much pain did you have this past week?1 None
2 Very mild
3 Mild
4 Moderate
5 Severe
6 Very severe
9
Feature of Ordinal Scales
Distances between numbers are unknown and probably vary– some closer together in meaning than others
When ordinal responses are determining extent of agreement (agree, disagree) – referred to as a Likert scale
Likert scale has since come to have other meanings in health measurement
10
Interval Scales
Numbers have equal intervals A unit change is constant across the scale Example - temperature
– can add and subtract scores
– a 2 unit change is the same at lower temperatures as higher temperatures
11
Ratio Scale
Has a meaningful zero point Change scores have specific meaning and can multiply
– e.g., one score can be 2 or 3 times another Examples
– Weight in pounds– Income in dollars– Number of visits
12
Types of Measurement Scales and Their Properties
Property of Numbers
Type of scale Rank order
Equal interval
Absolute zero
Nominal No No No
Ordinal Yes No No
Interval Yes Yes No
Ratio Yes Yes Yes
13
Overview of class
Types of measurement scales Rationale for multi-item measures Scale construction methods Error concepts
14
Single- and Multi-Item Measures
Advantages of single items– Response choices are interpretable
Disadvantages– Numbers are not easily interpretable– Limited variability
» Easy to get skewed distributions
– Reliability is usually low– Difficult to assess a complex concept with
one item
15
Interpretability of “Numbers” in Single Item Ordinal Scale
How much pain did you havethis past week?
1 - none
2 – very mild
3 - mild
4 - moderate
5 - severe
6 – very severe
16
Interpretability of “Numbers” in Single Item Ordinal Scale
How much pain did you havethis past week?
1 - none
2 – very mild
3 - mild
4 - moderate
5 - severe
6 – very severe
Is “very severe”twice as painful
as “mild”?
17
Estimated Distance Between Levels in Ordinal Scale (N=2,928) (0-100 scale)
How much pain did you have this past week?
0-100 transform
M pain scale
1 - none 0 3.30
2 – very mild 20 12.19
3 - mild 40 21.89
4 - moderate 60 38.76
5 - severe 80 59.43
6 – very severe 100 75.38
18
Distance Between Levels in an Ordinal Scale (N=2,928)
How much pain did you havethis past week?
Mean: pain scale
1 - none 3.30
2 – very mild 12.19
3 - mild 21.89
4 - moderate 38.76
5 - severe 59.43
6 – very severe 75.38
9
10
17
20
16
19
Distance Between Levels: “In general, how would you rate your health?”
Mean: current health scale
ScreeningN=~11,000
BaselineN=3,054
1 – poor 0 10.8 10.8
2 – fair 25 30.0 30.6
3 – good 50 57.6 55.9
4 – very good 75 75.5 75.4
5 – excellent 100 87.9 86.9
20
Distance Between Levels: “In general, how would you rate your health?”
Mean: current health scale
ScreeningN=~11,000
BaselineN=3,054
1 – poor 10.8 10.8
2 – fair 30.0 30.6
3 – good 57.6 55.9
4 – very good 75.5 75.4
5 – excellent 87.9 86.9
20
26
18
11
21
Multi-Item Measures or Scales
Multi-item measures are created by combining two or more items into an overall measure or scale score
22
Advantages of Multi-item measures
More scale values (enhances sensitivity) Improves score distribution (more normal) Reduces number of variables needed to
measure one concept Improves reliability (reduces random error) Can estimate a score if some items are missing Enriches the concept being measured (more
valid)
23
Overview of class
Types of measurement scales Rationale for multi-item measures Scale construction methods Error concepts
24
Types of Scale Construction
Summated ratings scales– Likert scaling
Utility weighting or preference-based measures (econometric scales)
Guttman scaling Thurstone scales Many others
25
Example of a 2-item Summated Ratings Scale
How much of the time .... tired?
1 - All of the time
2 - Most of the time
3 - Some of the time
4 - A little of the time
5 - None of the time
How much of the time…. full of energy?
1 - All of the time
2 - Most of the time
3 - Some of the time
4 - A little of the time
5 - None of the time
26
Step 1: Reverse One Item So They Are All in the Same Direction
How much of the time .... tired?
1 - All of the time
2 - Most of the time
3 - Some of the time
4 - A little of the time
5 - None of the time
How much of the time…. full of energy?
1=5 All of the time
2=4 Most of the time
3=3 Some of the time
4=2 A little of the time
5=1 None of the time
Reverse “energy” item so high score = more energy
27
Step 2: Sum the Two Items
How much of the time .... tired?
1 - All of the time 2 - Most of the time
3 - Some of the time
4 - A little of the time
5 - None of the time
How much of the time…. full of energy?5 All of the time
4 Most of the time
3 Some of the time
2 A little of the time
1 None of the time
Highest = 10 (tired none of the time, full of energy all of the time)Lowest = 2 (tired all of the time, full of energy none of the time)
28
Step 2: Average the Two Items
How much of the time .... tired?
1 - All of the time 2 - Most of the time
3 - Some of the time
4 - A little of the time
5 - None of the time
How much of the time…. full of energy?5 All of the time
4 Most of the time
3 Some of the time
2 A little of the time
1 None of the time
Highest = 5.0 (tired none of the time, full of energy all of the time)Lowest = 1.0 (tired all of the time, full of energy none of the time)
29
Summed or Averaged: Increase Number of Levels from 5 to 9
Summed Averaged
2 1.0
3 1.5
4 2.0
5 2.5
6 3.0
7 3.5
8 4.0
9 4.5
10 5.0
30
Summated Scales: Scaling Analyses
To create a summated scale, one needs to first test whether a set of items that appear to measure the same concept can be combined– Need to test hypothesis that the items do indeed
belong together to form a single concept Five criteria need to be met to combine items
into a summated scale
31
Five Criteria to Meet to Qualify as a Summated Scale
Item convergence Item discrimination No unhypothesized dimensions Items contribute similar proportion of
information to score Items have equal variances
32
First Criterion: Item Convergence
Each item correlates substantially with the total score of all items– with the item taken out or “corrected for
overlap” Typical criterion is >= .30
– for well-developed scales, often set at>= .40
33
Example: Analyzing Convergent Validity for Adaptive Coping Scale
Item-scale correlations
Adaptive coping (alpha = .70)
5 Get emotional support from others .49
11 See it in a different light .62
18 Accept the reality of it .25
20 Find comfort in religion .58
13 Get comfort from someone .45
21 Learn to live with it .21
23 Pray or meditate .39 Moody-Ayers SY et al. Prevalence and correlates of perceived
societal racism in older African American adults with type 2 diabetes mellitus. J Amer Geriatr Soc, 2005, in press.
34
Example: Analyzing Convergent Validity for Adaptive Coping Scale
Item-scale correlations
Adaptive coping (alpha = .70)
5 Get emotional support from others .49
11 See it in a different light .62
18 Accept the reality of it .25 <.30
20 Find comfort in religion .58
13 Get comfort from someone .45
21 Learn to live with it .21 <.30
23 Pray or meditate .39
35
Example: Analyzing Convergent Validity for Adaptive Coping Scale
Item-scale correlations
Adaptive coping (alpha = .76)
5 Get emotional support from others .45
11 See it in a different light .59
20 Find comfort in religion .73
13 Get comfort from someone .45
23 Pray or meditate .51
Acceptance (alpha = .67)
21 Learn to live with it .50 18 Accept the reality of it .50
36
SAS/SPSS Make Item Convergence Analysis Easy
Reliability programs provide this– Item-scale correlations corrected for overlap
– Internal consistency reliability (coefficient alpha)
– Reliability with each item removed» To see effect of removing a bad item
37
Second Criterion: Item Discrimination
Each item correlates significantly higher with the construct it is hypothesized to measure than with other constructs– Item discrimination
Statistical significance is determined by standard error of the correlation – Determined by sample size
38
Multitrait Scaling - An Approach to Constructing Multi-item Scales
Confirms whether hypothesized item groupings can be summed into a scale score
Examines extent to which all five criteria are met
Examines resulting scales
39
Example: Two Subscales Being Developed
Depression and Anxiety subscales of MOS Psychological Distress measure
40
Example of Multitrait Scaling Matrix: Hypothesized Scales
ANXIETY DEPRESSION ANXIETY
Nervous person .80 .65
Tense, high strung .83 .70
Anxious, worried .78 .78
Restless, fidgety .76 .68DEPRESSION
Low spirits .75 .89
Downhearted .74 .88
Depressed .76 .90
Moody .77 .82
41
Example of Multitrait Scaling Matrix: Item Convergence
ANXIETY DEPRESSION ANXIETY
Nervous person .80* .65
Tense, high strung .83* .70
Anxious, worried .78* .78
Restless, fidgety .76* .68DEPRESSION
Low spirits .75 .89*
Downhearted .74 .88*
Depressed .76 .90*
Moody .77 .82*
42
Example of Multitrait Scaling Matrix: Item Convergence
ANXIETY DEPRESSION ANXIETY
Nervous person .80* .65
Tense, high strung .83* .70
Anxious, worried .78* .78
Restless, fidgety .76* .68DEPRESSION
Low spirits .75 .89*
Downhearted .74 .88*
Depressed .76 .90*
Moody .77 .82*
43
Example of Multitrait Scaling Matrix: Item Discrimination
ANXIETY DEPRESSION ANXIETY
Nervous person .80* .65
Tense, high strung .83* .70
Anxious, worried .78* .78
Restless, fidgety .76* .68DEPRESSION
Low spirits .75 .89*
Downhearted .74 .88*
Depressed .76 .90*
Moody .77 .82*
44
Preference Based or Utility Measures
Utilities are numeric measurements that reflect the desirability people associate with a health state or condition– Value of that health state
– Preference for that health state (rather than another)
45
Methods for Assigning Values?
Four steps:– Identify the population of judges who will
assign “preferences”– Sample and describe health states to be
assigned utilities– Select a preference measurement method– Collect preference judgments, analyze the
data, and assign weights to the health states
46
Preference Based or Utility Measures (cont.)
Advantages– Combine complex health states into a single number
Score reflects the value or preference for the overall health state
Need two absolute reference points– 0 represents death– 1 represents perfect health
Methods for obtaining value weights– Time tradeoff, standard gamble, rating scales
47
Readings on Utility Measurement
A huge literature Some readings available on request
48
Overview
Types of measurement scales Rationale for multi-item measures Scale construction methods Error concepts
49
Concepts of Error
How to depict error Distinction between random error and
systematic error
50
Components of an Individual’s Observed Item Score
(NOTE: Simplistic view)
Observed true item score score
= + error
51
Components of Variability in Item Scores of a Group of Individuals
Observed true score score variance variance
Total variance (Variation is the sum of all observed item scores)
= + errorvariance
52
Combining Items into Multi-Item Scales
When items are combined into a scale score, error cancels out to some extent– Error variance is reduced as more items are
combined– As you reduce random error, amount of “true
score” increases– Multi-item scale is thus more reliable than any
single item
53
Sources of Error
Subjects Observers or interviewers Measure or instrument
54
Measuring Weight in Pounds of Children: Weight without shoes
Observed scores is a linear combination of many sources of variation for an individual
55
Measuring Weight in Pounds of Children: Weight without shoes
Scale ismiscalibrated
True weight
Amount of water
past 30 min
Weightof clothes
Observed weight
Person weighing children
is not very precise
= + +
+ +
56
Measuring Weight in Pounds of Children: Weight without shoes
Scale ismiscalibrated
+1 lb
True weight80 lbs
Amount of water
past 30 min+.25 lb
Weightof clothes
+.75 lb
Observed weight83 lbs
Person weighing children
is not very precise+1 lb
= + +
+ +
83 = 80 +.25 +.75 +1 +1
57
Sources of Error
Weight of clothes– Subject source of error
Person weighing child is not precise– Observer source of error
Scale is miscalibrated– Instrument source of error
58
Measuring Depressive Symptoms in Asian and Latino Men
Unwillingnessto tell
interviewer
“True” depression
Low awarenessof negative
affect
Observed depression
score
Depression measure
not culturallysensitive
= +
+ +
59
Measuring Depressive Symptoms in Asian and Latino Men
Unwillingnessto tell
interviewer-3
“True” depression
16
Hard to choose onenumber on the 1-6response choices
+2
Observed depression
score13
Measurenot culturally
Sensitive-2
= +
+ +
13 = 16 +2 -3 -2
60
Return to Components of an Individual’s Observed Item Score
Observed true item score score
= + error
61
Components of an Individual’s Observed Item Score
Observed true item score score
= + error random
systematic
62
Sources of Error in Measuring Weight
Weight of clothes– Subject source of random error
Scale is miscalibrated– Instrument source of systematic error
Person weighing child is not precise– Observer source of random error
63
Sources of Error in Measuring Depression
Hard to choose one number on 1-6 response scale– Subject source of random error
Unwillingness to tell interviewer– Subject source of systematic error (underreporting
true depression) Instrument is not culturally sensitive (missing
some components)– Instrument source of systematic error
64
Next Week – Week 4
Variability Reliability Interpretability
65
Homework for Week 4
Complete rows 1-12 on the matrix for each measure you want to review– Handout
– On the web site for this class Download matrix and fill it in