1 Class 4 Psychometric Characteristics Part I: Sources of Error, Variability, Reliability,...

Preview:

Citation preview

1

Class 4

Psychometric Characteristics Part I: Sources of Error, Variability, Reliability,

Interpretability October 12, 2006

Anita L. StewartInstitute for Health & Aging

University of California, San Francisco

2

Overview of Class 4

Concepts of error Basic psychometric characteristics

– Variability

– Reliability

– Interpretability

3

Components of an Individual’s Observed Item Score

(NOTE: Simplistic view)

Observed true item score score

= + error

4

Components of Variability in Item Scores of a Group of Individuals

Observed true score score variance variance

Total variance (Variation is the sum of all observed item scores)

= + errorvariance

5

Combining Items into Multi-Item Scales

When items are combined into a scale score, error cancels out to some extent– Error variance is reduced as more items

are combined

– As you reduce random error, amount of “true score” variance increases

6

Sources of Error

Subjects Observers or interviewers Measure or instrument

7

Measuring Weight in Pounds of Children: Weight without shoes

Observed scores is a linear combination of many sources of variation for an individual

8

Measuring Weight in Pounds of Children: Weight without shoes

Scale ismiscalibrated

True weight

Amount of water

past 30 min

Weightof clothes

Observed weight

Person weighing children

is not very precise

= + +

+ +

9

Measuring Weight in Pounds of Children: Weight without shoes

Scale ismiscalibrated

+1 lb

True weight80 lbs

Amount of water

past 30 min+.25 lb

Weightof clothes

+.75 lb

Observed weight83 lbs

Person weighing children

is not very precise+1 lb

= + +

+ +

83 = 80 +.25 +.75 +1 +1

10

Sources of Error

Weight of clothes– Subject source of error

Person weighing child is not precise– Observer source of error

Scale is miscalibrated– Instrument source of error

11

Measuring Depressive Symptoms in Asian and Latino Men

Unwillingnessto tell

interviewer

“True” depression

Hard to choose onenumber on the 1-6

response choice scale

Observed depression

score

Measurenot culturally

sensitive

= +

+ +

12

Measuring Depressive Symptoms in Asian and Latino Men

Unwillingnessto tell

interviewer-3

“True” depression

16

Hard to choose onenumber on the 1-6

response choice scale+2

Observed depression

score13

Measurenot culturally

Sensitive-2

= +

+ +

13 = 16 +2 -3 -2

13

Return to Components of an Individual’s Observed Item Score

Observed true item score score

= + error

14

Components of an Individual’s Observed Item Score

Observed true item score score

= + error random

systematic

15

Sources of Error in Measuring Weight

Weight of clothes– Subject source of random error

Scale is miscalibrated– Instrument source of systematic error

Person weighing child is not precise– Observer source of random error

16

Sources of Error in Measuring Depression

Hard to choose one number on 1-6 response scale– Subject source of random error

Unwillingness to tell interviewer– Subject source of systematic error (underreporting

true depression) Instrument is not culturally sensitive (missing

some components)– Instrument source of systematic error

17

Memory Errors – From Cognitive Psychology

Error remembering “when” and “how often” something occurred within some time frame

Memory and emotion – tend to remember– positive more than negative experiences– more emotionally intense than neutral experiences

Memory for threatening, sensitive events is more error prone than non-threatening events

AA Stone et al. (eds), The Science of Self-Report,London: Lawrence Erlbaum, 2000.

18

Overview

Concepts of error Basic psychometric characteristics

– Variability

– Reliability

– Interpretability

19

Variability

Good variability– All (or nearly all) scale levels are represented– Distribution approximates bell-shaped normal

Variability is a function of the sample– Need to understand variability of measure of

interest in sample similar to one you are studying Review criteria

– Adequate variability in a range that is relevant to your study

20

Common Indicators of Variability

Range of scores (possible, observed) Mean, median, mode Standard deviation (standard error) Skewness % at floor (lowest score) % at ceiling (highest score)

21

Range of Scores

Especially important for multi-item measures Possible and observed Example of difference:

– CES-D possible range is 0-30– Wong et al. study of mothers of young children:

observed range was 0-23» missing entire high end of the distribution (none had high

levels of depression)

22

Mean, Median, Mode

Mean - average Median - midpoint Mode - most frequent score In normally distributed measures, these are

all the same In non-normal distributions, they will vary

23

Mean and Standard Deviation

Most information on variability is from mean and standard deviation– Can envision how it is distributed on the

possible range

24

Normal Distributions(Or Approximately Normal)

Mean, SD tell the entire story of the distribution + 1 SD on each side of the mean = 64%

of the scores

25

Skewness

Positive skew - scores bunched at low end, long tail to the right

Negative skew - opposite pattern Coefficient ranges from - infinity to + infinity

– the closer to zero, the more normal Test whether skewness coefficient is significantly

different from zero– thus depends on sample size

Scores +2.0 are cause for concern

26

Skewed Distributions

Mean and SD are not as useful – SD often goes out beyond the maximum or

minimum possible

27

Ceiling and Floor Effects: Similar to Skewness Information

Ceiling effects: substantial number of people get highest possible score

Floor effects: opposite Not very meaningful for continuous scales

– there will usually be very few at either end More helpful for single-item measures or

coarse scales with only a few levels

28

… to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)?

0

10

20

30

40

50

Not at all Slightly Moderately Quite a bit Extremely

%

49% not limited at all (can’t improve)

29

… to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)?

0

10

20

30

40

50

Not at all Slightly Moderately Quite a bit Extremely

%

49% not limited at all (can’t improve)

30

SF-36 Variability Information in Patients with Chronic Conditions (N=3,445)

Physicalfunction

Role-physical

Mental health

Vitality (energy)

0-100 0-100 0-100 0-100

Mean 80 75 71 54

SD 27 41 21 22

Skewness - .99 - .26 - .83 - .24

% floor < 1 24 <1 <1

% ceiling 19 37 4 <1

McHorney C et al. Med Care. 1994;32:40-66.

31

SF-36 Variability Information in Patients with Chronic Conditions (N=3,445)

Physicalfunction

Role-physical

Mental health

Vitality (energy)

0-100 0-100 0-100 0-100

Mean 80 75 71 54

SD 27 41 21 22

Skewness - .99 - .26 - .83 - .24

% floor < 1 <1 <1

% ceiling 19 4 <1

McHorney C et al. Med Care. 1994;32:40-66.

24

37

32

Reasons for Poor Variability

Low variability in construct being measured in that “sample” (true low variation)

Items not adequately tapping construct– If only one item, especially hard

Items not detecting important differences in construct at one or the other end of the continuum

Solutions if one is in the process of developing measures: add items

33

Advantages of multi-item scales revisited

Using multi-item scales minimizes likelihood of ceiling/floor effects

When items are skewed, multi-item scale “normalizes” the skew

34

Percent with Highest (Best) Score:MOS 5-Item Mental Health Index

Items (6 pt scale - all of the time to none of the time): – Very nervous person - 34% none of the time– Felt calm and peaceful - 4% all of the time– Felt downhearted and blue - 33% none of the time– Happy person - 10% all of the time– So down in the dumps nothing could cheer you up – 63%

none of the time Summated 5-item scale (0-100 scale)

– Only 5% had highest scoreStewart A. et al., MOS book, 1992

35

Overview

Concepts of error Basic psychometric characteristics

– Variability

– Reliability

– Interpretability

36

Reliability

Extent to which an observed score is free of random error– Produces the same score each time it is administered (all else

being equal) Population-specific; reliability increases with:

– sample size– variability in scores (dispersion)– a person’s level on the scale

37

Components of Variability in Item Scores of a Group of Individuals

Observed true score score variance variance

Total variance (Variation is the sum of all observed item scores)

= + errorvariance

38

Reliability Depends on True Score Variance

Reliability is a group-level statistic Reliability:

– Reliability = 1 – (error variance)– Reliability is:

Proportion of variance due to true score Total variance

39

Reliability Depends on True Score Variance

Proportion of variance due to true score Total variance

Reliability = Total variance – error variance .70 = 100% - 30%

40

Reliability Depends on True Score Variance

Reliability of .70 means 30% of the variancein the observed score is explainedby error

Reliability = total variance – error variance

Proportion of variance due to true score Total variance

41

Importance of Reliability

Necessary for validity (but not sufficient)– Low reliability attenuates correlations with other

variables (harder to detect true correlations among variables)

– May conclude that two variables are not related when they are

Greater reliability, greater power – Thus the more reliable your scales, the smaller

sample size you need to detect an association

42

Reliability Coefficient

Typically ranges from .00 - 1.00 Higher scores indicate better reliability

43

How Do You Know if a Scale or Measure Has Adequate Reliability?

Adequacy of reliability judged according to standard criteria– Criteria depend on type of coefficient

44

Types of Reliability Tests

Internal-consistency Test-retest Inter-rater Intra-rater

45

Internal Consistency Reliability: Cronbach’s Alpha

Requires multiple items supposedly measuring same construct to calculate

Extent to which all items measure the same construct (same latent variable)

46

Internal-Consistency Reliability

For multi-item scales Cronbach’s alpha

– ordinal scales Kuder Richardson 20 (KR-20)

– for dichotomous items

47

Minimum Standardsfor Internal Consistency Reliability

For group comparisons (e.g., regression, correlational analyses)– .70 or above is minimum (Nunnally, 1978)– .80 is optimal– above .90 is unnecessary

For individual assessment (e.g., treatment decisions)– .90 or above (.95) is preferred (Nunnally, 1978)

48

Internal-Consistency Reliability Can be Spurious

Based on only those who answered all questions in the measure– If a lot of people are having trouble with the

items and skip some, they are not included in test of reliability

49

Internal-Consistency Reliability is a Function of Number of Items in Scale

Increases with the number of items Very large scales (20 or more items) can

have high reliability without other good scaling properties

50

Example: 20 item Beck Depression Inventory (BDI)

BDI 1978 version (past week)– reliability .86

– 3 items correlated < .30 with other items in the scale

Beck AT et al. J Clin Psychol. 1984;40:1365-1367

51

Test-Retest Reliability

Repeat assessment on individuals who are not expected to change

Time between assessments should be:– Short enough so no change occurs– Long enough so subjects don’t recall first response

Coefficient is a correlation between two measurements For single item measures, the only way to test

reliability

52

Appropriate Test-Retest Coefficients by Type of Measure

Continuous scales (ratio or interval scales, multi-item Likert scales):– Pearson

Ordinal or non-normally distributed scales:– Spearman– Kendall’s tau

Dichotomous (categorical) measures:– Phi– Kappa

53

Minimum Standards for Test-Retest Reliability

Significance of a test-retest correlation has NOTHING to do with the adequacy of the reliability

Criteria: similar to those for internal consistency

– >.70 is desirable

– >.80 is optimal

54

Observer or Rater Reliability

Inter-rater reliability (across two or more raters)– Consistency (correlation) between two or more

observers on the same subjects (one point in time)

Intra-rater reliability (within one rater)– A test-retest within one observer– Correlation among repeated values obtained by the

same observer (over time)

55

Observer or Rater Reliability

Sometimes Pearson correlations are used - correlate one observer with another– Assesses association only

.65 to .95 are typical correlations >.85 is considered acceptable

McDowell and Newell

56

Association vs. Agreement When Correlating Two Times or Ratings

Association is degree to which one score linearly predicts other score

Agreement is extent to which same score is obtained on second measurement (retest, second observer)

Can have high correlation and poor agreement– If second score is consistently higher for all

subjects, can obtain high correlation– Need second test of mean differences

57

Hypothetical Scores on 4 Subjects by 2 Observers

1

2

3

4

5

6

7

S1 S2 S3 S4

Subjects

58

Example of Association and Agreement

Scores by observer 1 are exactly 2 points above scores by observer 2– Correlation (association) would be perfect

(r=1.0)

– Agreement is poor (no agreement on score in all cases - a difference of 2 between scores on each subject

59

Intraclass Correlation Coefficient for Testing Inter-rater Reliability (Kappa) Coefficient indicates level of agreement of two

or more judges, exceeding that which would be expected by chance

Appropriate for dichotomous (categorical) scales and ordinal scales

Several forms of kappa:– e.g., Cohen’s kappa is for 2 judges, dichotomous

scale Sensitive to number of observations,

distribution of data

60

Interpreting Kappa: Level of Reliability

<0.00

.00 - .20

.21 - .40

.41 - .60

.61 - .80

.81 - 1.00

Poor

Slight

Fair

Moderate

Substantial

Almost perfect

.60 or higher is acceptable (Landis, 1977)

61

Reliable Scale?

NO! There is no such thing as a “reliable” scale We accumulate “evidence” of reliability in a

variety of populations in which it has been tested

62

Reliability Often Poorer in Lower SES Groups

More random error due to Reading problems Difficulty understanding complex

questions Unfamiliarity with questionnaires and

surveys

63

Advantages of multi-item scales revisited

Using multi-item scales improves reliability

Random error is “canceled out” across multiple items

64

Overview

Concepts of error Basic psychometric characteristics

– Variability

– Reliability

– Interpretability

65

Interpretability of Scale Scores: What does a Score Mean?

Meaning of scores What are the endpoints? Direction of scoring - what does a high score

mean? Compared to norms - is score average, low, or

high compared to norms?

Single items, more easily interpretableMulti-item scales, no inherent meaning to scores

66

Endpoints

What is minimum and maximum possible?– To enable interpretation of mean score

Endpoints of summated scales depend on number of items & number of response choices– 5 items, 4 response choices = 5 - 20

– 3 items, 5 response choices = 3 - 15

67

Direction of Scoring

What does a high score mean? Where in the range does this mean score

lie?– Toward top, bottom?

– In the middle?

68

Descriptive Statistics for 3193 Women

M (SD) Min Max

Age 46.2 (2.7) 44.0 52.9

Activity 7.7 (1.8) 3.0 14.0

Stress 8.6 (2.9) 4.0 19.0

Avis NE et al. Med Care, 2003;41:1262-1276

69

Sample Results: Mean Scores in a Sample of Older Adults

Physical functioning 45.0Sleep 28.1Disability 35.7

Mean

70

Example of Table Labeling Scores: Making it Easier to Interpret

Physical functioning 45.0Sleep 28.1Disability 35.7

* All scores 0-100

Mean*

71

Example of Table Labeling Scores: Making it Easier to Interpret

Physical functioning (+) 45.0Sleep (-) 28.1Disability (-) 35.7

* All scores 0-100 (+) indicates higher score is better health(-) indicates lower score is better health

Mean*

72

Solutions

Can include in label (+) or (-)– Can label scale so that higher score is more

of “label” Can easily put score range next to label if

they differ in one table

73

Mean Has to be Interpreted Within the Possible Range

M SD

Parents’ harsh discipline practices* Interviewers’ ratings of mother 2.55 .74 Husbands’ reports of wife 5.32 3.30

*Note: high score indicates more harsh practices

74

Mean Has to be Interpreted Within the Possible Range

M SD

Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55 .74 Husbands’ reports of wife (1-7) 5.32 3.30

*Note: high score indicates more harsh practices

75

Mean Has to be Interpreted Within the Possible Range

M SD

Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55 .74 Husbands’ reports of wife (1-7) 5.32 3.30

Interviewer: 1 2 3 4 5

Husband: 1 2 3 4 5 6 7

*Note: high score indicates more harsh practices

2.55

5.32

76

Mean Has to be Interpreted Within the Possible Range: Adding SD Information

M SD

Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55 .74 Husbands’ reports of wife (1-7) 5.32 3.30

Interviewer: 1 2 3 4 5

Husband: 1 2 3 4 5 6 7

*Note: high score indicates more harsh practices

2.55

5.32

77

Transforming a Summated Scale to 0-100 Scale

Works with any ordinal or summated scale Transforms it so 0 is the lowest possible and

100 is the highest possible Eases interpretation across numerous scales

100 x (observed score - minimum possible score)

(maximum possible score - minimum possible score)

78

Homework for Next Class

Complete rows in matrix for your two measures– Rows 13-18: Nature of samples on which it

has been tested, data quality

– Rows 19-26: Variability, reliability, interpretability

79

Next Class (Class 5)

Guest lecture: Steve Gregorich Factor analysis

80

Two Readings for Next Week

Selected by Steve Gregorich– Kline

– Mulaik

Suggest reading them ahead to be able to ask questions

Recommended