36
Reliability & Validity

Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

Embed Size (px)

Citation preview

Page 1: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

Reliability & Validity

Page 2: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

2

Overview for this lecture

• Ethical considerations in testing• Reliability of tests

– Split-half reliability

• Validity of tests

• Reliability and validity in designed research– Internal and external validity

Page 3: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

3

What does this resemble?

Page 4: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

4

Rorschach test

• You look at several images like this, and say what they resemble

• At the end of the test, the tester says …– you need therapy– or you can't work for this company

What assurance would you expect about the test?

Page 5: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

5

Or imagine some asks your child to draw a human figure

The tester says this shows “signs” that your child is a victim of sexual abuse.

What questions would you ask?

Page 6: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

6

What questions would you ask?

• Is it valid for the purpose to which you plan to put it?

• Can it be faked?• How were the norms constructed?• Can we see the data on which the norm is

based?• Are there tester effects?• Is scoring reliable? • Is it culture fair – are there separate norms for

my culture?

Page 7: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

7

Ethics – developmental role for a test

Sometimes said: “a good test will let you give the subject a debrief that they can use to help…”

- personal decisions

- career

- choice of therapy

- personal development targets

eg learning styles & study practices

But how reliable / specific is the test, really?

Page 8: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

8

Psychological Testing

• Occurs widely …– in personnel selection– in clinical settings– in education

• Test construction is an industry– There are many standard tests available

What constitutes a good test?

Page 9: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

9

Working assumption - a test is:

a set of items

questions, pictures, …

to which an individual responds

rating, comment, yes/no ….

The responses to these items are added up (combined in some way) to create an overall score that assesses one psychological construct

Also called a ‘scale’

Page 10: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

10

Eg. The Warwick Sweetness Scale

1, 2, 3, 4, 5

How much do you like sugar in coffee?

How much do you like toffee?

How much do you like ice-cream?

How much do you like pudding?

How much do you like chocolate cake?

How much do you like honey?

Page 11: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

11

Specificity & sensitivity

Critical for diagnostic tests (dyslexic; autistic; diabetic)

Sensitivity: the test picks out people who really do have the condition

Specificity: the test excludes people who do not have the condition

Page 12: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

12

Reliability

consistency

• Test-retest reliability• Parallel forms reliability• Split-half reliability• Intraclass correlation (ICC, Cronbach’s alpha)• Inter-rater reliability (kappa, ICC)

Page 13: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

13

Split-half reliability

1. sugar in coffee? 3

2. toffee? 4

3. ice-cream? 2

4. pudding? 3

5. chocolate cake? 5

6. honey? 4

Total Warwick Sweetness score 21

odd

3

2

5

even

4

3

4

10 11

Page 14: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

14

Split-half reliability

• Split test in two halves – do you get similar scores on the halves?- Separate sub-totals for odd and even items (for each subject)

- correlate these subtotals (rhalf)

• Adjust the reliability estimate with the Spearman-Brown correction

rtest = (2 * rhalf) / (1+ rhalf)

Page 15: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

15

Reliability v. accuracy

Can be reliable but not accurate

m1 m2 m3

1 11 21

2 12 22

3 13 23

4 14 24

5 15 25

Page 16: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

16

Page 17: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

17

Validity

Interpretation; link to reality

The relationship between test scores and the conclusions we draw from them.

"The degree to which evidence and theory support the interpretation of test scores entailed by proposed use of tests." (AERA/APA/NCME, 1999)

IQ tests – “intelligence”

Personality tests – “personality”

Page 18: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

18

Validity

Fast cars

are powerful – the bhp test

are red – the colour test

move quickly – the speed test

Page 19: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

19

Validity

• "Validation is inquiry into the soundness of the interpretations proposed for scores from a test" Cronbach (1990, p. 145)

• Face validity• Content validity• Construct validity• Criterion validity

Page 20: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

20

Face validity

• Does a test, on the face of it, seem to be a good measure of the construct

E.g., how fast can a particular car go?– time it over a fixed distance

Direct measurement of speed has good face validity

Page 21: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

21

Face validity

The bishop / colonel question

Page 22: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

22

Content validity

Does the test systematically cover all parts of the construct?

Eg the examination for a module

Topics taughtSoupFish

BeetrootCustard

Rice

Topics examinedSoup

BeetrootCustard

Page 23: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

23

Content validity

Spider phobia

Aspects of the constructStrength of fear reactionPersistence of reactionInvariability of reactionRecognition that reaction is unreasonableAvoidance of spiders …

Aspects assessed

Page 24: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

24

Construct validity

Measuring things that are in our theory of a domain.

e.g. engine power propels car

• A construct is a mechanism that is believed to account for some aspect of behaviour– working memory– trait introversion/extroversion

• E.g., children's spelling ability in native language is correlated with learning of second language

Page 25: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

25

Construct validity

The construct is sometimes called a latent variable

You can’t directly observe the construct

You can only measure its surface manifestations

ExtroversionExtroversionConstruct

(Latent variable)

Measurement(Manifest variable)

Personality questionnaire

Behavioural observation

Page 26: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

26

Construct validity

Measuring construct validity

• Convergent validity– Agrees with other measures of the same thing

• Divergent validity– Does not agree with measures of different things

(Campbell & Fiske, 1959)

‘Warwick spider phobia questionnaire’positive correlation with SPQno correlation with BDI

Page 27: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

27

Criterion validity

• A test has high criterion validity if it correlates highly with some external benchmark

– e.g. spelling test predict learning 2nd language– e.g. "Bishop/colonel" test might predict good cleaners

Concurrent validityPredictive validity

Page 28: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

28

Criterion / predictive validity

• Graphology for job selection

– Candidate writes something: Validity = .18– But untrained graphologists, too…

Candidate copies something:

Validity = none

Schmidt & Hunter (1998) in Psychological Bulletin, 124, 262-274

Page 29: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

29

Reliability and validity

Reliability limits validity

- without reliability, there is no validity

- Measures of validity cannot exceed measures of reliability

validity ≤ reliability

Page 30: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

30

Replicability

Can the result be repeated?

Drachnik (1994)

43 children abused; 14 included tongues

194 not abused – only 2 …

d = 1.44

Page 31: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

31

Replicability

Does it replicate?

1. Chase (1987)

34 abused, 26 not abused d = 0.09

2. Grobstein (1996)

81 abused, 82 not abused d = 0.08

Page 32: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

32

Reliability in designed research

Use reliable measurement instruments

Standardized questionnaires

Accurate and reliable clocks

Repeat measurements

Many participants

Many trials

Eliminate (control) sources of ‘noise’ – irrelevant factors that randomly affect the outcome variable

Temperature

Time of day

Page 33: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

33

Reliability in designed research

Eliminate (control) sources of ‘noise’ – irrelevant factors that randomly affect the outcome variable

Temperature

Time of day

Tip:

Reduce irrelevant individual differences

e.g. test only female participants

test only a narrow age band

Why? – reduces error variance, makes test more powerful

Cost? – ability to generalise to other groups or situations is reduced

Page 34: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

34

Validity in designed research

Internal validity

Are there flaws in the design or method?

Can the study generate data that allows suitable conclusions to be drawn?

External validity

How well do the results carry over from sample to populations? How well do they generalise?

Page 35: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

35

Lecture Overview

• Ethical considerations in testing– Results can be used to make important decisions, is the test

good enough to justify these?

• Reliability– Test-retest; internal consistency (Split-half)– Accuracy; specificity & sensitivity

• Validity– Face, content, construct, criterion– Divergent & convergent

• Replicability• Reliability and validity in designed research

– Internal and external validity

Page 36: Reliability & Validity. 2 Overview for this lecture Ethical considerations in testing Reliability of tests –Split-half reliability Validity of tests Reliability

36

http://wilderdom.com/personality/L3-2EssentialsGoodPsychologicalTest.html

• Standardization• Standardization: Standardized tests are:• administered under uniform conditions. i.e. no matter where, when, by whom or to

whom it is given, the test is administered in a similar way. • scored objectively, i.e. the procedures for scoring the test are specified in detail so

that ant number of trained scorers will arrive at the same score for the same set of responses. So for example, questions that need subjective evaluation (e.g. essay questions) are generally not included in standardized tests.

• designed to measure relative performance. i.e. they are not designed to measure ABSOLUTE ability on a task. In order to measure relative performance, standardized tests are interpreted with reference to a comparable group of people, the standardization, or normative sample. e.g. Highest possible grade in a test is 100. Child scores 60 on a standardized achievement test. You may feel that the child has not demonstrated mastery of the material covered in the test (absolute ability) BUT if the average of the standardization sample was 55 the child has done quite well (RELATIVE performance).

• The normative sample should (for hopefully obvious reasons!) be representative of the target population - however this is not always the case, thus norms and the structure of the test would need to interpreted with appropriate caution.