4. Sampling & Measuringaix1.uottawa.ca/~ccollin/PCLWebsite/Teaching_files/PSY2174_S11_… · • Proportionate SRS: n’s in sample are proportional to n’s in population. Example:

4. Sampling & Measuring

1

Material From Text Covered in This Section

• Mainly Chapter 4, but also:

• Parts of chapter 9 about correlations and scatterplots

• Part of Chapter 12 on sampling

2

Context

• So far, we’ve covered foundational stuff: • Scientific thinking

• Ethics

• Getting ideas for research

• Now we get into the technical stuff:• Getting participants

• Measuring their behaviour

• Analyzing data from measurements

3

Topics

• Sampling & Recruiting

• Measurement

• Data analysis

• Data presentation

4

5

4.1 Sampling & Recruiting

• Population: Entire group of individuals of interest

• Sample: Subset of population tested

• Sampling: Method of selecting individuals to:

• Naturalistically observe

• Invite to participate in study

• Recruiting: Method of inviting the selected individuals to participate in study

Some Vocabulary

6

• If you want know what the population is like, your sample must be representative: It must reflect the attributes of the population

• Representative samples are important in some kinds of observational research (e.g., surveys)

• But, if you want to know how manipulated variables will affect each other, it is less important

• Representativeness is less important where low variability exists in the population

Sampling

7

• Best way to get representative sample

• Each population member has an equal chance of appearing in the sample

• Typical methods: Random number generation & physical mixing

Random Sampling

8

Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary!

Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary!Rand...

9

Random Sampling in MS Excel

• Column A: Place all population members’ identifiers (i.e., names or numbers)

• Column B: Set B1 to =rand(), then use Cntrl-D to fill the rest of the column

• Then sort (“Data” menu) by Column B. Equivalent to shuffling.

• Then take the first n individuals as your sample.

10

• Used when sub-populations vary significantly Example: Voters in different parts of the country

• Each sub-population (stratum) is randomly sampled independently

• Several different versions

Stratified Random Sampling

11

• Proportionate SRS: n’s in sample are proportional to n’s in population. Example: Typical psych class has N=50, with 90% F and 10% M. To get a sample with n=10, randomly sample 9/45 F and 1/5 M.

• Disproportionate SRS: Numbers in sample are higher for sub-pops with greater variability. Example: If males vary more in IQ, you might sample more males than females

Stratified Random Sampling

12

• Sampling larger fractions from strata living in sparsely populated areas.

• Sampling equal numbers from sub-populations that vary widely in size (e.g., males and females in psych classes)

• Ensures that one has homogeneity of variance in samples when comparing sub-populations statistically

Other SRS Strategies

13

Used when random sampling is not practical due to size of population

1. Break pop down into sub-populations

2. Randomly select some sub-pops

3. Randomly select some members of each of the selected sub-pops

Cluster Sampling

14

Example: How to select a sample of 50 from Dawson City (pop 1327)?

1. The town consists of 60 city blocks, each with 20-25 people.

2. Select 10/60 city blocks at random

3. Select 5 individuals randomly from each selected city block.

Cluster Sampling

15

• Most common form of sampling in reality. Also called arbitrary or haphazard sampling

• In most studies, participants are self-sampled, responding to posters, ads, subject pools, etc.

• Participants are not forced to participate, so some degree of self-selection is unavoidable. Exception: Naturalistic observation. But subjects still selected arbitrarily (e.g., those who happen by) not randomly

Convenience Sampling

16

• Non-random sampling can lead to lack of generalizability. BUT, this depends on:

• Variability in the population. Example: Number of eyes vs. political affiliation

• Probability of interaction between variables of interest and variations in population Example: Political affiliation won’t affect number of eyes but will affect attitudes toward abortion

• Every sample is equivalent to a random sample from some population

Convenience Sampling

17

• In order to ensure that one is drawing from the population of interest, one may apply inclusion or exclusion criteria. Example: You must have normal vision to participate in some perception experiments

• Whether a criterion is inclusive or exclusive can sometimes be a matter of semantics

Inclusion/Exclusion Criteria

18

Recruiting

• Many methods can be used:

• Ads: TV, radio, poster, web...

• Direct appeal: In person, phone, email, social networking sites...

• Main ethical issue: Informed consent

• With vulnerable pops, must be extra careful about coercion

19

Discussion / Questions

• What is the difference between random and arbitrary?

• When would you use cluster sampling?

• Discussion: The participant pool controversy. What do you think?

20

4.2 Issues in Measurement

21

The Importance of Careful Measurement• Of utmost importance in science. We want

data, not just casual observations

• Especially important in Psych: 1) Data is noisy enough as it is without adding undue measurement error 2) We have limitations on repeatability

• Treat each measurement as precious

22

• Construct: Abstract concept of interest, such as intelligence, aggression, attitude towards women...

• Measure: Concrete means for measuring construct

• Operational definition = Construct+Measure.

• Choice of construct is guided by question

• Choice of measure usually guided by previous work, or you can come up with your own.

Measures & Constructs

23

Signal to Noise

• Any measured value is a combination of signal (true score) and noise (measurement error).

• Measurement error can be:

• Systematic: E.g., you set your scale 5 kg lower

• Random: E.g., a variety of other factors, such as your posture, the temperature, etc. can affect the scale as well. Arises from factors we cannot or do not control.

24

True Score

Measured score (medium reliability)

Measured score (low reliability)

Measured score (high reliability)

25

True Score

Measured score (medium reliability)

+ Positive Systematic error

+ Negative Systematic error

26

Systematic Error

• Easy to compensate for if you know it’s there.

• Not a problem if comparing relative differences and it’s the same amplitude across groups. Example: Two groups of dieters both use a scale that’s off by 5 kg.

• Problematic if you don’t know it’s there and need the absolute value of the thing you’re measuring.

• Extremely problematic if you don’t know it’s there and it’s different for different groups. (=Confound)

27

Random Error

• Is always present to some degree. You can never quite get the true value

• Can be a problem when comparing groups Example: Which dieters are heavier? Random error may make measured values for the heavier ones lower than those for the lighter ones

• Is dealt with by averaging over many measurements (up/down errors cancel) and by using inferential statistics (on which, more later...)

28


• When is measurement error most problematic in research?

• How does one mainly deal with random error?

• What is an operational definition?

29

• Reaction Time (aka, RT or Latency)

• Accuracy (= 100% - Error Rate)

• Preferential looking

• Verbal and written responses

• Behaviour inventories

• Physiological responses

• Brain imaging

• Etc. etc. etc...

Examples of Common Measures

30

• a.k.a. Latency. Time between stimulus (e.g. computer image) and response (e.g. mouse click).

• Usually measured in ms (min. value ≈ 100 ms)

• Inherently skewed (asymmetrical) distribution

Reaction Time

Careful: Important technical issues exist with gathering accurate RTs via computer (MacInnes & Taylor, 2001).

31

• Accuracy: Proportion of times correct answer given

• Error = 100% - Accuracy;

• Normally distributed but only at moderate difficulty

• Individual performance not linear with regard to difficulty, but rather sigmoidal.

• Error is good in combo with RT because “up = bad” for both

Accuracy / Error

32

Error

DifficultyHardEasy

Erro

r

Error500

Freq

uenc

y

33

Accuracy

DifficultyHardEasy

Acc

urac

y

Accuracy10050

Freq

uenc

y

34

Speed/Accuracy Trade-Off

• Accuracy/Error and RT are both measures of “performance”

• Faster and/or Fewer errors = better

• Slower and/or More errors = worse

• But what if one goes up while the other goes down?

• Faster and More errors = ???

• Slower and Fewer errors = ???

35

Word or Scramble?

TABLEBLTEAEXTREMISMMSIEXTMRE

“As quickly and accurately as possible, say whether the string of letters forms a word or a scrambled word.”

36

0

250

500

750

1,000

Large Small0

250

500

750

1,000

Large Small

Speed-Accuracy Trade-off

0

2.5

5.0

7.5

10.0

Large Small

Error (%)

0

250

500

750

1,000

Large Small

RT (msec)

37

Word Length Word Length

Sensitivity

• Sensitivity is symbolized d' (“dee-prime”).

• Measure of one’s ability to detect a given signal (e.g., a dim light, or an ambiguous diagnosis)

• How might we measure sensitivity?

38

(a bad way of) Measuring Sensitivity

• Present stimulus 100 times and note % of times participant says he detects it?

• Problem: P who says “yes” all the time will do very well. (the very problem we are trying to avoid!)

• Need measure that reflects ability to discriminate between “signal present” and “signal absent”

39

(a good way of) Measuring Sensitivity

• Present stimulus (signal + noise) on only half of the trials, (test trials).

• Present no stimulus (noise) on other half of trials, (catch trials).

• Many trials are presented. For each, the participant says “yes, the signal is there” or “no, it isn’t”.

40

Four Possible Results on Each Trial

Present Absent

Yes, I see it

Hit False Alarm

No, I don’t

Miss Correct Rejection

The signal is really...Pa

rtic

ipan

t sa

ys...

41

d’ and Signal Detection Theory

ill healthy

ill Hit (illness detected)

False Alarm(health falsely seen as

illness)

healthy Miss(illness overlooked)

Correct Rejection

(health correctly identified as such)

Patient is really...

Psyc

holo

gist

say

spa

tient

is...

42

Also Known As...

Present Absent

Yes, it’s there

True Positive False Positive(type I error)

No, it’s not

False Negative(type II error) True Negative

The difference is really...St

atis

tical

tes

t sa

ys...

43

Present (n =100)

Absent (n=100)

Yes, I see it 90 20

No, I don’t 10 80

The stimulus is really...

Part

icip

ant

says

...

Present (n =100)

Absent (n=100)

Yes, I see it 80 10

No, I don’t 20 90


Part

icip

ant

says

...

Present (n =100)

Absent (n=100)

Yes, I see it 90 30

No, I don’t 10 70


Part

icip

ant

says

...

Albert Benny

Claire

44

Link to Sensitivity Calculations (#90)

Link to Criterion Calculations (#96)

Some Example Results

Sensitivity• The results of such an experiment yield:

proportion of hits (Ph= Nhits / Ntesttrials)proportion of FAs (Pfa= Nfa / Ncatchtrials)

• For example, for Albert: Ph= 90 / 100 = .9Pfa= 20 / 100 = .2

• Note that we could calc proportions of misses and correct rejections too, but these are redundant (Pm = 1-Ph; Pcr = 1-Pfa)

45

Questions

• What is Benny’s proportion of hits (Ph)?

• What is his proportion of FA’s (Pfa)?

46

Sensitivity

• Perfect participant: Ph = 1, Pfa= 0

• Participant just guessing: Ph = .5, Pfa= .5

• Worst possible participant (perfectly backwards) : Ph = 0, Pfa= 1

• In calculating sensitivity, we want to reward hits and punish FAs, so we could just use “Basic Sensitivity”: BS = Ph - Pfa

47

Sensitivity

• BS equals:

• Perfect participant: Ph of 1 - PFA of 0 = 1

• Participant guessing: Ph of .5 - PFA of .5 = 0

• Backward participant: Ph of 0 - PFA of 1 = -1

• So BS seems to work, right?

48

However, for obscure statistical reasons, BS is, well, B.S.

• Instead we calculated' = z(Ph) - z(PFA)

• Converting the proportions of hits and FAs to z-scores yields a more valid result.

• d’ is measured in standard deviation units.

• How to calculate the z scores?

• In Excel, use norminv(P, 0, 1)

• Table A5.1 from MacMillan & Creelman

• Or the “unit normal” table from any stats textbook.

49

Normal distribution for finding d' (based on Macmillan & Creelman (2005), Signal Detection: A User’s Guide)

50

Sensitivity

• d' = z(Ph) - z(Pfa)

• Albert: Ph = .9 ∴ z = 1.28; Pfa = .2 ∴ z = -0.84∴ d’ = 1.28 - (-0.84) = 2.12

• Ben: Ph = .8 ∴ z = 0.84; Pfa = .1 ∴ z = -1.28∴ d’ = 0.84- (-1.28) = 2.12

• Claire: Ph = .9 ∴ z = 1.28; Pfa = .3 ∴ z = -0.52∴ d’ = 1.28 - (-0.52) = 1.8

51

Link back to data (#83)

Zero and One

• What to do with proportions of 0 and 1?

• Technically, these yield z scores of ∞

• There are many ways of getting around this (MacMillan & Creelman, 2005, chp. 1). For our purposes, just substitute values of 0.01 and 0.99, respectively.

52

• Example for self-test: Zack does a sensitivity experiment with 20 test trials and 20 catch trials. He gets 10 hits and 5 false alarms.

• What are his Ph and Pfa values?

• What is is d’?

53

Ph = .ejPfa= .be

d’ = .fgQuestions

Dr. X tests a new field diagnostic procedure by applying it to 50 individuals known to have an illness. He finds that it correctly labels all 50 of them as ill, whereas his old procedure labeled only 40 of them as such. He concludes that the new procedure is better. Is his conclusion sound? Why or why not?

54

Questions

• Used with kids, because no verbal feedback required

• Infants will look longer at stimuli that are more interesting / surprising

• Measure looking time with “blind” raters. Also use several raters and check inter-rater reliability

• More in PSY2105 & 3140

Preferential Looking

55

• Very common way to measure higher-level functions such as cognitive ability, personality, social attitudes...

• Responses usually gathered by questionnaires, tests, or interviews

• Entire field devoted to questionnaire/instrument development, called Psychometrics

• There is an art and science to creating good questions

• More in PSY3307

Verbal/Written Responses

Strongly Agree Agree Not sure Disagree

Strongly Disagree

I find PSY 2174 to be exciting an informative

56

• Questions should be clear and simple, with proper grammar and spelling

• They should be minimum in number, though asking the same question multiple ways gets more reliable data...

• Be careful of question order. Previous questions can influence later answers.

• Avoid leading questions “Do you support Stephen Harper’s harmful and exploitative foreign policy?”

• Avoid “double questions” (see previous slide)

Good Questions & Bad

57

• Close-ended: Yes/No, Multiple-choice, Likhert Scale, Numerical ratings...

• Answer options should be inclusive and exhaustive. Often a good idea to include “Other (please specify):_____”

• Answer options should be non-overlapping (age ranges of 18-30, 25-40, 35-60, 50+ make no sense)

• Open ended: Sentence completion, word association, completely unstructured, etc. (must use text analysis)

Types of Questions

58

59

• Center for Epidemiological Studies Depression scale

• 9 items, each rated on a 4-point scale. A “Matrix Questionnaire”

• Notice: Same thing asked multiple times, simple structure, few questions

• Has surprisingly good reliability & validity

An Example Questionnaire: CES-D

60

• Observer visually assesses behaviour: Example: Counts occurrences of a behaviour Example: Times how long a behaviour takes

• Problem: Whether or not a behaviour has occurred can be ambiguous and open to interpretation. Solutions: Use several observers and measure their agreement with one another. Also, make sure behaviours are defined in detail

Behaviour Inventories

61

Behaviour InventoryP#

Start Time

EndTime

Wait-ing?

Driver Sex

Driver Race

Observed Car Intruder Car

1 12:56:37 12:57:44 Y M Cau Matrix, new F-150, new2 12:58:01 12:59:33 N M AA Sebring, old3 1:07:37 1:08:09 N F Cau BMW, new4 1:11:55 1:12:30 Y F East

IndWindstar, new Cherokee, old

Start timing when driver opens driver-side doorStop timing when the front bumper clears the parking spaceA car is “waiting” if someone is waiting for the spot and the driver turns toward the intruding car before entering carRecord model and approximate age of car

62

• E.g., heart rate, levels of stress hormones in blood, galvanic skin response.

• Good reliability, but can have questionable validity (physiological response is quite similar for fear, anger, sexual arousal, etc.)

Physiological Responses

63

• fMRI, CAT, EEG, ERP, NIRS, etc.

• All ways of measuring activity in various parts of the brain in response to given stimuli

• Like other physiological indicators, good reliability but questions arise about what exactly one is measuring

Brain Imaging

64

And Many More...

• Text and image analysis, archival research, and on and on...

65


• What are some issues with RT measurements?

• A survey asks “Do you oppose the tyrannical policies of Alan Rock?” What’s wrong with this question?

• What is inter-rater reliability? In what situations would it be an issue?

66

4.4 Measuring Measures

Is Your Measure a Good One?

67

• Reliability: The degree to which successive measurement values are the same.

• Measurement Error: Difference between successive measures of the same thing.

• Systematic error (”bias”); inexact ruler

• Random error (”noise”); rubber ruler

Is Your Measure Reliable?

68

• Validity: The degree to which the measure actually measures what you think it does.

• Face Validity: Does the measure intuitively seem to measure the desired construct?

• Construct Validity: Does the measure actually measure the desired construct?

Is Your Measure Valid?

69

Construct validity is bolstered if your measurements...

...correlate with things they should correlate with: Criterion Validity

...correlate with others from instruments claiming to measure the same thing: Convergent Validity

... don’t (or weakly) correlate with things they shouldn’t correlate with: Discriminant Validity


70

Values from your new IQ test, if it really measures intelligence, should:

• Correlate with grades (+), income (+), crime (-)(criterion validity)

• Correlate with other established IQ tests (+)(convergent validity)

• Not correlate with personality tests (0)(discriminant validity)


71

Construct ValidityMood Disorders

Depression Anxiety

Poor SleepNegative Mood

HopelessnessSuicidalIdeation

Irritated Mood

Apprehension Muscle Tension

Does your measure capture “depression” per se or all mood disorders? Or does it merely capture some aspects of depression?

72

• A measure can be reliable but not valid.

• But validity is impossible without reliability.

• Beware tendency to choose reliable (objective / impartial / systematic) measures instead of valid ones

• Reliability without validity is meaningless

Reliability vs. Validity

73

“Because we cannot measure what we value, we begin to value what we measure” - Unknown

Levels of Measurement: N.O.I.R.

• Measurement: “The assigning of numbers to objects based on a rule”

• Can be done at 4 basic levels: Nominal, Ordinal, Interval, or Ratio

• Level determines what math can be done with the measurements and therefore what stats can be used

74

• Labels are applied to participants. Labels are not mathematically related. Examples: Gender (male or female); Sport jersey numbers (11, 22, 44) Country of origin (Afghanistan...Zambia)

• No math can be done with nominal labels

• Use frequency statistics (e.g., chi square test)

Nominal Scale Measurement (aka Categorization)

75

Frequency Data Example

0

125

250

375

500

Blonde Brown Black Red

MagazinesMall• No math can be done with nominal

labels, but instances can be counted, generating frequency data

• Example: Measure hair colour of sample of 1000 people appearing in magazines vs. 1000 naturalistically observed at the mall

• Out of 1000 of each, what is the frequency of blond, brown, black, or red hair?

76

• Ranks are assigned to participants. Ranks have logical order, but spacing is undefined. Examples: Birth order (1st born, 2nd born, 3rd...) Social class (lower, middle, upper) GPA (the numbers do not indicate even intervals)

• Because units are not evenly spaced, math options are limited and so are statistical procedures

• Use nonparametric inferential stats for this kind of data

Ordinal Scale Measurement (AKA Ranking)

77

• Numbers on a scale are assigned to participants.Numbers have meaningful equal spacing, but the zero value is arbitrary Example: Celsius temperature. 0° ≠ total absence of heat (thankfully!)

• Addition and subtraction are meaningful, but not division/multiplication.

• Use parametric stats for these kinds of data (if other assumptions, such as normality, are met)

Interval Scale Measurement

78

• Numbers on a scale are assigned to participants.Numbers have meaningful equal spacing and meaningful zero Example: Kelvin temperature (0° = no heat), height, weight, RT, age, any physical measure

• All mathematical functions are possible. These are “true numbers”

• Use parametric stats for these kinds of data (if other assumptions, such as normality, are met)

Ratio Scale Measurement

79

• Psychometric tests (IQ, personality tests, etc.) are treated as being interval scale, but are they?

• Depends if you choose to focus on the measure (IQ) or the construct (intelligence)

• Consider: Is the difference in intelligence represented by IQ scores of 100 and 110 the same as the intelligence difference between IQs of 140 and 150?

• There’s really no way to answer this question, but people typically treat such measures as interval

The Scale Debate

80

• Along with design of study, level of measurement determines which descriptive and inferential stats techniques are appropriate

• More techniques exist (and more powerful ones) for interval / ratio than for ordinal or nominal

• Some mis-use statistical analyses on the wrong scale of data, making for meaningless results.

The Importance of Level of Measurement

81

Questions

• What are some aspects of construct validity?

• A researcher measures religiosity in terms of # of times Ps attend church in a year. What’s the level of measurement?

• What kind of mathematical operations can be done with nominal data?

82

4.5 Statistical Analysis

83

• Exploratory: Examine the raw data to get a better understanding of it

• Descriptive statistics: Summarize the characteristics of the sample(s)

• Inferential stats: Infer things about the population(s) from the sample(s)

Three Types of Stats

84

• MoCTs give a summary impression of group data.

• Arithmetic Mean: Simply the average. Sum of all scores over number of scores. (see also Geometric Mean & Harmonic Mean)

• Median: The middle score when all scores are placed in order, (or the arithmetic mean of the two middle scores if an even number of scores).

• Mode: The most frequently occurring score (not always well-defined)

• Do examples in text to get practice!

Measures of Central Tendency (MoCTs)

85

MoCTs: Examples80

140

90

110

100

100

110

110

120

90

Data (x)80

90

90

100

100

110

110

110

120

140

Arithmetic Mean:∑x = 1050, N = 10, ∴ μ = 105

Median:N =10; (N+1)/2 = 5.5; ∴ M = (5th score + 6th score) /2

Bins Freq.

80 1

90 2

100 2

110 3

120 1

130 0

140 1

Mode:Determine frequencies of each score. Score with max frequency is 110

86

MoCTs: Things to Watch Out For

• Mean is strongly affected by outliers

• Therefore, median is a better choice for asymmetrical data distributions (e.g., income). Also used for ordinal scale data.

• Mode can be ill-defined, but is the only one usable with nominal scale data

87

Remember“Lying with Averages”

• Year 1: 90% of people make 10 000$, 10% make 1

000 000$. Average = 109 000$

• Year 2: 90% people still make 10 000$, 10% now make 2 000 000$. Average = 209000$

• The economy’s doing great! Average income has almost doubled!

• The rich 10% are outliers, and make the mean meaningless. The median for both years is 10 000$ and gives a better model of the data

88

Average is Not Enough

• A MoCT is a very simple model of the data in a sample. Much detail is lost.

• It pays to explore the data more closely (e.g., using frequency histogram)

• Samples with similar MoCTs can have very different frequency distributions

89

All of These Groups Have the Same Mean

!"

#"

$"

%"

&"

'"

('" )'" #!'" ##'" #$'" #%'" #&'"

!"#$%#&'()

*+)

!"

#"

$"

%"

&"

'"

(!" )!" #!!" ##!" #$!" #%!" #&!"

!"#$%#&'()

*+)

!"

#"

$"

%"

&"

'"

(!" )!" #!!" ##!" #$!" #%!" #&!"

!"#$%#&'()

*+)

!"

#"

$"

%"

&"

'"

(!" )!" #!!" ##!" #$!" #%!" #&!"

!"#$%#&'()

*+)

90

For more practice with measures of central tendency (which I recommend), go to khanacademy.org

http://tinyurl.com/26mksz9 (the average) http://tinyurl.com/3qdjlhf (sample vs. pop.)

91

Questions

• Would it be a good idea to use the mean to assess the central tendency of RT data? Why or why not?

• What is the median of this set of data:1, 2, 3, 4, 5, 6

• What is the median of this set of data:1, 2, 3, 4, 5, 1000

92

• MoVs give some idea of how individual scores are spread around a measure of central tendency

• Examples: Range, average deviation, standard deviation, standard error of the mean, etc. etc.

• A comparison of measures of central tendency is meaningless without a measure of variability.

Measures of Variability (MoVs)

93

Range

• Simple range = max score - min score. Extremely subject to outliers, gives no info as to distribution.

• Interquartile range = 3rd quartile-1st quartile. Much more robust, but still not preferred in most fields.

94

Interquartile Range

• Interquartile range = 3rd quartile-1st quartile. Much more robust.

• To find 1st and 3rd quartiles, first find medianset A= 2 4 6 10 12 20 40 90set B= 2 4 5 10 12 20 40 90 95

• Recall this is the value at position (n+1)/2 set A = (8+1)/2 = 4.5th value, ∴ 11set B = (9+1)/2 = 5th value ∴ 20

95

Interquartile Range

• Now take each group and divide into halves, excluding the median if n is oddset A= (2 4 6 10) (12 20 40 90)set B= (2 4 5 10) 12 (20 40 90 95)

• Now take the medians of those subgroupsset A= 1st Q = 5.0 3rd Q = 30set B= 1st Q = 4.5 3rd Q = 65

96

Interquartile Range

• Finally, subtract the first quartile from the 3rd

• set A IQR = 30 - 5 = 25set B IQR = 65 - 4.5 = 60.5

97

• Take the difference between the mean and each score, and add up the differences. Simple, right?

• Problem! It always adds up to zero. In fact, the mean is defined as such.

• Could use mean absolute deviation, but the standard deviation is much more common

Average Deviation

98

x Deviation from the mean

Absolute deviation from

the mean

Amy 0 -2 2

Bob 0 -2 2

Ced 2 0 0

Dan 4 2 2

Eva 4 2 2

Sum: 10 0 8

Sum / n: 2 (=mean) 0 1.6

99

Standard Deviation

• Standard deviation (∂) is the most common MoV

• For normally-distributed data, there are known percentages of scores that lie within x standard deviations of the mean Example: μ ± 1 ∂ encompasses 68.27% of scores Example: μ ± 1.96 ∂ encompasses 95% of scores

• Variance is simply the ∂ squared

100

“Square-root of the mean of the squared deviations from the sample mean”

Same Mean, Different Standard Deviations

!"

#"

$"

%"

&"

'"

(!" )!" #!!" ##!" #$!" #%!" #&!"

!"#$%#&'()

*+)

!"

#"

$"

%"

&"

'"

(!" )!" #!!" ##!" #$!" #%!" #&!"

!"#$%#&'()

*+)

!"

#"

$"

%"

&"

'"

("

)!" *!" #!!" ##!" #$!" #%!" #&!"

!"#$%#&'()

*+)

!"

#"

$"

%"

&"

'"

(!" )!" #!!" ##!" #$!" #%!" #&!"

!"#$%#&'()

*+)

101

• Take each score (x) and determine its difference from the mean (μ), then square the difference (to get rid of negatives and also to “penalize” outlying scores).

• Add up the squared differences (∑) and divide by the number of scores (n, or n-1 if doing this for a sample) to get the mean squared difference. This is the Variance.

• Take the square-root of the variance (to “undo” the squaring from before) to get the Standard Deviation.

Calculating ∂ ∑ (x - μ)2

n-1∂ =

102

x Deviation from the mean

Squared deviation from the mean

Amy 0 -2 4

Bob 0 -2 4

Ced 2 0 0

Dan 4 2 4

Eva 4 2 4

Sum: 10 0 16

Sum / n: 2 (=mean) 0 3.2

Sum / n-1: n/a n/a 4 (=variance)

Standard Deviation = Square root of variance = √4 = 2

103

• The SEM is the ∂ divided by the square root of n. SEM = ∂ / √n

• In our example, the SEM is 2 / √5 = .894

• “Standard” means “expected”, as in “you can expect that the difference between the sample mean and population mean will be ≤ SEM.”

• In our example, the pop mean is predicted to be between 2.894 and 1.106 (2 ± .894)

• SEM is the most commonly used measure of variability for error bars in graphs

Standard Error of the Mean

104

Confidence Intervals

• For normally-distributed data, there is a 95% chance that the pop mean is within 1.96 SEM of the sample mean

• Therefore, mean ± 1.96 SEM is the “95% confidence interval”

• 95% CI is sometimes used as error bar value in graphs

105

Is This Drug Effective?

0

5

10

15

20

0 50 100

Sym

ptom

Sev

erity

Drug Dosage (mg)

Figure 1. Mean symptom severity as a function of drug dosage

106

• A graph feature showing a measure of variability around a mean or median, typically 1 SEM or 1.96 SEM (= 95% CI)

• A graph showing means/medians without error bars is uninterpretable

• ALWAYS have error bars on graphs showing mean/median. Exception: error bars may be too small to show, if you’re lucky (but say that!)

Error Bars

107

• In reality the relationship between error bars and significance is complex and one should never rely on this kind of visual analysis alone

• However, as a rule of thumb, the following can be taken as statistically significant differences:

• SEM error bars have a gap between them that is equal in size to the average length of the error bars

• 95% CI error bars are overlapped by no more than a quarter of their length

• Exception: If design is within-subjects, it gets complicated

Assessing Significant Differences via Error Bars

108

0

5

10

15

20

0 50 100

Sym

ptom

Sev

erity

Drug Dosage (mg)

0

5

10

15

20

0 50 100

Sym

ptom

Sev

erity

Drug Dosage (mg)Figure 1. Mean symptom severity as a function of drug dosage. Error bars show one SEM

Figure 1. Mean symptom severity as a function of drug dosage. Error bars show one SEM

Is This Drug Effective?

109

For more practice with measures of variability (which I recommend), go to khanacademy.org

http://tinyurl.com/3hdbww6 (pop. variance) http://tinyurl.com/3vmhagf (sample variance) http://tinyurl.com/3j755nk (standard dev.)

etc. etc...

110

Questions

• Why are error bars important to graphs that show means/medians?

• What is the mean deviation from the mean of this set of data: 3, 4, 9.5, 11.3634

• What is the most commonly-used MoV?

• What is the most common MoV used for error bars in graphs?

111

Measures of Association

• Correlation: The degree to which two variables are linearly related

• Many statistical values are used to measure this; Almost all range from -1 to +1

• Sign (+/-) gives direction of relationship

• Number (0 to 1) gives strength

• Careful, most relationships are not linear!

112

• Row 1: Clouds of data points showing various degrees of relationship. Note varying strength and direction

• Row 2: Degree of correlation does not tell about slope of the relationship between the variables

• Row 3: Several lawful but non-linear relationships, which are not detected by correlation (linear) stats.

113

Some Measures of Correlation

• Pearson’s r for continuous normal interval / ratio data.

• Spearman’s ρ for continuous data that is non-normal and/or ordinal

• Many others exist: Kendall’s tau, Point-biserial, etc. These are used for discrete data, nominal data, etc.

• More on these in chapter 9

114

Four sets of data with the same correlation of 0.81(also same mean and standard deviation!)

115

Exploratory Graphs

• Frequency distributions/histograms

• Scatterplots

• Box plots

• And many more...

116

Frequency Histogram: Not the same as a bar graph. Shows the frequency of different scores occurring

!"

#"

$!"

$#"

%!"

$" %" &" '" #" (" )" *" +" $!"$$"$%"$&"$'"

!"#$%#&'()

*"+,#)

-+.)*"+,#/)01234)

117

Frequency Histogram

• Tells more about group than just mean and other summary stats will

• Groups with same mean can have very different distributions, and vice versa

• FH can help to check assumptions of inferential statistical tests

118

• Normal (“Bell curve”, “Gaussian”)

• Many things in nature are like this

• Many inferential stats techniques require that your data is (roughly) normally distributed

• Flat (“White”, “Uniform”)

• All values have roughly same frequency

Common Frequency Distribution Shapes

119

• Skewed Distribution:

• Asymmetrical tails. “Skewer” points either in the positive or negative direction

• Examples: Income, Reaction time.

• Bi-modal (or multimodal) distribution: Multiple peaks. Usually a result of mixing two (or several) unimodal distributions

Common Frequency Distributions

120

• Good way to represent relationship between two (sometimes more) variables

• Each point represents the scores of a single individual. Keeps emphasis on raw data (good!)

• Scatterplot gives better idea of details of the relationship than simply looking at a summary statistic (e.g., Pearson’s r; Spearman’s rho).

Scatterplots

121

Looking at a scatterplot gives a better idea of the details of the data than simply looking at a summary statistic. This can:

1. Uncover non-linear relationships in the data that the most common summary stats may not pick up.

2. Uncover unusual groups of data points that may point to problems with sample selection, or lead to new research ideas.

3. Uncover inhomogeneity of variance that violates assumptions of Pearson’s r and Spearman’s rho

Scatterplots

122

Examples of APA Style Scatterplots

123

Fancy Scatterplot (Actually a “Bubble Plot”)

124

Another Scatterplot Example

125

• Look at axis labels (and decide if up=good or up=bad)

• Look at ranges of axes to get sense of scale

• If there are multiple lines or sets of points on the graph, pick one and figure it out, then compare to others (use the legend)

• If the points on the graph represent MoCTs and there are no error bars, stand up and yell “there are no error bars, your graph is meaningless!” (exception: error bars too small to show up)

How to Read a Graph

126

0123456789

1 2 3 4

YX

Therapy Type

Weeks of Therapy

Reading Graphs: A Tale of Two Therapies

127

128

Questions

• How is a frequency histogram different from a bar chart?

• What is the main problem with relying solely on summary stats, such as correlation measures?

129

Inferential Statistics

• Goal is to generalize from sample to population. Example: If sample has mean of 10 and SEM of 5, one can infer a 68% probability that the population from which was drawn has a mean between 5 and 15. There is a 95% chance that the population mean lies between 10±(1.96 x 5) or .2 and 19.8.

• Above assumes that the population is normally distributed, and that the sample had n>30 and was randomly selected

130

Inferential Statistics

• Given two samples with means±SEM of 10±1 and 8±1, what is the chance that the populations they were drawn from differ?

• Inferential stats techniques calculate the probability (p) that two such samples could have been drawn from the same population. That is, the chance that the difference is just due to random error.

• As the difference between the samples grows and the SEM shrinks, p goes down. Thus one’s confidence that the difference is “real” grows.

131

ReadingInferential Statistics

“An independent groups t-test showed that the difference between groups was significant, t(29) = 4.23, p < .05”

Inferential statistical technique(is it the right one?)

Critical value

Degrees of Freedom(related to sample size)

Probability that null hypothesis is true

(depends on degrees of freedom and critical value)

132

Documents

4. Sampling & Measuringaix1.uottawa.ca/~ccollin/PCLWebsite/Teaching_files/PSY2174_S11_… · • Proportionate SRS: n’s in sample are proportional to n’s in population. Example: