Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
4. Sampling & Measuring
1
Material From Text Covered in This Section
• Mainly Chapter 4, but also:
• Parts of chapter 9 about correlations and scatterplots
• Part of Chapter 12 on sampling
2
Context
• So far, we’ve covered foundational stuff: • Scientific thinking
• Ethics
• Getting ideas for research
• Now we get into the technical stuff:• Getting participants
• Measuring their behaviour
• Analyzing data from measurements
3
Topics
• Sampling & Recruiting
• Measurement
• Data analysis
• Data presentation
4
5
4.1 Sampling & Recruiting
• Population: Entire group of individuals of interest
• Sample: Subset of population tested
• Sampling: Method of selecting individuals to:
• Naturalistically observe
• Invite to participate in study
• Recruiting: Method of inviting the selected individuals to participate in study
Some Vocabulary
6
• If you want know what the population is like, your sample must be representative: It must reflect the attributes of the population
• Representative samples are important in some kinds of observational research (e.g., surveys)
• But, if you want to know how manipulated variables will affect each other, it is less important
• Representativeness is less important where low variability exists in the population
Sampling
7
• Best way to get representative sample
• Each population member has an equal chance of appearing in the sample
• Typical methods: Random number generation & physical mixing
Random Sampling
8
Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary!
Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary! Random ≠ Arbitrary!Rand...
9
Random Sampling in MS Excel
• Column A: Place all population members’ identifiers (i.e., names or numbers)
• Column B: Set B1 to =rand(), then use Cntrl-D to fill the rest of the column
• Then sort (“Data” menu) by Column B. Equivalent to shuffling.
• Then take the first n individuals as your sample.
10
• Used when sub-populations vary significantly Example: Voters in different parts of the country
• Each sub-population (stratum) is randomly sampled independently
• Several different versions
Stratified Random Sampling
11
• Proportionate SRS: n’s in sample are proportional to n’s in population. Example: Typical psych class has N=50, with 90% F and 10% M. To get a sample with n=10, randomly sample 9/45 F and 1/5 M.
• Disproportionate SRS: Numbers in sample are higher for sub-pops with greater variability. Example: If males vary more in IQ, you might sample more males than females
Stratified Random Sampling
12
• Sampling larger fractions from strata living in sparsely populated areas.
• Sampling equal numbers from sub-populations that vary widely in size (e.g., males and females in psych classes)
• Ensures that one has homogeneity of variance in samples when comparing sub-populations statistically
Other SRS Strategies
13
Used when random sampling is not practical due to size of population
1. Break pop down into sub-populations
2. Randomly select some sub-pops
3. Randomly select some members of each of the selected sub-pops
Cluster Sampling
14
Example: How to select a sample of 50 from Dawson City (pop 1327)?
1. The town consists of 60 city blocks, each with 20-25 people.
2. Select 10/60 city blocks at random
3. Select 5 individuals randomly from each selected city block.
Cluster Sampling
15
• Most common form of sampling in reality. Also called arbitrary or haphazard sampling
• In most studies, participants are self-sampled, responding to posters, ads, subject pools, etc.
• Participants are not forced to participate, so some degree of self-selection is unavoidable. Exception: Naturalistic observation. But subjects still selected arbitrarily (e.g., those who happen by) not randomly
Convenience Sampling
16
• Non-random sampling can lead to lack of generalizability. BUT, this depends on:
• Variability in the population. Example: Number of eyes vs. political affiliation
• Probability of interaction between variables of interest and variations in population Example: Political affiliation won’t affect number of eyes but will affect attitudes toward abortion
• Every sample is equivalent to a random sample from some population
Convenience Sampling
17
• In order to ensure that one is drawing from the population of interest, one may apply inclusion or exclusion criteria. Example: You must have normal vision to participate in some perception experiments
• Whether a criterion is inclusive or exclusive can sometimes be a matter of semantics
Inclusion/Exclusion Criteria
18
Recruiting
• Many methods can be used:
• Ads: TV, radio, poster, web...
• Direct appeal: In person, phone, email, social networking sites...
• Main ethical issue: Informed consent
• With vulnerable pops, must be extra careful about coercion
19
Discussion / Questions
• What is the difference between random and arbitrary?
• When would you use cluster sampling?
• Discussion: The participant pool controversy. What do you think?
20
4.2 Issues in Measurement
21
The Importance of Careful Measurement• Of utmost importance in science. We want
data, not just casual observations
• Especially important in Psych: 1) Data is noisy enough as it is without adding undue measurement error 2) We have limitations on repeatability
• Treat each measurement as precious
22
• Construct: Abstract concept of interest, such as intelligence, aggression, attitude towards women...
• Measure: Concrete means for measuring construct
• Operational definition = Construct+Measure.
• Choice of construct is guided by question
• Choice of measure usually guided by previous work, or you can come up with your own.
Measures & Constructs
23
Signal to Noise
• Any measured value is a combination of signal (true score) and noise (measurement error).
• Measurement error can be:
• Systematic: E.g., you set your scale 5 kg lower
• Random: E.g., a variety of other factors, such as your posture, the temperature, etc. can affect the scale as well. Arises from factors we cannot or do not control.
24
True Score
Measured score (medium reliability)
Measured score (low reliability)
Measured score (high reliability)
25
True Score
Measured score (medium reliability)
+ Positive Systematic error
+ Negative Systematic error
26
Systematic Error
• Easy to compensate for if you know it’s there.
• Not a problem if comparing relative differences and it’s the same amplitude across groups. Example: Two groups of dieters both use a scale that’s off by 5 kg.
• Problematic if you don’t know it’s there and need the absolute value of the thing you’re measuring.
• Extremely problematic if you don’t know it’s there and it’s different for different groups. (=Confound)
27
Random Error
• Is always present to some degree. You can never quite get the true value
• Can be a problem when comparing groups Example: Which dieters are heavier? Random error may make measured values for the heavier ones lower than those for the lighter ones
• Is dealt with by averaging over many measurements (up/down errors cancel) and by using inferential statistics (on which, more later...)
28
Discussion / Questions
• When is measurement error most problematic in research?
• How does one mainly deal with random error?
• What is an operational definition?
29
• Reaction Time (aka, RT or Latency)
• Accuracy (= 100% - Error Rate)
• Preferential looking
• Verbal and written responses
• Behaviour inventories
• Physiological responses
• Brain imaging
• Etc. etc. etc...
Examples of Common Measures
30
• a.k.a. Latency. Time between stimulus (e.g. computer image) and response (e.g. mouse click).
• Usually measured in ms (min. value ≈ 100 ms)
• Inherently skewed (asymmetrical) distribution
Reaction Time
Careful: Important technical issues exist with gathering accurate RTs via computer (MacInnes & Taylor, 2001).
31
• Accuracy: Proportion of times correct answer given
• Error = 100% - Accuracy;
• Normally distributed but only at moderate difficulty
• Individual performance not linear with regard to difficulty, but rather sigmoidal.
• Error is good in combo with RT because “up = bad” for both
Accuracy / Error
32
Error
DifficultyHardEasy
Erro
r
Error500
Freq
uenc
y
33
Accuracy
DifficultyHardEasy
Acc
urac
y
Accuracy10050
Freq
uenc
y
34
Speed/Accuracy Trade-Off
• Accuracy/Error and RT are both measures of “performance”
• Faster and/or Fewer errors = better
• Slower and/or More errors = worse
• But what if one goes up while the other goes down?
• Faster and More errors = ???
• Slower and Fewer errors = ???
35
Word or Scramble?
TABLEBLTEAEXTREMISMMSIEXTMRE
“As quickly and accurately as possible, say whether the string of letters forms a word or a scrambled word.”
36
0
250
500
750
1,000
Large Small0
250
500
750
1,000
Large Small
Speed-Accuracy Trade-off
0
2.5
5.0
7.5
10.0
Large Small
Error (%)
0
250
500
750
1,000
Large Small
RT (msec)
37
Word Length Word Length
Sensitivity
• Sensitivity is symbolized d' (“dee-prime”).
• Measure of one’s ability to detect a given signal (e.g., a dim light, or an ambiguous diagnosis)
• How might we measure sensitivity?
38
(a bad way of) Measuring Sensitivity
• Present stimulus 100 times and note % of times participant says he detects it?
• Problem: P who says “yes” all the time will do very well. (the very problem we are trying to avoid!)
• Need measure that reflects ability to discriminate between “signal present” and “signal absent”
39
(a good way of) Measuring Sensitivity
• Present stimulus (signal + noise) on only half of the trials, (test trials).
• Present no stimulus (noise) on other half of trials, (catch trials).
• Many trials are presented. For each, the participant says “yes, the signal is there” or “no, it isn’t”.
40
Four Possible Results on Each Trial
Present Absent
Yes, I see it
Hit False Alarm
No, I don’t
Miss Correct Rejection
The signal is really...Pa
rtic
ipan
t sa
ys...
41
d’ and Signal Detection Theory
ill healthy
ill Hit (illness detected)
False Alarm(health falsely seen as
illness)
healthy Miss(illness overlooked)
Correct Rejection
(health correctly identified as such)
Patient is really...
Psyc
holo
gist
say
spa
tient
is...
42
Also Known As...
Present Absent
Yes, it’s there
True Positive False Positive(type I error)
No, it’s not
False Negative(type II error) True Negative
The difference is really...St
atis
tical
tes
t sa
ys...
43
Present (n =100)
Absent (n=100)
Yes, I see it 90 20
No, I don’t 10 80
The stimulus is really...
Part
icip
ant
says
...
Present (n =100)
Absent (n=100)
Yes, I see it 80 10
No, I don’t 20 90
The stimulus is really...
Part
icip
ant
says
...
Present (n =100)
Absent (n=100)
Yes, I see it 90 30
No, I don’t 10 70
The stimulus is really...
Part
icip
ant
says
...
Albert Benny
Claire
44
Link to Sensitivity Calculations (#90)
Link to Criterion Calculations (#96)
Some Example Results
Sensitivity• The results of such an experiment yield:
proportion of hits (Ph= Nhits / Ntesttrials)proportion of FAs (Pfa= Nfa / Ncatchtrials)
• For example, for Albert: Ph= 90 / 100 = .9Pfa= 20 / 100 = .2
• Note that we could calc proportions of misses and correct rejections too, but these are redundant (Pm = 1-Ph; Pcr = 1-Pfa)
45
Questions
• What is Benny’s proportion of hits (Ph)?
• What is his proportion of FA’s (Pfa)?
46
Sensitivity
• Perfect participant: Ph = 1, Pfa= 0
• Participant just guessing: Ph = .5, Pfa= .5
• Worst possible participant (perfectly backwards) : Ph = 0, Pfa= 1
• In calculating sensitivity, we want to reward hits and punish FAs, so we could just use “Basic Sensitivity”: BS = Ph - Pfa
47
Sensitivity
• BS equals:
• Perfect participant: Ph of 1 - PFA of 0 = 1
• Participant guessing: Ph of .5 - PFA of .5 = 0
• Backward participant: Ph of 0 - PFA of 1 = -1
• So BS seems to work, right?
48
However, for obscure statistical reasons, BS is, well, B.S.
• Instead we calculated' = z(Ph) - z(PFA)
• Converting the proportions of hits and FAs to z-scores yields a more valid result.
• d’ is measured in standard deviation units.
• How to calculate the z scores?
• In Excel, use norminv(P, 0, 1)
• Table A5.1 from MacMillan & Creelman
• Or the “unit normal” table from any stats textbook.
49
Normal distribution for finding d' (based on Macmillan & Creelman (2005), Signal Detection: A User’s Guide)
50
Sensitivity
• d' = z(Ph) - z(Pfa)
• Albert: Ph = .9 ∴ z = 1.28; Pfa = .2 ∴ z = -0.84∴ d’ = 1.28 - (-0.84) = 2.12
• Ben: Ph = .8 ∴ z = 0.84; Pfa = .1 ∴ z = -1.28∴ d’ = 0.84- (-1.28) = 2.12
• Claire: Ph = .9 ∴ z = 1.28; Pfa = .3 ∴ z = -0.52∴ d’ = 1.28 - (-0.52) = 1.8
51
Link back to data (#83)
Zero and One
• What to do with proportions of 0 and 1?
• Technically, these yield z scores of ∞
• There are many ways of getting around this (MacMillan & Creelman, 2005, chp. 1). For our purposes, just substitute values of 0.01 and 0.99, respectively.
52
• Example for self-test: Zack does a sensitivity experiment with 20 test trials and 20 catch trials. He gets 10 hits and 5 false alarms.
• What are his Ph and Pfa values?
• What is is d’?
53
Ph = .ejPfa= .be
d’ = .fgQuestions
Dr. X tests a new field diagnostic procedure by applying it to 50 individuals known to have an illness. He finds that it correctly labels all 50 of them as ill, whereas his old procedure labeled only 40 of them as such. He concludes that the new procedure is better. Is his conclusion sound? Why or why not?
54
Questions
• Used with kids, because no verbal feedback required
• Infants will look longer at stimuli that are more interesting / surprising
• Measure looking time with “blind” raters. Also use several raters and check inter-rater reliability
• More in PSY2105 & 3140
Preferential Looking
55
• Very common way to measure higher-level functions such as cognitive ability, personality, social attitudes...
• Responses usually gathered by questionnaires, tests, or interviews
• Entire field devoted to questionnaire/instrument development, called Psychometrics
• There is an art and science to creating good questions
• More in PSY3307
Verbal/Written Responses
Strongly Agree Agree Not sure Disagree
Strongly Disagree
I find PSY 2174 to be exciting an informative
56
• Questions should be clear and simple, with proper grammar and spelling
• They should be minimum in number, though asking the same question multiple ways gets more reliable data...
• Be careful of question order. Previous questions can influence later answers.
• Avoid leading questions “Do you support Stephen Harper’s harmful and exploitative foreign policy?”
• Avoid “double questions” (see previous slide)
Good Questions & Bad
57
• Close-ended: Yes/No, Multiple-choice, Likhert Scale, Numerical ratings...
• Answer options should be inclusive and exhaustive. Often a good idea to include “Other (please specify):_____”
• Answer options should be non-overlapping (age ranges of 18-30, 25-40, 35-60, 50+ make no sense)
• Open ended: Sentence completion, word association, completely unstructured, etc. (must use text analysis)
Types of Questions
58
59
• Center for Epidemiological Studies Depression scale
• 9 items, each rated on a 4-point scale. A “Matrix Questionnaire”
• Notice: Same thing asked multiple times, simple structure, few questions
• Has surprisingly good reliability & validity
An Example Questionnaire: CES-D
60
• Observer visually assesses behaviour: Example: Counts occurrences of a behaviour Example: Times how long a behaviour takes
• Problem: Whether or not a behaviour has occurred can be ambiguous and open to interpretation. Solutions: Use several observers and measure their agreement with one another. Also, make sure behaviours are defined in detail
Behaviour Inventories
61
Behaviour InventoryP#
Start Time
EndTime
Wait-ing?
Driver Sex
Driver Race
Observed Car Intruder Car
1 12:56:37 12:57:44 Y M Cau Matrix, new F-150, new2 12:58:01 12:59:33 N M AA Sebring, old3 1:07:37 1:08:09 N F Cau BMW, new4 1:11:55 1:12:30 Y F East
IndWindstar, new Cherokee, old
Start timing when driver opens driver-side doorStop timing when the front bumper clears the parking spaceA car is “waiting” if someone is waiting for the spot and the driver turns toward the intruding car before entering carRecord model and approximate age of car
62
• E.g., heart rate, levels of stress hormones in blood, galvanic skin response.
• Good reliability, but can have questionable validity (physiological response is quite similar for fear, anger, sexual arousal, etc.)
Physiological Responses
63
• fMRI, CAT, EEG, ERP, NIRS, etc.
• All ways of measuring activity in various parts of the brain in response to given stimuli
• Like other physiological indicators, good reliability but questions arise about what exactly one is measuring
Brain Imaging
64
And Many More...
• Text and image analysis, archival research, and on and on...
65
Discussion / Questions
• What are some issues with RT measurements?
• A survey asks “Do you oppose the tyrannical policies of Alan Rock?” What’s wrong with this question?
• What is inter-rater reliability? In what situations would it be an issue?
66
4.4 Measuring Measures
Is Your Measure a Good One?
67
• Reliability: The degree to which successive measurement values are the same.
• Measurement Error: Difference between successive measures of the same thing.
• Systematic error (”bias”); inexact ruler
• Random error (”noise”); rubber ruler
Is Your Measure Reliable?
68
• Validity: The degree to which the measure actually measures what you think it does.
• Face Validity: Does the measure intuitively seem to measure the desired construct?
• Construct Validity: Does the measure actually measure the desired construct?
Is Your Measure Valid?
69
Construct validity is bolstered if your measurements...
...correlate with things they should correlate with: Criterion Validity
...correlate with others from instruments claiming to measure the same thing: Convergent Validity
... don’t (or weakly) correlate with things they shouldn’t correlate with: Discriminant Validity
Is Your Measure Valid?
70
Values from your new IQ test, if it really measures intelligence, should:
• Correlate with grades (+), income (+), crime (-)(criterion validity)
• Correlate with other established IQ tests (+)(convergent validity)
• Not correlate with personality tests (0)(discriminant validity)
Is Your Measure Valid?
71
Construct ValidityMood Disorders
Depression Anxiety
Poor SleepNegative Mood
HopelessnessSuicidalIdeation
Irritated Mood
Apprehension Muscle Tension
Does your measure capture “depression” per se or all mood disorders? Or does it merely capture some aspects of depression?
72
• A measure can be reliable but not valid.
• But validity is impossible without reliability.
• Beware tendency to choose reliable (objective / impartial / systematic) measures instead of valid ones
• Reliability without validity is meaningless
Reliability vs. Validity
73
“Because we cannot measure what we value, we begin to value what we measure” - Unknown
Levels of Measurement: N.O.I.R.
• Measurement: “The assigning of numbers to objects based on a rule”
• Can be done at 4 basic levels: Nominal, Ordinal, Interval, or Ratio
• Level determines what math can be done with the measurements and therefore what stats can be used
74
• Labels are applied to participants. Labels are not mathematically related. Examples: Gender (male or female); Sport jersey numbers (11, 22, 44) Country of origin (Afghanistan...Zambia)
• No math can be done with nominal labels
• Use frequency statistics (e.g., chi square test)
Nominal Scale Measurement (aka Categorization)
75
Frequency Data Example
0
125
250
375
500
Blonde Brown Black Red
MagazinesMall• No math can be done with nominal
labels, but instances can be counted, generating frequency data
• Example: Measure hair colour of sample of 1000 people appearing in magazines vs. 1000 naturalistically observed at the mall
• Out of 1000 of each, what is the frequency of blond, brown, black, or red hair?
76
• Ranks are assigned to participants. Ranks have logical order, but spacing is undefined. Examples: Birth order (1st born, 2nd born, 3rd...) Social class (lower, middle, upper) GPA (the numbers do not indicate even intervals)
• Because units are not evenly spaced, math options are limited and so are statistical procedures
• Use nonparametric inferential stats for this kind of data
Ordinal Scale Measurement (AKA Ranking)
77
• Numbers on a scale are assigned to participants.Numbers have meaningful equal spacing, but the zero value is arbitrary Example: Celsius temperature. 0° ≠ total absence of heat (thankfully!)
• Addition and subtraction are meaningful, but not division/multiplication.
• Use parametric stats for these kinds of data (if other assumptions, such as normality, are met)
Interval Scale Measurement
78
• Numbers on a scale are assigned to participants.Numbers have meaningful equal spacing and meaningful zero Example: Kelvin temperature (0° = no heat), height, weight, RT, age, any physical measure
• All mathematical functions are possible. These are “true numbers”
• Use parametric stats for these kinds of data (if other assumptions, such as normality, are met)
Ratio Scale Measurement
79
• Psychometric tests (IQ, personality tests, etc.) are treated as being interval scale, but are they?
• Depends if you choose to focus on the measure (IQ) or the construct (intelligence)
• Consider: Is the difference in intelligence represented by IQ scores of 100 and 110 the same as the intelligence difference between IQs of 140 and 150?
• There’s really no way to answer this question, but people typically treat such measures as interval
The Scale Debate
80
• Along with design of study, level of measurement determines which descriptive and inferential stats techniques are appropriate
• More techniques exist (and more powerful ones) for interval / ratio than for ordinal or nominal
• Some mis-use statistical analyses on the wrong scale of data, making for meaningless results.
The Importance of Level of Measurement
81
Questions
• What are some aspects of construct validity?
• A researcher measures religiosity in terms of # of times Ps attend church in a year. What’s the level of measurement?
• What kind of mathematical operations can be done with nominal data?
82
4.5 Statistical Analysis
83
• Exploratory: Examine the raw data to get a better understanding of it
• Descriptive statistics: Summarize the characteristics of the sample(s)
• Inferential stats: Infer things about the population(s) from the sample(s)
Three Types of Stats
84
• MoCTs give a summary impression of group data.
• Arithmetic Mean: Simply the average. Sum of all scores over number of scores. (see also Geometric Mean & Harmonic Mean)
• Median: The middle score when all scores are placed in order, (or the arithmetic mean of the two middle scores if an even number of scores).
• Mode: The most frequently occurring score (not always well-defined)
• Do examples in text to get practice!
Measures of Central Tendency (MoCTs)
85
MoCTs: Examples80
140
90
110
100
100
110
110
120
90
Data (x)80
90
90
100
100
110
110
110
120
140
Arithmetic Mean:∑x = 1050, N = 10, ∴ μ = 105
Median:N =10; (N+1)/2 = 5.5; ∴ M = (5th score + 6th score) /2
Bins Freq.
80 1
90 2
100 2
110 3
120 1
130 0
140 1
Mode:Determine frequencies of each score. Score with max frequency is 110
86
MoCTs: Things to Watch Out For
• Mean is strongly affected by outliers
• Therefore, median is a better choice for asymmetrical data distributions (e.g., income). Also used for ordinal scale data.
• Mode can be ill-defined, but is the only one usable with nominal scale data
87
Remember“Lying with Averages”
• Year 1: 90% of people make 10 000$, 10% make 1
000 000$. Average = 109 000$
• Year 2: 90% people still make 10 000$, 10% now make 2 000 000$. Average = 209000$
• The economy’s doing great! Average income has almost doubled!
• The rich 10% are outliers, and make the mean meaningless. The median for both years is 10 000$ and gives a better model of the data
88
Average is Not Enough
• A MoCT is a very simple model of the data in a sample. Much detail is lost.
• It pays to explore the data more closely (e.g., using frequency histogram)
• Samples with similar MoCTs can have very different frequency distributions
89
All of These Groups Have the Same Mean
!"
#"
$"
%"
&"
'"
('" )'" #!'" ##'" #$'" #%'" #&'"
!"#$%#&'()
*+)
!"
#"
$"
%"
&"
'"
(!" )!" #!!" ##!" #$!" #%!" #&!"
!"#$%#&'()
*+)
!"
#"
$"
%"
&"
'"
(!" )!" #!!" ##!" #$!" #%!" #&!"
!"#$%#&'()
*+)
!"
#"
$"
%"
&"
'"
(!" )!" #!!" ##!" #$!" #%!" #&!"
!"#$%#&'()
*+)
90
For more practice with measures of central tendency (which I recommend), go to khanacademy.org
http://tinyurl.com/26mksz9 (the average) http://tinyurl.com/3qdjlhf (sample vs. pop.)
91
Questions
• Would it be a good idea to use the mean to assess the central tendency of RT data? Why or why not?
• What is the median of this set of data:1, 2, 3, 4, 5, 6
• What is the median of this set of data:1, 2, 3, 4, 5, 1000
92
• MoVs give some idea of how individual scores are spread around a measure of central tendency
• Examples: Range, average deviation, standard deviation, standard error of the mean, etc. etc.
• A comparison of measures of central tendency is meaningless without a measure of variability.
Measures of Variability (MoVs)
93
Range
• Simple range = max score - min score. Extremely subject to outliers, gives no info as to distribution.
• Interquartile range = 3rd quartile-1st quartile. Much more robust, but still not preferred in most fields.
94
Interquartile Range
• Interquartile range = 3rd quartile-1st quartile. Much more robust.
• To find 1st and 3rd quartiles, first find medianset A= 2 4 6 10 12 20 40 90set B= 2 4 5 10 12 20 40 90 95
• Recall this is the value at position (n+1)/2 set A = (8+1)/2 = 4.5th value, ∴ 11set B = (9+1)/2 = 5th value ∴ 20
95
Interquartile Range
• Now take each group and divide into halves, excluding the median if n is oddset A= (2 4 6 10) (12 20 40 90)set B= (2 4 5 10) 12 (20 40 90 95)
• Now take the medians of those subgroupsset A= 1st Q = 5.0 3rd Q = 30set B= 1st Q = 4.5 3rd Q = 65
96
Interquartile Range
• Finally, subtract the first quartile from the 3rd
• set A IQR = 30 - 5 = 25set B IQR = 65 - 4.5 = 60.5
97
• Take the difference between the mean and each score, and add up the differences. Simple, right?
• Problem! It always adds up to zero. In fact, the mean is defined as such.
• Could use mean absolute deviation, but the standard deviation is much more common
Average Deviation
98
x Deviation from the mean
Absolute deviation from
the mean
Amy 0 -2 2
Bob 0 -2 2
Ced 2 0 0
Dan 4 2 2
Eva 4 2 2
Sum: 10 0 8
Sum / n: 2 (=mean) 0 1.6
99
Standard Deviation
• Standard deviation (∂) is the most common MoV
• For normally-distributed data, there are known percentages of scores that lie within x standard deviations of the mean Example: μ ± 1 ∂ encompasses 68.27% of scores Example: μ ± 1.96 ∂ encompasses 95% of scores
• Variance is simply the ∂ squared
100
“Square-root of the mean of the squared deviations from the sample mean”
Same Mean, Different Standard Deviations
!"
#"
$"
%"
&"
'"
(!" )!" #!!" ##!" #$!" #%!" #&!"
!"#$%#&'()
*+)
!"
#"
$"
%"
&"
'"
(!" )!" #!!" ##!" #$!" #%!" #&!"
!"#$%#&'()
*+)
!"
#"
$"
%"
&"
'"
("
)!" *!" #!!" ##!" #$!" #%!" #&!"
!"#$%#&'()
*+)
!"
#"
$"
%"
&"
'"
(!" )!" #!!" ##!" #$!" #%!" #&!"
!"#$%#&'()
*+)
101
• Take each score (x) and determine its difference from the mean (μ), then square the difference (to get rid of negatives and also to “penalize” outlying scores).
• Add up the squared differences (∑) and divide by the number of scores (n, or n-1 if doing this for a sample) to get the mean squared difference. This is the Variance.
• Take the square-root of the variance (to “undo” the squaring from before) to get the Standard Deviation.
Calculating ∂ ∑ (x - μ)2
n-1∂ =
102
x Deviation from the mean
Squared deviation from the mean
Amy 0 -2 4
Bob 0 -2 4
Ced 2 0 0
Dan 4 2 4
Eva 4 2 4
Sum: 10 0 16
Sum / n: 2 (=mean) 0 3.2
Sum / n-1: n/a n/a 4 (=variance)
Standard Deviation = Square root of variance = √4 = 2
103
• The SEM is the ∂ divided by the square root of n. SEM = ∂ / √n
• In our example, the SEM is 2 / √5 = .894
• “Standard” means “expected”, as in “you can expect that the difference between the sample mean and population mean will be ≤ SEM.”
• In our example, the pop mean is predicted to be between 2.894 and 1.106 (2 ± .894)
• SEM is the most commonly used measure of variability for error bars in graphs
Standard Error of the Mean
104
Confidence Intervals
• For normally-distributed data, there is a 95% chance that the pop mean is within 1.96 SEM of the sample mean
• Therefore, mean ± 1.96 SEM is the “95% confidence interval”
• 95% CI is sometimes used as error bar value in graphs
105
Is This Drug Effective?
0
5
10
15
20
0 50 100
Sym
ptom
Sev
erity
Drug Dosage (mg)
Figure 1. Mean symptom severity as a function of drug dosage
106
• A graph feature showing a measure of variability around a mean or median, typically 1 SEM or 1.96 SEM (= 95% CI)
• A graph showing means/medians without error bars is uninterpretable
• ALWAYS have error bars on graphs showing mean/median. Exception: error bars may be too small to show, if you’re lucky (but say that!)
Error Bars
107
• In reality the relationship between error bars and significance is complex and one should never rely on this kind of visual analysis alone
• However, as a rule of thumb, the following can be taken as statistically significant differences:
• SEM error bars have a gap between them that is equal in size to the average length of the error bars
• 95% CI error bars are overlapped by no more than a quarter of their length
• Exception: If design is within-subjects, it gets complicated
Assessing Significant Differences via Error Bars
108
0
5
10
15
20
0 50 100
Sym
ptom
Sev
erity
Drug Dosage (mg)
0
5
10
15
20
0 50 100
Sym
ptom
Sev
erity
Drug Dosage (mg)Figure 1. Mean symptom severity as a function of drug dosage. Error bars show one SEM
Figure 1. Mean symptom severity as a function of drug dosage. Error bars show one SEM
Is This Drug Effective?
109
For more practice with measures of variability (which I recommend), go to khanacademy.org
http://tinyurl.com/3hdbww6 (pop. variance) http://tinyurl.com/3vmhagf (sample variance) http://tinyurl.com/3j755nk (standard dev.)
etc. etc...
110
Questions
• Why are error bars important to graphs that show means/medians?
• What is the mean deviation from the mean of this set of data: 3, 4, 9.5, 11.3634
• What is the most commonly-used MoV?
• What is the most common MoV used for error bars in graphs?
111
Measures of Association
• Correlation: The degree to which two variables are linearly related
• Many statistical values are used to measure this; Almost all range from -1 to +1
• Sign (+/-) gives direction of relationship
• Number (0 to 1) gives strength
• Careful, most relationships are not linear!
112
• Row 1: Clouds of data points showing various degrees of relationship. Note varying strength and direction
• Row 2: Degree of correlation does not tell about slope of the relationship between the variables
• Row 3: Several lawful but non-linear relationships, which are not detected by correlation (linear) stats.
113
Some Measures of Correlation
• Pearson’s r for continuous normal interval / ratio data.
• Spearman’s ρ for continuous data that is non-normal and/or ordinal
• Many others exist: Kendall’s tau, Point-biserial, etc. These are used for discrete data, nominal data, etc.
• More on these in chapter 9
114
Four sets of data with the same correlation of 0.81(also same mean and standard deviation!)
115
Exploratory Graphs
• Frequency distributions/histograms
• Scatterplots
• Box plots
• And many more...
116
Frequency Histogram: Not the same as a bar graph. Shows the frequency of different scores occurring
!"
#"
$!"
$#"
%!"
$" %" &" '" #" (" )" *" +" $!"$$"$%"$&"$'"
!"#$%#&'()
*"+,#)
-+.)*"+,#/)01234)
117
Frequency Histogram
• Tells more about group than just mean and other summary stats will
• Groups with same mean can have very different distributions, and vice versa
• FH can help to check assumptions of inferential statistical tests
118
• Normal (“Bell curve”, “Gaussian”)
• Many things in nature are like this
• Many inferential stats techniques require that your data is (roughly) normally distributed
• Flat (“White”, “Uniform”)
• All values have roughly same frequency
Common Frequency Distribution Shapes
119
• Skewed Distribution:
• Asymmetrical tails. “Skewer” points either in the positive or negative direction
• Examples: Income, Reaction time.
• Bi-modal (or multimodal) distribution: Multiple peaks. Usually a result of mixing two (or several) unimodal distributions
Common Frequency Distributions
120
• Good way to represent relationship between two (sometimes more) variables
• Each point represents the scores of a single individual. Keeps emphasis on raw data (good!)
• Scatterplot gives better idea of details of the relationship than simply looking at a summary statistic (e.g., Pearson’s r; Spearman’s rho).
Scatterplots
121
Looking at a scatterplot gives a better idea of the details of the data than simply looking at a summary statistic. This can:
1. Uncover non-linear relationships in the data that the most common summary stats may not pick up.
2. Uncover unusual groups of data points that may point to problems with sample selection, or lead to new research ideas.
3. Uncover inhomogeneity of variance that violates assumptions of Pearson’s r and Spearman’s rho
Scatterplots
122
Examples of APA Style Scatterplots
123
Fancy Scatterplot (Actually a “Bubble Plot”)
124
Another Scatterplot Example
125
• Look at axis labels (and decide if up=good or up=bad)
• Look at ranges of axes to get sense of scale
• If there are multiple lines or sets of points on the graph, pick one and figure it out, then compare to others (use the legend)
• If the points on the graph represent MoCTs and there are no error bars, stand up and yell “there are no error bars, your graph is meaningless!” (exception: error bars too small to show up)
How to Read a Graph
126
0123456789
1 2 3 4
YX
Therapy Type
Weeks of Therapy
Reading Graphs: A Tale of Two Therapies
127
128
Questions
• How is a frequency histogram different from a bar chart?
• What is the main problem with relying solely on summary stats, such as correlation measures?
129
Inferential Statistics
• Goal is to generalize from sample to population. Example: If sample has mean of 10 and SEM of 5, one can infer a 68% probability that the population from which was drawn has a mean between 5 and 15. There is a 95% chance that the population mean lies between 10±(1.96 x 5) or .2 and 19.8.
• Above assumes that the population is normally distributed, and that the sample had n>30 and was randomly selected
130
Inferential Statistics
• Given two samples with means±SEM of 10±1 and 8±1, what is the chance that the populations they were drawn from differ?
• Inferential stats techniques calculate the probability (p) that two such samples could have been drawn from the same population. That is, the chance that the difference is just due to random error.
• As the difference between the samples grows and the SEM shrinks, p goes down. Thus one’s confidence that the difference is “real” grows.
131
ReadingInferential Statistics
“An independent groups t-test showed that the difference between groups was significant, t(29) = 4.23, p < .05”
Inferential statistical technique(is it the right one?)
Critical value
Degrees of Freedom(related to sample size)
Probability that null hypothesis is true
(depends on degrees of freedom and critical value)
132