Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Lecture Week 4 Inspecting Data: Distributions
Introduction to Research Methods & Statistics
2013 – 2014
Hemmo Smit
So next week
No lecture & workgroups
But…
Practice Test on-line (BB)
Enter data for your own research
Practice SPSS skills with own data
Overview
Descriptive research
Describing and presenting data
Frequency distributions
Graphical displays (1)
Measures of Central Tendency and Variability
Graphical displays (2): Boxplots
Read:
Leary: Chapter 6
Howell: Chapter 2
Types of descriptive research
Survey research
Demographic
research
Epidemiological
research
Attitudes, lifestyles, behaviors, problems
Patterns of basic life events: birth,
marriage, migration, death.
Occurrence of disease and death
3 types of surveys
Cross-sectional
Successive
independent
samples
Longitudinal (panel survey design)
One-shot
“cross-section” of the population
Changes over time
Different respondents each time
! Are samples comparable?
Changes over time
Same respondents more than once
! Drop out
Describing and presenting data
3 criteria for a good description:
1) Accurate
2) concise
3) comprehensible
Data can be presented in numerical and graphical format
Beware: Scale of measurement?!?
TIP: Always start with graphs
Trade-off
- Loss of information
- Possible distortion
How to describe a distribution?
)( yy )( yy
A) Overall pattern
1) Shape
- number of peaks (uni-, bi- of multi-modal)?
- symmetrical or skewed?
2) Central tendency / Location: midpoint
3) Spread: a little or a lot?
B) Deviations from the pattern
- Outliers: observations that lie far from the majority
- Tails: thick or thin?
Frequency distributions: Example
How do children recall stories?
Respondents: 25 children
Task: Tell researcher about a movie
Dependent variable: number of “and then…” statements
(see Howell, Exercise 2.1, p.55)
Raw data and frequency distributions
18 17 16 18 15
15 18 16 20 18
22 20 17 21 17
19 17 21 20 19
18 12 23 20 20
Score f P
12 1 0.04
15 2 0.08
16 2 0.08
17 4 0.16
18 5 0.20
19 2 0.08
20 5 0.20
21 2 0.08
22 1 0.04
23 1 0.04
Total 25 1.00
Table 1. # ‘and then’ statements Table 2. # ‘and then’ statements
Absolute and relative frequencies
Absolute frequency (f)
= Number of respondents with a given score
Disadvantage: hard to interpret / compare
Relative frequency (P)
= Proportion of the total with a given score (P = f / n)
Advantage: easy to interpret
Note:
0 < P < 1
P x 100 = %
SPSS: Frequencies - Menu
Analyze > Desciptive Statistics > Frequencies
SPSS: Frequencies – Dialog box
SPSS: Frequencies - Output
Grouped frequency distribution (1)
Simple frequency distributions unclear in case of:
- small number of participants in each category and/or
- variables with many categories
Solution: grouped frequency table
Distribute the raw data over K class intervals and make a new frequency distribution
Make sure all intervals are:
- exhaustive and mutually exclusive
- of equal width
Grouped frequency distribution (2)
Score f P
12-14 1 0.04
15-17 8 0.32
18-20 12 0.48
21-23 4 0.16
total 25 1.00
Rule 1: number of classes (K) = √n
Rule 2: class interval width (I) = range / number of classes
(Range (R) = highest score – lowest score)
In our example
Number of intervals = √25 = 5
Range = 23 – 12 = 11
Interval width = 11 / 5 ≈ 2 or 3
SPSS: Grouped frequency distribution (1)
SPSS: Grouped frequency distribution (2)
1
2
SPSS: Grouped frequency distribution (3)
1
2
3
SPSS: Grouped frequency distribution (4)
1
2
SPSS: Grouped frequency distribution (5)
Cumulative frequency distributions (1)
Real lower limit = lower limit – 0.5
Real upper limit = upper limit + 0.5
Midpoint = upper limit + lower limit / 2
Class
interval
Real
lower
limit
Real
upper
limit
Midpoint f P F
12-14 11.5 14.5 13
15-17 14.5 17.5 16
18-20 17.5 20.5 19
21-23 20.5 23.5 22
Total
Cumulative frequency distributions (2)
F = Cumulative Relative Frequency (CRF): add all previous proportions.
Class
interval
Real
lower
limit
Real
upper
limit
Midpoint f P F
12-14 11.5 14.5 13 1 0.04
15-17 14.5 17.5 16 8 0.32
18-20 17.5 20.5 19 12 0.48
21-23 20.5 23.5 22 4 0.16
Total 25 1.00
Cumulative frequency distributions (3)
NB. Also possible: cumulative absolute frequency
Class
interval
Real
lower
limit
Real
upper
limit
Midpoint f P F
12-14 11.5 14.5 13 1 0.04 0.04
15-17 14.5 17.5 16 8 0.32 0.36
18-20 17.5 20.5 19 12 0.48 0.84
21-23 20.5 23.5 22 4 0.16 1.00
Total 25 1.00
Cumulative frequency distributions (4)
)( yy )( yy
The cumulative relative frequency polygon graphs
the possibility that someone has a score of X or lower.
Graphical displays: Nominal / Ordinal
Raw data Grouped
Bar
Pie
98765432
score
4
3
2
1
0
Co
un
t
8-96-74-52-3
score
6
4
2
0
Co
un
t
9
8
7
6
5
4
3
2
8-9
6-7
4-5
2-3
Graphical displays: Interval
Freq. Stem & Leaf
1,00 Extremes (=<12,0)
2,00 15. 00
2,00 16. 00
4,00 17. 0000
5,00 18. 00000
2,00 19. 00
5,00 20. 00000
2,00 21. 00
1,00 22. 0
1,00 23. 0
Stem width: 1
Each leaf: 1 case(s)
Histograms Stem & Leaf Display
Histogram – symmetrical or skewed?
Negatively skewed Positively skewed
Symmetrical
SPSS: Graphs – Chart Builder / Legacy Dialogs
SPSS: Graphs > Legacy Dialogs
SPSS - Graphs > Chart builder
3
1
2
Measures of central tendency
1. Mode (Mo) = most common score
2. Median (Mdn) = middle score (50th percentile)
3. Mean (M) = average
2
1location Median
N
in x
nx
n
xxxx
1or
...21
Central tendency and skewness
sx
sx2
Shape
Mode
Median
Mean
positive skew symmetrical negative skew
A
B
C
A
A
A
C
B
A
Measures of variability
1. Range (R) = Highest score – Lowest score
2. Interquartile range (IQR) = Q3 – Q1
3. Standard deviation (s or σ) = spread around the mean
4. Variance (s² or σ²) = spread around the mean
Variance and standard deviation
Score Deviation Squared
… … …
Sum 0 ≥ 0
xx 1
1
)(
deviation Standard
2
n
xxs
i
x
1
)(
Variance
2
2
n
xxs
i
x
1x 2
1 )( xx
2x
3x
nx
ix
xx 2
xxn
xx 3
2
2 )( xx 2
3 )( xx
2)( xxn
The standard deviation and variance are:
only suitable as measures of spread around the mean
Not robust against outliers
Five-number summary and boxplot
Five-number summary consists of:
Graphical display: Boxplot
Minimum = Lowest (non-outlying) score
Q1 = 25th percentile (25% lower, 75% higher)
Median (=Q2) = 50th percentile
Q3 = 75th percentile
Maximum = Highest (non-outlying) score
Boxplot - Example
Nummerical (five-number summary) Graphical (boxplot)
Data: 3 13 17 19 22 24 25 28 35 39 44 45 83 86 93
Q3 = 45
Max = 93
M = 28
Q1 = 19
Min = 3 IQR = 45 – 19 = 26
Q1 – 1.5*IQR = -20
Q3 + 1.5*IQR = 84
Rule of thumb
Outlier = observation
that lies 1.5 x IQR
above Q3 or below Q1.
Overview
Scale of
Measurement
Graphical CT Spread
Nominal • Bar chart
• (Pie chart)
Mode ---
Ordinal • Boxplot
Median Range
IQR
Interval
(and higher)
• Histogram
• (Stem&Leaf display)
Mean - Standard dev.
- Variance
What have you learned today?
What are the various ways to represent distributions
numerically?
What are the various ways to represent distributions
graphically?
How to describe a distribution
How to create and evaluate various numerical and
graphical representations of distributions
How to determine what numerical and graphical
representation is suitable for a variable.
Next week
No lecture and workgroups
Practice test on Blackboard
Enter your own data
Read:
Howell: Chapter 3
In two weeks
Normal distribution and standard scores