Upload
marco-stokely
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Basic Statistics
“I always find that statistics are hard to swallow and impossible to digest.
The only one I can remember is that if all the people who go to sleep in church were laid end to end they would be a lot more comfortable.”
[Mrs Robert A Taft]
“Data! Data! Data!”he cried impatiently.
“I can’t make bricks without clay”
[Sherlock Holmes]
Qualitative
a) Nominal data
(dead/alive, blood group O,A,B,AB)
b) Ordered categorical/ranked data
(mild/moderate/severe)
Quantitative
a) Numerical discrete
(no. of deaths in a hospital per year)
b) Numerical continuous
(age, weight, blood pressure)
Presenting data
• Graphs
• Summary statistics
• Tables
Graphical methods
•Piechart
•Barchart
•Histogram
•Scattergram
Pie chart
Self-reported pain
extreme pain
moderate pain
no pain
Bar chart
self-reported pain
extreme painmoderate painno pain
No
. o
f su
bje
cts
5000
4000
3000
2000
1000
0
Age
65.060.055.050.045.040.035.030.025.020.0
50
40
30
20
10
0
Histogram
Boxplot
36004465N =
Gender
malefemale
Ag
e (
yea
rs)
100
80
60
40
20
0
Error bar plot
2818714785N =
mobility
severe problemsome problemsno problem
95
% C
I h
ea
lth s
tatu
s sc
ore
400
300
200
100
0
Scattergram of creatinine vs. digoxin
Digoxin
120100806040200
Cre
atin
ine
140
120
100
80
60
40
20
0
Scattergram
Graph Example
SF36 sub-scale
General
Pain
Vitality
Mental
Role emo
Role phys
Soc fun
Phys fun
SF
36
sco
re
100
80
60
40
20
0
Not ill
Long term ill
Graph
SF36 sub-scale
General
Pain
Vitality
Mental
Role emo
Role phys
Soc fun
Phys fun
SF
36
sco
re
100
80
60
40
20
0
Not ill
Long term ill
Solution
Summary statistics
Qualitative data
• Percentages
• Numbers
Secondary prevention of coronary heart disease
Respondents(n=1343)
Non-respondents(n=578)
Male 58% (782) 54% (314)
Urban Practice 54% (720) 57% (331)
Practice size:
< 5,000 14% (190) 18% (105)
5,000 – 10,000 39% (523) 41% (238)
> 10,000 47% (630) 41% (235)
Summarizing data example I
Summary StatisticsQuantitative data
• Non-normal
median
range
inter-quartile range• Normal
mean
standard deviation
variance
Boxplot
36004465N =
Gender
malefemale
Ag
e (
yea
rs)
100
80
60
40
20
0
Summary StatisticsNormal data
Approximately 95% of observations lie between the mean plus or minus
2 standard deviations
Age
65.060.055.050.045.040.035.030.025.020.0
50
40
30
20
10
0
Histogram
IgM
3.002.752.502.252.001.751.501.251.00.75.50.250.00
140
120
100
80
60
40
20
0
Histogram of IgM values
How to test for Normality
• Mean = Median
• (mean-2sd, mean+2sd) reasonable range
• -1 < skewness < 1
• -1 < kurtosis < 1
• Histogram shows symmetric bell shape
Checking for Normality
Age Length of stay
Satisfaction score
Mean 66.2 12.1 5.2
Median 67 8 9
SD 8.2 9.0 4.3
Minimum 49 4 1
Maximum 80 36 10
Skewness -0.2 1.8 -2.5
Kurtosis 0.5 1.3 4.6
Secondary prevention of coronary heart disease
Mean (sd)
Respondents
(n=1343)
Non-respondents
(n=578)
Age (years) 66.2 (8.2) 66.6 (8.7)
Time since MI (mths) * 10 (6, 35) 15 (8, 47)
Cholesterol (mmol/l) 6.5 (1.2) 6.6 (1.2)
[* Median (range)]
Summary statistics example II
Natural log transformation
• Can transform +vely skewed data to ‘Normal’ data
• Use transformed data in analysis
• Resulting mean value transformed back (using ex) to give geometric mean
• Present geometric mean and range
Effect of loge transformation
Length of stay
Loge length of stay
Mean 12.1 2.2
Median 8 2.1
SD 9.0 0.5
Minimum 4 1.4
Maximum 36 3.6
Skewness 1.8 0.4
Kurtosis 1.3 0.7
[Geometric mean = e 2.2 = 9.0]
Secondary prevention of coronary heart disease
Mean (sd)
Respondents
(n=1343)
Non-respondents
(n=578)
Age (years) 66.2 (8.2) 66.6 (8.7)
Time since MI (mths) * 10 (6, 35) 15 (8, 47)
Cholesterol (mmol/l)
Length of stay #
6.5 (1.2)
9.0 (4, 36)
6.6 (1.2)
11.2 (6, 83)
[* Median (range), # Geometric mean (range)]
Confidence Interval
“ The estimated mean difference in systolic blood pressure between 100 diabetic and 100 non-diabetic men was 6.0 mmHg
with 95% confidence interval
(1.1mmHg, 10.9mmHg)”
Confidence Interval
• Contains information about the (im)precision of the estimated effect size
• Presents a range of values, on the basis of the sample data, in which the population value for such an effect size may lie
Confidence Interval
95% CI for mean = mean +/- 1.96 SEM90% CI for mean = mean +/- 1.64 SEM
SEM = sd / sqrt(n)
Confidence Interval
• The 95% CI is a range of values which we are 95% confident covers the true population mean
• There is a 5% chance that the ‘true’ mean lies outside the 95% CI
Error bar plot
2818714785N =
mobility
severe problemsome problemsno problem
95
% C
I h
ea
lth s
tatu
s sc
ore
400
300
200
100
0
Confidence Interval Example
Significance/hypothesis tests
Measure strength of evidence provided by the data for or against some proposition of interest
Eg. Is the survival rate after X better than after Y?
Significance/hypothesis tests
Null hypothesis:
“Effects of X and Y are the same”
Alternative hypothesis:
“Effects of X and Y are different”
Significance/hypothesis tests
One-sided :
“X is better than Y”
Two-sided:
“ X and Y have different effects”
P-value
P is the probability of how true is the null hypothesis
P-value
P <= 0.05
• null hypothesis is not true
• there is a difference between X and Y
• result is statistically significant
P-value
P > 0.05
• null hypothesis may be true
• there is probably no difference between X and Y
• result is not statistically significant
P-value
Power of study
• probability of rejecting null hypothesis when false
• increased by increasing sample size
• increased if true difference between treatments is large
P-value
Statistical significance does not imply clinical significance
A statistician is a person whose lifetime ambition is to be wrong
5% of the time
Types of significance tests
Chi-square test:
“28 out of 70 smokers have a cough compared with 5 out of 50 non-smokers
- is there a significant difference?”
[28/70 = 40% compared with 5/50=10%]
Chi-square test result
“P=0.001”
There is a significant relationship between smoking and cough
Types of significance tests
Two-sample t-test:
“Is there a difference in the 24 hour energy expenditure between groups of lean and
obese women?”
Types of significance tests
Mann-Whitney U-test:
“Is there a difference in the nausea score between chemo patients receiving an active anti-emetic treatment and those receiving
placebo?”
Types of significance tests
Paired t-test:
“Is there a difference in the dietary intake of a group of students in the week before
and after Finals?”
Types of significance tests
Wilcoxon matched pairs signed rank test or the Sign test:
“Is there a difference in the units of alcohol consumed by students in the week before
and after finals?”
Significance test example
Correlation
Measures the strength of the relationship between two variables
Scattergram of creatinine vs. digoxin
Digoxin
120100806040200
Cre
atin
ine
140
120
100
80
60
40
20
0
Scattergram
Correlation
Pearson correlation:
• Used for Normally distributed data
• Measures linear relation between variables
Correlation
• r = 0 no relationship
• r = 1 perfect +ve relationship
• r = -1 perfect –ve relationship
Scattergram of creatinine vs. digoxin
Digoxin
120100806040200
Cre
atin
ine
140
120
100
80
60
40
20
0
Scattergram
Correlation
Spearman correlation:
• Used for non-Normally distributed data
• Measures monotonic relationship between variables
Correlation Example
Correlation
Change in IGF-1 (ng/ml)
2001000-100-200
Ch
an
ge
in le
ft-v
en
tric
ula
r m
ass
(g
)
120
100
80
60
40
20
0
-20
-40
placebo
rhGH
“The government are very keen on amassing statistics.They collect them, add them, raise them to the n’th power, take the cube root and prepare wonderful diagrams.But you must never forget that every one of these figures comes in the first instance from the village watchman, who just puts down what he damn pleases”
[Comment of a judge on the subject of government statistics, 1920]