Upload
dorcas-collins
View
222
Download
2
Embed Size (px)
Citation preview
Introduction to Statistics
Dr Linda Morgan
Clinical Chemistry Division
School of Clinical Laboratory Sciences
Outline
• Types of data• Descriptive statistics• Estimates and confidence intervals• Hypothesis testing• Comparing groups• Relation between variables• Statistical aspects of study design• Pitfalls
Descriptive statisticsCategorical variables
• Graphical representation – bar diagram
• Numbers and proportions in each category
smoking habit
heavy smokerlight smokerex-smokernon-smoker
Pe
rce
nt
50
40
30
20
10
0
Descriptive statisticsContinuous variables
• Distributions– Gaussian– Lognormal– Non-parametric
• Central tendency– Mean– Median
• Scatter– Standard deviation– Range– Interquartile range
Maternal age
42.5
40.0
37.5
35.0
32.5
30.0
27.5
25.0
22.5
20.0
17.5
15.0
60
50
40
30
20
10
0
Std. Dev = 4.84
Mean = 28.0
N = 223.00
Gaussian (normal) distribution
• Central tendency
Mean = x
n
• Scatter
Variance = (x-mean)2
n –1
Standard deviation = variance
Gaussian (normal) distribution
Plasma renin concentration
36.0
34.0
32.0
30.0
28.0
26.0
24.0
22.0
20.0
18.0
16.0
14.0
12.0
10.0
8.0
6.0
4.0
2.0
20
10
0
Std. Dev = 5.36
Mean = 9.3
N = 73.00
Lognormal distribution
Log plasma renin concentration
1.501.381.251.131.00.88.75.63.50.38
30
20
10
0
Std. Dev = .21
Mean = .91
N = 73.00
Lognormal distribution
Lognormal distribution
• Mean = log x n
• Geometric mean = antilog of mean (10mean)
• Median– Rank data in order– Median = (n+1) / 2th observation
Variability
• Variance = (x-mean)2
n –1
• Standard deviation = variance
• Range
• Interquartile range
Variability of Sample Mean
• The sample mean is an estimate of the population mean
• The standard error of the mean describes the distribution of the sample mean
• Estimated SEM = SD/ n• The distribution of the sample mean is
Normal providing n is large
Standard error of the difference between two means• SEM = SD/ n• Variance of the mean = SD2/n• Variance of the difference between two
sample means
= sum of the variances of the two means= (SD2/n)1 + (SD2/n)2
• SE of difference between means
= [(SD2/n)1 + (SD2/n)2 ]
Variability of a sample proportion
• Assume Normal distribution when np and n(1-p) are > 5
• SE of a Binomial proportion =
(pq/n) where q = 1-p
Standard error of the difference between two
proportions• SE (p1 – p2 )
= [variance (p1) + variance (p2) ]
= [ (p1 q1 /n1) + (p2 q2 /n2) ]
Confidence intervals of means
• 95% ci for the mean =
Sample mean 1.96 SEM
• 95% ci for difference between 2 means =
(mean1 – mean2 ) 1.96 SE of difference
Confidence intervals of proportions
• 95% ci for proportion
= p 1.96 (pq/n)
• 95% ci for difference between two proportions
= (p1 – p2 ) 1.96 x SE (p1 – p2 )
Comparing 2 groups of continuous data
• Normal distribution:
paired or unpaired t test
• Non-Normal distribution:
transform data
OR
Mann-Whitney-Wilcoxon test
Paired t test
We wish to compare the fasting blood cholesterol levels in 10 subjects before and after treatment with a new drug.
What is the null hypothesis?
Paired t testSubject Fasting cholesterol DNumber Predrug Postdrug01 6.7 4.4 2.302 7.8 7.0 0.803 8.1 6.0 2.104 5.5 5.8 -0.305 8.6 9.0 -0.406 6.7 6.1 0.607 7.1 7.3 -0.208 9.9 9.9 009 8.2 6.3 1.910 6.5 7.1 -0.6
Paired t test
• Calculate the mean and SEM of D
• The null hypothesis is that D = 0
• The test statistic t =
mean(d) – 0
SEM (d)
Paired t test
• Mean = 0.62• SEM = 0.351• t = 1.766• Degrees of freedom = n - 1 = 9• From tables of t,
2-tailed probability (P) is between 0.1 and 0.2• How would you interpret this?
Comparing 2 groups of categorical data
• In a study of the effect of smoking on the risk of developing ischaemic heart disease, 250 men with IHD and 250 age-matched healthy controls were asked about their current smoking habits.
• What is the null hypothesis?
Results
• 70 of the 250 patients were smokers
• 30 of the healthy controls were smokers
Smoker Non-smoker
Total
IHD 70 180 250
Control 30 220 250
Total 100 400 500
Smoker Non-smoker Total
IHD 70
50
180
200
250
Control 30
50
220
200
250
Total 100 400 500
Calculate expected values, E, for each cell
Calculate (observed – expected) value, D
Smoker Non-smoker Total
IHD 70 – 50 =
20
180–200=
-20
Control 30-50=
-20
220-200=
20
Total
Calculate the sum of D2/E
8 + 8 + 2 + 2 = 20
This is the test statistic, chi squaredCompare with tables of chi squared with (r-1)(c-1) degrees of freedom In this case, chi squared with 1 df has a P value of < 0.001
How do you interpret this?
Statistical power
• The power of a study to detect an effect depends on:– The size of the effect– The sample size
• The probability of failing to detect an effect where one exists is called
• The power of a study is 100(1-)%• Wide confidence intervals indicate low
statistical power
Statistical power
• The necessary sample size to detect the effect of interest should be calculated in advance
• Pilot data are usually required for these calculations
Statistical power - example
• 30% of the population are carriers of a genetic variant. You wish to test whether this variant increases the risk of Alzheimers Disease.
• For P < 0.05, and 80% power, number of controls and cases required:
Control carriers Case carriers Sample size 30% 50% 10030% 40% 35030% 35% 1400
Multiple testingNumber of Probability of Tests false positive
1 0.052 0.103 0.144 0.195 0.2310 0.4020 0.64
Bonferroni correction: Divide 0.05 by the number of tests to provide the required P value for hypothesis testing at the conventional level of statistical significance
Data trawling
• Decide in advance which statistical tests are to be performed
• Post hoc testing of subgroups should be viewed with caution
• Multiple correlations should be avoided