Upload
tarek-tawfik-amin
View
631
Download
2
Embed Size (px)
DESCRIPTION
Provide the basic concept and application of bio-statistics using a practical model coupled with the essential theoretical background.
Citation preview
Medical Statistics2013
Dr Tarek Tawfik Amin
Introduction
- Questions - Why statistics?- The process- The resources
How?
• Book: Statistics at Square One 11th ed. “Campbell and Swinscow”
• SPSS Practical sessions-PASW guide.• Practical sessions using SPSS v. 17.0
Statistics “an overview”
Data
Population
Sample
AnalysisInterpretation
Information
Parameters
Statistics
Reference range
Researches
Statistical analysis
Statistical analysis
Statistical analysis Variables
Data
QualitativeCategorical
QuantitativeNumerical
Nominal Ordinal
Interval/Ratio
DiscreteContinuous
Descriptive Inferential
Depends on the sample (s) and objectives of analysis
Tables Graphs Measures
I-Descriptive Statistics
Goals
SummarizingOverview
Data checking
PATNRAGESEXSMOK
EHEIGH
TWEIGHTSBP
1SBP
2INSULINCHOLHBA1CDIABD
UDEAD
157001779814015406.307.625#NULL!
274101726915014515.108.30110
338101557012012606.5011.002#NULL!
473101657218015705.807.00210
5531217410914011916.8010.6070
674101718315114506.257.6270
781021756014011306.506.4060
886101645914015805.205.3040
978011718315114805.605.9011
1078101718315115915.008.00231
1191001718315114004.309.7041
1277021768717019806.406.6072
1377101718315115205.204.90261
1484001716216014807.007.8081
1572101546314514806.207.8001
diabIB
I-Tables
Frequency Contingency
SEX
145 52.2 52.2 52.2
133 47.8 47.8 100.0
278 100.0 100.0
male
female
Total
ValidFrequency Percent Valid Percent
CumulativePercent
smoking history * SEX Crosstabulation
Count
26 110 136
64 14 78
55 9 64
145 133 278
never
stopped smoking
yes
smokinghistory
Total
male female
SEX
Total
Tables can summarize counts, frequency (categorical), measures (numerical)
For comparison (2 or more variables)
Food items (servings/day) *Subjects classificationP value a
Obese (N=91)Non-obese (N=125)
Milk Milk beverage Milk in cereals Milk in coffee or tea - Total milk Yoghurt Cheese Ice cream- Total dairy Tuna (canned) Fish Half cooked fish Shrimp/oyster Eggs Liver (including chicken livers) Others! -Dietary vitamin D (IU/day): Median
(mean ±SD) Low dietary intake c (< 200 IU/day): No.
(%)-Dietary calcium (mg/day): Median
(mean ±SD) Low calcium intake d (<1000mg/day): No.
(%)
0.52(0.71±0.3)0.45(0.59±0.4)0.20(0.33±0.2)0.15(0.25±0.6)
0.90(1.03±0.3)0.10(0.12±0.6)0.20(0.24±0.9)0.15(0.14±0.6)
0.25(0.45±0.6)0.05(0.03±0.1)0.15(0.19±0.7)0.06(0.11±0.5)0.05(0.08±0.1)0.85(0.81±1.1)0.02(0.04±0.4)0.20(0.23±0.3)
(111.6)118.1±73.5
56(62.2) (660.0)698.8±26
1.951(56.7)
0.65(0.88±0.7)0.35(0.53±0.4)0.50(0.58±0.4)0.20(0.23±0.6)
1.20(1.34±0.7)0.20(0.14±0.5)0.20(0.29±0.8)0.06(0.09±0.3)
0.30(0.43±0.7)0.03(0.04±0.3)0.10(0.18±0.5)0.25(0.27±0.6)0.05(0.06±0.1)0.80(0.76±0.7)0.05(0.06±0.3)0.40(0.55±0.5)
(123.7)132.2±67.447(37.6)
(692.0)717.9±245.949(39.2)
0.0310.2790.0010.7900.0010.7900.6610.4220.8260.7610.9020.0290.1490.7970.8340.5490.0340.003b
0.2230.011b
Table 3 Daily servings of calcium and vitamin D rich foods in relation to body mass index classification of the included adults .
Assignment I Table 1 Basic characteristics for the patients examined (N=278).
Baseline characteristics 1996Total (N=278)
1 -Men)%( 2 -Insulin users)%(
3 -Smokers)%( 4 -Ex-smokers)%(
5 -Non-smokers)%( 6 -Age in years (mean ±SD)
7 -Systolic Blood pressure at starting point mmHg (mean ±SD)
8 -Systolic blood pressure two years mm Hg (mean ±SD)9 -Duration of diabetes (median/Quartiles 1-3)
10 -Missed values
52.2
25.5
23.0
28.1
48.9
67.24 ±11.74
151.20 ±22.00
153.83 ±29.1
6.0( 2.75-12.25)
0.0
II-Graphs
GoalsImpressionComparison
Data checking Clustering
Trend
II- Graphs
Figure 1Outcomes of the included diabetic patients (1996)
other cause of death
died from CVD
alive
Missing
Selection of graphs 1-Types of variables
2-Number of variables 3-Objectives
Categorical Numerical
Figure 2: Smoking status of the inlcuded diabetic patients
smoking history
yesstopped smokingnever
Per
cent
60
50
40
30
20
10
0
Next
total cholesterol
Figure 3: Total cholesterol level in diabetic pateints 1996
in mmol/l60
50
40
30
20
10
0
Std. Dev = 1.33
Mean = 6.25
N = 278.00
For numerical variables
133145N =
Figure 4: Systolic blood pressure at starting point
among diabetic patients 1996 (mmHg)
SEX
femalemale
syst
. blo
od p
ress
ure
at st
art
240
220
200
180
160
140
120
100
80
24728
676899
955 1464 11026N =
Figure 6: Total cholesterol level in relation to gender and
smoking status among diabetic patients 1996
SEX
femalemale
95%
CI to
tal c
hol
este
rol (
mm
ol/l)
8.5
8.0
7.5
7.0
6.5
6.0
5.5
5.0
smoking history
never
stopped smoking
yes
duration of diabetes
32.5
30.0
27.5
25.0
22.5
20.0
17.5
15.0
12.5
10.0
7.5
5.0
2.5
0.0
Figure 7: Duration of diabetes among the included patients 1996
(in years)80
70
60
50
40
30
20
10
0
Std. Dev = 6.96
Mean = 7.9
N = 278.00
Median=6.0
Mode
Median
Mean
Normal distribution
+-
Outliers
Checking for normality
Mode=1
III-Measures (numerical variables)
MeanMedianModePercentiles
Central Tendency Dispersion
Range (max-min)Inter Quartile rangeVariance Standard deviationVariation coefficient
How the data aggregate around a central pointHow the data varies
Central Tendency Mean= summation of observations/their numberAffected by extremes of value
)x1+x2+x3/(number
Mode= The most frequently occurring values in a set of observations
Median= The middle value that divide the ordered data set into 50/50Not affected by extremes of values
3 7 37
Age of sample
Median=7Mean=(3+7+37)/3=15.7
1173Median=7Mean=(3+7+11)/3=7
Dispersion
1 6 8 10 16 17 23 43 531
Range=53-1=52Affected by extremes of values
Median=1350% of data
50th percentile=13
75% of the data75th percentiles
3rd quartile
25% of data25th percentile
1st quartile
Interquartile range=3rd-1st quartiles23-6=17
IQR not affected by extremes of values
Standard deviation and variance
3 7 17
Sample of 3, their age in years
Mean age=(3+7+17)/3=9
9
+8-2-6
The sum of the differences between the mean and individual values=0The mean deviation=0
To overcome the 0= sum the difference squared/number-1= Variance)3-9(2)+6-9(2)+17-9(2/3-1=52
The amount of dispersion around the mean=52 years2 (wrong scale)
Hence we need to convert back to the usual (natural) scale, use the standard deviation√Variance=±7.2 years
The sample disperses around the mean (=9 years) by 7.2 years on both directions
Description of a binary (dichotomous variable)
o A binary variable: Has only two outcomes (diseased or not diseased).
o The proportion of the population that is diseased (at certain point of time) is called prevalence.
o The new cases occurring is called incidence.
Dichotomous variables
Prevalence= All cases (new or old)/at risk population
Incidence= New cases/total population at risk
Probability and Oddso Odds= chance o In a population of 1000, 200 has a certain
disease. o When we randomly take one person out, the
probability that this person is diseased= 200/1000= 0.2 (this is probability)
o The chance (the Odds) that is person is diseased= probability of having the disease /probability of not having the disease.
o Odds= P (probability of disease)/probability of not having the disease (1-P)=P/1-P= 0.2/0.8=1/4, the odds are 1 to 4.
The following table depicts the outcomes of isoniazid/placebo trail among children with HIV (death within 6 months) .
Interventions
Dead (within 6 months)
Alive Total
Placebo21110131
Isoniazid 11121132
What is the risk of dying?
Risk=21/131=0.160
Risk=11/132=0.083
Absolute risk difference (ARD)=risk in placebo-risk in isoniazid= 0.077
Net relative risk (NRR)=risk in placebo/risk in isoniazid= 1.928
Relative risk reduction (RRR)=risk in placebo-risk in isoniazid/risk in placebo= 0.48
Number needed to treat (NNT)=1/ARD=1/0.077=13
Odds ratio (OR)
o An odds ratio (OR) is a measure of association between an exposure and an outcome.
o The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.
o Odds ratios are most commonly used in case-control studies, however they can also be used in cross-sectional and cohort study designs as well (with some modifications and/or assumptions).
Disease-free
Dis
eased
Population
Diseased (cases)
Disease-free(controls)
Exposed to factor(a)
Unexposed to factor(b)
Unexposed to factor(d)
Exposed to factor(c)
Sam
ple
Trace Present time
Starting pointPast time
Basic structure of case-control design
The O
dds “ch
ance
of e
xposu
re
Is calcu
late
d b
etw
een b
oth
gro
ups
Calculation
Case control study
Diseased None Total
Exposed Cases+ exposed (a)
Exposed+ not diseased (b)
a+b
Non-exposed Cases-not exposed (c)
Not exposed+ not diseased
(d)
c+d
Odds ratio= a/c÷b/d= ad/bc
Prevalence among the diseased/prevalence among the non-diseased
OR=1 Exposure does not affect odds of outcomeOR>1 Exposure associated with higher odds of outcomeOR<1 Exposure associated with lower odds of outcome
Odds ratio
Case control study
Lung cancer No lung cancer
Total
Smoking a-80b-30110
Nonec-20d-7090
80x70=560030x20=600
5600/600=9.3
Or 80/20÷30/70=9.3
Basic Structure of cohort study
Disease-free
Dis
eased
Disease-free
Unexposedto factor
Exposed to factor
Population
Develop Disease (a)
Disease-free)b(
Develop Disease (c)
Disease-free)d(
Sam
ple
Starting point
Present time Future timeFollow
Com
parin
g th
e in
cid
en
ce o
f dis
ease in
each
g
rou
p
The Relative Risk is calculated for exposure
Relative risk (RR)
Mammography
Breast cancer No breast cancer
Total
Positive a-10b-90100
Negative c-20d-998980100,100
In Cohort design
RR= a/(a+b)÷c/(c+d)10)/100÷ (20)100,100=(0.1/0.0002 =500
The relative risk (RR)
Lung cancer
No lung cancer
Total
Smokers 18582600
Non 611941200
Cohort
stu
dy
Risk for smokers=18/600=0.03Risk for non-smokers=6/1200=0.005RR=0.03/0.005=6
The Odds ratio (OR)
Lung cancer
No lung cancer
Total
Smokers 8030110
Non 207090
Case
contr
ol st
udy
Odds for smokers=80/30=2.67Odds for non-smokers=20/70=0.29OR=80*70/30*20=9.33
Assignment I Table 1 Basic characteristics for the patients examined (N=278).
Baseline characteristics 1996Total (N=278)
1 -Men)%( 2 -Insulin users)%(
3 -Smokers)%( 4 -Ex-smokers)%(
5 -Non-smokers)%( 6 -Age in years (mean ±SD)
7 -Systolic Blood pressure at starting point mmHg (mean ±SD)
8 -Systolic blood pressure two years mm Hg (mean ±SD)9 -Duration of diabetes (median/Quartiles 1st -3rd)
10 -Missed values
52.2
25.5
23.0
28.1
48.9
67.24 ±11.74
151.20 ±22.00
153.83 ±29.1
6.0( 2.75-12.25)
--
Smoking histroy (all subjects)
smoking history
yesstopped smokingnever
Pe
rce
nt
60
50
40
30
20
10
0
23
28
49
2a
Smoking history by sex
smoking history
yesstopped smokingnever
Pe
rce
nt
100
80
60
40
20
0
SEX
male
female711
83
38
44
18
2b
Age using Bar (mean used as summary)
SEX
femalemale
Mea
n ag
e (y
ears
)
70
69
68
67
66
65
64
3a
133145N =
Boxplot age by Sex
SEX
femalemale
age
(yea
rs)
120
100
80
60
40
20
0
195
3b
This graph gives check for Data distribution and checking for outliers
height (cm)
Height of the included subjects 50
40
30
20
10
0
Std. Dev = 8.89
Mean = 170.5
N = 278.00
Median=170.55 cm
4a
duration of diabetes
32.5
30.0
27.5
25.0
22.5
20.0
17.5
15.0
12.5
10.0
7.5
5.0
2.5
0.0
Duration of diabetes 80
70
60
50
40
30
20
10
0
Std. Dev = 6.96
Mean = 7.9
N = 278.00
4b
Median=6.0 years
syst. blood pressure at start
1 .4 .4 .4
1 .4 .4 .7
2 .7 .7 1.4
1 .4 .4 1.8
2 .7 .7 2.5
21 7.6 7.6 10.1
2 .7 .7 10.8
1 .4 .4 11.2
1 .4 .4 11.5
6 2.2 2.2 13.7
1 .4 .4 14.0
16 5.8 5.8 19.8
1 .4 .4 20.1
2 .7 .7 20.9
1 .4 .4 21.2
11 4.0 4.0 25.2
1 .4 .4 25.5
2 .7 .7 26.3
1 .4 .4 26.6
28 10.1 10.1 36.7
2 .7 .7 37.4
4 1.4 1.4 38.8
12 4.3 4.3 43.2
1 .4 .4 43.5
1 .4 .4 43.9
31 11.2 11.2 55.0
1 .4 .4 55.4
23 8.3 8.3 63.7
1 .4 .4 64.0
1 .4 .4 64.4
2 .7 .7 65.1
1 .4 .4 65.5
21 7.6 7.6 73.0
1 .4 .4 73.4
1 .4 .4 73.7
1 .4 .4 74.1
1 .4 .4 74.5
5 1.8 1.8 76.3
1 .4 .4 76.6
2 .7 .7 77.3
14 5.0 5.0 82.4
1 .4 .4 82.7
2 .7 .7 83.5
4 1.4 1.4 84.9
1 .4 .4 85.3
1 .4 .4 85.6
1 .4 .4 86.0
2 .7 .7 86.7
14 5.0 5.0 91.7
2 .7 .7 92.4
1 .4 .4 92.8
1 .4 .4 93.2
1 .4 .4 93.5
1 .4 .4 93.9
6 2.2 2.2 96.0
1 .4 .4 96.4
1 .4 .4 96.8
2 .7 .7 97.5
1 .4 .4 97.8
1 .4 .4 98.2
3 1.1 1.1 99.3
1 .4 .4 99.6
1 .4 .4 100.0
278 100.0 100.0
100
110
112
115
116
120
121
122
124
125
127
130
131
132
134
135
136
137
139
140
141
144
145
147
148
150
151
151
152
153
155
158
160
161
162
163
164
165
167
168
170
171
172
175
176
177
178
179
180
182
184
185
187
189
190
194
195
200
205
209
210
216
220
Total
ValidFrequency Percent Valid Percent
CumulativePercent
Using Frequency table: P95≈189-190
5a-
p95, p5= Mean± Z score (probability) at the specified percentiles *(Standard deviation)
P95 SBP1= 151.2+1.645(22.0)=187.4 mmHg
Probability distribution of the normal curve: page 180
52-/-
duration of diabetes
12 4.3 4.3 4.3
35 12.6 12.6 16.9
22 7.9 7.9 24.8
21 7.6 7.6 32.4
24 8.6 8.6 41.0
20 7.2 7.2 48.2
23 8.3 8.3 56.5
19 6.8 6.8 63.3
6 2.2 2.2 65.5
6 2.2 2.2 67.6
6 2.2 2.2 69.8
13 4.7 4.7 74.5
2 .7 .7 75.2
7 2.5 2.5 77.7
6 2.2 2.2 79.9
5 1.8 1.8 81.7
11 4.0 4.0 85.6
8 2.9 2.9 88.5
6 2.2 2.2 90.6
5 1.8 1.8 92.4
3 1.1 1.1 93.5
5 1.8 1.8 95.3
2 .7 .7 96.0
2 .7 .7 96.8
3 1.1 1.1 97.8
1 .4 .4 98.2
1 .4 .4 98.6
2 .7 .7 99.3
1 .4 .4 99.6
1 .4 .4 100.0
278 100.0 100.0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
28
31
32
Total
ValidFrequency Percent Valid Percent
CumulativePercent
P5 for duration of diabetes
5b-1
Or using the formula:Mean-Z score (1.645)* SD =-3.6 years
Total population n=287, μ=67.24 years
σ11.743
+-
Sample no.MeanSD
167.612.07
267.1311.81
36711.98
467.811.63
566.3311.44
667.4411.95
767.8412.42
866.5911.36
96712
1066.3811.9
1168.0612.06
1267.6111.02
1367.3111.33
1466.4411.91
1566.8711.26
1666.811.5
1766.7312.37
1866.3811.77
1967.0311.22
2066.5812.13
2166.8111.55
2266.5812.21
2367.211.61
2466.4811.48
2567.5312.1
2667.5810.6
276711.91
2867.3111.59
Mean of the means67.0311.3
28 samples of 150 from a total population of 287
0
20
40
60
801
2 34
56
7
8
9
1011
121314
151617
1819
20
21
22
23
2425
2627 28
Sample no.
Mean
SD
Age in years
Population and Sample
o In scientific research we want to make a statement (conclusion) about the population.
o Studying the whole population is impossible in terms of money/time/labor.
o Random sampling from the population and infer from the sample data the needed conclusions.
o The task of statistics is to quantify the uncertainty (the sample is really representing that population).
The concept of sampling
Study population:Sampling units
You select a few sampling unitsfrom the study population
Sample
You collect informationfrom these people to find answers to your research questions.
You make an estimate “prediction” extrapolated to the study population
(prevalence, outcomes etc.)
What would be the mean systolic blood pressure of older subjects (65+) in Al
Hassa?
Pop
ula
tion
mean
(μ)=
un
kn
ow
n
175
165
180
155
From our sample we calculate an estimate of the population parameter
The good sample (the estimator)
Should be :
Unbiased:
The mean of sample = population mean
Precise: (narrow dispersion about the mean)
The dispersion in repeated samples is small
This is a dream
Sampling error
Four individuals A, B, C, DA = 18 yearsB= 20 yearsC= 23 yearsD= 25 yearsTheir mean age is = 18+20+23+
25= 86/4= 21.5 years (population mean μ).
Probability of sampling two individuals: (6 probabilities)
A+B=18+20= 38/2=19.0 yearsA+C= 18+23=20.5 years.A+D=18+25=21.5 years.B+C=20+23=21.5 years.B+D=20+25=22.5 years.C+D=23+25=24.0 years.
Probability of sampling three individuals: (4 probabilities)
A+B+C=18+20+23=20.33 years.A+B+D=18+20+25=21.00 years.A+C+D=18+23+25=22.00 years.B+C+D=20+23+25=22.67 years.
If C=32 (instead of 23) years and D=40 (instead of 25) years: sampling of 2= sampling error of -7.00 to +7.00 and in 3= -3.67 to +3.67 years.
Sampling error= population mean-sample mean= ranges from -2.5 to +2.5 years.
Error = ranges from -1.17 to +1.7 years.
The greater the variability of a given variable the larger the sampling error for a given sample size.
Infinite samples should represents the population it came from (good estimator)
2
o The normal distribution o The Standard error of the meano Estimation:
- Reference interval - Confidence intervals
For mean proportion
Difference between means/proportions
RR and OR
/ /١٤٤٤ ٠٩ ٢١56
Normal Distribution: Many human traits, such as intelligence, personality, and
attitudes, also, the weight and height, are distributed
among the populations in a fairly normal way.
The normal distribution
±68% within between μ ±1 SD (σ)
±95% within between μ ±2 SD (σ)
>2SDs Possible outliers
>3 SDs Definite outliers
One more The Z score which measures how many standard deviations a particular data point is above or below the mean. oUnusual observations would have a Z score over 2 or under 2 SD.oExtreme observations would have Z scores over 3 or under 3 SD and should be investigated as potential outliers.
sXZ 1
Areas under the standard normal curve.
ZArea under curve between both points (around the mean)
Beyond both points
)two tails(
Beyond one point
)one tail(
±0.1
±0.2
±0.3
±0.4
±0.5
±0.6
±0.7
±0.8
±0.9
±1
±1.1
±1.2
±1.3
±1.4
±1.5
±1.6
±1.645
±1.7
±1.8
±1.9
1.96
±2
±2.1
±2.2
±2.3
±2.4
±2.578
0.080
0.159
0.236
0.311
0.383
0.451
0.516
0.576
0.632
0.683
0.729
0.770
0.806
0.838
0.866
0.890
0.900
0.911
0.928
0.943
0.950
0.954
0.964
0.972
0.979
0.984
0.99
0.920
0.841
0.764
0.689
0.617
0.549
0.484
0.424
0.368
0.317
0.271
0.230
0.194
0.162
0.134
0.110
0.100
0.089
0.072
0.057
0.050
0.046
0.036
0.028
0.021
0.010
0.004
0.4600
0.4205
0.3820
0.3445
0.3085
0.2745
0.2420
0.2120
0.1840
0.1585
0.1355
0.1150
0.0970
0.0810
0.0670
0.0550
0.0500
0.0445
0.0360
0.0290
0.0250
0.0230
0.0180
0.0140
0.0105
0.0100
0.0020
Calculating values from Z-scores
Xi = Mean± Z (standard deviation).
Value (percentiles) =Mean± Z score*(SD)
Random sample for estimating a population mean
μ?
X1=128
X2=133
X3=129
From the information in the sample, we will estimate the unknown population mean (X is an estimator for μ) What could have happened if we had another random sample?
What is the measure of variation of sample means?
The Sampling Distribution of a Sample Statistics
≈ Let’s assume that we want to survey a community of 400, the age of them were recorded and having the following parameters:
µ = 35 years σ = 13 years
≈ Let’s assume, however, that we do not survey all 400, instead we randomly select 120 people and ask them about their ages and calculate the mean age.
≈ Then, we put them back into the community and randomly select another 120 residents (may include members of the first sample).
≈ We did this over and over and each time we calculate the mean age.
≈ The results will be like those in the following table.
Distribution of 20 random sample means (n=20)
Sample NumberSample mean
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SD of the means
34.7
35.9
35.5
34.7
34.5
34.4
35.7
34.6
37.4
35.3
34.1
35.5
34.9
36.2
35.6
35.0
35.1
36.4
35.6
33.6
13.37
353433 36 37
. . ..…..…
.… .. ..
All the results are clustered around the population value (35 years), with a few scores a bit further out and one extreme score of 37.4 years (random variation=1/20=5%).
Those 400 people have age range from 2 to 69 years ,while the means of the samples have a very narrow range of value of about 4 years and 10 samples coincide with the population mean (35 years).
μ
Most of the samples will cluster around the population parameters with occasional sample result falling relatively further to one side or the other of the distribution (this called the sampling distribution of sample means). Has the following properties:The mean of the sampling distribution is equal to the population mean, the average of the averages (µχ) will be the same as the population mean. The standard deviation of the sample means = the standard error SE= σ/√n, (σ= population SD). The distribution of the sample means is Normal if the population distribution is Normal.If the population distribution is Not Normal, The distribution of the sample means is almost Normal when n is large (Central Limit Theorem).
PopulationParameters
Mean S.D
Sample
Mean S.D
Standard error of the mean
The degree the sample statistics are deviating /different from the population parameters.
The term error indicates the fact that due to sampling error, each sample mean is likely to deviate some what from true population mean.
Sample mean
Central Limit Theorem
The formula for SE= SD/√n.The formula indicates that we are estimating the SE given the S.D of a sample of size n.For a sample of 100 and S.D of 40 the SE= 40 /√100 = 4.For a sample of 1000 and S.D of 40 the SE= 40 / √1000 = 1.26.
Two factors influence the SE, sample size and S.D of the sample:
Sample size has greater impact as it is used a denominator .
For a sample of 100 and S.D of 20 the SE = 20 / √100 = 2.For a sample of 100 and S.D of 40 the SE = 40 / √100 = 4.If there is more variability within a sample the greater
the SE.
Confidence Interval (CI)
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data.
We need to know the smallest and the largest μ (range) we think is likely using sample statistics. The mean of sample = μ
c= level of
confidence
Z c= Z critical
values (under
normal curve)
90%
95%
99%
1.645
1.960
2.578
n
c
C.I= Mean of the sample ±Z critical scores (SEM)SEM= SD/√n
C.I
• The confidence interval provides a range that is highly likely (often 95% or 99%) to contain the true population parameter that is being estimated.
• The narrower the interval the more informative is the result.
• It is usually calculated using the estimate (sample mean) and its standard error (SEM).
CI for μSystolic blood pressure in 287 diabetic patients
Descriptives
151.20 1.319
149.02
153.38
150.30
150.00
483.880
21.997
100
220
120
30.00
.540 .146
.152 .291
Mean
Lower Bound
Upper Bound
90% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
syst. bloodpressure at start
Statistic Std. Error
90% C.I= 151.20±1.65(21.997/√287)C.I=149.02-153.38 mmHg
Descriptives
155.06 3.064
149.92
160.20
154.72
151.20
460.033
21.448
115
205
90
30.00
.263 .340
-.506 .668
Mean
Lower Bound
Upper Bound
90% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
syst. bloodpressure at start
Statistic Std. Error
Random sample of 50 out of 287
Descriptives
151.20 1.319
148.60
153.80
150.30
150.00
483.880
21.997
100
220
120
30.00
.540 .146
.152 .291
Mean
Lower Bound
Upper Bound
95% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
syst. bloodpressure at start
Statistic Std. Error
95% C.I=151.20±1.96(21.997/√287)C.I=148.60-153.80 mmHg
Descriptives
155.06 3.064
148.90
161.22
154.72
151.20
460.033
21.448
115
205
90
30.00
.263 .340
-.506 .668
Mean
Lower Bound
Upper Bound
95% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
syst. bloodpressure at start
Statistic Std. Error
Random Sample of 50 out of 287
Descriptives
151.20 1.319
147.78
154.62
150.30
150.00
483.880
21.997
100
220
120
30.00
.540 .146
.152 .291
Mean
Lower Bound
Upper Bound
99% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
syst. bloodpressure at start
Statistic Std. Error
99% C.I=151.20±2.58(21.997/√287)C.I=147.78-154.62 mmHg
Descriptives
155.06 3.064
146.84
163.28
154.72
151.20
460.033
21.448
115
205
90
30.00
.263 .340
-.506 .668
Mean
Lower Bound
Upper Bound
99% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
syst. bloodpressure at start
Statistic Std. Error
Random sample of 50 out of 287
90% C.I= 151.20±1.65(21.997/√287)C.I=149.02-153.38 mmHg
95% C.I=151.20±1.96(21.997/√287)C.I=148.60-153.80 mmHg
99% C.I=151.20±2.58(21.997/√287)C.I=147.78-154.62 mmHgWhat does this mean? It means that if the
same population is sampled on numerous occasions and interval estimates are made on each occasion, the resulting intervals would bracket the true population parameter (ranged) in approximately 90, 95 and 99 % of the cases.
The sample distribution of a proportion
nKpn
pppSE
p
/
(1)()
()96.1 SEpCI p Z critical score equal 95%
Smokers among diabetics
Sample=400Smokers=40P=40/400=0.1SE (p) = √0.1-0.9/400=0.015
CI p 95%= 0.1±1.96(0.015)
[0.07-0.13] for % it is the same SE=1.5% C.I=[7-
13]
95% CI for the difference between two means (μ1-μ2)
Smoke nMean SBPSE (mean)
No 214153.11.50
Yes 64144.82.62
Difference
8.3
22
21
2121
(())(())
()*96.1
SESESE
SE
C.I= 2.4 to 14.2
95% CI for percentage
Smoke (n) %died SE
No (212)28.83.11
Yes (64)23.45.30
Difference= 5.4%
2
(2100)
1
(1100)
()*96.1
21
n
pP
n
pPSE
PPSEPP snssns
95% C.I=-6.7% to 17.4%
95% CI for RR and OR
Use available software
http://www.medcalc.org/calc/relative_risk.php
http://www.medcalc.org/calc/odds_ratio.php
vl.academicdirect.org/applied_statistics/.../CIcalculator.xls
Assignment II
Inferential StatisticsTesting in research
o In scientific research we would like to test if our research ideas are true.
o Based on previous observations (studies) we know that the mean cholesterol of patients with diabetes is higher than those without the disease.
o We will take samples and check whether the results will agree with our expectations.
o Meaning we are going to test the situation using a statistical test.
The Z-test for one sample
Serum cholesterol (μ=5 mmol/L)
σ=±1.5Diabetic patients, mean cholesterol > 5
Considering σ=±1.5?
Is there any difference between diabetes free population and the diabetic patients regarding serum cholesterol? Let’s perform Z test .
Research question (hypothesis)
The research hypothesis would be
The mean cholesterol of diabetics is > 5mmol/L
Null hypothesisH0: μ=sample mean=5
Alternative hypothesisH1: μ >5 (one sided)
OrH1: μ≠5 (two sided)
Procedure
total cholesterol
13.0012.00
11.0010.00
9.008.00
7.006.00
5.004.00
3.00
Cholesterol level diabetic patients in mmol/L60
50
40
30
20
10
0
Std. Dev = 1.33
Mean = 6.25
N = 278.00
μ=5
Mean of sample
If the sample mean close to the population meanThe null hypothesis is TRUE
If the sample mean differs from population meanWe REJECT the null
The ά level (P value)
The probability to obtain /achieve the null hypothesis
The probability that Population mean=sample mean
There no difference between the population and sample mean.
Or
The maximum probability we accept to reject the null hypothesis falsely
ά = 0.05
Alpha level
P ≤ 0.05 (ά)Reject the null
Sample mean≠
population mean
P > 0.05 (ά)Accept the nullSample mean=
population mean
Calculation (σ=1.5)
SEM=μ/√n=0.3 Z=(mean sample-μ)/σ
P (mean of the sample≥6)=P(Z ≥6-5)/0.3= 0.0005Under the normal curve area of rejection >1.96 Z
P=0.0005 :The cholesterol blood level of diabetic patients can coincide
with the population (disease free) 5 in 10,000 times The two values could be the same in 5 times if we repeated this test 10,000 times
P < 0.05 so we reject the nullThe diabetics have larger mean cholesterol level than the normal population
In reality
It is unlikely that the σ (population SD) is known.
In most of the cases, σ will be unknown and we will be able to apply neither the formula nor the table of normal distribution (areas under the curve=Z score).
We resort to other statistical tests.
Possible situations in testing
Possible situations in Hypothesis testing
Reject H0Do not reject H0
H0 is true Type I error (ά)OK (1-ά)
H0 is not true
OK (1-В)Type II error (В)
Realit y
Decision
Level of significance
1-В= PowerIt is the probability to reject the null hypothesis if is NOT TRUEUsually 80% is the least required for any test
Errors of Hypothesis Testing and PowerDecisions and errors in hypothesis testing
True Situation Difference exist (H1) No
difference (H0)
Study results
Correct decision(power or 1-β)
Type I error or άRejection when it is true
False rejectionThere is a difference when it is really not
Type II or β errorFalse acceptanceThere is no difference when it is really present.
Correct decision
Con
clu
sion
fro
m h
ypot
hes
is t
esti
ng
Difference existReject H0
No differenceDo not reject H0
Passive smoking and lung cancer
Truth about the population
Passive smoking is
related to lung cancer.
Not related to lung cancer.
Type II ErrorIncorrect acceptance Passive smoking is not related to lung cancer when it is really does.
Type I ErrorIncorrect rejectionPassive smoking is related to lung cancer when it is really not..
Conclusions, based on results from a study of a sample of the population
Reject the null hypothesis (rates in the study appear to be different)
Accept the null hypothesis (rates in the study appear similar)
The Alpha-Fetoprotein (AFP) test has both Type I and Type II error possibilities .
This test screens the mother’s blood during pregnancy for
AFP and determines risk .Abnormally high or low levels may indicate Down
syndrome . H0: patient is healthy
Ha: patient is unhealthy
Error Type I (False positive or False Rejection) is: Test wrongly indicates that patient has a Down syndrome, which means that pregnancy must be aborted for no reason.
Error Type II (False negative or False Acceptance) is: Test is negative and the child will be born with multiple anomalies
Hypothesis Test
This is the distribution given the null hypothesis is true
Type I and Type II Error
False rejection
False acceptance
One Sample
The distribution of X under the null and alternative hypotheses.
t-distribution
In real life situations we will estimate the unknown population SD using Sample SD .
Results are standardized to the t-distribution:
ns
t
n
Z
Z test for normal distributionThe population SD is known
t-distribution
Heavier tails than the Z distribution
df=No. of observations (sample size)-1
Degree of freedom (df)
For all sample statistics: variance, SD, we used n-1All the observations in any given sample are free except one= Complementary effect.
Degree of freedom
7 15
12
16
total =50
restricted
df = n-1
t-distrib
utio
n
t-test-steps to determine the statistical difference
When? descriptive statistics: mean ± standard deviation
Number of samples
One sample vs. population mean
Two independent samples
Two dependent (t-paired):Repeated measures Matched pairs
Steps:1- State the hypothesis to be tested: Null (non-directional-two tailed) mean= mean Alternative (unidirectional-one tail) mean ≠ mean 2- Find the calculated t value: using the formulae. 3- Find the degree of freedom: all = n-1 (two sample independent df=n1-1+n2-1 (n1+n2-2).4- Find the P value using the tables of t-distribution.5- Conclude: if < 0.05 = rejection. If > 0.05 the null is accepted.
nSDt / 2
/22
1
21
21 n
SD
n
SD
()
dSE
ddependentt
t-test (student’s t-test) one sample
nSDt /
Using diabetes data: Is the mean age of diabetics > 65 years?
Statistics
age (years)278
0
67.24
.704
11.743
137.902
Valid
Missing
N
Mean
Std. Error of Mean
Std. Deviation
Variance
H0:μ=65H1:μ≠65
t one sample =67.24-65/SD/√n=3.18
t distribution P=0.002Reject the nullDiabetics are significantly older than 65 years
One-Sample Test
3.182 277 .002 2.24 .85 3.63age (years)t df Sig. (2-tailed)
MeanDifference Lower Upper
95% ConfidenceInterval of the
Difference
Test Value = 65
P value (two sided)
Degree of freedom
Assuming that the distribution of age is normalPopulation SD is unknown (σ)
t-test for comparison of means of two independent samples
H0: Smoking has no effect on systolic blood pressureMean S= Mean NS or Mean S-mean NS=0
H1: smoking has an effect Mean S≠ Mean NS or Mean S-Mean NS≠0
Assumptions:•Independent observations (2 samples)•Normally distributed •Equal variances (for the pooled t-test)
Three formulae
2
22
1
21
21 0
nS
nS
t
(1)(1)
(1)(1)
21
222
2112
2
2
1
2
21
nn
SnSnS
n
S
n
St
p
pp
2
22
1
21
21
nS
nS
t
Standardized
Expected difference if H0 is true
SD of the difference
If SDs are equal
Pooled SD
If SDs are not equal
Decision based on Levene’s test
Group Statistics
214 153.11 21.995 1.504
64 144.82 20.934 2.617
SMOKINGno
smokers
syst. bloodpressure at start
N Mean Std. DeviationStd. Error
Mean
Independent Samples Test
.006 .936 2.674 276 .008 8.29 3.100 2.188 14.392
2.747 107.982 .007 8.29 3.018 2.308 14.272
Equal variancesassumed
Equal variancesnot assumed
syst. bloodpressure at start
F Sig.
Levene's Test forEquality of Variances
t df Sig. (2-tailed)Mean
DifferenceStd. ErrorDifference Lower Upper
95% ConfidenceInterval of the
Difference
t-test for Equality of Means
P value <0.05, reject H0Not significant it means equal variances
Two separate t-test
Variances are apparently equal
Paired t-test
If we have paired data (two repeated measurements on the same subjects) or before and after
If the difference of the paired observations are Normally distributed.
Paired samples (dependent)
(Paired / dependent 2-sample t-test)
• To compare observations collected form the same group of individuals on 2 separate occasions (dependent observations or paired samples).
• The paired t statistics is calculated by:
- Calculate the difference between the 2 measurements taken on each individual.
- Calculate the mean of the differences.- Calculate the SE of the observed differences.- Under the null hypothesis of no difference or difference
= 0, the paired t statistic takes the form.- t= Mean difference / SE of the difference.
- It has a normal distribution with degrees of freedom = (n-1)
d
d
SE
0-m t
m d
SE d
Example Four students had the following scores in 2 subsequent
tests. Is there a significant difference in their performance?
NumberNameTest 1Test 2 Dif
1Mike35%67%- 32
2Melanie50%46% 4
3Melissa90%86% 4
4Mitchell78%91%- 13
Mean Dif = -9.25, S D Dif= 17.152, SE Dif= 8.58Calculated Paired t = -9.25/8.58 = -1.078,
df=n-1 = 3
d
d
SE
0-m t
dfLevel of significance for one-tail test
0.01 0.05 0.02 0.01 0.005
Level of significance for two-tail test
0.20 0.10 0.05 0.02 0.01
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
35
50
∞
3.078 6.314 12.706 31.821 63.657
1.886 2.920 4.303 6.965 9.925
1.638 2.353 3.182 4.541 5.841
1.533 2.132 2.776 3.747 4.604
1.476 2.015 2.571 3.365 4.032
1.440 1.943 2.447 3.143 3.707
1.415 1.895 2.365 2.998 3.499
1.397 1.860 2.306 2.896 3.355
1.383 1.833 2.262 2.821 3.250
1.372 1.812 2.228 2.764 3.169
1.363 1.796 2.201 2.718 3.106
1.356 1.782 2.179 2.681 3.055
1.350 1.771 2.160 2.650 3.012
1.345 1.761 2.145 2.624 2.977
1.341 1.753 2.131 2.602 2.947
1.340 1.746 2.120 2.583 2.921
1.333 1.740 2.110 2.567 2.898
1.330 1.734 2.101 2.552 2.878
1.328 1.729 2.093 2.539 2.861
1.325 1.725 2.086 2.528 2.845
1.323 1.721 2.080 2.518 2.831
1.306 1.690 2.030 2.438 2.724
1.299 1.676 2.009 2.403 2.678
1.282 1.645 1.960 2.326 2.576The P value = 0.20, the null is accepted!
P value
Conclusion
The observed difference can be encountered in 36 (actual P value =0.362 out of 100 cases. i.e. we accept the null hypothesis of no difference between first and 2nd test.
Paired Samples Statistics
151.20 278 21.997 1.319
153.83 278 29.076 1.744
syst. blood pressureat start
syst. blood pressureafter 2 years
Pair1
Mean N Std. DeviationStd. Error
Mean
Paired Samples Test
-2.63 17.920 1.075 -4.74 -.51 -2.443 277 .015syst. blood pressureat start - syst. bloodpressure after 2 years
Pair1
Mean Std. DeviationStd. Error
Mean Lower Upper
95% ConfidenceInterval of the
Difference
Paired Differences
t df Sig. (2-tailed)
Test of significanceInterval/ratio data
Parametric assuming normal distribution
Known Population Variance (σ)One sample Z-test
Z test, rejection limit > ±1.96
n
Z
Unknown Population Variance
Number of samples
One sample vs. population One sample t-test
Two samples
Independent t-test independent
Dependent t-paired test
t-testReject if P ≤ 0.05
The Chi-Square test χ2
Used for hypothesis testing for categorical variablesMany types depends on design, distribution of variables and objectives of testing
χ2
Example:
Vaccination against Influenza deceases the risk to get the disease.
Study:
Compare the effectiveness of 5 vaccines with respect to the probability to get influenza.
Comparison will be in respect to a nominal variable (getting influenza: yes or no)
Effectiveness of Five Vaccines
Vaccines
Influenza No
Influenza Yes
Total
1
2
3
4
5
237
198
245
212
233
43
52
25
48
57
280
250
270
260
290
Total 11252251350
Vaccines
Influenza No
Influenza Yes
Total
1
2
3
4
5
84.6
79.2
90.7
81.5
80.3
15.4
20.8
9.3
18.5
19.7
100
100
100
100
100
Total 83.316.7100
Data cross tabulated 2X5: response variable: Influenza
Frequency %within Vaccines
The probability to get influenza
The null hypothesis states that the probability to get influenza is independent of the vaccinesThe alternative states that a dependency exists
Effectiveness of Five Vaccines
If H0 is true: The probability to influenza in every group should be the same= the probability in the total population ,
Equal to: 225/1350=0.167 (16.7%)Vaccine 1 used in 280, if H0 is true ,we expect that 16.7% (≈47) to get influenza.
However this is not true
Expected frequencies
Vaccines Influenza No
Influenza Yes
Total
1-Observe
d
Expected2-
Observed
Expected3-
Observed
Expected4-
Observed
Expected5-
Observed
Expected
237
233.3
198
208.3
245
225.0
212
216.7
233
241.7
43
46.7
52
41.7
25
45.0
48
43.3
57
48.3
280
250
270
260
290
Total 11252251350
For any cell: Expected Frequency= Row total*column total/ grand total
280X225/1350
260*1125/1350
Row total
Column total
Grand total
Pearson Chi-square test
Calculate the expected frequencies (assuming H0 is true) for all the ten cells.
Calculate Chi square: Of= observed frequencyEf= Expected frequency
f
ff
E
EO 22 ()
Reject H0 if χ2 is large Use the Chi-square distribution
After determining the degree of freedom (df)df= (r-1)*(c-1)
Chi-square distribution
Critical values for Chi-squaredf Level of Significance
0.990.900.700.500.300.200.100.050.010.001
1
2
3
4
5
.
.
30
0.00016
0.0201
0.115
0.297
0.554
14.953
0.0158
0.211
0.584
1.064
1.610
20.599
0.148
0.713
1.424
2.195
3.000
25.508
0.455
1.386
2.366
3.357
4.351
29.336
1.074
2.408
3.665
4.878
6.064
33.530
1.642
3.219
4.642
5.989
7.289
36.250
2.706
4.605
6.251
7.779
9.236
40.256
3.841
5.991
7.815
9.488
11.070
43.773
6.635
9.210
11.341
13.277
15.086
50.892
10.827
13.815
16.268
18.465
20.517
59.703
χ2critical= 9.488
Calculated=16.555df=(2-1)(5-1)=4
P=0.002
There is a relation )dependence( between type of vaccine and influenza prevention
SMOKING * SEX Crosstabulation
90 124 214
42.1% 57.9% 100.0%
55 9 64
85.9% 14.1% 100.0%
145 133 278
52.2% 47.8% 100.0%
Count
% within SMOKING
Count
% within SMOKING
Count
% within SMOKING
no
smokers
SMOKING
Total
male female
SEX
Total
Chi-Square Tests
38.017b 1 .000
36.279 1 .000
41.649 1 .000
.000 .000
37.880 1 .000
278
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-LinearAssociation
N of Valid Cases
Value dfAsymp. Sig.
(2-sided)Exact Sig.(2-sided)
Exact Sig.(1-sided)
Computed only for a 2x2 tablea.
0 cells (.0%) have expected count less than 5. The minimum expected count is30.62.
b.
At least 80% of cells must have Ef >5
We can’t use Pearson Chi-square ifthe expected frequency is <5
In this case we use Fisher’s Exact test
status * SEX Crosstabulation
Count
24 15 39
4 1 5
2 2 4
30 18 48
alive
died from CVD
other cause of death
status
Total
male female
SEX
Total
E f=5*18/48=1.875 (>5)
Expected f=4*30/48=2.5 (>5)
Fisher Exact test provides correction
Chi-Square Tests
.935a 2 .626
.991 2 .609
.004 1 .951
48
Pearson Chi-Square
Likelihood Ratio
Linear-by-LinearAssociation
N of Valid Cases
Value dfAsymp. Sig.
(2-sided)
4 cells (66.7%) have expected count less than 5. Theminimum expected count is 1.50.
a.
Chi-square is not valid
Chi-Square Tests
38.017b 1 .000
36.279 1 .000
41.649 1 .000
.000 .000
37.880 1 .000
278
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-LinearAssociation
N of Valid Cases
Value dfAsymp. Sig.
(2-sided)Exact Sig.(2-sided)
Exact Sig.(1-sided)
Computed only for a 2x2 tablea.
0 cells (.0%) have expected count less than 5. The minimum expected count is30.62.
b.
McNemar test Paired data in a cross tabulation
Ointment B +No
Total
Ointment A+
No16 10
23 5
26
28
Total 39 1554
54 eczematous persons on both arms use ointment A or B (randomized)
McNemar test only take the discordant pairs into account
Χ2=)23-10(2/23+10df=1
Questions
Thank you