Medical statistics Basic concept and applications [Square one]

Medical Statistics2013

Dr Tarek Tawfik Amin

Introduction

- Questions - Why statistics?- The process- The resources

How?

• Book: Statistics at Square One 11th ed. “Campbell and Swinscow”

• SPSS Practical sessions-PASW guide.• Practical sessions using SPSS v. 17.0

Statistics “an overview”

Data

Population

Sample

AnalysisInterpretation

Information

Parameters

Statistics

Reference range

Researches

Statistical analysis

Statistical analysis

Statistical analysis Variables

Data

QualitativeCategorical

QuantitativeNumerical

Nominal Ordinal

Interval/Ratio

DiscreteContinuous

Descriptive Inferential

Depends on the sample (s) and objectives of analysis

Tables Graphs Measures

I-Descriptive Statistics

Goals

SummarizingOverview

Data checking

PATNRAGESEXSMOK

EHEIGH

TWEIGHTSBP

1SBP

2INSULINCHOLHBA1CDIABD

UDEAD

157001779814015406.307.625#NULL!

274101726915014515.108.30110

338101557012012606.5011.002#NULL!

473101657218015705.807.00210

5531217410914011916.8010.6070

674101718315114506.257.6270

781021756014011306.506.4060

886101645914015805.205.3040

978011718315114805.605.9011

1078101718315115915.008.00231

1191001718315114004.309.7041

1277021768717019806.406.6072

1377101718315115205.204.90261

1484001716216014807.007.8081

1572101546314514806.207.8001

diabIB

I-Tables

Frequency Contingency

SEX

145 52.2 52.2 52.2

133 47.8 47.8 100.0

278 100.0 100.0

male

female

Total

ValidFrequency Percent Valid Percent

CumulativePercent

smoking history * SEX Crosstabulation

Count

26 110 136

64 14 78

55 9 64

145 133 278

never

stopped smoking

yes

smokinghistory

Total

male female

SEX

Total

Tables can summarize counts, frequency (categorical), measures (numerical)

For comparison (2 or more variables)

Food items (servings/day) *Subjects classificationP value a

Obese (N=91)Non-obese (N=125)

Milk Milk beverage Milk in cereals Milk in coffee or tea - Total milk Yoghurt Cheese Ice cream- Total dairy Tuna (canned) Fish Half cooked fish Shrimp/oyster Eggs Liver (including chicken livers) Others! -Dietary vitamin D (IU/day): Median

(mean ±SD) Low dietary intake c (< 200 IU/day): No.

(%)-Dietary calcium (mg/day): Median

(mean ±SD) Low calcium intake d (<1000mg/day): No.

(%)

0.52(0.71±0.3)0.45(0.59±0.4)0.20(0.33±0.2)0.15(0.25±0.6)

0.90(1.03±0.3)0.10(0.12±0.6)0.20(0.24±0.9)0.15(0.14±0.6)

0.25(0.45±0.6)0.05(0.03±0.1)0.15(0.19±0.7)0.06(0.11±0.5)0.05(0.08±0.1)0.85(0.81±1.1)0.02(0.04±0.4)0.20(0.23±0.3)

(111.6)118.1±73.5

56(62.2) (660.0)698.8±26

1.951(56.7)

0.65(0.88±0.7)0.35(0.53±0.4)0.50(0.58±0.4)0.20(0.23±0.6)

1.20(1.34±0.7)0.20(0.14±0.5)0.20(0.29±0.8)0.06(0.09±0.3)

0.30(0.43±0.7)0.03(0.04±0.3)0.10(0.18±0.5)0.25(0.27±0.6)0.05(0.06±0.1)0.80(0.76±0.7)0.05(0.06±0.3)0.40(0.55±0.5)

(123.7)132.2±67.447(37.6)

(692.0)717.9±245.949(39.2)

0.0310.2790.0010.7900.0010.7900.6610.4220.8260.7610.9020.0290.1490.7970.8340.5490.0340.003b

0.2230.011b

Table 3 Daily servings of calcium and vitamin D rich foods in relation to body mass index classification of the included adults .

Assignment I Table 1 Basic characteristics for the patients examined (N=278).

Baseline characteristics 1996Total (N=278)

1 -Men)%( 2 -Insulin users)%(

3 -Smokers)%( 4 -Ex-smokers)%(

5 -Non-smokers)%( 6 -Age in years (mean ±SD)

7 -Systolic Blood pressure at starting point mmHg (mean ±SD)

8 -Systolic blood pressure two years mm Hg (mean ±SD)9 -Duration of diabetes (median/Quartiles 1-3)

10 -Missed values

52.2

25.5

23.0

28.1

48.9

67.24 ±11.74

151.20 ±22.00

153.83 ±29.1

6.0( 2.75-12.25)

0.0

II-Graphs

GoalsImpressionComparison

Data checking Clustering

Trend

II- Graphs

Figure 1Outcomes of the included diabetic patients (1996)

other cause of death

died from CVD

alive

Missing

Selection of graphs 1-Types of variables

2-Number of variables 3-Objectives

Categorical Numerical

Figure 2: Smoking status of the inlcuded diabetic patients

smoking history

yesstopped smokingnever

Per

cent

60

50

40

30

20

10

0

Next

total cholesterol

Figure 3: Total cholesterol level in diabetic pateints 1996

in mmol/l60

50

40

30

20

10

0

Std. Dev = 1.33

Mean = 6.25

N = 278.00

For numerical variables

133145N =

Figure 4: Systolic blood pressure at starting point

among diabetic patients 1996 (mmHg)

SEX

femalemale

syst

. blo

od p

ress

ure

at st

art

240

220

200

180

160

140

120

100

80

24728

676899

955 1464 11026N =

Figure 6: Total cholesterol level in relation to gender and

smoking status among diabetic patients 1996

SEX

femalemale

95%

CI to

tal c

hol

este

rol (

mm

ol/l)

8.5

8.0

7.5

7.0

6.5

6.0

5.5

5.0

smoking history

never

stopped smoking

yes

duration of diabetes

32.5

30.0

27.5

25.0

22.5

20.0

17.5

15.0

12.5

10.0

7.5

5.0

2.5

0.0

Figure 7: Duration of diabetes among the included patients 1996

(in years)80

70

60

50

40

30

20

10

0

Std. Dev = 6.96

Mean = 7.9

N = 278.00

Median=6.0

Mode

Median

Mean

Normal distribution

+-

Outliers

Checking for normality

Mode=1

III-Measures (numerical variables)

MeanMedianModePercentiles

Central Tendency Dispersion

Range (max-min)Inter Quartile rangeVariance Standard deviationVariation coefficient

How the data aggregate around a central pointHow the data varies

Central Tendency Mean= summation of observations/their numberAffected by extremes of value

)x1+x2+x3/(number

Mode= The most frequently occurring values in a set of observations

Median= The middle value that divide the ordered data set into 50/50Not affected by extremes of values

3 7 37

Age of sample

Median=7Mean=(3+7+37)/3=15.7

1173Median=7Mean=(3+7+11)/3=7

Dispersion

1 6 8 10 16 17 23 43 531

Range=53-1=52Affected by extremes of values

Median=1350% of data

50th percentile=13

75% of the data75th percentiles

3rd quartile

25% of data25th percentile

1st quartile

Interquartile range=3rd-1st quartiles23-6=17

IQR not affected by extremes of values

Standard deviation and variance

3 7 17

Sample of 3, their age in years

Mean age=(3+7+17)/3=9

9

+8-2-6

The sum of the differences between the mean and individual values=0The mean deviation=0

To overcome the 0= sum the difference squared/number-1= Variance)3-9(2)+6-9(2)+17-9(2/3-1=52

The amount of dispersion around the mean=52 years2 (wrong scale)

Hence we need to convert back to the usual (natural) scale, use the standard deviation√Variance=±7.2 years

The sample disperses around the mean (=9 years) by 7.2 years on both directions

Description of a binary (dichotomous variable)

o A binary variable: Has only two outcomes (diseased or not diseased).

o The proportion of the population that is diseased (at certain point of time) is called prevalence.

o The new cases occurring is called incidence.

Dichotomous variables

Prevalence= All cases (new or old)/at risk population

Incidence= New cases/total population at risk

Probability and Oddso Odds= chance o In a population of 1000, 200 has a certain

disease. o When we randomly take one person out, the

probability that this person is diseased= 200/1000= 0.2 (this is probability)

o The chance (the Odds) that is person is diseased= probability of having the disease /probability of not having the disease.

o Odds= P (probability of disease)/probability of not having the disease (1-P)=P/1-P= 0.2/0.8=1/4, the odds are 1 to 4.

The following table depicts the outcomes of isoniazid/placebo trail among children with HIV (death within 6 months) .

Interventions

Dead (within 6 months)

Alive Total

Placebo21110131

Isoniazid 11121132

What is the risk of dying?

Risk=21/131=0.160

Risk=11/132=0.083

Absolute risk difference (ARD)=risk in placebo-risk in isoniazid= 0.077

Net relative risk (NRR)=risk in placebo/risk in isoniazid= 1.928

Relative risk reduction (RRR)=risk in placebo-risk in isoniazid/risk in placebo= 0.48

Number needed to treat (NNT)=1/ARD=1/0.077=13

Odds ratio (OR)

o An odds ratio (OR) is a measure of association between an exposure and an outcome.

o The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.

o Odds ratios are most commonly used in case-control studies, however they can also be used in cross-sectional and cohort study designs as well (with some modifications and/or assumptions).

Disease-free

Dis

eased

Population

Diseased (cases)

Disease-free(controls)

Exposed to factor(a)

Unexposed to factor(b)

Unexposed to factor(d)

Exposed to factor(c)

Sam

ple

Trace Present time

Starting pointPast time

Basic structure of case-control design

The O

dds “ch

ance

of e

xposu

re

Is calcu

late

d b

etw

een b

oth

gro

ups

Calculation

Case control study

Diseased None Total

Exposed Cases+ exposed (a)

Exposed+ not diseased (b)

a+b

Non-exposed Cases-not exposed (c)

Not exposed+ not diseased

(d)

c+d

Odds ratio= a/c÷b/d= ad/bc

Prevalence among the diseased/prevalence among the non-diseased

OR=1 Exposure does not affect odds of outcomeOR>1 Exposure associated with higher odds of outcomeOR<1 Exposure associated with lower odds of outcome

Odds ratio

Case control study

Lung cancer No lung cancer

Total

Smoking a-80b-30110

Nonec-20d-7090

80x70=560030x20=600

5600/600=9.3

Or 80/20÷30/70=9.3

Basic Structure of cohort study

Disease-free

Dis

eased

Disease-free

Unexposedto factor

Exposed to factor

Population

Develop Disease (a)

Disease-free)b(

Develop Disease (c)

Disease-free)d(

Sam

ple

Starting point

Present time Future timeFollow

Com

parin

g th

e in

cid

en

ce o

f dis

ease in

each

g

rou

p

The Relative Risk is calculated for exposure

Relative risk (RR)

Mammography

Breast cancer No breast cancer

Total

Positive a-10b-90100

Negative c-20d-998980100,100

In Cohort design

RR= a/(a+b)÷c/(c+d)10)/100÷ (20)100,100=(0.1/0.0002 =500

The relative risk (RR)

Lung cancer

No lung cancer

Total

Smokers 18582600

Non 611941200

Cohort

stu

dy

Risk for smokers=18/600=0.03Risk for non-smokers=6/1200=0.005RR=0.03/0.005=6

The Odds ratio (OR)

Lung cancer

No lung cancer

Total

Smokers 8030110

Non 207090

Case

contr

ol st

udy

Odds for smokers=80/30=2.67Odds for non-smokers=20/70=0.29OR=80*70/30*20=9.33

Assignment I Table 1 Basic characteristics for the patients examined (N=278).

Baseline characteristics 1996Total (N=278)

1 -Men)%( 2 -Insulin users)%(

3 -Smokers)%( 4 -Ex-smokers)%(

5 -Non-smokers)%( 6 -Age in years (mean ±SD)

7 -Systolic Blood pressure at starting point mmHg (mean ±SD)

8 -Systolic blood pressure two years mm Hg (mean ±SD)9 -Duration of diabetes (median/Quartiles 1st -3rd)

10 -Missed values

52.2

25.5

23.0

28.1

48.9

67.24 ±11.74

151.20 ±22.00

153.83 ±29.1

6.0( 2.75-12.25)

--

Smoking histroy (all subjects)

smoking history


Pe

rce

nt

60

50

40

30

20

10

0

23

28

49

2a

Smoking history by sex

smoking history


Pe

rce

nt

100

80

60

40

20

0

SEX

male

female711

83

38

44

18

2b

Age using Bar (mean used as summary)

SEX

femalemale

Mea

n ag

e (y

ears

)

70

69

68

67

66

65

64

3a

133145N =

Boxplot age by Sex

SEX

femalemale

age

(yea

rs)

120

100

80

60

40

20

0

195

3b

This graph gives check for Data distribution and checking for outliers

height (cm)

Height of the included subjects 50

40

30

20

10

0

Std. Dev = 8.89

Mean = 170.5

N = 278.00

Median=170.55 cm

4a


32.5

30.0

27.5

25.0

22.5

20.0

17.5

15.0

12.5

10.0

7.5

5.0

2.5

0.0

Duration of diabetes 80

70

60

50

40

30

20

10

0

Std. Dev = 6.96

Mean = 7.9

N = 278.00

4b

Median=6.0 years

syst. blood pressure at start

1 .4 .4 .4

1 .4 .4 .7

2 .7 .7 1.4

1 .4 .4 1.8

2 .7 .7 2.5

21 7.6 7.6 10.1

2 .7 .7 10.8

1 .4 .4 11.2

1 .4 .4 11.5

6 2.2 2.2 13.7

1 .4 .4 14.0

16 5.8 5.8 19.8

1 .4 .4 20.1

2 .7 .7 20.9

1 .4 .4 21.2

11 4.0 4.0 25.2

1 .4 .4 25.5

2 .7 .7 26.3

1 .4 .4 26.6

28 10.1 10.1 36.7

2 .7 .7 37.4

4 1.4 1.4 38.8

12 4.3 4.3 43.2

1 .4 .4 43.5

1 .4 .4 43.9

31 11.2 11.2 55.0

1 .4 .4 55.4

23 8.3 8.3 63.7

1 .4 .4 64.0

1 .4 .4 64.4

2 .7 .7 65.1

1 .4 .4 65.5

21 7.6 7.6 73.0

1 .4 .4 73.4

1 .4 .4 73.7

1 .4 .4 74.1

1 .4 .4 74.5

5 1.8 1.8 76.3

1 .4 .4 76.6

2 .7 .7 77.3

14 5.0 5.0 82.4

1 .4 .4 82.7

2 .7 .7 83.5

4 1.4 1.4 84.9

1 .4 .4 85.3

1 .4 .4 85.6

1 .4 .4 86.0

2 .7 .7 86.7

14 5.0 5.0 91.7

2 .7 .7 92.4

1 .4 .4 92.8

1 .4 .4 93.2

1 .4 .4 93.5

1 .4 .4 93.9

6 2.2 2.2 96.0

1 .4 .4 96.4

1 .4 .4 96.8

2 .7 .7 97.5

1 .4 .4 97.8

1 .4 .4 98.2

3 1.1 1.1 99.3

1 .4 .4 99.6

1 .4 .4 100.0

278 100.0 100.0

100

110

112

115

116

120

121

122

124

125

127

130

131

132

134

135

136

137

139

140

141

144

145

147

148

150

151

151

152

153

155

158

160

161

162

163

164

165

167

168

170

171

172

175

176

177

178

179

180

182

184

185

187

189

190

194

195

200

205

209

210

216

220

Total


CumulativePercent

Using Frequency table: P95≈189-190

5a-

p95, p5= Mean± Z score (probability) at the specified percentiles *(Standard deviation)

P95 SBP1= 151.2+1.645(22.0)=187.4 mmHg

Probability distribution of the normal curve: page 180

52-/-


12 4.3 4.3 4.3

35 12.6 12.6 16.9

22 7.9 7.9 24.8

21 7.6 7.6 32.4

24 8.6 8.6 41.0

20 7.2 7.2 48.2

23 8.3 8.3 56.5

19 6.8 6.8 63.3

6 2.2 2.2 65.5

6 2.2 2.2 67.6

6 2.2 2.2 69.8

13 4.7 4.7 74.5

2 .7 .7 75.2

7 2.5 2.5 77.7

6 2.2 2.2 79.9

5 1.8 1.8 81.7

11 4.0 4.0 85.6

8 2.9 2.9 88.5

6 2.2 2.2 90.6

5 1.8 1.8 92.4

3 1.1 1.1 93.5

5 1.8 1.8 95.3

2 .7 .7 96.0

2 .7 .7 96.8

3 1.1 1.1 97.8

1 .4 .4 98.2

1 .4 .4 98.6

2 .7 .7 99.3

1 .4 .4 99.6

1 .4 .4 100.0

278 100.0 100.0

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

25

26

27

28

31

32

Total


CumulativePercent

P5 for duration of diabetes

5b-1

Or using the formula:Mean-Z score (1.645)* SD =-3.6 years

Total population n=287, μ=67.24 years

σ11.743

+-

Sample no.MeanSD

167.612.07

267.1311.81

36711.98

467.811.63

566.3311.44

667.4411.95

767.8412.42

866.5911.36

96712

1066.3811.9

1168.0612.06

1267.6111.02

1367.3111.33

1466.4411.91

1566.8711.26

1666.811.5

1766.7312.37

1866.3811.77

1967.0311.22

2066.5812.13

2166.8111.55

2266.5812.21

2367.211.61

2466.4811.48

2567.5312.1

2667.5810.6

276711.91

2867.3111.59

Mean of the means67.0311.3

28 samples of 150 from a total population of 287

0

20

40

60

801

2 34

56

7

8

9

1011

121314

151617

1819

20

21

22

23

2425

2627 28

Sample no.

Mean

SD

Age in years

Population and Sample

o In scientific research we want to make a statement (conclusion) about the population.

o Studying the whole population is impossible in terms of money/time/labor.

o Random sampling from the population and infer from the sample data the needed conclusions.

o The task of statistics is to quantify the uncertainty (the sample is really representing that population).

The concept of sampling

Study population:Sampling units

You select a few sampling unitsfrom the study population

Sample

You collect informationfrom these people to find answers to your research questions.

You make an estimate “prediction” extrapolated to the study population

(prevalence, outcomes etc.)

What would be the mean systolic blood pressure of older subjects (65+) in Al

Hassa?

Pop

ula

tion

mean

(μ)=

un

kn

ow

n

175

165

180

155

From our sample we calculate an estimate of the population parameter

The good sample (the estimator)

Should be :

Unbiased:

The mean of sample = population mean

Precise: (narrow dispersion about the mean)

The dispersion in repeated samples is small

This is a dream

Sampling error

Four individuals A, B, C, DA = 18 yearsB= 20 yearsC= 23 yearsD= 25 yearsTheir mean age is = 18+20+23+

25= 86/4= 21.5 years (population mean μ).

Probability of sampling two individuals: (6 probabilities)

A+B=18+20= 38/2=19.0 yearsA+C= 18+23=20.5 years.A+D=18+25=21.5 years.B+C=20+23=21.5 years.B+D=20+25=22.5 years.C+D=23+25=24.0 years.

Probability of sampling three individuals: (4 probabilities)

A+B+C=18+20+23=20.33 years.A+B+D=18+20+25=21.00 years.A+C+D=18+23+25=22.00 years.B+C+D=20+23+25=22.67 years.

If C=32 (instead of 23) years and D=40 (instead of 25) years: sampling of 2= sampling error of -7.00 to +7.00 and in 3= -3.67 to +3.67 years.

Sampling error= population mean-sample mean= ranges from -2.5 to +2.5 years.

Error = ranges from -1.17 to +1.7 years.

The greater the variability of a given variable the larger the sampling error for a given sample size.

Infinite samples should represents the population it came from (good estimator)

2

o The normal distribution o The Standard error of the meano Estimation:

- Reference interval - Confidence intervals

For mean proportion

Difference between means/proportions

RR and OR

/ /١٤٤٤ ٠٩ ٢١56

Normal Distribution: Many human traits, such as intelligence, personality, and

attitudes, also, the weight and height, are distributed

among the populations in a fairly normal way.

The normal distribution

±68% within between μ ±1 SD (σ)

±95% within between μ ±2 SD (σ)

>2SDs Possible outliers

>3 SDs Definite outliers

One more The Z score which measures how many standard deviations a particular data point is above or below the mean. oUnusual observations would have a Z score over 2 or under 2 SD.oExtreme observations would have Z scores over 3 or under 3 SD and should be investigated as potential outliers.

sXZ 1

Areas under the standard normal curve.

ZArea under curve between both points (around the mean)

Beyond both points

)two tails(

Beyond one point

)one tail(

±0.1

±0.2

±0.3

±0.4

±0.5

±0.6

±0.7

±0.8

±0.9

±1

±1.1

±1.2

±1.3

±1.4

±1.5

±1.6

±1.645

±1.7

±1.8

±1.9

1.96

±2

±2.1

±2.2

±2.3

±2.4

±2.578

0.080

0.159

0.236

0.311

0.383

0.451

0.516

0.576

0.632

0.683

0.729

0.770

0.806

0.838

0.866

0.890

0.900

0.911

0.928

0.943

0.950

0.954

0.964

0.972

0.979

0.984

0.99

0.920

0.841

0.764

0.689

0.617

0.549

0.484

0.424

0.368

0.317

0.271

0.230

0.194

0.162

0.134

0.110

0.100

0.089

0.072

0.057

0.050

0.046

0.036

0.028

0.021

0.010

0.004

0.4600

0.4205

0.3820

0.3445

0.3085

0.2745

0.2420

0.2120

0.1840

0.1585

0.1355

0.1150

0.0970

0.0810

0.0670

0.0550

0.0500

0.0445

0.0360

0.0290

0.0250

0.0230

0.0180

0.0140

0.0105

0.0100

0.0020

Calculating values from Z-scores

Xi = Mean± Z (standard deviation).

Value (percentiles) =Mean± Z score*(SD)

Random sample for estimating a population mean

μ?

X1=128

X2=133

X3=129

From the information in the sample, we will estimate the unknown population mean (X is an estimator for μ) What could have happened if we had another random sample?

What is the measure of variation of sample means?

The Sampling Distribution of a Sample Statistics

≈ Let’s assume that we want to survey a community of 400, the age of them were recorded and having the following parameters:

µ = 35 years σ = 13 years

≈ Let’s assume, however, that we do not survey all 400, instead we randomly select 120 people and ask them about their ages and calculate the mean age.

≈ Then, we put them back into the community and randomly select another 120 residents (may include members of the first sample).

≈ We did this over and over and each time we calculate the mean age.

≈ The results will be like those in the following table.

Distribution of 20 random sample means (n=20)

Sample NumberSample mean

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

SD of the means

34.7

35.9

35.5

34.7

34.5

34.4

35.7

34.6

37.4

35.3

34.1

35.5

34.9

36.2

35.6

35.0

35.1

36.4

35.6

33.6

13.37

353433 36 37

. . ..…..…

.… .. ..

All the results are clustered around the population value (35 years), with a few scores a bit further out and one extreme score of 37.4 years (random variation=1/20=5%).

Those 400 people have age range from 2 to 69 years ,while the means of the samples have a very narrow range of value of about 4 years and 10 samples coincide with the population mean (35 years).

μ

Most of the samples will cluster around the population parameters with occasional sample result falling relatively further to one side or the other of the distribution (this called the sampling distribution of sample means). Has the following properties:The mean of the sampling distribution is equal to the population mean, the average of the averages (µχ) will be the same as the population mean. The standard deviation of the sample means = the standard error SE= σ/√n, (σ= population SD). The distribution of the sample means is Normal if the population distribution is Normal.If the population distribution is Not Normal, The distribution of the sample means is almost Normal when n is large (Central Limit Theorem).

PopulationParameters

Mean S.D

Sample

Mean S.D

Standard error of the mean

The degree the sample statistics are deviating /different from the population parameters.

The term error indicates the fact that due to sampling error, each sample mean is likely to deviate some what from true population mean.

Sample mean

Central Limit Theorem

The formula for SE= SD/√n.The formula indicates that we are estimating the SE given the S.D of a sample of size n.For a sample of 100 and S.D of 40 the SE= 40 /√100 = 4.For a sample of 1000 and S.D of 40 the SE= 40 / √1000 = 1.26.

Two factors influence the SE, sample size and S.D of the sample:

Sample size has greater impact as it is used a denominator .

For a sample of 100 and S.D of 20 the SE = 20 / √100 = 2.For a sample of 100 and S.D of 40 the SE = 40 / √100 = 4.If there is more variability within a sample the greater

the SE.

Confidence Interval (CI)

A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data.

We need to know the smallest and the largest μ (range) we think is likely using sample statistics. The mean of sample = μ

c= level of

confidence

Z c= Z critical

values (under

normal curve)

90%

95%

99%

1.645

1.960

2.578

n

c

C.I= Mean of the sample ±Z critical scores (SEM)SEM= SD/√n

C.I

• The confidence interval provides a range that is highly likely (often 95% or 99%) to contain the true population parameter that is being estimated.

• The narrower the interval the more informative is the result.

• It is usually calculated using the estimate (sample mean) and its standard error (SEM).

CI for μSystolic blood pressure in 287 diabetic patients

Descriptives

151.20 1.319

149.02

153.38

150.30

150.00

483.880

21.997

100

220

120

30.00

.540 .146

.152 .291

Mean

Lower Bound

Upper Bound

90% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

syst. bloodpressure at start

Statistic Std. Error

90% C.I= 151.20±1.65(21.997/√287)C.I=149.02-153.38 mmHg

Descriptives

155.06 3.064

149.92

160.20

154.72

151.20

460.033

21.448

115

205

90

30.00

.263 .340

-.506 .668

Mean

Lower Bound

Upper Bound


5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis



Random sample of 50 out of 287

Descriptives

151.20 1.319

148.60

153.80

150.30

150.00

483.880

21.997

100

220

120

30.00

.540 .146

.152 .291

Mean

Lower Bound

Upper Bound


5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis



95% C.I=151.20±1.96(21.997/√287)C.I=148.60-153.80 mmHg

Descriptives

155.06 3.064

148.90

161.22

154.72

151.20

460.033

21.448

115

205

90

30.00

.263 .340

-.506 .668

Mean

Lower Bound

Upper Bound


5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis



Random Sample of 50 out of 287

Descriptives

151.20 1.319

147.78

154.62

150.30

150.00

483.880

21.997

100

220

120

30.00

.540 .146

.152 .291

Mean

Lower Bound

Upper Bound


5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis



99% C.I=151.20±2.58(21.997/√287)C.I=147.78-154.62 mmHg

Descriptives

155.06 3.064

146.84

163.28

154.72

151.20

460.033

21.448

115

205

90

30.00

.263 .340

-.506 .668

Mean

Lower Bound

Upper Bound


5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis



Random sample of 50 out of 287

90% C.I= 151.20±1.65(21.997/√287)C.I=149.02-153.38 mmHg

95% C.I=151.20±1.96(21.997/√287)C.I=148.60-153.80 mmHg

99% C.I=151.20±2.58(21.997/√287)C.I=147.78-154.62 mmHgWhat does this mean? It means that if the

same population is sampled on numerous occasions and interval estimates are made on each occasion, the resulting intervals would bracket the true population parameter (ranged) in approximately 90, 95 and 99 % of the cases.

The sample distribution of a proportion

nKpn

pppSE

p

/

(1)()

()96.1 SEpCI p Z critical score equal 95%

Smokers among diabetics

Sample=400Smokers=40P=40/400=0.1SE (p) = √0.1-0.9/400=0.015

CI p 95%= 0.1±1.96(0.015)

[0.07-0.13] for % it is the same SE=1.5% C.I=[7-

13]

95% CI for the difference between two means (μ1-μ2)

Smoke nMean SBPSE (mean)

No 214153.11.50

Yes 64144.82.62

Difference

8.3

22

21

2121

(())(())

()*96.1

SESESE

SE

C.I= 2.4 to 14.2

95% CI for percentage

Smoke (n) %died SE

No (212)28.83.11

Yes (64)23.45.30

Difference= 5.4%

2

(2100)

1

(1100)

()*96.1

21

n

pP

n

pPSE

PPSEPP snssns

95% C.I=-6.7% to 17.4%

95% CI for RR and OR

Use available software

http://www.medcalc.org/calc/relative_risk.php

http://www.medcalc.org/calc/odds_ratio.php

vl.academicdirect.org/applied_statistics/.../CIcalculator.xls

Assignment II

Inferential StatisticsTesting in research

o In scientific research we would like to test if our research ideas are true.

o Based on previous observations (studies) we know that the mean cholesterol of patients with diabetes is higher than those without the disease.

o We will take samples and check whether the results will agree with our expectations.

o Meaning we are going to test the situation using a statistical test.

The Z-test for one sample

Serum cholesterol (μ=5 mmol/L)

σ=±1.5Diabetic patients, mean cholesterol > 5

Considering σ=±1.5?

Is there any difference between diabetes free population and the diabetic patients regarding serum cholesterol? Let’s perform Z test .

Research question (hypothesis)

The research hypothesis would be

The mean cholesterol of diabetics is > 5mmol/L

Null hypothesisH0: μ=sample mean=5

Alternative hypothesisH1: μ >5 (one sided)

OrH1: μ≠5 (two sided)

Procedure

total cholesterol

13.0012.00

11.0010.00

9.008.00

7.006.00

5.004.00

3.00

Cholesterol level diabetic patients in mmol/L60

50

40

30

20

10

0

Std. Dev = 1.33

Mean = 6.25

N = 278.00

μ=5

Mean of sample

If the sample mean close to the population meanThe null hypothesis is TRUE

If the sample mean differs from population meanWe REJECT the null

The ά level (P value)

The probability to obtain /achieve the null hypothesis

The probability that Population mean=sample mean

There no difference between the population and sample mean.

Or

The maximum probability we accept to reject the null hypothesis falsely

ά = 0.05

Alpha level

P ≤ 0.05 (ά)Reject the null

Sample mean≠

population mean

P > 0.05 (ά)Accept the nullSample mean=

population mean

Calculation (σ=1.5)

SEM=μ/√n=0.3 Z=(mean sample-μ)/σ

P (mean of the sample≥6)=P(Z ≥6-5)/0.3= 0.0005Under the normal curve area of rejection >1.96 Z

P=0.0005 :The cholesterol blood level of diabetic patients can coincide

with the population (disease free) 5 in 10,000 times The two values could be the same in 5 times if we repeated this test 10,000 times

P < 0.05 so we reject the nullThe diabetics have larger mean cholesterol level than the normal population

In reality

It is unlikely that the σ (population SD) is known.

In most of the cases, σ will be unknown and we will be able to apply neither the formula nor the table of normal distribution (areas under the curve=Z score).

We resort to other statistical tests.

Possible situations in testing

Possible situations in Hypothesis testing

Reject H0Do not reject H0

H0 is true Type I error (ά)OK (1-ά)

H0 is not true

OK (1-В)Type II error (В)

Realit y

Decision

Level of significance

1-В= PowerIt is the probability to reject the null hypothesis if is NOT TRUEUsually 80% is the least required for any test

Errors of Hypothesis Testing and PowerDecisions and errors in hypothesis testing

True Situation Difference exist (H1) No

difference (H0)

Study results

Correct decision(power or 1-β)

Type I error or άRejection when it is true

False rejectionThere is a difference when it is really not

Type II or β errorFalse acceptanceThere is no difference when it is really present.

Correct decision

Con

clu

sion

fro

m h

ypot

hes

is t

esti

ng

Difference existReject H0

No differenceDo not reject H0

Passive smoking and lung cancer

Truth about the population

Passive smoking is

related to lung cancer.

Not related to lung cancer.

Type II ErrorIncorrect acceptance Passive smoking is not related to lung cancer when it is really does.

Type I ErrorIncorrect rejectionPassive smoking is related to lung cancer when it is really not..

Conclusions, based on results from a study of a sample of the population

Reject the null hypothesis (rates in the study appear to be different)

Accept the null hypothesis (rates in the study appear similar)

The Alpha-Fetoprotein (AFP) test has both Type I and Type II error possibilities .

This test screens the mother’s blood during pregnancy for

AFP and determines risk .Abnormally high or low levels may indicate Down

syndrome . H0: patient is healthy

Ha: patient is unhealthy

Error Type I (False positive or False Rejection) is: Test wrongly indicates that patient has a Down syndrome, which means that pregnancy must be aborted for no reason.

Error Type II (False negative or False Acceptance) is: Test is negative and the child will be born with multiple anomalies

Hypothesis Test

This is the distribution given the null hypothesis is true

Type I and Type II Error

False rejection

False acceptance

One Sample

The distribution of X under the null and alternative hypotheses.

t-distribution

In real life situations we will estimate the unknown population SD using Sample SD .

Results are standardized to the t-distribution:

ns

t

n

Z

Z test for normal distributionThe population SD is known

t-distribution

Heavier tails than the Z distribution

df=No. of observations (sample size)-1

Degree of freedom (df)

For all sample statistics: variance, SD, we used n-1All the observations in any given sample are free except one= Complementary effect.

Degree of freedom

7 15

12

16

total =50

restricted

df = n-1

t-distrib

utio

n

t-test-steps to determine the statistical difference

When? descriptive statistics: mean ± standard deviation

Number of samples

One sample vs. population mean

Two independent samples

Two dependent (t-paired):Repeated measures Matched pairs

Steps:1- State the hypothesis to be tested: Null (non-directional-two tailed) mean= mean Alternative (unidirectional-one tail) mean ≠ mean 2- Find the calculated t value: using the formulae. 3- Find the degree of freedom: all = n-1 (two sample independent df=n1-1+n2-1 (n1+n2-2).4- Find the P value using the tables of t-distribution.5- Conclude: if < 0.05 = rejection. If > 0.05 the null is accepted.

nSDt / 2

/22

1

21

21 n

SD

n

SD

()

dSE

ddependentt

t-test (student’s t-test) one sample

nSDt /

Using diabetes data: Is the mean age of diabetics > 65 years?

Statistics

age (years)278

0

67.24

.704

11.743

137.902

Valid

Missing

N

Mean

Std. Error of Mean

Std. Deviation

Variance

H0:μ=65H1:μ≠65

t one sample =67.24-65/SD/√n=3.18

t distribution P=0.002Reject the nullDiabetics are significantly older than 65 years

One-Sample Test

3.182 277 .002 2.24 .85 3.63age (years)t df Sig. (2-tailed)

MeanDifference Lower Upper

95% ConfidenceInterval of the

Difference

Test Value = 65

P value (two sided)

Degree of freedom

Assuming that the distribution of age is normalPopulation SD is unknown (σ)

t-test for comparison of means of two independent samples

H0: Smoking has no effect on systolic blood pressureMean S= Mean NS or Mean S-mean NS=0

H1: smoking has an effect Mean S≠ Mean NS or Mean S-Mean NS≠0

Assumptions:•Independent observations (2 samples)•Normally distributed •Equal variances (for the pooled t-test)

Three formulae

2

22

1

21

21 0

nS

nS

t

(1)(1)

(1)(1)

21

222

2112

2

2

1

2

21

nn

SnSnS

n

S

n

St

p

pp

2

22

1

21

21

nS

nS

t

Standardized

Expected difference if H0 is true

SD of the difference

If SDs are equal

Pooled SD

If SDs are not equal

Decision based on Levene’s test

Group Statistics

214 153.11 21.995 1.504

64 144.82 20.934 2.617

SMOKINGno

smokers


N Mean Std. DeviationStd. Error

Mean

Independent Samples Test

.006 .936 2.674 276 .008 8.29 3.100 2.188 14.392

2.747 107.982 .007 8.29 3.018 2.308 14.272

Equal variancesassumed

Equal variancesnot assumed


F Sig.

Levene's Test forEquality of Variances

t df Sig. (2-tailed)Mean

DifferenceStd. ErrorDifference Lower Upper


Difference

t-test for Equality of Means

P value <0.05, reject H0Not significant it means equal variances

Two separate t-test

Variances are apparently equal

Paired t-test

If we have paired data (two repeated measurements on the same subjects) or before and after

If the difference of the paired observations are Normally distributed.

Paired samples (dependent)

(Paired / dependent 2-sample t-test)

• To compare observations collected form the same group of individuals on 2 separate occasions (dependent observations or paired samples).

• The paired t statistics is calculated by:

- Calculate the difference between the 2 measurements taken on each individual.

- Calculate the mean of the differences.- Calculate the SE of the observed differences.- Under the null hypothesis of no difference or difference

= 0, the paired t statistic takes the form.- t= Mean difference / SE of the difference.

- It has a normal distribution with degrees of freedom = (n-1)

d

d

SE

0-m t

m d

SE d

Example Four students had the following scores in 2 subsequent

tests. Is there a significant difference in their performance?

NumberNameTest 1Test 2 Dif

1Mike35%67%- 32

2Melanie50%46% 4

3Melissa90%86% 4

4Mitchell78%91%- 13

Mean Dif = -9.25, S D Dif= 17.152, SE Dif= 8.58Calculated Paired t = -9.25/8.58 = -1.078,

df=n-1 = 3

d

d

SE

0-m t

dfLevel of significance for one-tail test

0.01 0.05 0.02 0.01 0.005

Level of significance for two-tail test

0.20 0.10 0.05 0.02 0.01

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

35

50

∞

3.078 6.314 12.706 31.821 63.657

1.886 2.920 4.303 6.965 9.925

1.638 2.353 3.182 4.541 5.841

1.533 2.132 2.776 3.747 4.604

1.476 2.015 2.571 3.365 4.032

1.440 1.943 2.447 3.143 3.707

1.415 1.895 2.365 2.998 3.499

1.397 1.860 2.306 2.896 3.355

1.383 1.833 2.262 2.821 3.250

1.372 1.812 2.228 2.764 3.169

1.363 1.796 2.201 2.718 3.106

1.356 1.782 2.179 2.681 3.055

1.350 1.771 2.160 2.650 3.012

1.345 1.761 2.145 2.624 2.977

1.341 1.753 2.131 2.602 2.947

1.340 1.746 2.120 2.583 2.921

1.333 1.740 2.110 2.567 2.898

1.330 1.734 2.101 2.552 2.878

1.328 1.729 2.093 2.539 2.861

1.325 1.725 2.086 2.528 2.845

1.323 1.721 2.080 2.518 2.831

1.306 1.690 2.030 2.438 2.724

1.299 1.676 2.009 2.403 2.678

1.282 1.645 1.960 2.326 2.576The P value = 0.20, the null is accepted!

P value

Conclusion

The observed difference can be encountered in 36 (actual P value =0.362 out of 100 cases. i.e. we accept the null hypothesis of no difference between first and 2nd test.

Paired Samples Statistics

151.20 278 21.997 1.319

153.83 278 29.076 1.744

syst. blood pressureat start

syst. blood pressureafter 2 years

Pair1

Mean N Std. DeviationStd. Error

Mean

Paired Samples Test

-2.63 17.920 1.075 -4.74 -.51 -2.443 277 .015syst. blood pressureat start - syst. bloodpressure after 2 years

Pair1

Mean Std. DeviationStd. Error

Mean Lower Upper


Difference

Paired Differences

t df Sig. (2-tailed)

Test of significanceInterval/ratio data

Parametric assuming normal distribution

Known Population Variance (σ)One sample Z-test

Z test, rejection limit > ±1.96

n

Z

Unknown Population Variance

Number of samples

One sample vs. population One sample t-test

Two samples

Independent t-test independent

Dependent t-paired test

t-testReject if P ≤ 0.05

The Chi-Square test χ2

Used for hypothesis testing for categorical variablesMany types depends on design, distribution of variables and objectives of testing

χ2

Example:

Vaccination against Influenza deceases the risk to get the disease.

Study:

Compare the effectiveness of 5 vaccines with respect to the probability to get influenza.

Comparison will be in respect to a nominal variable (getting influenza: yes or no)

Effectiveness of Five Vaccines

Vaccines

Influenza No

Influenza Yes

Total

1

2

3

4

5

237

198

245

212

233

43

52

25

48

57

280

250

270

260

290

Total 11252251350

Vaccines

Influenza No

Influenza Yes

Total

1

2

3

4

5

84.6

79.2

90.7

81.5

80.3

15.4

20.8

9.3

18.5

19.7

100

100

100

100

100

Total 83.316.7100

Data cross tabulated 2X5: response variable: Influenza

Frequency %within Vaccines

The probability to get influenza

The null hypothesis states that the probability to get influenza is independent of the vaccinesThe alternative states that a dependency exists

Effectiveness of Five Vaccines

If H0 is true: The probability to influenza in every group should be the same= the probability in the total population ,

Equal to: 225/1350=0.167 (16.7%)Vaccine 1 used in 280, if H0 is true ,we expect that 16.7% (≈47) to get influenza.

However this is not true

Expected frequencies

Vaccines Influenza No

Influenza Yes

Total

1-Observe

d

Expected2-

Observed

Expected3-

Observed

Expected4-

Observed

Expected5-

Observed

Expected

237

233.3

198

208.3

245

225.0

212

216.7

233

241.7

43

46.7

52

41.7

25

45.0

48

43.3

57

48.3

280

250

270

260

290

Total 11252251350

For any cell: Expected Frequency= Row total*column total/ grand total

280X225/1350

260*1125/1350

Row total

Column total

Grand total

Pearson Chi-square test

Calculate the expected frequencies (assuming H0 is true) for all the ten cells.

Calculate Chi square: Of= observed frequencyEf= Expected frequency

f

ff

E

EO 22 ()

Reject H0 if χ2 is large Use the Chi-square distribution

After determining the degree of freedom (df)df= (r-1)*(c-1)

Chi-square distribution

Critical values for Chi-squaredf Level of Significance

0.990.900.700.500.300.200.100.050.010.001

1

2

3

4

5

.

.

30

0.00016

0.0201

0.115

0.297

0.554

14.953

0.0158

0.211

0.584

1.064

1.610

20.599

0.148

0.713

1.424

2.195

3.000

25.508

0.455

1.386

2.366

3.357

4.351

29.336

1.074

2.408

3.665

4.878

6.064

33.530

1.642

3.219

4.642

5.989

7.289

36.250

2.706

4.605

6.251

7.779

9.236

40.256

3.841

5.991

7.815

9.488

11.070

43.773

6.635

9.210

11.341

13.277

15.086

50.892

10.827

13.815

16.268

18.465

20.517

59.703

χ2critical= 9.488

Calculated=16.555df=(2-1)(5-1)=4

P=0.002

There is a relation )dependence( between type of vaccine and influenza prevention

SMOKING * SEX Crosstabulation

90 124 214

42.1% 57.9% 100.0%

55 9 64

85.9% 14.1% 100.0%

145 133 278

52.2% 47.8% 100.0%

Count

% within SMOKING

Count

% within SMOKING

Count

% within SMOKING

no

smokers

SMOKING

Total

male female

SEX

Total

Chi-Square Tests

38.017b 1 .000

36.279 1 .000

41.649 1 .000

.000 .000

37.880 1 .000

278

Pearson Chi-Square

Continuity Correctiona

Likelihood Ratio

Fisher's Exact Test

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is30.62.

b.

At least 80% of cells must have Ef >5

We can’t use Pearson Chi-square ifthe expected frequency is <5

In this case we use Fisher’s Exact test

status * SEX Crosstabulation

Count

24 15 39

4 1 5

2 2 4

30 18 48

alive

died from CVD

other cause of death

status

Total

male female

SEX

Total

E f=5*18/48=1.875 (>5)

Expected f=4*30/48=2.5 (>5)

Fisher Exact test provides correction

Chi-Square Tests

.935a 2 .626

.991 2 .609

.004 1 .951

48

Pearson Chi-Square

Likelihood Ratio


N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

4 cells (66.7%) have expected count less than 5. Theminimum expected count is 1.50.

a.

Chi-square is not valid

Chi-Square Tests

38.017b 1 .000

36.279 1 .000

41.649 1 .000

.000 .000

37.880 1 .000

278

Pearson Chi-Square

Continuity Correctiona

Likelihood Ratio

Fisher's Exact Test


N of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is30.62.

b.

McNemar test Paired data in a cross tabulation

Ointment B +No

Total

Ointment A+

No16 10

23 5

26

28

Total 39 1554

54 eczematous persons on both arms use ointment A or B (randomized)

McNemar test only take the discordant pairs into account

Χ2=)23-10(2/23+10df=1

Questions

Thank you

Health & Medicine

Medical statistics Basic concept and applications [Square one]