Basic Statistics. “I always find that statistics are hard to swallow and impossible to digest. The only one I can remember is that if all the people who

Basic Statistics

“I always find that statistics are hard to swallow and impossible to digest.

The only one I can remember is that if all the people who go to sleep in church were laid end to end they would be a lot more comfortable.”

[Mrs Robert A Taft]

“Data! Data! Data!”he cried impatiently.

“I can’t make bricks without clay”

[Sherlock Holmes]

Qualitative

a) Nominal data

(dead/alive, blood group O,A,B,AB)

b) Ordered categorical/ranked data

(mild/moderate/severe)

Quantitative

a) Numerical discrete

(no. of deaths in a hospital per year)

b) Numerical continuous

(age, weight, blood pressure)

Presenting data

• Graphs

• Summary statistics

• Tables

Graphical methods

•Piechart

•Barchart

•Histogram

•Scattergram

Pie chart

Self-reported pain

extreme pain

moderate pain

no pain

Bar chart

self-reported pain

extreme painmoderate painno pain

No

. o

f su

bje

cts

5000

4000

3000

2000

1000

0

Age

65.060.055.050.045.040.035.030.025.020.0

50

40

30

20

10

0

Histogram

Boxplot

36004465N =

Gender

malefemale

Ag

e (

yea

rs)

100

80

60

40

20

0

Error bar plot

2818714785N =

mobility

severe problemsome problemsno problem

95

% C

I h

ea

lth s

tatu

s sc

ore

400

300

200

100

0

Scattergram of creatinine vs. digoxin

Digoxin

120100806040200

Cre

atin

ine

140

120

100

80

60

40

20

0

Scattergram

Graph Example

SF36 sub-scale

General

Pain

Vitality

Mental

Role emo

Role phys

Soc fun

Phys fun

SF

36

sco

re

100

80

60

40

20

0

Not ill

Long term ill

Graph

SF36 sub-scale

General

Pain

Vitality

Mental

Role emo

Role phys

Soc fun

Phys fun

SF

36

sco

re

100

80

60

40

20

0

Not ill

Long term ill

Solution

Summary statistics

Qualitative data

• Percentages

• Numbers

Secondary prevention of coronary heart disease

Respondents(n=1343)

Non-respondents(n=578)

Male 58% (782) 54% (314)

Urban Practice 54% (720) 57% (331)

Practice size:

< 5,000 14% (190) 18% (105)

5,000 – 10,000 39% (523) 41% (238)

> 10,000 47% (630) 41% (235)

Summarizing data example I

Summary StatisticsQuantitative data

• Non-normal

median

range

inter-quartile range• Normal

mean

standard deviation

variance

Boxplot

36004465N =

Gender

malefemale

Ag

e (

yea

rs)

100

80

60

40

20

0

Summary StatisticsNormal data

Approximately 95% of observations lie between the mean plus or minus

2 standard deviations

Age

65.060.055.050.045.040.035.030.025.020.0

50

40

30

20

10

0

Histogram

IgM

3.002.752.502.252.001.751.501.251.00.75.50.250.00

140

120

100

80

60

40

20

0

Histogram of IgM values

How to test for Normality

• Mean = Median

• (mean-2sd, mean+2sd) reasonable range

• -1 < skewness < 1

• -1 < kurtosis < 1

• Histogram shows symmetric bell shape

Checking for Normality

Age Length of stay

Satisfaction score

Mean 66.2 12.1 5.2

Median 67 8 9

SD 8.2 9.0 4.3

Minimum 49 4 1

Maximum 80 36 10

Skewness -0.2 1.8 -2.5

Kurtosis 0.5 1.3 4.6


Mean (sd)

Respondents

(n=1343)

Non-respondents

(n=578)

Age (years) 66.2 (8.2) 66.6 (8.7)

Time since MI (mths) * 10 (6, 35) 15 (8, 47)

Cholesterol (mmol/l) 6.5 (1.2) 6.6 (1.2)

[* Median (range)]

Summary statistics example II

Natural log transformation

• Can transform +vely skewed data to ‘Normal’ data

• Use transformed data in analysis

• Resulting mean value transformed back (using ex) to give geometric mean

• Present geometric mean and range

Effect of loge transformation

Length of stay

Loge length of stay

Mean 12.1 2.2

Median 8 2.1

SD 9.0 0.5

Minimum 4 1.4

Maximum 36 3.6

Skewness 1.8 0.4

Kurtosis 1.3 0.7

[Geometric mean = e 2.2 = 9.0]


Mean (sd)

Respondents

(n=1343)

Non-respondents

(n=578)

Age (years) 66.2 (8.2) 66.6 (8.7)

Time since MI (mths) * 10 (6, 35) 15 (8, 47)

Cholesterol (mmol/l)

Length of stay #

6.5 (1.2)

9.0 (4, 36)

6.6 (1.2)

11.2 (6, 83)

[* Median (range), # Geometric mean (range)]

Confidence Interval

“ The estimated mean difference in systolic blood pressure between 100 diabetic and 100 non-diabetic men was 6.0 mmHg

with 95% confidence interval

(1.1mmHg, 10.9mmHg)”

Confidence Interval

• Contains information about the (im)precision of the estimated effect size

• Presents a range of values, on the basis of the sample data, in which the population value for such an effect size may lie

Confidence Interval

95% CI for mean = mean +/- 1.96 SEM90% CI for mean = mean +/- 1.64 SEM

SEM = sd / sqrt(n)

Confidence Interval

• The 95% CI is a range of values which we are 95% confident covers the true population mean

• There is a 5% chance that the ‘true’ mean lies outside the 95% CI

Error bar plot

2818714785N =

mobility

severe problemsome problemsno problem

95

% C

I h

ea

lth s

tatu

s sc

ore

400

300

200

100

0

Confidence Interval Example

Significance/hypothesis tests

Measure strength of evidence provided by the data for or against some proposition of interest

Eg. Is the survival rate after X better than after Y?


Null hypothesis:

“Effects of X and Y are the same”

Alternative hypothesis:

“Effects of X and Y are different”


One-sided :

“X is better than Y”

Two-sided:

“ X and Y have different effects”

P-value

P is the probability of how true is the null hypothesis

P-value

P <= 0.05

• null hypothesis is not true

• there is a difference between X and Y

• result is statistically significant

P-value

P > 0.05

• null hypothesis may be true

• there is probably no difference between X and Y

• result is not statistically significant

P-value

Power of study

• probability of rejecting null hypothesis when false

• increased by increasing sample size

• increased if true difference between treatments is large

P-value

Statistical significance does not imply clinical significance

A statistician is a person whose lifetime ambition is to be wrong

5% of the time

Types of significance tests

Chi-square test:

“28 out of 70 smokers have a cough compared with 5 out of 50 non-smokers

- is there a significant difference?”

[28/70 = 40% compared with 5/50=10%]

Chi-square test result

“P=0.001”

There is a significant relationship between smoking and cough


Two-sample t-test:

“Is there a difference in the 24 hour energy expenditure between groups of lean and

obese women?”


Mann-Whitney U-test:

“Is there a difference in the nausea score between chemo patients receiving an active anti-emetic treatment and those receiving

placebo?”


Paired t-test:

“Is there a difference in the dietary intake of a group of students in the week before

and after Finals?”


Wilcoxon matched pairs signed rank test or the Sign test:

“Is there a difference in the units of alcohol consumed by students in the week before

and after finals?”

Significance test example

Correlation

Measures the strength of the relationship between two variables


Digoxin

120100806040200

Cre

atin

ine

140

120

100

80

60

40

20

0

Scattergram

Correlation

Pearson correlation:

• Used for Normally distributed data

• Measures linear relation between variables

Correlation

• r = 0 no relationship

• r = 1 perfect +ve relationship

• r = -1 perfect –ve relationship


Digoxin

120100806040200

Cre

atin

ine

140

120

100

80

60

40

20

0

Scattergram

Correlation

Spearman correlation:

• Used for non-Normally distributed data

• Measures monotonic relationship between variables

Correlation Example

Correlation

Change in IGF-1 (ng/ml)

2001000-100-200

Ch

an

ge

in le

ft-v

en

tric

ula

r m

ass

(g

)

120

100

80

60

40

20

0

-20

-40

placebo

rhGH

“The government are very keen on amassing statistics.They collect them, add them, raise them to the n’th power, take the cube root and prepare wonderful diagrams.But you must never forget that every one of these figures comes in the first instance from the village watchman, who just puts down what he damn pleases”

[Comment of a judge on the subject of government statistics, 1920]

Documents

Basic Statistics. “I always find that statistics are hard to swallow and impossible to digest. The only one I can remember is that if all the people who