47
Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's College London Email: [email protected] Drug Development Statistics & Data Management

Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Embed Size (px)

Citation preview

Page 1: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Data Handling II: Describing and Depicting your Data

Dr Yanzhong WangLecturer in Medical StatisticsDivision of Health and Social Care ResearchKing's College LondonEmail: [email protected]

Drug Development Statistics & Data Management

Page 2: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

2

Types of data

• Quantitative data– continuous, discrete– distributions may symmetric or skewed

• Qualitative (categorical) data– binary– nominal, ordinal

Page 3: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

3

Positively skewed data

Fre

qu

en

cy0

5

10

15

20

25

Negatively Skewed data

0

5

10

15

20

25

30

Fre

quen

cy

Long tail to left Long tail to right

Skewed Distributions

Page 4: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

4

0 2 4 60

.1

.2

.3

.4

Symmetric Distribution

Page 5: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

5

Summary statistics

• ‘Where the data are’ - location– mean, median, mode, geometric mean

• Used to describe baseline data and main outcomes

• ‘How variable the data are’ - spread– standard deviation, variance, range, interquartile

range, 95% range• Needed (primarily) to describe baseline data

in RCT and cohort study

Page 6: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

6

Definition of the Mean

The mean of a sample of values is the arithmetic average and is determined by dividing the sum of the values by the number of the values.

Page 7: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

7

Definition of the Median

The median is the middle value.

not affected by skewness and outliers, but less precise than mean theoretically.

Page 8: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Ordered Blood Glucose Values

2.2 2.9 3.3 3.3 3.3 3.4 3.4 3.4 3.6 3.6 3.6 3.6 3.7 3.7 3.8 3.8 3.8 3.9 4.0 4.0 4.0 4.1 4.1 4.1 4.2 4.3 4.4 4.4 4.4 4.5 4.6 4.7 4.7 4.7 4.8 4.9 4.9 5.0 5.1 6.0

8

Page 9: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Definition of the Mode

The mode is the most frequent value.

9

Page 10: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

2.2 2.9 3.3 3.3 3.3 3.4 3.4 3.4 3.6 3.6 3.6 3.6 3.7 3.7 3.8 3.8 3.8 3.9 4.0 4.0 4.0 4.1 4.1 4.1 4.2 4.3 4.4 4.4 4.4 4.5 4.6 4.7 4.7 4.7 4.8 4.9 4.9 5.0 5.1 6.0

Ordered Blood Glucose Values

10

Page 11: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

0

1

2

3

4

5

6

7

2 3 4 5 6

Blood glucose (mmol/litre)

Cou

nt

Arithmetic Mean - outlier prone

Mode - not necessarily central (categorical data)Median - only uses relative magnitudes

Location = Central Tendency

11

Page 12: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Relation of mean, median and mode

• If distribution is unimodal (has only one mode) then:

• Mean=median=mode for symmetric distribution.

• Mean>median>mode for positively skewed distribution.

• Mean<median<mode for negatively skewed distribution.

12

Page 13: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

0

10

20

30

40

50

60

70

80

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Serum Triglyceride Levels

Cou

nt

Serum Triglyceride Levels from Cord Blood of 282 Babies

13

Page 14: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

0

5

10

15

20

25

30

35

-1.9 -1.7 -1.5 -1.3 -1.1 -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5

log(Serum Triglyceride) Levels

coun

t

Log(Serum Triglyceride Levels) from Cord Blood of 282 Babies

14

Page 15: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Definition of the Geometric Mean

The geometric mean of a sample of n values is determined by multiplying all the values together and taking the nth root (for only two values this is the more familiar square root).

15

Page 16: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Geometric Mean

• A common example of when the geometric mean is the correct choice average is when averaging growth rates.

• Another Method: Take log of each value, find arithmetic mean and anti-log the result.

Exp( (log(0.15) + … + log(1.66) )/40) = 0.467

Page 17: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

0

10

20

30

40

50

60

70

80

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Serum Triglyceride Levels

Cou

nt

Mean=0.506

Median=0.460Geometric Mean=0.467

Serum Triglyceride Levels from Cord Blood of 282 Babies

17

Page 18: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Why measures of variability are important

Production of Aspirin • New production process of 100 mg tabs• Random sample from process

– 96 97 100 101 101 mgs - mean 99 mg• Random sample from old process

– 88 93 100 104 110 mgs - mean 99 mg• Same means but new is better because less variable

18

Page 19: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Definition of RangeThe range of a sample of values is the largest value minus the smallest value.

• New process the range is 101-96=5 • Old process the range is 110-88=22

• Range is simple ….. BUT– Only uses min and max– Gets larger as sample size increases

19

Page 20: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Definition of Inter-quartile Range

The inter-quartile range of a sample of values is the difference between the upper and lower quartiles. The lower quartile is the value which is greater than ¼ of the sample and less than ¾ of the sample. Conversely, the upper quartile is the value which is greater than ¾ of the sample and less than ¼ of the sample.

20

Page 21: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Ordered Blood Glucose Values

2.2 2.9 3.3 3.3 3.3 3.4 3.4 3.4 3.6 3.6 3.6 3.6 3.7 3.7 3.8 3.8 3.8 3.9 4.0 4.0 4.0 4.1 4.1 4.1 4.2 4.3 4.4 4.4 4.4 4.5 4.6 4.7 4.7 4.7 4.8 4.9 4.9 5.0 5.1 6.0

1/4 of 40 = 10 3/4 of 40 = 30

21

Page 22: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

0

1

2

3

4

5

6

7

2 3 4 5 6

Blood glucose (mmol/litre)

Cou

nt

Inter-Quartile Range

Lower quartile Upper quartile

Inter-quartile range

22

Page 23: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Standard deviation

• Neither measure uses the numerical values - only relative magnitudes

• A measure accounting for the values is the standard deviation

• Consider the aspirin data from the new process 96 97 100 101 101 (mean 99 mg)

• Determine deviations from mean -3 -2 1 2 2

• Square , add, average and square-root098.24.4

5

44149

23

Page 24: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Measures of scatter/dispersion – ‘how variable the data are’

• Range – smallest to biggest value– increases with sample size

• Standard deviation – measure of variation around the mean– affected by skewness and outliers

• Variance = square of standard deviation• Interquartile range (IQR) – from 25th centile

to 75th centile

24

Page 25: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Plotting Data

• Histograms• Stem and Leaf Plots Box Plots

Stem Leaf 60 0 1 58 56 54 52 50 00 2 48 000 3 46 0000 4 44 0000 4 42 00 2 40 000000 6 38 0000 4 36 000000 6 34 000 3 32 000 3 30 28 0 1 26 24 22 0 1 ----+----+----+----+ Multiply Stem.Leaf by 10**-1

2

3

4

5

6

Blo

od g

luco

se (

mm

ol/li

tre)

25

Page 26: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Mean and standard deviation

• Best description if distribution reasonably symmetric (and single mode)

• Give full description if data have Normal distribution

26

Page 27: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

x0 1 2 3 4 5 6 7 8 9 10

0

.1

.2

.3

.4 Mean 3, s.d. 1 Mean 5, s.d. 1

Mean 5, s.d. 2

27

Page 28: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Properties of Normal distribution

• Symmetric distribution – mean, median and mode equal

• Completely specified by mean and standard deviation

• 95% of distribution contained within mean 1.96 standard deviations

• 68% within mean 1 standard deviation

28

Page 29: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Continuous data, not Normally distributed

• If symmetric use mean and standard deviation• If skewed use median and IQR

Unless• Positively skewed, but log transformation

creates symmetric distribution – use geometric mean

29

Page 30: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Nominal categorical data

• Mode.• % in each category, especially when binary.

Wheeze in last 12 months

Frequency (n) %

No 1945 75.2Yes 642 24.8Total 2587 100.0

30

Page 31: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Ordinal categorical data

• Median and IQR if enough separate values.• Otherwise as for nominal.

31

Page 32: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Discrete quantitative data

• As for continuous data if many values, as for ordinal data if fewer.

Page 33: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

33

Difference BetweenStandard Deviation & Standard Error

Page 34: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

34

Measure of Variability of the Sample Mean

• Range, inter-quartile range and standard deviation relate to population (sample) not mean.

• To understand the difference carry out a sampling experiment using the Ritchie Index values

Page 35: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

35

Values of the Ritchie Index (Measure of Joint Stiffness) in 50 Untreated Patients

14 9 8 9 1 20 3 3 2 4 2 3 6 1 2 11 16 24 16 21 19 22 33 12 12 12 19 10 33 2 19 40 1 20 1 2 4 7 9 4 9 6 14 8 27 10 27 7 24 21

Mean = (14+…+21)/50 = 12.18

Page 36: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

36

0

2

4

6

8

10

12

14

16

0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index

Arithmetic Mean - outlier prone

Median - only uses relative magnitudes

Mode - not necessarily central (categgorical data)

Location = Central Tendency

Page 37: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

37

Sampling Experiment

• Take a random sample (10) from the 50 values

• Determine the mean of the 10 values• Repeat 50 times• These means show variation - HOW

LARGE IS IT ?

Page 38: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

38

Variations in Samples

0

2

4

6

8

10

12

14

16

0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index

0

2

4

6

8

10

12

14

16

0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index

0

2

4

6

8

10

12

14

16

0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index

0

2

4

6

8

10

12

14

16

0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index

0

2

4

6

8

10

12

14

16

0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index

Mean=12.18

Mean=10.00

Mean=12.60

Mean=13.40

Mean=11.50

Page 39: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

39

Ritchie Values

Values of the Ritchie Index0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40

30

25

20

15

10

5

0

Original values (mean - 12.18 ; sd - 9.69)

Page 40: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

40

Ritchie ValuesSampling Experiment – Sample Means

Values of the Ritchie Index0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40

30

25

20

15

10

5

0

Sample means(mean - 12.21 ; sd - 2.97)

Original values (mean - 12.18 ; sd - 9.69)

Page 41: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

41

Definition of the Standard Error

The standard deviation of the sampling distribution of the mean is called the standard error of the mean.

Page 42: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

42

Increasing Sample Size

• Increased precision (smaller standard error)• Less skewness

Values of the Ritchie Index0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40

Sample means(mean - 12.21 ; sd - 2.97)

30

25

20

15

10

5

0

35

40

30

25

20

15

10

Values of the Ritchie Index0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40

Sample means(mean - 12.37 ; sd - 2.43)

5

0

35

40n=10 n=15

Page 43: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

43

Standard error of the mean as a function of the sample size

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40

Sample Size

Sta

ndar

d E

rror

of t

he M

ean

nse

sd

/

Page 44: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

44

Population of Gene Lengthsn=20,290

0 5000 10000 15000

Gene Length (# of nucleotides)

Fre

quen

cy

050

010

0015

0020

0025

0030

00

Page 45: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

45

Samples of size : n=100

0 5000 10000 15000

050

100

150

200

250

300

Gene Length (# of nucleotides)

Fre

quen

cy

Page 46: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

46

Practical Confusion

• A mean is often reported in medical papers as

12.18 1.37

what is 1.37 ?

sd or se ?

Page 47: Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's

Thanks!

Tea break