View
217
Download
0
Category
Preview:
Citation preview
1
INTRODUCTORY STATISTICS
March 2008
Francesca Little
•INTRODUCTION : What is Statistics?
•TYPES OF DATA
•GRAPHICAL METHODS FOR DISLAYING DATA
•SUMMARIZING DATA
•STATISTICAL INFERENCE
•COMPARING GROUPS
Continuous data – Parametric
Non-parametric
Categorical data
•MEASURES OF DISEASE FREQUENCY AND EFFECT
•SAMPLE SIZE ESTIMATION
•ESTIMATION VERSUS HYPOTHESIS TESTING
INTRODUCTION:
Statistics = the art of decision making
Collection of data
Organization of data
Summarizing and displaying data
Analysis of data
Interpretation
References :
1. M Pagano & K Gauvreau. Principles of Bio-Statistics, 2nd edition, 2000,
Duxbury.
2. J T Connor. The Value of a p-Valueless Paper. Am J Gastroenterol
2004;99:1638-1640.
3. Armitage, P. & Berry, G. : Statistical methods in medical research. 2nd
edition,1987, Blackwell
4. Fisher, LD & Van Belle, G. Biostatistics. A methodology for the Health Sciences.
Wiley, 1993.
5. Beagle etal, Introduction to Epidemiology, WHO 1993
2
EXAMPLE : Randomised Clinical Trial to investigate and compare the efficacy
of two treatment regimes in the treatment of malaria
Data include information on
•Patient demographics
•Baseline diagnostics
•Treatment and Outcome
•Kinetics
Statistical Software
•SPlus
•Stata
•Excel
•Statistica
DEMOGRAPHIC AND BASELINE DIAGNOSTIC DATA
+------------------------------------------------------------------+
| subject site age gender weight feverhis pardens0 |
|------------------------------------------------------------------|
1. | MOC001 SiteA 2 M 10.7 Y 46 |
2. | MOC002 SiteA 2 F 10.7 Y 523 |
3. | MOC003 SiteA 13 F 31.5 Y 174 |
4. | MOC004 SiteA 6 F 20 Y 152 |
5. | MOC006 SiteA 3 F 11.7 Y 372 |
|------------------------------------------------------------------|
6. | MOC009 SiteA 31 M 55 Y 239 |
7. | MOC010 SiteA 2 M 10.6 Y 30595 |
8. | MOC011 SiteA 12 M 33.6 Y 217 |
9. | MOC012 SiteA 2 M 11.6 Y 5338 |
10. | MOC013 SiteA 4 F 15 Y 78 |
|------------------------------------------------------------------|
11. | MOC016 SiteA 14 F 33.1 Y 94 |
12. | MOC017 SiteA 4 F 12.4 Y 108 |
13. | MOC018 SiteA 14 F 47 N 251 |
14. | MOC019 SiteA 6 M 14.5 N 502 |
15. | MOC020 SiteA 12 F 20.3 N 3193 |
|------------------------------------------------------------------|
16. | MOC021 SiteA 12 M 20.5 N 284 |
17. | MOC022 SiteA 11 M 21.2 N 1045 |
18. | MOC023 SiteA 7 F 18.1 N 4821 |
19. | MOC024 SiteA 6 M 13.4 N 347 |
20. | MOC025 SiteA 6 F 19.1 Y 522 |
|------------------------------------------------------------------|
21. | MOC026 SiteA 5 M 15.1 Y 3499 |
22. | MOC027 SiteA 6 F 16.4 Y 614 |
23. | MOC028 SiteA 8 F 25.6 N 153 |
24. | MOC029 SiteA 14 M 36.6 N 325 |
25. | MOC030 SiteA 9 F 28.1 N 536 |
|------------------------------------------------------------------|
3
TREATMENT AND OUTCOME DATA
+---------------------------------------------------------------+
| subject A1dose Bdose A2dose trt outcome |
|---------------------------------------------------------------|
1. | MOC001 93.45795 0 4.672897 trtA ACPR |
2. | MOC002 46.72897 0 2.336449 trtA ACPR |
3. | MOC003 31.74603 0 1.587302 trtA ACPR |
4. | MOC004 25 0 1.25 trtA ACPR |
5. | MOC006 42.73504 6.410256 2.136752 trtB ACPR |
|---------------------------------------------------------------|
6. | MOC009 27.27273 5.454545 1.363636 trtB LTFU |
7. | MOC010 94.33962 9.433962 4.716981 trtB ACPR |
8. | MOC011 29.76191 4.464286 1.488095 trtB ACPR |
9. | MOC012 43.10345 6.465517 2.155172 trtB ACPR |
10. | MOC013 33.33333 0 1.666667 trtA ACPR |
|---------------------------------------------------------------|
11. | MOC016 45.31722 9.063444 2.265861 trtB ACPR |
12. | MOC017 40.32258 6.048387 2.016129 trtB ACPR |
13. | MOC018 31.91489 0 1.595745 trtA ACPR |
14. | MOC019 34.48276 5.172414 1.724138 trtB ACPR |
15. | MOC020 49.26109 7.389163 2.463054 trtB ACPR |
|---------------------------------------------------------------|
16. | MOC021 48.78049 0 2.439024 trtA ACPR |
17. | MOC022 47.16981 7.075471 2.35849 trtB ACPR |
18. | MOC023 55.24862 8.287292 2.762431 trtB ACPR |
19. | MOC024 37.31343 5.597015 1.865672 trtB ACPR |
20. | MOC025 52.35602 0 2.617801 trtA ACPR |
|---------------------------------------------------------------|
21. | MOC026 33.11258 0 1.655629 trtA ACPR |
22. | MOC027 30.48781 0 1.52439 trtA ACPR |
23. | MOC028 39.0625 5.859375 1.953125 trtB ACPR |
24. | MOC029 40.98361 8.196722 2.049181 trtB ACPR |
25. | MOC030 35.58719 0 1.779359 trtA ACPR |
|---------------------------------------------------------------|
Trt A = 2 drugs A1 and A2trtB = trtA +third drug B
PK DATA
+---------------------------------------------------------------------+
| subject AUCall_A1 Cmax_A1 Tmax_A1 AUCall_A2 Tmax_A2 Cmax_A2 |
|---------------------------------------------------------------------|
1. | MOC001 1481.1 123.655 1 1323.425 1 386.898 |
2. | MOC002 694.0709 106.517 2 4619.423 2 404.834 |
3. | MOC003 1019.58 103.838 1 5512.589 1 433.755 |
4. | MOC004 742.2638 75.3804 1 3275.157 0 312.978 |
5. | MOC006 548.6373 72.0582 1 1907.779 1 284.006 |
|---------------------------------------------------------------------|
6. | MOC009 745.3734 69.0091 2 1414.887 1 270.758 |
7. | MOC010 626.12 95.014 1 2166.401 1 469.149 |
8. | MOC011 906.2881 86.4171 1 2749.568 1 481.745 |
9. | MOC012 734.5026 70.785 2 1980.176 2 282.227 |
10. | MOC013 855.3212 83.2393 3 1942.275 1 340.038 |
|---------------------------------------------------------------------|
11. | MOC016 976.5257 94.5215 1 2886.02 1 625.695 |
12. | MOC017 531.5585 78.3498 1 1050.526 1 198.951 |
13. | MOC018 899.3121 105.459 1 1608.391 1 342.87 |
14. | MOC019 762.8975 111.888 1 1402.796 1 356.158 |
15. | MOC020 884.7603 111.346 2 2041.504 2 434.162 |
|---------------------------------------------------------------------|
16. | MOC021 614.1311 109.219 1 1791.219 1 493.847 |
17. | MOC022 745.5765 88.314 1 2049.41 1 449.877 |
18. | MOC023 34.721 6.4265 1 113.2341 1 18.1723 |
19. | MOC024 643.1728 64.3864 2 1281.929 2 196.15 |
20. | MOC025 1202.952 124.234 1 3983.165 1 647.307 |
|---------------------------------------------------------------------|
21. | MOC026 615.6737 75.757 1 1901.963 1 264.39 |
22. | MOC027 1199.341 139.315 1 3936.105 1 591.919 |
23. | MOC028 1317.269 127.166 1 3276.55 1 582.077 |
24. | MOC029 1158.077 105.017 1 3078.129 1 388.982 |
25. | MOC030 1078.056 94 1 2668.494 1 509.955 |
|---------------------------------------------------------------------|
4
TYPES OF DATA
NOMINAL, the values fall into unordered categories
eg., site = SiteA or SiteB
gender = male or female
outcome= ACPR or LTFU or ETF or LPF or LCF
ORDINAL, where order is important, the values in one category is in some
way less or worse than the values in another category
eg., severity of symptoms = none, mild, moderate, severe
DISCRETE, where both order and magnitude is important but the variables
can take on only isolated values, usually integers, that differ by fixed amounts
eg., number of children for one woman
number of organisms in sample
CONTINUOUS, where the data present measurable quantities that can be
measured to any degree of accuracy
eg., parasite density on day 0
Area under the curve for Trt A1
Age
GRAPHS
BAR CHARTS to display nominal or ordinal data
1 06
12
25
3
010
20
30
40
50
60
70
80
90
10
011
0
Fre
que
ncy
0 ACP R LTFU E TF L PF LCFou tc ome
8 2.8 1
9 .3 75
1.5 633 .90 6
2.3 44
01
02
03
04
05
06
07
08
09
01
00
Perc
en
t
0 A CP R L TFU ET F L PF LCFoutcom e
The heights of the vertical bars show either the frequency or relative
frequency (percent) of observations in each class.
5
HISTOGRAMS illustrate frequency distributions for discrete or continuous
data
Take care when choosing width of bins, i.e., X-axis.
Remember that it is the area of the bins that illustrate the
frequency.
02
46
810
12
14
16
18
20
Fre
quency
0 10 20 30 40 50 60 70 80 90 100110120130140150160170180190200Cmax_A1
ILLUSTRATING THE RELATIONSHIP BETWEEN TWO VARIABLES
Two categorical variables:
Bar graphs of one variable by the other variable
010
20
30
40
50
60
70
count of
patn
o
TrtA TrtB
SiteA SiteB
6
010
20
30
40
50
60
70
count of
patn
o
TrtA TrtB
ACPR LTFU
ETF LPF
LCF
Two continuous variables – Scatter plots
The basic graphical technique for the two-variable situation is the scatter
diagram. In general the data refer to a number of individuals, each of which
provides observations on two variables. In the scatter diagram each variable is
allotted one of two co-ordinate axes and each observation defines a point, of
which the co-ordinates are the observed values of the two variables. The
scatter diagram gives a compact illustration of the relationship between the two
variables.
02
04
06
08
0S
tuS
itWe
ightK
G
0 20 40 60StuSitDOBAge
7
020
40
60
80
10
0
0 20 40 60StuSitDOBAge
Fitted values StuSitWeightKG
020
40
60
80
0 20 40 60StuSitDOBAge
Fitted values StuSitW eightKG
You may try to fit linear or quadratic lines to the data to summarize the
suggested relationship :
EXERCISE:
Think about your specific research project and consider the nature of the
expected results in terms of the type of information that you are collecting
or generating – will you be able to organize this information in terms of
variables and observations and can you identify what kind of variables you
will have.
Use the data from the malaria example discussed earlier and group all
variables according to their type, i.e, nominal, ordinal, discrete, continuous.
Discuss how you would illustrate the information collected on gender and
how you would further illustrate the relationship between gender and
outcome.
You are interested in whether there is a relationship between the dose of
trt A1 received and the AUC for trt A1. How would you illustrate this
relationship?
8
The graph below illustrates the use of other medications for the two
sites. Comment and compare the two sites with respect to use of other
medications.
010
20
30
40
50
60
70
80
count of
patn
o
SiteA SiteB
Antipyretic Antimalarial
Topical Rehydration
The graph below illustrates the relationship between Cmax for trt A1 and
dose. Comment on this relationship.
050
100
150
200
20 40 60 80 100A1dose
Fitted values Cmax_A1
9
SUMMARIZING DATA
CATEGORICAL DATA
Create frequency distributions that gives the number or percentage of observations in each category of
the nominal or ordinal variable.
For example
outcome | Freq. Percent Cum.
------------+-----------------------------------
ACPR | 106 82.81 82.81
LTFU | 12 9.38 92.19
ETF | 2 1.56 93.75
LPF | 5 3.91 97.66
LCF | 3 2.34 100.00
------------+---------------------------------
Total| 128 100.00
Use two-way frequency tables to summarize the relationships between two categorical variables:outcome
trt | ACPR LTFU ETF LPF LCF | Total
-----------+-------------------------------------------------------+----------
trtA | 50 5 1 4 2 | 62
| 80.65 8.06 1.61 6.45 3.23 | 100.00
-----------+-------------------------------------------------------+----------
trtB | 56 7 1 1 1 | 66
| 84.85 10.61 1.52 1.52 1.52 | 100.00
-----------+-------------------------------------------------------+----------
Total | 106 12 2 5 3 | 128
| 82.81 9.38 1.56 3.91 2.34 | 100.00
Continuous data
Recall the histogram of Cmax_A1.
How can we summarize the information on Cmax_A1 values for our sample using a few numbers?
What is it that we want to know about the Cmax_A1 values?
Measures of central tendency:
Mean
Median
Mode
Measures of dispersion:
Range
Interquartile range
Variance and standard deviation
Coefficient of variation
02
46
81
01
21
41
61
82
0F
reque
ncy
0 10 20 30 40 50 60 70 80 90 100110120130140150160170180190200Cmax_A1
10
THE MEAN
= the average value
= the sum of all the values / number of observations+-----+
| age |
1. | 31 |
2. | 6 |
3. | 21 |
4. | 27 |
5. | 15 |
6. | 3 |
7. | 17 |
8. | 35 |
9. | 3 |
10. | 56 |
11. | 23 |
12. | 3 | mean age = (31+6+21+27+15+3+17+35+3+56+23+3)/12 = 20+-----+
Say, the age of 56 was an error and should really have been recorded as 36,
Then the new mean is mean age =
(31+6+21+27+15+3+17+35+3+36+23+3)=18.33
The mean takes into consideration the actual magnitude of the values and is
very sensitive to unusual values.
nxx
n
i
i/
1
∑=
=
01
23
4F
reque
ncy
0 5 10 15 20 25 30 35 40 45 50 55 60StuSitDOBAge
THE MEDIAN = the central observation,
50% of the values lie below this value and 50% lie above.
For n observations, where n is odd, the median is the [(n+1)/2]th largest value.
When n is even, the median is the average of the two middle observations, the
(n/2)th and [(n/2)+1]th observations.+-----+
| age |
|-----|
1. | 31 |
2. | 6 |
3. | 21 |
4. | 27 |
5. | 15 |
6. | 3 |
7. | 17 |
8. | 35 |
9. | 3 |
10. | 56 |
11. | 23 |
12. | 3 |
+-----+
+-----+
| age |
|-----|
1. | 3 |
2. | 3 |
3. | 3 |
4. | 6 |
5. | 15 |
6. | 17 |
7. | 21 |
8. | 23 |
9. | 27 |
10. | 31 |
11. | 35 |
12. | 56 |
+-----+
N=12 is even � (n/2=6)th obs= 17
[(n/2)+1]=7th obs = 21
� median=(17+21)/2 = 19
Changing 56 to 36 does not affect
the median because it is still the
largest value and so the rank or
order of the observations have
remained the same. The median is
much more robust than the mean.
11
In the previous example the mean and median gave similar values for the
central age. However, this is not always the case.
Cmax_A1 has a
symmetrical distributions,
thus the mean =84.67 is
very similar to the
median=81.33
05
.0e-0
61
.0e
-05
1.5
e-0
52
.0e-0
5D
en
sity
-1 99999 199999 299999 399999 4999990 pardens
Parasite density on day 0 has a
very skew distribution, thus the
mean=33533 is very different from
the median=5280. For very skew
distributions, the median is a more
appropriate measure of the central
position than the mean.
0.0
05
.01
.015
De
nsity
0 50 100 150 200Cmax_A1
THE MODE = the value that occurs most frequently.+-----+
| age |
|-----|
1. | 3 |
2. | 3 |
3. | 3 | So for the 12 ages, the mode = 3
4. | 6 |
5. | 15 |
6. | 17 |
7. | 21 |
8. | 23 |
9. | 27 |
10. | 31 |
11. | 35 |
12. | 56 |
+-----+
More appropriate for discrete data, where you may be interested in the most
frequently observed response, e.g., when your variable measures the number of
children for one mother, the mode will give you the most common family size.
12
MEASURES OF SPREAD
RANGE = difference between largest and smallest value
INTERQUARTILE RANGE = the range of the central 50% of the values
+-----+
| age |
|-----|
1. | 3 |
2. | 3 |
3. | 3 |
4. | 6 |
5. | 15 |
6. | 17 |
7. | 21 |
8. | 23 |
9. | 27 |
10. | 31 |
11. | 35 |
12. | 56 |
+-----+
The range=56-3=53,
more commonly expressed as the two limits, (3-56).
To calculate the interquartile range, we need to identify the 25th
and 75th percentiles for the data.
In general, the kth percentile is the average of the observations with rank=nk/100 and (nk/100 + 1) if nk/100 is an integer. If nk/100 is not an integer, the kth percentile is the (j+1)th largest observation where j is the largest integer less than nk/100.
For our example,
n=12 �12x25/100=3 � 25th percentile = (3+6)/2 = 3, so 25th
percentile=(3+6)/2=4.5
12x75/100=9 � 75th percentile = 9th largest obs. = 27, so 75th percentile= (27+31)/2 = 29
Thus interquartile range=29-4.5=24.5, also expressed as (4.5-29).
VARIANCE = the amount of variability around the mean
∑=
−
−
=
n
i
ixx
ns
1
22)(
1
1
It can be thought of as the average of the squared deviations
from the mean.
Variance = 2798/11 = 254.36
27980240Total
289-173
9323
12963656
289-173
2251535
9-317
289-173
25-515
49727
1121
196-146
1211131
(age-mean)*2age-meanage
13
THE STANDARD DEVIATION:
The units of measurement of the variance are not the same as the units of
measurement of the variable for which you have calculated the variance. For
this reason we calculate the standard deviation which is the positive square
root of the variance. For the 12 ages, the standard deviations will thus equal
95.1536.254 ==s
This now has the same units as age.
The variable with the larger standard deviation is the more variable.
However, if variables have different units of measurement, it is not
appropriate to compare the standard deviations.
To get rid of the units of measurement, we calculate the
COEFFICIENT OF VARIABILITY as
74.79100)20/95.15(100 =×=×=
x
sCV
Which expressed the standard deviation as a percentage of the
mean, or the variability as a percentage of the central position.
It is dimensionless and can be used to evaluate the relative
variability of two groups of observations.
Illustrating (graphing) the measures of central position and spread:
05
10
15
20
25
30
35
40
45
50
55
60
Stu
SitD
OB
Age
maximum
adjacent value=most extreme value not
more than 1.5 times the height of the box
beyond either quartile, so not more than
1.5x(18-3)=22 +18 = 40
upper quartile
median
lower quartile
minimum
14
Side-by-side box plots are very useful to compare groups :
�median age slightly higher for trtB group than for trtA group
�interquartile range wider for trtB group than for trtA group
�age-distribution for trtA-group skew because of outlying values
05
10
15
20
25
30
35
40
45
50
55
60
trtA trtB
Stu
SitD
OB
Age
Graphs by trt
Sometimes bar graphs with error bars are used to display means and
standard deviations:
Mean Plot (Spreadsheet1 10v*128c)
Mean
Mean±SD SP SP/ART
Var3
-2
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
Var2
�Mean ages for two treatment groups the same, ages in
trtB group slightly less variable than in trtA group
trtA trtB
15
CLASS EXERCISE:
(From Pagano & Gauvreau)
In Massachusetts, 8 individuals experienced
unexplained episodes of vitamin D intoxication
that required hospitalization; it was thought that
these unusual occurrences might be as a result of excessive supplementation of dairy milk.
Blood levels of calcium and albumin for each
subject at the time of hospitalization are listed.
For these calcium and albumin levels,
calculate the
Mean
Median
Mode
Range
Interquartile range
Standard deviation
Coefficient of variation
If you wish, use a statistical software package to illustrate these measures of central tendency
and spread.
Calcium Albumin
(mmol/l) (g/l)
2.92 43
3.84 42
2.37 42
2.99 40
2.67 42
3.17 38
3.74 34
3.44 42
The tables and graphs below summarise Cmax values for drug A1 for the two treatment groups.From these tables and graphs:Compare the means and medians for the two groups. Discuss which of the two measures of central position is the most appropriate.Comment on the variability of the Cmax values within each treatment group by referring to the standard deviations, the ranges and interquartileranges.Calculate and compare the coefficients of variation of Cmax within the two treatment groups.
. table trt,c(n Cmax_A1 mean Cmax_A1 sd Cmax_A1)
----------------------------------------------------
trt | N(Cmax_A1) mean(Cmax_A1) sd(Cmax_A1)
----------+-----------------------------------------
trtA | 62 87.55352 29.73334
trtB | 63 81.82308 28.72239
----------------------------------------------------
. table trt,c(min Cmax_A1 p25 Cmax_A1 med Cmax_A1 p75 Cmax_A1 max Cmax_A1)
---------------------------------------------------------------------------
trt | min(Cmax_A1) p25(Cmax_A1) med(Cmax_A1) p75(Cmax_A1) max(Cmax_A1)
----------+----------------------------------------------------------------
trtA | 28.2339 64.2546 86.60575 108.267 165.744
trtB | 6.4265 64.3864 80.7916 95.014 152.769
---------------------------------------------------------------------------
05
01
00
150
200
trtA trtB
Graphs by trt
0.0
05
.01
.01
5D
en
sity
0 50 100 150 200Cmax_A1
16
The tables and graphs below summarise Cmax values for drug A2 for the two treatment groups.
From these tables and graphs:
Compare the means and medians for the two groups. Discuss which of the two measures of central position is the
most appropriate.
Comment on the variability of the Cmax values within each treatment group by referring to the standard
deviations, the ranges and interquartile ranges.
Calculate and compare the coefficients of variation of Cmax within the two treatment groups.
. table trt,c(n Cmax_A2 mean Cmax_A2 sd Cmax_A2)
----------------------------------------------------
trt | N(Cmax_A2) mean(Cmax_A2) sd(Cmax_A2)
----------+-----------------------------------------
trtA | 62 340.6208 159.762
trtB | 63 292.24 145.7977
----------------------------------------------------
. table trt,c(min Cmax_A2 p25 Cmax_A2 med Cmax_A2 p75 CmaxA2 max Cmax_A2)
---------------------------------------------------------------------------
trt | min(Cmax_A2) p25(Cmax_A2) med(Cmax_A2) p75(Cmax_A2) max(Cmax_A2)
----------+----------------------------------------------------------------
trtA | 97.0572 236.945 307.553 423.281 1000.9
trtB | 18.1723 195.749 268.026 358.674 712.894
---------------------------------------------------------------------------
0.0
01
.00
2.0
03
.00
4D
en
sity
0 200 400 600 800 1000Cmax_P
020
04
00
600
800
1,0
00
SP SP/ART
Graphs by trt
INTRODUCTORY STATISTICS
continued
PROBABILITY AND STATISTICAL INFERENCE
PROBABILITY:
All statistical summaries and hence decisions are subject to uncertainty.
The appropriate tool for measuring uncertainty is the theory of probability.
In frequentist statistics, probabilities are just relative frequencies that express the number of times that a given outcome was observed as a proportion of the total number of trials.
For example,
for the malaria data, we had 128 subjects, of whom 67 were female. Thus the probability of a female subject is 67/128 = 0.5234.
Of the 128 subjects, 106 had an “adequate clinical response”, hence the probability of successful
treatment was 106/128=0.8281.
Within each treatment group we observed that
50 of the 62 subjects on trt A, had an adequate clinical response � probability = 0.8065
56 of the 66 subjects on trt B had an adequate clinical response � probability = 0.8485.
Recommended