55
Biostatistics Lecture 2 Graphical Presentation of Data

Biostatistics Lecture 2 Graphical Presentation of Data

Embed Size (px)

Citation preview

Page 1: Biostatistics Lecture 2 Graphical Presentation of Data

Biostatistics

Lecture 2Graphical Presentation of

Data

Page 2: Biostatistics Lecture 2 Graphical Presentation of Data

Data Organization

• Measurements that have not been organized, summarized or otherwise manipulated are called raw data.

• Unless the number of observations is extremely small, it will be unlikely that these raw data will impart much information until they have been put into some kind of order.• Always it is easier to analyze organized data

Page 3: Biostatistics Lecture 2 Graphical Presentation of Data

The ordered array• The preparation of the ordered array is the first step in

organizing data.

• An ordered array is a listing of the values of a collection (either population or sample) from the smallest value to the largest value.

• The ordered array enables one to determine quickly the value of the smallest measurement, the value of the largest measurement and the general trends in the data.

Raw data 13 3 17 9 5 7 15 11

Organized 3 5 7 9 11 13 15 17

Page 4: Biostatistics Lecture 2 Graphical Presentation of Data

Grouped DataThe frequency distribution

• Although a set of observation can be made more comprehensible and meaningful by means of an ordered array, further useful summarization may be achieved by grouping the data.

• To group a set of observations, we select a set of non-overlapping intervals such that each value in the data set of observations can be placed in one, and only one, interval.

• These intervals are usually referred to as Class Intervals.

• Usually class intervals are ordered from smallest to largest.

Page 5: Biostatistics Lecture 2 Graphical Presentation of Data

Grouped dataThe frequency distribution

• How many intervals should we use? (0-100 years)– Too few intervals are undesirable because of the resulting loss of

information. (eg. 0-50, 51-100) two intervals– Too many intervals, on the other hand, will not meet the

objective of summarization. (eg. 0-1,2-3,4-5,…….99-100)!!– A commonly used rule is there should be no fewer than six

intervals and no more than 15. (6-15 is optimal)– Sturges rule: where k is the number of

class intervals and n is the number of values in the data set under consideration. (rounded to nearest integer)

– The size of the class interval is often selected as 5, 10, 15 or 20 etc

)(log3229.31 nk

Page 6: Biostatistics Lecture 2 Graphical Presentation of Data

Grouped dataThe frequency distribution

• The width of class intervals:– Class intervals should be generally of the same width.– The width may be obtained by dividing the range by k, the

number of class intervals.

eg: tablet hardness values range between 50 and 120 N, calculate the recommended number of intervals and the interval width for data contains 60 values of tablet hardness??

n= 60Range (k)=largest – smallest=120-50=70#intervals=1+3.329(logn)=1+3.329(log60)=7Interval width=k/n=70/7=10

Page 7: Biostatistics Lecture 2 Graphical Presentation of Data

Grouped dataThe frequency distribution

Class Interval Frequency

10-19 4

20-29 66

30-39 47

40-49 36

50-59 12

60-69 4

Total 169

Frequency distribution of ages of 169 subjects.

How many subjects are there in each class interval?

non-overlapping intervals

Variables range = 69-10 + 1 = 60

Interval width

Page 8: Biostatistics Lecture 2 Graphical Presentation of Data

Grouped dataThe relative frequency distribution

• It may be useful sometimes to know the proportion rather than the number, of values falling between a particular class interval.

• We obtain this information by dividing the number of values in the particular class interval by the total number of values.

• We refer to the proportion of values falling within a class interval as the relative frequency of values in that interval.

• We may sum (cumulate) the frequencies and relative frequencies to facilitate obtaining information regarding frequency or relative frequency of values within two or more contiguous class intervals.

Class Interval

Frequency Relative Frequency

Cumulative Frequency

Cumulative Relative

Frequency

10-19 4 0.0237 4 0.0237

20-29 66 0.3905 70 0.4142

30-39 47 0.2781 117 0.6923

40-49 36 0.2130 153 0.9053

50-59 12 0.0710 165 0.9763

60-69 4 0.0237 169 1.0000

Total 169 1.0000

Page 9: Biostatistics Lecture 2 Graphical Presentation of Data

Grouped dataThe relative frequency distribution

• We use true limits to fill the gaps between intervals for a continuous variable.

• Using true limits is very essential to calculate statistics (range, median,…etc) of grouped data.

• Upper true limit = upper class value + 0.5.

• Lower true limit = lower class value - 0.5.

Intervals True limits frequency10--19 9.5-19.5 420-29 19.5-29.5 6630-39 29.5-39.5 4740-49 39.5-49.5 3650-59 49.5-59.5 1260-69 59.5-69.5 4

Page 10: Biostatistics Lecture 2 Graphical Presentation of Data

Histogram

• We may display a frequency distribution (or a relative frequency distribution) graphically in the form of a histogram, which is a special type of bar graphs. This histogram is a probability distribution that consists of adjacent columns to represent a continuous variable such as weight, height, age..etc.

• When we construct a histogram, the variable under consideration are represented by the horizontal (x) axis, while the the frequency (or relative frequency) of occurrence is the (y) axis.

Freq

uenc

y

Age interval, yrs (variable)

Histogram

Page 11: Biostatistics Lecture 2 Graphical Presentation of Data

Class Problem I• Everyone: Choose a color from the list below: Green, Blue, Red, Yellow,

Black and type it on your notebook• Let us select a random variable sample from this population (you all!!)• Let us count the frequency for each selected color

Color Frequency

Green

Blue

Red

Yellow

Black

Draw a representative histogram for the variable frequency of the listed colors (Use Excel to draw it, HW1-B)

Dr. Alkilany 2012

Page 12: Biostatistics Lecture 2 Graphical Presentation of Data

Class Problem IIA school nurse weighed 30 students in Year 10. Their weights (in kg) were recorded as follows:

50 52 53 54 55 65 60 70 48 6374 40 46 59 68 44 47 56 49 5863 66 68 61 57 58 62 52 56 58

1. Use the data above to construct a frequency table

Range = 74-40=34Let width of class interval =5#intervals=34/5=7There are 7 class intervals. This is reasonable for the given data. The frequency table is as follows:

2. Complete the table to calculate: cumulative frequency, relative frequencies, cumulative relative frequencies (HW1-C)

Page 13: Biostatistics Lecture 2 Graphical Presentation of Data

Biostatistics

Lecture 3Descriptive Statistics

Page 14: Biostatistics Lecture 2 Graphical Presentation of Data

Descriptive Statistics• With interval scale (continuous measurement) data, there are two

aspects to the figures that we should be trying to describe:

– How large are they? ‘indicator of central tendency’– How variable are they? ‘indicator of dispersion’

– FBG for two sets of patients as follows: Set A: 84, 85,89, 89, 93, 94.Set B:72, 82,89, 89, 96, 106. which is larger? Which is more

variable?• ‘indicator of central tendency’ describes any statistic that is used to

indicate an average value around which the data are clustered• Three possible indicators of central tendency are in common use –

the mean, median and mode.

Dr. Alkilany 2012

Central tendency

Dispersion

Page 15: Biostatistics Lecture 2 Graphical Presentation of Data

Mean• The usual approach to showing the central tendency of a set of data is to

quote the average or the ‘mean’. • Example: Potency data of different vaccine batches.

– Each batch is intended to be of equal potency, but some manufacturing variability is unavoidable.

– A series of 10 batches has been analyzed and the results are shown in the following table:

Sum = 991.5n=10Mean=99.15

Dr. Alkilany 2012

Page 16: Biostatistics Lecture 2 Graphical Presentation of Data

Types of Mean1. Arithmetic mean2. Geometric Mean3. Harmonic mean

Arithmetic mean

Page 17: Biostatistics Lecture 2 Graphical Presentation of Data

Arithmetic

Mean

• Arithmetic mean represents the balance point of the distribution.

Mean ( )

• The arithmetic mean has the following properties:– Uniqueness, for a given set of data there is one and only one mean.– Simplicity, easily to be understood and computed.– Not robust to extreme values, it is affected by each value in the data.e.g. 5,10,15: mean=10…………..5,10,150: mean=55

Tail to the leftTail to the right

Symmetrical

Page 18: Biostatistics Lecture 2 Graphical Presentation of Data

Geometric Mean

• Geometric Mean:– Is the anti-log of the average of the logarithms of the

observations.

– Example, for the values 50, 100, 200Geometric mean = Antilog[(log50+log100+log200)/3]=100

while the arithmetic mean is 116.67.

– Is meaningful for data with logarithmic relationships as in the case of the current procedure in bioequivalence studies where the ratios of log-transformed parameters are compared (log (AUC), log(Cmax)).

Dr. Alkilany 2012

GM anti log[ logX i /n]i1

n

Page 19: Biostatistics Lecture 2 Graphical Presentation of Data

Harmonic Mean• Harmonic Mean: Is the appropriate mean following reciprocal

transformation.

• Example, the half-lives of a certain drug in 3 subjects were 2, 4, 8 hrs. determine the harmonic mean half-life for this drug?

• While the arithmetic mean is 4.667 hrs.

)/1(/ ixnH

H 3

1

2

1

4

1

8

3.429

Page 20: Biostatistics Lecture 2 Graphical Presentation of Data

Median

• The point that divides the distribution into two equal parts, or the point between the upper and lower halves of the distribution.

• Accordingly, if we have a finite number of values, then the median is the value that divides those values into two parts such that the number of values equal to or greater than the median is equal to the number of values equal to or less than the median.

Median7 variables

7 variables

Dr. Alkilany 2012

Page 21: Biostatistics Lecture 2 Graphical Presentation of Data

Median (example)

• Fifteen patients were provided with their drugs in a child-proof container of a design that they had not previously experienced.

• The time it took each patient to open the container was measured. • The results are shown below.• The mean = 7.09 s, Is this the most representative/descriptive figure?

Some outliers shifted the mean and thus median can tell us better information in this case

Values are clustering here

Page 22: Biostatistics Lecture 2 Graphical Presentation of Data

Median

• Most patients have got the idea more or less straight away and have taken only 2–5 s to open the container.

• However, four seem to have got the wrong end of the stick and have ended up taking anything up to 25 s.

• These four have contributed a disproportionate amount of time (65.6 s) to the overall total. This has then increased the mean to 7.09 s.

• We would not consider a patient who took 7.09 s to be remotely typical. In fact they would be distinctly slow.

Page 23: Biostatistics Lecture 2 Graphical Presentation of Data

Median (other example)

• This problem of mean values being disproportionately affected by a minority of outliers arises quite frequently in biological and medical research.

• A useful approach in such a case is to use the median.eg: Blood Glucose Level (mg/dl): 80, 81, 82, 83, 84, 84, 86, 86, 180Mean: 93Median: 84The outlier 180 shifted the mean to higher value, which is not

descriptive for the data set in this case!!

MeanMedian

Values are clustering hereoutlier

Page 24: Biostatistics Lecture 2 Graphical Presentation of Data

Median (how to determine it in ordered array?)

• When n is an odd number, then the median is the value number (n+1)/2 in an ordered array Example, what is the median for the following data set: 10, 15, 12, 25, 20.

– Rank the data: 10, 12, 15, 20, 25.

– The median is the value number (n+1)/2(5+1)/2=3rd so the median is 15.

• When n is an even number, then the median is the mean of the two middle values (n/2)th and ((n/2) + 1)th in an ordered array .

Example, what is the median for the following data set: 10, 15, 20, 25, 30, 5

– Rank the data: 5, 10, 15, 20, 25, 30

The median is the average of (n/2)th and the (n/2 + 1)th values: 3rd and 4th

(15+20)/2= 17.5

Page 25: Biostatistics Lecture 2 Graphical Presentation of Data

Mean Vs. MedianProperties Mean Median

Uniqueness Yes Yes

Simplicity Yes Yes

Robustness to extreme values No Yes

The median is robust to extreme outliers.

The term ‘robust’ is used to indicate that a statistic or a procedure will continue to give a reasonable outcome even if some of the data are aberrant.

eg. 2, 4, 6, 8, 10 median=6 2, 4, 6, 8, 1000 median=6

If last variable increased to 1000 instead of 10, the median will stay the same (6), while the the mean would be hugely inflated!!!!

Page 26: Biostatistics Lecture 2 Graphical Presentation of Data

Mode• Mode: value which occurs most frequently.

• If all values are different there is no mode and a set of values may have more than one mode.

• Used for quick estimation and for identifying the most common observation.

• Properties:– Not unique– Simple– Not robust, less stable than the median and the mean.

Dr. Alkilany 2012

Page 27: Biostatistics Lecture 2 Graphical Presentation of Data

Mode• The condition of sixty

patients with arthritis is recorded using a global assessment variable.

• A positive score indicates an improvement and a negative one a deterioration in the patient’s condition after treatment.

• The mean (0.77) [Do you think the mean is the best descriptive parameter for these data?

Dr. Alkilany 2012

Page 28: Biostatistics Lecture 2 Graphical Presentation of Data

Mode• A histogram of the above data shows that there are two distinct sub-populations. • Slightly under half the patients have improved quality of life, but for the remainder, their

lives are actually made considerably worse.

Dr. Alkilany 2012

Page 29: Biostatistics Lecture 2 Graphical Presentation of Data

Mode

• Neither the mean nor the median indicator remotely describes the situation. • The mean is particularly unhelpful as it indicates a value that is very untypical –

very few patients show changes close to zero.• We need to describe the fact that in this case, there are two distinct groups. The data consisted of values clustered

around some central points.

Dr. Alkilany 2012

Page 30: Biostatistics Lecture 2 Graphical Presentation of Data

Mode

• Data distribution can be ‘unimodal’ or ‘polymodal’ in the case with several clustering.

• If we want to be more precise, we use terms such as bimodal or trimodal to describe the exact number of clusters.

Dr. Alkilany 2012

Page 31: Biostatistics Lecture 2 Graphical Presentation of Data

How mean, median, and mode are related?

Dr. Alkilany 2012

For symmetric distributions:the mean and median are equal For skewed distributions with a single modethe three measures differ

Page 32: Biostatistics Lecture 2 Graphical Presentation of Data

How mean, median, and mode are related?

Dr. Alkilany 2012

For skewed distributions with a single modethe three measures differ: mean>median>mode (positively skewed distributions) mean<median<mode (negatively skewed distributions)

Page 33: Biostatistics Lecture 2 Graphical Presentation of Data

Measures of central tendency for grouped data

Calculate the mean. median, and mode.

Page 34: Biostatistics Lecture 2 Graphical Presentation of Data

Biostatistics

Lecture 4Descriptive Statistics

“Indicators of dispersion”

Dr. Alkilany 2012

Page 35: Biostatistics Lecture 2 Graphical Presentation of Data

Indicators of dispersion

• If all observations are the same, there is no variability.If they are not all the same, then dispersion is present in the data.

• Variation is an inherent characteristic of experimental observations due to several reasons.

• it is always important to get an estimate of how much given objects tend to differ from that central tendency

• In any experiment, variation will depend on:• The instrument used for analysis.• The analyst performing the assay.• The particular sample chosen.• Unidentified error commonly known as noise.

Dr. Alkilany 2012Central tendency

Dispersion (variability)

Page 36: Biostatistics Lecture 2 Graphical Presentation of Data

Indicators of dispersion (why we need them?)

Dr. Alkilany 2012

A

B

C

1. A, B, C have the same mean

2.Based on similarity of the mean, can we say the data sets are the same?

3.What is the differences between these data sets?

4.How we can describe the (differences)?

Dr. Alkilany 2012

Page 37: Biostatistics Lecture 2 Graphical Presentation of Data

Indicators of dispersion

Dr. Alkilany 2012

Page 38: Biostatistics Lecture 2 Graphical Presentation of Data

Standard deviation

Dr. Alkilany 2012

• Two tabletting machines producing erythromycin tablets with a nominal content of 250 mg. • 500 tablets are randomly selected from each machine and their erythromycin contents was assayed.

Mean~250 mg/tablet Mean~250 mg/tablet

Although tablets from both machines had equal mean, do you think the two machine still differ? How?

AlphaBravo

Page 39: Biostatistics Lecture 2 Graphical Presentation of Data

Standard deviation

• The two machines are very similar in terms of average drug content for the tablets, both producing tablets with a mean very close to 250 mg. However, the two products clearly differ.

• With the Alpha machine, there is a considerable proportion of tablets with a content differing by more than 20 mg from the nominal dose (i.e. below 230 mg or above 270 mg), whereas with the Bravo machine, such outliers are much rarer.

• An ‘indicator of dispersion’ is required in order to convey this difference in variability and to decide which one has better performance!!

Dr. Alkilany 2012

Page 40: Biostatistics Lecture 2 Graphical Presentation of Data

Standard deviation

Dr. Alkilany 2012

SD (X i X)2

i1

n

n 1

•This is the standard deviation (SD) for the sample•For population it is usually donated : σ •Same unit of the mean

Standard deviation is a widely used measure of variability and central dispersion

Let us go back to tabletting machines (raw data)!

Page 41: Biostatistics Lecture 2 Graphical Presentation of Data

Standard deviation

Dr. Alkilany 2012

SD(X i X)2

i1

n

n 1

X=__

Xi

X i X

Page 42: Biostatistics Lecture 2 Graphical Presentation of Data

Standard deviation• The Alpha machine produces rather variable tablets and so

several of the tablets deviate considerably from the overall mean.

• These relatively large figures then feed through the rest of the calculation, producing a high final SD (8.72 mg).

• In contrast, the Bravo machine is more consistent and individual tablets never have a drug content much above or below the overall average.

• The small figures in the column of individual deviations, leading to a lower SD (3.78 mg).

Dr. Alkilany 2012

Page 43: Biostatistics Lecture 2 Graphical Presentation of Data

Standard deviation• Reporting the SD:

– The symbol is used in reporting the SD– The symbol reasonably interpreted as meaning ‘more or

less’.– is used to indicate variability. – With the tablets from our two machines, we would report

their drug contents as:• Alpha machine: 248.78.72 mg (MeanSD mg)• Bravo machine: 251.13.78 mg (MeanSD mg)

– The figures quoted before summarize the true situation. The two machines produce tablets with almost identical mean contents, but those from the Alpha machine are two to three times more variable.

Dr. Alkilany 2012

Page 44: Biostatistics Lecture 2 Graphical Presentation of Data

Standard deviation and Coefficient of variation

• Elephant tail: CV=10/150x100=6.7%• Mouse tail: CV= 3/7x100=42.8%• Coefficient of variation (CV) expresses variation

relative to the magnitude of data• Useful to compare variation in two or more sets

of data with different mean values• CV is has no unit (it is a ratio!)

Dr. Alkilany 2012

- Elephant tail=150±10 cm- Mouse tail=7±3 cmWith this in mind, which is more variable: the elephant tail length results or the one for the mouse?

CV SD /Mean *100

Page 45: Biostatistics Lecture 2 Graphical Presentation of Data

Variance

N

X

22 )(

Dr. Alkilany 2012

• The Variance: 2 (population) or S2 (sample) is a measure of spread that is related to the deviations of the data values from their mean.

• Unit: same as mean but squared. If mean in mg, variance will be in mg2

2

SD (X X)2n 1

Variance SD2

samplePopulation

Page 46: Biostatistics Lecture 2 Graphical Presentation of Data

Quartiles

• Quartiles: the three points that divide the data set into four equal groups, each representing a fourth of the population being sampled

• The median= Q2

Dr. Alkilany 2012

First Quartile Q1 cuts off lowest 25% of data 25th percentile

Second Quartile Q2 cuts data set in half 50th percentile

Third Quartile Q3 cuts off highest 25% of data, or lowest 75%

75th percentile

Q1

Q2

MedianQ3

Page 47: Biostatistics Lecture 2 Graphical Presentation of Data

Dr. Alkilany 2012

Quartiles

Interquartile range: difference between the upper and lower quartiles IQR= (Q3 – Q1)

Page 48: Biostatistics Lecture 2 Graphical Presentation of Data

Finding Quartiles• To find the quartiles for a set of data, do the following:1. Arrange the data from smallest to highest (ordered array)2. Locate the median (Q2)3. The half to the left: locate their median (Q1)4. The half to the right: Locate their median (Q3)

Dr. Alkilany 2012

Q1 Q2 Q3

MedianHalf to the left Half to the right

Page 49: Biostatistics Lecture 2 Graphical Presentation of Data

Finding Quartiles• Example with odd (n)Times needed for 15 tablets to disintegrate in minutes: 5, 10 10 10 10 12 15 20 20 25 30 30 40 40 601. Data is already in an order from smallest to highest2. Median is the (n+1/2)th=8th=20 (in bold red)3. For the half to the right: n=7, median=4th=10 minutes4. For the half to the right: n=7, median=4th=30 minutes5. Q1=10 minutes; Q2= 20 minutes; Q3=30 minutes. IQR=Q3-Q1=20 minutes6. This means that 25% of tablets need less than 10 minutes to disintegrate. Also 50% of tablets need 20 minutes

to disintegrate. Before 30 minutes, 75% of all tables were disintegrated. 25% only of these tablets need more than 30 minutes to disintegrate.

This question can come in this form

Dr. Alkilany 2012

Disintegration time (min) Frequency

5 1

10 4

12 1

15 1

20 1

25 2

30 2

40 2

60 1

Total 15

Page 50: Biostatistics Lecture 2 Graphical Presentation of Data

Finding Quartiles• Example with even (n)Times needed for 20 capsules to disintegrate in minutes: 5, 10, 10, 15, 15, 15, 15, 20, 20, 20, 25, 30, 30 40, 40, 45, 60, 60, 65, 851. Data is already in an order from smallest to highest2. Median is the mean of the two middle values (n/2)th and ((n/2) + 1)th (in bold red)=10th and

11th=(20+25)/2= 22.53. For the half to the right: n=10, median=mean of 5th & 6th=15 minutes4. For the half to the right: n=10, median=mean of 5th & 6th=42.55. Q1=15minutes; Q2= 22.5 minutes; Q3=42.5 minutes. IRQ=??

Dr. Alkilany 2012

Disintegration time (min)

Frequency

5 1

10 2

15 4

20 3

25 1

30 2

40 2

45 1

60 2

65 1

85 1

Total 20

Page 51: Biostatistics Lecture 2 Graphical Presentation of Data

Quartiles• Consider the elimination half-lives of two synthetic steroids have been

determined using two groups, each containing 15 volunteers.• The results are shown in the following table with the values ranked from

lowest to highest for each steroid.

Dr. Alkilany 2012

Page 52: Biostatistics Lecture 2 Graphical Presentation of Data

Quartiles and IQR as a measurement for data spread The IQR for the half life of steroid 2 is only half that for steroid 1, duly reflecting its less variable nature.

Just as the median is a robust indicator of central tendency, the interquartile range is a robust indicator of dispersion. The interquartile range is a more useful measure of spread than range as it describes the middle 50% of the data values and thus less affected by outliers.

Page 53: Biostatistics Lecture 2 Graphical Presentation of Data

Box and whisker plot• A box-and-whisker plot can be useful for handling many

data values.

• It shows only certain statistics rather than all the data.

• Five-number summary is another name for the visual representations of the box-and-whisker plot.

• The five-number summary consists of the median, the quartiles, and the smallest and greatest values in the distribution (not including outliers).

• Immediate visuals of a box-and-whisker plot are the center, the spread, and the overall range of distribution.

Page 54: Biostatistics Lecture 2 Graphical Presentation of Data

Box and whisker plot• The first step in constructing a box-and-whisker plot is to first find the

median (Q2), the lower quartile (Q1) and the upper quartile (Q3) of a given set of data.

• Example: The following set of numbers are weights of 10 patients in hospital (kgs):

75 1 62 Smallest (S)

62 2 67

78 3 73 Q1

96 4 75

73 5 78Median=78.5 (Q2)

93 6 79

85 7 81

81 8 85 Q3

67 9 93

79 10 96 Largest (L)

L

Q3

Q2Q1

S

Page 55: Biostatistics Lecture 2 Graphical Presentation of Data

Outliers

• Outliers (extreme values) are values that are much bigger or smaller (distant) than the rest of the data.

• In order to be an outlier, the data value must be:

• larger than Q3 by at least 1.5 times the interquartile range (IQR), or

• smaller than Q1 by at least 1.5 times the IQR.

• Represented by a dot on the box and whisker plot