Upload
dinhxuyen
View
218
Download
2
Embed Size (px)
Citation preview
Chapter 2Chapter 2
SUMMARIZING DATA
1
2 1 B k dRecall:
2.1 BackgroundRecall:Data elements/pointsPopulation and SamplePopulation and Sample
Descriptive Statistics are statistical proceduresDescriptive Statistics are statistical procedures used to summarize, organize, and simplify data.In statistics or Biometry the basis for any analysis is aIn statistics or Biometry, the basis for any analysis is a clear understanding of the data.Describing, summarizing, and presenting data are central to all statistical applications.
Inferential Statistics
2
B k dBackground…
The first step in data analysis is to describe the dataThe first step in data analysis is to describe the data in some concise manner.
Descriptive statistics that involve numeric or graphic display are crucial in capturing and conveying the final results of studies in publications.
3
2.1.2 Classification of variables
• Continuous or DiscreteQ tit ti i bl R d th t f thiQuantitative variable: Records the amount of something
(How much or how many?) – e.g. Age or weight
– Continuous quantitative variable: Measured on a continuous scale (Always a value between any 2 given l ) Di i C i f h i lvalues) – e.g. Distance in m; Concentration of chemical
etc.
– Discrete quantitative variable: Can list all the possible values(Counting something) – e.g. Number of children in household, Number of subjects per semester etc.
4
Classification of variables…..
• Ordinal variable:‐Natural order to the different categories Examples:• Cloudiness – Mostly cloudy, Partly cloudy & Sunny• Self‐reported health status‐ Excellent, very good, good, fair,
poor
• Categorical variable or Nominal: Records in which category a i i i ( Bl d R h )person or item is in (e.g. Blood type, Response to therapy).
No natural order to the different categories (e.g. Blood type – A, B AB O; Gender: Male or Female)B, AB, O; Gender: Male or Female).
5
NoteIt is important to determine the nature of the variable under investigation, as the selection of the most appropriate technique to summarize it depends on whether the variable is continuous or discrete.
6
Notation
Population vs. SamplePopulation size vs Sample sizePopulation size vs. Sample sizePopulation parameter vs. Sample statistic
Notation for variables and observations
V i bl b X Y Z• Variables: represent by uppercase – X, Y, Z etc.
• Observations on a variable: represent by lower case with subscript - . nxxx ,,, 21
7
Statistical analysis…..
• Data elements or data points: ‐Data elements or data points: Each data element (also called observation) is a representation of a particular measurablerepresentation of a particular measurable characteristic or variable.
Examples– Individuals’ ages measured in years.– Patients’ systolic blood pressures measured inPatients systolic blood pressures measured in mmHg
8
2.2 Descriptive Statistics and Graphical Methodsp p
• Assume that the subjects in the sample were selected at d f h l f h lrandom from the population of interest, i.e. the sample is
representative of the population.
Numerical summaries for continuous variablesSample mean: is the sum of all observations divided by the number of observations. That is,
n
iin x
nxxx
nx
121
1)(1
• It gives a sense of what a typical value looks like.
inn 1
9
Sigma ∑ is a summa on sign.g ∑ g
implies (x + x + + x )implies (x1 + x2 +…+ xn)
If a and b are integers where a ≤ b, then meaning (xa + xa+1 + … + xb)
If a = b, thenIf a b, then
If i t t thIf c is some constant, then 10
Example 2.1 Systolic blood pressures
• The sample data are: 121 110 114 100 160 130 130l ( )• Sample size (n) = 7
• Let X denotes the systolic blood pressure.
130 130 160 100 114 110 100 7654321 xxxxxxx
6.1237
130130160100114110121
x
• Note that some of the observed systolic blood pressures are above and some below the mean.
• The mean of 123.6 is interpreted as the average systolic blood pressure in the sample.
11
Example 2.1….
0)6.123130()6.123121()(1
n
ii xx
• The deviations from the mean sum to zero, since negative deviations cancel out the positive deviationsdeviations cancel out the positive deviations.
12
Variance and Standard Deviation
If the center of the sample is defined as the sample mean, thenh h d ff ( d )a measure that can summarize the difference (or deviations)
between the individual sample points and the sample mean isneeded; that isneeded; that is,
One of the measure that would seem to accomplish this goal is:
However the sum of the deviations of the individualHowever, the sum of the deviations of the individual observations of a sample about the sample mean is always zero.
13
Variance and Standard deviationVariance and Standard deviation…..• Mean absolute deviation (MAD), expressed as
may be used.
l l l i i h h h• Alternatively, sample variance or variance, which is the average of the squares of the deviations from the sample mean, may be usedused
Another commonly used measure of spread is the sample standardAnother commonly used measure of spread is the sample standard deviation
14
Example 2.1….
• Sample variance
6.37417
)6.123130()6.123110()6.123100( 2222
s
• Sample standard deviation
17
4.196.374 s
15
Alternative formula for the Sample variance…..
• Sample variance (s2) is defined as
p
2
11
2
1
2
2
n
nxxs
n
i
n
iii
Example 2.11n
6.3746
3.889,106137,10917
7)865(137,109 22
s
• Sample standard deviation (s) is defined as
617
Sample standard deviation (s) is defined as
2s s
• Standard deviation is preferable to variance because it is in the same units as the original data
16
same units as the original data.
M f L tiM f L tiMeasures of LocationMeasures of Location
Data summarization is important before anyinferences can be made about the population frominferences can be made about the population fromwhich the sample points have been obtained.
Measure of location is a type of measure useful for data summarization that defines the center or da a su a a o a de es e ce e omiddle of the sample. Sample mean, median and mode
17
• A standard data summary for a continuous variable in a sample consists of three statistics:sample consists of three statistics:Sample sizeSample meansSample standard deviation
Th th t ti ti id th i f ti th b• These three statistics provide the information on the number of subjects in the sample, the location, and the dispersion of the sample, respectively.
18
Coefficient of Variation (CV)Coefficient of Variation (CV)
• The coefficient of variation (CV) is defined as the ratio ofThe coefficient of variation (CV) is defined as the ratio of the sample standard deviation to the sample, expressed as a percentage, that is
100
xsCV
• Remains the same regardless of units used.• Useful in comparing variability of different samples with
different arithmetic means and when samples have diff i fdifferent units of measurements.
19
MedianMedian• Sample median is defined as the middle value. • 50% of the values in the sample greater than or equal to
the median and the remaining 50% less than or equal to the medianthe median.
To calculate the median:To calculate the median:• Arrange the observations from smallest to largest. Then
median is
th the largest observation if n is odd
Average of the th and the th observation if n is evenAverage of the th and the th observation if n is even
20
Example 2.1Step 1: Arrange the data in ascending order:
100 110 114 121 130 130 160
n = 7 is odd, therefore the median is the average ((n+1)/2 = (7+1)/2 =4)th largest observation, or the observation in the fourth position in the order data set.
Median = 121.
21
ModeMode: the most frequently occurring value among all the observations in a sampleall the observations in a sample.
Data distributions may have one or more modesData distributions may have one or more modes.One mode = unimodalTwo modes = bimodalTwo modes = bimodalThree modes = trimodal and so on.
22
Range • Range is the difference between the largest and smallest
observations in a sample. Once the sample is ordered, it is very easy to compute the range.
• Range is very sensitive to extreme observations or outliers.
• Larger the sample size (n), the larger the range and the diffi lt th i b t f d tmore difficult the comparison between ranges from data
sets of varying sizes.
23
Quartiles• The first quartile of the sample is the sample value that holds
approximately 25% of the data elements at or below it and
Quartiles
approximately 25% of the data elements at or below it and approximately 75% above or equal to it. It is denoted by Q1.
• The third quartile of the sample holds approximately 25% of the data elements at or above it and approximately 75% below or equal to it It is denoted by Qbelow or equal to it. It is denoted by Q3.
• The median is also referred to as the second quartile Q• The median is also referred to as the second quartile, Q2
• When n is odd, the positions of the quartiles are determinedWhen n is odd, the positions of the quartiles are determined by
43n
24
where [k] is the greatest integer less than k.
Quartiles…..
• When n is even, the positions of the quartiles are determined bby
.
42n
• The interquartile range is the difference between the first and• The interquartile range is the difference between the first and third quartiles: 13 QQ
• If there are outliers, the median and interquartile deviation which is defined as
are the most appropriate measures of location and dispersion, respectively
213 QQ
respectively.25
Interpretation of Standard Deviation (Empirical Rule)
• The mean and standard deviation are used to understand where the data are located and how they are spread.
• If a distribution of values is normal, then approximately
68% of the observations fall between and .
95% of the observations fall between and
sx sx
sx 2 sx 295% of the observations fall between and .
All of the observations fall between and .
sx 2 sx 2
sx 3 sx 3
• If a distribution is not normal, then it is difficult to infer properties f di t ib ti f th d SDof a distribution from the mean and SD.
• The median and interquartile deviation should be used instead.
26
Outliers
• Outliers in a sample areObservations outside the range sx 3Observations outside the range .Observations above or below .
where = Interquartile range.)(5.13 IQRQ )(5.11 IQRQ
sx 3
13 QQIQR q g
Summary
13 QQIQR
Summary• If there are outliers in a data set, the most appropriate
measure of location is the median and the most appropriate pp pmeasure of dispersion is the interquartile deviation.
• If there is no outliers in the data, then the sample mean and t d d d i ti th t i t fstandard deviation are the most appropriate measures of location and dispersion, respectively.
27
Frequency Distribution (pages 45‐51)
• A frequency distribution is the organization of raw data in table f l d fform, using classes and frequencies.– A frequency distribution table is a useful summary for discrete data.
When the data are in original form they are called raw data– When the data are in original form, they are called raw data.– The frequency of a class is the number of data values contained in a
specific class.
1) The categorical frequency distribution is used for data that can b l d f h l d l l lbe placed in specific categories, such as nominal or ordinal level data, e.g. Religious affiliation, gender, blood type etc.
2) G d f di t ib ti N i l d t d2) Grouped frequency distribution : Numerical data are grouped into classes formed by two numbers called class limits.
28
• Lower class limit:‐ represents the smallest data value that can be included in the classincluded in the class.
• Upper class limit:‐ represents the largest data value that can be included in the class.Cl idth f l i f di t ib ti i f d b• Class width for a class in a frequency distribution is found by subtracting the lower (or upper) class limit of one class from the lower (or upper) class limit of the next class.
f di ib i f i i bl• To construct a group frequency distribution for continuous variableStep 1: Determine the classes
– Find the range: Highest value – Lowest valueg g– Select the number of classes desired for the frequency distribution– Find the width by dividing the range by the number of classes and
rounding uprounding up.– Select the starting point (usually the lowest value or any convenient
number less than the lowest value); add the width to get the lower limitslimits.
– Find the upper class limits.Step 2: Tally the data.Step 3: Find the numerical frequencies from the tallies.
29
Grouped Data ….When sample size is too large to display all the raw data, data are frequently collected in grouped form.
The simplest way to display the data is to generate a frequencydistribution using a statistical package.
A frequency distribution is an ordered display of each value in adata set together with its frequency, that is, the number of timesh l i h dthat value occurs in the data set.
30
If the number of unique sample q pvalues is large, then a frequencydi t ib ti distribution may still be too detailed.
31
If the data is too large, then the data is categorized into broader groups.
32
2 2 2 Graphic Summaries for Continuous Variables2.2.2 Graphic Summaries for Continuous Variables
• Provide a simple, complete and accurate representation of p p pthe data.
B d hi k lBox‐and‐whisker plot: This plot incorporates the minimum and maximum, the median, and the quartiles., qUseful to compare samples
Stem‐and‐Leaf plots: easy to compute the median and other quartiles. Each data point is converted into stem and leaf, e.g., 438 (stem: 43; leaf: 8)(stem: 43; leaf: 8)
33
Graphical Summaries for Discrete Variables
• Bar chart: d d l l bl used to display a categorical variable;
Hi t• Histogram: used to display an ordinal variable;
• Bar charts and histograms are based on either the frequency• Bar charts and histograms are based on either the frequency or the relative frequency of responses in each category.
See Examples 2.10, 2.11 and 2.12 (Pages 51 ‐55).
34
Features of good numeric or graphic form of data summarization:Self‐containedUnderstandable without reading the textUnderstandable without reading the textClearly labeled of attributes with well‐defined termsterms
35
Bar Graphs
To construct • Divide the data into groups or categoriesg p g• Construct a rectangle for each group with a base of a constant width and a height proportional to the frequency within that group.within that group.• Note that the rectangles are not contiguous and are equally spaced from each other.
Example: A sample of 182 poinsettia plants were categorized by color in the following tableC l F (N b R l ti C l tiColor Frequency (Number
of plants = f)Relative Frequency (f/n)*100
Cumulative frequency
Red 108 59.34% 108
Pink 34 18.68% 142
White 40 21.98% 182
l %
36
Total 182 100%
120
80
100f p
lant
s
20
40
60
Num
ber o
f
0
20
Red Pink White
Color
Figure 2.1. Bar chart of color of 182 poinsettias
37
Histogram
To construct the histogramList each response option along the horizontal• List each response option along the horizontal axis (the x axis).
• Scale the vertical axis to accommodate theScale the vertical axis to accommodate the relative frequencies.
• Draw rectangles above each response option to reflect the proportion of subjects in each.
Note that there are no breaks between rectangles suggesting theNote that there are no breaks between rectangles, suggesting the ordering of the response options.
S E l 2 10 2 11 d 2 12See Examples 2.10, 2.11 and 2.12
38
Box‐and‐whisker plot:• Top of box is upper quartile (Q3)
• Bottom of box is lower quartile (Q1)Bottom of box is lower quartile (Q1)
• Midline of box is median
• Vertical line extends from top of box to largest (maximum) non‐outlying valueVertical line extends from bottom of box to smallest (minimum) non‐outlying value
• Outlying value is defined as any valueOutlying value is defined as any valueeither > Q3 + 1.5 (Q3‐ Q1)
or < Q1 – 1.5 (Q3‐ Q1)
Note: An outlier is a data point which differs so much from the rest of the data that it doesn’t seem to belong with the other data.
• Outliers are designated by *'sOutliers are designated by s
• An extreme outlying value is defined as any value either> Q3 + 3 (Q3 – Q1) or < Q1 – 3 (Q3 – Q1).
39
Stem and Leaf Plot
To construct use the following steps:
1. Separate each data value into a leaf, the least significant digit(s) and a stem, the most significant digit(s). The stem of 483 is 48 and the leaf is 33.
2. Write the smallest stem in the data set in the upper left‐hand corner of the plot.
3 W it th d t hi h l th fi t t 1 b l th fi t3. Write the second stem, which equals the first stem +1, below the first stem.
4. Continue with Step 3 until you reach the largest stem in the data set.5. Draw a vertical bar to the right of the column of stems.6. For each number in the data set, find the appropriate stem and write
the leaf to the right of the vertical bar.g
40
Stem and leaf plot of cholesterol
195, 145, 205, 159, 244, 166, 250, 236, 192, 224, 238, 197, 169, 158, 151, 197, 180, 222, 168, 168, 167, 161, 178, 137
13 | 714 | 514 | 515 | 18916 | 16788917 | 817 | 818 | 019 | 257720 | 5
Median = (178+180)/2 = 179
20 | 521 | 22 | 24
|23 | 6824 | 425 | 0
41
Obtaining descriptive statistics using a computer
Numerous statistical packages may be used.
Excel may be used to compute average (for the arithmetic mean),median (for the median), Stdev (for the standard deviation), Var (formedian (for the median), Stdev (for the standard deviation), Var (forthe variance).
42