Upload
doanhanh
View
217
Download
0
Embed Size (px)
Citation preview
2Slide© 2007 Thomson South-Western. All Rights Reserved
Chapter 3Descriptive Statistics: Numerical Measures
Measures of Location
Measures of Variability
Measures of Distribution Shape, Relative Location, and Detecting Outliers
Measures of Association Between Two Variables
Weighted Mean
3Slide© 2007 Thomson South-Western. All Rights Reserved
Measures of Location
If the measures are computedfor data from a sample,
they are called sample statistics.
If the measures are computedfor data from a population,
they are called population parameters.
A sample statistic is referred toas the point estimator of the
corresponding population parameter.
Mean
Median
Mode
Percentiles
Quartiles
4Slide© 2007 Thomson South-Western. All Rights Reserved
Mean
The mean of a data set is the average of all the data values.
The sample mean is the point estimator of the population mean m.
x
5Slide© 2007 Thomson South-Western. All Rights Reserved
Sample Mean x
Number ofobservationsin the sample
Sum of the valuesof the n observations
ix
xn
6Slide© 2007 Thomson South-Western. All Rights Reserved
Population Mean m
Number ofobservations inthe population
Sum of the valuesof the N observations
ix
Nm
7Slide© 2007 Thomson South-Western. All Rights Reserved
Median
Whenever a data set has extreme values, the medianis the preferred measure of central location.
A few extremely large incomes or property valuescan inflate the mean.
The median is the measure of location most oftenreported for annual income and property value data.
The median of a data set is the value in the middlewhen the data items are arranged in ascending order.
8Slide© 2007 Thomson South-Western. All Rights Reserved
Median
12 22 26 27 2724 28
For an odd number of observations:
in ascending order
26 28 27 22 24 27 12 7 observations
the median is the middle value.
Median = 26
9Slide© 2007 Thomson South-Western. All Rights Reserved
28
Median
For an even number of observations:
in ascending order
27 8 observations
the median is the average of the middle two values.
Median = (26 + 27)/2 = 26.5
3012 22 26 27 2724
26 28 27 22 24 30 12
10Slide© 2007 Thomson South-Western. All Rights Reserved
Mean VS Median
The mean IS affected by outliers (extreme observations)
The median IS NOT affected by outliers
11Slide© 2007 Thomson South-Western. All Rights Reserved
Mode
The mode of a data set is the value that occurs withgreatest frequency.
The greatest frequency can occur at two or moredifferent values.
If the data have exactly two modes, the data arebimodal.
If the data have more than two modes, the data aremultimodal.
12Slide© 2007 Thomson South-Western. All Rights Reserved
Percentiles
A percentile provides information about how thedata are spread over the interval from the smallestvalue to the largest value.
Admission test scores for colleges and universitiesare frequently reported in terms of percentiles.
13Slide© 2007 Thomson South-Western. All Rights Reserved
The pth percentile of a data set is a value such that at least p percent of the items take on this value or less and at least (100 - p) percent of the items take on this value or more.
Percentiles
14Slide© 2007 Thomson South-Western. All Rights Reserved
Percentiles
Arrange the data in ascending order.
Compute index i, the position of the pth percentile.
i = (p/100)n
If i is not an integer, round up. The pth percentileis the value in the ith position.
If i is an integer, the pth percentile is the averageof the values in positions i and i+1.
15Slide© 2007 Thomson South-Western. All Rights Reserved
Note on Excel’s Percentile Function
The formula that Excel uses is differentfrom the one used in the textbook!
In order to find the observation where the median occurs, Excel uses the following formula:
Lp = (p/100)n + (1 – p/100)
Once the observation is identified Excel will: 1. If Lp is a whole number (e.g. 12),
Excel’s result will be the same as the textbook’s.2. If Lp is not a whole number (e.g. 12.3) Excel’s
result will be different from the textbook’s.
16Slide© 2007 Thomson South-Western. All Rights Reserved
Quartiles
Quartiles are specific percentiles.
First Quartile = 25th Percentile
Second Quartile = 50th Percentile = Median
Third Quartile = 75th Percentile
17Slide© 2007 Thomson South-Western. All Rights Reserved
Measures of Variability
It is often desirable to consider measures of variability(dispersion), as well as measures of location.
For example, in choosing supplier A or supplier B wemight consider not only the average delivery time foreach, but also the variability in delivery time for each.
18Slide© 2007 Thomson South-Western. All Rights Reserved
Measures of Variability
Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
19Slide© 2007 Thomson South-Western. All Rights Reserved
Range
The range of a data set is the difference between thelargest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest datavalues.
20Slide© 2007 Thomson South-Western. All Rights Reserved
Interquartile Range
The interquartile range of a data set is the differencebetween the third quartile and the first quartile.
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data values.
21Slide© 2007 Thomson South-Western. All Rights Reserved
The variance is a measure of variability that utilizesall the data.
Variance
It is based on the difference between the value ofeach observation (xi) and the mean ( for a sample,m for a population).
x
22Slide© 2007 Thomson South-Western. All Rights Reserved
Variance
The variance is computed as follows:
The variance is the average of the squareddifferences between each data value and the mean.
for asample
for apopulation
m2
2
( )x
Nis
xi x
n
22
1
( )
23Slide© 2007 Thomson South-Western. All Rights Reserved
Standard Deviation
The standard deviation of a data set is the positivesquare root of the variance.
It is measured in the same units as the data, makingit more easily interpreted than the variance.
24Slide© 2007 Thomson South-Western. All Rights Reserved
The standard deviation is computed as follows:
for asample
for apopulation
Standard Deviation
s s 2 2
25Slide© 2007 Thomson South-Western. All Rights Reserved
The coefficient of variation is computed as follows:
Coefficient of Variation
100 %s
x
The coefficient of variation indicates how large thestandard deviation is in relation to the mean.
for asample
for apopulation
100 %
m
26Slide© 2007 Thomson South-Western. All Rights Reserved
Measures of Distribution Shape,Relative Location, and Detecting Outliers
Distribution Shape
z-Scores
Chebyshev’s Theorem
Empirical Rule
Detecting Outliers
27Slide© 2007 Thomson South-Western. All Rights Reserved
Distribution Shape: Skewness
An important measure of the shape of a distribution is called skewness.
The formula for computing skewness for a data set is somewhat complex.
• Skewness can be easily computed using statistical software.
Excel’s SKEW function can be used to compute the
skewness of a data set.
28Slide© 2007 Thomson South-Western. All Rights Reserved
Distribution Shape: Skewness
Symmetric (not skewed)
• Skewness is zero.
• Mean and median are equal.R
elat
ive
Fre
qu
ency
.05
.10
.15
.20
.25
.30
.35
0
Skewness = 0
29Slide© 2007 Thomson South-Western. All Rights Reserved
Rel
ativ
e F
req
uen
cy
.05
.10
.15
.20
.25
.30
.35
0
Distribution Shape: Skewness
Moderately Skewed Left
• Skewness is negative.
• Mean will usually be less than the median.
Skewness = .31
30Slide© 2007 Thomson South-Western. All Rights Reserved
Distribution Shape: Skewness
Moderately Skewed Right
• Skewness is positive.
• Mean will usually be more than the median.R
elat
ive
Fre
qu
ency
.05
.10
.15
.20
.25
.30
.35
0
Skewness = .31
31Slide© 2007 Thomson South-Western. All Rights Reserved
The z-score is often called the standardized value.
It denotes the number of standard deviations a datavalue xi is from the mean.
z-Scores
zx x
si
i
32Slide© 2007 Thomson South-Western. All Rights Reserved
z-Scores
A data value less than the sample mean will have az-score less than zero.
A data value greater than the sample mean will havea z-score greater than zero.
A data value equal to the sample mean will have az-score of zero.
An observation’s z-score is a measure of the relativelocation of the observation in a data set.
33Slide© 2007 Thomson South-Western. All Rights Reserved
Chebyshev’s Theorem
At least (1 - 1/z2) of the items in any data set will be
within z standard deviations of the mean, where z is
any value greater than 1.
34Slide© 2007 Thomson South-Western. All Rights Reserved
At least of the data values must be
within of the mean.
75%
z = 2 standard deviations
Chebyshev’s Theorem
At least of the data values must be
within of the mean.
89%
z = 3 standard deviations
At least of the data values must be
within of the mean.
94%
z = 4 standard deviations
35Slide© 2007 Thomson South-Western. All Rights Reserved
Empirical Rule
For data having a bell-shaped distribution:
of the values of a normal random variable
are within of its mean.
68.26%
+/- 1 standard deviation
of the values of a normal random variable
are within of its mean.
95.44%
+/- 2 standard deviations
of the values of a normal random variable
are within of its mean.
99.72%
+/- 3 standard deviations
36Slide© 2007 Thomson South-Western. All Rights Reserved
Empirical Rule
xm – 3 m – 1
m – 2m + 1
m + 2m + 3m
68.26%
95.44%
99.72%
37Slide© 2007 Thomson South-Western. All Rights Reserved
Detecting Outliers
An outlier is an unusually small or unusually largevalue in a data set.
A data value with a z-score less than -3 or greaterthan +3 might be considered an outlier.
It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the
data set
• a correctly recorded data value that belongs in
the data set
38Slide© 2007 Thomson South-Western. All Rights Reserved
Measures of Association Between Two Variables
Covariance
Correlation Coefficient
39Slide© 2007 Thomson South-Western. All Rights Reserved
Covariance
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.
The covariance is a measure of the linear associationbetween two variables.
40Slide© 2007 Thomson South-Western. All Rights Reserved
Covariance
The covariance coefficient is computed as follows:
forsamples
forpopulations
sx x y y
nxy
i i
( )( )
1
m m
xyi x i yx y
N
( )( )
41Slide© 2007 Thomson South-Western. All Rights Reserved
Correlation Coefficient
Values near +1 indicate a strong positive linearrelationship.
Values near -1 indicate a strong negative linearrelationship.
The coefficient can take on values between -1 and +1.
42Slide© 2007 Thomson South-Western. All Rights Reserved
The correlation coefficient is computed as follows:
forsamples
forpopulations
rs
s sxy
xy
x y
xy
xy
x y
Correlation Coefficient
43Slide© 2007 Thomson South-Western. All Rights Reserved
Correlation Coefficient
Just because two variables are highly correlated, it does not mean that one variable is the cause of theother.
Correlation is a measure of linear association and notnecessarily causation.
44Slide© 2007 Thomson South-Western. All Rights Reserved
Weighted Mean
When the mean is computed by giving each datavalue a weight that reflects its importance, it isreferred to as a weighted mean.
In the computation of a grade point average (GPA),the weights are the number of credit hours earned foreach grade.
When data values vary in importance, the analystmust choose the weight that best reflects theimportance of each value.