Chapter 3 Descriptive Statistics: Numerical Methods Statistics for Business (Env) 1

Preview:

Citation preview

Chapter 3

Descriptive Statistics: Numerical Methods

Statistics for Business(Env)

1

Descriptive Statistics

3.1 Describing Central Tendency3.2 Measures of Variation3.3 Percentiles, Quartiles and Box-and-Whiskers

Displays3.4 Covariance, Correlation, and the Least Square

Line3.5 Weighted Means and Grouped Data

(Optional)3.6 The Geometric Mean (Optional)

Describing Central Tendency

• In addition to describing the shape of a distribution, want to describe the data set’s central tendency– A measure of central tendency represents the

center or middle of the data– It is most typical or most representative of the

entire data

Parameters and Statistics• A population parameter is a number

calculated from all the population measurements that describes some aspect of the population

• A sample statistic is a number calculated using the sample measurements that describes some aspect of the sample

Measures of Central Tendency

Mean, The average or expected value Median, Md The value of the middle point of

the ordered measurementsMode, Mo The most frequent value

The MeanPopulation X1, X2, …, XN

Population Mean

N

X

N

=1ii

Sample x1, x2, …, xn

Sample Mean

x

n

x x

n

=1ii

The Sample Mean

and is a point estimate of the population mean • It is the value to expect, on average and in the long run

• And the amount each member gets when the total is distributed equally within the sample

n

xxx

n

xx n

n

ii

...211

For a sample of size n, the sample mean is defined as

8

Mean as the balance point for a distributionData: 2, 2, 6, 10mean=(2+2+6+10)/4=5

9

Data: 3, 6, 6, 9, 11mean=(3+6+6+9+11)/5=7

What will happen to the mean if we add one more number to the data?

The Median

The median Md is a value such that 50% of all measurements, after having been arranged in numerical order, lie above (or below) it

1. If the number of measurements is odd, the median is the middlemost measurement in the ordering

2. If the number of measurements is even, the median is the average of the two middlemost measurements in the ordering

11

Data: 3, 5, 8, 10, 11median=8

12

Data: 3, 3, 4, 5, 7, 8median=(4+5)/2=4.5

13

Data: 1, 2, 2, 3, 4, 4, 4, 4, 4, 5median=4??

14

Example for non-integer data

• Example 3.1: First five observations from Table 3.1:

30.8, 31.7, 30.1, 31.6, 32.1

• In order: 30.1, 30.8, 31.6, 31.7, 32.1

• There is an odd so median is one in middle, or 31.6

16

Data: 2, 2, 2, 3, 3, 12 mean=4median=(2+3)/2=2.5

The Mode

The mode Mo of a population or sample of measurements is the measurement that occurs most frequently– Modes are the values that are observed “most typically”– Sometimes higher frequencies at two or more values

• If there are two modes, the data is bimodal• If more than two modes, the data is multimodal

– When data are in classes, the class with the highest frequency is the modal class

• The tallest box in the histogram

Histogram Describing the 50 Mileages

19

Selecting a measure of Central Tendency

• Usually the mean is a good measure, because it uses every score in the distribution.

• There are some extreme cases in which the mean is not representative (or calculable). Then the mode and the median are used.

20

21

Mean=(10+11*4+12*3+13+100)/10=20.3

Median=(11+12)/2=11.5

Mode=11

22

Mean – not computableMedian=(12+13)/2=12.5Mode – not meaningful

Open-ended distributions A distribution is said to be open-ended when there is no upper limit (or lower limit) for one of the categories

Measures of Variation/variability• Knowing the measures of central tendency is not

enough• Both of the distributions below have identical

measures of central tendency

24

Measures of VariationRange Largest minus the smallest measurement

Variance The average of the squared deviations of all the population measurements from the population mean

Standard The square root of the varianceDeviation

They provide quantitative measures of the degree to which data in a distribution are spread out or clustered together.

Range for discrete & continuous data

• The range is the distance between the largest score (Xmax) and the smallest score (Xmin) in the distribution for discrete data.

• For continuous data, you must also take into account the real limits of the maximum and minimum X values.

• range = URL Xmax - LRL Xmin

26

Population Variance and Standard Deviation

• The population variance (σ2) is the average of the squared deviations of the individual population measurements from the population mean (µ)

• The population standard deviation (σ) is the positive square root of the population variance

Variance• For a population of size N, the population

variance σ2 is:

• For a sample of size n, the sample variance s2 is:

N

xxx

N

xN

N

ii 22

22

11

2

2

11

222

211

2

2

n

xxxxxx

n

xxs n

n

ii

29

Sample variability tends to underestimate the population value

Standard Deviation

• Population standard deviation (σ):

• Sample standard deviation (s):

2

2ss

Example: Sample Variance and Standard Deviation

• Data points are: 60, 41, 15, 30, 34• Mean is 36• Variance is:

Standard deviation is:

4.2165

1082

5

436441255765

36343630361536413660 222222

71.144.216

32

X

Percentiles & Quartiles

For a set of measurements arranged in increasing order, the pth percentile is a value such that p percent of the measurements fall at or below the value and (100-p) percent of the measurements fall at or above the value

• The first quartile Q1 is the 25th percentile • The second quartile (or median) is the 50th percentile• The third quartile Q3 is the 75th percentile• The interquartile range IQR is Q3 - Q1

Cumulative percentages & PERCENTILES

34

X=2 means that the measurement was somewhere between the real limits of 1.5 and 2.5.

30% of the individuals have been accumulated by the time you reach the top of the interval for X=2.

Q3

Q1

35

36

What is the 95th percentile? (Answer: X = 4.5.)

What is the percentile rank for X = 3.5? (Answer: 70%.)

What is the 50th percentile?What is the percentile rank for X = 4?estimates of these values by a standard procedure known as interpolation

37

38

39

Using the following distribution of scores, we will use interpolation to find the 50th percentile:

40

For the scores, the width of the interval is 5 points. For the percentages, thewidth is 50 points.The value of 50% is located 10 points from the top of the percentage interval. Asa fraction of the whole interval, this is 10 out of 50, or 1/5 of the total interval.

The 50th percentile is X = 8.5.

41

USING INTERPOLATION TO FIND THE MEDIAN

Answer: X = 3.70 is the medianNotice that this is exactly the same answer we obtained using the graphic method of interpolation in Figure 3.7

42

43

Example: Quartiles

20 customer satisfaction ratings:

1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10

Md = (8+8)/2 = 8

Q1 = (7+8)/2 = 7.5 Q3 = (9+9)/2 = 9

IQR = Q3 Q1 = 9 7.5 = 1.5

A slightly different way to find the quartiles (without using interpolation).

Five Number Summary in descriptive statistic

1. The smallest measurement2. The first quartile, Q1

3. The median, Md

4. The third quartile, Q3

5. The largest measurement

• Displayed visually using a box-and-whiskers plot

Box-and-whisker plots

46

A box and whisker plot (sometimes called a boxplot) is a graph that presents information from a five-number summary. It does not show a distribution in as much detail as a stem and leaf plot or histogram does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers) in the data set.

Outliers

• Outliers are measurements that are very different from other measurements– They are either much larger or much smaller than most of

the other measurements

• Outliers lie beyond the fences of the box-and-whiskers plot– Measurements between the inner and outer fences are

mild outliers– Measurements beyond the outer fences are severe

outliers

Box-and-Whiskers Plots

• The box plots the: – first quartile, Q1

– median, Md

– third quartile, Q3

– inner fences– outer fences

From: Business Statistics in Practice, 5th Edition, Bowerman O’Connell Murphree,

Box-and-Whiskers Plots Continued

• Inner fences– Located 1.5IQR away from the quartiles:

• Q1 – (1.5 IQR)• Q3 + (1.5 IQR)

• Outer fences– Located 3IQR away from the quartiles:

• Q1 – (3 IQR)• Q3 + (3 IQR)

From: Business Statistics in Practice, 5th Edition, Bowerman O’Connell Murphree,

Box-and-Whiskers Plots Continued

• The “whiskers” are dashed lines that plot the range of the data– A dashed line drawn from the box below Q1 down

to the smallest measurement between the inner fences

– Another dashed line drawn from the box above Q3 up to the largest measurement between the inner fences

Box-and-Whiskers Plots Continued

Symmetric distributionSymmetric distribution: A distribution having the same shape on either side of the center

Skewed distributionSkewed distribution: One whose shapes on either side of the center differ; a nonsymmetrical distribution.

Can be positively or negatively skewed, or bimodal

negative

positive

The Relative Positions of the Mean, Median, and Mode: Symmetric Distribution

M o d e

M ed ia n

M ea n

Symmetric distributionSymmetric distribution: mean = median = mode

54

Relationships Among Mean, Medianand Mode

The The leftleft tail is longer; the mass of the tail is longer; the mass of the distribution is concentrated on the distribution is concentrated on the right of the figure. The distribution is right of the figure. The distribution is also said to be also said to be left-skewedleft-skewed..

The The rightright tail is longer; the mass of the tail is longer; the mass of the distribution is concentrated on the left distribution is concentrated on the left of the figure. The distribution is also of the figure. The distribution is also said to be said to be right-skewedright-skewed..

Mean<Median<Mode Mode<Median<Mean

Relative Positions of the Mean, Median, and Mode

Skewness is the measurement of the lack of symmetry of the distribution.

The coefficient of skewness can range from -3.00 up to 3.00 when using the following formula:

A value of 0 indicates a symmetric distribution.Negatively Skewed Distribution has negative coefficient of skewness. Positively Skewed Distribution has positive coefficient of skewness.

Some software packages use a different formula which results in a wider range for the coefficient.

s

MedianXSk

3

Normal distribution (Gaussian Normal distribution (Gaussian distribution) is a symmetric distribution) is a symmetric distribution that often gives a good distribution that often gives a good description of data that cluster description of data that cluster around the mean. The graph of the around the mean. The graph of the distribution is bell-shaped, with a distribution is bell-shaped, with a peak at the mean, and is known as peak at the mean, and is known as the Gaussian function or bell curve. the Gaussian function or bell curve.

Normal distributionNormal distribution

The bean machine is a device The bean machine is a device invented by Sir Francis Galton to invented by Sir Francis Galton to demonstrate how the normal demonstrate how the normal distribution appears in nature. This distribution appears in nature. This machine consists of a vertical board machine consists of a vertical board with interleaved rows of pins. Small with interleaved rows of pins. Small balls are dropped from the top and balls are dropped from the top and then bounce randomly left or right then bounce randomly left or right as they hit the pins. The balls are as they hit the pins. The balls are collected into bins at the bottom collected into bins at the bottom and settle down into a pattern and settle down into a pattern resembling the Gaussian curve.resembling the Gaussian curve.

Normal distribution in natureNormal distribution in nature

Height (in.)Height (in.)

Normal distribution in natureNormal distribution in nature

Distribution of the heights of 1052 women fits the normal distribution, Distribution of the heights of 1052 women fits the normal distribution, with a goodness of fit p value of 0.75with a goodness of fit p value of 0.75

Histogram of daily percentage changes in the S&P 500 index

The Empirical Rule for a Normal/Gaussian distribution

• If a population has mean µ and standard deviation σ and is described by a normal distribution, then– 68.26% of the population measurements lie within one

standard deviation of the mean: [µ-σ, µ+σ]– 95.44% of the population measurements lie within two

standard deviations of the mean: [µ-2σ, µ+2σ]– 99.73% of the population measurements lie within three

standard deviations of the mean: [µ-3σ, µ+3σ]

Origin and meaning of “Six Sigma Process"

If one has six standard deviations between the process mean and If one has six standard deviations between the process mean and the nearest specification limit, as shown in the graph, practically the nearest specification limit, as shown in the graph, practically no items will fail to meet specifications.no items will fail to meet specifications.

If the upper and lower specification limits (USL, LSL) are at a distance of 6σ from the If the upper and lower specification limits (USL, LSL) are at a distance of 6σ from the mean, values lying that far away from the mean are extremely unlikely. Even if the mean, values lying that far away from the mean are extremely unlikely. Even if the mean were to move right or left by 1.5σ at some point in the future (1.5 sigma shift), mean were to move right or left by 1.5σ at some point in the future (1.5 sigma shift), there is still a good safety cushion. This is why if the mean is at least 6σ away from there is still a good safety cushion. This is why if the mean is at least 6σ away from the nearest specification limit, the process will be under good quality control. the nearest specification limit, the process will be under good quality control.

Six Sigma – quality control of process outputs

Chebyshev’s Theorem

• Let µ and σ be a population’s mean and standard deviation, then for any value k> 1

• At least 100(1 - 1/k2 )% of the population measurements lie in the interval [µ-kσ, µ+kσ]

• Or, at most 100/k2 % of the population measurements lie outside the interval [µ-kσ, µ+kσ]

• Chebyshev’s Theorem tells us roughly the percentage of data with values that will fall within a certain number of standard deviations of the mean.

• Chebyshev's Theorem applies to all distributions regardless of their shape and can therefore be used for non normal distributions and distributions where the shape is unknown.

Chebyshev’s Theorem: what is it?

Chebyshev’s Theorem: comparision

Minimum percentage of the population lie within the interval

One study of heights in the U.S. concluded that women have a mean height of 65 inches with a standard deviation of 2.5 inches. If this is correct, what is the percentage of women that have heights between 60 and 70 inches according to Chebyshev’s Theorem? If we assume the distribution is a normal distribution, what is the percentage of women with heights between 60 and 70 inches?

Since the interval between 60 & 70 inches is 2 σ away from the mean, according to Chebyshev’s Theorem, there are at least 100(1 - 1/22 )% or 75% of women with heights between 60 and 70 inches. But if we know the distribution is a normal distribution, then there is 95.44% of the population measurements lie within two standard deviations from the mean.

Chebyshev’s Theorem: Example

z Scores• For any x in a population or sample, the associated z

score is defined as

deviation standard

mean

xz

Z-score is: the exact location of a scorewithin a distribution

The z-score transforms each X value into a signed number so that

1. The sign tells whether the score is located above (+) or below (-) the mean, and2. The number tells the distance between the score and the mean in terms of the number of standard deviations.

69

70

Two distributions of exam scores. For both distributions, = 70, but for one distribution, = 3, and for the other, = 12. The position of X = 76 is very different for these two distributions.

σ σ

Z=(76-70)/12=0.5Z=(76-70)/3=2

What if you score a 65 on a test?On the surface it might look bad.But what if that was the mean is 45 and the standard deviation is 10. Then your z-Score is Z = (65 – 45)/10 = 2.That means you are 2 σ above the mean value.From Chebyshev’s Theorem, you are at worse in the top 12.5% of the class if the distribution is symmetric. If the distribution is normal you are in the top 2-3% of the class.

z Scores: application

z Scores: assignment

In order to get an “A”, you should be in the top 10% of the In order to get an “A”, you should be in the top 10% of the class.class.

What should be your z-Score to get an “A”, according to What should be your z-Score to get an “A”, according to Chebyshev’s Theorem and assuming the distribution is Chebyshev’s Theorem and assuming the distribution is symmetrical?symmetrical?

Example: z Score• Population of profit margins for five American

companies: 8%, 10%, 15%, 12%, 5%

• µ = 10%, σ = 3.406%

74

75

TRANSFORMING z-SCORES TO A DISTRIBUTION WITH A PREDETERMINED mean and standard deviation

An instructor gives an exam to a psychology class. For this exam, the distributionof raw scores has a mean of 57 with sd= 14. The instructor would like to simplify the distribution by transforming all scores into a new, standardized distribution with mean= 50 and sd= 10. To demonstrate this process, we will consider what happens to two specific students: Joe, who has a raw score of X= 64 in the original distribution; and Maria, whose original raw score is X= 43.

Step 1: Transform each of the original raw scores into z-scores. For Joe, X= 64, so hisz-score is 0.5 For Maria, X= 43, and her z-score is -1.0

Step 2: Change the z-scores to the new, standardized scores.Joe’s z-score, z=0.5, indicates that he is above the mean by exactly 1/2 standard deviation, so his standardized score would be 55. Maria’s score is located 1 standard deviation below the mean. So her new score would be X= 40.

76

Example: A psychologist has developed a new intelligence test. For years, the test has been given to a large number of people; for this population, mean= 65 and sd= 10. The psychologist would like to make the scores of this test comparable to scores on other IQ tests, which have mean= 100 and sd= 15. If the test is standardized so that it is comparable (has the same mean and sd) to other tests, what would be the standardized scores for the following individuals?

For A z=(75-65)/10=1 standardized score=100+1x15=115 B z=(45-65)/10=-2 100-2x15=70 C z=(67-65)/10=0.2 100+0.2x15=103

Coefficient of Variation

• Measures the size of the standard deviation relative to the size of the mean

• Coefficient of variation =standard deviation/mean × 100%

• Used to:– Compare the relative variabilities of values about the

mean– Compare the relative variability of populations or samples

with different means and different standard deviations– Measure risk

Covariance, Correlation, and the Least Squares Line• When points on a scatter plot seem to

fluctuate around a straight line, there is a linear relationship between x and y

• A measure of the strength of a linear relationship is the covariance sxy

1

1

n

yyxxs

n

iii

xy

Covariance

• A positive covariance indicates a positive linear relationship between x and y– As x increases, y increases

• A negative covariance indicates a negative linear relationship between x and y– As x increases, y decreases

Interpretation of the Sample Covariance

Interpretation of the Sample Covariance Continued

• Points in quadrant I correspond to xi and yi both greater than their averages– (x-) and (y-y ̅) both positive so covariance positive

• Points in quadrant III correspond to xi and yi both less than their averages– (x-) and (y-y ̅) both negative so covariance positive

• If sxy is positive, the points having the greatest influence are in quadrants I and III

• Therefore, a positive sxy indicates a positive linear relationship

x

x

Interpretation of the Sample Covariance Continued

• Points in quadrant II correspond to xi less than and yi greater than y – (x-) is negative and (y-y ̅) is positive so covariance negative

• Points in quadrant IV correspond to xi greater than and yi less than y ̅– (x-) is positive and (y- y ̅) is negative so covariance

negative• If sxy is negative, the points having the greatest

influence are in quadrants II and IV• Therefore, a negative sxy indicates a negative linear

relationship

x

x

x

x

Correlation Coefficient• Magnitude of covariance does not indicate

the strength of the relationship– Magnitude depends on the unit of

measurement used for the data

• Sample correlation coefficient (r) is a measure of the strength of the relationship that does not depend on the magnitude of the data

yx

xy

ss

sr

Correlation Coefficient Continued

• Sample correlation coefficient r is always between -1 and +1– Values near -1 show strong negative correlation– Values near 0 show no correlation– Values near +1 show strong positive correlation

• Sample correlation coefficient is the point estimate for the population correlation coefficient ρ

Least Squares Line

• If there is a linear relationship between x and y, might wish to predict y on the basis of x

• This requires the equation of a line describing the linear relationship

• Line is calculated based on least squares line– Discussed in detail in Chapter 13

Least Squares Line Continued

• Need to calculate slope (b1) and y-intercept (b0)

21x

xy

s

sb

xbyb 10

Least Squares Line for the Sales Volume Data

Weighted Means• Sometimes, some measurements are more

important than others– Assign numerical “weights” to the data

• Weights measure relative importance of the value

• Calculate weighted mean as

where wi is the weight assigned to the ith measurement xi

i

ii

w

xw

Example: Weighted Mean

• June 2001 unemployment rates by census region– Northeast, 26.9 million in civilian labor force, 4.1%

unemployment rate– South, 50.6 million, 4.7% unemployment– Midwest, 34.7 million, 4.4% unemployment– West, 32.5 million, 5.0 unemployment

• Want the mean unemployment rate for the US

Example: Weighted Mean Continued

• Want the mean unemployment rate for the U.S.

• Calculate it as a weighted mean– So that the bigger the region, the more heavily it

counts in the mean• The data values are the regional

unemployment rates• The weights are the sizes of the regional labor

forces

Example: Weighted Mean Continued

• Note that the unweigthed mean is 4.55%, which underestimates the true rate by 0.03%– That is, 0.0003 144.7 million = 43,410

workers

%5847144

29663532525734650926

05532447347465014926

..

......

........

Descriptive Statistics for Grouped Data

• Data already categorized into a frequency distribution or a histogram is called grouped data

• Can calculate the mean and variance even when the raw data is not available

• Calculations are slightly different for data from a sample and data from a population

Descriptive Statistics for Grouped Data (Sample)

• Sample mean for grouped data:

• Sample variance for grouped data:

fi is the frequency for class i

Mi is the midpoint of class i

n = Σfi = sample size

n

Mf

f

Mfx ii

i

ii

1

22

n

xMfs ii

Descriptive Statistics for Grouped Data (Population)

• Population mean for grouped data:

• Population variance for grouped data:

fi is the frequency for class i Mi is the midpoint of class iN = Σfi = population size

N

Mf

f

Mf ii

i

ii

N

xMf ii

22

The Geometric Mean (Optional)• For rates of return of an investment, use

the geometric mean to give the correct wealth at the end of the investment

• Suppose the rates of return (expressed as decimal fractions) are R1, R2, …, Rn for periods 1, 2, …, n

• The mean of all these returns is the calculated as the geometric mean:

1111 21 n

ng RRRR

The arithmetic mean of the absolute values of the deviations from the arithmetic mean.

The main features: All values are used in the calculation. It is sensitive to unusual data. The absolute values are difficult to

manipulate. So, it is not used in inferential stat.

Mean Deviation

M D = X - X

n

3- 96

Mean DeviationMean Deviation

Chapter Three: Summary

ONECalculate the arithmetic mean, median, mode, weighted mean, and the geometric mean.

TWO Explain the characteristics of each measure of location.

THREEIdentify the position of the arithmetic mean, median, and mode for both a symmetrical and a skewed distribution.

FOUR

Compute and interpret the range, the mean deviation, the variance, and the standard deviation of ungrouped data.

FIVEExplain the characteristics, uses, advantages, and disadvantages of each measure of dispersion.

SIX

Understand Chebyshev’s theorem and the Empirical Rule as they relate to a set of observations.

Chapter Three: Summary