102
Data Description Chapter 3

Data Description

Embed Size (px)

DESCRIPTION

Chapter 3. Data Description. Through this chapter you will learn Measure of Central tendency Measure of Dispersion Measure of Position. A statistic is a characteristic or measure obtained by using the data values from a sample. - PowerPoint PPT Presentation

Citation preview

Page 1: Data Description

Data Description

Chapter 3

Page 2: Data Description

Through this chapter you will learn Measure of Central tendency Measure of Dispersion Measure of Position

Page 3: Data Description

A statistic is a characteristic or measure obtained by using the data values from a sample.

A parameter is a characteristic or measure obtained by using all the data values for a specific population.

Page 4: Data Description

Population Arithmetic Mean

X

N

X : Each value, N: Total number of values in the population

Page 5: Data Description

Sample Arithmetic Mean

XX

nX: Each value in the samplen: Total number of observations in the

sample (sample size)

Page 6: Data Description

Example 1Find the mean of the following sample data

7 4 8 8 10 12 12 .

X= 7+4+8+8+10+12+12 = 61

618.71

7

Xx

n

Page 7: Data Description

Estimate the Mean of a Grouped Data into a Frequency Distribution

f frequency of each class

Xm class midpoint of each class

n Total number of frequencies

mf X

Xn

Page 8: Data Description

Example 2Given a frequency distribution

Estimate the mean.

Class boundaries Frequency5.5-10.5 1

10.5-15.5 215.5-20.5 320.5-25.5 525.5-30.5 430.5-35.5 335.5-40.5 2

Page 9: Data Description

Example 2 (Cont.)

Class Frequency Midpoint

Xm 5.5 - 0.5 1 8 8

10.5 - 15.5 2 13 2615.5 - 20.5 3 18 5420.5 - 25.5 5 23 11525.5 - 30.5 4 28 11230.5 - 35.5 3 33 9935.5 - 40.5 2 38 76

mfX

total n =f= 20 f Xm= 490

Page 10: Data Description

Example 2 (Cont.)

490

24.520

mf XX

n

Page 11: Data Description

Median

A median is the midpoint of the data array.

Steps in finding the median of a data array: Step1: Arrange the data in order Step2: Select the midpoint of the

array as the median.

Page 12: Data Description

Example 3

Find the median of the scores 7 2 3 7 6 9 10 8 9 9 10.

Arrange the data in order to obtain

2 3 6 7 7 8 9 9 9 10 10

We have 11 values. 8 is the exact middle value and hence it is the median.

Page 13: Data Description

Example 4

Find the median of the scores 7 2 3 7 6 9 10 8 9 9

Arrange the data in order

2 3 6 7 7 8 9 9 9 10

With these ten scores, no single score is at the exact middle. Instead, the two scores of 7 and 8 share the middle. We therefore find the mean of these two scores.

Page 14: Data Description

Example 4 (Cont.)

the median is 7.5.

7 87.5

2

Page 15: Data Description

The Estimate of Data Grouped into a Frequency Distribution

2Median

nCF

LB Wf

Page 16: Data Description

LB Lower boundary of the median class

n Total # of frequencies

f frequency of the median class

CF Cumulative frequency of the class preceding the median class.

w class width

Page 17: Data Description

Example 5Given the frequency distribution as below. Estimate the median.

Class Frequency30-39 440-49 650-59 860-69 1270-79 980-89 790-99 4

Page 18: Data Description

Example 5First find the cumulative frequency

Class Frequency CF30-39 4 440-49 6 1050-59 8 1860-69 12 3070-79 9 3980-89 7 4690-99 4 50

Page 19: Data Description

Example 5w = 10, n = 50, and hence, n/2=25. The median falls in the class 60-69 ( 59.5-69.5)

2Median

25 1859.5 10 65.33

12

nCF

LB Wf

Page 20: Data Description

Example 6Estimate the median for the frequency distribution below

Class Frequency80-89 590-99 9

100-109 20110-119 8120-129 6130-139 2

Page 21: Data Description

Modeo For grouped data into a frequency

distribution, the estimate of mode can be the class midpoint of the modal class ( the class with the highest frequency)

o It can also be found by the formula

1

1 2

dMode LB w

d d

Page 22: Data Description

whereo LB Lower boundary of the modal classo W class widtho d1 difference between class frequency of

the modal class and that of the class preceding it.

o d2 difference between class frequency of the modal class and that of the class right after it.

Page 23: Data Description

Example 7

AClass

BFrequency ( f )

5.5-10.5 110.5-15.5 215.5-20.5 3 20.5-25.5 525.5-30.5 430.5-35.5 335.5-40.5 2

Estimate the mode of the below distribution

Modal class

Page 24: Data Description

Example 7 (cont.) LB = 20.5

W = 5

d1= 5 - 3 =2

d2 = 5 – 4=1

2

Mode 20.5 5 23.832 1

Page 25: Data Description

The Midrange

lowest value highest valueMR

2

Page 26: Data Description

Example 8

The midrange of this data set: 2, 3, 6, 8, 4, 1 is

MR=(8+1)/2=4.5

Page 27: Data Description

The Weighted Mean

Xi : the values

Wi : the weights

n

i i1 1 2 2 n n i 1

n1 2 n

ii 1

w Xw X w X w X

Xw w w

w

Page 28: Data Description

Example 8  A student obtained 40, 50, 60, 80, and 45 marks in the subjects of Math, Statistics, Physics, Chemistry and Biology respectively. Assuming weights 5, 2, 4, 3, and 1 respectively for the above mentioned subjects. Find Weighted Arithmetic Mean per subject.

Page 29: Data Description

Example 8 (cont.)

Subjects Marks

Obtained Weight wxMath 40 5 200Statistics 50 2 100Physics 60 4 240Chemistry 80 3 240Biology 55 1 55Total 15 835

Page 30: Data Description

Example 8 (cont.)

835x 55.667marks / subject

15

Page 31: Data Description

Distribution Shapes

Mode Median Mean

a Positively skewed or right-skewed

y

x

Page 32: Data Description

Distribution Shapes (cont.)

b Negatively skewed or left-skewedModeMedianMean

x

y

Page 33: Data Description

Distribution Shapes (cont.)

Mean = Median = Mode

x

y

Page 34: Data Description

Range

The range is the highest value minus the lowest value. The symbol R is used for the range.

highest value lowest valueR

Page 35: Data Description

Mean Deviation

Mean DeviationX X

n

Page 36: Data Description

Example 9

The number of patients seen in the emergency room in a hospital for a sample of 5 days last year was: 103, 97, 101, 106, and 103. Determine the mean deviation and interpret.

Page 37: Data Description

Example 9

First find the arithmetic mean

103 97 101 106 103X 102

5

Page 38: Data Description

Example 9 (Cont.)

Number of cases Deviation Absolute

Deviation103 103 - 102= 1 197 97 - 102= -5 5

101 101 - 102= -1 1106 106 - 102= 4 4103 103 - 102= 1 1

Total 12

Page 39: Data Description

Example 9 (Cont.)X X 12

MD 2.4n 5

Hence the mean deviation is 2.4 patients per day. The number of patients deviates, on average, by 2.4 patients from the mean of 102 patients per day.

Page 40: Data Description

Example 10

The weight of a group of crates being shipped to Ireland is (in pounds)

95, 103, 105, 110, 104, 105, 112, and 90.

a) What is the range of the weights?

b) Compute the arithmetic mean weight. c) Compute the mean deviation of the weights. (answer: a) 22, b) 103, c) 5.25 pounds)

Page 41: Data Description

Population Variance and Standard Deviation

2

2 X

N

2

X

N

Remember: Standard deviation is the positive square root of variance.

Page 42: Data Description

Example 11Find the variance and standard deviation for the population data: 35, 45, 30, 35, 40, 25

Solution

First find the arithmetic mean

X= 35+ 45+ 30+ 35+40+25=210

= 210/6 = 35

then construct the table

Page 43: Data Description

Example 11(cont.)

X35 0 045 10 10030 -5 2535 0 040 5 2525 -10 100

X 2

X

Page 44: Data Description

Example 11(cont.)

2

2 X 25041.7

N 6

The population variance is

The population standard deviation is

2

X41.7 6.5

N

Page 45: Data Description

Sample Variance and Standard DeviationSample Variance (Conceptual formula)

Sample Variance (Computational formula)

2

2

1

X Xs

n

22

2

1

X X ns

n

Page 46: Data Description

Sample Variance and Standard Deviation (Cont.)Sample Standard Deviation (Conceptual

formula)

Sample Standard Deviation (Computational formula)

2

1

X Xs

n

22

1

X X ns

n

Page 47: Data Description

Example 12

Find the sample variance and standard deviation for the amount of European auto sales for a sample of 6 years shown. The data are in millions of dollars.

11.2, 11.9, 12.0, 12.8, 13.4, 14.3

Page 48: Data Description

Example 12 (Cont.)Method 1Find the mean : 12.6

X11.20 -1.40 1.9611.90 -0.70 0.4912.00 -0.60 0.3612.80 0.20 0.0413.40 0.80 0.6414.30 1.70 2.89

x x 2x x

Total= 6.38

Page 49: Data Description

Example 12 (Cont.)Method 1The variance is defined by

and hence, the standard deviation is

2 6.381.28

6 1s

1.28 1.13s

Page 50: Data Description

Example 12 (Cont.)Method 2We compute X= 11.2+11.9+12.0+12.8+13.4+14.3 =75.6X2= 11.22 +11.92 +12.02 +12.82

+13.42 +14.32 =958.94The variance is computed by

Standard deviation is 1.13

2

2958.94 75.6 6

1.285

s

Page 51: Data Description

Example 13

Suppose the number of minutes you spent for traveling to school on last 7 days were9, 12, 9, 15, 10, 11, 15. Find the variance of the number of minutes by the two formula.

Page 52: Data Description

Variance and Standard Deviation for Grouped Data

22

2

1m mf X f X n

sn

f : class frequencyXm : class midpoint (class mark)n : Total number of frequencies

Page 53: Data Description

Example

Find the variance and the standard deviation for the frequency distribution of the data representing the number of miles that 20 runners ran during one week.

Page 54: Data Description

Example 14 (cont.)Class Frequency

5.5-10.5 110.5-15.5 215.5-20.5 320.5-25.5 525.5-30.5 430.5-35.5 335.5-40.5 2

Page 55: Data Description

Example 14 (cont.)Class

BoundaryFreq.

fMidpoint

Xm

5.5-10.5 1 8 8 6410.5-15.5 2 13 26 33815.5-20.5 3 18 54 97220.5-25.5 5 23 115 264525.5-30.5 4 28 112 313630.5-35.5 3 33 99 326735.5-40.5 2 38 76 2888

mf X 2mf X

490 13310

Page 56: Data Description

Example 14 (cont.)

22 13310 490 20

20 168.7

s

Hence, the variance is

and the standard deviation is 8.3

Page 57: Data Description

Coefficient of VariationThe coefficient of variation is the standard deviation divided by the mean. The result is expressed as a percentage.

CVar 100%s

X

CVar 100%

Page 58: Data Description

Example 15

The mean of the number of sales of cars over a 3-month period is 87, and the standard deviation is 5. The mean of the commissions is $5225, and the standard deviation is $773. Compare the variations of the two.

Page 59: Data Description

Example 15 (Cont.)

Sales

Commissions

Since the coefficient of variation is larger for commission, the commissions are more variable than the sales.

5CVar 100% 5.7%

87

s

X

773CVar 100% 14.8%

5225

Page 60: Data Description

Example 16The mean for the number of pages of women’s fitness magazines is 132, with a variance of 23; the mean for the number of advertisements of a sample of women’s fitness magazines is 182, with a variance of 62. Compare the variances.(answer: 3.6% pages, 4.3% advertisements)

Page 61: Data Description

Chebyshev’s theoremThe proportion of values from a data set that will fall within k standard deviations of the mean will be at least 1-1/k2, where k is a number greater than 1 (k is not necessarily an integer).

Page 62: Data Description

Chebyshev’s theorem

3X s 3X s2X s 2X sX

At least

88.89%

At least

75%

Page 63: Data Description

Example 17

The mean price of houses in a certain neighborhood is $50,000, and the standard deviation is $10, 000. Find the price range for which at least 75% of the houses will sell.

Page 64: Data Description

Example 17 (Cont.)Chebyshev’s theorem states that three-fourths, or 75%, of the data values will fall within 2 standard deviations of the mean. Thus,

and

Hence, at least 75% of all homes sold in the area will have a price range from $30,000 to $70,000.

$50,000 2 $10,000 $70,000

$50,000 2 $10,000 $30,000

Page 65: Data Description

Example 18A survey of local companies found that the mean amount of travel allowance for executives was $0.25 per mile. The standard deviation was $ 0.02. Using Chebyshev’s theorems find the minimum percentage of the data values that will fall between $0.20 and $0.30.

Page 66: Data Description

The Empirical (Normal) Rule

Chebyshev’s theorem applies to any distribution regardless of its shape. However, when a distribution is bell-shaped (or what is called normal), the following statements, which make up the empirical rule, are true.

Page 67: Data Description

The Empirical (Normal) Rule

Approximately 68% of the data values will fall within 1 standard deviation of the mean.

Approximately 95% of the data values will fall within 2 standard deviations of the mean.

Approximately 99.7% (almost all) of the data values will fall within 3 standard deviations of the mean.

Page 68: Data Description

The Empirical (Normal) Rule

3X s 2X s 1X s X 1X s 2X s 3X s

68%

95%

99.7%

Page 69: Data Description

Measures of PositionStandard ScoresA z score or standard score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. The symbol for a standard score is z.

value mean

standard deviationz

Page 70: Data Description

Measures of PositionStandard ScoresThe z score represents the number of standard deviations that a data value falls above or below the mean.

Page 71: Data Description

Example 19 A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10; she scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative position on the two tests.

Page 72: Data Description

Example 19 (Cont.)For calculus, the z score is

For history the z score is

Since the z score for calculus is larger, her relative position in the calculus class is higher than her relative position in the history class.

65 501.5

10

X Xz

s

30 251.0

5

X Xz

s

Page 73: Data Description

PercentilesPercentiles divide the data set into 100 equal groups.

There are several mathematical methods for computing percentiles for data. These methods can be used to find approximate percentile rank of a data value or to find a data value corresponding to a given percentile.

Page 74: Data Description

Find a Percentile Rank Corresponding to a ValueThe percentile corresponding to a given value X is computed by using the following formula

#of values 0.5

below Percentile 100%

total#of value

X

Page 75: Data Description

Example 20

A teacher gives a 20-point test to 10 students. The scores are shown here. Find the percentile rank of a score of 12.

18, 15, 12, 6, 8, 2, 3, 5, 20, 10

Page 76: Data Description

Example 20 (Cont.)Arrange the data in order from lowest to highest

2, 3, 5, 6, 8, 10, 12, 15, 18, 20

Thus, a student whose score was 12 did better than 65% of the class.

6 0.5Percentile 100%

1065th percentile

Page 77: Data Description

Finding a Data Value Corresponding to a Given PercentileoArrange the data in order from lowest to highest. oCompute c=(np)/100, where n is the total number of

observations and p the percentile.oIf c is not a whole number, round up to the next

whole number. Starting at the lowest value, count over to the number that corresponds to the rounded-up value.

oIf c is a whole number, use the value halfway between the cth and (c+1)th values when counting up from the lowest value.

Page 78: Data Description

Example 21A teacher gives a 20-point test to 10 students. The scores are shown here. find the value corresponding to the 25th percentile.

18, 15, 12, 6, 8, 2, 3, 5, 20, 10

Page 79: Data Description

Example 21 (Cont.)oArrange the data in order from lowest to

highest2, 3, 5, 6, 8, 10, 12, 15, 18, 20

o n= 10, p = 25 c= 10×25 / 100=2.5

o We round it up to get c =3. Start at the lowest values and count over to the third value, which is 5. Hence, the value 5 corresponds to the 25th percentile.

Page 80: Data Description

Example 22

A teacher gives a 20-point test to 10 students. The scores are shown here. find the value corresponding to the 60th percentile.

18, 15, 12, 6, 8, 2, 3, 5, 20, 10

Page 81: Data Description

Example (22 Cont.)oArrange the data in order from smallest to

largest2, 3, 5, 6, 8, 10, 12, 15, 18, 20

on= 10, p = 60 c=10×60 / 100=6

oSince is a whole number, we use the value halfway between the 6th and 7th values when counting up from the lowest valueoThe 60th percentile is (10+12)/2=11.

Page 82: Data Description

QuartilesQuartiles divide the distribution into 4 equal groups, separated by Q1, Q2, and Q3.

Q1 Q2 Q3

25% 25% 25% 25%

L H

Page 83: Data Description

QuartilesQuartiles can be computed using the formula for computing percentiles. o1st quartile corresponds to 25th percentile .o2nd quartile corresponds to 50th percentile.o3rd quartile corresponds to 75th percentile.

2nd quartile = 50th percentile = median

Page 84: Data Description

Example 23

Find first quartile, second quartile and third quartile for the data set 15, 13, 6, 5, 12, 50, 22, 18.

Arrange the data in order from smallest to the largest. 5 6 12 13 15 18 22 50

Page 85: Data Description

Example 23 (Cont.)oFirst quartile = 25th percentile.

c = (825)/100=2Hence, the first quartile is equal to the second value plus the third value divided by 2. That is, Q1 = (6+12)/2=9

oSecond quartile = 50th percentilec=(8 50)/100=4Hence, Q2 =(4th value+5th value)/2

=(13+15)/2=14

Page 86: Data Description

Example 23 (Cont.)oThird quartile = 75th percentile

c=(8 75)/100=6Hence, Q3 =(6th value+7th value)/2

=(18+22)/2=20

Page 87: Data Description

o Interquartile Range: IQR Q3 Q1

oQuartile deviation: QD (Q3 Q1)/2oSemi-interquartile range is referred to

quartile deviation. oMidquartile Range : (Q3 Q1)/2

Interquartile Range, Quartile Deviation and Midquartile Range

Page 88: Data Description

oFirst quartile

oSecond quartile (Median)

oThird quartile

Quartiles of Data Grouped into a Freq. Dist.

1

/ 4n CFQ LB w

f

2

/ 2n CFQ LB w

f

3

3 / 4n CFQ LB w

f

Page 89: Data Description

The office manager of the Mallard Glass Co. is investigating the ages in months of the company’s PCs currently in use. The ages of 30 units selected at random were organized into a frequency distribution. Compute the quartile deviation.

Example 24

Page 90: Data Description

Example 24 (Cont.)Age

(in months) # of PCs

20-24 3

25- 29 5

30-34 10

35-39 7

40-44 4

45-49 1

Page 91: Data Description

Example 24 (Cont.)Age

(in months) # of PCsCumu. Freq.

20-24 3 3

25- 29 5 8

30-34 10 18

35-39 7 25

40-44 4 29

45-49 1 30

Page 92: Data Description

Example 24 (Cont.)

1

30 / 4 324.5 5 29

5Q

3

3 30 / 4 1835.5 5

738.71

Q

Hence, QD 38.7129 4.855 months

Page 93: Data Description

Example 25

The weekly income of a sample of 60 part time employees of a fast-food restaurant chain was organized into the following frequency distribution. Compute the standard deviation and quartile deviation.

Page 94: Data Description

Outliers An outlier is an extremely high or an

extremely low data value when compared with the rest of the data values.

An outlier can strongly affect the mean and standard deviation of a variable.

There are several ways to check a data set for outliers. One of which is shown as follows:

Page 95: Data Description

Example 25 (Cont.)

Weekly Incomes

Number of Employees

100-149 5

150-199 9

200-249 20

250-299 18

300-349 5

350-399 3

Page 96: Data Description

Outliers (Cont.)Step1 Arrange the data in order and find Q1

and Q3.Step2 Find the inter-quartile range:

IQR=Q3 Q1 Step3 Multiply the IQR by 1.5.Step5 Check the data set for any data value

which is smaller than Q11.5IQR or larger than Q3 1.5IQR .

Page 97: Data Description

Outliers: Example 26Check the following data set for outliers.

5, 6, 12, 13, 15, 18, 22, 50We found Q19, Q320Inter-quartile Range: IQR 20-9=11Compute the dividing points:

Q11.5IQR 91.5117.5Q3 1.5IQR 201.51136.5

The data value of 50 is greater than the upper dividing point of 36.5. So, the data value of 50 is considered an outlier.

Page 98: Data Description

Exploratory Data Analysiso In exploratory data analysis (EDA) the

data are presented graphically using a box-plot (sometimes called a box-and-whisker plot).

oThe purpose of exploratory data analysis is to examine data to find out what information can be discovered about the data such as the center and the spread.

oEDA was developed by John Tukey.

Page 99: Data Description

Exploratory Data Analysis

A box plot can be used to graphically represent the data set. These plots involve five specific values:

o The lowest value (i.e., minimum)o Q1

o Median (Q2)o Q3

o The highest value (i.e., maximum)

Page 100: Data Description

Example 27 (Box-plot)

A stockbroker recorded the number of clients she saw each day over an 11-day period the data are shown below. Construct a box plot for the data.

33, 38, 43, 30, 29, 40, 51, 27, 42, 23, 31

Page 101: Data Description

Example 27 (Box-plot)oArrange the data in order from lowest to the

highest: 23, 27, 29, 30, 31, 33, 38, 40, 42, 43, 51

oWe obtain: the lowest value23, Q129, Median Q2 33, Q3 43, and the highest value 15.

20 25 30 35 40 45 50

23 5129 4233

Page 102: Data Description

THE END!