Basic Stat Notes

1

Business Statistics

Graphs, Charts, and Tables – Graphs, Charts, and Tables – Describing Your DataDescribing Your Data

Graphs, Charts, and Tables – Graphs, Charts, and Tables – Describing Your DataDescribing Your Data

Dr.M.Raghunadh Acharya04/15/23

2

Contents …

• Construct a frequency distribution both manually and with a computer

• Construct and interpret a histogram

• Create and interpret bar charts, pie charts, and stem-and-leaf diagrams

• Present and interpret data in line charts and scatter diagrams

04/15/23

3

Frequency Distributions

What is a Frequency Distribution?

• A frequency distribution is a list or a table …

• containing the values of a variable (or a set of ranges within which the data falls) ...

• and the corresponding frequencies with which each value occurs (or frequencies with which data falls within each range)

04/15/23

4

Why Use Frequency Distributions?

• A frequency distribution is a way to summarize data

• The distribution condenses the raw data into a more useful form...

• and allows for a quick visual interpretation of the data

04/15/23

5

Frequency Distribution: Discrete Data

• Discrete data: possible values are countable

Example: An advertiser asks 200 customers how many days per week they read the daily newspaper.

Number of days read

Frequency

0 44

1 24

2 18

3 16

4 20

5 22

6 26

7 30

Total 20004/15/23

6

Relative FrequencyRelative Frequency: What proportion is in each category?

Number of days read

FrequencyRelative

Frequency

0 44 .22

1 24 .12

2 18 .09

3 16 .08

4 20 .10

5 22 .11

6 26 .13

7 30 .15

Total 200 1.00

.22200

44

22% of the people in the sample report that they read the newspaper 0 days per week

04/15/23

7

Frequency Distribution: Continuous Data

• Continuous Data: may take on any value in some interval

Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature

24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27

(Temperature is a continuous variable because it could be measured to any degree of precision desired)

04/15/23

8

Grouping Data by Classes

Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43,

44, 46, 53, 58

• Find range: 58 - 12 = 46

• Select number of classes: 5 (usually between 5 and 20)

• Compute class width: 10 (46/5 then round off)

• Determine class boundaries:10, 20, 30, 40, 50

• Compute class midpoints: 15, 25, 35, 45, 55

• Count observations & assign to classes04/15/23

9

Frequency Distribution Example

Data in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class Frequency

10 but under 20 3 .15

20 but under 30 6 .30

30 but under 40 5 .25

40 but under 50 4 .20

50 but under 60 2 .10

Total 20 1.00

RelativeFrequency

Frequency Distribution

04/15/23

10

Histograms

• The classes or intervals are shown on the horizontal axis

• frequency is measured on the vertical axis

• Bars of the appropriate heights can be used to represent the number of observations within each class

• Such a graph is called a histogram

04/15/23

11

Histogram

0

3

6

5

4

2

00

1

2

3

4

5

6

7

5 15 25 36 45 55 More

Fre

qu

en

cy

Class Midpoints

Histogram Example

Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

No gaps between

bars, since continuous

data

04/15/23

12

Questions for Grouping Data into Classes

• 1. How wide should each interval be? (How many classes should be used?)

• 2. How should the endpoints of the intervals be determined?

• Often answered by trial and error, subject to user judgment

• The goal is to create a distribution that is neither too "jagged" nor too "blocky”

• Goal is to appropriately show the pattern of variation in the data

04/15/23

13

How Many Class Intervals?

• Many (Narrow class intervals)

• may yield a very jagged distribution with gaps from empty classes

• Can give a poor indication of how frequency varies across classes

• Few (Wide class intervals)• may compress variation too much

and yield a blocky distribution• can obscure important patterns of

variation.0

2

4

6

8

10

12

0 30 60 More

TemperatureF

req

ue

nc

y

0

0.5

1

1.5

2

2.5

3

3.5

4 8

12

16

20

24

28

32

36

40

44

48

52

56

60

Mo

re

Temperature

Fre

qu

en

cy

(X axis labels are upper class endpoints)04/15/23

14

General Guidelines

• Number of Data Points Number of Classes

under 50 5 - 7 50 – 100 6 - 10 100 – 250 7 - 12 over 250 10 - 20

– Class widths can typically be reduced as the number of observations increases

– Distributions with numerous observations are more likely to be smooth and have gaps filled since data are plentiful

04/15/23

15

Class Width

• The class width is the distance between the lowest possible value and the highest possible value for a frequency class

• The minimum class width is

Largest Value Smallest ValueNumber of Classes

W =

04/15/23

16

Histograms in Excel

SelectTools/Data

Analysis

1

04/15/23

17

Choose Histogram

2

3

Input data and bin ranges

Select Chart Output

Histograms in Excel(continued)

04/15/23

18

Stem and Leaf Diagram

• A simple way to see distribution details in a data set

METHOD: Separate the sorted data series into leading digits (the stem) and the trailing digits (the leaves)

04/15/23

19

Example:

• Here, use the 10’s digit for the stem unit:


• 12 is shown as

• 35 is shown as

Stem Leaf

1 2

3 5

04/15/23

20

Example:

• Completed Stem-and-leaf diagram:


Stem Leaves

1 2 3 7

2 1 4 4 6 7 8

3 0 2 5 7 8

4 1 3 4 6

5 3 8

04/15/23

21

Using other stem units

• Using the 100’s digit as the stem:

– Round off the 10’s digit to form

the leaves

– 613 would become 6 1• 776 would become 7 8• . . .• 1224 becomes 12 2

Stem Leaf

04/15/23

22

Graphing Categorical Data

Categorical Data

Pie Charts

Pareto Diagram

Bar Charts

04/15/23

23

Bar and Pie Charts

• Bar charts and Pie charts are often used for qualitative (category) data

• Height of bar or size of pie slice shows the frequency or percentage for each category

04/15/23

24

Pie Chart Example

Percentages are rounded to the nearest percent

Current Investment Portfolio

Savings

15%

CD 14%

Bonds 29%

Stocks

42%

Investment Amount PercentageType (in thousands $)

Stocks 46.5 42.27

Bonds 32.0 29.09

CD 15.5 14.09

Savings 16.0 14.55

Total 110 100

(Variables are Qualitative)

04/15/23

25

Bar Chart Example

Investor's Portfolio

0 10 20 30 40 50

Stocks

Bonds

CD

Savings

Amount in $1000's

04/15/23

26

Pareto Diagram Examplecu

mu

lative % in

vested

(line g

raph

)

% i

nve

sted

in

eac

h c

ateg

ory

(b

ar g

rap

h)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

Stocks Bonds Savings CD

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

04/15/23

27

Bar Chart Example

Newspaper readership per week

0

10

20

30

40

50

0 1 2 3 4 5 6 7

Number of days newspaper is read per week

Freu

ency

Number of days

read

Frequency

0 44

1 24

2 18

3 16

4 20

5 22

6 26

7 30

Total 200

04/15/23

28

Tabulating and Graphing Multivariate Categorical Data

• Investment in thousands of dollarsInvestment Investor A Investor B Investor C Total Category

Stocks 46.5 55 27.5 129

Bonds 32.0 44 19.0 95

CD 15.5 20 13.5 49

Savings 16.0 28 7.0 51

Total 110.0 147 67.0 324

04/15/23

29

Tabulating and Graphing Multivariate Categorical Data

• Side by side chartsComparing Investors

0 10 20 30 40 50 60

S toc k s

B onds

CD

S avings

Inves tor A Inves tor B Inves tor C

(continued)

04/15/23

30

Side-by-Side Chart Example

• Sales by quarter for three sales territories:

0

10

20

30

40

50

60

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

EastWestNorth

1st Qtr 2nd Qtr 3rd Qtr 4th QtrEast 20.4 27.4 59 20.4West 30.6 38.6 34.6 31.6North 45.9 46.9 45 43.9

04/15/23

31

• Line charts show values of one variable vs. time– Time is traditionally shown on the

horizontal axisScatter Diagrams show points for

bivariate data – one variable is measured on the

vertical axis and the other variable is measured on the horizontal axis

Line Charts and Scatter Diagrams

04/15/23

32

Line Chart Example

U.S. Inflation Rate

0

1

2

3

4

5

6

1984 1986 1988 1990 1992 1994 1996 1998 2000 2002

Year

Infl

atio

n R

ate

(%)

Year

Inflation

Rate

1985 3.561986 1.861987 3.651988 4.141989 4.821990 5.401991 4.211992 3.011993 2.991994 2.561995 2.831996 2.951997 2.291998 1.561999 2.212000 3.362001 2.852002 1.5804/15/23

33

Scatter Diagram Example

Production Volume vs. Cost per Day

0

50

100

150

200

250

0 10 20 30 40 50 60 70

Volume per Day

Cos

t per

Day

Volume per day

Cost per day

23 125

26 140

29 146

33 160

38 167

42 170

50 188

55 195

60 20004/15/23

34

Types of Relationships

• Linear Relationships

X X

YY

04/15/23

35

• Curvilinear Relationships

X X

YY

Types of Relationships(continued)

04/15/23

36

• No Relationship

X X

YY

Types of Relationships(continued)

04/15/23

37

Chapter Summary

• Data in raw form are usually not easy to use for decision making -- Some type of organization is needed:

Table Graph

• Techniques reviewed in this chapter:– Frequency Distributions and

Histograms– Bar Charts and Pie Charts– Stem and Leaf Diagrams– Line Charts and Scatter Diagrams

04/15/23

38

Summarization measures are single or few number representations of the data which are helpful in representing data and also to compare between data. Based on the summary measures of the sample ,population measures can be forecasted.

The following will illustrate the above, different measures to represent the data are as follows :

1. Measures of Center and Location2. Mean, median, mode, geometric mean, midrange

3. Other measures of Location4. Weighted mean, percentiles, quartiles

5. Measures of Variation6. Range, Inter quartile range, variance and standard deviation, coefficient of variation

Summarization measures …..

04/15/23

39

Center and Location

Mean

Median

Mode

Other Measures of Location

Weighted Mean

Describing Data Numerically

Variation

Variance

Standard Deviation

Coefficient of Variation

Range

Percentiles Inter quartile Range

Quartiles

Summary Measures

04/15/23

40

Center and Location

Mean Median Mode Weighted Mean

N

x

n

xx

N

ii

n

ii

1

1

i

iiW

i

iiW

w

xw

w

xwX

Overview: Measures of Center and Location

04/15/23

41

• The Mean is the arithmetic average of data values

– Sample mean

– Population mean

n = Sample Size

N = Population Size

N

xxx

N

xN

N

ii

211

Mean (Arithmetic Average)

n

xxx

n

xx n

n

ii

211

04/15/23

42

• The most common measure of central tendency• Mean = sum of values divided by the number of values• Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10

Mean = 3

0 1 2 3 4 5 6 7 8 9 10

Mean = 4

35

15

5

54321

4

5

20

5

104321

Mean (Arithmetic Average)

04/15/23

43

• Not affected by extreme values

• In an ordered array, the median is the “middle” number– If n or N is odd, the median is the middle number– If n or N is even, the median is the average of the two middle

numbers

0 1 2 3 4 5 6 7 8 9 10

Median = 3

0 1 2 3 4 5 6 7 8 9 10

Median = 3

Median

04/15/23

44

• A measure of central tendency• Value that occurs most often• Not affected by extreme values• Used for either numerical or categorical data• There may be no mode• There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 5

0 1 2 3 4 5 6

No Mode

Mode

04/15/23

45

• Used when values are grouped by frequency or relative importance

Days to Complete

Frequency

5 4

6 12

7 8

8 2

Example: Sample of 26 Repair Projects

Weighted Mean Days to Complete:

days 6.31 26

164

28124

8)(27)(86)(125)(4

w

xwX

i

iiW

Weighted Mean

04/15/23

46

• Five houses on a hill by the beach

$2,000 K

$500 K

$300 K

$100 K

$100 K

House Prices:

$2,000,000 500,000 300,000 100,000 100,000

Review Example

04/15/23

47

• Mean: ($3,000,000/5) = $600,000

• Median: middle value of ranked data = $300,000

• Mode: most frequent value = $100,000

House Prices:

$2,000,000 500,000 300,000 100,000 100,000

Sum 3,000,000

Summary Statistics

04/15/23

48

• Mean is generally used, unless extreme values (outliers) exist

• Then median is often used, since the median is not sensitive to extreme values.– Example: Median home prices may be reported for a region –

less sensitive to outliers

Which measure of location is the “best”?

04/15/23

49

• Describes how data is distributed

• Symmetric or skewed

Mean = Median = Mode

Mean < Median < Mode

Mode < Median < Mean

Right-SkewedLeft-Skewed Symmetric

(Longer tail extends to left) (Longer tail extends to right)

Shape of a Distribution

04/15/23

50

Other Measures of Location

Percentiles

Quartiles

• 1st quartile = 25th

percentile

• 2nd quartile = 50th percentile

= median

• 3rd quartile = 75th percentile

The pth percentile in a data array:

• p% are less than or equal to this value

• (100 – p)% are greater than or equal to this value

(where 0 ≤ p ≤ 100)

Other Location Measures

04/15/23

51

• The pth percentile in an ordered array of n values is the value in ith position, where

• Example: The 60th percentile in an ordered array of 19 values

is the value in 12th position:

1)(n100

pi

121)(19100

601)(n

100

pi

Percentiles

04/15/23

52

• Quartiles split the ranked data into 4 equal groups

25% 25% 25% 25%

Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

• Example: Find the first quartile

Q1 Q2 Q3

Quartiles

(n = 9)

Q1 = 25th percentile, so find the

so use the value half way between the 2nd and 3rd values,

so

25100 (9+1) = 2.5 position

25100

Q1=12.5

04/15/23

53

• A Graphical display of data using 5-number summary:

Minimum -- Q1 -- Median -- Q3 -- Maximum

Example:

Minimum 1st Median 3rd Maximum Quartile Quartile

25% 25% 25% 25%

Box and Whisker Plot

04/15/23

54

• The Box and central line are centered between the endpoints if data is symmetric around the median

• A Box and Whisker plot can be shown in either vertical or horizontal format

Shape of Box and Whisker Plots

04/15/23

55

Right-SkewedLeft-Skewed Symmetric

Q1 Q2 Q3 Q1 Q2 Q3Q1 Q2 Q3

Distribution Shape and Box and Whisker Plot

04/15/23

56

• Below is a Box-and-Whisker plot for the following data:

0 2 2 2 3 3 4 5 5 10 27

• This data is very right skewed, as the plot depicts

0 2 3 5 270 2 3 5 27

Min Q1 Q2 Q3 Max

Box-and-Whisker Plot Example

04/15/23

57

Variation

Variance

Standard Deviation


PopulationVariance

Sample Variance

PopulationStandardDeviationSample Standard Deviation

Range

Interquartile

Range

Measures of Variation

04/15/23

58

• Measures of variation give information on the spread or variability of the data values.

Same center, different variation

Variation

04/15/23

59

• Difference between the largest and the smallest observations.

Range = xmaximum – xminimum

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13

Example:

Range

04/15/23

60

7 8 9 10 11 12Range = 12 - 7 = 5

7 8 9 10 11 12 Range = 12 - 7 = 5

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120

Range = 5 - 1 = 4

Range = 120 - 1 = 119

Disadvantages of the Range

Sensitive to outliers

• Ignores the way in which data are distributed

04/15/23

61

• Can eliminate some outlier problems by using the Interquartile range

• Eliminate some high-and low-valued observations and calculate the range from the remaining values.

• Interquartile range = 3rd quartile – 1st quartile

Interquartile Range

04/15/23

62

Median(Q2)

XmaximumX

minimum Q1 Q3

Example:

25% 25% 25% 25%

12 30 45 57 70

Interquartile range = 57 – 30 = 27

Interquartile Range

04/15/23

63

• Average of squared deviations of values from the mean

– Sample variance:

– Population variance:

N

μ)(xσ

N

1i

2i

2

1- n

)x(xs

n

1i

2i

2

Variance

04/15/23

64

• Most commonly used measure of variation• Shows variation about the mean• Has the same units as the original data

– Sample standard deviation:

– Population standard deviation:

N

μ)(xσ

N

1i

2i

1-n

)x(xs

n

1i

2i

Standard Deviation

04/15/23

65

Sample Data (Xi) : 10 12 14 15 17 18 18 24

n = 8 Mean = x = 16

4.24267

126

18

16)(2416)(1416)(1216)(10

1n

)x(24)x(14)x(12)x(10s

2222

2222

Calculation Example: Sample Standard Deviation

04/15/23

66

Mean = 15.5

s = 3.338 11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

Data B

Data A

Mean = 15.5 s = .9258

11 12 13 14 15 16 17 18 19 20 21

Mean = 15.5 s = 4.57

Data C

Comparing Standard Deviations

04/15/23

67

• Measures relative variation

• Always in percentage (%)

• Shows variation relative to mean

• Is used to compare two or more sets of data measured in different units

100%x

sCV

100%

μ

σCV

Population Sample


04/15/23

68

• Stock A:– Average price last year = $50– Standard deviation = $5

Both stocks have the same standard deviation, but stock B is less variable relative to its price

10%100%$50

$5100%

x

sCVA

5%100%$100

$5100%

x

sCVB

Comparing Coefficient of Variation

Stock B:Average price last year = $100Standard deviation = $5

04/15/23

69

• If the data distribution is bell-shaped, then the interval:

• contains about 68% of the values in the population or the sample

The Empirical The Empirical RuleRule

1σμ

X

μ

68%

1σμ

04/15/23

70

• contains about 95% of the values in the population or the sample

• contains about 99.7% of the values in the population or the sample

The Empirical RuleThe Empirical Rule

2σμ 3σμ

3σμ

99.7%95%

2σμ

04/15/23

71

• Regardless of how the data are distributed, at least (1 - 1/k2) of the values will fall within k standard deviations of the mean

• Examples:

– (1 - 1/12) = 0% ……..... k=1 (μ ± 1σ)(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)(1 - 1/32) = 89% …........ k=3 (μ ± 3σ)

withinAt least

Tchebysheff’s Theorem

04/15/23

72

• A standardized data value refers to the number of standard deviations a value is from the mean

• Standardized data values are sometimes referred to as z-scores

Standardized Data Values

04/15/23

73

where: • x = original data value• μ = population mean• σ = population standard deviation

• z = standard score

(number of standard deviations x is from μ)

σ

μx z

Standardized Population Values

04/15/23

74

where: • x = original data value• x = sample mean• s = sample standard deviation• z = standard score

(number of standard deviations x is from μ)Remark: The standardized sample values are used for constructing the confidence limits for the

population parameters.

s

xx z

Standardized Sample Values

04/15/23

Technology

Basic Stat Notes