Upload
roopcool
View
6.918
Download
0
Embed Size (px)
Citation preview
1
Business Statistics
Graphs, Charts, and Tables – Graphs, Charts, and Tables – Describing Your DataDescribing Your Data
Graphs, Charts, and Tables – Graphs, Charts, and Tables – Describing Your DataDescribing Your Data
Dr.M.Raghunadh Acharya04/15/23
2
Contents …
• Construct a frequency distribution both manually and with a computer
• Construct and interpret a histogram
• Create and interpret bar charts, pie charts, and stem-and-leaf diagrams
• Present and interpret data in line charts and scatter diagrams
04/15/23
3
Frequency Distributions
What is a Frequency Distribution?
• A frequency distribution is a list or a table …
• containing the values of a variable (or a set of ranges within which the data falls) ...
• and the corresponding frequencies with which each value occurs (or frequencies with which data falls within each range)
04/15/23
4
Why Use Frequency Distributions?
• A frequency distribution is a way to summarize data
• The distribution condenses the raw data into a more useful form...
• and allows for a quick visual interpretation of the data
04/15/23
5
Frequency Distribution: Discrete Data
• Discrete data: possible values are countable
Example: An advertiser asks 200 customers how many days per week they read the daily newspaper.
Number of days read
Frequency
0 44
1 24
2 18
3 16
4 20
5 22
6 26
7 30
Total 20004/15/23
6
Relative FrequencyRelative Frequency: What proportion is in each category?
Number of days read
FrequencyRelative
Frequency
0 44 .22
1 24 .12
2 18 .09
3 16 .08
4 20 .10
5 22 .11
6 26 .13
7 30 .15
Total 200 1.00
.22200
44
22% of the people in the sample report that they read the newspaper 0 days per week
04/15/23
7
Frequency Distribution: Continuous Data
• Continuous Data: may take on any value in some interval
Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
(Temperature is a continuous variable because it could be measured to any degree of precision desired)
04/15/23
8
Grouping Data by Classes
Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43,
44, 46, 53, 58
• Find range: 58 - 12 = 46
• Select number of classes: 5 (usually between 5 and 20)
• Compute class width: 10 (46/5 then round off)
• Determine class boundaries:10, 20, 30, 40, 50
• Compute class midpoints: 15, 25, 35, 45, 55
• Count observations & assign to classes04/15/23
9
Frequency Distribution Example
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class Frequency
10 but under 20 3 .15
20 but under 30 6 .30
30 but under 40 5 .25
40 but under 50 4 .20
50 but under 60 2 .10
Total 20 1.00
RelativeFrequency
Frequency Distribution
04/15/23
10
Histograms
• The classes or intervals are shown on the horizontal axis
• frequency is measured on the vertical axis
• Bars of the appropriate heights can be used to represent the number of observations within each class
• Such a graph is called a histogram
04/15/23
11
Histogram
0
3
6
5
4
2
00
1
2
3
4
5
6
7
5 15 25 36 45 55 More
Fre
qu
en
cy
Class Midpoints
Histogram Example
Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
No gaps between
bars, since continuous
data
04/15/23
12
Questions for Grouping Data into Classes
• 1. How wide should each interval be? (How many classes should be used?)
• 2. How should the endpoints of the intervals be determined?
• Often answered by trial and error, subject to user judgment
• The goal is to create a distribution that is neither too "jagged" nor too "blocky”
• Goal is to appropriately show the pattern of variation in the data
04/15/23
13
How Many Class Intervals?
• Many (Narrow class intervals)
• may yield a very jagged distribution with gaps from empty classes
• Can give a poor indication of how frequency varies across classes
• Few (Wide class intervals)• may compress variation too much
and yield a blocky distribution• can obscure important patterns of
variation.0
2
4
6
8
10
12
0 30 60 More
TemperatureF
req
ue
nc
y
0
0.5
1
1.5
2
2.5
3
3.5
4 8
12
16
20
24
28
32
36
40
44
48
52
56
60
Mo
re
Temperature
Fre
qu
en
cy
(X axis labels are upper class endpoints)04/15/23
14
General Guidelines
• Number of Data Points Number of Classes
under 50 5 - 7 50 – 100 6 - 10 100 – 250 7 - 12 over 250 10 - 20
– Class widths can typically be reduced as the number of observations increases
– Distributions with numerous observations are more likely to be smooth and have gaps filled since data are plentiful
04/15/23
15
Class Width
• The class width is the distance between the lowest possible value and the highest possible value for a frequency class
• The minimum class width is
Largest Value Smallest ValueNumber of Classes
W =
04/15/23
16
Histograms in Excel
SelectTools/Data
Analysis
1
04/15/23
17
Choose Histogram
2
3
Input data and bin ranges
Select Chart Output
Histograms in Excel(continued)
04/15/23
18
Stem and Leaf Diagram
• A simple way to see distribution details in a data set
METHOD: Separate the sorted data series into leading digits (the stem) and the trailing digits (the leaves)
04/15/23
19
Example:
• Here, use the 10’s digit for the stem unit:
Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
• 12 is shown as
• 35 is shown as
Stem Leaf
1 2
3 5
04/15/23
20
Example:
• Completed Stem-and-leaf diagram:
Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Stem Leaves
1 2 3 7
2 1 4 4 6 7 8
3 0 2 5 7 8
4 1 3 4 6
5 3 8
04/15/23
21
Using other stem units
• Using the 100’s digit as the stem:
– Round off the 10’s digit to form
the leaves
– 613 would become 6 1• 776 would become 7 8• . . .• 1224 becomes 12 2
Stem Leaf
04/15/23
22
Graphing Categorical Data
Categorical Data
Pie Charts
Pareto Diagram
Bar Charts
04/15/23
23
Bar and Pie Charts
• Bar charts and Pie charts are often used for qualitative (category) data
• Height of bar or size of pie slice shows the frequency or percentage for each category
04/15/23
24
Pie Chart Example
Percentages are rounded to the nearest percent
Current Investment Portfolio
Savings
15%
CD 14%
Bonds 29%
Stocks
42%
Investment Amount PercentageType (in thousands $)
Stocks 46.5 42.27
Bonds 32.0 29.09
CD 15.5 14.09
Savings 16.0 14.55
Total 110 100
(Variables are Qualitative)
04/15/23
25
Bar Chart Example
Investor's Portfolio
0 10 20 30 40 50
Stocks
Bonds
CD
Savings
Amount in $1000's
04/15/23
26
Pareto Diagram Examplecu
mu
lative % in
vested
(line g
raph
)
% i
nve
sted
in
eac
h c
ateg
ory
(b
ar g
rap
h)
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
Stocks Bonds Savings CD
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
04/15/23
27
Bar Chart Example
Newspaper readership per week
0
10
20
30
40
50
0 1 2 3 4 5 6 7
Number of days newspaper is read per week
Freu
ency
Number of days
read
Frequency
0 44
1 24
2 18
3 16
4 20
5 22
6 26
7 30
Total 200
04/15/23
28
Tabulating and Graphing Multivariate Categorical Data
• Investment in thousands of dollarsInvestment Investor A Investor B Investor C Total Category
Stocks 46.5 55 27.5 129
Bonds 32.0 44 19.0 95
CD 15.5 20 13.5 49
Savings 16.0 28 7.0 51
Total 110.0 147 67.0 324
04/15/23
29
Tabulating and Graphing Multivariate Categorical Data
• Side by side chartsComparing Investors
0 10 20 30 40 50 60
S toc k s
B onds
CD
S avings
Inves tor A Inves tor B Inves tor C
(continued)
04/15/23
30
Side-by-Side Chart Example
• Sales by quarter for three sales territories:
0
10
20
30
40
50
60
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
EastWestNorth
1st Qtr 2nd Qtr 3rd Qtr 4th QtrEast 20.4 27.4 59 20.4West 30.6 38.6 34.6 31.6North 45.9 46.9 45 43.9
04/15/23
31
• Line charts show values of one variable vs. time– Time is traditionally shown on the
horizontal axisScatter Diagrams show points for
bivariate data – one variable is measured on the
vertical axis and the other variable is measured on the horizontal axis
Line Charts and Scatter Diagrams
04/15/23
32
Line Chart Example
U.S. Inflation Rate
0
1
2
3
4
5
6
1984 1986 1988 1990 1992 1994 1996 1998 2000 2002
Year
Infl
atio
n R
ate
(%)
Year
Inflation
Rate
1985 3.561986 1.861987 3.651988 4.141989 4.821990 5.401991 4.211992 3.011993 2.991994 2.561995 2.831996 2.951997 2.291998 1.561999 2.212000 3.362001 2.852002 1.5804/15/23
33
Scatter Diagram Example
Production Volume vs. Cost per Day
0
50
100
150
200
250
0 10 20 30 40 50 60 70
Volume per Day
Cos
t per
Day
Volume per day
Cost per day
23 125
26 140
29 146
33 160
38 167
42 170
50 188
55 195
60 20004/15/23
34
Types of Relationships
• Linear Relationships
X X
YY
04/15/23
35
• Curvilinear Relationships
X X
YY
Types of Relationships(continued)
04/15/23
36
• No Relationship
X X
YY
Types of Relationships(continued)
04/15/23
37
Chapter Summary
• Data in raw form are usually not easy to use for decision making -- Some type of organization is needed:
Table Graph
• Techniques reviewed in this chapter:– Frequency Distributions and
Histograms– Bar Charts and Pie Charts– Stem and Leaf Diagrams– Line Charts and Scatter Diagrams
04/15/23
38
Summarization measures are single or few number representations of the data which are helpful in representing data and also to compare between data. Based on the summary measures of the sample ,population measures can be forecasted.
The following will illustrate the above, different measures to represent the data are as follows :
1. Measures of Center and Location2. Mean, median, mode, geometric mean, midrange
3. Other measures of Location4. Weighted mean, percentiles, quartiles
5. Measures of Variation6. Range, Inter quartile range, variance and standard deviation, coefficient of variation
Summarization measures …..
04/15/23
39
Center and Location
Mean
Median
Mode
Other Measures of Location
Weighted Mean
Describing Data Numerically
Variation
Variance
Standard Deviation
Coefficient of Variation
Range
Percentiles Inter quartile Range
Quartiles
Summary Measures
04/15/23
40
Center and Location
Mean Median Mode Weighted Mean
N
x
n
xx
N
ii
n
ii
1
1
i
iiW
i
iiW
w
xw
w
xwX
Overview: Measures of Center and Location
04/15/23
41
• The Mean is the arithmetic average of data values
– Sample mean
– Population mean
n = Sample Size
N = Population Size
N
xxx
N
xN
N
ii
211
Mean (Arithmetic Average)
n
xxx
n
xx n
n
ii
211
04/15/23
42
• The most common measure of central tendency• Mean = sum of values divided by the number of values• Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
35
15
5
54321
4
5
20
5
104321
Mean (Arithmetic Average)
04/15/23
43
• Not affected by extreme values
• In an ordered array, the median is the “middle” number– If n or N is odd, the median is the middle number– If n or N is even, the median is the average of the two middle
numbers
0 1 2 3 4 5 6 7 8 9 10
Median = 3
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median
04/15/23
44
• A measure of central tendency• Value that occurs most often• Not affected by extreme values• Used for either numerical or categorical data• There may be no mode• There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 5
0 1 2 3 4 5 6
No Mode
Mode
04/15/23
45
• Used when values are grouped by frequency or relative importance
Days to Complete
Frequency
5 4
6 12
7 8
8 2
Example: Sample of 26 Repair Projects
Weighted Mean Days to Complete:
days 6.31 26
164
28124
8)(27)(86)(125)(4
w
xwX
i
iiW
Weighted Mean
04/15/23
46
• Five houses on a hill by the beach
$2,000 K
$500 K
$300 K
$100 K
$100 K
House Prices:
$2,000,000 500,000 300,000 100,000 100,000
Review Example
04/15/23
47
• Mean: ($3,000,000/5) = $600,000
• Median: middle value of ranked data = $300,000
• Mode: most frequent value = $100,000
House Prices:
$2,000,000 500,000 300,000 100,000 100,000
Sum 3,000,000
Summary Statistics
04/15/23
48
• Mean is generally used, unless extreme values (outliers) exist
• Then median is often used, since the median is not sensitive to extreme values.– Example: Median home prices may be reported for a region –
less sensitive to outliers
Which measure of location is the “best”?
04/15/23
49
• Describes how data is distributed
• Symmetric or skewed
Mean = Median = Mode
Mean < Median < Mode
Mode < Median < Mean
Right-SkewedLeft-Skewed Symmetric
(Longer tail extends to left) (Longer tail extends to right)
Shape of a Distribution
04/15/23
50
Other Measures of Location
Percentiles
Quartiles
• 1st quartile = 25th
percentile
• 2nd quartile = 50th percentile
= median
• 3rd quartile = 75th percentile
The pth percentile in a data array:
• p% are less than or equal to this value
• (100 – p)% are greater than or equal to this value
(where 0 ≤ p ≤ 100)
Other Location Measures
04/15/23
51
• The pth percentile in an ordered array of n values is the value in ith position, where
• Example: The 60th percentile in an ordered array of 19 values
is the value in 12th position:
1)(n100
pi
121)(19100
601)(n
100
pi
Percentiles
04/15/23
52
• Quartiles split the ranked data into 4 equal groups
25% 25% 25% 25%
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
• Example: Find the first quartile
Q1 Q2 Q3
Quartiles
(n = 9)
Q1 = 25th percentile, so find the
so use the value half way between the 2nd and 3rd values,
so
25100 (9+1) = 2.5 position
25100
Q1=12.5
04/15/23
53
• A Graphical display of data using 5-number summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum
Example:
Minimum 1st Median 3rd Maximum Quartile Quartile
25% 25% 25% 25%
Box and Whisker Plot
04/15/23
54
• The Box and central line are centered between the endpoints if data is symmetric around the median
• A Box and Whisker plot can be shown in either vertical or horizontal format
Shape of Box and Whisker Plots
04/15/23
55
Right-SkewedLeft-Skewed Symmetric
Q1 Q2 Q3 Q1 Q2 Q3Q1 Q2 Q3
Distribution Shape and Box and Whisker Plot
04/15/23
56
• Below is a Box-and-Whisker plot for the following data:
0 2 2 2 3 3 4 5 5 10 27
• This data is very right skewed, as the plot depicts
0 2 3 5 270 2 3 5 27
Min Q1 Q2 Q3 Max
Box-and-Whisker Plot Example
04/15/23
57
Variation
Variance
Standard Deviation
Coefficient of Variation
PopulationVariance
Sample Variance
PopulationStandardDeviationSample Standard Deviation
Range
Interquartile
Range
Measures of Variation
04/15/23
58
• Measures of variation give information on the spread or variability of the data values.
Same center, different variation
Variation
04/15/23
59
• Difference between the largest and the smallest observations.
Range = xmaximum – xminimum
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:
Range
04/15/23
60
7 8 9 10 11 12Range = 12 - 7 = 5
7 8 9 10 11 12 Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
Disadvantages of the Range
Sensitive to outliers
• Ignores the way in which data are distributed
04/15/23
61
• Can eliminate some outlier problems by using the Interquartile range
• Eliminate some high-and low-valued observations and calculate the range from the remaining values.
• Interquartile range = 3rd quartile – 1st quartile
Interquartile Range
04/15/23
62
Median(Q2)
XmaximumX
minimum Q1 Q3
Example:
25% 25% 25% 25%
12 30 45 57 70
Interquartile range = 57 – 30 = 27
Interquartile Range
04/15/23
63
• Average of squared deviations of values from the mean
– Sample variance:
– Population variance:
N
μ)(xσ
N
1i
2i
2
1- n
)x(xs
n
1i
2i
2
Variance
04/15/23
64
• Most commonly used measure of variation• Shows variation about the mean• Has the same units as the original data
– Sample standard deviation:
– Population standard deviation:
N
μ)(xσ
N
1i
2i
1-n
)x(xs
n
1i
2i
Standard Deviation
04/15/23
65
Sample Data (Xi) : 10 12 14 15 17 18 18 24
n = 8 Mean = x = 16
4.24267
126
18
16)(2416)(1416)(1216)(10
1n
)x(24)x(14)x(12)x(10s
2222
2222
Calculation Example: Sample Standard Deviation
04/15/23
66
Mean = 15.5
s = 3.338 11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5 s = .9258
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5 s = 4.57
Data C
Comparing Standard Deviations
04/15/23
67
• Measures relative variation
• Always in percentage (%)
• Shows variation relative to mean
• Is used to compare two or more sets of data measured in different units
100%x
sCV
100%
μ
σCV
Population Sample
Coefficient of Variation
04/15/23
68
• Stock A:– Average price last year = $50– Standard deviation = $5
Both stocks have the same standard deviation, but stock B is less variable relative to its price
10%100%$50
$5100%
x
sCVA
5%100%$100
$5100%
x
sCVB
Comparing Coefficient of Variation
Stock B:Average price last year = $100Standard deviation = $5
04/15/23
69
• If the data distribution is bell-shaped, then the interval:
• contains about 68% of the values in the population or the sample
The Empirical The Empirical RuleRule
1σμ
X
μ
68%
1σμ
04/15/23
70
• contains about 95% of the values in the population or the sample
• contains about 99.7% of the values in the population or the sample
The Empirical RuleThe Empirical Rule
2σμ 3σμ
3σμ
99.7%95%
2σμ
04/15/23
71
• Regardless of how the data are distributed, at least (1 - 1/k2) of the values will fall within k standard deviations of the mean
• Examples:
– (1 - 1/12) = 0% ……..... k=1 (μ ± 1σ)(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)(1 - 1/32) = 89% …........ k=3 (μ ± 3σ)
withinAt least
Tchebysheff’s Theorem
04/15/23
72
• A standardized data value refers to the number of standard deviations a value is from the mean
• Standardized data values are sometimes referred to as z-scores
Standardized Data Values
04/15/23
73
where: • x = original data value• μ = population mean• σ = population standard deviation
• z = standard score
(number of standard deviations x is from μ)
σ
μx z
Standardized Population Values
04/15/23
74
where: • x = original data value• x = sample mean• s = sample standard deviation• z = standard score
(number of standard deviations x is from μ)Remark: The standardized sample values are used for constructing the confidence limits for the
population parameters.
s
xx z
Standardized Sample Values
04/15/23