Upload
aaron-harrington
View
221
Download
4
Tags:
Embed Size (px)
Citation preview
IB Standard level IB Standard level statsstats
Part 2Part 2
Statistics Topics leftStatistics Topics left
Standard deviation – a recap Standard deviation – a recap Cumulative frequency- medians, Cumulative frequency- medians,
quartiles IQR, deciles, percentiles and quartiles IQR, deciles, percentiles and box whisker plotsbox whisker plots
HistogramsHistograms Random variables- probability Random variables- probability
distributions, expectationdistributions, expectation Binomial distributionBinomial distribution Normal distribution Normal distribution
The mean is the most widely used average in statistics. It is found by adding up all the values in the data and dividing by how many values there are.
, , ,...,1 2 3 nx x x x
...1 2 3 inxx x x x
xn n
Note: The mean takes into account every piece of data, so it is affected by outliers in the data. The
median is preferred over the mean if the data contains outliers or is skewed.
Mean
Notation: If the data values are , then the mean is
This is the mean symbol
This symbol means the
total of all the x values
If data are presented in a frequency table:
Mean
Value Frequency
… …
2x
nx
1x 1f
2f
nf
...1 1 2 2 i in n
i i
x fx f x f x fx
f f
then the mean is
Example: The table shows the results of a survey into household size. Find the mean size.
Mean
Household size, x Frequency, f
1 20
2 28
3 25
4 19
5 16
6 6
To find the mean, we add a 3rd column to the table.
Mean = 343 ÷ 114 = 3.01
x × f
20
56
75
76
80
36
TOTAL 114 343
There are three commonly used measures of spread (or dispersion) – the range, the inter-quartile range and the standard deviation.
( )2
variance ix x
n
( )
2
s.d. ix x
n
Standard deviation
The following formulae can be used to find the variance and s.d.
variance = (standard deviation)2variance = (standard deviation)2
The variance is related to the standard deviation:
The standard deviation is widely used in statistics to measure spread. It is based on all the values in the data, so it is sensitive to the presence of outliers in the data.
Example: The mid-day temperatures (in ˚C) recorded for one week in June were: 21, 23, 24, 19, 19, 20, 21
( )2
variance ix x
n
Standard deviation
...21 23 21 14721
7 7x
21 0 0
23 2 4
24 3 9
19 -2 4
19 -2 4
20 -1 1
21 0 0
( )2ix xix xix
Total: 22
So variance = 22 ÷ 7 = 3.143
So, s.d. = 1.77 ˚C (3 s.f.)
˚CFirst we find the mean:
There is an alternative formula which is usually a more convenient way to find the variance:
Standard deviation
( ) ( )2 2 2But, 2i i ix x x x x x 2 22i ix x x nx 2 22ix x nx nx 2 2ix nx
2
2variance ix xn
Therefore, and
2
2s.d. ix xn
( )2
variance ix x
n
Example (continued): Looking again at the temperature data for June: 21, 23, 24, 19, 19, 20, 21
Standard deviation
14721
7x
...2 2 2 221 23 21ix
˚C
Also, = 3109
.
.
2
2 23109variance 21 3 143
7s . 77.d 1
ix xn
˚C
Note: Essentially the standard deviation is a measure of how close the values are to the mean value.
We know that
So,
When the data is presented in a frequency table, the formula for finding the standard deviation needs to be adjusted slightly:
Calculating standard deviation from a table
2
2s.d. i i
i
f xx
f
Example: A class of 20 students were asked how many times they exercise in a normal week.
Find the mean and the standard deviation.
Number of times exercise taken
Frequency
0 5
1 3
2 5
3 4
4 2
5 1
Calculating standard deviation from a table
x × f x2 × f
0 0
3 3
10 20
12 36
8 32
5 25
No. of times exercise taken, x
Frequency, f
0 5
1 3
2 5
3 4
4 2
5 1
. .2
2 2116s.d. 1 9 1 4
08
2i i
i
f xx
f
The table can be extended to help find the mean and the s.d.
TOTAL: 20 38 116
.38
201 9x
If data is presented in a grouped frequency table, it is only possible to estimate the mean and the standard deviation. This is because the exact data values are not known.
An estimate is obtained by using the mid-point of an interval to represent each of the values in that interval.
Example: The table shows the annual mileage for the employees of an insurance company.
Estimate the mean and standard deviation.
Calculating standard deviation from a table
Annual mileage, x Frequency
0 ≤ x < 5000 7
5000 ≤ x < 10,000 18
10,000 ≤ x < 15,000 14
15,000 ≤ x < 20,000 4
20,000 ≤ x < 30,000 2
Calculating standard deviation from a table
Mileage Frequency, f Mid-point, x f × x f × x2
0 – 5000 6 2500 15000 37,500,000
5000 – 10,000 17 7500 127,500 956,250,000
10,000 – 15,000 14 12,500 175,000 2,187,500,000
15,000 – 20,000 5 17,500 87,500 1,531,250,000
20,000 – 30,000 3 25,000 75,000 1,875,000,000
480,000
410
5,667x
TOTAL 45 480,000 6,587,500,000
26,587,500,000s.d. 10,667
47
55 11
miles
miles
In most distributions, about 67% of the data will lie within 1 standard deviation of the mean, whilst nearly all the data values will lie within 2 standard deviations of the mean.
Values that lie more than 2 standard deviations from the mean are sometimes classed as outliers – any such values should be treated carefully.
Standard deviation is measured in the same units as the original data. Variance is measured in the same units squared.
Most calculators have built-in functions which will find the standard deviation for you. Learn how to use this facility on
Notes about standard deviationHere are some notes to consider about standard deviation.
your calculator.
Examination style question: The ages of the people in a cinema queue one Monday afternoon are shown in the stem-and-leaf diagram:
Examination style question 2 3 means 23 years old
2 3 63 1 6 64 1 2 5 6 95 0 4 76 1
a) Explain why the diagram suggests that the mean and standard deviation can be sensibly used as measures of location and spread respectively.
b) Calculate the mean and the standard deviation of the ages.
c) The mean and the standard deviation of the ages of the people in the queue on Monday evening were 29 and 6.2 respectively. Compare the ages of the people queuing atthe cinema in the afternoon with those in the evening.
a) The mean and the standard deviation are appropriate, as the distribution of ages is roughly symmetrical and there are no outliers.
Examination style question2 3 means 23 years old
2 3 63 1 6 64 1 2 5 6 95 0 4 76 1
b) . .597
597 so, 42 642861
44
2 6ix x . .2 227,131
27131 so, s.d. 42 6428614
10 9ix c) The cinemagoers in the evening had a smaller mean
age, meaning that they were, on average, younger than those in the afternoon.
The standard deviation for the ages in the evening was also smaller, suggesting that the evening audience were closer together in age.
Sometimes in examination questions you are asked to pool two sets of data together.
Combining sets of data
Example: Six male and five female students sit an A level examination.
The mean marks were 52% and 57% for the males and females respectively. The standard deviations were 14 and 18 respectively.
Find the combined mean and the standard deviation for the marks of all 11 students.
Let be the marks for the 6 male students.
Let be the marks of the 5 female students.
To find the overall mean, we first need to find the total marks for all 11 students.
,...,1 6x x
,...,1 5y y
Combining sets of data
As 52x 6 52 312x As 57y 5 57 285y
312 285 597x y
.. . %. .597
54 2727 31
541
Therefore
So the combined mean is:
To find the overall standard deviation, we need to find the total of the marks squared for all 11 students.
As s.d. 14x
Therefore,
So the combined s.d. is: to 3 s.f.
Combining sets of data
As s.d. 18y
2
2s.d. ix xn
( )2 2 2s.d.x n x ( )2 2 26 14 52 17,400x ( )2 2 25 18 57 17,865y
2 2 35,265x y
. . %235,26554 2 6 17
111
Notice that the formula
rearranges to give
VARIANCE - you consider how the values are spread about the mean
To calculate VARIANCE, δ2, for a POPULATION
For a list of data:
δ2 = Σ x2 _ μ2
n
For grouped data:
δ2 = Σ fx2 _ μ2
Σf
As this is measured in terms of x2 then the units would be x2 - a bit strange when comparing with original values
So to measure in terms of x we often calculate the STANDARD DEVIATION
Remember - μ is the mean of the population
STANDARD DEVIATION - the positive square root of the variance
Therefore STANDARD
DEVIATION, δ, of a POPULATION
For a list of data:
δ = Σ x2 _ μ2
n √
For grouped data:
δ = Σ fx2 _ μ2
Σf √
4 6 7 5 9 10 6 6 4 7 8
This is the whole POPULATION
For a list of data:
δ2 = Σ x2 _ μ2
n
Σ x2 =
Σ x =
n =
So μ = Σ x n
δ2 =
Visits f
0 – 4 32
5 – 9 71
10 – 14 20
15 – 19 14
20 – 24 10
25 - 29 3
Total 150
Mid pt(x)
2
7
12
17
22
27
∑ fx = 1340
∑ f = 150So μ = 8.93
∑ fx2 = 17560
δ 2 = 17560 _ 8.93 2
150
= 37.32
δ = 6.11
δ2 = Σ fx2 _ μ2
Σf
A set of data can be summarised using 5 key statistics:Quartiles and box plots
the median value (denoted Q2) – this is the middle number once the data has been written in order. If there are n numbers in order, the median lies in position ½ (n + 1).
the lower quartile (Q1) – this value lies one quarter of the way through the ordered data;
the upper quartile (Q3) – this lies three quarters of the way through the distribution.
the smallest value
and the largest value.
These five numbers can be shown on a simple diagram known as a box-and-whisker plot (or box plot):
Smallest value
Q1 Q2 Q3 Largest value
Note: The box width is the inter-quartile range.
Inter-quartile range = Q3 – Q1
Quartiles and box plots
The inter-quartile range is a measure of spread.
The semi-inter-quartile range = ½ (Q3 – Q1).
Example: The (ordered) ages of 15 brides marrying at a registry office one month in 1991 were:
18, 20, 20, 22, 23, 23, 25, 26, 29, 30, 32, 34, 38, 44, 53
The median is the ½(15 + 1) = 8th number. So, Q2 = 26.
The lower quartile is the median of the numbers below Q2,
So, Q1 = 22
The upper quartile is the median of the numbers above Q2,
So, Q3 = 34.
The smallest and largest numbers are 18 and 53.
Quartiles and box plots
The (ordered) ages of 12 brides marrying at the registry office in the same month in 2005 were:
21, 24, 25, 25, 27, 28, 31, 34, 37, 43, 47, 61
Q2 is half-way between the 6th and 7th numbers: Q2 = 29.5.
Q1 is the median of the smallest 6 numbers: Q1 = 25
Q3 is the median of the highest 6 numbers: Q3 = 40.
The smallest and highest numbers are 21 and 61.
Quartiles and box plots
We can use the box plots to compare the two distributions.
The median values show that the brides in 1991 were generally younger than in 2005. The inter-quartile range was larger in 2005 meaning that that there was greater variation in the ages of brides in 2005.
Note: When asked to compare data, always write your comparisons in the context of the question.
Quartiles and box plotsA box plot to compare the ages of brides in 1991 and 2005
It is important that the two box plots are drawn
on the same scale.
0
50
100
150
0 10 20 30 40 50 60 70
A cumulative frequency diagram is useful for finding the median and the quartiles from data given in a grouped frequency table.
There are some important points to remember:
Cumulative frequency diagrams
the cumulative frequencies should be plotted above the upper class boundaries of the intervals – don’t use the mid-point.
points can be joined by a straight line (for a cumulative frequency polygon) or by a curve (for a cumulative frequency curve).
A cumulative frequency polygon
Example: A survey was carried out into the number of hours a group of employees worked.
The table below shows the cumulative frequencies:
Cumulative frequency diagrams
Hours worked Frequency
1 – 9 3
10 – 19 5
20 – 29 5
30 – 39 35
40 – 49 65
50 – 59 27
The upper class boundary (u.c.b.) of the first interval is actually 9.5 (as it contains all values from 0.5 up to 9.5).
u.c.b 9.5 19.5 29.5 39.5 49.5 59.5
c.f. 3 8 13 48 113 140
Cumulative frequency diagram to show hours worked
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70
hours worked
cum
ula
tive
fr
equ
ency
Cumulative frequency diagramsAs well as plotting the points given in the previous table, we also plot the point (0.5, 0) – no one worked less than 0.5 hours.
We can estimate the median by drawing a line across at one half of the total frequency, i.e. at 70. We see that Q2 ≈ 43.
For the lower quartile, a line is drawn at 0.25 × 140 = 35. This gives Q1 ≈ 36.
Drawing a line at 0.75 × 140 = 105, we see that Q3 ≈ 48.
You don’t need to add 1 before halving the frequency when the data is cumulativeSo the inter-quartile range is 48 – 36 = 12.
Cumulative frequency diagram to show hours worked
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70
hours workedcu
mu
lati
ve
freq
uen
cy
Cumulative frequency diagramsPercentiles and deciles
Instead of looking at quartiles we can also split the data up in terms of tenths (deciles) and hundredths (percentiles)
We can estimate the 3rd decile by drawing a line across at 3/10 of the total frequency, i.e. at 42. We see that D3 ≈ .
For the 7th decile, a line is drawn at 0.7 × 140 = 98. This gives D7 ≈ .
.
A cumulative frequency diagram to show the marks in an exam
0
50
100
150
200
250
30 40 50 60 70 80 90 100mark (%)
cu
mu
lati
ve
fr
eq
ue
nc
yExamination style question: The cumulative frequency diagram shows the marks achieved by 220 students in a maths examination.
Cumulative frequency diagrams
a) Estimate the median and the 95th percentile.
b) Where should the pass mark of the examination be set if the college wishes 70% of candidates to pass?
a) The median will be approximately the 220 ÷ 2 = 110th value. This is about 61%.
Cumulative frequency diagrams
The 95th percentile lies 95% of the way through the data. A line is drawn across at 0.95 × 220 = 209.This gives a mark of 84%.
A cumulative frequency diagram to show the marks in an exam
0
50
100
150
200
250
30 40 50 60 70 80 90 100mark (%)
cu
mu
lati
ve
fr
eq
ue
nc
y
b) The college wants 70% of 220 = 154 students to pass. Therefore 66 students will get a mark below the pass mark.
Drawing a line across at 66 gives a pass mark of about 55%.
Cumulative frequency diagramsA cumulative frequency diagram to show the
marks in an exam
0
50
100
150
200
250
30 40 50 60 70 80 90 100mark (%)
cu
mu
lati
ve
fr
eq
ue
nc
y
A histogram can be used to display grouped continuous data. There are some important points to remember:
frequencyfrequency density =
class width
Histograms
The area of each bar in a histogram should be in proportion to the frequency.
When the class widths are not all equal, proportional areas can be achieved by plotting the frequency density on the vertical axis, where
The class width of an interval is calculated as the difference between the smallest and largest values that could occur in that interval. Upper class boundary minus lower class boundary
Notice that the class widths are not all
equal – frequencydensities need to
be used.
Histograms
Weight loss (kg)
Frequency
0 – 4 12
4 – 6 13
6 – 8 11
8 – 10 7
10 – 15 5
15 – 25 2
Class width Frequency density
4 3.0
2 6.5
2 5.5
2 3.5
5 1
10 0.2
Example: 50 overweight adults tested a new diet. The table shows the amount of weight they lost (in kg) in 6 months.
HistogramsWeight loss (kg)
0 – 4
4 – 6
6 – 8
8 – 10
10 – 15
15 – 25
Frequency density
3.0
6.5
5.5
3.5
1
0.2
Histogram to show weight loss
When you draw a histogram, remember to:
plot the frequency densities on the vertical axis;
choose sensible scales for your axes;
label both your axes;
give the histogram a title.
Histograms
We can use the histogram to estimate, for example, the number of people who lost at least 12kg:
There were 2 people who lost between 15 and 25 kg.
To estimate how many people lost between 12 and 15 kg, times this new class width by the frequency density for that class: 3 × 1 = 3.
That means that about 5 people lost at least 12 kg.
Histogram to show weight loss
Example: An ornithologist measures the wing spans (to the nearest mm) of 40 adult robins. Her results are shown below.
Histograms
Wing span (mm) Frequency
195 - 204 8
205 – 209 9
210 – 214 11
215 – 224 9
225 or over 3
The measurements are to the nearest millimetre.
The first interval contains all wing spans between
194.5 and 204.5 mm
The measurements are to the nearest millimetre.
The first interval actually contains all wing spans between
194.5 and 204.5 mm
Actual interval Freq. density
194.5 – 204.5 0.8
204.5 – 209.5 1.8
209.5 – 214.5 2.2
214.5 – 224.5 0.9
224.5 – 244.5 0.15
The last interval is open-ended. We assume that its width is twice that of the previous interval.
Histograms
Interval 194.5 – 204.5 204.5 – 209.5 209.5 – 214.5 214.5 – 224.5 224.5 – 244.5
Freq. density
0.8 1.8 2.2 0.9 0.15
Example (continued)
Wing span (mm)
A histogram showing the wing spans of robins
Fre
q. d
ensi
ty