Upload
lethu
View
215
Download
0
Embed Size (px)
Citation preview
When information for a statistical investigation is collected and recorded, the information is
referred to as data.
There are four processes involved in a statistical investigation:
Collection of data (information)
Data for a statistical investigation can be collected from records, from surveys (either face-
to-face, telephone, or postal), by direct observation or by measuring or counting. Unless the
correct data is collected, valid conclusions cannot be made.
Organisation and display of data
Data can be organised into tables and displayed on a graph. This allows us to identify features
of the data more easily.
Calculation of descriptive statistics
Some statistics used to describe a set of data are the centre and the spread of the data. These
give us a picture of the sample or population under investigation.
Interpretation of statistics
This process involves explaining the meaning of the table, graph or descriptive statistics in
terms of the variable, or theory, being investigated.
The variable is the subject that we are investigating.
The entire group of objects from which information is required is called the population.
Gathering statistical information properly is vitally important. If gathered incorrectly then any
resulting analysis of the data would almost certainly lead to incorrect conclusions about the
population.
The gathering of statistical data may take the form of:
² a census, where information is collected from the whole population, or
² a survey, where information is collected from a much smaller group of the
population, called a sample.
For example:
² The Australian Bureau of Statistics conducts a census of the whole population of
Australia every five years.
² In opinion polls before an election, a survey is conducted to see which way a
sample of the population will vote.
² The students in a school are to vote for a new school captain. If 20 students from
the school are asked how they will vote, then the population is all the students
who attend the school, and the 20 students is a sample.
A DATA
WHAT IS A STATISTICAL INVESTIGATION?
COLLECTION OF DATA
10 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
When taking a sample it is hoped that the information gathered is representative of the entire
population.
For accurate information when sampling, it is essential that:
² the number of individuals in the sample is large enough
² the individuals involved in the survey are randomly chosen from the
population. This means that every member of the population has an equal
chance of being chosen.
If the individuals are not randomly chosen or the sample is too small, the data collected may
be biased towards a particular outcome.
For example:
If the purpose of a survey is to investigate how the population of Melbourne will vote at the
next election, then surveying the residents of only one suburb would not provide information
that represents all of Melbourne.
Data are individual observations of a variable. A variable is a quantity that can have a value
recorded for it or to which we can assign an attribute or quality.
Two types of variable that we commonly deal with are categorical variables and numerical
variables.
A quality or category is recorded for this type of variable. The information collected is
called categorical data.
Examples of categorical variables and their possible categories include:
Colour of eyes: blue, brown, hazel, green and violet
Continent of birth: Europe, Asia, North America, South America, Africa, Australia and
Antarctica
Gender: male or female
Type of car: General Motors, Toyota, Ford, Mazda, BMW, Subaru, etc.
A number is recorded for this type of variable. The information collected is called numerical
data.
There are two types of numerical variables:
Discrete numerical variables
A discrete variable can only take distinct values and these values are often obtained by
counting.
Examples of discrete numerical variables and their possible values include:
The number of children in a family: 0, 1, 2, 3, ...
The score on a test, out of 30 marks: 0, 1, 2 ..., 29, 30.
TYPES OF DATA
CATEGORICAL VARIABLES
NUMERICAL VARIABLES
Chapter 1 UNIVARIATE DATA 11
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Continuous numerical variables
A continuous numerical variable can theoretically take any value on a part of the number line.
Its value often has to be measured.
Examples of continuous numerical variables and their possible values include:
1 40 students, from a school with 820 students, are randomly selected to complete a survey
on their school uniform. In this situation:
a what is the population size b what is the size of the sample?
2
a What is the population being surveyed in this situation?
b How is the data biased if it is used to represent the views of all Australians?
3 A polling agency is employed to survey the voting intention of residents of a particular
electorate in the next election. From the data collected they are to predict the election
result in that electorate.
Explain why each of the following situations would produce a biased sample.
a A random selection of people in the local large shopping complex is surveyed
between 1 pm and 3 pm on a weekday.
b All the members of the local golf club are surveyed.
c A random sample of people on the local train station between 7 am and 9 am are
surveyed.
d A doorknock is undertaken, surveying every voter in a particular street.
4 Classify the following data as categorical, discrete numerical or continuous numerical:
a the quantity of soil in a particular size of potplant
b the number of pages in a daily newspaper
c the number of cousins a person has
d the speed of cars on a particular stretch of highway
e the state of Australia where a person was born
f the maximum daily temperature in Melbourne
g the manufacturer of a car
h the preferred football code
i the position taken by a player on a football field
j the time it takes 12-year-olds to run one kilometre
k the length of feet
l the number of goals shot by a netballer
m the amount spent weekly, by an individual, at the supermarket.
EXERCISE 1A
The height of Year 9 students:
The speed of cars on a stretch
The weight of newborn babies:
The time taken to run 100 m:
any value from about cm to cm
any value from km/h to the fastest speed that a car
any value from kg to kg but most likely in the rangekg to kg
any value from seconds to seconds.
120 200
0
0 100 5 5
9 30
of highway: can travel, but most likely in the range km/h to km/h30 120
:
A television station is conducting a viewer telephone poll on the ques-tion ‘Should Australia become a republic?’
-into-the-station
12 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
5 A sample of public trees in a municipality was surveyed for the following data:
a the diameter of the tree (in centimetres) measured 1 metre above the ground
b the type of tree
c the location of the tree (nature strip, park, reserve, roundabout)
d the height of the tree, in metres
e the time (in months) since the last inspection
f the number of inspections since planting
g the condition of the tree (very good, good, fair, unsatisfactory).
Classify the data collected as categorical, discrete numerical or continuous numerical.
Tally and frequency tables are used to organise categorical data and there are several types
of graphs that can be used to effectively display the data.
For example:
A centrally-located school is investigating how their students get to school. This is of interest
to them because of local traffic problems. A sample of 50 students was asked which of the
following five categories they used most.
The results were:
BBCWTn TnTnTmCC WCCBC CWBBTn TmCBWTn
WWTnTnC TmTnCCTm BBBBW CCBWC TnBCBB
(Tn ´ train, Tm ´ tram, B ´ bus, W ´ walk, C ´ private car)
The variable ‘mode of transport to school’ is a categorical variable.
We can organise the data using a tally and frequency table.
One stroke for each data value is recorded in the tally column.
jjjj©© represents a
tally of five.Mode of transport Tally Frequency
Train jjjj©© jjjj 9
Tram jjjj 4
Bus jjjj©© jjjj©© jjjj 14
Walk jjjj©© jjj 8
Private car jjjj©© jjjj©© jjjj©© 15
Total 50
From the frequency table we can see:
² The most favoured ‘mode of transport’ in the sample was ‘Private car’.
² 9 + 4 + 14 = 27 of the 50 students came by public transport (train, tram, or bus).
² Only 8 of the 50 students (16%) walked to school.
B ORGANISING AND DISPLAYING DATA
CATEGORICAL DATA
Chapter 1 UNIVARIATE DATA 13
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
1 A barchart (or column graph) is usually drawn with the categories along the horizontal
axis and the frequency on the vertical axis.
Each bar (or column) is drawn with height equal to the frequency of its category.
The ‘bars’ are equally spaced (not joined together) and are of the same width.
Below is a barchart for the example. Note: A barchart can also be drawn
with horizontal bars.
2 A segmented barchart is a single ‘bar’ divided into segments so that the length of each
segment is proportional to the frequency.
A percentaged segmented barchart can also be produced.
The percentage for each category is calculated usingfrequency of category
total£ 100% .
For example, for the traffic data shown previously:
The category with the highest frequency of 15 was Private car.
So, 15
50£ 100
1= 30% of the students came by private car.
27 students came by public transport.
So the percentage who came by public transport was 27
50£ 100
1= 54%.
Following is a segmented barchart and a percentaged segmented barchart for the above
example.
The segments can be labelled, or shaded including a legend.
GRAPHS TO DISPLAY CATEGORICAL DATA
02468
10121416
train tram bus walk privatecar
Mode of transport to school
frequency
0 2 4 6 8 10 12 14 16
private car
walk
bus
tram
train
Mode of transport to school
private car
walk
bus
tram
train0
10
20
30
40
50
0%
20%
40%
60%
80%
100%frequency % frequency
privatecar
walk
bus
tram
train
14 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
1 55 randomly selected year eight students were asked to nominate their favourite subject
studied at school.
The results of the survey are dis-
played in the barchart alongside.
a Which subject was the most
favoured?b How many students chose Art
as their favourite subject?
c What percentage of the stu-
dents nominated Mathematics
as their favourite subject?
d What percentage of the stu-
dents chose either Music or
Art as their favourite subject?
2 A randomly selected sample of adults was asked to
nominate the evening television news service that
they watched. The results alongside were obtained:
News Service Frequency
ABC 40Channel 7 45Channel 9 64
Channel 10 25SBS 23None 3
a Construct a barchart for this data.
b Use the table and graph to answer the follow-
ing questions about the data.
i How many adults were surveyed?
ii Which news service is the most popular?
iii What percentage of those surveyed watched the most popular news service?
iv What percentage of those surveyed watched the news service on Channel 7?
3 Construct a percentaged segmented barchart
for the following categorical data, shading
the categories and including a legend.
Expenditure Weekly householditem expenditure ($)
Food 60Clothing 30
Rent 120Travel 15
Utilities 30Entertainment 45
A discrete numerical variable can take only distinct values.
The data is often obtained by counting.
For example, a farmer has a crop of peas and wishes to investigate the number of peas in
the pods. He takes a random sample of 50 pods and counts the number of peas in each pod,
obtaining the following data: 6 6 5 4 9 8 7 7 7 6 5 6 7 8 8 8 7 5 2 4 7 7 6 7 88 7 8 6 6 4 2 9 1 3 3 5 9 8 8 7 7 6 7 7 6 8 4 5 5
The variable in this situation is the discrete numerical variable ‘the number of peas in a pod’.
The data could only take the discrete numerical values 0, 1, 2, 3, 4, ....
EXERCISE 1B.1
0 2 4 6 8 10
English
Mathematics
Science
Language
History
Geography
Music
Art
frequency
subje
ct
DISCRETE NUMERICAL DATA
Chapter 1 UNIVARIATE DATA 15
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
To organise his data the farmer could use
the tally and frequency table shown.
A barchart could be used to display the
results.
No. peas in pod Tally Frequency
1 j 1
2 jj 2
3 jj 2
4 jjjj 4
5 jjjj©© j 6
6 jjjj©© jjjj 9
7 jjjj©© jjjj©© jjj 13
8 jjjj©© jjjj©© 10
9 jjj 3
Total 50
Alternatively, the farmer could use a dot plot which is a convenient method of tallying the
data and at the same time displaying the frequencies.
To draw a dot plot:
1 Draw a horizontal axis and mark it with the values that the variable can take. For this
example, the variable took values from 1 to 9, so we mark the axis from 0 to 10.
2 Label the axis with a description, in this case: number of peas in pod.
3 Systematically go through the data, placing a dot or cross above the appropriate position
on the axis.
The dot plot for this example is:
Notice that the dots are evenly spaced so the final plot looks similar to the barchart.
From both the barchart and the dot plot it can be seen that:
² Seven was the most frequently occurring number of peas in a pod.
² 35
50£ 100
1= 70% of the pods yielded six or more peas.
² 10% of the pods had fewer than 4 peas in them.
The distribution of a set of data is the pattern or shape of its graph.
For the example above, the graph has the
general shape shown alongside:
This distribution of the data is said to be neg-
atively skewed because it is stretched to the
left (the negative direction).
TABLES AND GRAPHS
02468
101214
0 1 2 3 4 5 6 7 8 9
frequency
number of peas in pod
0 1 2 3 4 5 6 7 8 9 10
number of peas in pod
DESCRIBING THE DISTRIBUTION OF A SET OF DATA
stretched to the left
16 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
A positively skewed distribution of data
would have a shape:
A symmetrical distribution of data is nei-
ther positively nor negatively skewed, but is
symmetrical about a central value.
A set of data whose graph has two peaks is
said to be
For the example, if the farmer found one pod
in his sample contained 13 peas then the data
value 13 would be considered an outlier. It is
much larger than the other data in the sample.
On the column graph it appears separated.
1 A randomly selected sample of households
has been asked, “How many people live
in your household?” A column graph has
been constructed for the results.
a How many households were sur-
veyed?
b How many households had only one
or two occupants?
c What percentage of the households
had five or more occupants?
d Describe the distribution of the data.
2 a Construct a barchart for the discrete
numerical data alongside.Number of toothpicks Frequency
33 134 535 736 1337 1238 839 2
b Comment on the distribution of the
data (positively or negatively skewed
or symmetric).
stretched to the right
EXERCISE 1B.2
02468
1012
0 1 2 3 4 5 6 7 8 9
frequency
number of peas in pod10 11 12 13
outlier
0
2
4
6
8
1 2 3 4 5 6 7 8 9 10number of people in the household
freq
uen
cy
Size of households
Note
Outliers
that the horizontal is a number linewith numbers in ascending order from left toright.
are data values that are either muchlarger or much smaller than the general bodyof data. Outliers appear separated from thebody of data on a frequency graph.
Chapter 1 UNIVARIATE DATA 17
bimodal.
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
3 A bowler has recorded the number of wickets he has taken in each of the last 30 innings
he has played: 1 1 3 2 0 0 4 2 2 4 3 1 0 1 0 2 1 5 1 3 7 2 2 2 4 3 1 1 0 3
a Construct a dot plot for the raw data.
b Comment on the distribution of the data, noting any outliers.
4 For an investigation into the number of phonecalls made by teenagers, a sample of
50 fifteen-year-olds were asked the question, “How many phonecalls did you make
yesterday?” The following dot plot was constructed for the data.
a What is the variable in this investigation?
b Explain why the data is discrete numerical data.
c What percentage of the fifteen-year-olds did not make any phonecalls?
d What percentage of the fifteen-year-olds made 5 or more phonecalls?
e Copy and complete: “The most frequent number of phonecalls made was .........”.
f Describe the distribution of the data.
g How would you describe the data value ‘11’?
5 The number of matches in a box is stated as 50, but the actual number of matches has
been found to vary. To investigate this, the number of matches in a box is counted for
a sample of 60 boxes:
51 50 50 51 52 49 50 48 51 50 47 50 52 48 50 49 51 50 50 5252 51 50 50 52 50 53 48 50 51 50 50 49 48 51 49 52 50 49 5050 52 50 51 49 52 52 50 49 50 49 51 50 50 51 50 53 48 49 49
a What is the variable in this investigation?
b Is the data continuous or discrete numerical data?
c Construct a dot plot for this data.
d Describe the distribution of the data.
e What percentage of the boxes contained exactly 50 matches?
The height of 14-year-old children is being investigated. The variable ‘height of 14-year-old
children’ is a continuous numerical variable because the values recorded for the variable
could, theoretically, be any value on the number line. They are most likely to fall between
120 and 190 centimetres.
The heights of thirty children are measured in centimetres. The measurements are rounded to
one decimal place, and the values recorded below:
163:0 154:2 152:8 160:5 148:3 149:2 154:7 172:7 171:3 162:5165:0 160:2 166:2 175:3 143:4 174:6 180:9 162:4 167:3 158:4159:4 164:5 163:7 183:8 150:8 163:4 181:9 158:3 165:0 156:8
The number of phone calls made in a day by a sample of 50 fifteen year olds
0 1 2 3 4 5 6 7 8 9 10 11
number of phone calls
CONTINUOUS NUMERICAL DATA
18 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Note that these rounded values are actually discrete. However, when we tally them, we use
continuous class intervals as follows:
The smallest height is 143:4 cm and the largest is 183:8 cm so we will use class intervals 140up to 150 (this does not include 150), 150 up to 160, 160 up to 170, 170 up to 180, 180 up
to 190. Note that we choose class intervals of the same width.
These class intervals are written as 140 - , 150 - , 160 - , etc. in the frequency table. The
final class interval is written as 180 - < 190 which means 180 cm up to a height that is less
than 190 cm.
A tally-frequency table for this example is: Height (cm) Tally Frequency
140 - jjj 3
150 - jjjj©© jjj 8
160 - jjjj©© jjjj©© jj 12
170 - jjjj 4
180 - < 190 jjj 3
Total 30
A histogram is used to display continuous numerical data. This is similar to a barchart but
because of the continuous nature of the variable, the ‘bars’ are joined together. The frequency
is represented by the height of the ‘bars’.
A histogram for this example is
shown opposite:
Note: The two oblique lines that cross the horizontal axis indicate that the numbers onthis axis are not starting at zero. This can also be shown using .
A relative frequency table and histogram can also be drawn:
Height (cm) Frequency Relative %
140 - 3 3
30£ 100 = 10%
150 - 8 26:7%160 - 12 40%170 - 4 13:3%
180 - < 190 3 10%
Total 30 100%
From the tables and graphs we can see:
² More children had a height in the class interval 160 up to 170 cm than any other
class interval. This class interval is called the modal class.12
30£ 100 = 40% of the children had a height in this class.
² Three of the children ( 330£ 100 = 10%) had a height less than 150 cm.
Heights of a sample of fourteen-year-old children
0
4
8
12
140 150 160 170 180 190height (cm)
frequency
0
10
20
40
140 150 160 170 180 190height (cm)
relative frequency %
30
Chapter 1 UNIVARIATE DATA 19
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
² Three of the children (10%) were 180 cm or more tall.
² The distribution of heights was approximately symmetrical.
1 Time to complete Number of100 m swim (secs) swimmers
50 - 355 - 660 - 1665 - 1170 - 2
75 - < 80 2
2 The speed of vehicles travelling along a
section of highway has been recorded and
displayed using the histogram alongside.
a How many vehicles were included in
this survey?
b What percentage of the vehicles
were travelling at speeds equal to or
greater than 100 km/h?
c What percentage of the vehicles were
travelling at a speed from 100 up to
110 km/h?
d What percentage of the vehicles were travelling at a speed less than 80 km/h?
e If the owners of the vehicles travelling at 110 km/h or more were fined $165 each,
what amount would be collected in fines?
3 The daily maximum temperature (oC) to the nearest degree, in Melbourne, for each day
in January 2001, is recorded below:
34 38 31 38 23 24 25 26 29 35 41 23 32 36 22 2124 26 35 36 25 32 27 30 34 30 27 25 26 23 25
a Using class intervals of 5 degrees construct a tally and frequency table for the data.
b Construct a histogram to display the data.
c Describe the distribution of Melbourne’s daily maximum temperatures in January
2001.
4 The height of each member of a basketball squad has
been measured and the results are displayed using the
frequency table alongside.
Height (cm) Frequency
165 - 1170 - 3175 - 5180 - 12185 - 7190 - 5195 - 2
200 - < 205 1
a Calculate the relative frequencies and construct a
relative frequency histogram for the data.
b Comment on the distribution of the heights.
c Find the percentage of members of the squad
whose height is
i greater than 180 cm ii less than 170 cm
iii between 175 and 190 cm.
EXERCISE 1B.3
Construct a histogram for the followingcontinuous numerical data.
0
50
100
150
200
50 70 90 110 130speed (km/h)
number ofvehicles
20 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Constructing a stem-and-leaf plot, commonly called a stemplot, is often a convenient method
to organise and display a set of numerical data. A stemplot groups the data and shows the
relative frequencies but has the added advantage of retaining the actual data values.
Data values such as 25 36 38 49 23 46 47 15 28 38 34 are all two digit numbers, so
the first digit will be the ‘stem’ and the last digit the ‘leaf’ for each of the numbers. The
stems will be 1, 2, 3, 4 to allow for numbers from 10 to 49.
The stemplot for the data is shown alongside.
Notice that:
² 1 j 5 represents 15
² 2 j 3 5 8 represents 23, 25 and 28
² the data in the leaves is evenly spaced with
no commas
² the leaves are placed in increasing order, so this stemplot is ordered
² the scale (sometimes called the key) tells us the place value of each leaf.
Stem Leaf
1 52 3 5 83 4 6 8 84 6 7 9 2 j 3 means 23
If the scale was 2 j 3 means 2:3, then 4 j 6 7 9 would represent 4:6, 4:7 and 4:9.
For data values such as 195 199 207 183 201 ...... the first two digits are the stem and
the last digit is the leaf.
The score, out of 50, on a test was recorded for 36 students.
a Organise the data using a stemplot.
b Comment on the distribution of the
data.
25 36 38 49 23 46 47 15 28 38 34 930 24 27 27 42 16 28 31 24 46 25 3137 35 32 39 43 40 50 47 29 36 35 33
a Recording the data from the list
gives an unordered stemplot:Stem Leaf
0 91 5 62 5 3 8 4 7 7 8 4 5 93 6 8 8 4 0 1 1 7 5 2 9 6 5 34 6 7 2 6 3 0 75 0 2 j 4 means 24 marks
Ordering the data from smallest
to largest for each stem gives an
ordered stemplot:
Stem Leaf
0 91 5 62 3 4 4 5 5 7 7 8 93 0 1 1 2 3 4 5 5 6 6 7 8 8 94 0 2 3 6 6 7 75 0
C STEM-AND-LEAF PLOTS (STEMPLOTS)
CONSTRUCTING A STEMPLOT
Example 1
9
8
9
Chapter 1 UNIVARIATE DATA 21
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
b The shape of the distribution can
be seen when the stemplot is
rotated:
The data is slightly negatively
skewed.
We also observe these important
features:
² The minimum (smallest) test
score is 9.
² The maximum (largest) test
score is 50.
² The modal class is 30 - 39.
Consider the following example:
The residue that results when a cigarette is
smoked collects in the filter. This residue has
been weighed for twenty cigarettes, giving the
following data, in milligrams.
1:62 1:55 1:59 1:56 1:56 1:55 1:631:59 1:56 1:69 1:61 1:57 1:56 1:551:62 1:61 1:52 1:58 1:63 1:58
Scanning the data reveals that there will be only two ‘stems’, i.e., 15 and 16. In cases like
this we will need to split the stems.
If we use the stem 15 to represent data with
values 1:50 to 1:54 and 15¤ to represent data
with values 1:55 to 1:59 etc., we can construct
a stemplot with four stems:
Stem Leaf
15 215¤ 5 5 5 6 6 6 6 7 8 8 9 916 1 1 2 2 3 3
16¤ 9 15 j 2 means 1:52
If we split the stems five ways, where 150
represents data with values 1:50 and 1:51, 152
represents data with values 1:52 and 1:53 etc.,
the stemplot becomes:
The stemplot with the stems split five ways
clearly gives a better view of the distribution
of the data. The value 1:69 appears as an
outlier in this graph.
The stemplot with the stems split two ways
was not sensitive enough to show this.
Stem Leaf
150
152 2154 5 5 5156 6 6 6 6 7158 8 8 9 9160 1 1162 2 2 3 3164
166
168 9
SPLIT STEMS
Ste
mLea
f
09
15
62
34
45
57
78
93
01
12
34
55
66
78
89
40
23
66
77
50
8 9
22 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
1 A school has conducted a survey of 60 of their students to investigate the time it takes
for students to travel to school. The following data gives the travel time to the nearest
minute. 12 15 16 8 10 17 25 34 42 18 24 18 45 33 38 45 40 3 20 1210 10 27 16 37 45 15 16 26 32 35 8 14 18 15 27 19 32 6 1214 20 10 16 14 28 31 21 25 8 32 46 14 15 20 18 8 10 25 22
a Is travel time a discrete or continuous variable?
b Construct a stemplot for the data using stems 0, 1, 2, ....
c Describe the distribution of the data.
d Copy and complete: “Most students spent between ...... and ...... minutes travelling
to school.”
2 The weight of 900 g loaves of bread varies
slightly from loaf to loaf. A manufacturer of
bread is concerned that he may be producing
too many underweight loaves of bread in his
900 gram range. He weighs a sample of sixty
900 g loaves and records their weight to the
nearest gram. Construct a stemplot for the
following data and comment on the distribu-
tion of the data.
901 904 913 924 921 893 894 895 878 885 896 910 901 903 907907 904 892 888 905 907 901 915 901 909 917 889 891 894 894898 895 904 908 913 924 927 885 898 903 903 913 916 931 882893 894 903 900 906 910 928 901 896 886 897 899 908 904 889
3 A taxi driver has recorded the fares, to the
nearest dollar, of 60 passengers that he has
collected from Melbourne airport:
25 32 35 16 39 18 19 25 16 41 40 43 1613 9 48 42 20 20 22 23 33 35 24 23 1434 37 36 36 44 51 22 48 55 13 16 20 2630 12 30 33 35 41 17 22 54 24 20 21 3542 43 54 28 38 37 46 25
a Construct a stemplot with stems 0, 1, 2, 3, ...... Comment on the distribution of
the data.
b Construct a stemplot with two-way split stems. Comment on the feature of the
distribution that is revealed by this split-stem stemplot.
4 The time spent (minutes) by 20 people in a queue at a bank, waiting to be attended by
a teller, has been recorded:
3:4 2:1 3:8 2:2 4:5 1:4 0 0 1:6 4:8 1:5 1:9 0 3:6 5:2 2:7 3:0 0:8 3:8 5:2
Construct a stemplot for this data (include a legend). Comment on the distribution of
the data.
EXERCISE 1C
Chapter 1 UNIVARIATE DATA 23
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
A picture of a data set can be obtained if we have an indication of the centre of the data and
the spread of the data.
Three statistics that provide a measure of the centre of a set of data are:
² the mean ² the median ² the mode.
The mean x is the statistical name for ‘average’. The mean is calculated by adding all the
data values x then dividing this sum by the number of data n.
mean =sum of the data values
number of data valuesdenoted x =
Px
n
Note: The Greek letter sigma, §, means ‘the sum of’.
² The mean involves all the data values.
² If you are told that the mean mark for a test is 65% then there will be some marks
higher than 65% and some marks lower than 65%.
² The mean does not have to be one of the data values.
For example:
The mean number of children per family is 1:8 in Melbourne.
It is obvious that a family cannot have 1:8 children but this statistic tells us that most
families have either 1 or 2 children, with more families having 2 children.
Megan has had three Maths tests and her mean (average) mark is 78.
a What is the total of Megan’s marks for the three tests?
b She scores 82 marks for her next test. What is the mean mark for the four tests?
c How many marks did she need to score for the fourth test so that her overall
mean mark would increase to 80?
a The total number of marks for the three tests is 78£ 3 = 234.
D SAMPLE SUMMARY STATISTICS: MEASURES OF CENTRE
MEASURES OF CENTRE
THE MEAN
Find the mean of the following data:
5 5 7 3 8 2 3 4 6 5 7 6 4
There are 13 data values in this set, so n = 13.
Mean =5 + 5 + 7 + 3 + 8 + 2 + 3 + 4 + 6 + 5 + 7 + 6 + 4
13=
65
13= 5
Example 2
Example 3
24 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
b The average of her marks for the four tests is234 + 82
4= 79.
c To get an average mark of 80 in four tests, Megan needed to score a total of
4£ 80 = 320 marks.
Hence she needed to score 320¡ 234 = 86 marks on the fourth test to bring
her overall mean mark to 80.
The median is the middle value of an ordered set of data.
An ordered set of data is the data listed from smallest to largest value (or largest to smallest).
The median splits the data set into two halves: half of the data have values less than or equal
to the median and half have values greater than or equal to the median.
For example, if the median mark for a test is 65%, then half the marks scored are greater
than or equal to 65% and half the marks scored are lower than or equal to 65%.
To find the median:
1 Order the data by rearranging the values from smallest to largest.
2 Locate the middle of the data values.
² If there is an odd number of data then the median will be one of the data values.
The median is then+ 1
2th value in a data set of n values.
² If there is an even number of data then the median is the average of the two
middle values and may not be equal to any of the data values.
Find the median for the following data sets:
a 5 5 7 3 8 2 3 4 6 5 7 6 4 b 3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10
a The data set is ordered (arranged from smallest to largest).
2 3 3 4 4 5 5 5 6 6 7 7 8
The median is the13 + 1
2= 7th value (circled).
The median is 5.
b There are 16 data values so the median is the average of the 8th and 9th values
(circled). 3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10
The median is6 + 7
2= 6:5 (Note: This is not one of the data values.)
THE MODE
The mode is the most frequently occurring value in the data set.
THE MEDIAN
Example 4
Chapter 1 UNIVARIATE DATA 25
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
This statistic can usually be found easily from a frequency table, barchart or dot plot.
If there are two modes in a data set then the data can be described as bimodal.
If there are more than two modes then it is said that “the mode is not distinct” and the mode
is not useful as a descriptive statistic.
For continuous data, the class interval with the highest frequency is the modal class.
1 Find the i mean ii median iii mode for each of the following data sets:
a 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9
b 10, 12, 12, 15, 15, 16, 16, 17, 18, 18, 18, 18, 19, 20, 21
c 22:4, 24:6, 21:8, 26:4, 24:9, 25:0, 23:5, 26:1, 25:3, 29:5, 23:5
d 127, 123, 115, 105, 145, 133, 142, 115, 135, 148, 129, 127, 103, 130, 146, 140,
125, 124, 119, 128, 141.
2 Consider the following two data sets:
Data set A: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 10
Data set B: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 15
a Find the mean for both Data set A and Data set B.
b Find the median of both Data set A and Data set B.
c Explain why the mean of Data set A is less than the mean of Data set B.
d Explain why the median of Data set A is the same as the median of Data set B.
3 A cricketer has scored an average of 25:4 runs in his last 10 innings. He scores 58 and
16 runs in his next two innings. What is his new batting average?
4 On the first five days of his holiday David drove an average of 256 kilometres per day
and on the next three days he drove an average of 172 kilometres per day.
a What is the total distance that David drove in the first five days?
b What is the total distance that David drove in the next three days?
c What is the mean distance travelled per day over the eight days?
5 A basketball team scored 43, 55, 41 and 37 goals in their first four matches.
a What is the mean number of goals scored for the first four matches?
b What score will the team need to shoot in the next match so that they maintain the
same mean score?
c The team shoots only 25 goals in the fifth match. What is the mean number of
goals scored for the five matches?
d The team shoots 41 goals in their sixth and final match. Will this increase or
decrease their previous mean score? What is the mean score for all six matches?
Find the mode for the following data: 5 5 7 3 8 2 3 4 6 5 7 6 4
The mode is the most frequently occurring value.
There are three 5s and the most we have of any other number is two.
Example 5
EXERCISE 1D.1
So, the mode is .5
26 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Consider the data set 5 5 7 3 8 2 3 4 6 5 7 6 4 used in Examples 4a and 5. For this data
set the mean, median and mode all had the same value, 5, and this fact indicates that the
distribution of data in this set is symmetrical.
A dot plot of the data confirms
this:
When the distribution of data is
not symmetrical the measures of
centre can have different values.
When the same data appear several times
we often summarise the data in table form.
Consider the data of the given table.
We can find the measures of the centre
directly from the table.
The mode
The mode is 7. There are 15 of data value
7 which is more than any other data value.
Data value Frequency Data value£ frequency
3 1 3£ 1 = 34 1 4£ 1 = 45 3 5£ 3 = 156 7 6£ 7 = 427 15 7£ 15 = 1058 8 8£ 8 = 649 5 9£ 5 = 45
Total 40 278
The mean
There are 40 data in this set, made up of one 3, one 4, three 5s, seven 6s and so on.
The data in an ordered list would look like
3 4 5 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ::::::
To add these numbers we could say
3£ 1 + 4£ 1 + 5£ 3 + 6£ 7 + 7£ 15 + ::::::
so it is not necessary to write out all the data values.
Adding a ‘Data value £ frequency’ column to the table helps to add all the scores. For
example, there are 15 data of value 7 and these add to 7£ 15 = 105.
Since the total of the 40 data values is 278, the mean =278
40= 6:95.
The median
Since there are 40 data in
this set, if the data is writ-
ten out in order from small-
est to largest then the median
will be the average of the two
middle values, i.e., the 20th
and 21st values.
The median can be found by
counting down the frequency
table.
COMPARING MEASURES OF CENTRE
2 3 4 5 6 7 8
mean, median and mode
data values
CALCULATING MEASURES OF SPREAD FROM A FREQUENCY TABLE
Data value Frequency
3 1 1 one number is 34 1 2 two numbers are 4 or less
5 3 5 five numbers are 5 or less
6 7 12 12 numbers are 6 or less
7 15 27 27 numbers are 7 or less
8 89 5
Total 40
Chapter 1 UNIVARIATE DATA 27
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
In the table, the blue numbers show us accumulated values. We can see that the 20th and
21st data values (in order) must both be 7s; =7 + 7
2= 7.
Which measure of centre is the most suitable to use?
Find the mean, median and mode for the data
given in the following frequency table.Data value Frequency
2 14 15 26 37 48 99 6
Total 26
Adding a Data value £ Frequency
column, we get:
the mean =188
26
' 7:23
Data value Freq Data value £ Freq
2 1 2£ 1 = 24 1 4£ 1 = 45 2 5£ 2 = 106 3 6£ 3 = 187 4 7£ 4 = 288 9 8£ 9 = 729 6 9£ 6 = 54
Total 26 188
There are 26 data
in this set, so the
median will be the
average of the 13th
and 14th values.
The 13th and 14th
values are both 8so their average is
8.
The median is 8.
8 is the data value with the highest frequency of 9, so the mode is 8.
Data value Freq
2 1 1st value
4 1 2nd value
5 2 3rd and 4th values
6 3 5th, 6th and 7th values
7 4 8th, 9th, 10th and 11th values
8 9 12th, 13th, 14th, 15th to 20th values
9 6 21st to 26th values
Total 26
Example 6
2 3 4 5 6 7 8
median and mode
9
meanoutliers
In , the ( ) is less thanthe median ( ) and mode ( ).
A dot plot shows the distribution of the data:
Example 6 mean 7 238 8
:
the median
28 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
The data is negatively skewed and the data values 2 and 4 are much smaller than most of the
data values.
The mean depends on the actual values of the data so it has been ‘dragged’ towards these
outliers.
If the data value ‘2’ was replaced by a ‘7’ then the overall total would increase by 5 and
hence the mean would increase.
The median is not influenced by extreme values because it depends on the position of data
rather than their value. If the data value ‘2’ was replaced by a ‘7’ then the median would not
change; the middle values would remain the same.
In cases where there are outliers in one direction so the distribution is skewed, the most
suitable measure of centre to use is the median or the mode. In this case the mode has the
same value as the median and would be a suitable measure of centre for the data.
However, because the mode does not take all the data values into account, in some situations
it is not representative of a data set.
For example, the data set 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 9, 9 has a
mode of 2 and this is not representative of the data set.
A more suitable measure of centre for this data set would be the median 4 or the mean 4:5:
Find the mean, median and mode
from the ordered stemplot shown.Stem Leaf
1 6 7 8 82 2 3 3 4 4 6 7 8 93 1 2 5 8
median, the 11th value4 0 4 65 1
The mean is found by dividing the sum of all the data values by the number of
data. We must make sure that the ‘stem’ is included with the ‘leaf’.
Mean =16 + 17 + 18 + 18 + 22 + 23 + 23 + ::::::+ 51
21= 29:14
The median is the middle value, the 11th value in this ordered data set.
Counting the leaves from the beginning gives a median of 27.
The mode is the most frequently occurring value; there are two 18s, two 23s and
two 24s in this set of data. We can say that the mode is not distinct in this case and
is not useful as a measure of centre.
Note: The mean of 29:14 is larger than the median of 27, indicating that the distri-
bution is positively skewed. This can be seen from the stemplot.
MEASURES OF CENTRE FROM A STEMPLOT
Example 7
Chapter 1 UNIVARIATE DATA 29
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Consider the data 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9:
Choose
We are dealing with only one variable so we choose
All the available descriptive statistics for this variable appear
on the screen:
The first statistic, x, is
The mean of the data is 4:867 (to 3 decimal places).
The second statistic,Px = 73, means that the sum of all
the data values is 73.
The next three statistics we will consider in Section 1E.
indicates that there are 15 data values in the set.
The other statistics on this part of the screen give the statistics of the five-number summary
which is also covered in Section 1E.
USING A CALCULATOR TO FIND THE MEAN AND MEDIAN
The data is entered into the calculator under themenu.
…
1:Edit .Í
Use List 1 ( ), and after checking that the cursor is in the
first position of List 1 we can type the first data value. This
value will appear at the bottom of the screen as .
L
L(1)=2
Press and ‘ ’ appears in the list.Í 2
Continue in a similar way through the list of data, pressingafter each data entry to move the cursor to the next
position.Í
To find the descriptive statistics for the data:
… ~ will get you into the menus for findingdescriptive statistics.
CALC
1:1-Var Stats .Í
tell the calculator which list our data is entered in, so type
y À Í.
1-Var Stats appears on the home screen. We need to
‘ ’n=15
The arrow beside n=15 means that there are other entries
for this
ÿ
screen. Scroll down using .†
means the median is 5.Med=5
the mean.
30 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
1 Find the mean, median and mode for each of the following data sets given as frequency
tables:
a Data value Frequency
1 22 53 84 65 4
b Number of rooms Frequency
2 13 44 125 156 27 48 2
2 The test scores, out of 30 marks, for a class of twenty-two students are:
15, 16, 18, 23, 22, 28, 29, 25, 25, 24, 27, 18, 11, 20, 23, 26, 26, 30, 25, 18, 15, 17
a Find the i mean ii median iii mode for the data.
b Explain why the mean is not the most suitable measure of centre for this set of data.
c Explain why the mode is not the most suitable measure of centre for this set of data.
3 a Find the i mean ii median iii mode
for the data displayed in the following stem-and-
leaf plot:
Stem Leaf
5 3 5 66 0 1 2 4 6 7 97 3 3 6 88 4 79 1
b Which measure of centre would be the best rep-
resentative for this set of data?
4 The following data is the daily rainfall (to the nearest millimetre) for the month of
October 2000 in Melbourne:
3, 1, 0, 0, 0, 0, 0, 2, 0, 0, 3, 0, 0, 0, 7, 1, 1, 0, 3, 8, 0, 0, 0, 32, 38, 3, 0, 3, 1, 0, 0
a Find the i mean ii median iii mode for this data.
b Explain why the median is not the most suitable measure of centre for this data.
c Explain why the mode is not the most suitable measure of centre for this data.
5 The frequency table alongside records the
number of phonecalls made in a day by 50fifteen-year-olds.
Number of phonecalls Frequency
0 51 82 133 84 65 36 37 28 19 010 011 1
a Find the:
i mean ii median iii mode
for this data.
b
c Describe the distribution of the data.
d Why is the mean larger than the median
for this data?
e Which measure of centre would be the most suitable for this data set?
EXERCISE 1D.2
Construct a barchart for the data andshow the position of the measures ofcentre (mean, median and mode) onthe horizontal axis.
Chapter 1 UNIVARIATE DATA 31
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
6 Which one of the following will always be true for the mean, median and mode of a set
of discrete numerical data, assuming a distinct mode exists?
A The mean always equals one of the data values in the set.
B The median always equals one of the data values in the set.
C The mode always equals one of the data values in the set.
D The median is distorted by extreme values.
E In a positively skewed set of data, the median will be greater than the mean.
Three commonly used statistics that indicate the spread of a set of data are:
² the range
² the interquartile range
² the standard deviation.
The range is the difference between the maximum (largest) data value and the minimum
(smallest) data value.
Find the range for the data set: 5 5 7 3 8 2 3 4 6 5 7 6 4.
Scanning the data we can see that the minimum is 2 and the maximum is 8.
Hence the range is 8¡ 2 = 6.
The middle value of the lower half is called the lower quartile, denoted Q1. One quarter
(25%) of the data have values less than or equal to the lower quartile. Three quarters (75%)
of the data have values greater than or equal to the lower quartile.
The middle value of the upper half is called the upper quartile, denoted Q3. One quarter
(25%) of the data have values greater than or equal to the upper quartile. Three quarters
(75%) of the data have values less than or equal to the upper quartile.
The interquartile range (IQR) is the spread of the middle half (50%) of the data.
Interquartile range (IQR) = upper quartile ¡ lower quartile
= Q3 ¡ Q1
E SAMPLE SUMMARY STATISTICS: MEASURES OF SPREAD
MEASURES OF SPREAD
THE RANGE AND INTERQUARTILE RANGE
= maximum data value ¡ minimum data valueRange
Example 8
Now the median divides an ordered data set into two halves. These halves are divided in half
again by the quartiles. The median is denoted Q2.
32 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
A summary for the set of data in Example 9 is:
The data has a spread of 6 (range = 6),
centred around the value 5 (median = 5).
The middle half of the data has a spread of 3(interquartile range = 3).
For the data set 5 5 7 3 8 2 3 4 6 5 7 6 4 find the:
a median b lower quartile c upper quartile d interquartile range
The ordered data set is 2 3 3 4 4 5 5 5 6 6 7 7 8
a There are 13 data values so the median is the 7th value (circled).
There is an odd number of data and the median is one of the values so it
divides the data into two halves of six values each.
Note: For an odd number of data the median data value is not included in
the lower or upper half for the calculation of the quartiles.
b The middle value of the lower half is the average of the 3rd and 4th values.
z }| {2 3 3 4 4 5 5
z }| {5 6 6 7 7 8
3:5 median
Lower quartile =3 + 4
2= 3:5
c Similarly, the middle value of the upper half is the average of the 10th and
11th values: 2 3 3 4 4 5 5 5 6 6 7 7 8
6:5Upper quartile =
6 + 7
2= 6:5
d Interquartile range = upper quartile ¡ lower quartile
= 6:5¡ 3:5
= 3
So, the middle half of the
Example 9
6 values6 values
data has a spread of .3
Range = 8¡ 2 = 6
2 3 3 4 4 5 5 5 6 6 7 7 8
3:5 5 6:5Lower quartile Median Upper quartile
Interquartile range = 3
Chapter 1 UNIVARIATE DATA 33
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Find the range and the interquartile range and describe the distribution of the data:
8, 4, 3, 9, 6, 5, 5, 10, 3, 6, 7, 9, 11, 14, 9, 8, 7, 12
The ordered data set (there are 18 data values) is:
3, 3, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 9, 10, 11, 12, 14
The range = 14¡ 3 = 11
The median will be the average of the 9th and 10th values:
Median =7 + 8
2= 7:5
The median divides the data set into two sets of 9 values:
9 values 9 valuesz }| {3, 3, 4, 5, 5, 6, 6, 7, 7
...z }| {8, 8, 9, 9, 9, 10, 11, 12, 14
...Lower quartile Median 7:5 Upper quartile
The lower quartile is the middle value of the lower half and the upper quartile is
the middle value of the upper half.
The interquartile range = 9¡ 5 = 4
The data is centred at 7:5 (median) and has a spread of 11 (range).
The middle half of the data has a spread of 4 (interquartile range).
Key the data into a list. The data does not have to be
ordered.
Enter
Example 10
USING THE CALCULATOR TO FIND THE RANGE AND INTERQUARTILE RANGE
… ~ ÍCALC 1:1-Var Statsand choose .
Press to select the listy À Í .L
34 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
The median, range, and interquartile range can be found easily from an ordered stemplot.
The screens below show all the statistics for the data. Use to scroll down and reveal thelower part of the screen.
†
The range is
maxX ¡ minX
= 14¡ 3 = 11
The IQR = Q3 ¡ Q1= 9¡ 5= 4
MEASURES OF SPREAD FROM A STEMPLOT
The number of cars travelling along a particular road
were counted for 21 days and the data was recorded
Find the median, range and interquartile range for this
data.
Stem Leaf
1 6 7 8 82 2 4 7 8 93 0 2 3 3 4 5 6 84 0 4 65 1
Stem Leaf
1 6 7 8 8
2 2 4 7 8 9
3 0 2 3 3 4 5 6 84 0 4 65 1
The median is the middle value (the 11th data value in a list of 21) and counting
from the beginning, the median = 32 (circled).
The median divides the data into two groups of 10 data values.
The average of the middle values of these groups gives the lower and upper
quartiles.
Lower quartile =22 + 24
2= 23 Upper quartile =
36 + 38
2= 37
Interquartile range = Upper quartile ¡ Lower quartile
= 37¡ 23= 14
Example 11
The data is ordered so we can read from the
smallest value to the largest value.
Combining the ‘stem’ with the ‘leaf’, we get:
16, 17, 18, 18, 22, 24, 27, ......, 40, 44, 46, 51.
The minimum is 16 and the maximum is 51,
so the range = 51¡ 16 = 35.
in this ordered stemplot.
Chapter 1 UNIVARIATE DATA 35
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
1 For each of the following data sets, find:
i the median (make sure the data is ordered)
ii the upper and lower quartiles
iii the range
iv the interquartile range.
a 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9
b 10, 12, 15, 12, 24, 18, 19, 18, 18, 15, 16, 20, 21, 17, 18, 16, 22, 14
c 21:8, 22:4, 23:5, 23:5, 24:6, 24:9, 25, 25:3, 26:1, 26:4, 29:5
d 127, 123, 115, 105, 145, 133, 142, 115, 135, 148, 129, 127, 103, 130, 146, 140,
125, 124, 119, 128, 141.
2 For the data given in the following ordered stem-and-
leaf plot, find the:Stem Leaf
0 3 4 7 91 0 3 4 6 7 82 0 0 3 5 6 9 9 93 1 3 7 84 2
a median b upper quartile
c lower quartile d range
e interquartile range
3 The time spent (in minutes) by 20 people in a queue at a bank has been recorded:
3:4, 2:1, 3:8, 2:2, 4:5, 1:4, 0, 0, 1:6, 4:8, 1:5, 1:9, 0, 3:6, 5:2, 2:7, 3:0, 0:8, 3:8, 5:2
a Find the median waiting time and the upper and lower quartiles.
b Find the range and interquartile range of the waiting times.
c Copy and complete the following statements:
i “50% of the waiting times were greater than ...... minutes.”
ii
iii “The minimum waiting time was ...... minutes and the maximum waiting time
was ...... minutes. The waiting times were spread over ...... minutes.”
4 The following data gives the number of novels counted
in 30 households.Stem Leaf
2 0 2 5 5 8 993 0 1 3 5 6 6 8 94 2 2 4 7 7 8 95 0 0 1 2 66 2 57 2
a Find the median number of novels per household
and the upper and lower quartiles of the data.
b Copy and complete the following statements:
i “Half of the households have more than ......
novels.”
ii
EXERCISE 1E.1
c i ii
d
Find the range interquartile range
for the number of novels per household.
Describe the distribution of the data using thestatistics found.
“ of the waiting times were less than or equal to ...... minutes.”75%
“ of the households have at least ......novels.”75%
36 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
5 The height (to the nearest centimetre) of 20 ten year olds
is recorded in the following stemplot.Stem Leaf
10 911 1 3 4 4 8 912 2 2 4 4 6 8 9 913 1 2 5 8 8
a Find the i median height
ii upper and lower quartiles of the data.
b Copy and complete the following statements:
i
ii “75% of the children are less than ...... cm tall.”
iii “The middle 50% of the children have heights spread over ...... cm.”
Now the range and IQR both only use two values in their calculation. It is sometimes better
to use a measure of spread that includes all of the data values in its calculation. One such
statistic is the variance, which measures the average of the squared deviations of each data
value from the mean. The deviation of a data value x from the mean x is given by x¡ x.
For a sample, i.e., when we have surveyed a portion of the population:
² the variance is s2 =
P(x¡ x)2
n¡ 1where n is the sample size
² the standard deviation s is the square root of the variance, s =
sP(x¡ x)2
n¡ 1.
Note: The variance and standard deviation for a whole population have slightly different
formulae. However, we do not use these in this course.
THE VARIANCE AND STANDARD DEVIATION
Use the formula to find the variance and the standard deviation of the sample data:
3, 4, 4, 8, 7, 6, 10
The mean, x, of the data is3 + 4 + 4 + 8 + 7 + 6 + 10
7=
42
7= 6
Using a table for the calculations:x x¡ x (x¡ x)23 ¡3 94 ¡2 44 ¡2 48 2 47 1 16 0 010 4 16
Total 38P
(x¡ x)2
variance s2 =
P(x¡ x)2n¡ 1
=38
6
= 6:3333::::
standard deviation s =p
variance
=q
38
6
= 2:5166 (4 d.p.)
Example 12
“Half of the children are less than or ...... cm tall.”
Chapter 1 UNIVARIATE DATA 37
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Using a table to calculate the standard deviation is an interesting exercise, but you will
normally use your calculator to find this statistic.
Sample standard
deviation.
Note:
USING THE CALCULATOR TO FIND THE STANDARD DEVIATION
Press and choose
Key data into list .
…1:EDIT
L
Presschoose
… ~ to choose, then
.CALC
1:1-Var Stats
Press tochoose list
y À ÍL.
The variance is not given on the screen, but it can be found by squaring the standarddeviation.
The frequency table alongside shows data
collected from a random sample of 50households in a particular suburb, investiga-
ting the number of people in the household.
Use the calculator to find the standard devia-
tion of the number of people in a household
for this sample.
Number of people Frequencyin the household
1 52 83 134 145 76 3
Press and choose
Key the variable values into andthe frequency values into .
… 1: Edit
L
L‚
STANDARD DEVIATION FOR GROUPED DATA
Example 13
Presschoose
… ~ to choose, then
.CALC
1:1-Var Stats
38 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Many data sets have frequency distributions that are ‘bell-shaped’ and symmetrical about the
mean.
For example, the histogram alongside
exhibits this typical ‘bell-shape’. The
data represents the heights of a group
of adult women and has a mean of 165and a standard deviation of 8.
The data is centred about the mean and
spreads from 140 to 190. However,
most of the data have values between
155 and 170 and not many have values
more than 180 or less than 150.
The Normal distribution is an important bell-shaped distribution.
For the Normal distribution it can be shown that:
² 68% of the data will have values within one standard deviation of the mean.
² 95% of the data will have values within two standard deviations of the mean.
² 99:7% of the data will have values within three standard deviations of the mean.
Graphically this can be summarised:
If we model the bell-shaped data above using the Normal distribution:
² 68% of the heights will have values between 165¡ 8 = 157 and
165 + 8 = 173, i.e., between 157 and 173 cm.
68% of the data values will be in the interval [x¡ s, x+ s].
The sample standard deviation is
1:3536 ::::
Note: If you do not include then you will still get a screen of statistics, but theywill be for only.
L‚
L
Enter , by pressingL L‚
y À ¢ y Á Í.
GIVING MEANING TO THE STANDARD DEVIATION
0
5
10
15
20
25
140 145 150 155 160 165 170 175 180 185 190
frequency
height (cm)
68% of data
x s��� x s���
mean x
95% of data
x s���� x s����
mean x
99.7% of data
x s���� x s����
mean x
Chapter 1 UNIVARIATE DATA 39
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
² 95% of the heights will have values between 165¡ 2£ 8 = 149 and
165 + 2£ 8 = 181, i.e., between 149 and 181 cm.
95% of the data values will be in the interval [x¡ 2s, x+ 2s].
² 99:7% of the heights will have values between 165¡ 3£ 8 = 141 and
165 + 3£ 8 = 189, i.e., between 141 and 189 cm.
99:7% of the data values will be in the interval [x¡ 3s, x+ 3s].
A set of data has a Normal distribution with a mean x = 30 and a standard
deviation of s = 7. What percentage of the data is:
a greater than 30 b between 23 and 37 c more than 37
d between 16 and 44 e more than 44 f between 37 and 44?
a The distribution of data is symmetrical about the
mean, so 50% of the data have a value greater
than 30.
b Now x¡ s = 30¡ 7 = 23
and x+ s = 30 + 7 = 37
c Since 68% of scores are between 23 and 37,
32% are outside this interval. The distribution
of scores is symmetrical, so 16% are greater
than 37.
d Now x+ 2s = 30 + 14 = 44
and x¡ 2s = 30¡ 14 = 16
e Since 95% of the data are between 16 and 44,
5% are outside this interval. The distribution
is symmetric so 2:5% of the data are greater
than 44.
30
30 3723
30 3723
16% 16%68%
30 4416
30 4416
95% 2.5%2.5%
Example 14
95% of the data fall between these two values.
68% 23 37of the data are between and .
40 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
f From c, we know that 16% of the data are
greater than 37, and from e, we know 2:5%of the data are greater than 44.
between 37 and 44.
1 a Use the formula to find the standard deviation of the following set of data:
3 3 4 4 5 6 6 7 8 8 9 9
b Check your answer to a using your calculator.
2 Use your calculator to find the standard deviation and variance of the following data:
25:6, 32:8, 24:7, 36:0, 32:1, 30:9, 34:4, 27:5
3 Find the standard deviation of the data given in the frequency table below.
Number of cars owned Frequency Number of cars owned Frequencyby the business by the business
0 3 6 10
1 4 7 8
2 6 8 5
3 9 9 2
4 12 10 1
5 10 11 0
30 4423
68% 2.5%2.5%
37
�� �����
����
% . %. %
The contents of a sample of two hundred ‘800 gram packets’ of muesli were weighed
and the weights were found to have a bell-shaped distribution with a mean of 800grams and a standard deviation of 8 grams. How many of the packets in the sample
would be expected to have a weight of more than 792 grams?
We model the bell-shaped distribution using the Normal distribution.
Now 792 = 800¡ 8
Since 68% of the weights are within
one standard deviation of the mean,
32% are outside this range.
32
2= 16% of the weights are lower
than 792 g.
84% of the weights are above 792 g.
84% of 200 = 84
100£ 200 = 168.
So 168 of the 200 packets in the sample would be expected to have a weight greater
than 792 grams.
800792
68%
16%16%
�� �����
���
% %
%
weight in grams
Example 15
EXERCISE 1E.2
16% 2 5% = 13 5% of the data lie¡ : :
So, 792 g is one standard deviation less than the mean.
Since the distribution is symmetric,
Chapter 1 UNIVARIATE DATA 41
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
4 The following data are the heights, to the nearest centimetre, of the thirty footballers that
belong to an AFL club.
192 185 189 183 189 191 190 192 198 187 191 194 198 181 189191 190 187 189 194 198 191 187 196 181 193 187 196 192 178
a Find the i mean, x ii standard deviation, s of the height of the footballers
in this club.
b i Calculate the interval [x¡ s, x+ s].
ii What percentage of the heights would be expected to fall in this interval?
iii What percentage of the actual heights fall in this interval?
c What percentage of the actual heights fall in the interval [x¡ 2s, x+ 2s]?
What percentage would you expect to fall in this interval?
5 The distribution of weights of 600 g loaves of bread is bell-shaped with a mean weight
of 605 g and a standard deviation of 8 g. What percentage of the loaves can be expected
to have a weight between 597 g and 613 g? (Use the Normal distribution as a model.)
6 [1997 FM CAT 2 Q4]
The distribution of the weight of ice-cream served in a single scoop of Danish Delight is
known to be bell-shaped with a mean of 104 grams and a standard deviation of 2 grams.
The percentage of single scoops of Danish Delight containing less than 100 grams will
be closest to:
A 0% B 2:5% C 5% D 16% E 95%
7 The diameters of washers produced by a machine have a bell-shaped distribution with
a mean diameter of 10 mm and a standard deviation of 0:3 mm. Using the Normal
distribution as a model, find the percentage of the washers that would have a diameter:
a between 9:7 mm and 10:3 mm b greater than 10 mm
c greater than 10:6 mm d between 9:4 and 9:7 mm
e greater than 9:7 mm?
8 The distribution of exam scores for 780 students who sat an exam is Normal with a mean
of 55 and a standard deviation of 15.
a Find the number of students who would be expected to obtain a score:
i greater than 70 ii less than 55
iii less than 25 iv between 70 and 85
b If the pass mark for the exam was 40, then how many students are expected to pass
the exam?
9
A greater than 32 seconds but less than 35 seconds
B less than 32 seconds
C greater than 35 seconds but less than 38 seconds
D greater than 35 seconds
E greater than 41 seconds.
The distribution of times taken to swim metres by a group of year-olds is bell-shaped with a mean of seconds and a standard deviation of seconds. The slowest
of the students would be expected to have a swim-time:
50 1638 3
16%
42 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
The relative significance of a particular data value can be considered in terms of the number
of standard deviations that it differs from the mean.
This is called the standard score or z-score of the data value, and the process of finding the
standard score is called standardisation. Non-standardised data are often referred to as raw
scores.
Standard score (z-score) =raw score¡mean
standard deviation
The mean percentage on a mathematics exam is 60 and the standard deviation is 13.
a Find the standard scores for students who, on the exam, scored:
i 82% ii 45% iii 73%
b Find the raw score of a student whose standardised score was 0:61.
a Using the formula for standard score:
i standard score
=82¡ 60
13
= 1:69 (2 dec. pl.)
ii standard score
=45¡ 60
13
= ¡1:15 (2 dec. pl.)
iii standard score
=73¡ 60
13
= 1
b z-score =raw score¡mean
standard deviation
0:61 =raw score¡ 60
13
0:61£ 13 = raw score ¡ 60
7:93 + 60 = raw score
raw score = 67:93
So, the student’s raw score would have been 68%.
Consider the following example:
The bell-shaped distribution alongside has
mean 35 and standard deviation 10.
STANDARD SCORES ( -SCORES)z
CALCULATING STANDARD SCORES
Example 16
3525 x15 40 45
95% of data
68%
Chapter 1 UNIVARIATE DATA 43
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Notice that:
These facts are always true when we standardise a bell-shaped distribution.
Find the percentage of scores that come from a Normal distribution that will
have a z-score:
a greater than 0 b between ¡2 and 2 c between 1 and 2
d less than ¡2 e more than 3.
a A z-score of 0 corresponds to a raw score
of the mean and 50% of the data will have
a value greater than the mean.
b 95% of the data will have a z-score between
¡2 and 2.
c95¡ 68
2= 13:5% of raw scores will have
a z-score between 1 and 2.
��
68%
x�� � �
95% of data
²²
¡
¡²²²
the shape of the distribution is unchanged
the values on the -axis have been scaledso that:
the of the data within one standard
deviation of the mean have -scores
between and
the of the data within two standard
deviations of the mean have -scores between and
a standard score of represents a raw score of the same value as the mean
a positive standard score represents a raw score that is greater than the mean
a negative standard score represents a raw score that is less than the mean.
x
z
z
I
I
68%
1 1
95%2 2
0
Example 17
�3 �2 �1 0 1 2 3 z
50%
�3 �2 �1 0 1 2 3 z
95%
�3 �2 �1 0 1 2 3 z
%5.132
6895�
�
The distribution of the isshown alongside.
standardised data
44 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
d If 95% of the raw scores have a z-score
between ¡2 and 2 then 2:5% (12
of 5%)
will have a z-score less than ¡2.
e If 99:7% of the raw scores have a z-score between ¡3 and 3 then
100¡ 99:7
2=
0:3
2= 0:15%
Since standard scores:
² keep the relative value of raw scores within a data set
² scale the x-axis of distributions in terms of their standard deviations,
standard scores are useful for comparing scores from different data sets.
Archie scored 62% on his Mathematics exam. This exam had a mean of 57 and a
standard deviation of 5. In his English exam Archie scored 75% and this exam had
a mean of 70 and a standard deviation of 6.
In which subject was his relative performance better?
In Maths: standard score =62¡ 57
5=
5
5= 1
In English: standard score =75¡ 70
6=
5
6= 0:83
Since Archie’s standard score for Maths was greater than his standard score for
English, his Maths result was further to the right in the distribution of the scores of
the class.
1 Find the standard scores for the following raw scores that come from a set of data that
has a mean of 6:4 and a standard deviation of 2.
a 10 b 5:2 c 12 d 6:5
2 A raw score from a data set has a z-score of ¡0:85. If the data set has a mean of 50and a standard deviation of 5:6, find the value of the raw score.
3 A raw score of 72 has a z-score of 1:25. If the standard deviation from the data set is
8, find the mean of the data.
�3 �2 �1 0 1 2 3 z
95%
2.5% 2.5%
COMPARING RAW SCORES FROM DIFFERENT DATA SETS
Example 18
EXERCISE 1E.3
will have a score more than .3z-score
Archie’s relative performance was better in Maths.
Chapter 1 UNIVARIATE DATA 45
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
4 A raw score of 20 has a z-score of ¡1:6. If the mean of the data set is 28, find the
standard deviation.
5 Peter has had four Mathematics tests for the year and his results and the class averages
and standard deviations are given in the table below.
Peter’s mark Class average Standard deviation
1 58 60 72 72 65 123 68 60 104 78 72 9
a Calculate Peter’s standard score for each test.
b In which test did Peter perform best?
6 The semester English exam results for four
students are given in the table alongside.
If the mean was 60 for both exams and
the standard deviation was 15 for the
Semester 1 exam and 8 for the Semester 2exam:
Student Semester 1 Semester 2
David 70 65Rodney 54 58Gavan 92 75Daniel 75 70
a Which of the students improved their performance from Semester 1 to Semester 2?
b Which student improved the most?
c Which student’s performance was the most consistent for the year?
7 For a set of data that has a bell-shaped distribution, find the percentage of raw scores
that have a z-score:
a less than 0 b between ¡1 and 1 c greater than 2
d between ¡1 and 0 e between ¡1 and ¡2 f between 0 and 3
A boxplot is a visual display of some of the descriptive statistics of a set of data, namely its
minimum and maximum values, the median and the upper and lower quartiles. These five
statistics form what is called the five-number summary of the data set.
A boxplot (box-and-whisker plot) is constructed above a number line (labelled and scaled)
which is drawn so that it covers all the data values in the data set.
The boxplot is drawn with a rectangular ‘box’ representing the middle half of the data. The
‘box’ goes from the lower quartile to the upper quartile.
The ‘whiskers’ extend from the ‘box’ to the maximum value and to the minimum value.
A vertical line marks the position of the median in the ‘box’.
For example, for the data set 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9:
CONSTRUCTING A BOXPLOT
F THE BOXPLOT (BOX-AND-WHISKER PLOT)
Test
46 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Using the graphics calculator to find descriptive statistics and construct a boxplot
2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9
The ordered data set is 1, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 7, 7, 8, 9 (15 data).
7 values}|z { 7 values}|z {Q1 Q3median
The minimum is 1.
The maximum is 9.
The median is the 8th value, 5.
The lower quartile is the 4th value, 3.
The upper quartile is the 12th value, 7.
These 5 statistics form the
five-number summary.
}|z
{1 2 3 4 5 6 7 8 9
minimum lower quartile median upper quartile maximum
whisker whisker
value
Press and choose .… 1:Edit
Enter the data from the example above into :L
Statistical graphs are drawn using ,STAT PLOT
which is located above the key.o
Press to use it.y o
Press to use .Í Plot 1
Turn the plot by pressing then use the arrowkeys to choose the boxplot icon and press .
On ÍÖ Í
Press to draw the boxplot.q ®
Q2
Chapter 1 UNIVARIATE DATA 47
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
A set of data with a symmetric distribution will have a symmetric boxplot.
For example:
The whiskers of the boxplot are the same length and the median line is in the centre of the
box.
A set of data which is positively skewed will have a positively skewed boxplot.
For example:
The right whisker will be longer than the left whisker and the median line is to the left of the
box.
A set of data which is negatively skewed will have a boxplot that appears stretched to the
left.
For example:
The left whisker is longer than the right and the median line is to the right of the box.
r can be used to locate the statistics of thefive-number summary. The arrow keys move back-wards and forwards between them.
In this screen,the cursor is onthe median.
INTERPRETING A BOXPLOT
0
2
4
6
8
10 11 12 13 14 15 16 17 18 19 20
y
x
10 11 12 13 14 15 16 17 18 19 20 x
02468
10
1 2 3 4 5 6 7 8 x
y
1 2 3 4 5 6 7 8 x
1 2 3 4 5 6 7 8 9 x
1 2 3 4 5 6 7 8 x9
48 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
A boxplot has been drawn to show the distribution of marks (out of 100) in a test
for a particular class:
a What was the highest mark scored for this test?
b What was the median test score for the class?
c What is the range of marks scored for this test?
d What percentage of students scored 60 or more for the test?
e What was the lowest mark scored?
f What is the interquartile range for this test?
g The top 25% of students scored a mark between ...... and ......
h If you scored 70 on this test, would you be in the top 50% of students in the
class?
i Comment on the symmetry of the distribution of marks.
a The highest score corresponds to the end of the upper whisker, so the
highest mark scored was 98.
b The median corresponds to the vertical line inside the box, which is at 73.
c The range = maximum score ¡ minimum score = 98¡ 30= 68
d The score of 60 corresponds to the lower quartile.
25% of the students have a score less than or equal to the lower quartile so 75%scored 60 or more.
e The lowest score corresponds to the end of the lower whisker, so the lowest
score was 30.
f The interquartile range = upper quartile ¡ lower quartile = 82¡ 60= 22
g The top 25% of scores correspond to the upper whisker.
h The top 50% of students had a mark greater than or equal to the median of 73.
You would not be in the top 50% of students if you scored 70 for the test.
i
The distribution of test scores is stretched to the left, and is therefore negatively
skewed. The lower whisker is longer than the upper whisker and the median is
not in the centre of the box but further towards the upper end.
The distribution is therefore not symmetrical.
Example 19
0 10 20 30 40 50 60 70 80 90 100
score on test
0 10 20 30 40 50 60 70 80 90 100
score on test
stretched to the left
So, the top of students scored a mark between and .25% 82 98
Chapter 1 UNIVARIATE DATA 49
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Outliers are extraordinary data that are either much larger or much smaller than the main
body of the data.
There are several tests that identify outliers. One commonly used test involves the following
calculation of ‘boundaries’:
The upper boundary = upper quartile + 1:5 £ IQR.
Any data larger than this number is an outlier.
The lower boundary = lower quartile ¡ 1:5 £ IQR.
Any data smaller than this value is an outlier.
TESTING FOR OUTLIERS
Draw a boxplot for the following data, identifying any outliers.
1, 3, 7, 8, 8, 5, 9, 9, 12, 14, 7, 1, 4, 8, 16, 8, 7, 9, 10, 13, 7, 6, 8, 11, 17, 7
The ordered data is:
The five-number summary is: Using the calculator:
minimum value is 1lower quartile is 7median is 8upper quartile is 10maximum value is 17
IQR = 10 ¡ 7 = 3
The upper boundary = upper quartile + 1:5 £ IQR = 10 + 1:5£ 3 = 14:5
The lower boundary = lower quartile ¡ 1:5 £ IQR = 7¡ 1:5£ 3 = 2:5
Values outside the interval [2:5, 14:5] are outliers. Hence the two outliers at the
upper end are the data values 16 and 17, and the two at the lower end are both the
data value 1.
We now have all the information to draw the boxplot:
1, 1, 3, 4, 5, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 11, 12, 13, 14, 16, 17
13 values}|z { 13 values}|z {lower quartile = 7 median = 8 upper quartile = 10
Two outliers of the samevalue are shown like this.
The whisker is drawn to the lastvalue that is not an outlier.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17variable
When outliers exist, the ‘whiskers’ of a boxplot extend to the last value that is not an outlier.Each outlier is marked with an asterisk; it is possible to have more than one outlier at either end.
Example 20
50 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Note:
Consider the following example:
3 2 5 84 2 3 35 3 8 86 0 5 5 5 7 9 9 9 9 9 9 97 0 08 4 8 99 0 1 1 3 3 4 4 4 5 5 6 7 810 2 3 3 4 5 611 2 9 leaf unit: 0:1 cm
Using the calculator to draw the boxplot in above, we begin by entering thedata in .
Example 20
L
Use by pressing .STAT PLOT y o
Press to use .Í Plot 1
Then press .Í
Press to draw the boxplot.q ®
Press and use the arrow keys to move the cursorthrough the summary statistics. Note that both values atare included.
r1
Turn the plot then use the arrow keys to choose the‘boxplot with outliers’ icon
On
Õ
Note that only one of the outliers at appears on thescreen.
1
You may wonder why we would need both the boxplot and the stemplot orhistogram. Each complements the other and shows slightly different things.
Boxplots provide an excellent display of the summary statistics, while stemplots andhistograms illustrate the shape of the distribution more accurately.
These graphs display the same distribution.
The boxplot displays the summary statistics, while the stemplotreveals the bimodal nature of the distribution. Hence bothgraphics are of value.
height (cm)
3
5
7
9
11
13
Chapter 1 UNIVARIATE DATA 51
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
1 The following boxplot summarises the heights of the players in an AFL team.
Use the boxplot to find:
a the median height of the team
b the range of heights of the team (ignoring the outlier)
c the height that 75% of the team are taller than
d the height of the player that is an outlier
e the interquartile range of the heights.
2 Find the five-number summary (minimum, lower quartile, median, upper quartile, max-
imum) for each of the following data sets, and construct a boxplot for the data.
a Essendon’s game scores for the year 2000 (not including the finals):
156, 130, 124, 137, 123, 144, 140, 127, 106, 132, 145, 169, 119, 89, 108,
89, 167, 159, 165, 109, 81, 97
b Number of toothpicks Frequency
33 134 535 736 1337 1238 839 2
c The daily maximum temperature
(oC) in Melbourne for the month
of March 2001:
Stem Leaf
11¤ 7 8 8 8 9 92 0 0 0 2 2 2 2 3 3 3 4 42¤ 5 5 6 7 8 83 0 0 1 2 2 33¤ 5
2 j 4 represents 24oC
3 A set of data has a lower quartile of 31:5, median of 37, and upper quartile of 43:5.
a Calculate the interquartile range for this data set.
b Calculate the boundaries that identify outliers.
c Which of the data 22, 13:2, 60, 65 would be outliers?
4 The boxplot below shows the distribution of weights of a sample of Jack Russell terriers:
Which one of the following would not be true for this data?
A The interquartile range is more than 1:5 kg.
B The heaviest 25% of the dogs all weighed more than 8 kg.
C The median weight was 7 kg.
D At least 75% of the weights were more than 6 kg.
E The lightest 25% weighed less than or equal to 6:2 kg.
EXERCISE 1F
165 170 175 180 185 190 195 200 205 210 215
height (cm)
4 5 6 7 8 9 10
weight (kg)
52 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
5 The boxplot below shows the distribution of taxi fares for 50 trips taken from Melbourne
Airport.
a Find: i the median fare ii the range of fares iii the IQR of fares.
b Write a sentence describing the distribution of the data, mentioning each of the
statistics from a.
c Complete the following:
i Approximately ...... % of fares were greater than $32.
ii The minimum fare was $ ......
iii 75% of the fares were greater than $ ......
6 Match the histograms A, B, C , D and E to the boxplots I, II , III , IV and V .
A B
C D
E Stem Leaf
7 1 2 2 4 66 2 2 3 4 4 5 5 5 5 6 8 95 0 7 9 9 94 0 2 3 4 6 6 6 7 7 83 2 5 6 6 82 2 91 5 leaf unit : 0:1
I II
III IV
V
0
2
4
6
8
1 2 3 4 5 6 7 8
frequency
x
0
2
4
6
8
1 2 3 4 5 6 7 8 9 10 11 12 13
frequency
x
0
1
2
3
4
1 2 3 4 5 6 7 8
frequency
x
x122 4 86 10
1 2 3 4 5 6 7 8 x
1 2 3 4 5 6 7 8 9 10 11 12 13 x
1 2 3 4 5 6 7 8 9 10 11 12 13 x1 2 3 4 5 6 7 8 x
1 2 3 4 5 6 7 8 x
15 20 25 30 35 40 45fare ($)
Chapter 1 UNIVARIATE DATA 53
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
When we conduct a statistical survey, it is important that our data reflects the whole popula-
tion.
If data is to be collected from a sample then the sample must accurately represent the popu-
lation. Otherwise, reliable conclusions about the population cannot be made. Samples must
be chosen so that the results will not show bias towards a particular outcome.
The sample size is also an important feature to be considered if conclusions about the popu-
lation are to be made from the sample.
For example:
Measuring a group of three fifteen-year-olds would not give a very reliable estimate of the
height of fifteen-year-olds all over the world. We therefore need to choose a random sample
that is large enough to represent the population. Note that conclusions based on a sample
will never be as accurate as conclusions made from the whole population, but if we choose
our sample carefully, they will be a good representation.
In a simple random sample, every member of the population has an equal chance of being
chosen, and each member is chosen independently of any other member.
Random samples can be chosen using coins, dice, numbered tokens, random number tables,
or random number generators on computers or calculators.
For example:
Suppose you wish to choose Tattslotto numbers. The population of numbers is the integers 1to 45 inclusive and you are going to choose a ‘sample’ of six different numbers. How could
you choose these numbers randomly?
Three possible methods:
1 Number forty five pieces of paper, place them in a container and select six pieces of
paper without looking.
2 Use a random number table (Table 1).
39634 62349 74088 65564 16379 19713 39153 69459 17986 2453714595 35050 40469 27478 44526 67331 93365 54526 22356 9320830734 71571 83722 79712 25775 65178 07763 82928 31131 3019664628 89126 91254 24090 25752 03091 39411 73146 06089 1563042831 95113 43511 42082 15140 34733 68076 18292 69486 80468
80583 70361 41047 26792 78466 03395 17635 09697 82447 3140500209 90404 99457 72570 42194 49043 24330 14939 09865 4590605409 20830 01911 60767 55248 79253 12317 84120 77772 5010395836 22530 91785 80210 34361 52228 33869 94332 83868 6167215358 70469 87149 89509 72176 18103 55169 79954 72002 20582
The digits in the table are generated by computer in groups of five for easy reading. You
can start anywhere in the table and move across or down.
To choose numbers between 1 and 45, you need to look at two digits at a time. If the
digits are 04 then the chosen number is ‘4’. If the digits give a number greater than 45then you ignore it. If you get a repeat of a number then you will also ignore it.
G RANDOM SAMPLES
CHOOSING A RANDOM SAMPLE
54 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Starting in the top left hand corner and going across, (crossing out the inappropriate
numbers until you have six numbers) the numbers are:
39, 63, 46, 23, 49, 74, 08, 86, 55, 64, 16, 37, 91, 97, 13
Your chosen numbers would be 39, 23, 8, 16, 37 and 13.
3 Use the random number generator on the
calculator.
You could also type in the sample size of 6 as shown
alongside. However, if this gave repeats in the sample,
you would need to repeat the procedure.
The table below gives
the monthly sales fig-
ures, in thousands of
dollars, for a shop over
a six year period.
a Choose a year at
random.
b Choose a month at
random.
c Choose three consec-
utive years.
d Choose a period of
three consecutive
years (36 months) starting with any month.
2000 2001 2002 2003 2004 2005
January 43:1 48:7 45:7 44:0 48:6 46:3February 38:2 35:3 36:4 38:3 37:7 40:2
March 38:6 36:0 36:2 34:8 35:3 33:3April 40:2 40:9 42:4 42:5 43:8 35:7May 43:2 44:2 47:0 48:7 50:3 52:4June 27:8 32:3 33:5 34:1 32:2 35:8July 26:4 27:2 23:5 27:2 27:7 28:1
August 23:8 24:9 24:8 27:6 26:1 28:2September 27:4 30:8 32:7 33:6 34:9 35:1October 40:4 39:3 38:7 41:3 42:4 44:9
November 68:3 67:4 67:3 69:8 70:4 72:6December 81:2 83:9 84:6 85:5 88:3 87:2
This can be found in the menu as follows:�
Press then to select .� | PRB
Choose .5:randInt(
This will bring to the screen.
We need to type in the range of randomintegers that we are considering, i.e., to .
randInt(
1 45
Press .À ¢ ¶ · ¤
Pressing repeatedly will give random digitsbetween and .
In this case the first six numbers were all differentnumbers, so these are the randomly chosen Tattslottonumbers. If numbers were repeated, we would generatemore until we had six different ones.
Í1 45
Example 21
Chapter 1 UNIVARIATE DATA 55
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
a
b There are twelve months from which we need
to choose one month.
We use the calculator, with 1 representing
January, 2 representing February, etc.
The randomly chosen month is November.
c To choose three consecutive years, we need to establish the number of sets of
three consecutive years that are possible:
1 2000 - 20022 2001 - 20033 2002 - 20044 2003 - 2005
There are four possibilities, from which we
have to choose one. Using the calculator, the
randomly chosen period is 3 2002 to 2004.
d To choose a period of three consecutive years starting with any month, we need
to establish the number of sets that are possible:
1 Jan 2000 - Dec 20022 Feb 2000 - Jan 20033 Mar 2000 - Feb 2003
...
37 Jan 2003 - Dec 2005
There are thirty seven possibilities, from which we have to choose one.
Using the calculator, the randomly chosen period is 11 November 2000 to
October 2003.
1 State the sample size needed.
2 State the number of possibilities from which you can choose, and number them if nec-
essary.
3 State the random number generator that you are using.
4 Explain what you will do if repeated random numbers are not applicable.
5 State the random number(s) chosen and the data that is now in your sample.
There are six years from which to choose. Wecould use a die to randomly choose one of theseyears; the year would be represented by ,
by , ......, by .
Alternatively, we could use the random generatoron a calculator: The randomly chosen year is
.
2000 12001 2 2005 6
2004
TO CHOOSE A SIMPLE RANDOM SAMPLE:
56 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
1 Use the random number table from page 54, starting at the top left corner and working
down, to:
a select a random sample of six different numbers between 1 and 45 inclusive
b select a random sample of 5 different numbers between 100 and 499 inclusive.
2 Use your calculator to:
a select a random sample of six different numbers between 5 and 25 inclusive
b select a random sample of 10 different numbers between 1 and 25 inclusive.
3 The following calendar for 2006 shows the weeks of the year. Each of the days is
numbered.
Using a random number generator, choose a sample from the calendar of:
a five different dates
b a complete week starting with a Monday
c a month
d three different months
e three consecutive months
f a four week period starting on a Saturday
g a four week period starting on any day.
Explain your method of selection in each case.
EXERCISE 1G
Chapter 1 UNIVARIATE DATA 57
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
January February March April May June
1 Th
July August September October November December
Su (1)
Mo (2)
Tu (3)
We (4)
Th (5)
Fr (6)
Sa (7)
Su (8)
1
2
3
4
5
6
7
8
9 Mo (9)
10 Tu (10)
11 We (11)
12 Th (12)
13 Fr (13)
14 Sa (14)
15 Su (15)
16 Mo (16)
17 Tu (17)
18 We (18)
19 Th (19)
20 Fr (20)
21 Sa (21)
22 Su (22)
23 Mo (23)
24 Tu (24)
25 We (25)
26 Th (26)
27 Fr (27)
28 Sa (28)
29 Su (29)
30 Mo (30)
31 Tu (31)
1 Sa (182)
2 Su (183)
3 Mo (184)
4 Tu (185)
5 We (186)
6 Th (187)
7 Fr (188)
8 Sa (189)
9 Su (190)
10 Mo (191)
11 Tu (192)
12 We (193)
13 Th (194)
14 Fr (195)
15 Sa (196)
16 Su (197)
17 Mo (198)
18 Tu (199)
19 We (200)
20 Th (201)
21 Fr (202)
22 Sa (203)
23 Su (204)
24 Mo (205)
25 Tu (206)
26 We (207)
27 Th (208)
28 Fr (209)
29 Sa (210)
30 Su (211)
31 Mo (212)
Wk 1
Wk 2
Wk 3
Wk 4
Wk 5
Wk 27
Wk 28
Wk 29
Wk 30
Wk 31
1 We (32)
2 Th (33)
3 Fr (34)
4 Sa (35)
5 Su (36)
6 Mo (37)
7 Tu (38)
8 We (39)
9 Th (40)
10 Fr (41)
11 Sa (42)
12 Su (43)
13 Mo (44)
14 Tu (45)
15 We (46)
16 Th (47)
17 Fr (48)
18 Sa (49)
19 Su (50)
20 Mo (51)
21 Tu (52)
22 We (53)
23 Th (54)
24 Fr (55)
25 Sa (56)
26 Su (57)
27 Mo (58)
28 Tu (59)
1 Tu (213)
2 We (214)
3 Th (215)
4 Fr (216)
5 Sa (217)
6 Su (218)
7 Mo (219)
8 Tu (220)
9 We (221)
10 Th (222)
11 Fr (223)
12 Sa (224)
13 Su (225)
14 Mo (226)
15 Tu (227)
16 We (228)
17 Th (229)
18 Fr (230)
19 Sa (231)
20 Su (232)
21 Mo (233)
22 Tu (234)
23 We (235)
24 Th (236)
25 Fr (237)
26 Sa (238)
27 Su (239)
28 Mo (240)
29 Tu (241)
30 We (242)
31 Th (243)
Wk 9
Wk 8
Wk 7
Wk 6
Wk 35
Wk 34
Wk 33
Wk 32
1 We (60)
2 Th (61)
3 Fr (62)
4 Sa (63)
5 Su (64)
6 Mo (65)
7 Tu (66)
8 We (67)
9 Th (68)
10 Fr (69)
11 Sa (70)
12 Su (71)
13 Mo (72)
14 Tu (73)
15 We (74)
16 Th (75)
17 Fr (76)
18 Sa (77)
19 Su (78)
20 Mo (79)
21 Tu (80)
22 We (81)
23 Th (82)
24 Fr (83)
25 Sa (84)
26 Su (85)
27 Mo (86)
28 Tu (87)
29 We (88)
30 Th (89)
31 Fr (90)
1 Fr (244)
2 Sa (245)
3 Su (246)
4 Mo (247)
5 Tu (248)
6 We (249)
7 Th (250)
8 Fr (251)
9 Sa (252)
10 Su (253)
11 Mo (254)
12 Tu (255)
13 We (256)
14 Th (257)
15 Fr (258)
16 Sa (259)
17 Su (260)
18 Mo (261)
19 Tu (262)
20 We (263)
21 Th (264)
22 Fr (265)
23 Sa (266)
24 Su (267)
25 Mo (268)
26 Tu (269)
27 We (270)
28 Th (271)
29 Fr (272)
30 Sa (273)
Wk 10
Wk 11
Wk 12
Wk 13
Wk 36
Wk 37
Wk 38
Wk 39
1 Sa (91)
2 Su (92)
3 Mo (93)
4 Tu (94)
5 We (95)
6 Th (96)
7 Fr (97)
8 Sa (98)
9 Su (99)
10 Mo (100)
11 Tu (101)
12 We (102)
13 Th (103)
14 Fr (104)
15 Sa (105)
16 Su (106)
17 Mo (107)
18 Tu (108)
19 We (109)
20 Th (110)
21 Fr (111)
22 Sa (112)
23 Su (113)
24 Mo (114)
25 Tu (115)
26 We (116)
27 Th (117)
28 Fr (118)
29 Sa (119)
30 Su (120)
1 Su (274)
2 Mo (275)
3 Tu (276)
4 We (277)
5 Th (278)
6 Fr (279)
7 Sa (280)
8 Su (281)
9 Mo (282)
10 Tu (283)
11 We (284)
12 Th (285)
13 Fr (286)
14 Sa (287)
15 Su (288)
16 Mo (289)
17 Tu (290)
18 We (291)
19 Th (292)
20 Fr (293)
21 Sa (294)
22 Su (295)
23 Mo (296)
24 Tu (297)
25 We (298)
26 Th (299)
27 Fr (300)
28 Sa (301)
29 Su (302)
30 Mo (303)
31 Tu (304)
Wk 18
Wk 17
Wk 16
Wk 15
Wk 14
Wk 44
Wk 43
Wk 42
Wk 41
Wk 40
1 Mo (121)
2 Tu (122)
3 We (123)
4 Th (124)
5 Fr (125)
6 Sa (126)
7 Su (127)
8 Mo (128)
9 Tu (129)
10 We (130)
11 Th (131)
12 Fr (132)
13 Sa (133)
14 Su (134)
15 Mo (135)
16 Tu (136)
17 We (137)
18 Th (138)
19 Fr (139)
20 Sa (140)
21 Su (141)
22 Mo (142)
23 Tu (143)
24 We (144)
25 Th (145)
26 Fr (146)
27 Sa (147)
28 Su (148)
29 Mo (149)
30 Tu (150)
31 We (151)
1 We (305)
2 Th (306)
3 Fr (307)
4 Sa (308)
5 Su (309)
6 Mo (310)
7 Tu (311)
8 We (312)
9 Th (313)
10 Fr (314)
11 Sa (315)
12 Su (316)
13 Mo (317)
14 Tu (318)
15 We (319)
16 Th (320)
17 Fr (321)
18 Sa (322)
19 Su (323)
20 Mo (324)
21 Tu (325)
22 We (326)
23 Th (327)
24 Fr (328)
25 Sa (329)
26 Su (330)
27 Mo (331)
28 Tu (332)
29 We (333)
30 Th (334)
Wk 19
Wk 20
Wk 21
Wk 22
Wk 45
Wk 46
Wk 47
Wk 48
(152)
2 Fr (153)
3 Sa (154)
4 Su (155)
5 Mo (156)
6 Tu (157)
7 We (158)
8 Th (159)
9 Fr (160)
10 Sa (161)
11 Su (162)
12 Mo (163)
13 Tu (164)
14 We (165)
15 Th (166)
16 Fr (167)
17 Sa (168)
18 Su (169)
19 Mo (170)
20 Tu (171)
21 We (172)
22 Th (173)
23 Fr (174)
24 Sa (175)
25 Su (176)
26 Mo (177)
27 Tu (178)
28 We (179)
29 Th (180)
30 Fr (181)
1 Fr (335)
2 Sa (336)
3 Su (337)
4 Mo (338)
5 Tu (339)
6 We (340)
7 Th (341)
8 Fr (342)
9 Sa (343)
10 Su (344)
11 Mo (345)
12 Tu (346)
13 We (347)
14 Th (348)
15 Fr (349)
16 Sa (350)
17 Su (351)
18 Mo (352)
19 Tu (353)
20 We (354)
21 Th (355)
22 Fr (356)
23 Sa (357)
24 Su (358)
25 Mo (359)
26 Tu (360)
27 We (361)
28 Th (362)
29 Fr (363)
30 Sa (364)
31 Su (365)
Wk 26
Wk 25
Wk 24
Wk 23
Wk 52
Wk 53
Wk 51
Wk 50
Wk 49
58 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Many statistical investigations involve analysing the relationship between two variables. We
call the data in these investigations bivariate data. The way that bivariate data is analysed
depends on whether the data is categorical or numerical.
² one variable is a categorical variable and the other is a numerical variable
² both variables are categorical
² both variables are numerical.
For any pair of variables, one of the pair is described as the dependent or response variable,
while the other is the independent or explanatory variable.
The dependent variable responds to changes in the independent variable.
The independent variable explains the changes in the dependent variable.
For example, the number of children in a family influences the type of car they have, but not
the other way around. The type of car is therefore the dependent variable and the number of
children is the independent variable.
If the categorical variable has only two categories then a back-to-back stemplot is useful. It
is a visual display that enables easy analysis and comparison of the data.
Consider this example:
An office worker has the choice of travelling to work by tram or train. He has recorded the
travel times from recent journeys on both of these types of transport. He wishes to know
which type of transport is quicker and which is the more reliable.
Recent tram journey times (minutes):
21, 25, 18, 13, 33, 27, 28, 14, 18, 43, 19, 22, 30, 22, 24
Recent train journey times (minutes):
23, 18, 16, 16, 30, 20, 21, 18, 18, 17, 20, 21, 28, 17, 16
The type of transport is the independent variable and the travel time is the dependent variable,
because the travel time depends on the type of transport.
A back-to-back stemplot is constructed
with only one stem. The leaves are
grouped on either side of this central
stem. The ordered back-to-back stem-
plot for the data is shown alongside:
Train leaf Stem Tram leaf
8 8 8 7 7 6 6 6 1 3 4 8 8 98 3 1 1 0 0 2 1 2 2 4 5 7 8
0 3 0 34 3
The most frequently occurring travel times by train were between 10 and 20 minutes whereas
the most frequently occurring travel times by tram were between 20 and 30 minutes.
BACK-TO-BACK STEMPLOTS
A COMPARING ONE CATEGORICAL
AND ONE NUMERICAL VARIABLE
BIVARIATE DATA
In this chapter we will study the display and analysis of bivariate data where:
A back-to-back stemplot could be used to display the relationship between the categorical vari-able which has two categories (or levels), and the numerical variable .type of transport travel time
60 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
The median train travel time is 18 minutes and the median tram travel time is 22 minutes.
This supports the observation that train journeys are generally shorter.
The range of the train travel times is 30 ¡ 16 = 14 minutes while the range of the tram
travel times is 43¡ 13 = 30 minutes.
The interquartile range of travel times for the train is 21¡ 17 = 4 minutes, while the IQR
for tram travel times is 28¡ 18 = 10 minutes.
Comparison of these measures of spread indicates that the train travel times are less ‘spread
out’ than the tram travel times. The train travel times are therefore more predictable or
reliable.
In conclusion, it is generally quicker and the travel times are more reliable if the worker
travels by train to work.
1 The heights (to the nearest centimetre) of Year 10 boys and girls in a school are being
investigated. The sample data are as follows:
Boys:
Girls:
a What are the two variables in this investigation? Classify the variables as categorical
or numerical, dependent or independent.
b Construct a back-to-back stemplot for the data.
c Find the statistics in the five-number summaries for each of the data sets.
d Compare and comment on the distributions of the data, mentioning the shape, centre
and spread and quoting statistics to support your statements.
2 A new cancer drug is being developed and is being tested on rats. Two groups of twenty
rats with cancer were formed; one group was given the drug while the other was not.
The survival time of each rat in the experiment was recorded up to a maximum of 192days.
Survival times of rats that were given the drug:
64 78 106 106 106 127 127 134 148 186192¤ 192¤ 192¤ 192¤ 192¤ 192¤ 64 78 106 106
Survival times of rats that were not given the drug:
37 38 42 43 43 43 43 43 48 4951 51 55 57 59 62 66 69 86 37
¤ denotes that the rat was still alive at the end of the experiment
a What are the variables in this investigation? Classify the variables as categorical or
numerical, dependent or independent.
b Construct a back-to-back stemplot for the data and find the statistics that make up
the 5-number summaries.
c Compare and comment on the distributions of the data, mentioning the shape, centre
and spread and quoting statistics to support your statements.
EXERCISE 2A.1
164 168 175 169 172 171 171 180 168 168 166 168 170 165 171 173 187179 181 175 174 165 167 163 160 169 167 172 174 177 188 177 185 167160
165 170 158 166 168 163 170 171 177 169 168 165 156 159 165 164 154170 171 172 166 152 169 170 163 162 165 163 168 155 175 176 170 166
Chapter 2 BIVARIATE DATA 61
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
3 Peter and John are competing taxi-drivers who wish to know who earns more money.
They have recorded the amount of money (in dollars) collected per hour for five hours
over five days:
Peter: 17:3 11:3 15:7 18:9 9:6 13 19:1 18:3 22:8 16:7 11:7 15:812:8 24 15 13 12:3 21:1 18:6 18:9 13:9 11:7 15:5 15:2 18:6
John: 23:7 10:1 8:8 13:3 12:2 11:1 12:2 13:5 12:3 14:2 18:6 18:915:7 13:3 20:1 14 12:7 13:8 10:1 13:5 14:6 13:3 13:4 13:6 14:2
a Construct a back-to-back stemplot for the data and find the statistics that make up
the 5-number summaries.
b Compare and comment on the distributions of the data, mentioning the shape, centre
and spread and quoting statistics to support your statements.
4 The residue that results when a cigarette is smoked collects in the filter. The residue
from twenty cigarettes from the two different brands was measured, giving the following
data, in milligrams:
Brand X: 1:62 1:55 1:59 1:56 1:56 1:55 1:63 1:59 1:56 1:691:61 1:57 1:56 1:55 1:62 1:61 1:52 1:58 1:63 1:58
Brand Y: 1:61 1:62 1:69 1:62 1:60 1:59 1:66 1:55 1:61 1:621:64 1:61 1:58 1:57 1:57 1:57 1:58 1:60 1:63 1:59
a Copy and complete the back-to-back stemplot for this data:
Brand Y Stem Brand X
150
152 2154 5 5 5156 6 6 6 6 7158 8 8 9 9160 1 1162 2 2 3 3164
166
168 9 156 includes values 1:56 and 1:57
b Comment on and compare the shape of the distributions.
Parallel boxplots are used to display and compare data where one of the variables is numerical
and the other is a categorical variable with two or more categories.
For example:
Car travel times (minutes): 30, 21, 19, 17, 24, 28, 23, 25, 25, 16, 18, 19, 29, 22
The categorical variable type of transport now has three categories and is the independent
variable.
Ordering the car travel times we get: 16, 17, 18, 19, 19, 21, 22, 23, 24, 25, 25, 28, 29, 30
PARALLEL BOXPLOTS
If additional data is available to the office worker in the example on page ,we can use parallel boxplots to compare the data. They help us decide which type of trans-port is the quickest to get him to work and which is the most reliable.
car travel time 60
62 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
The 5-number summary is:
min = 16, max = 30, median =22 + 23
2= 22:5, lower quartile = 19, upper quartile = 25
The three boxplots are drawn on the one axis:
The car travel times have almost the same spread (range = 14 mins, IQR = 6 mins) as the
train travel times (range = 14 mins, IQR = 4 mins), suggesting that the car travel time is as
reliable as the train travel time.
However, the train travel times include two outliers which may be due to extraordinary events.
If these are ignored then the range of travel times for the train would be 7 minutes, which is
considerably less than the ranges for the car and tram.
The median car travel time is 22:5 minutes, compared to 18 minutes for the train and 22minutes for the tram, so it is still generally quicker to travel by train.
In conclusion: From the data given, it is generally quicker and more reliable to travel by
train than it is by either tram or car.
Using the graphing calculator to graph parallel boxplots
The data for each of the transport types is entered
in separate lists.
The three boxplots can be drawn on the screen at
the same time by turning each of them On.
‘5-number summary’ values on the screen.
10 15 20 25 30 35 40 45
car
train
tram
categoricalvariable withthree categories
travel time (minutes)numerical variable
Press and choose . Press .y o Í1:Edit
Press to select .y o STAT PLOT
Make sure that the ‘boxplot with outliers’ iconand the correct list is selected for each plot.
Õ
q ® will bring the graphs to the screen:
r, then the arrows, can be used to find
Chapter 2 BIVARIATE DATA 63
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
General rules for interpreting and comparing the distribution of bivariate data:
1 Comment on the shape of the distributions (symmetric, positively skewed, negatively
skewed, outliers).
2 Comment on and compare the centres of the data (median and mean).
3 Comment on and compare the spread of the data (range, interquartile range).
1 The percentage scores on a SAC for three classes of Further Mathematics students have
been recorded and the distribution of results for the three classes are summarised on the
graph below:
a In which class was:
i the highest mark scored ii the lowest mark scored?
b Comment on the shape of the distribution of marks in each of the classes.
c Comment on and compare the centre of the scores for the classes.
d Comment on and compare the spread of the scores for the classes.
2 [VCAA FM 2001 Q6]
A conservation park in Thailand is home to 49 elephants, of which 26 are females and
23 are males. The parallel boxplots above show the distribution of their ages by sex.
Based on the information contained in the parallel boxplots, which one of the following
statements is incorrect?
A The youngest elephant is male.
B There are fewer female elephants under the age of 15 years than male elephants
under the age of 15 years.
C There are no female elephants over the age of 40 years.
D The median age of the female elephants is approximately the same as the median
age of the male elephants.
E Approximately 25% of the male elephants are 30 years of age or older.
EXERCISE 2A.2
0 5 10 15 20 25 30 35 40 45
age (years)
n����
n����
female
male
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
score on SAC (%)
class A
class B
class C
64 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
3 The daily maximum temperatures in Melbourne for June 21st and December 21st (the
equinoxes) are being compared. The data for the 20 years from 1981 to 2000 is given
below:
June 21st: 13:6, 10:6, 19:1, 14:2, 12:2, 11:9, 18:3, 14:9, 14:6, 15:1,
17:4, 13:5, 16:7, 14:0, 11:1, 17:0, 15:4, 16:3, 15:6, 16:3
December 21st: 24:2, 19:4, 21:4, 22:7, 21:4, 20:0, 22:3, 21:1, 18:9, 23:5,
21:3, 23:0, 28:1, 20:3, 17:2, 35:0, 33:7, 21:9, 21:4, 38:6
a What are the variables in this investigation? Classify the variables as categorical or
numerical, dependent or independent.
b Find the statistics that make up the 5-number summaries and construct parallel
boxplots for the data.
c Compare and comment on the distributions of the data, mentioning the shape, centre
and spread and quoting statistics to support your statements.
4 Using the data from question 4, Exercise 2A.1, find five-number summaries and construct
parallel boxplots to summarise the distributions of residue for the two types of cigarettes.
What conclusions can be made from comparing the boxplots? Support your statements
with statistics.
5 Plant fertilisers come in many different brands, but there are essentially two types:
organic and inorganic. A student was interested to discover whether radish plants re-
sponded better to organic or inorganic fertiliser. He prepared three identical plots of
ground, named plots A, B and C, in his mother’s garden, and planted 40 radish seeds
in each plot. After planting, each plot was treated in an identical manner, except for the
way they were fertilised. Cost prevented him using a variety of fertilisers, so he chose
one organic and one inorganic fertiliser. Plot A received no fertiliser, plot B received the
organic fertiliser as prescribed on the packet, and plot C received the inorganic fertiliser
as prescribed on the packet. The student was interested in the weight of the root that
forms under the ground.
The data supplied below is the weight of the root (measured to the nearest gram) of the
individual plants:
Data from plot A: 27 29 9 10 8 36 36 42 32 32 32 30 3839 38 50 34 41 39 40 12 14 35 35 42 2532 30 34 22
Data from plot B: 51 54 56 41 50 47 47 46 48 52 34 20 2847 58 56 63 66 54 48 48 53 47 29 46 3345 58 34
Data from plot C: 55 76 65 61 67 69 68 64 76 59 56 79 7069 70 76 43 70 62 60 58 79 65 75 60 3968 68 63 54 61 72 58 77 66 65 47 50
a Produce parallel boxplots for the data.
b Compare and comment on the distributions of the weights of the root for each
plot, mentioning the shape, centre and spread and quoting statistics to support your
statements.
Chapter 2 BIVARIATE DATA 65
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Two-way frequency tables are used to demonstrate the relationship between two categorical
variables. Percentaged segmented barcharts give a visual display of the data.
In two-way frequency tables, the independent variable fills the columns.
A town council is considering bringing in a rule banning the drinking of alcohol in
public places. A random survey of 60 residents gave the following results:
Of the 35 women surveyed, 20 were in favour of the rule. However only 11 of the
men were in favour of it.
a Construct a two-way frequency table to summarise these findings.
b Construct a two-way percentaged frequency table and answer the following:
i What percentage of those surveyed were female?
ii What percentage of those surveyed were in favour of the proposal?
iii What percentage of the females surveyed were in favour of the proposal?
c Do the results of the survey support the theory that females would be more in
favour of this rule than males?
The two categorical variables involved in this question are:
Gender: Male or Female
Opinion about rule: In favour or Against
Opinion about rule depends on gender so the variable gender is the independent
variable.
a
Opinion
Gender
Male Female Total
In favour 11 20 31
Against 14 15 29
Total 25 35 60
b The two-way percentaged frequency table is:
Opinion
Gender
Male Female
In favour 11
25£ 100 = 44% 20
35£ 100 = 57%
Against 14
25£ 100 = 56% 15
35£ 100 = 43%
Total 100% 100%
i 35
60£ 100 = 58:33% of those surveyed are female.
ii 31
60£ 100 = 51:67% of those surveyed were in favour of the rule.
iii 57% of the females were in favour of the rule.
B TWO CATEGORICAL VARIABLES
TWO-WAY FREQUENCY TABLES
Example 1
66 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
The percentaged frequency table in Example 1 can be graphed using a percentaged seg-
mented barchart:
1 A survey of Victorians was
recently conducted to ascer-
tain their interest in AFL
football.
The data was presented in
the following two-way per-
centaged frequency table:
Level of
interest
Gender
Male Female Total
Very interested 28 18 22
Somewhat 25 19 21
Not very 19 20 20
Not at all 28 43 37
Total 100 100 100a Use the table to find:
i the percentage of those surveyed who are very interested in football
ii the percentage of women who are either very or somewhat interested in football.
b Construct a percentaged segmented barchart that compares the interest in Australian
Rules for men and women.
c Does the data support the theory that gender influences the level of interest in AFL
football? Quote percentages to support your statement.
2 A survey of sixteen-year-old students revealed that 32 of the 48 boys and 23 of the 37girls played a team sport outside school.
a Copy and complete the two-
way frequency table shown:Play team sportoutside school?
Gender
Boys Girls Total
Yes
No
Total
b Find the percentage of all
the students who play a
team sport outside school.
c Find the percentage of girls who play a team sport outside school.
d Construct a two-way percentaged frequency table.
e Do the figures support the theory that more boys than girls play a team sport outside
school? Quote some percentages to support your statement.
c 57%44% 13%
of the females surveyed were in favour of the proposed rule compared withof the males. This shows a difference of . The results support the theory.
PERCENTAGED SEGMENTED BARCHARTS
male female gender
per
centa
ge
20
40
60
80
100
in favour
against
EXERCISE 2B
Chapter 2 BIVARIATE DATA 67
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
3 A market research company is con-
tracted to investigate the age of people
who listen to the three radio stations,
A, B or C, in a city. The results of their
survey are given in the table alongside:
Age group
Station < 30 30 - 60 > 60 Total
A 35 30 200
B 40 83 68
C 175 37 132
Totala Complete the Totals row and col-
umn in the table alongside.
b Why do we need a two-way percentaged frequency table to help analyse the data?
c Construct the two-way percentaged frequency table.
d Compare and comment on which age groups listen to which radio station.
4 The two-way percentaged frequency
table alongside was produced to show
the labour force status of parents from
one-parent families.Labour
force
status
Father Mother
Employed
full-time48:6 16:8
Employed
part-time13:3 27:2
Unemployed 8:3 8:9
Not in the
labour force29:8 47:0
Total 100:0 100:0
(Source: ABS June 2002 Labour Force Survey)
a What are the variables in this sur-
vey? Classify them as categorical
or numerical, independent or de-
pendent.
b Construct a percentaged seg-
mented barchart to illustrate the
data.
c What conclusions can be made
from this table and graph?
Support your statements with percentages from the table.
5 A polling agency wants to test the theory that in a particular municipality, “more of the
female residents vote for female candidates”. A random sample of eighty residents in the
municipality were asked their voting preference, either Smith the female candidate, or
Jones the male candidate. Of the 35 female residents in the sample, 20 said they would
vote for Smith, whereas 25 of the male residents said they would vote for Jones.
a Fill in the missing values on
the two-way frequency table
alongside.Voting
intention
Gender
Male Female Total
Smith 20
Jones 25
Total 35 80
b Construct a two-way percent-
aged frequency table for the
data.
c Use the figures in the table to comment on the validity of the theory.
Scatterplots are used to demonstrate and visualise the relationship between two numerical
variables.
The data is plotted as points on a graph where the independent variable is the horizontal axis
and the dependent variable is the vertical axis.
C TWO NUMERICAL VARIABLES
68 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
The pattern formed by the points on a scatterplot indicates the strength of the relationship
between the two variables.
For example:
The relationship between weight and height of members
of an AFL football team is being investigated.
We expect there to be a fairly strong association between
these variables as it is generally perceived that the taller
a person is, the more they will weigh.
The height and weight of each of the players in the team
is recorded and these values form a coordinate pair for
each of the players:
Before a scatterplot is constructed you need to establish which of the variables is the inde-
pendent variable and which is the dependent variable.
In this case we assume that weight depends on height and so weight is the dependent variable
and height is the independent variable.
The points are therefore plotted as
coordinate pairs (height, weight) for
the individuals in the investigation.
Using the calculator to construct a scatterplot
CONSTRUCTING A SCATTERPLOT
Player Height Weight Player Height Weight Player Height Weight
1 203 106 7 180 78 13 178 802 189 93 8 186 84 14 178 773 193 95 9 188 93 15 186 904 187 86 10 181 84 16 190 865 186 85 11 179 86 17 189 956 197 92 12 191 92 18 193 89
Press and choose . Press .
Enter the data into lists. The independent variableshould be and the dependent variable should be .
… Í1:Edit
L L‚
75
80
85
90
95
100
105
175 180 185 190 195 200 205
Weight versus Height
height (cm)
wei
ght
(kg)
Chapter 2 BIVARIATE DATA 69
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
There are four aspects that we need to consider:
1 Direction
Positive association
The points generally go up as x increases,
similar to a straight line with positive gradient.
“As the independent variable (x) increases,
the dependent variable (y) also increases.”
Negative association
The points generally go down as ‘x’ increases,
similar to a straight line with negative gradient.
“As the independent variable (x) increases,
the dependent variable (y) decreases.”
2 Form
In the scatterplots above, the points are generally in a straight line. The relationship
between the variables is said to be linear.
These scatterplots show relationships which are not linear.
Press to select .y o STAT PLOT
Press .Í
Turn the plot and select the scatterplot icon .The is for the independent variable andthe is for the dependent variable .
On
XList L
YList L‚
"
Press to view the scatterplot.
You can press and use the arrow keys toidentify the points.
q ®
r
INTERPRETATION OF A SCATTERPLOT
x
y
y
x
70 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
3 Strength
If the points form a well-ordered pattern then the strength of the association is said to
be strong.
For example:
Strong positive Strong negative Strong non-linear
If the points form a pattern which is less well defined, then the strength is said to be
moderate.
For example: Moderate positive Moderate negative
If the points are scattered but a general pattern is still discernable then the association is
said to be weak.
For example: Weak positive Weak negative
If the points appear to be randomly scattered then
there is no association between the variables.
An example of this is shown opposite.
4 Outliers
Outliers stand out from the general body of data.
The example opposite shows a “moderate positive
association with one outlier”.
outlier
Chapter 2 BIVARIATE DATA 71
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Outliers should be checked to ensure they are genuine outstanding data and not errors
in the data or errors in plotting. A decision can be made to ignore them as they will
influence correlation measures and models fitted to the data, but this should only be done
after careful consideration.
We can interpret the Weight versus Height scatterplot
from earlier as follows:
“There is a moderate positive association between the
variables height and weight. This means that as height
increases, weight increases. The relationship appears
linear and there are no obvious outliers.”
1 For each of the following, state whether you would expect to find positive, negative,
or no association between the following variables. Indicate the strength (none, weak,
moderate or strong) of the association.
a Shoe size and height.
b Speed and time taken for a journey.
c The number of occupants in a household and the water consumption of the house-
hold.
d Maximum daily temperature and the number of newspapers sold.
e Age and hearing ability.
2 Copy and complete the following:
a If the variables x and y are positively associated then as x increases, y ..........
b If there is negative association between the variables m and n then as m increases,
n ..........
c If there is no association between two variables then the points on the scatterplot
appear to be .......... ..........
3 For each of the scatterplots below, state:
i whether there is positive, negative or no association between the variables
ii the strength of the association between the variables (zero, weak, moderate or
strong)
iii whether the relationship between the variables appears to be linear or not
iv the presence of outliers.
a b c
EXERCISE 2C
x
y
40302010
20.4
20.2
20
19.8
19.6
x
5
y
5
25201510
10
15
20
25
30
x
y
40302010
10
20
30
40
80
90
100
180 190 200
Weight versus Height
height (cm)
wei
ght
(kg)
72 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
d e f
4 Consider the data: x 1 2 3 4 5 6 7 8 9 10
y 2 1 4 3 5 6 5 5 7 8
a Construct a scatterplot for the data.
b State whether the association between the variables is:
i positive, negative or no association ii weak, moderate or strong
iii linear or not.
5 The following data was collected by a milkbar owner over fifteen consecutive days:
(oC)29 40 35 30 34 34 27 27 19 37 22 19 25 36 23
119 164 131 152 206 169 122 143 63 208 155 96 125 248 139
a Which of the two variables is the independent variable?
b Construct a scatterplot of the data.
c Interpret the scatterplot in terms of the variables, mentioning direction, strength,
linearity and outliers.
6 A class of 25 students was asked to record their times (in minutes) spent preparing for
a test. The table below gives the score that they achieved on the test and the recorded
preparation time.
Score 25 31 30 38 55 20 39 47 35 45 32 33 34
Minutesspent
preparing75 30 35 65 110 60 40 80 56 70 50 110 18
Score 38 17 38 17 17 26 41 50 30 45 36 23
Minutesspent
preparing80 22 30 15 10 85 100 60 55 80 50 75
a Which of the two variables is the independent variable?
b Construct a scatterplot of the data.
c Interpret the scatterplot in terms of the variables, mentioning direction, strength,
linearity and outliers.
x
2 4 6 8
y
50
40
30
20
10x
y
50
40
30
20
10
5040302010
x
1 2 3 4 5 6 7
y
120
100
80
60
40
20
Max. daily
temp.
No. of ice-
creams sold
Chapter 2 BIVARIATE DATA 73
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
Correlation is a statistical word that means relationship or association. We can talk about
the correlation/relationship/association between two variables and mean the same thing.
The correlation between two numerical variables can be measured by a correlation coefficient.
There are several correlation coefficients that can be used, but the most widely used coefficient
is Pearson’s correlation coefficient, named after the statistician Carl Pearson who developed
it. Its full name is Pearson’s product-moment correlation coefficient, and it is denoted r.
For a set of n bivariate numerical data with variables x and y, Pearson’s correlation
coefficient is:
r =1
n¡ 1
Pµx¡ x
sx
¶µy ¡ y
sy
¶where x and y are the means of the x and y data respectively and sx and sy are
their standard deviations.
This formula is tedious to use, so in all situations you will be using your calculator to find r.
r Description r Description
1 perfect ¡1 perfect
positive negative
correlation correlation
0:75 to 1 strong ¡1 to ¡0:75 strong
positive negative
correlation correlation
0:50 to 0:75 moderate ¡0:75 to ¡0:50 moderate
positive negative
correlation correlation
to 0:50 weak ¡0:50 to ¡0:25 weak
positive negative
correlation correlation
D CORRELATION
PEARSON’S CORRELATION COEFFICIENT r
0 25:
Pearson’s correlation coefficient gives a measure of the relationship between two variables on
a scale from ¡1 to 1. Word descriptors based on r-values seem doubtful at the best of times
and the majority of texts on this subject do not include them. Many texts and Internet sites
vary on the advice they give. Here is one possible interpretation.
INTERPRETATION OF PEARSON’S CORRELATION COEFFICIENT
74 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
0 to 0:25 almost no ¡0:25 to 0 almost no
correlation correlation
Notes about Pearson’s correlation coefficient:
² It is designed for linear data only.
² It should be used with caution if there are outliers.
For example, the data in the two scatterplots below both have a correlation coefficient
of r = 0:8. The presence of the outlier in the second graph has greatly reduced the
Using the calculator to find Pearson’s correlation coefficient
The first step is to activate the diagnostic tools on the calculator. Once turned on these will
remain on, but if the memory is cleared or battery changed then the calculator will revert
back to the default functions that do not include r.
To activate the diagnostic tools:
x 1 2 3 4 5 6 7 8 9 10
y 2 1 4 3 5 6 5 5 7 8
r Description r Description
x
2 4 6 8
y
5
15
10
10 12 14
x
2 4 6 8
y
5
1210
15
10
outlier
We consider finding Pearson’s correlationcoefficient for the data opposite:
Locate the menu using .CATALOG y Ê
Use the arrow keys to scroll down to
and press .DiagnosticOn Í
DiagnosticOn will appear on the screen.
Press Í and you will have turned the di-
agnostic tools on.
r rvalue, however, without this point, would equal .1
Chapter 2 BIVARIATE DATA 75
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
We check the scatterplot at this stage as it will
reveal any errors made in entering the data, and
any outliers. It will also indicate whether the
data is linear.
(This means we are fitting a linear model or
linear regression of the form y = ax+ b to
the data.) Regression will be discussed in greater
detail in Chapter 3.
The linear regression screen appears and the last
figure r = :9130 :::: is Pearson’s correlation
coefficient for this data set.
For example:
1 The heights and reading speeds of children were measured and a strong positive corre-
lation was found. Does this mean that increasing height makes you read faster or that
increasing your reading speed will cause you to grow? These suggestions are obviously
not sensible. The strong correlation results because both variables are closely associated
with age. As age increases, both the variables height and reading speed increase. It is
age which causes height and reading speed to increase.
Enter the data into lists, the -data intoand the -data into .
x
y
L
L‚
Press to select and choose.
… ~ CALC
4:LinReg(ax+b)
LinReg(ax+b) appears on the screen. Youneed to tell the calculator where your data is:
Enter , by pressing.L L‚ y À ¢ y Á
Í
CAUSATION
When analysing data, we must be aware of . A high degree of correlation between twovariables does not necessarily imply that a change in one variable the other to change.
causation
causes
The value indicates a, which agrees with the scatterplot.r strong positive corre-
lation
76 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
2 The number of television sets sold in Ballarat and
the number of stray dogs collected in Bendigo were
recorded over several years and a strong positive
association was found between the variables.
Obviously the number of television sets sold in
Ballarat was not influencing the number of stray
dogs collected in Bendigo. Both variables have
simply been increasing over the period of time that
their numbers were recorded.
If a change in one variable causes a change in the other variable then we say that a causal
relationship exists between them.
For example:
The age and height of a group of children is measured and there is a strong positive correlation
between these variables. This will be a causal relationship because an increase in age will
cause an increase in height.
1 a Use your calculator to find Pearson’s correlation coefficient for the data given in
question 5, Exercise 2C.
29 40 35 30 34 34 27 27 19 37 22 19 25 36 23
119 164 131 152 206 169 122 143 63 208 155 96 125 248 139
b Interpret the value of r in terms of strength and direction.
c Does the value of the correlation coefficient confirm your observations from the
scatterplot? Was it appropriate to find r for this data? Explain.
2 a Use your calculator to find Pearson’s correlation coefficient for the data given in
question 6, Exercise 2C:
Score 25 31 30 38 55 20 39 47 35 45 32 33 34
Minutesspent
preparing75 30 35 65 110 60 40 80 56 70 50 110 18
Score 38 17 38 17 17 26 41 50 30 45 36 23
Minutesspent
preparing80 22 30 15 10 85 100 60 55 80 50 75
b Interpret the value of r in terms of strength and direction.
c Does the value of the correlation coefficient confirm your observations from the
scatterplot? Was it appropriate to find r for this data? Explain.
EXERCISE 2D
(oC)
Max. daily
temp.
No. of ice-
creams sold
Chapter 2 BIVARIATE DATA 77
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
3 [VCAA FM 2000 Q5]
The scatterplot alongside shows the birth
rate and the average food intake for 14different countries.
The value of the product moment correla-
tion coefficient, r, for this data is closest
to:
A ¡0:6 B ¡0:2 C 0:2
D 0:6 E 0:9
4 Which one of the following is true for Pearson’s correlation coefficient r?
A The addition of an outlier to a set of data would always result in a lesser value
of r.
B An r value of 1 represents a stronger relationship between the variables than an
r value of ¡1.
C A high value of r means that one variable is causing the other variable to change.
D An r value of ¡0:8 means that as the independent variable increases, the depen-
dent variable will tend to decrease.
E It can take values between 0 and 1 inclusive.
5 The following pairs of variables were measured and a strong positive correlation between
them was found. Discuss whether a causal relationship exists between the variables. If
not, suggest a third variable to which they may both be related.
a The lengths of one’s left and right feet.
b The damage caused by a fire and the number of firemen who attend it.
c Company expenditure on advertising, and sales.
d The height of parents and the height of their adult children.
e The number of hotels and the number of churches in rural towns.
In a bivariate set of numerical data, the coefficient of determination gives us a means of
measuring the influence that one variable has over the other variable.
Coefficient of determination = r2 = (Pearson’s correlation coefficient)2
r2 is found on the linear regression screen of
your calculator as shown opposite.
Alternatively, if the value of r is known, then
this can simply be squared.
1.7 1.9 2.1 2.3 2.5 2.7
20
30
40
50
bir
thra
te(p
er100
000)
�
average food intake(1000 calories per person)
E THE COEFFICIENT OF DETERMINATION
CALCULATION OF THE COEFFICIENT OF DETERMINATION
78 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
r2 indicates the strength of association between the dependent or response
variable and the independent or explanatory variable.
If there is a causal relationship then r2 indicates the degree to which change in the explanatory
variable explains change in the response variable.
Pearson’s correlation coefficient, r, was found to be 0:8625.
The coefficient of determination for this study is (0:8625)2 = 0:7439.
An interpretation of this r2 value is “the proportion of variation in kilojoule content that can
be explained by the variation in fat content of muesli is 0:7439.”
It is usual to quote the coefficient of variation as a percentage. A proportion of 0:7439 is
equivalent to 0:7439£ 100 = 74:39%.
The interpretation becomes:
74:39% of the variation in of muesli can be explained by the variation in
fat content of muesli.
If 74:4% of the variation in kilojoule content of muesli can be explained by the fat content of
muesli then we can assume that the other 25:6% (100%¡74:4%) of the variation in kilojoule
content of muesli can be explained by other factors (which may or may not be known).
Note:
² Since ¡1 6 r 6 1, 0 6 r2 6 1.
² If r = ¡0:625 then r2 = (¡0:625)2 = 0:3906, a positive value.
² It is only appropriate to use r2 values, like r values, in situations where there is a
linear relationship between the two variables.
² r2 values of 10% or more are worth mentioning.
² If you are finding an r value from an r2 value then you must consider that the r
value can be positive or negative. The solutions to r2 = a are r =pa and
r = ¡pa. Your calculator will only give you a positive value.
If this statement was based on the coefficient of variation then what would be the
value of Pearson’s correlation coefficient for this study?
We are told that r2 = 0:45 so r is the square root of 0:45. (p
0:45 w 0:6708)
INTERPRETATION OF THE COEFFICIENT OF DETERMINATION
Example 2
At this point we need to consider the variables involved: and of acar. We would assume that as the of a car then the of a carwould , i.e., there is correlation between the variables.
Hence we can conclude for this study that Pearson’s correlation coefficient, ,will be .
selling price age
age increases selling price
decrease negative
r
:¡0 6708
For example: An investigation into many different brands of muesli found that there is strongpositive correlation between the variables and .fat content kilojoule content
kilojoule content
dependent variable
independent variable
A study has found that of the variation in can be explained by thevariation in of a used car.
45% selling price
age
Chapter 2 BIVARIATE DATA 79
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50
1 In an investigation the coefficient of determination for the variables preparation time
and exam score is found to be 0:5624. Complete the following interpretation of the
coefficient of determination:
...... % of the variation in .......... can be explained by the .......... in preparation time.
2 For each of the following find the value of the coefficient of determination correct to
four decimal places, and interpret it in terms of the variables.
a An investigation has found the association between the variables time spent gambling
and money lost has an r value of 0:4732.
b For a group of children a product-moment correlation coefficient of ¡0:365 is found
between the variables heart rate and age.
c In a study of a sample of countries, Pearson’s correlation coefficient for the variables
female literacy and gross domestic product is found to be 0:7723.
3 A study of the relationship between stress levels and productivity has produced a product-
moment correlation coefficient of 0:5629. Which one of the following would be an
interpretation that could be made from this study?
A 56:3% of the variation in productivity can be explained by the variation in stress
levels.
B 75% of the variation in productivity can be explained by the variation in stress
levels.
C 31:7% of the variation in productivity is caused by the variation in stress levels.
D 56:3% of the variation in productivity is caused by the variation in stress levels.
E 31:7% of the variation in productivity can be explained by the variation in stress
levels.
4 A rural school has investigated the relationship between the time spent travelling to
school (minutes) and a student’s year ten average (%) for a sample of students.
The results are given in the table below:
Travel time(mins)
10 33 18 43 34 30 24 47 44 41 17 45 39 31 23 11 14 25 16 17
Year 10average (%)
51 78 97 56 90 70 64 67 37 46 95 67 31 57 43 99 98 82 40 67
a Construct a scatterplot of the data and interpret the scatterplot.
b Find Pearson’s correlation coefficient for the data and interpret.
c Calculate the coefficient of determination and interpret this in terms of the variables.
EXERCISE 2E
80 DATA ANALYSIS – CORE MATERIAL
VIC MCR_12cyan black
0 5 25
75
50
95
100
0 5 25
75
95
100
50