10 DATA ANALYSIS – CORE MATERIAL A DATA · PDF filewalk bus tram train Mode of transport to school private car walk bus tram train 0 10 20 30 40 50 0% 20% 40% 60% 80% 100% frequency

When information for a statistical investigation is collected and recorded, the information is

referred to as data.

There are four processes involved in a statistical investigation:

Collection of data (information)

Data for a statistical investigation can be collected from records, from surveys (either face-

to-face, telephone, or postal), by direct observation or by measuring or counting. Unless the

correct data is collected, valid conclusions cannot be made.

Organisation and display of data

Data can be organised into tables and displayed on a graph. This allows us to identify features

of the data more easily.

Calculation of descriptive statistics

Some statistics used to describe a set of data are the centre and the spread of the data. These

give us a picture of the sample or population under investigation.

Interpretation of statistics

This process involves explaining the meaning of the table, graph or descriptive statistics in

terms of the variable, or theory, being investigated.

The variable is the subject that we are investigating.

The entire group of objects from which information is required is called the population.

Gathering statistical information properly is vitally important. If gathered incorrectly then any

resulting analysis of the data would almost certainly lead to incorrect conclusions about the

population.

The gathering of statistical data may take the form of:

² a census, where information is collected from the whole population, or

² a survey, where information is collected from a much smaller group of the

population, called a sample.

For example:

² The Australian Bureau of Statistics conducts a census of the whole population of

Australia every five years.

² In opinion polls before an election, a survey is conducted to see which way a

sample of the population will vote.

² The students in a school are to vote for a new school captain. If 20 students from

the school are asked how they will vote, then the population is all the students

who attend the school, and the 20 students is a sample.

A DATA

WHAT IS A STATISTICAL INVESTIGATION?

COLLECTION OF DATA

10 DATA ANALYSIS – CORE MATERIAL

VIC MCR_12cyan black

0 5 25

75

50

95

100

0 5 25

75

95

100

50

When taking a sample it is hoped that the information gathered is representative of the entire

population.

For accurate information when sampling, it is essential that:

² the number of individuals in the sample is large enough

² the individuals involved in the survey are randomly chosen from the

population. This means that every member of the population has an equal

chance of being chosen.

If the individuals are not randomly chosen or the sample is too small, the data collected may

be biased towards a particular outcome.

For example:

If the purpose of a survey is to investigate how the population of Melbourne will vote at the

next election, then surveying the residents of only one suburb would not provide information

that represents all of Melbourne.

Data are individual observations of a variable. A variable is a quantity that can have a value

recorded for it or to which we can assign an attribute or quality.

Two types of variable that we commonly deal with are categorical variables and numerical

variables.

A quality or category is recorded for this type of variable. The information collected is

called categorical data.

Examples of categorical variables and their possible categories include:

Colour of eyes: blue, brown, hazel, green and violet

Continent of birth: Europe, Asia, North America, South America, Africa, Australia and

Antarctica

Gender: male or female

Type of car: General Motors, Toyota, Ford, Mazda, BMW, Subaru, etc.

A number is recorded for this type of variable. The information collected is called numerical

data.

There are two types of numerical variables:

Discrete numerical variables

A discrete variable can only take distinct values and these values are often obtained by

counting.

Examples of discrete numerical variables and their possible values include:

The number of children in a family: 0, 1, 2, 3, ...

The score on a test, out of 30 marks: 0, 1, 2 ..., 29, 30.

TYPES OF DATA

CATEGORICAL VARIABLES

NUMERICAL VARIABLES

Chapter 1 UNIVARIATE DATA 11


0 5 25

75

50

95

100

0 5 25

75

95

100

50

Continuous numerical variables

A continuous numerical variable can theoretically take any value on a part of the number line.

Its value often has to be measured.

Examples of continuous numerical variables and their possible values include:

1 40 students, from a school with 820 students, are randomly selected to complete a survey

on their school uniform. In this situation:

a what is the population size b what is the size of the sample?

2

a What is the population being surveyed in this situation?

b How is the data biased if it is used to represent the views of all Australians?

3 A polling agency is employed to survey the voting intention of residents of a particular

electorate in the next election. From the data collected they are to predict the election

result in that electorate.

Explain why each of the following situations would produce a biased sample.

a A random selection of people in the local large shopping complex is surveyed

between 1 pm and 3 pm on a weekday.

b All the members of the local golf club are surveyed.

c A random sample of people on the local train station between 7 am and 9 am are

surveyed.

d A doorknock is undertaken, surveying every voter in a particular street.

4 Classify the following data as categorical, discrete numerical or continuous numerical:

a the quantity of soil in a particular size of potplant

b the number of pages in a daily newspaper

c the number of cousins a person has

d the speed of cars on a particular stretch of highway

e the state of Australia where a person was born

f the maximum daily temperature in Melbourne

g the manufacturer of a car

h the preferred football code

i the position taken by a player on a football field

j the time it takes 12-year-olds to run one kilometre

k the length of feet

l the number of goals shot by a netballer

m the amount spent weekly, by an individual, at the supermarket.

EXERCISE 1A

The height of Year 9 students:

The speed of cars on a stretch

The weight of newborn babies:

The time taken to run 100 m:

any value from about cm to cm

any value from km/h to the fastest speed that a car

any value from kg to kg but most likely in the rangekg to kg

any value from seconds to seconds.

120 200

0

0 100 5 5

9 30

of highway: can travel, but most likely in the range km/h to km/h30 120

:

A television station is conducting a viewer telephone poll on the ques-tion ‘Should Australia become a republic?’

-into-the-station



0 5 25

75

50

95

100

0 5 25

75

95

100

50

5 A sample of public trees in a municipality was surveyed for the following data:

a the diameter of the tree (in centimetres) measured 1 metre above the ground

b the type of tree

c the location of the tree (nature strip, park, reserve, roundabout)

d the height of the tree, in metres

e the time (in months) since the last inspection

f the number of inspections since planting

g the condition of the tree (very good, good, fair, unsatisfactory).

Classify the data collected as categorical, discrete numerical or continuous numerical.

Tally and frequency tables are used to organise categorical data and there are several types

of graphs that can be used to effectively display the data.

For example:

A centrally-located school is investigating how their students get to school. This is of interest

to them because of local traffic problems. A sample of 50 students was asked which of the

following five categories they used most.

The results were:

BBCWTn TnTnTmCC WCCBC CWBBTn TmCBWTn

WWTnTnC TmTnCCTm BBBBW CCBWC TnBCBB

(Tn ´ train, Tm ´ tram, B ´ bus, W ´ walk, C ´ private car)

The variable ‘mode of transport to school’ is a categorical variable.

We can organise the data using a tally and frequency table.

One stroke for each data value is recorded in the tally column.

jjjj©© represents a

tally of five.Mode of transport Tally Frequency

Train jjjj©© jjjj 9

Tram jjjj 4

Bus jjjj©© jjjj©© jjjj 14

Walk jjjj©© jjj 8

Private car jjjj©© jjjj©© jjjj©© 15

Total 50

From the frequency table we can see:

² The most favoured ‘mode of transport’ in the sample was ‘Private car’.

² 9 + 4 + 14 = 27 of the 50 students came by public transport (train, tram, or bus).

² Only 8 of the 50 students (16%) walked to school.

B ORGANISING AND DISPLAYING DATA

CATEGORICAL DATA



0 5 25

75

50

95

100

0 5 25

75

95

100

50

1 A barchart (or column graph) is usually drawn with the categories along the horizontal

axis and the frequency on the vertical axis.

Each bar (or column) is drawn with height equal to the frequency of its category.

The ‘bars’ are equally spaced (not joined together) and are of the same width.

Below is a barchart for the example. Note: A barchart can also be drawn

with horizontal bars.

2 A segmented barchart is a single ‘bar’ divided into segments so that the length of each

segment is proportional to the frequency.

A percentaged segmented barchart can also be produced.

The percentage for each category is calculated usingfrequency of category

total£ 100% .

For example, for the traffic data shown previously:

The category with the highest frequency of 15 was Private car.

So, 15

50£ 100

1= 30% of the students came by private car.

27 students came by public transport.

So the percentage who came by public transport was 27

50£ 100

1= 54%.

Following is a segmented barchart and a percentaged segmented barchart for the above

example.

The segments can be labelled, or shaded including a legend.

GRAPHS TO DISPLAY CATEGORICAL DATA

02468

10121416

train tram bus walk privatecar

Mode of transport to school

frequency

0 2 4 6 8 10 12 14 16

private car

walk

bus

tram

train

Mode of transport to school

private car

walk

bus

tram

train0

10

20

30

40

50

0%

20%

40%

60%

80%

100%frequency % frequency

privatecar

walk

bus

tram

train



0 5 25

75

50

95

100

0 5 25

75

95

100

50

1 55 randomly selected year eight students were asked to nominate their favourite subject

studied at school.

The results of the survey are dis-

played in the barchart alongside.

a Which subject was the most

favoured?b How many students chose Art

as their favourite subject?

c What percentage of the stu-

dents nominated Mathematics

as their favourite subject?

d What percentage of the stu-

dents chose either Music or

Art as their favourite subject?

2 A randomly selected sample of adults was asked to

nominate the evening television news service that

they watched. The results alongside were obtained:

News Service Frequency

ABC 40Channel 7 45Channel 9 64

Channel 10 25SBS 23None 3

a Construct a barchart for this data.

b Use the table and graph to answer the follow-

ing questions about the data.

i How many adults were surveyed?

ii Which news service is the most popular?

iii What percentage of those surveyed watched the most popular news service?

iv What percentage of those surveyed watched the news service on Channel 7?

3 Construct a percentaged segmented barchart

for the following categorical data, shading

the categories and including a legend.

Expenditure Weekly householditem expenditure ($)

Food 60Clothing 30

Rent 120Travel 15

Utilities 30Entertainment 45

A discrete numerical variable can take only distinct values.

The data is often obtained by counting.

For example, a farmer has a crop of peas and wishes to investigate the number of peas in

the pods. He takes a random sample of 50 pods and counts the number of peas in each pod,

obtaining the following data: 6 6 5 4 9 8 7 7 7 6 5 6 7 8 8 8 7 5 2 4 7 7 6 7 88 7 8 6 6 4 2 9 1 3 3 5 9 8 8 7 7 6 7 7 6 8 4 5 5

The variable in this situation is the discrete numerical variable ‘the number of peas in a pod’.

The data could only take the discrete numerical values 0, 1, 2, 3, 4, ....

EXERCISE 1B.1

0 2 4 6 8 10

English

Mathematics

Science

Language

History

Geography

Music

Art

frequency

subje

ct

DISCRETE NUMERICAL DATA



0 5 25

75

50

95

100

0 5 25

75

95

100

50

To organise his data the farmer could use

the tally and frequency table shown.

A barchart could be used to display the

results.

No. peas in pod Tally Frequency

1 j 1

2 jj 2

3 jj 2

4 jjjj 4

5 jjjj©© j 6

6 jjjj©© jjjj 9

7 jjjj©© jjjj©© jjj 13

8 jjjj©© jjjj©© 10

9 jjj 3

Total 50

Alternatively, the farmer could use a dot plot which is a convenient method of tallying the

data and at the same time displaying the frequencies.

To draw a dot plot:

1 Draw a horizontal axis and mark it with the values that the variable can take. For this

example, the variable took values from 1 to 9, so we mark the axis from 0 to 10.

2 Label the axis with a description, in this case: number of peas in pod.

3 Systematically go through the data, placing a dot or cross above the appropriate position

on the axis.

The dot plot for this example is:

Notice that the dots are evenly spaced so the final plot looks similar to the barchart.

From both the barchart and the dot plot it can be seen that:

² Seven was the most frequently occurring number of peas in a pod.

² 35

50£ 100

1= 70% of the pods yielded six or more peas.

² 10% of the pods had fewer than 4 peas in them.

The distribution of a set of data is the pattern or shape of its graph.

For the example above, the graph has the

general shape shown alongside:

This distribution of the data is said to be neg-

atively skewed because it is stretched to the

left (the negative direction).

TABLES AND GRAPHS

02468

101214

0 1 2 3 4 5 6 7 8 9

frequency

number of peas in pod

0 1 2 3 4 5 6 7 8 9 10

number of peas in pod

DESCRIBING THE DISTRIBUTION OF A SET OF DATA

stretched to the left



0 5 25

75

50

95

100

0 5 25

75

95

100

50

A positively skewed distribution of data

would have a shape:

A symmetrical distribution of data is nei-

ther positively nor negatively skewed, but is

symmetrical about a central value.

A set of data whose graph has two peaks is

said to be

For the example, if the farmer found one pod

in his sample contained 13 peas then the data

value 13 would be considered an outlier. It is

much larger than the other data in the sample.

On the column graph it appears separated.

1 A randomly selected sample of households

has been asked, “How many people live

in your household?” A column graph has

been constructed for the results.

a How many households were sur-

veyed?

b How many households had only one

or two occupants?

c What percentage of the households

had five or more occupants?

d Describe the distribution of the data.

2 a Construct a barchart for the discrete

numerical data alongside.Number of toothpicks Frequency

33 134 535 736 1337 1238 839 2

b Comment on the distribution of the

data (positively or negatively skewed

or symmetric).

stretched to the right

EXERCISE 1B.2

02468

1012

0 1 2 3 4 5 6 7 8 9

frequency

number of peas in pod10 11 12 13

outlier

0

2

4

6

8

1 2 3 4 5 6 7 8 9 10number of people in the household

freq

uen

cy

Size of households

Note

Outliers

that the horizontal is a number linewith numbers in ascending order from left toright.

are data values that are either muchlarger or much smaller than the general bodyof data. Outliers appear separated from thebody of data on a frequency graph.


bimodal.


0 5 25

75

50

95

100

0 5 25

75

95

100

50

3 A bowler has recorded the number of wickets he has taken in each of the last 30 innings

he has played: 1 1 3 2 0 0 4 2 2 4 3 1 0 1 0 2 1 5 1 3 7 2 2 2 4 3 1 1 0 3

a Construct a dot plot for the raw data.

b Comment on the distribution of the data, noting any outliers.

4 For an investigation into the number of phonecalls made by teenagers, a sample of

50 fifteen-year-olds were asked the question, “How many phonecalls did you make

yesterday?” The following dot plot was constructed for the data.

a What is the variable in this investigation?

b Explain why the data is discrete numerical data.

c What percentage of the fifteen-year-olds did not make any phonecalls?

d What percentage of the fifteen-year-olds made 5 or more phonecalls?

e Copy and complete: “The most frequent number of phonecalls made was .........”.

f Describe the distribution of the data.

g How would you describe the data value ‘11’?

5 The number of matches in a box is stated as 50, but the actual number of matches has

been found to vary. To investigate this, the number of matches in a box is counted for

a sample of 60 boxes:

51 50 50 51 52 49 50 48 51 50 47 50 52 48 50 49 51 50 50 5252 51 50 50 52 50 53 48 50 51 50 50 49 48 51 49 52 50 49 5050 52 50 51 49 52 52 50 49 50 49 51 50 50 51 50 53 48 49 49

a What is the variable in this investigation?

b Is the data continuous or discrete numerical data?

c Construct a dot plot for this data.

d Describe the distribution of the data.

e What percentage of the boxes contained exactly 50 matches?

The height of 14-year-old children is being investigated. The variable ‘height of 14-year-old

children’ is a continuous numerical variable because the values recorded for the variable

could, theoretically, be any value on the number line. They are most likely to fall between

120 and 190 centimetres.

The heights of thirty children are measured in centimetres. The measurements are rounded to

one decimal place, and the values recorded below:

163:0 154:2 152:8 160:5 148:3 149:2 154:7 172:7 171:3 162:5165:0 160:2 166:2 175:3 143:4 174:6 180:9 162:4 167:3 158:4159:4 164:5 163:7 183:8 150:8 163:4 181:9 158:3 165:0 156:8

The number of phone calls made in a day by a sample of 50 fifteen year olds

0 1 2 3 4 5 6 7 8 9 10 11

number of phone calls

CONTINUOUS NUMERICAL DATA



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Note that these rounded values are actually discrete. However, when we tally them, we use

continuous class intervals as follows:

The smallest height is 143:4 cm and the largest is 183:8 cm so we will use class intervals 140up to 150 (this does not include 150), 150 up to 160, 160 up to 170, 170 up to 180, 180 up

to 190. Note that we choose class intervals of the same width.

These class intervals are written as 140 - , 150 - , 160 - , etc. in the frequency table. The

final class interval is written as 180 - < 190 which means 180 cm up to a height that is less

than 190 cm.

A tally-frequency table for this example is: Height (cm) Tally Frequency

140 - jjj 3

150 - jjjj©© jjj 8

160 - jjjj©© jjjj©© jj 12

170 - jjjj 4

180 - < 190 jjj 3

Total 30

A histogram is used to display continuous numerical data. This is similar to a barchart but

because of the continuous nature of the variable, the ‘bars’ are joined together. The frequency

is represented by the height of the ‘bars’.

A histogram for this example is

shown opposite:

Note: The two oblique lines that cross the horizontal axis indicate that the numbers onthis axis are not starting at zero. This can also be shown using .

A relative frequency table and histogram can also be drawn:

Height (cm) Frequency Relative %

140 - 3 3

30£ 100 = 10%

150 - 8 26:7%160 - 12 40%170 - 4 13:3%

180 - < 190 3 10%

Total 30 100%

From the tables and graphs we can see:

² More children had a height in the class interval 160 up to 170 cm than any other

class interval. This class interval is called the modal class.12

30£ 100 = 40% of the children had a height in this class.

² Three of the children ( 330£ 100 = 10%) had a height less than 150 cm.

Heights of a sample of fourteen-year-old children

0

4

8

12

140 150 160 170 180 190height (cm)

frequency

0

10

20

40

140 150 160 170 180 190height (cm)

relative frequency %

30



0 5 25

75

50

95

100

0 5 25

75

95

100

50

² Three of the children (10%) were 180 cm or more tall.

² The distribution of heights was approximately symmetrical.

1 Time to complete Number of100 m swim (secs) swimmers

50 - 355 - 660 - 1665 - 1170 - 2

75 - < 80 2

2 The speed of vehicles travelling along a

section of highway has been recorded and

displayed using the histogram alongside.

a How many vehicles were included in

this survey?

b What percentage of the vehicles

were travelling at speeds equal to or

greater than 100 km/h?

c What percentage of the vehicles were

travelling at a speed from 100 up to

110 km/h?

d What percentage of the vehicles were travelling at a speed less than 80 km/h?

e If the owners of the vehicles travelling at 110 km/h or more were fined $165 each,

what amount would be collected in fines?

3 The daily maximum temperature (oC) to the nearest degree, in Melbourne, for each day

in January 2001, is recorded below:

34 38 31 38 23 24 25 26 29 35 41 23 32 36 22 2124 26 35 36 25 32 27 30 34 30 27 25 26 23 25

a Using class intervals of 5 degrees construct a tally and frequency table for the data.

b Construct a histogram to display the data.

c Describe the distribution of Melbourne’s daily maximum temperatures in January

2001.

4 The height of each member of a basketball squad has

been measured and the results are displayed using the

frequency table alongside.

Height (cm) Frequency

165 - 1170 - 3175 - 5180 - 12185 - 7190 - 5195 - 2

200 - < 205 1

a Calculate the relative frequencies and construct a

relative frequency histogram for the data.

b Comment on the distribution of the heights.

c Find the percentage of members of the squad

whose height is

i greater than 180 cm ii less than 170 cm

iii between 175 and 190 cm.

EXERCISE 1B.3

Construct a histogram for the followingcontinuous numerical data.

0

50

100

150

200

50 70 90 110 130speed (km/h)

number ofvehicles



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Constructing a stem-and-leaf plot, commonly called a stemplot, is often a convenient method

to organise and display a set of numerical data. A stemplot groups the data and shows the

relative frequencies but has the added advantage of retaining the actual data values.

Data values such as 25 36 38 49 23 46 47 15 28 38 34 are all two digit numbers, so

the first digit will be the ‘stem’ and the last digit the ‘leaf’ for each of the numbers. The

stems will be 1, 2, 3, 4 to allow for numbers from 10 to 49.

The stemplot for the data is shown alongside.

Notice that:

² 1 j 5 represents 15

² 2 j 3 5 8 represents 23, 25 and 28

² the data in the leaves is evenly spaced with

no commas

² the leaves are placed in increasing order, so this stemplot is ordered

² the scale (sometimes called the key) tells us the place value of each leaf.

Stem Leaf

1 52 3 5 83 4 6 8 84 6 7 9 2 j 3 means 23

If the scale was 2 j 3 means 2:3, then 4 j 6 7 9 would represent 4:6, 4:7 and 4:9.

For data values such as 195 199 207 183 201 ...... the first two digits are the stem and

the last digit is the leaf.

The score, out of 50, on a test was recorded for 36 students.

a Organise the data using a stemplot.

b Comment on the distribution of the

data.

25 36 38 49 23 46 47 15 28 38 34 930 24 27 27 42 16 28 31 24 46 25 3137 35 32 39 43 40 50 47 29 36 35 33

a Recording the data from the list

gives an unordered stemplot:Stem Leaf

0 91 5 62 5 3 8 4 7 7 8 4 5 93 6 8 8 4 0 1 1 7 5 2 9 6 5 34 6 7 2 6 3 0 75 0 2 j 4 means 24 marks

Ordering the data from smallest

to largest for each stem gives an

ordered stemplot:

Stem Leaf

0 91 5 62 3 4 4 5 5 7 7 8 93 0 1 1 2 3 4 5 5 6 6 7 8 8 94 0 2 3 6 6 7 75 0

C STEM-AND-LEAF PLOTS (STEMPLOTS)

CONSTRUCTING A STEMPLOT

Example 1

9

8

9



0 5 25

75

50

95

100

0 5 25

75

95

100

50

b The shape of the distribution can

be seen when the stemplot is

rotated:

The data is slightly negatively

skewed.

We also observe these important

features:

² The minimum (smallest) test

score is 9.

² The maximum (largest) test

score is 50.

² The modal class is 30 - 39.

Consider the following example:

The residue that results when a cigarette is

smoked collects in the filter. This residue has

been weighed for twenty cigarettes, giving the

following data, in milligrams.

1:62 1:55 1:59 1:56 1:56 1:55 1:631:59 1:56 1:69 1:61 1:57 1:56 1:551:62 1:61 1:52 1:58 1:63 1:58

Scanning the data reveals that there will be only two ‘stems’, i.e., 15 and 16. In cases like

this we will need to split the stems.

If we use the stem 15 to represent data with

values 1:50 to 1:54 and 15¤ to represent data

with values 1:55 to 1:59 etc., we can construct

a stemplot with four stems:

Stem Leaf

15 215¤ 5 5 5 6 6 6 6 7 8 8 9 916 1 1 2 2 3 3

16¤ 9 15 j 2 means 1:52

If we split the stems five ways, where 150

represents data with values 1:50 and 1:51, 152

represents data with values 1:52 and 1:53 etc.,

the stemplot becomes:

The stemplot with the stems split five ways

clearly gives a better view of the distribution

of the data. The value 1:69 appears as an

outlier in this graph.

The stemplot with the stems split two ways

was not sensitive enough to show this.

Stem Leaf

150

152 2154 5 5 5156 6 6 6 6 7158 8 8 9 9160 1 1162 2 2 3 3164

166

168 9

SPLIT STEMS

Ste

mLea

f

09

15

62

34

45

57

78

93

01

12

34

55

66

78

89

40

23

66

77

50

8 9



0 5 25

75

50

95

100

0 5 25

75

95

100

50

1 A school has conducted a survey of 60 of their students to investigate the time it takes

for students to travel to school. The following data gives the travel time to the nearest

minute. 12 15 16 8 10 17 25 34 42 18 24 18 45 33 38 45 40 3 20 1210 10 27 16 37 45 15 16 26 32 35 8 14 18 15 27 19 32 6 1214 20 10 16 14 28 31 21 25 8 32 46 14 15 20 18 8 10 25 22

a Is travel time a discrete or continuous variable?

b Construct a stemplot for the data using stems 0, 1, 2, ....

c Describe the distribution of the data.

d Copy and complete: “Most students spent between ...... and ...... minutes travelling

to school.”

2 The weight of 900 g loaves of bread varies

slightly from loaf to loaf. A manufacturer of

bread is concerned that he may be producing

too many underweight loaves of bread in his

900 gram range. He weighs a sample of sixty

900 g loaves and records their weight to the

nearest gram. Construct a stemplot for the

following data and comment on the distribu-

tion of the data.

901 904 913 924 921 893 894 895 878 885 896 910 901 903 907907 904 892 888 905 907 901 915 901 909 917 889 891 894 894898 895 904 908 913 924 927 885 898 903 903 913 916 931 882893 894 903 900 906 910 928 901 896 886 897 899 908 904 889

3 A taxi driver has recorded the fares, to the

nearest dollar, of 60 passengers that he has

collected from Melbourne airport:

25 32 35 16 39 18 19 25 16 41 40 43 1613 9 48 42 20 20 22 23 33 35 24 23 1434 37 36 36 44 51 22 48 55 13 16 20 2630 12 30 33 35 41 17 22 54 24 20 21 3542 43 54 28 38 37 46 25

a Construct a stemplot with stems 0, 1, 2, 3, ...... Comment on the distribution of

the data.

b Construct a stemplot with two-way split stems. Comment on the feature of the

distribution that is revealed by this split-stem stemplot.

4 The time spent (minutes) by 20 people in a queue at a bank, waiting to be attended by

a teller, has been recorded:

3:4 2:1 3:8 2:2 4:5 1:4 0 0 1:6 4:8 1:5 1:9 0 3:6 5:2 2:7 3:0 0:8 3:8 5:2

Construct a stemplot for this data (include a legend). Comment on the distribution of

the data.

EXERCISE 1C



0 5 25

75

50

95

100

0 5 25

75

95

100

50

A picture of a data set can be obtained if we have an indication of the centre of the data and

the spread of the data.

Three statistics that provide a measure of the centre of a set of data are:

² the mean ² the median ² the mode.

The mean x is the statistical name for ‘average’. The mean is calculated by adding all the

data values x then dividing this sum by the number of data n.

mean =sum of the data values

number of data valuesdenoted x =

Px

n

Note: The Greek letter sigma, §, means ‘the sum of’.

² The mean involves all the data values.

² If you are told that the mean mark for a test is 65% then there will be some marks

higher than 65% and some marks lower than 65%.

² The mean does not have to be one of the data values.

For example:

The mean number of children per family is 1:8 in Melbourne.

It is obvious that a family cannot have 1:8 children but this statistic tells us that most

families have either 1 or 2 children, with more families having 2 children.

Megan has had three Maths tests and her mean (average) mark is 78.

a What is the total of Megan’s marks for the three tests?

b She scores 82 marks for her next test. What is the mean mark for the four tests?

c How many marks did she need to score for the fourth test so that her overall

mean mark would increase to 80?

a The total number of marks for the three tests is 78£ 3 = 234.

D SAMPLE SUMMARY STATISTICS: MEASURES OF CENTRE

MEASURES OF CENTRE

THE MEAN

Find the mean of the following data:

5 5 7 3 8 2 3 4 6 5 7 6 4

There are 13 data values in this set, so n = 13.

Mean =5 + 5 + 7 + 3 + 8 + 2 + 3 + 4 + 6 + 5 + 7 + 6 + 4

13=

65

13= 5

Example 2

Example 3



0 5 25

75

50

95

100

0 5 25

75

95

100

50

b The average of her marks for the four tests is234 + 82

4= 79.

c To get an average mark of 80 in four tests, Megan needed to score a total of

4£ 80 = 320 marks.

Hence she needed to score 320¡ 234 = 86 marks on the fourth test to bring

her overall mean mark to 80.

The median is the middle value of an ordered set of data.

An ordered set of data is the data listed from smallest to largest value (or largest to smallest).

The median splits the data set into two halves: half of the data have values less than or equal

to the median and half have values greater than or equal to the median.

For example, if the median mark for a test is 65%, then half the marks scored are greater

than or equal to 65% and half the marks scored are lower than or equal to 65%.

To find the median:

1 Order the data by rearranging the values from smallest to largest.

2 Locate the middle of the data values.

² If there is an odd number of data then the median will be one of the data values.

The median is then+ 1

2th value in a data set of n values.

² If there is an even number of data then the median is the average of the two

middle values and may not be equal to any of the data values.

Find the median for the following data sets:

a 5 5 7 3 8 2 3 4 6 5 7 6 4 b 3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10

a The data set is ordered (arranged from smallest to largest).

2 3 3 4 4 5 5 5 6 6 7 7 8

The median is the13 + 1

2= 7th value (circled).

The median is 5.

b There are 16 data values so the median is the average of the 8th and 9th values

(circled). 3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10

The median is6 + 7

2= 6:5 (Note: This is not one of the data values.)

THE MODE

The mode is the most frequently occurring value in the data set.

THE MEDIAN

Example 4



0 5 25

75

50

95

100

0 5 25

75

95

100

50

This statistic can usually be found easily from a frequency table, barchart or dot plot.

If there are two modes in a data set then the data can be described as bimodal.

If there are more than two modes then it is said that “the mode is not distinct” and the mode

is not useful as a descriptive statistic.

For continuous data, the class interval with the highest frequency is the modal class.

1 Find the i mean ii median iii mode for each of the following data sets:

a 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9

b 10, 12, 12, 15, 15, 16, 16, 17, 18, 18, 18, 18, 19, 20, 21

c 22:4, 24:6, 21:8, 26:4, 24:9, 25:0, 23:5, 26:1, 25:3, 29:5, 23:5

d 127, 123, 115, 105, 145, 133, 142, 115, 135, 148, 129, 127, 103, 130, 146, 140,

125, 124, 119, 128, 141.

2 Consider the following two data sets:

Data set A: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 10

Data set B: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 15

a Find the mean for both Data set A and Data set B.

b Find the median of both Data set A and Data set B.

c Explain why the mean of Data set A is less than the mean of Data set B.

d Explain why the median of Data set A is the same as the median of Data set B.

3 A cricketer has scored an average of 25:4 runs in his last 10 innings. He scores 58 and

16 runs in his next two innings. What is his new batting average?

4 On the first five days of his holiday David drove an average of 256 kilometres per day

and on the next three days he drove an average of 172 kilometres per day.

a What is the total distance that David drove in the first five days?

b What is the total distance that David drove in the next three days?

c What is the mean distance travelled per day over the eight days?

5 A basketball team scored 43, 55, 41 and 37 goals in their first four matches.

a What is the mean number of goals scored for the first four matches?

b What score will the team need to shoot in the next match so that they maintain the

same mean score?

c The team shoots only 25 goals in the fifth match. What is the mean number of

goals scored for the five matches?

d The team shoots 41 goals in their sixth and final match. Will this increase or

decrease their previous mean score? What is the mean score for all six matches?

Find the mode for the following data: 5 5 7 3 8 2 3 4 6 5 7 6 4

The mode is the most frequently occurring value.

There are three 5s and the most we have of any other number is two.

Example 5

EXERCISE 1D.1

So, the mode is .5



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Consider the data set 5 5 7 3 8 2 3 4 6 5 7 6 4 used in Examples 4a and 5. For this data

set the mean, median and mode all had the same value, 5, and this fact indicates that the

distribution of data in this set is symmetrical.

A dot plot of the data confirms

this:

When the distribution of data is

not symmetrical the measures of

centre can have different values.

When the same data appear several times

we often summarise the data in table form.

Consider the data of the given table.

We can find the measures of the centre

directly from the table.

The mode

The mode is 7. There are 15 of data value

7 which is more than any other data value.

Data value Frequency Data value£ frequency

3 1 3£ 1 = 34 1 4£ 1 = 45 3 5£ 3 = 156 7 6£ 7 = 427 15 7£ 15 = 1058 8 8£ 8 = 649 5 9£ 5 = 45

Total 40 278

The mean

There are 40 data in this set, made up of one 3, one 4, three 5s, seven 6s and so on.

The data in an ordered list would look like

3 4 5 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ::::::

To add these numbers we could say

3£ 1 + 4£ 1 + 5£ 3 + 6£ 7 + 7£ 15 + ::::::

so it is not necessary to write out all the data values.

Adding a ‘Data value £ frequency’ column to the table helps to add all the scores. For

example, there are 15 data of value 7 and these add to 7£ 15 = 105.

Since the total of the 40 data values is 278, the mean =278

40= 6:95.

The median

Since there are 40 data in

this set, if the data is writ-

ten out in order from small-

est to largest then the median

will be the average of the two

middle values, i.e., the 20th

and 21st values.

The median can be found by

counting down the frequency

table.

COMPARING MEASURES OF CENTRE

2 3 4 5 6 7 8

mean, median and mode

data values

CALCULATING MEASURES OF SPREAD FROM A FREQUENCY TABLE

Data value Frequency

3 1 1 one number is 34 1 2 two numbers are 4 or less

5 3 5 five numbers are 5 or less

6 7 12 12 numbers are 6 or less

7 15 27 27 numbers are 7 or less

8 89 5

Total 40



0 5 25

75

50

95

100

0 5 25

75

95

100

50

In the table, the blue numbers show us accumulated values. We can see that the 20th and

21st data values (in order) must both be 7s; =7 + 7

2= 7.

Which measure of centre is the most suitable to use?

Find the mean, median and mode for the data

given in the following frequency table.Data value Frequency

2 14 15 26 37 48 99 6

Total 26

Adding a Data value £ Frequency

column, we get:

the mean =188

26

' 7:23

Data value Freq Data value £ Freq

2 1 2£ 1 = 24 1 4£ 1 = 45 2 5£ 2 = 106 3 6£ 3 = 187 4 7£ 4 = 288 9 8£ 9 = 729 6 9£ 6 = 54

Total 26 188

There are 26 data

in this set, so the

median will be the

average of the 13th

and 14th values.

The 13th and 14th

values are both 8so their average is

8.

The median is 8.

8 is the data value with the highest frequency of 9, so the mode is 8.

Data value Freq

2 1 1st value

4 1 2nd value

5 2 3rd and 4th values

6 3 5th, 6th and 7th values

7 4 8th, 9th, 10th and 11th values

8 9 12th, 13th, 14th, 15th to 20th values

9 6 21st to 26th values

Total 26

Example 6

2 3 4 5 6 7 8

median and mode

9

meanoutliers

In , the ( ) is less thanthe median ( ) and mode ( ).

A dot plot shows the distribution of the data:

Example 6 mean 7 238 8

:

the median



0 5 25

75

50

95

100

0 5 25

75

95

100

50

The data is negatively skewed and the data values 2 and 4 are much smaller than most of the

data values.

The mean depends on the actual values of the data so it has been ‘dragged’ towards these

outliers.

If the data value ‘2’ was replaced by a ‘7’ then the overall total would increase by 5 and

hence the mean would increase.

The median is not influenced by extreme values because it depends on the position of data

rather than their value. If the data value ‘2’ was replaced by a ‘7’ then the median would not

change; the middle values would remain the same.

In cases where there are outliers in one direction so the distribution is skewed, the most

suitable measure of centre to use is the median or the mode. In this case the mode has the

same value as the median and would be a suitable measure of centre for the data.

However, because the mode does not take all the data values into account, in some situations

it is not representative of a data set.

For example, the data set 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 9, 9 has a

mode of 2 and this is not representative of the data set.

A more suitable measure of centre for this data set would be the median 4 or the mean 4:5:

Find the mean, median and mode

from the ordered stemplot shown.Stem Leaf

1 6 7 8 82 2 3 3 4 4 6 7 8 93 1 2 5 8

median, the 11th value4 0 4 65 1

The mean is found by dividing the sum of all the data values by the number of

data. We must make sure that the ‘stem’ is included with the ‘leaf’.

Mean =16 + 17 + 18 + 18 + 22 + 23 + 23 + ::::::+ 51

21= 29:14

The median is the middle value, the 11th value in this ordered data set.

Counting the leaves from the beginning gives a median of 27.

The mode is the most frequently occurring value; there are two 18s, two 23s and

two 24s in this set of data. We can say that the mode is not distinct in this case and

is not useful as a measure of centre.

Note: The mean of 29:14 is larger than the median of 27, indicating that the distri-

bution is positively skewed. This can be seen from the stemplot.

MEASURES OF CENTRE FROM A STEMPLOT

Example 7



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Consider the data 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9:

Choose

We are dealing with only one variable so we choose

All the available descriptive statistics for this variable appear

on the screen:

The first statistic, x, is

The mean of the data is 4:867 (to 3 decimal places).

The second statistic,Px = 73, means that the sum of all

the data values is 73.

The next three statistics we will consider in Section 1E.

indicates that there are 15 data values in the set.

The other statistics on this part of the screen give the statistics of the five-number summary

which is also covered in Section 1E.

USING A CALCULATOR TO FIND THE MEAN AND MEDIAN

The data is entered into the calculator under themenu.

…

1:Edit .Í

Use List 1 ( ), and after checking that the cursor is in the

first position of List 1 we can type the first data value. This

value will appear at the bottom of the screen as .

L

L(1)=2

Press and ‘ ’ appears in the list.Í 2

Continue in a similar way through the list of data, pressingafter each data entry to move the cursor to the next

position.Í

To find the descriptive statistics for the data:

… ~ will get you into the menus for findingdescriptive statistics.

CALC

1:1-Var Stats .Í

tell the calculator which list our data is entered in, so type

y À Í.

1-Var Stats appears on the home screen. We need to

‘ ’n=15

The arrow beside n=15 means that there are other entries

for this

ÿ

screen. Scroll down using .†

means the median is 5.Med=5

the mean.



0 5 25

75

50

95

100

0 5 25

75

95

100

50

1 Find the mean, median and mode for each of the following data sets given as frequency

tables:

a Data value Frequency

1 22 53 84 65 4

b Number of rooms Frequency

2 13 44 125 156 27 48 2

2 The test scores, out of 30 marks, for a class of twenty-two students are:

15, 16, 18, 23, 22, 28, 29, 25, 25, 24, 27, 18, 11, 20, 23, 26, 26, 30, 25, 18, 15, 17

a Find the i mean ii median iii mode for the data.

b Explain why the mean is not the most suitable measure of centre for this set of data.

c Explain why the mode is not the most suitable measure of centre for this set of data.

3 a Find the i mean ii median iii mode

for the data displayed in the following stem-and-

leaf plot:

Stem Leaf

5 3 5 66 0 1 2 4 6 7 97 3 3 6 88 4 79 1

b Which measure of centre would be the best rep-

resentative for this set of data?

4 The following data is the daily rainfall (to the nearest millimetre) for the month of

October 2000 in Melbourne:

3, 1, 0, 0, 0, 0, 0, 2, 0, 0, 3, 0, 0, 0, 7, 1, 1, 0, 3, 8, 0, 0, 0, 32, 38, 3, 0, 3, 1, 0, 0

a Find the i mean ii median iii mode for this data.

b Explain why the median is not the most suitable measure of centre for this data.

c Explain why the mode is not the most suitable measure of centre for this data.

5 The frequency table alongside records the

number of phonecalls made in a day by 50fifteen-year-olds.

Number of phonecalls Frequency

0 51 82 133 84 65 36 37 28 19 010 011 1

a Find the:

i mean ii median iii mode

for this data.

b

c Describe the distribution of the data.

d Why is the mean larger than the median

for this data?

e Which measure of centre would be the most suitable for this data set?

EXERCISE 1D.2

Construct a barchart for the data andshow the position of the measures ofcentre (mean, median and mode) onthe horizontal axis.



0 5 25

75

50

95

100

0 5 25

75

95

100

50

6 Which one of the following will always be true for the mean, median and mode of a set

of discrete numerical data, assuming a distinct mode exists?

A The mean always equals one of the data values in the set.

B The median always equals one of the data values in the set.

C The mode always equals one of the data values in the set.

D The median is distorted by extreme values.

E In a positively skewed set of data, the median will be greater than the mean.

Three commonly used statistics that indicate the spread of a set of data are:

² the range

² the interquartile range

² the standard deviation.

The range is the difference between the maximum (largest) data value and the minimum

(smallest) data value.

Find the range for the data set: 5 5 7 3 8 2 3 4 6 5 7 6 4.

Scanning the data we can see that the minimum is 2 and the maximum is 8.

Hence the range is 8¡ 2 = 6.

The middle value of the lower half is called the lower quartile, denoted Q1. One quarter

(25%) of the data have values less than or equal to the lower quartile. Three quarters (75%)

of the data have values greater than or equal to the lower quartile.

The middle value of the upper half is called the upper quartile, denoted Q3. One quarter

(25%) of the data have values greater than or equal to the upper quartile. Three quarters

(75%) of the data have values less than or equal to the upper quartile.

The interquartile range (IQR) is the spread of the middle half (50%) of the data.

Interquartile range (IQR) = upper quartile ¡ lower quartile

= Q3 ¡ Q1

E SAMPLE SUMMARY STATISTICS: MEASURES OF SPREAD

MEASURES OF SPREAD

THE RANGE AND INTERQUARTILE RANGE

= maximum data value ¡ minimum data valueRange

Example 8

Now the median divides an ordered data set into two halves. These halves are divided in half

again by the quartiles. The median is denoted Q2.



0 5 25

75

50

95

100

0 5 25

75

95

100

50

A summary for the set of data in Example 9 is:

The data has a spread of 6 (range = 6),

centred around the value 5 (median = 5).

The middle half of the data has a spread of 3(interquartile range = 3).

For the data set 5 5 7 3 8 2 3 4 6 5 7 6 4 find the:

a median b lower quartile c upper quartile d interquartile range

The ordered data set is 2 3 3 4 4 5 5 5 6 6 7 7 8

a There are 13 data values so the median is the 7th value (circled).

There is an odd number of data and the median is one of the values so it

divides the data into two halves of six values each.

Note: For an odd number of data the median data value is not included in

the lower or upper half for the calculation of the quartiles.

b The middle value of the lower half is the average of the 3rd and 4th values.

z }| {2 3 3 4 4 5 5

z }| {5 6 6 7 7 8

3:5 median

Lower quartile =3 + 4

2= 3:5

c Similarly, the middle value of the upper half is the average of the 10th and

11th values: 2 3 3 4 4 5 5 5 6 6 7 7 8

6:5Upper quartile =

6 + 7

2= 6:5

d Interquartile range = upper quartile ¡ lower quartile

= 6:5¡ 3:5

= 3

So, the middle half of the

Example 9

6 values6 values

data has a spread of .3

Range = 8¡ 2 = 6

2 3 3 4 4 5 5 5 6 6 7 7 8

3:5 5 6:5Lower quartile Median Upper quartile

Interquartile range = 3



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Find the range and the interquartile range and describe the distribution of the data:

8, 4, 3, 9, 6, 5, 5, 10, 3, 6, 7, 9, 11, 14, 9, 8, 7, 12

The ordered data set (there are 18 data values) is:

3, 3, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 9, 10, 11, 12, 14

The range = 14¡ 3 = 11

The median will be the average of the 9th and 10th values:

Median =7 + 8

2= 7:5

The median divides the data set into two sets of 9 values:

9 values 9 valuesz }| {3, 3, 4, 5, 5, 6, 6, 7, 7

...z }| {8, 8, 9, 9, 9, 10, 11, 12, 14

...Lower quartile Median 7:5 Upper quartile

The lower quartile is the middle value of the lower half and the upper quartile is

the middle value of the upper half.

The interquartile range = 9¡ 5 = 4

The data is centred at 7:5 (median) and has a spread of 11 (range).

The middle half of the data has a spread of 4 (interquartile range).

Key the data into a list. The data does not have to be

ordered.

Enter

Example 10

USING THE CALCULATOR TO FIND THE RANGE AND INTERQUARTILE RANGE

… ~ ÍCALC 1:1-Var Statsand choose .

Press to select the listy À Í .L



0 5 25

75

50

95

100

0 5 25

75

95

100

50

The median, range, and interquartile range can be found easily from an ordered stemplot.

The screens below show all the statistics for the data. Use to scroll down and reveal thelower part of the screen.

†

The range is

maxX ¡ minX

= 14¡ 3 = 11

The IQR = Q3 ¡ Q1= 9¡ 5= 4

MEASURES OF SPREAD FROM A STEMPLOT

The number of cars travelling along a particular road

were counted for 21 days and the data was recorded

Find the median, range and interquartile range for this

data.

Stem Leaf

1 6 7 8 82 2 4 7 8 93 0 2 3 3 4 5 6 84 0 4 65 1

Stem Leaf

1 6 7 8 8

2 2 4 7 8 9

3 0 2 3 3 4 5 6 84 0 4 65 1

The median is the middle value (the 11th data value in a list of 21) and counting

from the beginning, the median = 32 (circled).

The median divides the data into two groups of 10 data values.

The average of the middle values of these groups gives the lower and upper

quartiles.

Lower quartile =22 + 24

2= 23 Upper quartile =

36 + 38

2= 37

Interquartile range = Upper quartile ¡ Lower quartile

= 37¡ 23= 14

Example 11

The data is ordered so we can read from the

smallest value to the largest value.

Combining the ‘stem’ with the ‘leaf’, we get:

16, 17, 18, 18, 22, 24, 27, ......, 40, 44, 46, 51.

The minimum is 16 and the maximum is 51,

so the range = 51¡ 16 = 35.

in this ordered stemplot.



0 5 25

75

50

95

100

0 5 25

75

95

100

50

1 For each of the following data sets, find:

i the median (make sure the data is ordered)

ii the upper and lower quartiles

iii the range

iv the interquartile range.

a 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9

b 10, 12, 15, 12, 24, 18, 19, 18, 18, 15, 16, 20, 21, 17, 18, 16, 22, 14

c 21:8, 22:4, 23:5, 23:5, 24:6, 24:9, 25, 25:3, 26:1, 26:4, 29:5

d 127, 123, 115, 105, 145, 133, 142, 115, 135, 148, 129, 127, 103, 130, 146, 140,

125, 124, 119, 128, 141.

2 For the data given in the following ordered stem-and-

leaf plot, find the:Stem Leaf

0 3 4 7 91 0 3 4 6 7 82 0 0 3 5 6 9 9 93 1 3 7 84 2

a median b upper quartile

c lower quartile d range

e interquartile range

3 The time spent (in minutes) by 20 people in a queue at a bank has been recorded:

3:4, 2:1, 3:8, 2:2, 4:5, 1:4, 0, 0, 1:6, 4:8, 1:5, 1:9, 0, 3:6, 5:2, 2:7, 3:0, 0:8, 3:8, 5:2

a Find the median waiting time and the upper and lower quartiles.

b Find the range and interquartile range of the waiting times.

c Copy and complete the following statements:

i “50% of the waiting times were greater than ...... minutes.”

ii

iii “The minimum waiting time was ...... minutes and the maximum waiting time

was ...... minutes. The waiting times were spread over ...... minutes.”

4 The following data gives the number of novels counted

in 30 households.Stem Leaf

2 0 2 5 5 8 993 0 1 3 5 6 6 8 94 2 2 4 7 7 8 95 0 0 1 2 66 2 57 2

a Find the median number of novels per household

and the upper and lower quartiles of the data.

b Copy and complete the following statements:

i “Half of the households have more than ......

novels.”

ii

EXERCISE 1E.1

c i ii

d

Find the range interquartile range

for the number of novels per household.

Describe the distribution of the data using thestatistics found.

“ of the waiting times were less than or equal to ...... minutes.”75%

“ of the households have at least ......novels.”75%



0 5 25

75

50

95

100

0 5 25

75

95

100

50

5 The height (to the nearest centimetre) of 20 ten year olds

is recorded in the following stemplot.Stem Leaf

10 911 1 3 4 4 8 912 2 2 4 4 6 8 9 913 1 2 5 8 8

a Find the i median height

ii upper and lower quartiles of the data.

b Copy and complete the following statements:

i

ii “75% of the children are less than ...... cm tall.”

iii “The middle 50% of the children have heights spread over ...... cm.”

Now the range and IQR both only use two values in their calculation. It is sometimes better

to use a measure of spread that includes all of the data values in its calculation. One such

statistic is the variance, which measures the average of the squared deviations of each data

value from the mean. The deviation of a data value x from the mean x is given by x¡ x.

For a sample, i.e., when we have surveyed a portion of the population:

² the variance is s2 =

P(x¡ x)2

n¡ 1where n is the sample size

² the standard deviation s is the square root of the variance, s =

sP(x¡ x)2

n¡ 1.

Note: The variance and standard deviation for a whole population have slightly different

formulae. However, we do not use these in this course.

THE VARIANCE AND STANDARD DEVIATION

Use the formula to find the variance and the standard deviation of the sample data:

3, 4, 4, 8, 7, 6, 10

The mean, x, of the data is3 + 4 + 4 + 8 + 7 + 6 + 10

7=

42

7= 6

Using a table for the calculations:x x¡ x (x¡ x)23 ¡3 94 ¡2 44 ¡2 48 2 47 1 16 0 010 4 16

Total 38P

(x¡ x)2

variance s2 =

P(x¡ x)2n¡ 1

=38

6

= 6:3333::::

standard deviation s =p

variance

=q

38

6

= 2:5166 (4 d.p.)

Example 12

“Half of the children are less than or ...... cm tall.”



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Using a table to calculate the standard deviation is an interesting exercise, but you will

normally use your calculator to find this statistic.

Sample standard

deviation.

Note:

USING THE CALCULATOR TO FIND THE STANDARD DEVIATION

Press and choose

Key data into list .

…1:EDIT

L

Presschoose

… ~ to choose, then

.CALC

1:1-Var Stats

Press tochoose list

y À ÍL.

The variance is not given on the screen, but it can be found by squaring the standarddeviation.

The frequency table alongside shows data

collected from a random sample of 50households in a particular suburb, investiga-

ting the number of people in the household.

Use the calculator to find the standard devia-

tion of the number of people in a household

for this sample.

Number of people Frequencyin the household

1 52 83 134 145 76 3

Press and choose

Key the variable values into andthe frequency values into .

… 1: Edit

L

L‚

STANDARD DEVIATION FOR GROUPED DATA

Example 13

Presschoose

… ~ to choose, then

.CALC

1:1-Var Stats



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Many data sets have frequency distributions that are ‘bell-shaped’ and symmetrical about the

mean.

For example, the histogram alongside

exhibits this typical ‘bell-shape’. The

data represents the heights of a group

of adult women and has a mean of 165and a standard deviation of 8.

The data is centred about the mean and

spreads from 140 to 190. However,

most of the data have values between

155 and 170 and not many have values

more than 180 or less than 150.

The Normal distribution is an important bell-shaped distribution.

For the Normal distribution it can be shown that:

² 68% of the data will have values within one standard deviation of the mean.

² 95% of the data will have values within two standard deviations of the mean.

² 99:7% of the data will have values within three standard deviations of the mean.

Graphically this can be summarised:

If we model the bell-shaped data above using the Normal distribution:

² 68% of the heights will have values between 165¡ 8 = 157 and

165 + 8 = 173, i.e., between 157 and 173 cm.

68% of the data values will be in the interval [x¡ s, x+ s].

The sample standard deviation is

1:3536 ::::

Note: If you do not include then you will still get a screen of statistics, but theywill be for only.

L‚

L

Enter , by pressingL L‚

y À ¢ y Á Í.

GIVING MEANING TO THE STANDARD DEVIATION

0

5

10

15

20

25

140 145 150 155 160 165 170 175 180 185 190

frequency

height (cm)

68% of data

x s�� x s��

mean x

95% of data

x s�� x s��

mean x

99.7% of data

x s�� x s��

mean x



0 5 25

75

50

95

100

0 5 25

75

95

100

50

² 95% of the heights will have values between 165¡ 2£ 8 = 149 and

165 + 2£ 8 = 181, i.e., between 149 and 181 cm.

95% of the data values will be in the interval [x¡ 2s, x+ 2s].

² 99:7% of the heights will have values between 165¡ 3£ 8 = 141 and

165 + 3£ 8 = 189, i.e., between 141 and 189 cm.

99:7% of the data values will be in the interval [x¡ 3s, x+ 3s].

A set of data has a Normal distribution with a mean x = 30 and a standard

deviation of s = 7. What percentage of the data is:

a greater than 30 b between 23 and 37 c more than 37

d between 16 and 44 e more than 44 f between 37 and 44?

a The distribution of data is symmetrical about the

mean, so 50% of the data have a value greater

than 30.

b Now x¡ s = 30¡ 7 = 23

and x+ s = 30 + 7 = 37

c Since 68% of scores are between 23 and 37,

32% are outside this interval. The distribution

of scores is symmetrical, so 16% are greater

than 37.

d Now x+ 2s = 30 + 14 = 44

and x¡ 2s = 30¡ 14 = 16

e Since 95% of the data are between 16 and 44,

5% are outside this interval. The distribution

is symmetric so 2:5% of the data are greater

than 44.

30

30 3723

30 3723

16% 16%68%

30 4416

30 4416

95% 2.5%2.5%

Example 14

95% of the data fall between these two values.

68% 23 37of the data are between and .



0 5 25

75

50

95

100

0 5 25

75

95

100

50

f From c, we know that 16% of the data are

greater than 37, and from e, we know 2:5%of the data are greater than 44.

between 37 and 44.

1 a Use the formula to find the standard deviation of the following set of data:

3 3 4 4 5 6 6 7 8 8 9 9

b Check your answer to a using your calculator.

2 Use your calculator to find the standard deviation and variance of the following data:

25:6, 32:8, 24:7, 36:0, 32:1, 30:9, 34:4, 27:5

3 Find the standard deviation of the data given in the frequency table below.

Number of cars owned Frequency Number of cars owned Frequencyby the business by the business

0 3 6 10

1 4 7 8

2 6 8 5

3 9 9 2

4 12 10 1

5 10 11 0

30 4423

68% 2.5%2.5%

37

��

��

% . %. %

The contents of a sample of two hundred ‘800 gram packets’ of muesli were weighed

and the weights were found to have a bell-shaped distribution with a mean of 800grams and a standard deviation of 8 grams. How many of the packets in the sample

would be expected to have a weight of more than 792 grams?

We model the bell-shaped distribution using the Normal distribution.

Now 792 = 800¡ 8

Since 68% of the weights are within

one standard deviation of the mean,

32% are outside this range.

32

2= 16% of the weights are lower

than 792 g.

84% of the weights are above 792 g.

84% of 200 = 84

100£ 200 = 168.

So 168 of the 200 packets in the sample would be expected to have a weight greater

than 792 grams.

800792

68%

16%16%

��

��

% %

%

weight in grams

Example 15

EXERCISE 1E.2

16% 2 5% = 13 5% of the data lie¡ : :

So, 792 g is one standard deviation less than the mean.

Since the distribution is symmetric,



0 5 25

75

50

95

100

0 5 25

75

95

100

50

4 The following data are the heights, to the nearest centimetre, of the thirty footballers that

belong to an AFL club.

192 185 189 183 189 191 190 192 198 187 191 194 198 181 189191 190 187 189 194 198 191 187 196 181 193 187 196 192 178

a Find the i mean, x ii standard deviation, s of the height of the footballers

in this club.

b i Calculate the interval [x¡ s, x+ s].

ii What percentage of the heights would be expected to fall in this interval?

iii What percentage of the actual heights fall in this interval?

c What percentage of the actual heights fall in the interval [x¡ 2s, x+ 2s]?

What percentage would you expect to fall in this interval?

5 The distribution of weights of 600 g loaves of bread is bell-shaped with a mean weight

of 605 g and a standard deviation of 8 g. What percentage of the loaves can be expected

to have a weight between 597 g and 613 g? (Use the Normal distribution as a model.)

6 [1997 FM CAT 2 Q4]

The distribution of the weight of ice-cream served in a single scoop of Danish Delight is

known to be bell-shaped with a mean of 104 grams and a standard deviation of 2 grams.

The percentage of single scoops of Danish Delight containing less than 100 grams will

be closest to:

A 0% B 2:5% C 5% D 16% E 95%

7 The diameters of washers produced by a machine have a bell-shaped distribution with

a mean diameter of 10 mm and a standard deviation of 0:3 mm. Using the Normal

distribution as a model, find the percentage of the washers that would have a diameter:

a between 9:7 mm and 10:3 mm b greater than 10 mm

c greater than 10:6 mm d between 9:4 and 9:7 mm

e greater than 9:7 mm?

8 The distribution of exam scores for 780 students who sat an exam is Normal with a mean

of 55 and a standard deviation of 15.

a Find the number of students who would be expected to obtain a score:

i greater than 70 ii less than 55

iii less than 25 iv between 70 and 85

b If the pass mark for the exam was 40, then how many students are expected to pass

the exam?

9

A greater than 32 seconds but less than 35 seconds

B less than 32 seconds

C greater than 35 seconds but less than 38 seconds

D greater than 35 seconds

E greater than 41 seconds.

The distribution of times taken to swim metres by a group of year-olds is bell-shaped with a mean of seconds and a standard deviation of seconds. The slowest

of the students would be expected to have a swim-time:

50 1638 3

16%



0 5 25

75

50

95

100

0 5 25

75

95

100

50

The relative significance of a particular data value can be considered in terms of the number

of standard deviations that it differs from the mean.

This is called the standard score or z-score of the data value, and the process of finding the

standard score is called standardisation. Non-standardised data are often referred to as raw

scores.

Standard score (z-score) =raw score¡mean

standard deviation

The mean percentage on a mathematics exam is 60 and the standard deviation is 13.

a Find the standard scores for students who, on the exam, scored:

i 82% ii 45% iii 73%

b Find the raw score of a student whose standardised score was 0:61.

a Using the formula for standard score:

i standard score

=82¡ 60

13

= 1:69 (2 dec. pl.)

ii standard score

=45¡ 60

13

= ¡1:15 (2 dec. pl.)

iii standard score

=73¡ 60

13

= 1

b z-score =raw score¡mean

standard deviation

0:61 =raw score¡ 60

13

0:61£ 13 = raw score ¡ 60

7:93 + 60 = raw score

raw score = 67:93

So, the student’s raw score would have been 68%.


The bell-shaped distribution alongside has

mean 35 and standard deviation 10.

STANDARD SCORES ( -SCORES)z

CALCULATING STANDARD SCORES

Example 16

3525 x15 40 45

95% of data

68%



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Notice that:

These facts are always true when we standardise a bell-shaped distribution.

Find the percentage of scores that come from a Normal distribution that will

have a z-score:

a greater than 0 b between ¡2 and 2 c between 1 and 2

d less than ¡2 e more than 3.

a A z-score of 0 corresponds to a raw score

of the mean and 50% of the data will have

a value greater than the mean.

b 95% of the data will have a z-score between

¡2 and 2.

c95¡ 68

2= 13:5% of raw scores will have

a z-score between 1 and 2.

��

68%

x��

95% of data

²²

¡

¡²²²

the shape of the distribution is unchanged

the values on the -axis have been scaledso that:

the of the data within one standard

deviation of the mean have -scores

between and

the of the data within two standard

deviations of the mean have -scores between and

a standard score of represents a raw score of the same value as the mean

a positive standard score represents a raw score that is greater than the mean

a negative standard score represents a raw score that is less than the mean.

x

z

z

I

I

68%

1 1

95%2 2

0

Example 17

�3 �2 �1 0 1 2 3 z

50%

�3 �2 �1 0 1 2 3 z

95%

�3 �2 �1 0 1 2 3 z

%5.132

6895�

�

The distribution of the isshown alongside.

standardised data



0 5 25

75

50

95

100

0 5 25

75

95

100

50

d If 95% of the raw scores have a z-score

between ¡2 and 2 then 2:5% (12

of 5%)

will have a z-score less than ¡2.

e If 99:7% of the raw scores have a z-score between ¡3 and 3 then

100¡ 99:7

2=

0:3

2= 0:15%

Since standard scores:

² keep the relative value of raw scores within a data set

² scale the x-axis of distributions in terms of their standard deviations,

standard scores are useful for comparing scores from different data sets.

Archie scored 62% on his Mathematics exam. This exam had a mean of 57 and a

standard deviation of 5. In his English exam Archie scored 75% and this exam had

a mean of 70 and a standard deviation of 6.

In which subject was his relative performance better?

In Maths: standard score =62¡ 57

5=

5

5= 1

In English: standard score =75¡ 70

6=

5

6= 0:83

Since Archie’s standard score for Maths was greater than his standard score for

English, his Maths result was further to the right in the distribution of the scores of

the class.

1 Find the standard scores for the following raw scores that come from a set of data that

has a mean of 6:4 and a standard deviation of 2.

a 10 b 5:2 c 12 d 6:5

2 A raw score from a data set has a z-score of ¡0:85. If the data set has a mean of 50and a standard deviation of 5:6, find the value of the raw score.

3 A raw score of 72 has a z-score of 1:25. If the standard deviation from the data set is

8, find the mean of the data.

�3 �2 �1 0 1 2 3 z

95%

2.5% 2.5%

COMPARING RAW SCORES FROM DIFFERENT DATA SETS

Example 18

EXERCISE 1E.3

will have a score more than .3z-score

Archie’s relative performance was better in Maths.



0 5 25

75

50

95

100

0 5 25

75

95

100

50

4 A raw score of 20 has a z-score of ¡1:6. If the mean of the data set is 28, find the

standard deviation.

5 Peter has had four Mathematics tests for the year and his results and the class averages

and standard deviations are given in the table below.

Peter’s mark Class average Standard deviation

1 58 60 72 72 65 123 68 60 104 78 72 9

a Calculate Peter’s standard score for each test.

b In which test did Peter perform best?

6 The semester English exam results for four

students are given in the table alongside.

If the mean was 60 for both exams and

the standard deviation was 15 for the

Semester 1 exam and 8 for the Semester 2exam:

Student Semester 1 Semester 2

David 70 65Rodney 54 58Gavan 92 75Daniel 75 70

a Which of the students improved their performance from Semester 1 to Semester 2?

b Which student improved the most?

c Which student’s performance was the most consistent for the year?

7 For a set of data that has a bell-shaped distribution, find the percentage of raw scores

that have a z-score:

a less than 0 b between ¡1 and 1 c greater than 2

d between ¡1 and 0 e between ¡1 and ¡2 f between 0 and 3

A boxplot is a visual display of some of the descriptive statistics of a set of data, namely its

minimum and maximum values, the median and the upper and lower quartiles. These five

statistics form what is called the five-number summary of the data set.

A boxplot (box-and-whisker plot) is constructed above a number line (labelled and scaled)

which is drawn so that it covers all the data values in the data set.

The boxplot is drawn with a rectangular ‘box’ representing the middle half of the data. The

‘box’ goes from the lower quartile to the upper quartile.

The ‘whiskers’ extend from the ‘box’ to the maximum value and to the minimum value.

A vertical line marks the position of the median in the ‘box’.

For example, for the data set 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9:

CONSTRUCTING A BOXPLOT

F THE BOXPLOT (BOX-AND-WHISKER PLOT)

Test



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Using the graphics calculator to find descriptive statistics and construct a boxplot

2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9

The ordered data set is 1, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 7, 7, 8, 9 (15 data).

7 values}|z { 7 values}|z {Q1 Q3median

The minimum is 1.

The maximum is 9.

The median is the 8th value, 5.

The lower quartile is the 4th value, 3.

The upper quartile is the 12th value, 7.

These 5 statistics form the

five-number summary.

}|z

{1 2 3 4 5 6 7 8 9

minimum lower quartile median upper quartile maximum

whisker whisker

value

Press and choose .… 1:Edit

Enter the data from the example above into :L

Statistical graphs are drawn using ,STAT PLOT

which is located above the key.o

Press to use it.y o

Press to use .Í Plot 1

Turn the plot by pressing then use the arrowkeys to choose the boxplot icon and press .

On ÍÖ Í

Press to draw the boxplot.q ®

Q2



0 5 25

75

50

95

100

0 5 25

75

95

100

50

A set of data with a symmetric distribution will have a symmetric boxplot.

For example:

The whiskers of the boxplot are the same length and the median line is in the centre of the

box.

A set of data which is positively skewed will have a positively skewed boxplot.

For example:

The right whisker will be longer than the left whisker and the median line is to the left of the

box.

A set of data which is negatively skewed will have a boxplot that appears stretched to the

left.

For example:

The left whisker is longer than the right and the median line is to the right of the box.

r can be used to locate the statistics of thefive-number summary. The arrow keys move back-wards and forwards between them.

In this screen,the cursor is onthe median.

INTERPRETING A BOXPLOT

0

2

4

6

8

10 11 12 13 14 15 16 17 18 19 20

y

x

10 11 12 13 14 15 16 17 18 19 20 x

02468

10

1 2 3 4 5 6 7 8 x

y

1 2 3 4 5 6 7 8 x

1 2 3 4 5 6 7 8 9 x

1 2 3 4 5 6 7 8 x9



0 5 25

75

50

95

100

0 5 25

75

95

100

50

A boxplot has been drawn to show the distribution of marks (out of 100) in a test

for a particular class:

a What was the highest mark scored for this test?

b What was the median test score for the class?

c What is the range of marks scored for this test?

d What percentage of students scored 60 or more for the test?

e What was the lowest mark scored?

f What is the interquartile range for this test?

g The top 25% of students scored a mark between ...... and ......

h If you scored 70 on this test, would you be in the top 50% of students in the

class?

i Comment on the symmetry of the distribution of marks.

a The highest score corresponds to the end of the upper whisker, so the

highest mark scored was 98.

b The median corresponds to the vertical line inside the box, which is at 73.

c The range = maximum score ¡ minimum score = 98¡ 30= 68

d The score of 60 corresponds to the lower quartile.

25% of the students have a score less than or equal to the lower quartile so 75%scored 60 or more.

e The lowest score corresponds to the end of the lower whisker, so the lowest

score was 30.

f The interquartile range = upper quartile ¡ lower quartile = 82¡ 60= 22

g The top 25% of scores correspond to the upper whisker.

h The top 50% of students had a mark greater than or equal to the median of 73.

You would not be in the top 50% of students if you scored 70 for the test.

i

The distribution of test scores is stretched to the left, and is therefore negatively

skewed. The lower whisker is longer than the upper whisker and the median is

not in the centre of the box but further towards the upper end.

The distribution is therefore not symmetrical.

Example 19

0 10 20 30 40 50 60 70 80 90 100

score on test

0 10 20 30 40 50 60 70 80 90 100

score on test

stretched to the left

So, the top of students scored a mark between and .25% 82 98



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Outliers are extraordinary data that are either much larger or much smaller than the main

body of the data.

There are several tests that identify outliers. One commonly used test involves the following

calculation of ‘boundaries’:

The upper boundary = upper quartile + 1:5 £ IQR.

Any data larger than this number is an outlier.

The lower boundary = lower quartile ¡ 1:5 £ IQR.

Any data smaller than this value is an outlier.

TESTING FOR OUTLIERS

Draw a boxplot for the following data, identifying any outliers.

1, 3, 7, 8, 8, 5, 9, 9, 12, 14, 7, 1, 4, 8, 16, 8, 7, 9, 10, 13, 7, 6, 8, 11, 17, 7

The ordered data is:

The five-number summary is: Using the calculator:

minimum value is 1lower quartile is 7median is 8upper quartile is 10maximum value is 17

IQR = 10 ¡ 7 = 3

The upper boundary = upper quartile + 1:5 £ IQR = 10 + 1:5£ 3 = 14:5

The lower boundary = lower quartile ¡ 1:5 £ IQR = 7¡ 1:5£ 3 = 2:5

Values outside the interval [2:5, 14:5] are outliers. Hence the two outliers at the

upper end are the data values 16 and 17, and the two at the lower end are both the

data value 1.

We now have all the information to draw the boxplot:

1, 1, 3, 4, 5, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 11, 12, 13, 14, 16, 17

13 values}|z { 13 values}|z {lower quartile = 7 median = 8 upper quartile = 10

Two outliers of the samevalue are shown like this.

The whisker is drawn to the lastvalue that is not an outlier.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17variable

When outliers exist, the ‘whiskers’ of a boxplot extend to the last value that is not an outlier.Each outlier is marked with an asterisk; it is possible to have more than one outlier at either end.

Example 20



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Note:


3 2 5 84 2 3 35 3 8 86 0 5 5 5 7 9 9 9 9 9 9 97 0 08 4 8 99 0 1 1 3 3 4 4 4 5 5 6 7 810 2 3 3 4 5 611 2 9 leaf unit: 0:1 cm

Using the calculator to draw the boxplot in above, we begin by entering thedata in .

Example 20

L

Use by pressing .STAT PLOT y o

Press to use .Í Plot 1

Then press .Í

Press to draw the boxplot.q ®

Press and use the arrow keys to move the cursorthrough the summary statistics. Note that both values atare included.

r1

Turn the plot then use the arrow keys to choose the‘boxplot with outliers’ icon

On

Õ

Note that only one of the outliers at appears on thescreen.

1

You may wonder why we would need both the boxplot and the stemplot orhistogram. Each complements the other and shows slightly different things.

Boxplots provide an excellent display of the summary statistics, while stemplots andhistograms illustrate the shape of the distribution more accurately.

These graphs display the same distribution.

The boxplot displays the summary statistics, while the stemplotreveals the bimodal nature of the distribution. Hence bothgraphics are of value.

height (cm)

3

5

7

9

11

13



0 5 25

75

50

95

100

0 5 25

75

95

100

50

1 The following boxplot summarises the heights of the players in an AFL team.

Use the boxplot to find:

a the median height of the team

b the range of heights of the team (ignoring the outlier)

c the height that 75% of the team are taller than

d the height of the player that is an outlier

e the interquartile range of the heights.

2 Find the five-number summary (minimum, lower quartile, median, upper quartile, max-

imum) for each of the following data sets, and construct a boxplot for the data.

a Essendon’s game scores for the year 2000 (not including the finals):

156, 130, 124, 137, 123, 144, 140, 127, 106, 132, 145, 169, 119, 89, 108,

89, 167, 159, 165, 109, 81, 97

b Number of toothpicks Frequency

33 134 535 736 1337 1238 839 2

c The daily maximum temperature

(oC) in Melbourne for the month

of March 2001:

Stem Leaf

11¤ 7 8 8 8 9 92 0 0 0 2 2 2 2 3 3 3 4 42¤ 5 5 6 7 8 83 0 0 1 2 2 33¤ 5

2 j 4 represents 24oC

3 A set of data has a lower quartile of 31:5, median of 37, and upper quartile of 43:5.

a Calculate the interquartile range for this data set.

b Calculate the boundaries that identify outliers.

c Which of the data 22, 13:2, 60, 65 would be outliers?

4 The boxplot below shows the distribution of weights of a sample of Jack Russell terriers:

Which one of the following would not be true for this data?

A The interquartile range is more than 1:5 kg.

B The heaviest 25% of the dogs all weighed more than 8 kg.

C The median weight was 7 kg.

D At least 75% of the weights were more than 6 kg.

E The lightest 25% weighed less than or equal to 6:2 kg.

EXERCISE 1F

165 170 175 180 185 190 195 200 205 210 215

height (cm)

4 5 6 7 8 9 10

weight (kg)



0 5 25

75

50

95

100

0 5 25

75

95

100

50

5 The boxplot below shows the distribution of taxi fares for 50 trips taken from Melbourne

Airport.

a Find: i the median fare ii the range of fares iii the IQR of fares.

b Write a sentence describing the distribution of the data, mentioning each of the

statistics from a.

c Complete the following:

i Approximately ...... % of fares were greater than $32.

ii The minimum fare was $ ......

iii 75% of the fares were greater than $ ......

6 Match the histograms A, B, C , D and E to the boxplots I, II , III , IV and V .

A B

C D

E Stem Leaf

7 1 2 2 4 66 2 2 3 4 4 5 5 5 5 6 8 95 0 7 9 9 94 0 2 3 4 6 6 6 7 7 83 2 5 6 6 82 2 91 5 leaf unit : 0:1

I II

III IV

V

0

2

4

6

8

1 2 3 4 5 6 7 8

frequency

x

0

2

4

6

8

1 2 3 4 5 6 7 8 9 10 11 12 13

frequency

x

0

1

2

3

4

1 2 3 4 5 6 7 8

frequency

x

x122 4 86 10

1 2 3 4 5 6 7 8 x

1 2 3 4 5 6 7 8 9 10 11 12 13 x

1 2 3 4 5 6 7 8 9 10 11 12 13 x1 2 3 4 5 6 7 8 x

1 2 3 4 5 6 7 8 x

15 20 25 30 35 40 45fare ($)



0 5 25

75

50

95

100

0 5 25

75

95

100

50

When we conduct a statistical survey, it is important that our data reflects the whole popula-

tion.

If data is to be collected from a sample then the sample must accurately represent the popu-

lation. Otherwise, reliable conclusions about the population cannot be made. Samples must

be chosen so that the results will not show bias towards a particular outcome.

The sample size is also an important feature to be considered if conclusions about the popu-

lation are to be made from the sample.

For example:

Measuring a group of three fifteen-year-olds would not give a very reliable estimate of the

height of fifteen-year-olds all over the world. We therefore need to choose a random sample

that is large enough to represent the population. Note that conclusions based on a sample

will never be as accurate as conclusions made from the whole population, but if we choose

our sample carefully, they will be a good representation.

In a simple random sample, every member of the population has an equal chance of being

chosen, and each member is chosen independently of any other member.

Random samples can be chosen using coins, dice, numbered tokens, random number tables,

or random number generators on computers or calculators.

For example:

Suppose you wish to choose Tattslotto numbers. The population of numbers is the integers 1to 45 inclusive and you are going to choose a ‘sample’ of six different numbers. How could

you choose these numbers randomly?

Three possible methods:

1 Number forty five pieces of paper, place them in a container and select six pieces of

paper without looking.

2 Use a random number table (Table 1).

39634 62349 74088 65564 16379 19713 39153 69459 17986 2453714595 35050 40469 27478 44526 67331 93365 54526 22356 9320830734 71571 83722 79712 25775 65178 07763 82928 31131 3019664628 89126 91254 24090 25752 03091 39411 73146 06089 1563042831 95113 43511 42082 15140 34733 68076 18292 69486 80468

80583 70361 41047 26792 78466 03395 17635 09697 82447 3140500209 90404 99457 72570 42194 49043 24330 14939 09865 4590605409 20830 01911 60767 55248 79253 12317 84120 77772 5010395836 22530 91785 80210 34361 52228 33869 94332 83868 6167215358 70469 87149 89509 72176 18103 55169 79954 72002 20582

The digits in the table are generated by computer in groups of five for easy reading. You

can start anywhere in the table and move across or down.

To choose numbers between 1 and 45, you need to look at two digits at a time. If the

digits are 04 then the chosen number is ‘4’. If the digits give a number greater than 45then you ignore it. If you get a repeat of a number then you will also ignore it.

G RANDOM SAMPLES

CHOOSING A RANDOM SAMPLE



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Starting in the top left hand corner and going across, (crossing out the inappropriate

numbers until you have six numbers) the numbers are:

39, 63, 46, 23, 49, 74, 08, 86, 55, 64, 16, 37, 91, 97, 13

Your chosen numbers would be 39, 23, 8, 16, 37 and 13.

3 Use the random number generator on the

calculator.

You could also type in the sample size of 6 as shown

alongside. However, if this gave repeats in the sample,

you would need to repeat the procedure.

The table below gives

the monthly sales fig-

ures, in thousands of

dollars, for a shop over

a six year period.

a Choose a year at

random.

b Choose a month at

random.

c Choose three consec-

utive years.

d Choose a period of

three consecutive

years (36 months) starting with any month.

2000 2001 2002 2003 2004 2005

January 43:1 48:7 45:7 44:0 48:6 46:3February 38:2 35:3 36:4 38:3 37:7 40:2

March 38:6 36:0 36:2 34:8 35:3 33:3April 40:2 40:9 42:4 42:5 43:8 35:7May 43:2 44:2 47:0 48:7 50:3 52:4June 27:8 32:3 33:5 34:1 32:2 35:8July 26:4 27:2 23:5 27:2 27:7 28:1

August 23:8 24:9 24:8 27:6 26:1 28:2September 27:4 30:8 32:7 33:6 34:9 35:1October 40:4 39:3 38:7 41:3 42:4 44:9

November 68:3 67:4 67:3 69:8 70:4 72:6December 81:2 83:9 84:6 85:5 88:3 87:2

This can be found in the menu as follows:�

Press then to select .� | PRB

Choose .5:randInt(

This will bring to the screen.

We need to type in the range of randomintegers that we are considering, i.e., to .

randInt(

1 45

Press .À ¢ ¶ · ¤

Pressing repeatedly will give random digitsbetween and .

In this case the first six numbers were all differentnumbers, so these are the randomly chosen Tattslottonumbers. If numbers were repeated, we would generatemore until we had six different ones.

Í1 45

Example 21



0 5 25

75

50

95

100

0 5 25

75

95

100

50

a

b There are twelve months from which we need

to choose one month.

We use the calculator, with 1 representing

January, 2 representing February, etc.

The randomly chosen month is November.

c To choose three consecutive years, we need to establish the number of sets of

three consecutive years that are possible:

1 2000 - 20022 2001 - 20033 2002 - 20044 2003 - 2005

There are four possibilities, from which we

have to choose one. Using the calculator, the

randomly chosen period is 3 2002 to 2004.

d To choose a period of three consecutive years starting with any month, we need

to establish the number of sets that are possible:

1 Jan 2000 - Dec 20022 Feb 2000 - Jan 20033 Mar 2000 - Feb 2003

...

37 Jan 2003 - Dec 2005

There are thirty seven possibilities, from which we have to choose one.

Using the calculator, the randomly chosen period is 11 November 2000 to

October 2003.

1 State the sample size needed.

2 State the number of possibilities from which you can choose, and number them if nec-

essary.

3 State the random number generator that you are using.

4 Explain what you will do if repeated random numbers are not applicable.

5 State the random number(s) chosen and the data that is now in your sample.

There are six years from which to choose. Wecould use a die to randomly choose one of theseyears; the year would be represented by ,

by , ......, by .

Alternatively, we could use the random generatoron a calculator: The randomly chosen year is

.

2000 12001 2 2005 6

2004

TO CHOOSE A SIMPLE RANDOM SAMPLE:



0 5 25

75

50

95

100

0 5 25

75

95

100

50

1 Use the random number table from page 54, starting at the top left corner and working

down, to:

a select a random sample of six different numbers between 1 and 45 inclusive

b select a random sample of 5 different numbers between 100 and 499 inclusive.

2 Use your calculator to:

a select a random sample of six different numbers between 5 and 25 inclusive

b select a random sample of 10 different numbers between 1 and 25 inclusive.

3 The following calendar for 2006 shows the weeks of the year. Each of the days is

numbered.

Using a random number generator, choose a sample from the calendar of:

a five different dates

b a complete week starting with a Monday

c a month

d three different months

e three consecutive months

f a four week period starting on a Saturday

g a four week period starting on any day.

Explain your method of selection in each case.

EXERCISE 1G



0 5 25

75

50

95

100

0 5 25

75

95

100

50

January February March April May June

1 Th

July August September October November December

Su (1)

Mo (2)

Tu (3)

We (4)

Th (5)

Fr (6)

Sa (7)

Su (8)

1

2

3

4

5

6

7

8

9 Mo (9)

10 Tu (10)

11 We (11)

12 Th (12)

13 Fr (13)

14 Sa (14)

15 Su (15)

16 Mo (16)

17 Tu (17)

18 We (18)

19 Th (19)

20 Fr (20)

21 Sa (21)

22 Su (22)

23 Mo (23)

24 Tu (24)

25 We (25)

26 Th (26)

27 Fr (27)

28 Sa (28)

29 Su (29)

30 Mo (30)

31 Tu (31)

1 Sa (182)

2 Su (183)

3 Mo (184)

4 Tu (185)

5 We (186)

6 Th (187)

7 Fr (188)

8 Sa (189)

9 Su (190)

10 Mo (191)

11 Tu (192)

12 We (193)

13 Th (194)

14 Fr (195)

15 Sa (196)

16 Su (197)

17 Mo (198)

18 Tu (199)

19 We (200)

20 Th (201)

21 Fr (202)

22 Sa (203)

23 Su (204)

24 Mo (205)

25 Tu (206)

26 We (207)

27 Th (208)

28 Fr (209)

29 Sa (210)

30 Su (211)

31 Mo (212)

Wk 1

Wk 2

Wk 3

Wk 4

Wk 5

Wk 27

Wk 28

Wk 29

Wk 30

Wk 31

1 We (32)

2 Th (33)

3 Fr (34)

4 Sa (35)

5 Su (36)

6 Mo (37)

7 Tu (38)

8 We (39)

9 Th (40)

10 Fr (41)

11 Sa (42)

12 Su (43)

13 Mo (44)

14 Tu (45)

15 We (46)

16 Th (47)

17 Fr (48)

18 Sa (49)

19 Su (50)

20 Mo (51)

21 Tu (52)

22 We (53)

23 Th (54)

24 Fr (55)

25 Sa (56)

26 Su (57)

27 Mo (58)

28 Tu (59)

1 Tu (213)

2 We (214)

3 Th (215)

4 Fr (216)

5 Sa (217)

6 Su (218)

7 Mo (219)

8 Tu (220)

9 We (221)

10 Th (222)

11 Fr (223)

12 Sa (224)

13 Su (225)

14 Mo (226)

15 Tu (227)

16 We (228)

17 Th (229)

18 Fr (230)

19 Sa (231)

20 Su (232)

21 Mo (233)

22 Tu (234)

23 We (235)

24 Th (236)

25 Fr (237)

26 Sa (238)

27 Su (239)

28 Mo (240)

29 Tu (241)

30 We (242)

31 Th (243)

Wk 9

Wk 8

Wk 7

Wk 6

Wk 35

Wk 34

Wk 33

Wk 32

1 We (60)

2 Th (61)

3 Fr (62)

4 Sa (63)

5 Su (64)

6 Mo (65)

7 Tu (66)

8 We (67)

9 Th (68)

10 Fr (69)

11 Sa (70)

12 Su (71)

13 Mo (72)

14 Tu (73)

15 We (74)

16 Th (75)

17 Fr (76)

18 Sa (77)

19 Su (78)

20 Mo (79)

21 Tu (80)

22 We (81)

23 Th (82)

24 Fr (83)

25 Sa (84)

26 Su (85)

27 Mo (86)

28 Tu (87)

29 We (88)

30 Th (89)

31 Fr (90)

1 Fr (244)

2 Sa (245)

3 Su (246)

4 Mo (247)

5 Tu (248)

6 We (249)

7 Th (250)

8 Fr (251)

9 Sa (252)

10 Su (253)

11 Mo (254)

12 Tu (255)

13 We (256)

14 Th (257)

15 Fr (258)

16 Sa (259)

17 Su (260)

18 Mo (261)

19 Tu (262)

20 We (263)

21 Th (264)

22 Fr (265)

23 Sa (266)

24 Su (267)

25 Mo (268)

26 Tu (269)

27 We (270)

28 Th (271)

29 Fr (272)

30 Sa (273)

Wk 10

Wk 11

Wk 12

Wk 13

Wk 36

Wk 37

Wk 38

Wk 39

1 Sa (91)

2 Su (92)

3 Mo (93)

4 Tu (94)

5 We (95)

6 Th (96)

7 Fr (97)

8 Sa (98)

9 Su (99)

10 Mo (100)

11 Tu (101)

12 We (102)

13 Th (103)

14 Fr (104)

15 Sa (105)

16 Su (106)

17 Mo (107)

18 Tu (108)

19 We (109)

20 Th (110)

21 Fr (111)

22 Sa (112)

23 Su (113)

24 Mo (114)

25 Tu (115)

26 We (116)

27 Th (117)

28 Fr (118)

29 Sa (119)

30 Su (120)

1 Su (274)

2 Mo (275)

3 Tu (276)

4 We (277)

5 Th (278)

6 Fr (279)

7 Sa (280)

8 Su (281)

9 Mo (282)

10 Tu (283)

11 We (284)

12 Th (285)

13 Fr (286)

14 Sa (287)

15 Su (288)

16 Mo (289)

17 Tu (290)

18 We (291)

19 Th (292)

20 Fr (293)

21 Sa (294)

22 Su (295)

23 Mo (296)

24 Tu (297)

25 We (298)

26 Th (299)

27 Fr (300)

28 Sa (301)

29 Su (302)

30 Mo (303)

31 Tu (304)

Wk 18

Wk 17

Wk 16

Wk 15

Wk 14

Wk 44

Wk 43

Wk 42

Wk 41

Wk 40

1 Mo (121)

2 Tu (122)

3 We (123)

4 Th (124)

5 Fr (125)

6 Sa (126)

7 Su (127)

8 Mo (128)

9 Tu (129)

10 We (130)

11 Th (131)

12 Fr (132)

13 Sa (133)

14 Su (134)

15 Mo (135)

16 Tu (136)

17 We (137)

18 Th (138)

19 Fr (139)

20 Sa (140)

21 Su (141)

22 Mo (142)

23 Tu (143)

24 We (144)

25 Th (145)

26 Fr (146)

27 Sa (147)

28 Su (148)

29 Mo (149)

30 Tu (150)

31 We (151)

1 We (305)

2 Th (306)

3 Fr (307)

4 Sa (308)

5 Su (309)

6 Mo (310)

7 Tu (311)

8 We (312)

9 Th (313)

10 Fr (314)

11 Sa (315)

12 Su (316)

13 Mo (317)

14 Tu (318)

15 We (319)

16 Th (320)

17 Fr (321)

18 Sa (322)

19 Su (323)

20 Mo (324)

21 Tu (325)

22 We (326)

23 Th (327)

24 Fr (328)

25 Sa (329)

26 Su (330)

27 Mo (331)

28 Tu (332)

29 We (333)

30 Th (334)

Wk 19

Wk 20

Wk 21

Wk 22

Wk 45

Wk 46

Wk 47

Wk 48

(152)

2 Fr (153)

3 Sa (154)

4 Su (155)

5 Mo (156)

6 Tu (157)

7 We (158)

8 Th (159)

9 Fr (160)

10 Sa (161)

11 Su (162)

12 Mo (163)

13 Tu (164)

14 We (165)

15 Th (166)

16 Fr (167)

17 Sa (168)

18 Su (169)

19 Mo (170)

20 Tu (171)

21 We (172)

22 Th (173)

23 Fr (174)

24 Sa (175)

25 Su (176)

26 Mo (177)

27 Tu (178)

28 We (179)

29 Th (180)

30 Fr (181)

1 Fr (335)

2 Sa (336)

3 Su (337)

4 Mo (338)

5 Tu (339)

6 We (340)

7 Th (341)

8 Fr (342)

9 Sa (343)

10 Su (344)

11 Mo (345)

12 Tu (346)

13 We (347)

14 Th (348)

15 Fr (349)

16 Sa (350)

17 Su (351)

18 Mo (352)

19 Tu (353)

20 We (354)

21 Th (355)

22 Fr (356)

23 Sa (357)

24 Su (358)

25 Mo (359)

26 Tu (360)

27 We (361)

28 Th (362)

29 Fr (363)

30 Sa (364)

31 Su (365)

Wk 26

Wk 25

Wk 24

Wk 23

Wk 52

Wk 53

Wk 51

Wk 50

Wk 49



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Many statistical investigations involve analysing the relationship between two variables. We

call the data in these investigations bivariate data. The way that bivariate data is analysed

depends on whether the data is categorical or numerical.

² one variable is a categorical variable and the other is a numerical variable

² both variables are categorical

² both variables are numerical.

For any pair of variables, one of the pair is described as the dependent or response variable,

while the other is the independent or explanatory variable.

The dependent variable responds to changes in the independent variable.

The independent variable explains the changes in the dependent variable.

For example, the number of children in a family influences the type of car they have, but not

the other way around. The type of car is therefore the dependent variable and the number of

children is the independent variable.

If the categorical variable has only two categories then a back-to-back stemplot is useful. It

is a visual display that enables easy analysis and comparison of the data.

Consider this example:

An office worker has the choice of travelling to work by tram or train. He has recorded the

travel times from recent journeys on both of these types of transport. He wishes to know

which type of transport is quicker and which is the more reliable.

Recent tram journey times (minutes):

21, 25, 18, 13, 33, 27, 28, 14, 18, 43, 19, 22, 30, 22, 24

Recent train journey times (minutes):

23, 18, 16, 16, 30, 20, 21, 18, 18, 17, 20, 21, 28, 17, 16

The type of transport is the independent variable and the travel time is the dependent variable,

because the travel time depends on the type of transport.

A back-to-back stemplot is constructed

with only one stem. The leaves are

grouped on either side of this central

stem. The ordered back-to-back stem-

plot for the data is shown alongside:

Train leaf Stem Tram leaf

8 8 8 7 7 6 6 6 1 3 4 8 8 98 3 1 1 0 0 2 1 2 2 4 5 7 8

0 3 0 34 3

The most frequently occurring travel times by train were between 10 and 20 minutes whereas

the most frequently occurring travel times by tram were between 20 and 30 minutes.

BACK-TO-BACK STEMPLOTS

A COMPARING ONE CATEGORICAL

AND ONE NUMERICAL VARIABLE

BIVARIATE DATA

In this chapter we will study the display and analysis of bivariate data where:

A back-to-back stemplot could be used to display the relationship between the categorical vari-able which has two categories (or levels), and the numerical variable .type of transport travel time



0 5 25

75

50

95

100

0 5 25

75

95

100

50

The median train travel time is 18 minutes and the median tram travel time is 22 minutes.

This supports the observation that train journeys are generally shorter.

The range of the train travel times is 30 ¡ 16 = 14 minutes while the range of the tram

travel times is 43¡ 13 = 30 minutes.

The interquartile range of travel times for the train is 21¡ 17 = 4 minutes, while the IQR

for tram travel times is 28¡ 18 = 10 minutes.

Comparison of these measures of spread indicates that the train travel times are less ‘spread

out’ than the tram travel times. The train travel times are therefore more predictable or

reliable.

In conclusion, it is generally quicker and the travel times are more reliable if the worker

travels by train to work.

1 The heights (to the nearest centimetre) of Year 10 boys and girls in a school are being

investigated. The sample data are as follows:

Boys:

Girls:

a What are the two variables in this investigation? Classify the variables as categorical

or numerical, dependent or independent.

b Construct a back-to-back stemplot for the data.

c Find the statistics in the five-number summaries for each of the data sets.

d Compare and comment on the distributions of the data, mentioning the shape, centre

and spread and quoting statistics to support your statements.

2 A new cancer drug is being developed and is being tested on rats. Two groups of twenty

rats with cancer were formed; one group was given the drug while the other was not.

The survival time of each rat in the experiment was recorded up to a maximum of 192days.

Survival times of rats that were given the drug:

64 78 106 106 106 127 127 134 148 186192¤ 192¤ 192¤ 192¤ 192¤ 192¤ 64 78 106 106

Survival times of rats that were not given the drug:

37 38 42 43 43 43 43 43 48 4951 51 55 57 59 62 66 69 86 37

¤ denotes that the rat was still alive at the end of the experiment

a What are the variables in this investigation? Classify the variables as categorical or

numerical, dependent or independent.

b Construct a back-to-back stemplot for the data and find the statistics that make up

the 5-number summaries.

c Compare and comment on the distributions of the data, mentioning the shape, centre


EXERCISE 2A.1

164 168 175 169 172 171 171 180 168 168 166 168 170 165 171 173 187179 181 175 174 165 167 163 160 169 167 172 174 177 188 177 185 167160

165 170 158 166 168 163 170 171 177 169 168 165 156 159 165 164 154170 171 172 166 152 169 170 163 162 165 163 168 155 175 176 170 166

Chapter 2 BIVARIATE DATA 61


0 5 25

75

50

95

100

0 5 25

75

95

100

50

3 Peter and John are competing taxi-drivers who wish to know who earns more money.

They have recorded the amount of money (in dollars) collected per hour for five hours

over five days:

Peter: 17:3 11:3 15:7 18:9 9:6 13 19:1 18:3 22:8 16:7 11:7 15:812:8 24 15 13 12:3 21:1 18:6 18:9 13:9 11:7 15:5 15:2 18:6

John: 23:7 10:1 8:8 13:3 12:2 11:1 12:2 13:5 12:3 14:2 18:6 18:915:7 13:3 20:1 14 12:7 13:8 10:1 13:5 14:6 13:3 13:4 13:6 14:2

a Construct a back-to-back stemplot for the data and find the statistics that make up

the 5-number summaries.

b Compare and comment on the distributions of the data, mentioning the shape, centre


4 The residue that results when a cigarette is smoked collects in the filter. The residue

from twenty cigarettes from the two different brands was measured, giving the following

data, in milligrams:

Brand X: 1:62 1:55 1:59 1:56 1:56 1:55 1:63 1:59 1:56 1:691:61 1:57 1:56 1:55 1:62 1:61 1:52 1:58 1:63 1:58

Brand Y: 1:61 1:62 1:69 1:62 1:60 1:59 1:66 1:55 1:61 1:621:64 1:61 1:58 1:57 1:57 1:57 1:58 1:60 1:63 1:59

a Copy and complete the back-to-back stemplot for this data:

Brand Y Stem Brand X

150

152 2154 5 5 5156 6 6 6 6 7158 8 8 9 9160 1 1162 2 2 3 3164

166

168 9 156 includes values 1:56 and 1:57

b Comment on and compare the shape of the distributions.

Parallel boxplots are used to display and compare data where one of the variables is numerical

and the other is a categorical variable with two or more categories.

For example:

Car travel times (minutes): 30, 21, 19, 17, 24, 28, 23, 25, 25, 16, 18, 19, 29, 22

The categorical variable type of transport now has three categories and is the independent

variable.

Ordering the car travel times we get: 16, 17, 18, 19, 19, 21, 22, 23, 24, 25, 25, 28, 29, 30

PARALLEL BOXPLOTS

If additional data is available to the office worker in the example on page ,we can use parallel boxplots to compare the data. They help us decide which type of trans-port is the quickest to get him to work and which is the most reliable.

car travel time 60



0 5 25

75

50

95

100

0 5 25

75

95

100

50

The 5-number summary is:

min = 16, max = 30, median =22 + 23

2= 22:5, lower quartile = 19, upper quartile = 25

The three boxplots are drawn on the one axis:

The car travel times have almost the same spread (range = 14 mins, IQR = 6 mins) as the

train travel times (range = 14 mins, IQR = 4 mins), suggesting that the car travel time is as

reliable as the train travel time.

However, the train travel times include two outliers which may be due to extraordinary events.

If these are ignored then the range of travel times for the train would be 7 minutes, which is

considerably less than the ranges for the car and tram.

The median car travel time is 22:5 minutes, compared to 18 minutes for the train and 22minutes for the tram, so it is still generally quicker to travel by train.

In conclusion: From the data given, it is generally quicker and more reliable to travel by

train than it is by either tram or car.

Using the graphing calculator to graph parallel boxplots

The data for each of the transport types is entered

in separate lists.

The three boxplots can be drawn on the screen at

the same time by turning each of them On.

‘5-number summary’ values on the screen.

10 15 20 25 30 35 40 45

car

train

tram

categoricalvariable withthree categories

travel time (minutes)numerical variable

Press and choose . Press .y o Í1:Edit

Press to select .y o STAT PLOT

Make sure that the ‘boxplot with outliers’ iconand the correct list is selected for each plot.

Õ

q ® will bring the graphs to the screen:

r, then the arrows, can be used to find



0 5 25

75

50

95

100

0 5 25

75

95

100

50

General rules for interpreting and comparing the distribution of bivariate data:

1 Comment on the shape of the distributions (symmetric, positively skewed, negatively

skewed, outliers).

2 Comment on and compare the centres of the data (median and mean).

3 Comment on and compare the spread of the data (range, interquartile range).

1 The percentage scores on a SAC for three classes of Further Mathematics students have

been recorded and the distribution of results for the three classes are summarised on the

graph below:

a In which class was:

i the highest mark scored ii the lowest mark scored?

b Comment on the shape of the distribution of marks in each of the classes.

c Comment on and compare the centre of the scores for the classes.

d Comment on and compare the spread of the scores for the classes.

2 [VCAA FM 2001 Q6]

A conservation park in Thailand is home to 49 elephants, of which 26 are females and

23 are males. The parallel boxplots above show the distribution of their ages by sex.

Based on the information contained in the parallel boxplots, which one of the following

statements is incorrect?

A The youngest elephant is male.

B There are fewer female elephants under the age of 15 years than male elephants

under the age of 15 years.

C There are no female elephants over the age of 40 years.

D The median age of the female elephants is approximately the same as the median

age of the male elephants.

E Approximately 25% of the male elephants are 30 years of age or older.

EXERCISE 2A.2

0 5 10 15 20 25 30 35 40 45

age (years)

n��

n��

female

male

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

score on SAC (%)

class A

class B

class C



0 5 25

75

50

95

100

0 5 25

75

95

100

50

3 The daily maximum temperatures in Melbourne for June 21st and December 21st (the

equinoxes) are being compared. The data for the 20 years from 1981 to 2000 is given

below:

June 21st: 13:6, 10:6, 19:1, 14:2, 12:2, 11:9, 18:3, 14:9, 14:6, 15:1,

17:4, 13:5, 16:7, 14:0, 11:1, 17:0, 15:4, 16:3, 15:6, 16:3

December 21st: 24:2, 19:4, 21:4, 22:7, 21:4, 20:0, 22:3, 21:1, 18:9, 23:5,

21:3, 23:0, 28:1, 20:3, 17:2, 35:0, 33:7, 21:9, 21:4, 38:6

a What are the variables in this investigation? Classify the variables as categorical or

numerical, dependent or independent.

b Find the statistics that make up the 5-number summaries and construct parallel

boxplots for the data.

c Compare and comment on the distributions of the data, mentioning the shape, centre


4 Using the data from question 4, Exercise 2A.1, find five-number summaries and construct

parallel boxplots to summarise the distributions of residue for the two types of cigarettes.

What conclusions can be made from comparing the boxplots? Support your statements

with statistics.

5 Plant fertilisers come in many different brands, but there are essentially two types:

organic and inorganic. A student was interested to discover whether radish plants re-

sponded better to organic or inorganic fertiliser. He prepared three identical plots of

ground, named plots A, B and C, in his mother’s garden, and planted 40 radish seeds

in each plot. After planting, each plot was treated in an identical manner, except for the

way they were fertilised. Cost prevented him using a variety of fertilisers, so he chose

one organic and one inorganic fertiliser. Plot A received no fertiliser, plot B received the

organic fertiliser as prescribed on the packet, and plot C received the inorganic fertiliser

as prescribed on the packet. The student was interested in the weight of the root that

forms under the ground.

The data supplied below is the weight of the root (measured to the nearest gram) of the

individual plants:

Data from plot A: 27 29 9 10 8 36 36 42 32 32 32 30 3839 38 50 34 41 39 40 12 14 35 35 42 2532 30 34 22

Data from plot B: 51 54 56 41 50 47 47 46 48 52 34 20 2847 58 56 63 66 54 48 48 53 47 29 46 3345 58 34

Data from plot C: 55 76 65 61 67 69 68 64 76 59 56 79 7069 70 76 43 70 62 60 58 79 65 75 60 3968 68 63 54 61 72 58 77 66 65 47 50

a Produce parallel boxplots for the data.

b Compare and comment on the distributions of the weights of the root for each

plot, mentioning the shape, centre and spread and quoting statistics to support your

statements.



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Two-way frequency tables are used to demonstrate the relationship between two categorical

variables. Percentaged segmented barcharts give a visual display of the data.

In two-way frequency tables, the independent variable fills the columns.

A town council is considering bringing in a rule banning the drinking of alcohol in

public places. A random survey of 60 residents gave the following results:

Of the 35 women surveyed, 20 were in favour of the rule. However only 11 of the

men were in favour of it.

a Construct a two-way frequency table to summarise these findings.

b Construct a two-way percentaged frequency table and answer the following:

i What percentage of those surveyed were female?

ii What percentage of those surveyed were in favour of the proposal?

iii What percentage of the females surveyed were in favour of the proposal?

c Do the results of the survey support the theory that females would be more in

favour of this rule than males?

The two categorical variables involved in this question are:

Gender: Male or Female

Opinion about rule: In favour or Against

Opinion about rule depends on gender so the variable gender is the independent

variable.

a

Opinion

Gender

Male Female Total

In favour 11 20 31

Against 14 15 29

Total 25 35 60

b The two-way percentaged frequency table is:

Opinion

Gender

Male Female

In favour 11

25£ 100 = 44% 20

35£ 100 = 57%

Against 14

25£ 100 = 56% 15

35£ 100 = 43%

Total 100% 100%

i 35

60£ 100 = 58:33% of those surveyed are female.

ii 31

60£ 100 = 51:67% of those surveyed were in favour of the rule.

iii 57% of the females were in favour of the rule.

B TWO CATEGORICAL VARIABLES

TWO-WAY FREQUENCY TABLES

Example 1



0 5 25

75

50

95

100

0 5 25

75

95

100

50

The percentaged frequency table in Example 1 can be graphed using a percentaged seg-

mented barchart:

1 A survey of Victorians was

recently conducted to ascer-

tain their interest in AFL

football.

The data was presented in

the following two-way per-

centaged frequency table:

Level of

interest

Gender

Male Female Total

Very interested 28 18 22

Somewhat 25 19 21

Not very 19 20 20

Not at all 28 43 37

Total 100 100 100a Use the table to find:

i the percentage of those surveyed who are very interested in football

ii the percentage of women who are either very or somewhat interested in football.

b Construct a percentaged segmented barchart that compares the interest in Australian

Rules for men and women.

c Does the data support the theory that gender influences the level of interest in AFL

football? Quote percentages to support your statement.

2 A survey of sixteen-year-old students revealed that 32 of the 48 boys and 23 of the 37girls played a team sport outside school.

a Copy and complete the two-

way frequency table shown:Play team sportoutside school?

Gender

Boys Girls Total

Yes

No

Total

b Find the percentage of all

the students who play a

team sport outside school.

c Find the percentage of girls who play a team sport outside school.

d Construct a two-way percentaged frequency table.

e Do the figures support the theory that more boys than girls play a team sport outside

school? Quote some percentages to support your statement.

c 57%44% 13%

of the females surveyed were in favour of the proposed rule compared withof the males. This shows a difference of . The results support the theory.

PERCENTAGED SEGMENTED BARCHARTS

male female gender

per

centa

ge

20

40

60

80

100

in favour

against

EXERCISE 2B



0 5 25

75

50

95

100

0 5 25

75

95

100

50

3 A market research company is con-

tracted to investigate the age of people

who listen to the three radio stations,

A, B or C, in a city. The results of their

survey are given in the table alongside:

Age group

Station < 30 30 - 60 > 60 Total

A 35 30 200

B 40 83 68

C 175 37 132

Totala Complete the Totals row and col-

umn in the table alongside.

b Why do we need a two-way percentaged frequency table to help analyse the data?

c Construct the two-way percentaged frequency table.

d Compare and comment on which age groups listen to which radio station.

4 The two-way percentaged frequency

table alongside was produced to show

the labour force status of parents from

one-parent families.Labour

force

status

Father Mother

Employed

full-time48:6 16:8

Employed

part-time13:3 27:2

Unemployed 8:3 8:9

Not in the

labour force29:8 47:0

Total 100:0 100:0

(Source: ABS June 2002 Labour Force Survey)

a What are the variables in this sur-

vey? Classify them as categorical

or numerical, independent or de-

pendent.

b Construct a percentaged seg-

mented barchart to illustrate the

data.

c What conclusions can be made

from this table and graph?

Support your statements with percentages from the table.

5 A polling agency wants to test the theory that in a particular municipality, “more of the

female residents vote for female candidates”. A random sample of eighty residents in the

municipality were asked their voting preference, either Smith the female candidate, or

Jones the male candidate. Of the 35 female residents in the sample, 20 said they would

vote for Smith, whereas 25 of the male residents said they would vote for Jones.

a Fill in the missing values on

the two-way frequency table

alongside.Voting

intention

Gender

Male Female Total

Smith 20

Jones 25

Total 35 80

b Construct a two-way percent-

aged frequency table for the

data.

c Use the figures in the table to comment on the validity of the theory.

Scatterplots are used to demonstrate and visualise the relationship between two numerical

variables.

The data is plotted as points on a graph where the independent variable is the horizontal axis

and the dependent variable is the vertical axis.

C TWO NUMERICAL VARIABLES



0 5 25

75

50

95

100

0 5 25

75

95

100

50

The pattern formed by the points on a scatterplot indicates the strength of the relationship

between the two variables.

For example:

The relationship between weight and height of members

of an AFL football team is being investigated.

We expect there to be a fairly strong association between

these variables as it is generally perceived that the taller

a person is, the more they will weigh.

The height and weight of each of the players in the team

is recorded and these values form a coordinate pair for

each of the players:

Before a scatterplot is constructed you need to establish which of the variables is the inde-

pendent variable and which is the dependent variable.

In this case we assume that weight depends on height and so weight is the dependent variable

and height is the independent variable.

The points are therefore plotted as

coordinate pairs (height, weight) for

the individuals in the investigation.

Using the calculator to construct a scatterplot

CONSTRUCTING A SCATTERPLOT

Player Height Weight Player Height Weight Player Height Weight

1 203 106 7 180 78 13 178 802 189 93 8 186 84 14 178 773 193 95 9 188 93 15 186 904 187 86 10 181 84 16 190 865 186 85 11 179 86 17 189 956 197 92 12 191 92 18 193 89

Press and choose . Press .

Enter the data into lists. The independent variableshould be and the dependent variable should be .

… Í1:Edit

L L‚

75

80

85

90

95

100

105

175 180 185 190 195 200 205

Weight versus Height

height (cm)

wei

ght

(kg)



0 5 25

75

50

95

100

0 5 25

75

95

100

50

There are four aspects that we need to consider:

1 Direction

Positive association

The points generally go up as x increases,

similar to a straight line with positive gradient.

“As the independent variable (x) increases,

the dependent variable (y) also increases.”

Negative association

The points generally go down as ‘x’ increases,

similar to a straight line with negative gradient.

“As the independent variable (x) increases,

the dependent variable (y) decreases.”

2 Form

In the scatterplots above, the points are generally in a straight line. The relationship

between the variables is said to be linear.

These scatterplots show relationships which are not linear.

Press to select .y o STAT PLOT

Press .Í

Turn the plot and select the scatterplot icon .The is for the independent variable andthe is for the dependent variable .

On

XList L

YList L‚

"

Press to view the scatterplot.

You can press and use the arrow keys toidentify the points.

q ®

r

INTERPRETATION OF A SCATTERPLOT

x

y

y

x



0 5 25

75

50

95

100

0 5 25

75

95

100

50

3 Strength

If the points form a well-ordered pattern then the strength of the association is said to

be strong.

For example:

Strong positive Strong negative Strong non-linear

If the points form a pattern which is less well defined, then the strength is said to be

moderate.

For example: Moderate positive Moderate negative

If the points are scattered but a general pattern is still discernable then the association is

said to be weak.

For example: Weak positive Weak negative

If the points appear to be randomly scattered then

there is no association between the variables.

An example of this is shown opposite.

4 Outliers

Outliers stand out from the general body of data.

The example opposite shows a “moderate positive

association with one outlier”.

outlier



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Outliers should be checked to ensure they are genuine outstanding data and not errors

in the data or errors in plotting. A decision can be made to ignore them as they will

influence correlation measures and models fitted to the data, but this should only be done

after careful consideration.

We can interpret the Weight versus Height scatterplot

from earlier as follows:

“There is a moderate positive association between the

variables height and weight. This means that as height

increases, weight increases. The relationship appears

linear and there are no obvious outliers.”

1 For each of the following, state whether you would expect to find positive, negative,

or no association between the following variables. Indicate the strength (none, weak,

moderate or strong) of the association.

a Shoe size and height.

b Speed and time taken for a journey.

c The number of occupants in a household and the water consumption of the house-

hold.

d Maximum daily temperature and the number of newspapers sold.

e Age and hearing ability.

2 Copy and complete the following:

a If the variables x and y are positively associated then as x increases, y ..........

b If there is negative association between the variables m and n then as m increases,

n ..........

c If there is no association between two variables then the points on the scatterplot

appear to be .......... ..........

3 For each of the scatterplots below, state:

i whether there is positive, negative or no association between the variables

ii the strength of the association between the variables (zero, weak, moderate or

strong)

iii whether the relationship between the variables appears to be linear or not

iv the presence of outliers.

a b c

EXERCISE 2C

x

y

40302010

20.4

20.2

20

19.8

19.6

x

5

y

5

25201510

10

15

20

25

30

x

y

40302010

10

20

30

40

80

90

100

180 190 200

Weight versus Height

height (cm)

wei

ght

(kg)



0 5 25

75

50

95

100

0 5 25

75

95

100

50

d e f

4 Consider the data: x 1 2 3 4 5 6 7 8 9 10

y 2 1 4 3 5 6 5 5 7 8

a Construct a scatterplot for the data.

b State whether the association between the variables is:

i positive, negative or no association ii weak, moderate or strong

iii linear or not.

5 The following data was collected by a milkbar owner over fifteen consecutive days:

(oC)29 40 35 30 34 34 27 27 19 37 22 19 25 36 23

119 164 131 152 206 169 122 143 63 208 155 96 125 248 139

a Which of the two variables is the independent variable?

b Construct a scatterplot of the data.

c Interpret the scatterplot in terms of the variables, mentioning direction, strength,

linearity and outliers.

6 A class of 25 students was asked to record their times (in minutes) spent preparing for

a test. The table below gives the score that they achieved on the test and the recorded

preparation time.

Score 25 31 30 38 55 20 39 47 35 45 32 33 34

Minutesspent

preparing75 30 35 65 110 60 40 80 56 70 50 110 18

Score 38 17 38 17 17 26 41 50 30 45 36 23

Minutesspent

preparing80 22 30 15 10 85 100 60 55 80 50 75

a Which of the two variables is the independent variable?

b Construct a scatterplot of the data.

c Interpret the scatterplot in terms of the variables, mentioning direction, strength,

linearity and outliers.

x

2 4 6 8

y

50

40

30

20

10x

y

50

40

30

20

10

5040302010

x

1 2 3 4 5 6 7

y

120

100

80

60

40

20

Max. daily

temp.

No. of ice-

creams sold



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Correlation is a statistical word that means relationship or association. We can talk about

the correlation/relationship/association between two variables and mean the same thing.

The correlation between two numerical variables can be measured by a correlation coefficient.

There are several correlation coefficients that can be used, but the most widely used coefficient

is Pearson’s correlation coefficient, named after the statistician Carl Pearson who developed

it. Its full name is Pearson’s product-moment correlation coefficient, and it is denoted r.

For a set of n bivariate numerical data with variables x and y, Pearson’s correlation

coefficient is:

r =1

n¡ 1

Pµx¡ x

sx

¶µy ¡ y

sy

¶where x and y are the means of the x and y data respectively and sx and sy are

their standard deviations.

This formula is tedious to use, so in all situations you will be using your calculator to find r.

r Description r Description

1 perfect ¡1 perfect

positive negative

correlation correlation

0:75 to 1 strong ¡1 to ¡0:75 strong

positive negative


0:50 to 0:75 moderate ¡0:75 to ¡0:50 moderate

positive negative


to 0:50 weak ¡0:50 to ¡0:25 weak

positive negative


D CORRELATION

PEARSON’S CORRELATION COEFFICIENT r

0 25:

Pearson’s correlation coefficient gives a measure of the relationship between two variables on

a scale from ¡1 to 1. Word descriptors based on r-values seem doubtful at the best of times

and the majority of texts on this subject do not include them. Many texts and Internet sites

vary on the advice they give. Here is one possible interpretation.

INTERPRETATION OF PEARSON’S CORRELATION COEFFICIENT



0 5 25

75

50

95

100

0 5 25

75

95

100

50

0 to 0:25 almost no ¡0:25 to 0 almost no


Notes about Pearson’s correlation coefficient:

² It is designed for linear data only.

² It should be used with caution if there are outliers.

For example, the data in the two scatterplots below both have a correlation coefficient

of r = 0:8. The presence of the outlier in the second graph has greatly reduced the

Using the calculator to find Pearson’s correlation coefficient

The first step is to activate the diagnostic tools on the calculator. Once turned on these will

remain on, but if the memory is cleared or battery changed then the calculator will revert

back to the default functions that do not include r.

To activate the diagnostic tools:

x 1 2 3 4 5 6 7 8 9 10

y 2 1 4 3 5 6 5 5 7 8

r Description r Description

x

2 4 6 8

y

5

15

10

10 12 14

x

2 4 6 8

y

5

1210

15

10

outlier

We consider finding Pearson’s correlationcoefficient for the data opposite:

Locate the menu using .CATALOG y Ê

Use the arrow keys to scroll down to

and press .DiagnosticOn Í

DiagnosticOn will appear on the screen.

Press Í and you will have turned the di-

agnostic tools on.

r rvalue, however, without this point, would equal .1



0 5 25

75

50

95

100

0 5 25

75

95

100

50

We check the scatterplot at this stage as it will

reveal any errors made in entering the data, and

any outliers. It will also indicate whether the

data is linear.

(This means we are fitting a linear model or

linear regression of the form y = ax+ b to

the data.) Regression will be discussed in greater

detail in Chapter 3.

The linear regression screen appears and the last

figure r = :9130 :::: is Pearson’s correlation

coefficient for this data set.

For example:

1 The heights and reading speeds of children were measured and a strong positive corre-

lation was found. Does this mean that increasing height makes you read faster or that

increasing your reading speed will cause you to grow? These suggestions are obviously

not sensible. The strong correlation results because both variables are closely associated

with age. As age increases, both the variables height and reading speed increase. It is

age which causes height and reading speed to increase.

Enter the data into lists, the -data intoand the -data into .

x

y

L

L‚

Press to select and choose.

… ~ CALC

4:LinReg(ax+b)

LinReg(ax+b) appears on the screen. Youneed to tell the calculator where your data is:

Enter , by pressing.L L‚ y À ¢ y Á

Í

CAUSATION

When analysing data, we must be aware of . A high degree of correlation between twovariables does not necessarily imply that a change in one variable the other to change.

causation

causes

The value indicates a, which agrees with the scatterplot.r strong positive corre-

lation



0 5 25

75

50

95

100

0 5 25

75

95

100

50

2 The number of television sets sold in Ballarat and

the number of stray dogs collected in Bendigo were

recorded over several years and a strong positive

association was found between the variables.

Obviously the number of television sets sold in

Ballarat was not influencing the number of stray

dogs collected in Bendigo. Both variables have

simply been increasing over the period of time that

their numbers were recorded.

If a change in one variable causes a change in the other variable then we say that a causal

relationship exists between them.

For example:

The age and height of a group of children is measured and there is a strong positive correlation

between these variables. This will be a causal relationship because an increase in age will

cause an increase in height.

1 a Use your calculator to find Pearson’s correlation coefficient for the data given in

question 5, Exercise 2C.

29 40 35 30 34 34 27 27 19 37 22 19 25 36 23

119 164 131 152 206 169 122 143 63 208 155 96 125 248 139

b Interpret the value of r in terms of strength and direction.

c Does the value of the correlation coefficient confirm your observations from the

scatterplot? Was it appropriate to find r for this data? Explain.

2 a Use your calculator to find Pearson’s correlation coefficient for the data given in

question 6, Exercise 2C:

Score 25 31 30 38 55 20 39 47 35 45 32 33 34

Minutesspent

preparing75 30 35 65 110 60 40 80 56 70 50 110 18

Score 38 17 38 17 17 26 41 50 30 45 36 23

Minutesspent

preparing80 22 30 15 10 85 100 60 55 80 50 75

b Interpret the value of r in terms of strength and direction.

c Does the value of the correlation coefficient confirm your observations from the

scatterplot? Was it appropriate to find r for this data? Explain.

EXERCISE 2D

(oC)

Max. daily

temp.

No. of ice-

creams sold



0 5 25

75

50

95

100

0 5 25

75

95

100

50

3 [VCAA FM 2000 Q5]

The scatterplot alongside shows the birth

rate and the average food intake for 14different countries.

The value of the product moment correla-

tion coefficient, r, for this data is closest

to:

A ¡0:6 B ¡0:2 C 0:2

D 0:6 E 0:9

4 Which one of the following is true for Pearson’s correlation coefficient r?

A The addition of an outlier to a set of data would always result in a lesser value

of r.

B An r value of 1 represents a stronger relationship between the variables than an

r value of ¡1.

C A high value of r means that one variable is causing the other variable to change.

D An r value of ¡0:8 means that as the independent variable increases, the depen-

dent variable will tend to decrease.

E It can take values between 0 and 1 inclusive.

5 The following pairs of variables were measured and a strong positive correlation between

them was found. Discuss whether a causal relationship exists between the variables. If

not, suggest a third variable to which they may both be related.

a The lengths of one’s left and right feet.

b The damage caused by a fire and the number of firemen who attend it.

c Company expenditure on advertising, and sales.

d The height of parents and the height of their adult children.

e The number of hotels and the number of churches in rural towns.

In a bivariate set of numerical data, the coefficient of determination gives us a means of

measuring the influence that one variable has over the other variable.

Coefficient of determination = r2 = (Pearson’s correlation coefficient)2

r2 is found on the linear regression screen of

your calculator as shown opposite.

Alternatively, if the value of r is known, then

this can simply be squared.

1.7 1.9 2.1 2.3 2.5 2.7

20

30

40

50

bir

thra

te(p

er100

000)

�

average food intake(1000 calories per person)

E THE COEFFICIENT OF DETERMINATION

CALCULATION OF THE COEFFICIENT OF DETERMINATION



0 5 25

75

50

95

100

0 5 25

75

95

100

50

r2 indicates the strength of association between the dependent or response

variable and the independent or explanatory variable.

If there is a causal relationship then r2 indicates the degree to which change in the explanatory

variable explains change in the response variable.

Pearson’s correlation coefficient, r, was found to be 0:8625.

The coefficient of determination for this study is (0:8625)2 = 0:7439.

An interpretation of this r2 value is “the proportion of variation in kilojoule content that can

be explained by the variation in fat content of muesli is 0:7439.”

It is usual to quote the coefficient of variation as a percentage. A proportion of 0:7439 is

equivalent to 0:7439£ 100 = 74:39%.

The interpretation becomes:

74:39% of the variation in of muesli can be explained by the variation in

fat content of muesli.

If 74:4% of the variation in kilojoule content of muesli can be explained by the fat content of

muesli then we can assume that the other 25:6% (100%¡74:4%) of the variation in kilojoule

content of muesli can be explained by other factors (which may or may not be known).

Note:

² Since ¡1 6 r 6 1, 0 6 r2 6 1.

² If r = ¡0:625 then r2 = (¡0:625)2 = 0:3906, a positive value.

² It is only appropriate to use r2 values, like r values, in situations where there is a

linear relationship between the two variables.

² r2 values of 10% or more are worth mentioning.

² If you are finding an r value from an r2 value then you must consider that the r

value can be positive or negative. The solutions to r2 = a are r =pa and

r = ¡pa. Your calculator will only give you a positive value.

If this statement was based on the coefficient of variation then what would be the

value of Pearson’s correlation coefficient for this study?

We are told that r2 = 0:45 so r is the square root of 0:45. (p

0:45 w 0:6708)

INTERPRETATION OF THE COEFFICIENT OF DETERMINATION

Example 2

At this point we need to consider the variables involved: and of acar. We would assume that as the of a car then the of a carwould , i.e., there is correlation between the variables.

Hence we can conclude for this study that Pearson’s correlation coefficient, ,will be .

selling price age

age increases selling price

decrease negative

r

:¡0 6708

For example: An investigation into many different brands of muesli found that there is strongpositive correlation between the variables and .fat content kilojoule content

kilojoule content

dependent variable

independent variable

A study has found that of the variation in can be explained by thevariation in of a used car.

45% selling price

age



0 5 25

75

50

95

100

0 5 25

75

95

100

50

1 In an investigation the coefficient of determination for the variables preparation time

and exam score is found to be 0:5624. Complete the following interpretation of the

coefficient of determination:

...... % of the variation in .......... can be explained by the .......... in preparation time.

2 For each of the following find the value of the coefficient of determination correct to

four decimal places, and interpret it in terms of the variables.

a An investigation has found the association between the variables time spent gambling

and money lost has an r value of 0:4732.

b For a group of children a product-moment correlation coefficient of ¡0:365 is found

between the variables heart rate and age.

c In a study of a sample of countries, Pearson’s correlation coefficient for the variables

female literacy and gross domestic product is found to be 0:7723.

3 A study of the relationship between stress levels and productivity has produced a product-

moment correlation coefficient of 0:5629. Which one of the following would be an

interpretation that could be made from this study?

A 56:3% of the variation in productivity can be explained by the variation in stress

levels.

B 75% of the variation in productivity can be explained by the variation in stress

levels.

C 31:7% of the variation in productivity is caused by the variation in stress levels.

D 56:3% of the variation in productivity is caused by the variation in stress levels.

E 31:7% of the variation in productivity can be explained by the variation in stress

levels.

4 A rural school has investigated the relationship between the time spent travelling to

school (minutes) and a student’s year ten average (%) for a sample of students.

The results are given in the table below:

Travel time(mins)

10 33 18 43 34 30 24 47 44 41 17 45 39 31 23 11 14 25 16 17

Year 10average (%)

51 78 97 56 90 70 64 67 37 46 95 67 31 57 43 99 98 82 40 67

a Construct a scatterplot of the data and interpret the scatterplot.

b Find Pearson’s correlation coefficient for the data and interpret.

c Calculate the coefficient of determination and interpret this in terms of the variables.

EXERCISE 2E



0 5 25

75

50

95

100

0 5 25

75

95

100

50

Documents

10 DATA ANALYSIS – CORE MATERIAL A DATA · PDF filewalk bus tram train Mode of transport to school private car walk bus tram train 0 10 20 30 40 50 0% 20% 40% 60% 80% 100% frequency