Chapter 7 What to do when you have the data n We saw in the previous chapters how to collect data. We will spend the rest of this course looking at how

Chapter 7What to do when you have the data We saw in the previous chapters We saw in the previous chapters

how to collect data. We will how to collect data. We will spend the rest of this course spend the rest of this course looking at how to analyse the looking at how to analyse the data that we have collected.data that we have collected.

Stem and Leaf Diagrams Stem and Leaf Diagrams are

graphical ways to display a group of integers in a dataset.

Steps for Constructing a Stem and Leaf Diagram

1. Select one or more of the leading digits to be the Stem values, the remaining digits become the Leaves.

2. List Possible Stem values in a column

3. Record the Leaf for every observation beside the corresponding Stem value.

4. Indicate on the display what units are used for the Stems and Leaves.

Example The following are a selection of exam marks

71 52 52 75 64 60 48 56 67 29 11 53 25 46 58 46 49 62 66 40 19 54 57 54 60 19 59 43 51 40 21 45 46 62 73 59 36 45 55 46 45 32 55 46 51 46 65 49 61 40

A Stem And Leaf Diagram will look like this:

1 1 9 9 2 1 5 9 3 2 6 4 0 0 0 3 5 5 5 6 6 6 6 6 6 8 9 9

5 1 1 2 2 3 4 4 5 5 6 7 8 9 9 6 0 0 1 2 2 4 5 6 7 7 1 3 5 STEM UNIT = TENS LEAF UNIT = ONES

Histogram for Discrete Numerical Data

1. Draw a horizontal X-axis and on it mark the possible values taken by the observations

2. Draw a vertical Y-axis marked with either relative frequencies or frequencies

3. Above each possible value on the X-axis draw a rectangle centred on the value with width 1 and height equal to the relative frequency or frequency of that value.

Value Frequency30 10040 15050 20060 100

0

50

100

150

200

250

30 40 50 60

Frequency

The Shape of Histograms

The general shape of a histogram is important.

The number of peaks in the histogram determines whether a distribution is classed as Unimodal, Bimodal or Multimodal.

In addition to this classification we can further classify UniModal distributions as to whether they are symmetric or not.

A unimodal distribution is defined to be Symmetric if there is a vertical line of symmetry through the middle of the distribution such that the distribution to the left of this line is the mirror image of the distribution to the right of this line.

The right part of a unimodal distribution is called the Upper Tail of the distribution while the left part is called the Lower Tail:

A Unimodal distribution which is not symmetric is called skewed, there are two types of skewness.

Positive Skew: If the upper tail of the distribution stretches out more than the lower tail then the distribution is said to be positively skewed.

Negative Skew: If the Lower tail of the distribution stretches out more than the upper tail then the distribution is said to be negatively skewed.

Symmetric Distributions

POSITIVELY SKEWED DISTRIBUTION

NEGATIVELY SKEWED DISTRIBUTION

Definitions Mean: The Mean of a

quantitative dataset is the sum of the observations in the dataset divided by the number of observations in the dataset.

Median: The Median (m) of a quantitative dataset is the middle number when the observations are arranged in ascending order.

Mode: The Mode of a datset is the observation that occurs most frequently in the dataset.

How to calculate these

Dataset: X1 X2 X3 X4 X5. . . . . Xn

Mean = (X1+ X2 + X3+ . .+ Xn)/n

Median: Arrange the n observations in order from smallest to largest, then:

if n is odd, the median (m) is the middle number,

if n is even, the median is the mean of the middle two numbers

Mode: If given a dataset, the mode is easily chosen as the value which appears most often.

Example A: Dataset: 5, 3, 8, 5, 6 Mean = 5.4 Mode = 5 Median: 3, 5, 5, 6, 8 so m = 5 Note: 5.4 is not one of the original

values in the dataset Example B: 11, 140, 98, 23, 45, 14,

56, 78, 93, 200, 123, 165 n = 12 Mean = 1046/12 = 87.16666666 Median: 11, 14, 23, 45, 56, 78, 93, 98,

123, 140, 165, 200 m = (78 + 93)/2 = 85.5

Example C: generate a dataset containing 9 numbers using the Day, Month and Year of your birth and that of the people sitting to your left and right. ie: DD/MM/YY

Mean vs Median vs Mode - which measures the centre best?

Choosing which of these three measures to use in practice can sometimes seem like a difficult task. However if we understand a little about the relative merits of each we should at least be able to make an informed decision.

If the distribution is symmetric then Mean = Median If the distribution is Positively

Skewed (to the right) then Median < Mean If the distribution is Negatively

Skewed (to the left) then Mean < Median

So the difference between the mean and median can be used to measure the skewness of a dataset.

Note: The presence of outliers affects the mean but not the median. This can be seen from the diagrams and from the following example

Example: Ten statistics graduates who are now working as statisticians are surveyed for their annual salary. The survey produced the following dataset:

£60,000 £20,000 £19,000 £22,000 £21,500 £21,000 £18,000 £16,000 £17,500 £20,000

Mode = £20,000 Median = £20,000 Mean = £23,500

Notice that the distribution is positively skewed, the presence of the one high earner has affected the Mean causing it to be £1,500 higher than the highest of all the salaries excluding £60,000. For this dataset the Mean is therefore not a good measure of the centre of the dataset.

Notice also that the median would be unaffected if the £60,000 was changed to a value like £23,000 which is more in line with the rest of the data.

Examples Would you expect the datasets

described below to be symmetric, skewed to the right or skewed to the left.

A. The salaries of people employed by UCD

B. The grades on an easy exam C. The grades on a diffucult exam D. The amount of time spent by

students in a difficult 3 hour exam. E. The amount of time students in

this class studied last week. F. The age of cars on a used car lot

Example:The median age of the population in Ireland is now 32 years old. The median age of the Irish population in 1986 was 27. Interpret these values and explain the trend, what implications does this data have for Irish society. What are the consequences for the entertainment industry in Ireland?

Numerical Measures of Variability When we want to describe a dataset

providing a measure of the centre of that dataset is only part of the story. Consider the following two distributions:

Both of these distributions are symmetric and

meanA = meanB, modeA=modeB and medianA=medianB. However these two distributions are obviously different, the data in A is quite spread out compared to the data in B.

This spread is technically called variability and we will now examine how best to measure it.

Revision Tutorials

M T W T F

11

12

1

2

3

4

5

6

Definitions Range: The Range of a quantatitive

dataset is equal to the largest value minus the smallest value.

Sample Variance: The Sample Variance is equal to the sum of the squared distances from the mean divided by n-1.

Standard Deviation: The Sample Standard Deviation, s, is defined as the positive square root of the Sample Variance, s2.

Sample Variance

sx x

n

ii

n

2

2

1

1

( )

sx

x

nn

i

ii

n

i

n

2

2 1

2

1

1

Which is best? The meaning of the Range is easily

seen from its definition. It is a very crude measure of the variability contained in a dataset as it is only interested in the largest and smallest values and does not measure the variability of the rest of the dataset.

Example: These two datasets have the same range but do they have the same variability?

Dataset1: 1, 5, 5, 5, 9 Dataset2: 1, 2, 5, 8, 9 NO, Dataset2 is obviously more

spread out than Dataset1 which has three values clustered at 5.

Example Once upon a time there were two

lecturers A & B, each delivered the same course to two different classes. When exam time came both classes had the same average marks of 70%. The marks for Lecturer A’s class however had a standard deviation of 25% whereas the Standard Deviation for Lecturer B’s class was 5%. Who’s class would you rather be in?

Chapter 8Normal Curves and Relative Standing We have just seen how datasets can We have just seen how datasets can

be described by histograms. be described by histograms. For large datasets of continuous For large datasets of continuous

variables the histograms have so variables the histograms have so many possible values that it would many possible values that it would be impracticable to draw all of the be impracticable to draw all of the really narrow rectangles necessary.really narrow rectangles necessary.

Instead we represent these datasets Instead we represent these datasets by curves (distributions). The curve by curves (distributions). The curve can be thought of as joining the can be thought of as joining the centre points of tops of all the centre points of tops of all the rectangles in the histogram. rectangles in the histogram.

These distributions which are like generalised relative frequency histograms can take many different shapes, some symmetrical some skewed.

There is one shape however that crops up all through the natural world and that is …

THE NORMAL DISTRIBUTION

aka The Gaussian Distribution or The Bell Curve

The Normal Curve

The Normal Distribution is Symmetric.

There are many different Normal curves, some are fat some are thin.

Some are centred at 0 some at 1 some at 5 etc.

Each normal curve can be uniquely identified by two parameters.

The Mean and the Standard Deviation

Once you know the mean and the S.Deviation for a Normal curve then it is possible to draw the curve.

Normal curves are centred at the Mean. And the Standard Deviation describes how spread out they are.

The Normal Curve

MEAN

StandardDeviation

The area under a Normal curve to the left of the mean is .5. This indicates that the probability that something which is normally distributed is less than its mean is .5.

The area under the curve to the left of any point A on the X axis represents the probability that a Normal variable is less than A.

X ~ Normal Probability( X<A) is the area

under the curve to the left of A

MEAN

A

There are an infinite number of different Normal curves, one for each possible combination of values of the mean and the standard deviation.

However there is a relationship between all Normal curves.

All Normal variables X can be transformed into a Standard Normal Variable Z.

Z is Normal with Mean 0 and Standard Deviation 1.

ZX

We can use tables to look areas under the Standard Normal Curve.

Example: Find the Probability that a Normal variable with Mean 3 and Standard Deviation 2 is less than 4.

Pr( )

Pr( )

Pr

Pr( . )

.

X

X

X

Z

4

3 4 3

3

2

4 3

2

05

0 6915

Section Interpreting the Standard Deviation -the Empirical Rule

We have seen that the Variance and hence the Standard Deviation of a dataset provides us with a relative measure of the variability contained in a dataset. So that if we are given two datasets the one with the larger Standard Deviation will be the dataset which exhibits the greater variability.

Is it posssible for the Standard Deviation to give more than a relative measure of variability?

Can we actually say how spread out the data is?

The answer is yes, we will see later how to give detailed answers for particular distributions. In the meantime there are two rules which will provide us with a good deal of information about some general datasets.

The Empirical rule provides us with some definite statements about the proportion of observations in a specified interval. It only works for Symmetric Bell-Shaped (mound-shaped) distributions. Also this rule is an approximation and more or less data than is indicated by the rule may lie in each interval.

The Empirical Rule

For a Symmetric Bell-Shaped distribution - Normal or close to Normal.

Approximately 68% of the observations are within 1 Standard Deviation of the Mean

Approximately 95% of the observations are within 2 Standard Deviation of the Mean

Approximately 99.7% of the observations are within 3 Standard Deviation of the Mean

Example In Tombstone, Arizona Territory

people used Colt .45 revolvers. However people used different ammunition.

Wyatt Earp knew that his brothers and Doc Holliday were the only ones in the territory who used Colt .45s with Winchester ammunition.

The Earp brothers conducted tests on many different combinations of weapons and ammunition.They found that dataset of observations produced by the combination of Colt .45 with Winchester shells showed a Mean velocity of 936 feet/second and a Standard Deviation of 10 feet/second.

The measurements were taken at a distance of 15 feet from the gun.

When Wyatt examined the body of a cowboy shot in the back in cold blood he concluded that he was shot at a distance of 15 feet and that the velocity of the bullet at impact was 1,000 feet/second.

The dastardly Ike Clanton claimed that this cowboy was shot by the Earp brothers or Doc Holliday. Was Wyatt able to clear his good name using the Empirical Rule?

The distribution of this bullet velocity data should be approximately bell-shaped. This implies that the empirical rule should give a good estimation of the percentages of the data within each interval.

k-# ofStandardDeviations

Interval EmpiricalapproximatePercentage

2 916, 956 95%3 906, 966 99.7%4 896, 976 ~100%5 886, 986 ~100%6 876, 996 ~100%7 866, 1006 ~100%

This table quite clearly demonstrates that since the bullet velocity in the shooting was 1000 ft/sec and since this lies more than 6 Standard Deviations away from the mean the probability is extremely high that the Earps were not responsible for this shooting.

This is especially evident from looking at the column showing percentages from the empirical rule.

Practically 100% of bullet velocities should be between 896 and 976 ft/sec.

Numerical Measures of Relative Standing

While it is useful to know how to measure the centre of a dataset and the variability of a dataset, many times we want to be able to compare one observation with the rest of the observations in the dataset. Is one observation larger than many others?

For Example suppose you get 35% on the exam for this course you will probably feel quite bad about your performance but what if 90% of the class actually did worse than you? Then you might feel a bit better about your 35%.

So in some cases knowing how one observation compares with others can be more useful than just knowing the value of that observation.

We will now look at some different ways of measuring Relative Standing.

Definitions Percentile: For any dataset the pth

percentile is the observation which is greater in value than P% of all the numbers. Consequently this observation will be smaller than (100-P)% of the data.

Z-Score: The Z-Score of an observation is the distance between that observation and the mean expressed in units of standard deviations. So:

ZX

The numerical value of the Z-score reflects the relative standing of the observation.

A large positive Z-score implies that the observation is larger than most of the other observations.

A large negative Z-score indicates that the bservation is smaller than almost all the other observations.

A Z score of zero or close to 0 means that the observation is located close to the mean of the dataset.

Example A sample of 120 statistics students was chosen and their exam results summarised, the mean and standard deviation were shown to be: mean = 53% st.dev. = 7%

Eric and Kenny are two students in this class and Eric’s exam result was 47% what was his Z-score?

If Kenny’s Z-Score is 2, what was his percentage on the exam?

What happens to Kenny then?

Documents

Chapter 7 What to do when you have the data n We saw in the previous chapters how to collect data. We will spend the rest of this course looking at how