Measures of Dispersion - pdfs.semanticscholar.org€¦ · Measures of Dispersion. There are many,...

Preview:

Citation preview

Measures of Dispersion

There are many, many model probability distributions

Here’s a link to a map of 50+ probability distributions, showing how they all relate!

● The center is the mean, median, or mode.

● The spread is the variability of the data:

● Shape can be described by symmetry, skewness, number of peaks (modes), etc.

Image source

Features of Probability Distributions

Descriptive Statistics● Measures of central tendency

○ Mean, median, mode

● Measures of spread○ Variance, standard deviation

○ Range, max, min, quartiles

● Measures of shape○ Skew, modalities, etc.

Spread

● Measurements vary for one of two reasons:

○ Systematic bias: underlying problems with how the data were collected,

leading to inaccuracies (e.g., thermometer off by a degree)

○ Randomness: natural fluctuations in measurements (aka “noise”)

● You can and will get different results taking the same measurements over

and over again, as a result of the randomness.

Measurements Vary

● Measures of central tendency tell us something about the center of a

probability distribution or a histogram.

● But they don’t tell us anything about spread.

Spread

Data Source

One Measure of Spread● Data: Change in the valuation of

different financial sectors.

● Let’s think up a metric we can use to

measure the spread of a distribution.

● A starting point: what is the average

deviation of the values from the mean? ○ [(1.66 - 0.58) + (1.08 - 0.58) + (0.89 - 0.58) +

… + (0.28 - 0.58) + (0.13 - 0.58) + (-0.28 -

0.58)] / 10 = 0.

● What is the problem with this measure? Data Source

● The problem is that we are get positive and negative deviations, and they

end up cancelling each other out, so the average deviation is 0!

● How can we make all deviations positive?

○ We can take absolute values, or

○ We can square everything!

● Our measure of spread will be the average squared deviation of values from

the mean. This is called the sample variance.

One Measure of Spread

● The total sum of the squared deviations:○ (1.66 - 0.58)2 + (1.08 - 0.58)2 + (0.89 - 0.58)2 + (0.69 - 0.58)2 + (0.52 - 0.58)2 + (0.47 - 0.58)2

+ (0.36 - 0.58)2 + (0.28 - 0.58)2 + (0.13 - 0.58)2 + (-0.28 - 0.58)2 ∽ 2.62

● To find the average, we divide by 10.

● So, the sample mean is ∽ 0.58.

● And the sample variance is ∽ 0.262.

● Why such a big spread? Well, because we squared the deviations!

● Arguably of more interest, since it is of the same magnitude as the values, is

the square root of the sample variance: i.e., the sample standard deviation.

Calculating Sample Variance

Interquartile Range (IQR)

Maximum, minimum, and range● The maximum is an outcome that is greater than or equal to all others in our sample

● The minimum is an outcome that is less than or equal to all others in our sample

● The range is the difference between the maximum and the minimum

● The maximum and minimum are sensitive to outliers

○ If an outcome is added to sample that is less than the minimum, then the minimum

changes

○ If an outcome is added to our sample that is greater than the maximum, then the

maximum changes

● To determine if the maximum and the minimum in our data set are indeed outliers we

can use a rule of thumb called the interquartile range rule

Quartiles ● Quartiles within an ordered set of data are three points that divide the dataset into

four equal groups, each containing a quarter of the data

● The first quartile (Q1) is the midpoint between the minimum and the median

● The second quartile (Q2) is the median

● The third quartile (Q3) is the midpoint between the median and the maximum

● The interquartile range (IQR) is the difference between Q3 and Q1 (IQR = Q3 - Q1)

Image source

Computing quartiles ● Find the median, and use it to divide the dataset into two halves

○ If there are an odd number of data points in the dataset do not include the median in either half

○ If there are even number of data points in the dataset, split it exactly in half

● The lower quartile value is the median of the lower half of the data

● The upper quartile value is the median of the upper half of the data

Example of computing quartiles ● Computing quartiles when the sample size is odd

● Ordered sample: 3, 4, 4, 5, 6, 8, 8

First Quartile (Q1) 4

Second Quartile (Q2) (Median) 5

Third Quartile (Q3) 8

Interquartile Range (IQR = Q3 - Q1) 4

Another example of computing quartiles● Computing quartiles when the sample size is even

● Ordered sample: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8

First Quartile (Q1) 3

Second Quartile (Q2) (Median) 5.5

Third Quartile (Q3) 7

Interquartile Range (IQR = Q3 - Q1) 4

Box and whisker plots: for visualizing quartiles

● Depict minimum, first quartile, median, third quartile

and maximum

● The upper whisker (from the maximum to the third

quartile) represents the upper 25% of the distribution

● The interquartile range (IQR) represents the middle

50% of the data

● The lower whisker (from the first quartile to the

minimum) represents the lower 25% of the distribution

Image source

Do pets relieve stress?● Does someone experience different level of stress when

doing tasks with a pet, a good friend, or alone?

● Allen et al. had 45 people count backwards by 13s and 17s

● The people were randomly assigned to 3 different groups:

pet (P), friend (F), and alone (C, for control)

● The dependent variable measured was the subject’s average

heart rate during the task

Study Results● The task was most stressful

around friends and least

stressful around pets

● We are comparing levels of a

quantitative variable (heart

rate) across levels of a

categorical (qualitative)

variable (treatment)

Image source

Side-by-side box and whisker plots (double the fun!)

● Side-by-side box and whisker plots are a

method for visualizing data when one

variable is categorical (qualitative) and

the other is quantitative

● They can be used to compare the

distributions associated with

quantitative variable across the levels of

a categorical variable

● In this plot, the stars are outliers

Image source

Interquartile range rule for detecting outliers

1. Calculate the interquartile range (Q3-Q1).

2. Multiply the interquartile range (IQR) by 1.5.

3. Add 1.5*IQR to the third quartile. This value is called the upper fence.

Values greater than this are suspected outliers.

4. Subtract 1.5*IQR from the first quartile. This value is called the lower fence.

Values less than this are suspected outliers.

Image source

Summary

Descriptive statistics don’t tell the whole story

Symmetric, unimodal Skewed right Skewed left

Bimodal Multimodal SymmetricImage Source

Summary● We can summarize data sets by the features of their frequency

distributions, such as their center, dispersion, shape, etc.○ Mean, median, and mode are three measures of central tendency

○ Variance, standard deviation, and IQR are three measures of dispersion

● Descriptive statistics can be more informative than raw data, but

they do not tell all; so they can also be misleading

Recommended