Marc H. Mehlman [email protected]/mhm/courses/estat/slides/data.pdf · 2017. 1. 31. · Marc Mehlman Statistics Marc H. Mehlman [email protected] University of

Marc Mehlman

Statistics

Marc H. [email protected]

University of New Haven

“To understand God’s thoughts, we must study statistics, forthese are the measure of his purpose.” – Florence Nightingale

“Statistics: the mathematical theory of ignorance.” – MorrisKline

“Statistics means never having to say you’re certain.” –Anonymous

Marc Mehlman (University of New Haven) Statistics 1 / 48

[email protected]

Marc Mehlman

Table of Contents

1 Introduction

2 Graphical Representation of Distributions

3 Measuring the Center

4 Measuring the Spread

5 Normal Distribution

6 Misuse of Statistics

7 Chapter #1 R Assignment


Marc Mehlman

Statistics

Statistics


Marc Mehlman

Introduction

Definition

Given a population, one often examines a sample of the population in order todraw inference about the entire population. A variable is a measurablecharacteristic of individuals within the population. The distribution of a variableis the frequency it obtains it outputs. Data is a variable’s values from the sample.Statistics is the science of drawing inference from data about the population.

Example

From the 50,000 residents of the town a Milford, 300 where selected randomlyand asked what their highest academic degree. The population is the 50,000residents, the sample is the 300 randomly selected residents and the variable isthe level of education of the resident. It was too costly to contact all 50,000residents so the actual distribution of terminal degrees among the entirepopulation is inferred from the distribution of the terminal degrees of 300randomly sampled residents.

Statistic’s Origins: Anecdotes and noticing patterns in random happenings.


Marc Mehlman

Introduction

“Data. Data. Data. I can’t make bricks without clay.” – Sherlock Holmes

“In God we trust. All others must bring data.” - W. Edwards Deming

Definition (Types of Variables)

qualitative (categorical): descriptiveExamples: color of eyes, gender, city born in.

quantitative: numericExamples: height, miles per gallon, tempera-ture, etc.

Definition (Types of Quantitative Variables)

discrete: discrete rangeExamples: # of children someone has, number of coinsin pocket

continuous: continuous rangeExamples: weight, speed


Marc Mehlman

Graphical Representation of Distributions




Marc Mehlman


Distribution of a Variable

6

To examine a single variable, we graphically display its distribution.

The distribution of a variable tells us what values it takes and how often it takes these values.

Distributions can be displayed using a variety of graphical tools. The proper choice of graph depends on the nature of the variable.

The distribution of a variable tells us what values it takes and how often it takes these values.

Distributions can be displayed using a variety of graphical tools. The proper choice of graph depends on the nature of the variable.

Categorical VariablePie chartBar graph

Categorical VariablePie chartBar graph

Quantitative VariableHistogramStemplot

Quantitative VariableHistogramStemplot


Marc Mehlman


Categorical Variables

7

The distribution of a categorical variable lists the categories and gives the count or percent of individuals who fall into that category.

Pie Charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories.

Bar Graphs represent each category as a bar whose heights show the category counts or percents.


Marc Mehlman


> pie.sales = c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12)

> lbls = c("Blueberry", "Cherry", "Apple", "Boston Cream", "Other", "Vanilla Cream")

> pie(pie.sales, labels = lbls, main="Pie Sales")

Blueberry

Cherry

Apple

Boston Cream

Other

Vanilla Cream

Pie Sales


Marc Mehlman


> counts=c(40,30,20,10)

> colors=c("Red","Blue","Green","Brown")

> barplot(counts,names.arg=colors,main="Favorite Colors")

Red Blue Green Brown

Favorite Colors

010

2030

40


Marc Mehlman


Quantitative Variables

9

The distribution of a quantitative variable tells us what values the variable takes on and how often it takes those values.

Histograms show the distribution of a quantitative variable by using bars whose height represents the number of individuals who take on a value within a particular class.

Stemplots separate each observation into a stem and a leaf that are then plotted to display the distribution while maintaining the original values of the variable.


Marc Mehlman


13

For quantitative variables that take many values and/or large datasets.

Divide the possible values into classes (equal widths).

Count how many observations fall into each interval (may change to percents).

Draw picture representing the distribution―bar heights are equivalent to the number (percent) of observations in each interval.

Histograms


Marc Mehlman


> hist(trees$Girth,main="Girth of Black Cherry Trees",xlab="Diameter in Inches")

Girth of Black Cherry Trees

Diameter in Inches

Fre

quen

cy

8 10 12 14 16 18 20 22

02

46

810

12


Marc Mehlman


10

To construct a stemplot:

Separate each observation into a stem (first part of the number) and a leaf (the remaining part of the number).

Write the stems in a vertical column; draw a vertical line to the right of the stems.

Write each leaf in the row to the right of its stem; order leaves if desired.

Stemplots


Marc Mehlman


> Girth=trees$Girth

> stem(Girth) # stem and leaf plot

The decimal point is at the |

8 | 368

10 | 57800123447

12 | 099378

14 | 025

16 | 03359

18 | 00

20 | 6

> stem(Girth, scale=2)

The decimal point is at the |

8 | 368

9 |

10 | 578

11 | 00123447

12 | 099

13 | 378

14 | 025

15 |

16 | 03

17 | 359

18 | 00

19 |

20 | 6


Marc Mehlman


15

In any graph of data, look for the overall pattern and for striking deviations from that pattern.

You can describe the overall pattern by its shape, center, and spread.

An important kind of deviation is an outlier, an individual that falls outside the overall pattern.

Examining Distributions


Marc Mehlman


16

A distribution is symmetric if the right and left sides of the graph are approximately mirror images of each other.

A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side.

It is skewed to the left (left-skewed) if the left side of the graph is much longer than the right side.

SymmetricSymmetric Skewed-leftSkewed-left Skewed-rightSkewed-right

Examining Distributions


Marc Mehlman


Alaska Florida

An important kind of deviation is an outlier. Outliers are observations that

lie outside the overall pattern of a distribution. Always look for outliers and

try to explain them.

The overall pattern is fairly

symmetrical except for two

states that clearly do not

belong to the main trend.

Alaska and Florida have

unusual representation of

the elderly in their

population.

A large gap in the

distribution is typically a

sign of an outlier.

Outliers


Marc Mehlman

Measuring the Center




Marc Mehlman


Measures of the Center

Definition

Given x1, x2, · · · , xn, the sample mean is x̄def= x1+x2+···+xn

n = 1n

∑nj=1 xj .

The population mean is µdef= 1

N

∑Nj=1 xj .

If one orders the data from smallest to largest, the median is

Mdef=

{middle value of data if n is oddthe average of the middle two values of data if n is even

.

Laymen refer to the mean as the average.

Example

The median sales price of a house in Milford was $212,175 for Feb–Apr2013. If Bill Gates buys a house in Milford for $100 million, what will thatdo to mean cost of a house in Milford? to the median house in Milford?What is a better measure of the cost of buying a house in Milford, themean or median?


Marc Mehlman


“Statistically, if you lie with your head in the oven and your feetin the fridge, on average you will be comfortably warm.”–Anonymous

“Then there is the man who drowned crossing a stream with anaverage depth of six inches.” – W.I.E. Gates

“The average human has one breast and one testicle.” –humorist Des McHale


Marc Mehlman


24

The mean and median measure center in different ways, and both are useful.

The mean and median of a roughly symmetric distribution are close together.

If the distribution is exactly symmetric, the mean and median are exactly the same.

In a skewed distribution, the mean is usually farther out in the long tail than is the median.

The mean and median of a roughly symmetric distribution are close together.

If the distribution is exactly symmetric, the mean and median are exactly the same.

In a skewed distribution, the mean is usually farther out in the long tail than is the median.

Comparing Mean and Median


Marc Mehlman

Measuring the Spread




Marc Mehlman


25

A measure of center alone can be misleading. A useful numerical description of a distribution requires both a

measure of center and a measure of spread.

To calculate the quartiles:

•Arrange the observations in increasing order and locate the median M.

•The first quartile Q1 is the median of the observations located to the left of the median in the ordered list.

•The third quartile Q3 is the median of the observations located to the right of the median in the ordered list.

The interquartile range (IQR) is defined as: IQR = Q3 – Q1.

To calculate the quartiles:

•Arrange the observations in increasing order and locate the median M.

•The first quartile Q1 is the median of the observations located to the left of the median in the ordered list.

•The third quartile Q3 is the median of the observations located to the right of the median in the ordered list.

The interquartile range (IQR) is defined as: IQR = Q3 – Q1.

How to Calculate the Quartiles and the Interquartile RangeHow to Calculate the Quartiles and the Interquartile Range

Measuring Spread: The Quartiles


Marc Mehlman


Definition (The 1.5 x IQR Rule for Outliers)

Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile orbelow the first quartile.

Example

The number of items of mail eleven professors, chosen at random, get on September 3rd 2013 isgiven below.

18, 13, 3, 16, 9, 35, 5, 15, 23, 11, 7.

Are there any outlier professors?Solution: We figure out the quartiles:

Q1 M Q33 5 7 9 11 13 15 16 18 23 35

Since IQR = 18− 7 = 11 and 35−Q3 = 17 > 1.5 x IQR = 16.5, one identifies 35 as an outlier.


Marc Mehlman


Definition

Given x1, x2, · · · , xn, the five number summary is

minimum, Q1, M, Q3, maximum.

Definition

Given x1, x2, · · · , xn, to create a boxplot (also called a box and whiskers plot)

1 draw and label a vertical number line that includes the range of the distribution.

2 draw a box from height Q1 to Q3.

3 draw a horizontal line inside the box at the height of the median.

4 draw vertical line segments (whiskers) from the bottom and top of the box to theminimum and maximum data values that are not outliers.

5 sometimes outliers are identified with ◦’s (R does this).

Boxplots are often useful when comparing the values of two different variables.


Marc Mehlman


> boxplot(trees$Height, main="Heights of Black Cherry Trees")

> boxplot(USJudgeRatings$DMNR,USJudgeRatings$DILG,

+ main="Lawyers’ Demeanor/Diligence ratings of US Superior Court state judges")

6570

7580

85

Heights of Black Cherry Trees

●

●

56

78

9

Lawyers' Demeanor/Diligence ratings of US Superior Court state judges


Marc Mehlman


Definition

The population variance is σ2 def= 1

N

∑Nj=1(xj − µ)2 and the population

standard deviation is σdef=√σ2 =

√1N

∑Nj=1(xj − µ)2.

However one often hase only a random sample to examine, not the entirepoplulation. With only a random sample, one can not calculate thepopulation mean, µ, so the best one can do is use the sample mean, x̄instead.

Definition

The sample variance is s2 def= 1

n−1

∑nj=1(xj − x̄)2 and the sample

standard deviation is sdef=√s2 =

√1

n−1

∑nj=1(xj − x̄)2.

Notice the use of n − 1 instead n for the sample variance and standarddeviation.


Marc Mehlman


Properties of the Sample Standard Deviation

1 s measures the amount the data is dispersed about the mean.

2 s ≥ 0 and if s = 0 then all the data values are the same.

3 s has the same units of measurement as the data.

4 s is sensitive to the existence of outliers.

Example

Suppose our random sample is

4.5, 3.7, 2.8, 5.3, 4.6.

Then

x̄ =1

5[4.5 + 3.7 + 2.8 + 5.3 + 4.6] = 4.18

s2 =1

5− 1

[(4.5− 4.18)2 + (3.7− 4.18)2 + (2.8− 4.18)2 + (5.3− 4.18)2 + (4.6− 4.18)2

]= 0.917

s =√

0.917 = 0.9576012.


Marc Mehlman


34

We now have a choice between two descriptions for center and spread

Mean and Standard Deviation

Median and Interquartile Range

The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.

Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.

NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!

The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.

Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.

NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!

Choosing Measures of Center and SpreadChoosing Measures of Center and Spread

Choosing Measures ofCenter and Spread


Marc Mehlman


R Commands:

Example

> mean(trees$Volume)

[1] 30.17097

> median(trees$Volume)

[1] 24.2

> summary(trees$Volume)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.20 19.40 24.20 30.17 37.30 77.00

> IQR(trees$Volume)

[1] 17.9

> var(trees$Volume)

[1] 270.2028

> sd(trees$Volume)

[1] 16.43785


Marc Mehlman

Normal Distribution

Normal Distribution

Normal Distribution


Marc Mehlman

Normal Distribution

40

A density curve is a curve that:

• is always on or above the horizontal axis• has an area of exactly 1 underneath it

A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall in that range.

A density curve is a curve that:

• is always on or above the horizontal axis• has an area of exactly 1 underneath it

A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall in that range.

Density Curves


Marc Mehlman

Normal Distribution

41

Our measures of center and spread apply to density curves as well as to actual sets of observations.

• The median of a density curve is the equal-areas point―the point that divides the area under the curve in half.

• The mean of a density curve is the balance point, at which the curve would balance if made of solid material.

• The median and the mean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail.

• The median of a density curve is the equal-areas point―the point that divides the area under the curve in half.

• The mean of a density curve is the balance point, at which the curve would balance if made of solid material.

• The median and the mean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail.

Distinguishing the Median and Mean of a Density CurveDistinguishing the Median and Mean of a Density Curve

41

Density Curves


Marc Mehlman

Normal Distribution

43

One particularly important class of density curves are the Normal curves, which describe Normal distributions.

All Normal curves are symmetric, single-peaked, and bell-shaped.

A Specific Normal curve is described by giving its mean µ and standard deviation σ.

Normal Distributions


Marc Mehlman

Normal Distribution

44

A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ.

The mean of a Normal distribution is the center of the symmetric Normal curve.

The standard deviation is the distance from the center to the change-of-curvature points on either side.

We abbreviate the Normal distribution with mean µ and standard deviation σ as N(µ,σ).

A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ.

The mean of a Normal distribution is the center of the symmetric Normal curve.

The standard deviation is the distance from the center to the change-of-curvature points on either side.

We abbreviate the Normal distribution with mean µ and standard deviation σ as N(µ,σ).

Normal Distributions


Marc Mehlman

Normal Distribution

45

The 68-95-99.7 RuleIn the Normal distribution with mean µ and standard deviation σ:

Approximately 68% of the observations fall within σ of µ.

Approximately 95% of the observations fall within 2σ of µ.

Approximately 99.7% of the observations fall within 3σ of µ.

The 68-95-99.7 RuleIn the Normal distribution with mean µ and standard deviation σ:

Approximately 68% of the observations fall within σ of µ.

Approximately 95% of the observations fall within 2σ of µ.

Approximately 99.7% of the observations fall within 3σ of µ.

The 68-95-99.7 Rule


Marc Mehlman

Normal Distribution

47

All Normal distributions are the same if we measure in units of size σ from the mean µ as center.

If a variable x has a distribution with mean µ and standard deviation σ, then the standardized value of x, or its z-score, is

If a variable x has a distribution with mean µ and standard deviation σ, then the standardized value of x, or its z-score, is

σμx

z-=

The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1. That is, the standard Normal distribution is N(0,1).

The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1. That is, the standard Normal distribution is N(0,1).

Standardizing Observations


Marc Mehlman

Normal Distribution

48

Because all Normal distributions are the same when we standardize, we can find areas under any Normal curve from a single table.

The Standard Normal Table

Table A is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.


Table A is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.



Marc Mehlman

Normal Distribution

Example

Given X ∼ N(5, 3), what is the probability 4 ≤ X ≤ 7?Solution: Using Table:

P(4 ≤ X ≤ 7) = P

(4− 5

3≤ X − 5

3≤ 7− 5

3

)= P (−0.33 ≤ Z ≤ 0.67)

= P(Z ≤ 0.67)− P(Z ≤ −0.33)

= 0.7486− 0.3707 = 0.3779

or

> pnorm(7,5,3) - pnorm(4,5,3)

[1] 0.3780661


Marc Mehlman

Normal Distribution

Example

According to the National Health and Nutrition Examination Study1976–1980, the heights (in inches) of adult men aged 18–24 areN(70, 2.8). What is the tallest a man aged 18–24 can be and still be inthe bottom 10% of all such men of that height?Solution: Using Table:

0.1 = P(X ≤ x) = P

(X − 70

2.8≤ x − 70

2.8

)= P

(Z ≤ x − 70

2.8

).

Using reverse table lookup one has

−1.28 =x − 70

2.8⇒ x = 66.416.

Or, using R:

> qnorm(0.1,70,2.8)

[1] 66.41166


Marc Mehlman

Normal Distribution

One way to assess if a distribution is indeed approximately normal is to

plot the data on a normal quantile plot.

The data points are ranked and the percentile ranks are converted to z-

scores with Table A. The z-scores are then used for the x axis against

which the data are plotted on the y axis of the normal quantile plot.

If the distribution is indeed normal the plot will show a straight

line, indicating a good match between the data and a normal

distribution.

Systematic deviations from a straight line indicate a non-normal

distribution. Outliers appear as points that are far away from the

overall pattern of the plot.

55

Normal Quantile Plots


Marc Mehlman

Normal Distribution

Normal quantile plots are complex to do by hand, but they are

standard features in most statistical software.

Good fit to a straight line: the

distribution of rainwater pH

values is close to normal.

Curved pattern: the data are not

normally distributed. Instead, it shows

a right skew: a few individuals have

particularly long survival times.

56

Normal Quantile Plots


Marc Mehlman

Normal Distribution

R commands:

> dat=rnorm(500,4,3)

> qqnorm(dat); qqline(dat, col="red")

> qqnorm(trees$Girth); qqline(trees$Girth, col="red")

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−3 −2 −1 0 1 2 3

−5

05

10

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

●●

●

●● ●

● ● ● ● ● ● ●●

●

● ●

●

● ●●

●●

●●

●●

● ● ●

●

−2 −1 0 1 2

810

1214

1618

20

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s


Marc Mehlman

Misuse of Statistics




Marc Mehlman


“He uses statistics as a drunken man uses lamp–posts . . . forsupport rather than illumination.” – Andrew Lang (1844–1912)

“Figures fool when fools figure.” – Oliver Lancaster, professor ofmathematical statistics at Sydney University

“There are three kinds of lies: lies, damned lies, and statistics.” –Mark Twain

“Facts are stubborn, but statistics are more pliable.” - MarkTwain

“Torture numbers, and they’ll confess to anything.” - GreggEasterbrook

Daily Cal: Forty percent of workforce is women, yet only thirty percent ofwomen on tv series work.


Marc Mehlman

Chapter #1 R Assignment




Marc Mehlman


Fifty-eight sailors are sampled and their eye color is noted as below

blue brown green hazel red11 32 8 5 2

1 Create a barplot and pie chart of eye color from the sailor sample.

2 Create of a histogram and a stemplot of the height of loblolly treesfrom the dataset “Loblolly”. The dataset, “Loblolly” comes with R,just as “trees” does. To observe “Loblolly”, type “Loblolly” at the Rprompt (without the quotes). To learn more about the dataset, type“help(Loblolly)” at the R prompt.

3 Find the mean, median, five number summary, variance and standarddeviation from the sample of heights in the dataset “Loblolly”.

4 If X ∼ N(2, 3). Find P(1.3 ≤ X ≤ 5.8).

5 Create a Normal Quantile Plot of the height of loblolly trees from thedataset “Loblolly” and decide if the distribution of the heights camefrom a normal distribution.


Documents

Marc H. Mehlman [email protected]/mhm/courses/estat/slides/data.pdf · 2017. 1. 31. · Marc Mehlman Statistics Marc H. Mehlman [email protected] University of