Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Marc Mehlman
Statistics
Marc H. [email protected]
University of New Haven
“To understand God’s thoughts, we must study statistics, forthese are the measure of his purpose.” – Florence Nightingale
“Statistics: the mathematical theory of ignorance.” – MorrisKline
“Statistics means never having to say you’re certain.” –Anonymous
Marc Mehlman (University of New Haven) Statistics 1 / 48
Marc Mehlman
Table of Contents
1 Introduction
2 Graphical Representation of Distributions
3 Measuring the Center
4 Measuring the Spread
5 Normal Distribution
6 Misuse of Statistics
7 Chapter #1 R Assignment
Marc Mehlman (University of New Haven) Statistics 2 / 48
Marc Mehlman
Statistics
Statistics
Marc Mehlman (University of New Haven) Statistics 3 / 48
Marc Mehlman
Introduction
Definition
Given a population, one often examines a sample of the population in order todraw inference about the entire population. A variable is a measurablecharacteristic of individuals within the population. The distribution of a variableis the frequency it obtains it outputs. Data is a variable’s values from the sample.Statistics is the science of drawing inference from data about the population.
Example
From the 50,000 residents of the town a Milford, 300 where selected randomlyand asked what their highest academic degree. The population is the 50,000residents, the sample is the 300 randomly selected residents and the variable isthe level of education of the resident. It was too costly to contact all 50,000residents so the actual distribution of terminal degrees among the entirepopulation is inferred from the distribution of the terminal degrees of 300randomly sampled residents.
Statistic’s Origins: Anecdotes and noticing patterns in random happenings.
Marc Mehlman (University of New Haven) Statistics 4 / 48
Marc Mehlman
Introduction
“Data. Data. Data. I can’t make bricks without clay.” – Sherlock Holmes
“In God we trust. All others must bring data.” - W. Edwards Deming
Definition (Types of Variables)
qualitative (categorical): descriptiveExamples: color of eyes, gender, city born in.
quantitative: numericExamples: height, miles per gallon, tempera-ture, etc.
Definition (Types of Quantitative Variables)
discrete: discrete rangeExamples: # of children someone has, number of coinsin pocket
continuous: continuous rangeExamples: weight, speed
Marc Mehlman (University of New Haven) Statistics 5 / 48
Marc Mehlman
Graphical Representation of Distributions
Graphical Representation of Distributions
Graphical Representation of Distributions
Marc Mehlman (University of New Haven) Statistics 6 / 48
Marc Mehlman
Graphical Representation of Distributions
Distribution of a Variable
6
To examine a single variable, we graphically display its distribution.
The distribution of a variable tells us what values it takes and how often it takes these values.
Distributions can be displayed using a variety of graphical tools. The proper choice of graph depends on the nature of the variable.
The distribution of a variable tells us what values it takes and how often it takes these values.
Distributions can be displayed using a variety of graphical tools. The proper choice of graph depends on the nature of the variable.
Categorical VariablePie chartBar graph
Categorical VariablePie chartBar graph
Quantitative VariableHistogramStemplot
Quantitative VariableHistogramStemplot
Marc Mehlman (University of New Haven) Statistics 7 / 48
Marc Mehlman
Graphical Representation of Distributions
Categorical Variables
7
The distribution of a categorical variable lists the categories and gives the count or percent of individuals who fall into that category.
Pie Charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the categories.
Bar Graphs represent each category as a bar whose heights show the category counts or percents.
Marc Mehlman (University of New Haven) Statistics 8 / 48
Marc Mehlman
Graphical Representation of Distributions
> pie.sales = c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12)
> lbls = c("Blueberry", "Cherry", "Apple", "Boston Cream", "Other", "Vanilla Cream")
> pie(pie.sales, labels = lbls, main="Pie Sales")
Blueberry
Cherry
Apple
Boston Cream
Other
Vanilla Cream
Pie Sales
Marc Mehlman (University of New Haven) Statistics 9 / 48
Marc Mehlman
Graphical Representation of Distributions
> counts=c(40,30,20,10)
> colors=c("Red","Blue","Green","Brown")
> barplot(counts,names.arg=colors,main="Favorite Colors")
Red Blue Green Brown
Favorite Colors
010
2030
40
Marc Mehlman (University of New Haven) Statistics 10 / 48
Marc Mehlman
Graphical Representation of Distributions
Quantitative Variables
9
The distribution of a quantitative variable tells us what values the variable takes on and how often it takes those values.
Histograms show the distribution of a quantitative variable by using bars whose height represents the number of individuals who take on a value within a particular class.
Stemplots separate each observation into a stem and a leaf that are then plotted to display the distribution while maintaining the original values of the variable.
Marc Mehlman (University of New Haven) Statistics 11 / 48
Marc Mehlman
Graphical Representation of Distributions
13
For quantitative variables that take many values and/or large datasets.
Divide the possible values into classes (equal widths).
Count how many observations fall into each interval (may change to percents).
Draw picture representing the distribution―bar heights are equivalent to the number (percent) of observations in each interval.
Histograms
Marc Mehlman (University of New Haven) Statistics 12 / 48
Marc Mehlman
Graphical Representation of Distributions
> hist(trees$Girth,main="Girth of Black Cherry Trees",xlab="Diameter in Inches")
Girth of Black Cherry Trees
Diameter in Inches
Fre
quen
cy
8 10 12 14 16 18 20 22
02
46
810
12
Marc Mehlman (University of New Haven) Statistics 13 / 48
Marc Mehlman
Graphical Representation of Distributions
10
To construct a stemplot:
Separate each observation into a stem (first part of the number) and a leaf (the remaining part of the number).
Write the stems in a vertical column; draw a vertical line to the right of the stems.
Write each leaf in the row to the right of its stem; order leaves if desired.
Stemplots
Marc Mehlman (University of New Haven) Statistics 14 / 48
Marc Mehlman
Graphical Representation of Distributions
> Girth=trees$Girth
> stem(Girth) # stem and leaf plot
The decimal point is at the |
8 | 368
10 | 57800123447
12 | 099378
14 | 025
16 | 03359
18 | 00
20 | 6
> stem(Girth, scale=2)
The decimal point is at the |
8 | 368
9 |
10 | 578
11 | 00123447
12 | 099
13 | 378
14 | 025
15 |
16 | 03
17 | 359
18 | 00
19 |
20 | 6
Marc Mehlman (University of New Haven) Statistics 15 / 48
Marc Mehlman
Graphical Representation of Distributions
15
In any graph of data, look for the overall pattern and for striking deviations from that pattern.
You can describe the overall pattern by its shape, center, and spread.
An important kind of deviation is an outlier, an individual that falls outside the overall pattern.
Examining Distributions
Marc Mehlman (University of New Haven) Statistics 16 / 48
Marc Mehlman
Graphical Representation of Distributions
16
A distribution is symmetric if the right and left sides of the graph are approximately mirror images of each other.
A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side.
It is skewed to the left (left-skewed) if the left side of the graph is much longer than the right side.
SymmetricSymmetric Skewed-leftSkewed-left Skewed-rightSkewed-right
Examining Distributions
Marc Mehlman (University of New Haven) Statistics 17 / 48
Marc Mehlman
Graphical Representation of Distributions
Alaska Florida
An important kind of deviation is an outlier. Outliers are observations that
lie outside the overall pattern of a distribution. Always look for outliers and
try to explain them.
The overall pattern is fairly
symmetrical except for two
states that clearly do not
belong to the main trend.
Alaska and Florida have
unusual representation of
the elderly in their
population.
A large gap in the
distribution is typically a
sign of an outlier.
Outliers
Marc Mehlman (University of New Haven) Statistics 18 / 48
Marc Mehlman
Measuring the Center
Measuring the Center
Measuring the Center
Marc Mehlman (University of New Haven) Statistics 19 / 48
Marc Mehlman
Measuring the Center
Measures of the Center
Definition
Given x1, x2, · · · , xn, the sample mean is x̄def= x1+x2+···+xn
n = 1n
∑nj=1 xj .
The population mean is µdef= 1
N
∑Nj=1 xj .
If one orders the data from smallest to largest, the median is
Mdef=
{middle value of data if n is oddthe average of the middle two values of data if n is even
.
Laymen refer to the mean as the average.
Example
The median sales price of a house in Milford was $212,175 for Feb–Apr2013. If Bill Gates buys a house in Milford for $100 million, what will thatdo to mean cost of a house in Milford? to the median house in Milford?What is a better measure of the cost of buying a house in Milford, themean or median?
Marc Mehlman (University of New Haven) Statistics 20 / 48
Marc Mehlman
Measuring the Center
“Statistically, if you lie with your head in the oven and your feetin the fridge, on average you will be comfortably warm.”–Anonymous
“Then there is the man who drowned crossing a stream with anaverage depth of six inches.” – W.I.E. Gates
“The average human has one breast and one testicle.” –humorist Des McHale
Marc Mehlman (University of New Haven) Statistics 21 / 48
Marc Mehlman
Measuring the Center
24
The mean and median measure center in different ways, and both are useful.
The mean and median of a roughly symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly the same.
In a skewed distribution, the mean is usually farther out in the long tail than is the median.
The mean and median of a roughly symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly the same.
In a skewed distribution, the mean is usually farther out in the long tail than is the median.
Comparing Mean and Median
Marc Mehlman (University of New Haven) Statistics 22 / 48
Marc Mehlman
Measuring the Spread
Measuring the Spread
Measuring the Spread
Marc Mehlman (University of New Haven) Statistics 23 / 48
Marc Mehlman
Measuring the Spread
25
A measure of center alone can be misleading. A useful numerical description of a distribution requires both a
measure of center and a measure of spread.
To calculate the quartiles:
•Arrange the observations in increasing order and locate the median M.
•The first quartile Q1 is the median of the observations located to the left of the median in the ordered list.
•The third quartile Q3 is the median of the observations located to the right of the median in the ordered list.
The interquartile range (IQR) is defined as: IQR = Q3 – Q1.
To calculate the quartiles:
•Arrange the observations in increasing order and locate the median M.
•The first quartile Q1 is the median of the observations located to the left of the median in the ordered list.
•The third quartile Q3 is the median of the observations located to the right of the median in the ordered list.
The interquartile range (IQR) is defined as: IQR = Q3 – Q1.
How to Calculate the Quartiles and the Interquartile RangeHow to Calculate the Quartiles and the Interquartile Range
Measuring Spread: The Quartiles
Marc Mehlman (University of New Haven) Statistics 24 / 48
Marc Mehlman
Measuring the Spread
Definition (The 1.5 x IQR Rule for Outliers)
Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile orbelow the first quartile.
Example
The number of items of mail eleven professors, chosen at random, get on September 3rd 2013 isgiven below.
18, 13, 3, 16, 9, 35, 5, 15, 23, 11, 7.
Are there any outlier professors?Solution: We figure out the quartiles:
Q1 M Q33 5 7 9 11 13 15 16 18 23 35
Since IQR = 18− 7 = 11 and 35−Q3 = 17 > 1.5 x IQR = 16.5, one identifies 35 as an outlier.
Marc Mehlman (University of New Haven) Statistics 25 / 48
Marc Mehlman
Measuring the Spread
Definition
Given x1, x2, · · · , xn, the five number summary is
minimum, Q1, M, Q3, maximum.
Definition
Given x1, x2, · · · , xn, to create a boxplot (also called a box and whiskers plot)
1 draw and label a vertical number line that includes the range of the distribution.
2 draw a box from height Q1 to Q3.
3 draw a horizontal line inside the box at the height of the median.
4 draw vertical line segments (whiskers) from the bottom and top of the box to theminimum and maximum data values that are not outliers.
5 sometimes outliers are identified with ◦’s (R does this).
Boxplots are often useful when comparing the values of two different variables.
Marc Mehlman (University of New Haven) Statistics 26 / 48
Marc Mehlman
Measuring the Spread
> boxplot(trees$Height, main="Heights of Black Cherry Trees")
> boxplot(USJudgeRatings$DMNR,USJudgeRatings$DILG,
+ main="Lawyers’ Demeanor/Diligence ratings of US Superior Court state judges")
6570
7580
85
Heights of Black Cherry Trees
●
●
56
78
9
Lawyers' Demeanor/Diligence ratings of US Superior Court state judges
Marc Mehlman (University of New Haven) Statistics 27 / 48
Marc Mehlman
Measuring the Spread
Definition
The population variance is σ2 def= 1
N
∑Nj=1(xj − µ)2 and the population
standard deviation is σdef=√σ2 =
√1N
∑Nj=1(xj − µ)2.
However one often hase only a random sample to examine, not the entirepoplulation. With only a random sample, one can not calculate thepopulation mean, µ, so the best one can do is use the sample mean, x̄instead.
Definition
The sample variance is s2 def= 1
n−1
∑nj=1(xj − x̄)2 and the sample
standard deviation is sdef=√s2 =
√1
n−1
∑nj=1(xj − x̄)2.
Notice the use of n − 1 instead n for the sample variance and standarddeviation.
Marc Mehlman (University of New Haven) Statistics 28 / 48
Marc Mehlman
Measuring the Spread
Properties of the Sample Standard Deviation
1 s measures the amount the data is dispersed about the mean.
2 s ≥ 0 and if s = 0 then all the data values are the same.
3 s has the same units of measurement as the data.
4 s is sensitive to the existence of outliers.
Example
Suppose our random sample is
4.5, 3.7, 2.8, 5.3, 4.6.
Then
x̄ =1
5[4.5 + 3.7 + 2.8 + 5.3 + 4.6] = 4.18
s2 =1
5− 1
[(4.5− 4.18)2 + (3.7− 4.18)2 + (2.8− 4.18)2 + (5.3− 4.18)2 + (4.6− 4.18)2
]= 0.917
s =√
0.917 = 0.9576012.
Marc Mehlman (University of New Haven) Statistics 29 / 48
Marc Mehlman
Measuring the Spread
34
We now have a choice between two descriptions for center and spread
Mean and Standard Deviation
Median and Interquartile Range
The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.
Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.
NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!
The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.
Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.
NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!
Choosing Measures of Center and SpreadChoosing Measures of Center and Spread
Choosing Measures ofCenter and Spread
Marc Mehlman (University of New Haven) Statistics 30 / 48
Marc Mehlman
Measuring the Spread
R Commands:
Example
> mean(trees$Volume)
[1] 30.17097
> median(trees$Volume)
[1] 24.2
> summary(trees$Volume)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.20 19.40 24.20 30.17 37.30 77.00
> IQR(trees$Volume)
[1] 17.9
> var(trees$Volume)
[1] 270.2028
> sd(trees$Volume)
[1] 16.43785
Marc Mehlman (University of New Haven) Statistics 31 / 48
Marc Mehlman
Normal Distribution
Normal Distribution
Normal Distribution
Marc Mehlman (University of New Haven) Statistics 32 / 48
Marc Mehlman
Normal Distribution
40
A density curve is a curve that:
• is always on or above the horizontal axis• has an area of exactly 1 underneath it
A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall in that range.
A density curve is a curve that:
• is always on or above the horizontal axis• has an area of exactly 1 underneath it
A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall in that range.
Density Curves
Marc Mehlman (University of New Haven) Statistics 33 / 48
Marc Mehlman
Normal Distribution
41
Our measures of center and spread apply to density curves as well as to actual sets of observations.
• The median of a density curve is the equal-areas point―the point that divides the area under the curve in half.
• The mean of a density curve is the balance point, at which the curve would balance if made of solid material.
• The median and the mean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail.
• The median of a density curve is the equal-areas point―the point that divides the area under the curve in half.
• The mean of a density curve is the balance point, at which the curve would balance if made of solid material.
• The median and the mean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail.
Distinguishing the Median and Mean of a Density CurveDistinguishing the Median and Mean of a Density Curve
41
Density Curves
Marc Mehlman (University of New Haven) Statistics 34 / 48
Marc Mehlman
Normal Distribution
43
One particularly important class of density curves are the Normal curves, which describe Normal distributions.
All Normal curves are symmetric, single-peaked, and bell-shaped.
A Specific Normal curve is described by giving its mean µ and standard deviation σ.
Normal Distributions
Marc Mehlman (University of New Haven) Statistics 35 / 48
Marc Mehlman
Normal Distribution
44
A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ.
The mean of a Normal distribution is the center of the symmetric Normal curve.
The standard deviation is the distance from the center to the change-of-curvature points on either side.
We abbreviate the Normal distribution with mean µ and standard deviation σ as N(µ,σ).
A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ.
The mean of a Normal distribution is the center of the symmetric Normal curve.
The standard deviation is the distance from the center to the change-of-curvature points on either side.
We abbreviate the Normal distribution with mean µ and standard deviation σ as N(µ,σ).
Normal Distributions
Marc Mehlman (University of New Haven) Statistics 36 / 48
Marc Mehlman
Normal Distribution
45
The 68-95-99.7 RuleIn the Normal distribution with mean µ and standard deviation σ:
Approximately 68% of the observations fall within σ of µ.
Approximately 95% of the observations fall within 2σ of µ.
Approximately 99.7% of the observations fall within 3σ of µ.
The 68-95-99.7 RuleIn the Normal distribution with mean µ and standard deviation σ:
Approximately 68% of the observations fall within σ of µ.
Approximately 95% of the observations fall within 2σ of µ.
Approximately 99.7% of the observations fall within 3σ of µ.
The 68-95-99.7 Rule
Marc Mehlman (University of New Haven) Statistics 37 / 48
Marc Mehlman
Normal Distribution
47
All Normal distributions are the same if we measure in units of size σ from the mean µ as center.
If a variable x has a distribution with mean µ and standard deviation σ, then the standardized value of x, or its z-score, is
If a variable x has a distribution with mean µ and standard deviation σ, then the standardized value of x, or its z-score, is
σμx
z-=
The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1. That is, the standard Normal distribution is N(0,1).
The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1. That is, the standard Normal distribution is N(0,1).
Standardizing Observations
Marc Mehlman (University of New Haven) Statistics 38 / 48
Marc Mehlman
Normal Distribution
48
Because all Normal distributions are the same when we standardize, we can find areas under any Normal curve from a single table.
The Standard Normal Table
Table A is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.
The Standard Normal Table
Table A is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.
The Standard Normal Table
Marc Mehlman (University of New Haven) Statistics 39 / 48
Marc Mehlman
Normal Distribution
Example
Given X ∼ N(5, 3), what is the probability 4 ≤ X ≤ 7?Solution: Using Table:
P(4 ≤ X ≤ 7) = P
(4− 5
3≤ X − 5
3≤ 7− 5
3
)= P (−0.33 ≤ Z ≤ 0.67)
= P(Z ≤ 0.67)− P(Z ≤ −0.33)
= 0.7486− 0.3707 = 0.3779
or
> pnorm(7,5,3) - pnorm(4,5,3)
[1] 0.3780661
Marc Mehlman (University of New Haven) Statistics 40 / 48
Marc Mehlman
Normal Distribution
Example
According to the National Health and Nutrition Examination Study1976–1980, the heights (in inches) of adult men aged 18–24 areN(70, 2.8). What is the tallest a man aged 18–24 can be and still be inthe bottom 10% of all such men of that height?Solution: Using Table:
0.1 = P(X ≤ x) = P
(X − 70
2.8≤ x − 70
2.8
)= P
(Z ≤ x − 70
2.8
).
Using reverse table lookup one has
−1.28 =x − 70
2.8⇒ x = 66.416.
Or, using R:
> qnorm(0.1,70,2.8)
[1] 66.41166
Marc Mehlman (University of New Haven) Statistics 41 / 48
Marc Mehlman
Normal Distribution
One way to assess if a distribution is indeed approximately normal is to
plot the data on a normal quantile plot.
The data points are ranked and the percentile ranks are converted to z-
scores with Table A. The z-scores are then used for the x axis against
which the data are plotted on the y axis of the normal quantile plot.
If the distribution is indeed normal the plot will show a straight
line, indicating a good match between the data and a normal
distribution.
Systematic deviations from a straight line indicate a non-normal
distribution. Outliers appear as points that are far away from the
overall pattern of the plot.
55
Normal Quantile Plots
Marc Mehlman (University of New Haven) Statistics 42 / 48
Marc Mehlman
Normal Distribution
Normal quantile plots are complex to do by hand, but they are
standard features in most statistical software.
Good fit to a straight line: the
distribution of rainwater pH
values is close to normal.
Curved pattern: the data are not
normally distributed. Instead, it shows
a right skew: a few individuals have
particularly long survival times.
56
Normal Quantile Plots
Marc Mehlman (University of New Haven) Statistics 43 / 48
Marc Mehlman
Normal Distribution
R commands:
> dat=rnorm(500,4,3)
> qqnorm(dat); qqline(dat, col="red")
> qqnorm(trees$Girth); qqline(trees$Girth, col="red")
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−5
05
10
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
●●
●
●● ●
● ● ● ● ● ● ●●
●
● ●
●
● ●●
●●
●●
●●
● ● ●
●
−2 −1 0 1 2
810
1214
1618
20
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Marc Mehlman (University of New Haven) Statistics 44 / 48
Marc Mehlman
Misuse of Statistics
Misuse of Statistics
Misuse of Statistics
Marc Mehlman (University of New Haven) Statistics 45 / 48
Marc Mehlman
Misuse of Statistics
“He uses statistics as a drunken man uses lamp–posts . . . forsupport rather than illumination.” – Andrew Lang (1844–1912)
“Figures fool when fools figure.” – Oliver Lancaster, professor ofmathematical statistics at Sydney University
“There are three kinds of lies: lies, damned lies, and statistics.” –Mark Twain
“Facts are stubborn, but statistics are more pliable.” - MarkTwain
“Torture numbers, and they’ll confess to anything.” - GreggEasterbrook
Daily Cal: Forty percent of workforce is women, yet only thirty percent ofwomen on tv series work.
Marc Mehlman (University of New Haven) Statistics 46 / 48
Marc Mehlman
Chapter #1 R Assignment
Chapter #1 R Assignment
Chapter #1 R Assignment
Marc Mehlman (University of New Haven) Statistics 47 / 48
Marc Mehlman
Chapter #1 R Assignment
Fifty-eight sailors are sampled and their eye color is noted as below
blue brown green hazel red11 32 8 5 2
1 Create a barplot and pie chart of eye color from the sailor sample.
2 Create of a histogram and a stemplot of the height of loblolly treesfrom the dataset “Loblolly”. The dataset, “Loblolly” comes with R,just as “trees” does. To observe “Loblolly”, type “Loblolly” at the Rprompt (without the quotes). To learn more about the dataset, type“help(Loblolly)” at the R prompt.
3 Find the mean, median, five number summary, variance and standarddeviation from the sample of heights in the dataset “Loblolly”.
4 If X ∼ N(2, 3). Find P(1.3 ≤ X ≤ 5.8).
5 Create a Normal Quantile Plot of the height of loblolly trees from thedataset “Loblolly” and decide if the distribution of the heights camefrom a normal distribution.
Marc Mehlman (University of New Haven) Statistics 48 / 48