Introduction to Statistical Data Analysis · Types of Statistics Data Display Measures of Central Tendency Measures of Dispersion Introduction This course is an introduction to statistical

Working with Data SetsProbability

Probability DistributionsSampling

Hypothesis TestingThe Chi-Square Distribution

Correlation and Simple RegressionMiscellaneous Topics

Introduction to Statistical Data Analysis

James V. Lambers

Department of MathematicsThe University of Southern Mississippi

August 18, 2016

James V. Lambers Statistical Data Analysis 1 / 314





IntroductionStatistical Software: The R ProjectTypes of StatisticsData DisplayMeasures of Central TendencyMeasures of Dispersion

Introduction

This course is an introduction to statistical data analysis.

The purpose of the course is to acquaint students with fundamentaltechniques for gathering data, describing data sets, and mostimportantly, making conclusions based on data.

Topics that will be covered include probability, probability distributions,sampling, confidence intervals, hypothesis testing, correlation, andregression.

Notes and other materials are posted at

http://www.math.usm.edu/lambers/sda







The R Project

To illustrate and work with concepts and techniques presented in thiscourse, we will use a software tool known as R, which provides aprogramming environment for statistical computing and graphics. It isfreely available for download from the site

http://www.r-project.org/

Throughout this workshop, as concepts are presented, relevant Rfunctions and sample code will be given.







Descriptive Statistics

The purpose of descriptive statistics to summarize and display data insuch a way that it can readily be interpreted. Examples of descriptivestatistics are as follows:

I The average, or mean is a convenient way of describing a set ofmany numbers with just a single number.

I A chart is useful for organizing and summarizing data in meaningfulways.







Inferential Statistics

The other, much more sophisticated branch of statistics is inferentialstatistics, which is used to make actual claims about an entire (large)population based on a (relatively small) sample of data.

Related topics:

I Confidence intervals

I Hypothesis testing

I Goodness-of-fit tests

I Correlation and regression







Example

For example, suppose that a pollster wanted to determine the percentageof all registered voters in California that would support a certain ballotmeasure.

It would not be practical to question the entire population consisting ofall of these voters, as there are millions of them.

Instead, the pollster would question a sample consisting of a reasonablenumber of these voters (such as, for example, 200 voters), and then useinferential statistics to make a conclusion about the voting preference ofthe entire population based on the data obtained from the sample.







The Distinction

The essential difference between descriptive and inferential statistics liesin the size of the population about which conclusions are being made.

In descriptive statistics, conclusions are made about a relatively smallpopulation based on direct observations of every member of thatpopulation.

In inferential statistics, conclusions are made about a relatively largepopulation based on descriptive statistics applied to a small sample fromthat population.







Ethics in Statistics

The example of inferential statistics given above, concerning a pollster,can be expanded to illustrate important aspects of ethics in statistics.

In order to draw sound conclusions about a large population, it isessential that a sample of that population be representative of thatpopulation; otherwise, the sample is said to be biased.







1936 Presidential Election

This occurred during the presidential election of 1936, in which a poll ofa sample of voters was conducted in order to determine whether themajority would vote for Franklin D. Roosevelt, the Democratic candidate,or Alf Landon, the Republican candidate.

The conclusion made from the poll was that Landon would win theelection, when in fact Roosevelt won.







Where Did They Go Wrong?

The reason why the poll yielded an incorrect conclusion was thattelephone directories were used to obtain voter names, and in 1936,telephones existed primarily in more affluent households, which tended tovote Republican.

That is, the method of polling led to an unintentional bias.

In some cases, unfortunately, a sample can be biased intentionally, inorder to make a false conclusion that supports one’s agenda.







Frequency Distributions

A frequency distribution is a table that lists specific intervals, calledclasses, along with the number of data observations that fall into eachclass.

The number of observations belonging to a particular class is called afrequency.







Example

Suppose that a survey of 100 voters is taken, in which the age of eachrespondent is recorded. The ages of the respondents are

48 55 73 54 36 82 30 37 63 5025 64 48 84 34 18 69 72 66 6460 47 24 63 65 50 51 31 63 7251 75 37 85 77 48 29 38 84 4367 68 29 35 42 50 42 24 33 6467 86 38 65 73 72 61 58 68 4763 55 49 38 65 41 31 66 35 7720 41 55 65 18 73 70 56 26 7623 25 50 67 60 51 35 48 61 3640 61 79 23 45 21 82 63 50 61







Example, cont’d

Since voters must be at least 18 years of age, classes could be chosen asfollows: 18-27, 28-37, and so on, up to 78-87, since the maximum ageamong all respondents is 86. Then, the frequency distribution is

Age Range Number of Respondents18-27 1128-37 1438-47 1248-57 1858-67 2468-77 1478-87 7

Frequency distribution of ages of 100 voters surveyed







Frequency Distributions in R

Suppose that the 100 ages from the preceding example are stored in atext file, called ages.txt, as a simple list of numbers separated byspaces. To create this frequency distribution in R, the followingcommands can be used:

> ages=scan("ages.txt")

> breaks = seq(min(ages),max(ages)+10,by=10)

> freq = table(cut(ages,breaks,right=FALSE))

> freq

[18,28) [28,38) [38,48) [48,58) [58,68) [68,78) [78,88)

11 14 12 18 24 14 7







Class Selection

In determining the classes for a frequency distribution, the followingguidelines should be observed:

I All classes should be of equal size, so that the number ofobservations in each class can be compared in a meaningful way.

I There should be between 5 and 15 classes. Using too few classesfails to give a sense of the distribution of observations, and havingtoo many classes makes comparing classes less useful.

I Classes should not be “open-ended”, if possible. For example, ifobservations are ages, there should not be a class of “over age 50”.

I Classes should be exhaustive, so that all data observations can beincluded.

Note that the frequency distribution in the preceding example followsthese guidelines; had classes spanned 20 years instead of 10, there wouldhave been too few.







Measures of Central Tendency

It is highly desirable to be able to characterize a data set using a singlevalue.

Suppose that a data set consists of numerical values, and that theobservations are plotted as points on the real number line.

Then, a number that is at the “center” of these points can serve as sucha characterizing value.

This value is called a measure of central tendency.







Mean

Given a set of n numerical observations {x1, x2, . . . , xn} of a population,the mean of the set is

µ =x1 + x2 + · · ·+ xn

n.

When the observations are drawn from a sample, rather than an entirepopulation, then the mean is denoted by x :

x =x1 + x2 + · · ·+ xn

n.

The mean can be defined more concisely using sigma notation:

µ =1

n

n∑i=1

xi .







The Mean in R

To compute the mean of a data set in R, the mean function can be used.

For example, with the age data used in previous example, we have:

> mean(ages)

[1] 52.55







Weighted Mean

In some instances, a measure of central tendency needs to be computedfrom the values in a data set, in which some values should be assignedmore weights than others.

This leads to the notion of a weighted mean

µ =w1x1 + w2x2 + · · ·+ wnxn

w1 + w2 + · · ·+ wn=

n∑i=1

wixi

n∑i=1

wi

.

The weights must all be positive.







Example

Suppose that an overall course grade is computed by weighting ahomework average h by 10%, two test grades t1 and t2 by 25% each, anda final exam f by 40%.

Then the overall grade is

10h + 25t1 + 25t2 + 40f

10 + 25 + 25 + 40.







Weighted Mean in R

To compute a weighted mean in R, the weighted.mean function can beused.

The first argument is a vector of observations, and the second argumentis a vector of weights.

For example, suppose the homework average is 80, the test scores are 75and 85, and the final exam score is 90. Then, the weighted mean is

> grades <- c(80,75,85,90)

> weighted.mean(grades,c(10,25,25,50))

[1] 84.54545







Mean of Grouped Data

When data observations are summarized in a frequency distribution, anapproximation of their mean can readily be obtained.

Suppose that the frequency distribution has n classes, with frequenciesf1, f2, . . . , fn.

Furthermore, suppose that the ith class has a representative value ci ; forexample, it could be the average of the lower and upper bounds of theclass.







Approximating the Mean

Then an approximation of the mean is

µ =

n∑i=1

ci fi

n∑i=1

fi

.

It follows that if each class contains only a single value, then thisapproximate mean is given by a weighted mean of these values, in whichthe frequencies are the weights.







Example

Consider the frequency distribution of age data given earlier. The classesare age ranges 18-27, 28-37, and so on.

If we average the upper and lower bounds of each class, we obtainrepresentative values of the classes.

In R, this can be accomplished using the following statements, and thebreaks variable that was defined earlier.

> breaks

[1] 18 28 38 48 58 68 78 88

> class midpoints=(breaks[1:7]+(breaks[2:8]-1))/2

> class midpoints

[1] 22.5 32.5 42.5 52.5 62.5 72.5 82.5







Vectors in R

Note that components of a vector are accessed using indices enclosed insquare brackets, and that the first component of each vector has theindex of 1.

Also, a contiguous portion of a vector can be extracted by specifiying arange of indices with a colon.

For example, breaks[1:7] is a vector consisting of the first 7 elements,numbered 1 through 7, of breaks.







Example, cont’d

Now, an approximate mean can be computed using (23):

> sum(class midpoints*freq)/sum(freq)

[1] 52.5

Note that this approximation is very close to the actual mean of 52.55.

Also, note that vectors of the same length can be multiplied; the result isa vector of products of corresponding components of the vectors.

Then, sum can be used to compute the sum of all of the components of avector.







Median

The median of a data set is, informally, the value such that half of thevalues in the set are less than the median, and half are greater than themedian.

Specifically, if the number n of observations in the set is odd, then themedian is the middle value of the set, at position (n + 1)/2, if the valuesare sorted.

If n is even, then the median is defined to the average of the values atpositions n/2 and n/2 + 1.

The median function in R can be used to compute the median of avector of observations. For example, using the age data, we have

> median(ages)

[1] 52.5







Choosing a Measure

Finally, the mode of a data set is the value that occurs most often withinthe set. It is possible for a data set to have more than one mode.

Given these three measure of central tendency, it is natural to ask whichone should be used.

The mean can be skewed if the data set contains outliers, thus making itan unreliable measure.

The median, on the other hand, is not susceptible to such bias.

Finally, the mode is not often used, except with nominal data, whichcannot be compared or added anyway.







Measures of Dispersion

A measure of central tendency is quite limited in its ability to describe adata set.

For example, the values may be clustered closely around the mean ormedian, or they may be widely spread out.

As such, we can use a measure of dispersion that describes how farindividual data values deviate from a measure of central tendency.







Range

The range of a set of data observations is simply the difference betweenthe largest and smallest values.

This measure of dispersion has the advantage that it is very easy tocompute.

However, it uses very little of the data, and is unduly influenced byoutliers.

The range function in R can be used to obtain the range of a set ofobservations.

> range(ages)

[1] 18 86







Population Variance

The variance of a population, denoted by σ2, is obtained from thedeviation of each observation from the mean:

σ2 =1

N

N∑j=1

(xj − µ)2.

An equivalent formula, that is less tedious for larger populations, is

σ2 =

1

N

N∑j=1

x2j

− µ2.







Sample Variance

The formula for the variance of a sample, denoted by s2, is slightlydifferent:

s2 =1

N − 1

N∑j=1

(xj − x)2.

The division by (N − 1) instead of N is intended to compensate for thetendency of the sample variance, when dividing by N, to underestimatethe population variance.

The var function in R computes the sample variance of a vector ofobservations that is given as an argument.







Standard Deviation

For both a population and a sample, the standard deviation is the squareroot of the variance. That is, the standard deviation of a population is

σ =

√√√√ 1

N

N∑j=1

(xj − µ)2,

whereas for a sample, we have

s =

√√√√ 1

N − 1

N∑j=1

(xj − x)2.

An advantage of the standard deviation over the variance, as a measureof dispersion, is that the standard deviation is measured using the sameunits as the original data.







Standard Deviation in R

The sd function in R computes the sample standard deviation of a givenvector of observations. For example, from the age data, we obtain

> var(ages)

[1] 325.0379

> sd(ages)

[1] 18.02881







Standard Deviation of Grouped Data

For grouped data in a relative frequency distribution, with n classes, classvalues cj (for example, the midpoint of the values in the class), andrelative frequencies fj , j = 1, 2, . . . , n, the population standard deviationcan be computed as follows:

σ =

√√√√√ n∑

j=1

c2j fj

− µ2.







Empirical Rule

The empirical rule states that if the distribution of a set of observationsis “bell-shaped”, meaning that the distribution is symmetric around themean and decreases toward zero away from the mean, then approximately68, 95, and 99.7 % of the observations fall within 1, 2, and 3 standarddeviations of the mean, respectively.







Chebyshev’s Theorem

Another rule of thumb, that applies even to distributions that are notbell-shaped or symmetric, is Chebyshev’s Theorem, which states that ifk > 1, then at least (

1− 1

k2

)100%

of the observations fall within k standard deviations of the mean.






IntroductionConditional ProbabilityIndependent EventsIntersection of EventsUnion of EventsBayes’ TheoremCounting Principles

Events

Informally, probability is the likelihood that a particular event will occur.

To be able to compute probabilities, though, we need precise definitionsof the concepts included in this informal definition.

I An experiment is a process of measuring or observing an activity forthe purpose of collecting data.

I An outcome is a result of an experiment.

I A sample space is a set of all possible outcomes of an experiment.

I An event is an outcome, or a set of outcomes, of interest.Mathematically, an event is a subset of the sample space.







Definition of Probability

Classical probability is the number of outcomes contained in an event,relative to size of sample space.

That is, if E is an event, and S is the sample space, then the probabilityof E , denoted by P(E ), is defined by

P(E ) =|E ||S |

,

where, for any set A, |A| denotes the cardinality of A, which is simply thenumber of elements contained in A.







Example

Consider the result of rolling a single six-sided die, which is anexperiment.

The outcome is the number showing on the die after it is rolled.

The sample space is the set S = {1, 2, 3, 4, 5, 6}, which contains allpossible results of the die roll.

Examples of events would be “rolling a 6”, which is the set {6}, or“rolling an odd number”, which is the set {1, 3, 5}.

If E is the event “rolling a number higher than 4”, which is the set{5, 6}, then

P(E ) =|E ||S |

=2

6=

1

3.







Properties of Probability

Regardless of the type of probability that is being measured, there arecertain properties that the probability of an event E must satisfy.

1. P(E ) = 1 if the event E is certain to occur.

2. P(E ) = 0 if it is certain that E will not occur.

3. P(E ) must satisfy 0 ≤ P(E ) ≤ 1.

4. If E1,E2, . . . ,En are mutually exclusive events, meaning that no twoof these events can occur simultaneously, then

P(E1 ∪ E2 ∪ · · · ∪ En) = P(E1) + P(E2) + · · ·+ P(En) =n∑

i=1

P(Ei ).







Complementary Probability

A consequence of the first and fourth properties is that if we denote byE ′ the complement of an event E , which consists of all outcomes in thesample space that are not contained in E , then

P(E ′) = 1− P(E ),

because either E or E ′ is certain to occur, due to all outcomes in thesample space belonging to one event or the other, but not both.







Example

Let E be the event that the sun is going to rise tomorrow. As theoften-used quintessential certainty, it is safe to say that P(E ) = 1.

After limited experimentation, I believe it is equally safe to say that if Lis the event in which I will ever choose winning lottery numbers, thenP(L) = 0, and this will certainly be the case if I make the wise choice togive up on playing.

There is no circumstance under which an event can have a negativeprobability, or a probability greater than 1.

If A is the event that a student earns an A in a particular course, and B isthe event that they earn a B, and so on, then these events are mutuallyexclusive, since the student can only be assigned one grade. Therefore,

P(A ∪ B ∪ C ∪ D ∪ F ) = P(A) + P(B) + P(C ) + P(D) + P(F ).







Simple and Conditional Probability

Simple probability, also known as prior probability, is probability that isdetermined solely from the number of observations of an experiment.

On the other hand, conditional probability, also known as posteriorprobability, is the probability that an event A will occur, given thatanother event B has already occurred. It is denoted by P(A|B); somesources use the notation P(A/B).

One can think of conditional probability as using a reduced sample space.When measuring P(A|B), one is not considering the whole of the samplespace from which A and B originate; instead, one is only considering thesubset B of that sample space, and then determining how many elementsof that subset also belong to A.







Independent Events

Informally, two events A and B are said to be independent if neither oneis influenced by the other.

Mathematically, we say that A is independent of B if

P(A|B) = P(A).







Example

Let A be the event that John is late for work, and B be the event thatJane, who has no connection to John whatsoever and in fact lives andworks in a different city from John, is late for work.

These two events are independent, so P(A|B) = P(A).

On the other hand, suppose John drives to work and that C is the eventthat there is a major traffic jam in his city.

This event, if it occurs, could cause him to be late for work, so P(A) isinfluenced by P(C ).

That is, P(A) is not the same as P(A|C ). On the other hand, B and Care independent, so P(B|C ) = P(B).







Intersection of Events

Let A and B be two events. Then, the joint probability of A and B,denoted by A ∩ B, is the event consisting of all outcomes that belong toboth A and B.

Since events are defined to be subsets of the sample space, the jointprobability of events is simply the intersection of the corresponding sets.







Contingency Tables

Joint probabilities arise in contingency tables, which list the number ofoutcomes that correspond to each possible pairing of results of twoexperiments.

In a contingency table, each row corresponds to a value of one variable(that is, one possible result of an experiment), and each columncorresponds to a value of a second variable.

Then, the entry in row i , column j of the table is the number ofoutcomes corresponding to the ith value of the first variable and the jthvalue of the second.







Example

Based on a survey of 100 adults, the following contingency table lists thejoint probabilities for each combination of values of two variables, whichare gender and choice of smartphone purchase.

Gender iPhone Samsung Neither TotalMale 16 18 14 48Female 20 16 16 52Total 36 34 30 100

From this table, it can be seen that if one respondent is randomly chosenfrom those surveyed, and if M is the event that the respondent is male,and I is the event that the respondent owns an iPhone, thenP(M ∩ I ) = 16/100 = 0.16, whereas P(M) = 0.48 and P(I ) = 0.36.







Multiplication Rule

Using the concept of intersection of events, we can now give a simpleformula for conditional probability, based on the definition given earlier:

P(A|B) =P(A ∩ B)

P(B).

Combining this formula with the definition of independent events, itfollows that if A and B are independent events, then

P(A ∩ B) = P(A)P(B).

This formula is called the multiplication rule for independent events. Ifthe events A and B are dependent, then the multiplication rule takes adifferent form:

P(A ∩ B) = P(A|B)P(B).







Having the Same Birthday

We will use the multiplication rule to compute the probability that out of23 people, at least 2 of them have the same birthday.

For simplicity, we work with a 365-day year. First, we note that theprobability that two people have different birthdays is 364/365, becauseonce the first person’s birthday is known, the second person’s birthdaycan fall on any one of the other 364 days.

Then, given that the first two people have different birthdays, theprobability that the third person has a different birthday is 363/365.







Using the Multiplication Rule

Continuing this process, if we let Ai be the event that the ith person hasa different birthday than the first i − 1 people, the probability that all 23people have different birthdays is

P(A23 ∩ A22 ∩ · · · ∩ A2) = P(A2)P(A3|A2)P(A4|A2 ∩ A3) · · ·P(A23|A2 ∩ A3 ∩ · · · ∩ A22)

=364

365

363

365· · · 343

365= 0.493.

Therefore, the probability that at least two of the 23 people have thesame birthday is 1− 0.493 = 0.507. That is, there is a 50% chance thatat least two of them have the same birthday.







Example

As before, let M be the event that a randomly chosen respondent ismale, and let I be the event that they own an iPhone. Then

P(M|I ) =P(M ∩ I )

P(I )=

0.16

0.36= 0.4.

This can also be seen by considering only the column of the table thatcorresponds to iPhone owners: there are 36 respondents who are iPhoneowners, and 16 of those are male, so based on that,P(M|I ) = 16/36 = 0.4.







Example, cont’d

The table can be used to determine whether the events M and I areindependent. We know that P(M ∩ I ) = 0.16. From the totals of thefirst row and first column of the table, we have P(M) = 0.48 andP(I ) = 0.36. However, because

P(M)P(I ) = (0.48)(0.36) = 0.1728 6= P(M ∩ I ),

we conclude that these events are dependent.

On the other hand, suppose two six-sided die are rolled. The numbershown on each die is independent of the other, and since the probabilityof either die roll being a 6 is 1/6, we can conclude that the probability ofrolling double sixes is (1/6)(1/6) = 1/36.







Interpretation of Conditional Probability

To reinforce the notion that conditional probability is the probability ofan event with respect to a reduced sample space, we note that if S is theoriginal sample space, then

P(A|B) =P(A ∩ B)

P(B)

=|A ∩ B|/|S ||B|/|S |

=|A ∩ B||B|

.

That is, P(A|B) is obtained by restricting the sample space to alloutcomes in B.







Mutually Exclusive Events

Two events A and B are said to be mutually exclusive if it is not possiblefor A and B to occur simultaneously.

In set notation, we say that A and B are disjoint, or that A ∩ B =.

Since there are no outcomes that belong to both A and B, it follows thatfor mutually exclusive events A and B,

P(A ∩ B) = 0.







Union of Events

The union of two events A and B is the event consisting of all outcomesthat belong to either A or B (and possibly both; the “or” is inclusive).

Using set notation again, we denote this event by A ∪ B.







Addition Rule

If two events A and B are mutually exclusive, then, from one of theproperties of probability stated earlier, it follows that

P(A ∪ B) = P(A) + P(B).

On the other hand, if A and B are not mutually exclusive, then the aboveformula does not hold, because outcomes that are in both A and B endup being counted twice.

Therefore, we need to correct the formula as follows:

P(A ∪ B) = P(A) + P(B)− P(A ∩ B).







Example

Consider the act of drawing a single card from a standard 52-card deck.

Let A be the event that the card drawn is a spade, let B be the eventthat the card drawn is a heart, and let C be the event that the carddrawn is a face card (jack, queen or king).

Then, the events A and B are mutually exclusive, but the events A andC are not, because it is possible to draw a jack, queen or king of spades.







Example, cont’d

From

P(A) = P(B) =1

4, P(C ) =

3

13, P(A ∩ C ) =

3

52,

we obtain

P(A ∪ B) = P(A) + P(B) =1

4+

1

4=

1

2,

and

P(A ∪ C ) = P(A) + P(C )− P(A ∩ C ) =1

4+

3

13− 3

52=

11

26.







Bayes’ Theorem

Given two events A and B, Bayes’ Theorem is a result that relates theconditional probabilities P(A|B) and P(B|A).

It states that

P(B|A) =P(B)P(A|B)

P(B)P(A|B) + P(B ′)P(A|B ′).

To see why this theorem is true, note that by the multiplication rule, thenumerator on the right-hand side is simply P(A ∩ B), and thedenominator becomes P(A ∩ B) + P(A ∩ B ′).







Bayes’ Theorem, cont’d

Because B and B ′ are mutually exclusive, but also exhaustive (meaningB ∪ B ′ is equal to the entire sample space), this expression becomesP((A ∩ B) ∪ (A ∩ B ′)) = P(A). We therefore have

P(B|A) =P(A ∩ B)

P(A),

which can be rearranged to again obtain the multiplication rule.







Alternative Form of Bayes’ Theorem

Also, if we keep the original numerator in but use the simplifieddenominator, we obtain another commonly used statement of Bayes’Theorem,

P(B|A) =P(B)P(A|B)

P(A).

This form is very useful for computing one conditional probability fromanother that may be easier to obtain.







Example

Suppose an insurance company classifies people as accident-prone or notaccident-prone.

Furthermore, they determine that the probability of an accident-proneperson actually having an accident within the next year is 0.4, whereasthe probability of a non-accident-prone person having an accident withinthe next year is 0.2.

If 30% of people are accident-prone, then what is the probability thatsomeone who does have an accident within the next year actually isaccident-prone?







Applying Bayes’ Theorem

To answer this question, we let A be the event that the person has anaccident within the next year, and let B be the event that the person isaccident-prone.

From the given information, we have

P(A|B) = 0.4, P(A|B ′) = 0.2, P(B) = 0.3.

From these probabilities, we conclude that

P(A) = P(A|B)P(B) + P(A|B ′)P(B ′) = (0.4)(0.3) + (0.2)(0.7) = 0.26.

Using Bayes’ Theorem, we conclude that the probability of someone whohas an accident being accident-prone is

P(B|A) =P(B)P(A|B)

P(A)=

(0.3)(0.4)

0.26= 0.4615.







Counting Principles

In order to compute probabilities using the definition, it is necessary to beable to determine the number of outcomes in an event or a sample space.

In this section, we present some techniques for counting how manyelements are in a given set.







The Fundamental Counting Principle

The Fundamental Counting Principle states that if there are m ways toperform task A, and n ways to perform task B, then

I The number of ways to perform task A and task B is mn, and

I The number of ways to perform task A or task B (but not both) ism + n.







Example

Suppose that an ice cream shop offers a selection of ten different flavors,five different toppings, and three different sizes.

Then the number of possible orders of ice cream is 10(5)(3) = 150.

On the other hand, suppose that at a particular restaurant, one entreeselection offers either steak or chicken, and a choice of a side dish.

If there are 7 different steak selections, 4 different chicken selections, and10 side dishes, then the number of possible variations of this entree are(7 + 4)10 = 110.







Example

Standard license plates in California have a digit, followed by 3 letters,followed by another 3 digits.

Therefore, the number of possible license plates is

10 · 26 · 26 · 26 · 10 · 10 · 10 = 104263 = 175, 760, 000.

It can be seen from this example that if there are n ways to perform acertain task, and it must be performed r times, then the number of waysto do so is nr .







Permutations

In many situations, it is necessary to know the number of possiblearrangements of things, or the number of ways to perform a task inwhich there is some sort of ordering.

Equivalently, it is often necessary to sample a number of objects in sucha way that (1) the order in which the objects are sampled is relevant, and(2) after an object is sampled, it is removed from the set so that itcannot be chosen again (this is known as sampling without replacement).

To see the equivalence, consider the task of arranging n objects. Once anobject is assigned its position, it should not be considered when placingthe second object, and then the second object should not be consideredwhen placing the third, and so on.







Permutations, cont’d

To sample r objects, in order, from a set of n, without replacement, wefirst note that there are n ways to choose the first object.

Then, the chosen object is removed from consideration, meaning thatthere n − 1 ways to choose the second object.

Then, that object is removed consideration, leaving n − 2 ways to choosethe third object, and so on.







Counting Permutations

Therefore, the number of ways to choose r objects from a set of n,without replacement, is

n(n − 1)(n − 2) · · · (n − r + 1) =n!

(n − r)!= nPr .

Since this is also the number of ways to arrange r objects chosen from aset of n, we call this the number of permutations of these objects.







Example

Suppose that a club has 25 members, and it is necessary to elect apresident, vice-president, secretary, and treasurer.

Then, the number of ways to choose 4 members to fill these positions is

25P4 =25!

(25− 4)!= 25 · 24 · 23 · 22 = 303, 600.







Example

We know from the Fundamental Counting Principle that the number ofpossible 4-letter words is 264.

This is in instance of sampling with replacement, because once the firstletter is chosen, it can be chosen again for the second letter, and so on.

However, if we require that all of the letters in each word are different,then we must sample without replacement, so the number of such wordsis 26P4 = 26(25)(24)(23).







Combinations

It is often the case that a number of objects must be sampled withoutreplacement, but the order in which they are sampled is irrelevant.

In order to determine the number of ways in which such a sampling maybe performed, we can start by computing nPr , where n is the number ofobjects to choose from and r is the number of objects to be chosen, butthen we must divide by rPr = r !, the number of ways to arrange robjects.







Binomial Coefficients

The result is

nCr =

(nr

)=

n!

r !(n − r)!,

which is called the number of combinations of r objects chosen from aset of n, also referred to as “n-choose-r”.

It is also the known as a binomial coefficient, as it arises naturally whencomputing powers of binomials.







Example

Suppose we wish to count the number of possible poker hands.

This means counting the number of ways to choose 5 cards from a deckof 52.

The number in which the cards are chosen is irrelevant, so we usecombinations instead of permutations.

The number of hands is

52C5 =52!

5!(52− 5)!=

52 · 51 · 50 · 49 · 48

5 · 4 · 3 · 2 · 1= 2, 598, 960.







Example

As another example, consider the 25-member club from the discussion ofpermutations.

Suppose that they need to form a 4-person committee.

The number of ways to do this is 25C4, the number of ways to choose 4members from a set of 25.

The reason why 25C4 is used here, as opposed to 25P4 for electing 4officers, is that a member’s position within the committee is irrelevant,whereas once 4 members are chosen to be officers, it matters which oneof them is chosen to be president, which is chosen to be vice-president,and so on.







Permutations and Combinations in R

In R, the choose and factorial functions can be used to compute thequantities nPr and nCr .

To compute nCr , use choose(n,r).

To compute nPr , use choose(n,r)*factorial(r).







Enumerating Combinations

The combn function can be used to actually enumerate all of thecombinations of elements of a vector.

For example, the combinations of 3 numbers chosen from the set{1, 2, 3, 4, 5} are

> v=c(1:5)

> combn(v,3)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 1 1 1 1 1 1 2 2 2 3

[2,] 2 2 2 3 3 4 3 3 4 4

[3,] 3 4 5 4 5 5 4 5 5 5






IntroductionUniform DistributionBinomial DistributionContinuous Distributions

Introduction

Now that we know how to compute probabilities of events, we can studythe behavior of the probability across all possible outcomes of anexperiment–that is, the distribution of the probability across the samplespace.

Our understanding of the probability distribution will eventually allow usto make inferences from the data from which the distribution arises.







Random Variables

A random variable, usually denoted by a capital letter such as X , is anoutcome of an experiment that has a numerical value.

The value itself is usually denoted by the lower-case version of the letterused to denote the variable itself; that is, a random variable X takes onnumerical values that are denoted by x .

Random variables can either be continuous or discrete.

A continuous random variable can assume a value equal to any realnumber within some interval, whereas a discrete random variable canonly assume selected numerical values, such as, for example, nonnegativeintegers. We will study random variables of both kinds.







Discrete Probability Distributions

A discrete probability distribution is a listing of all possible values of adiscrete random variable, along with the probability of each value beingassumed by the variable.







Example

Let X be a discrete random variable whose outcomes correspond towhere one finishes in a race: first, second, third, etc.

If there are 10 runners in the race, then X can assume as a value anypositive integer between 1 and 10.







The Distribution

The probability distribution might look like the following:

x P(X = x)1 0.12 0.153 0.234 0.185 0.156 0.17 0.048 0.029 0.0210 0.01

Note that the notation P(X = x) is used to refer to the probability thatthe random variable X assumes the value x .







Rules for Discrete Distributions

A discrete probability distribution must follow these rules:

I Each outcome must be mutually exclusive of the others; that is, wecannot have X assume two values simultaneously as a the result ofan experiment.

I For each outcome x , we must have 0 ≤ P(X = x) ≤ 1.

I If the distribution has n possible outcomes x1, x2, . . . , xn, then wemust have

n∑i=1

P(X = xi ) = 1.







Mean

For a given probability distribution, it is very helpful to know the “mostlikely”, or expected, value that the variable will assume.

This can be obtained by computing a weighted mean of the outcomes,where the probabilities serve as the weights.

We therefore define the mean, or expected value, of the discrete randomvariable X by

E [X ] = µ =n∑

i=1

xiP(X = xi ).







Example

Consider a raffle, in which each ticket costs $5.

There is one grand prize of $100, two first prizes of $50 each, and foursecond prizes of $25 each.

If 200 tickets are sold, then the probability of winning the grand prize is1/200 = 0.005, while the probabilities of winning first prize and secondprize are 2/200 = 0.01 and 4/200 = 0.02, respectively.

Then, the expected amount of winnings is

E [X ] = 100(0.005) + 50(0.01) + 25(0.02) + 0(0.965) = 1.5.







Interpretation

That is, a ticket holder can expect to win, on average, $1.50.

However, we must account for the cost of the ticket, which applies to allparticipants; therefore, the expected net winnings is −$3.50.

Since the expected amount is negative, the raffle is not fair to the ticketholders; if the expected value was zero, then the raffle would beconsidered a “fair game”.







Variance and Standard Deviation

Using the mean of X , we can then characterize the dispersion of theoutcomes by defining the variance of X as follows:

σ2 =n∑

i=1

(xi − µ)2P(X = xi ).

An equivalent formula, in terms of expected values, is

σ2 = E [X 2]− E [X ]2.

Note that in the first term, the values of X are squared, and then theyare multiplied by the probabilities and summed, whereas in the secondterm, the expected value is computed first, and then squared.







Uniform Distribution

The uniform distribution U{a, b} is the probability distribution for arandom variable X with domain {a, a + 1, . . . , b} in which each value inthe domain of X is equally likely to be observed

It follows that the probability mass function for this distribution is

P(X = k) =1

n, n = b − a + 1, k ∈ {a, a + 1, . . . , b}







Mean and Variance

Using the above definitions of the mean and variance of a discreterandom variable, it can be shown that

E [X ] =a + b

2, σ2 =

(b − a + 1)2 − 1

12

If a random variable X has the distribution U{a, b}, we writeX ∼ U{a, b}

We will use similar notation with other probability distributions, in orderto indicate that a given random variable has a particular distribution







Binomial Experiments

Suppose that an experiment is performed n times, and it can have onlytwo outcomes, that are classified as “success” and “failure”.

Each of these individual experiments is referred to as a trial.

Furthermore, suppose that each trial is independent of the others, andthat the probability of a trial being successful is p, where 0 < p < 1 (andtherefore, the probability of failure is q = 1− p).

These trials are called Bernoulli trials.







Examples

Examples of Bernoulli trials are:

I Testing for defective parts, in which n is the number of parts to bechecked, p is the probability that a part is not defective, and k is thenumber of parts that are not defective.

I Observing the number of correct responses on exam, in which n isthe total number of questions, p is the probability of getting thecorrect answer on a single question, and k is the number of correctresponses.

I Counting number of households with an internet connection, inwhich n is the number of households, p is the probability of a singlehousehold having an internet connection, and k is the number ofhouseholds that have an internet connection.







The Binomial Distribution

The binomial distribution B(n, p) is the probability distribution for thediscrete random variable X whose value is the number of successes,denoted by k , in n Bernoulli trials, with probability of success p for eachtrial.

Given a value for k , 0 ≤ k ≤ n, what is P(X = k), the probability that Xis equal to k?

First, we note that because the trials are independent, the probability ofsuccess (or failure) in consecutive trials can be obtained simply bymultiplying the probabilities of the outcomes of the individual trials.

It follows that the probability of k successes, followed by n− k failures, is

pk(1− p)n−k .







Probability Mass Function

However, to determine the probability that any k of the n trials aresuccessful, we have to consider all possible ways to choose k trials out ofthe n to be successful.

That is, we must multiply the above expression by nCk .

We conclude that the probability mass function for the binomialdistribution is

P(X = k) = nCkpk(1− p)n−k =

n!

k!(n − k)!pk(1− p)n−k .

Using properties of the binomial coefficients, it can be verified that thesum of all of these probabilities, for k = 0, 1, 2, . . . , n, is equal to 1.







Examples

The binomial distribution, for various values of n and pJames V. Lambers Statistical Data Analysis 97 / 314






Behavior of the Distribution

Note that the binomial distribution is symmetric if p = 0.5, in which casethe probability mass function simplifies to P(X = k) = nCk2−n.

Otherwise, the distribution skews to the left if p < 0.5, because there is agreater probability of more failures, and skews to the right if p > 0.5,since there is a greater probability of more successes.







Binomial Distribution in R

In R, the function dbinom can be used to compute probabilities from abinomial distribution.

Its first argument is a value, or vector of values, of k (number ofsuccesses).

The second argument is n, the number of trials, and the third argumentis p, the probability of success.

An example of its usage is:

> dbinom(c(0,1,2,3,4),4,0.5)

[1] 0.0625 0.2500 0.3750 0.2500 0.0625

The output lists P(X = k), for k = 0, 1, 2, 3, 4, with p = 0.5 and n = 4.







The Mean and Standard Deviation

Using the definition of expected value, and properties of binomialcoefficients, it can be shown by direct computation, and a lot of algebraicmanipulation, that if X is a discrete random variable with a binomialdistribution corresponding to n trials and probability of success p, then

E [X ] = µ = np.

It can also be shown that the standard deviation is given by

σ =√np(1− p).







Continuous Probability Distribution

Recall that a continuous random variable is a random variable X whosedomain is an interval D = [a, b], which is a subset of R, the set of realnumbers

A continuous probability distribution is a function f : D → [0, 1] whosevalue at x ∈ D is the probability P(X = x)

The function f (x) is the probability density function of X . By analogywith the requirement that the sum of all probabilities in a discreteprobability distribution must equal one, a probability density function fora continuous random variable X must satisfy∫ b

a

f (x) dx = 1,

where the interval [a, b] is the domain of X







Mean and Variance

The mean, or expected value, of a continuous random variable X isdefined by

E [X ] =

∫ b

a

xf (x) dx

Then, we can define the variance in the same way as for a discreterandom variable:

Var[X ] = E [X 2]− E [X ]2







Continuous Uniform Distribution

The continuous uniform distribution U(a, b) is the probability distributionfor a random variable X with domain [a, b] in which all subintervals of[a, b] of the same width are equally likely to be observed

It follows that the probability density function for this distribution is

f (x) =1

b − a, x ∈ [a, b]

Using the above definitions of the mean and variance of a continuousrandom variable, it can be shown that

E [X ] =a + b

2, σ2 =

(b − a)2

12







Continuous Uniform Distribution in R

The R function dunif gives the probability of observing within asubinterval of width 1 centered at x (first argument) on a specifiedinterval [a, b] (second and third arguments). It simply returns 1/(b − a)if a ≤ x ≤ b, and 0 otherwise.

To easily obtain cumulative probabilities, use the punif function. Thefirst argument is c , the largest desired outcome, and the second and thirdarguments are the endpoints a and b, respectively, of the domain of X

Finally, given a probability p, the function qunif(p,a,b) returns thevalue of x (that is, the quantile) such that P(X ≤ x) = p. It can easilybe determined that x = p(b − a) + a







Normal Distribution

The normal distribution is a probability distribution that is followed bycontinuous random variables, that can assume any real value within someinterval.

A normal distribution has two parameters, its mean µ and its standarddeviation σ; often, N (µ, σ) is used to refer to a specific normaldistribution.

Its mean, median and mode are all the same, and equal to µ.







Characteristics

The distribution is “bell-shaped”, and is symmetric around the mean. Inview of the essential properties of probability, the area under the entirebell-shaped normal distribution curve must be equal to 1.

The probability is always strictly positive; it can never be zero, thoughthe probability approaches zero for values of the variable that are far fromthe mean.

The probability density function is

P(X = x) =1

σ√

2πe−(x−µ)2/(2σ2).







Normal Distribution in R

This function can be evaluated in R using its function dnorm; forexample, dnorm(1,0.5,2) computes P(X = 1) for the normaldistribution with mean µ = 0.5 and standard deviation σ = 2.

If the third argument is omitted, then σ is assumed to be 1; if the secondargument is omitted as well, then µ is assumed to be 0.

This corresponds to the notion of the standard normal distribution, thathas mean 0 and standard deviation 1.







The Standard Normal Distribution

The standard normal distribution, with mean 0 and standard deviation 1James V. Lambers Statistical Data Analysis 108 / 314






Calculating Probabilities

Suppose we wish to determine P(X ≤ x0), which happens to the area ofthe region bounded by the normal distribution curve, the x-axis, and thevertical line x = x0.

As such, this probability would be given by

P(X ≤ x0) =

∫ x0

−∞P(X = x) dx =

1

σ√

2π

∫ x0

−∞e−(x−µ)2/(2σ2) dx ,

but this integral cannot be evaluated using analytical techniques fromcalculus.

It must instead be evaluated numerically, which is cumbersome.







More R Functions

In R, we can use the pnorm function; for example, pnorm(1) computesP(X ≤ 1) for the normal distribution with µ = 0 and σ = 1.

More generally, to compute the probability P(X ≤ x0): pnorm(x0,m,s)

where m is the mean and s is the standard deviation

To find the quantile x0 such that P(X ≤ x0) = a: x0=qnorm(a,m,s)

As with dnorm, the default values of m and s are 0 and 1, respectively







Tables and z-scores

Tables are often used to evaluate normal distribution probabilities.

Such tables use the standard normal distribution N (0, 1); therefore, if adifferent distribution is being used, a conversion to the standarddistribution must be performed first.

This involves computing the z-score,

z =x − µσ

.

If x is a value of the normal distribution N (µ, σ), then z is thecorresponding value in N (0, 1); more precisely, it is the number ofstandard deviations between x and µ.







Using Symmetry

We can now describe how to compute various probabilities using normaldistribution tables. In the following, we assume that z0 is the z-score forx0.

I P(X ≤ x0): Obtain P(Z ≤ z0) from a standard normal distributiontable, or by using pnorm

I P(X > x0) = 1− P(X ≤ x0), because the events X > x0 andX ≤ x0 are complementary. That is, they are mutually exclusive andexhaustive, so their probabilities must sum to 1.

I P(X ≤ µ− x0) = 1− P(X ≤ µ+ x0), by the symmetry of thenormal distribution.

I P(X > µ− x0) = P(X ≤ µ+ x0), again by symmetry.

I P(x1 ≤ X ≤ x2) = P(X ≤ x2)− P(X ≤ x1).







The Empirical Rule, Revisited

The empirical rule, introduced previously, can be used to estimate normaldistribution probabilities.

While it is approximately true for any bell-shaped, symmetric distribution,it is exact for any normal distribution.

In fact, the rule is derived from the behavior of the normal distribution.

Expressed in terms of probabilities, the empirical rule states that

P(−1 ≤ Z ≤ 1) ≈ 0.68,

P(−2 ≤ Z ≤ 2) ≈ 0.95,

P(−3 ≤ Z ≤ 3) ≈ 0.997.







Approximating the Binomial Distribution

Like the Poisson distribution, the normal distribution can be used toapproximate the binomial distribution, as long as the number of trials nand the probability of success p satisfy

np ≥ 5, n(1− p) ≥ 5.







Un-discretization

For computing probabilities, it is best to use the midpoints of the discretevalues of the number of successes.

For example, to approximate P(X ≤ 5), where X is a discrete randomvariable with a binomial distribution, one should work with a continuousrandom variable Y with a normal distribution

N (np,√

np(1− p))

and compute P(Y ≤ 4.5), rather than P(Y ≤ 5).

This is due to the change from a discrete random variable to acontinuous random variable.







Example

Approximation of the binomial distribution with n = 30 and p = 0.25(blue circles) by N (np,

√np(1− p)) (red curve)







Other Probability Distributions

I Hypergeometric distribution: for sampling without replacement

I Exponential distribution: for studying time between events in aPoisson process

I Chi-square distribution: for goodness-of-fit, independence tests

I F-distribution: for analysis of variance

I and others...






IntroductionMethods of SamplingSampling PitfallsSampling DistributionsThe Central Limit TheoremSampling Distribution of the Proportion

Introduction

In order to complete the transition from descriptive statistics toinferential statistics, we need to know how to work with a sample of apopulation, since in many cases gathering descriptive statistics from theentire population is impractical.

Therefore, in this section, we discuss sampling techniques.







Methods of Sampling

Once the determination is made that only a sample of a population ofinterest can be studied, how to obtain that sample is far from a trivialmatter.

It is essential that the sample not be biased; that is, the sample must berepresentative of the entire population, or any inferences made from thesample will not be reliable.

To reduce the chance of bias, it is best to use random sampling, whichmeans that every member of the population has a chance of beingselected.

We now discuss various approaches to random sampling.







Simple Sampling

In simple sampling, each member of the population has an equal chanceof selection.

Typically, tables of random numbers are used to assist in such a selectionprocess.

For example, suppose all members of the population can be numbered.Then, the table of random numbers can be used to determine thenumbers of members of the population who are to be included in thesample.







Systematic Sampling

Simple sampling is susceptible to bias, if some aid such as a table ofrandom numbers cannot be used.

To avoid this bias, one can use systematic sampling, which consists ofselecting every kth member of the population.

If the population has N members and a sample of size n is desired, thenone should choose k ≈ N/n.







Cluster Sampling

In cluster sampling, the population is divided into groups, called clusters,and then random sampling is applied to the clusters.

That is, entire clusters are chosen to obtain the sample.

This is effective if each cluster is representative of the entire population.







Stratified Sampling

In stratified sampling, the population is divided into mutually exclusivegroups, called strata, and then random sampling is performed within eachstratus.

This approach can be used to ensure that each stratus is treated equallywithin the sample.

For example, suppose that for a national poll, it was desired to have asample in which each state was represented equally.

Then, the strata would be the states, and a sample could be obtainedfrom the populations of each state.







Sampling Pitfalls

Sampling must be performed with care, so that any inferences madeabout the population from the sample have at least some validity.







Sampling Errors

A descriptive statistic computed from a sample is only an estimate of thecorresponding statistic for the population, which, in most cases, cannotbe obtained.

However, it is possible to estimate the error in the sample statistic, calledthe sampling error; we will learn how to do so later, using confidenceintervals.

As we will see then, choosing a larger sample reduces the sampling error.It can be made arbitrarily small by choosing a sample close to the size ofthe entire population, but usually this is not practical.







Poor Sampling Technique

Even if a very large sample is chosen, conclusions made about the sampledo not apply to the population if the sample is biased.

On the other hand, if a sample is truly representative of the population,then it does not need to be large to be reliable.

It is also important to avoid making unrealistic assumptions about thesample.







1948 Presidential Election

In a poll conducted during the 1948 presidential election, voters in thesample were classified as supporting Harry Truman, supporting ThomasDewey, or undecided.

The polling organization made the assumption that undecided votersshould be distributed among the two candidates in the same way that thedecided voters were, which led to a conclusion that Dewey would win.

However, the undecided voters were actually more in favor of Truman,thus leading to his victory.







Sampling Distributions

Suppose that it is desired to measure some quantifiable characteristic ofa population, such as average height, or the percentage of the populationthat votes Republican.

A sample of the population can be taken, and then the characteristic ofthe sample, whatever it is, can be computed from information obtainedfrom each member of the sample.

Now, suppose that many samples are taken, with each sample being thesame size.

Then, the values that are computed from these samples form a set ofoutcomes, where the experiment in question is the computation of thedesired characteristic of the sample.

This set of outcomes obtained from samples is called a samplingdistribution.







Sampling Distribution of the Mean

Sampling distributions apply to a number of different statistics, but themost commonly used is the mean.

The sampling distribution of the mean is the pattern of means that isobtained from computing the sample means from all possible samples ofthe population.







Example

We will illustrate the sampling distribution of the mean for an example ofrolling a six-sided die.

Each of the six numbers has an equal likelihood of appearing face up, sothese values follow a discrete uniform probability distribution, which is adistribution that assigns the same probability to each discrete event.







Example, cont’d

The mean of such a distribution is

µ =a + b

2,

where a and b are the minimum and maximum values, respectively, ofthe distribution. The variance is given by

σ2 =1

12[(b − a + 1)2 − 1].

Therefore, for the case of a six-sided die, for which a = 1 and b = 6, wehave µ = 3.5 and σ2 = 35/12.







Example, cont’d

Now, suppose we roll the die n times, where n is the size of our sample,and compute the sample mean x .

Then, we repeat this process m times, gathering m samples, each of sizen.

The m sample means form a sampling distribution of the mean, which wecan then display in a histogram.







Displaying the Sample Means

This is accomplished in R using the following statements (assuming thevalues of n, the sample size, and m, the number of samples, are alreadydefined):

> means=c()

> for (i in 1:m) means[i]=mean(round(runif(n,0.5,6.5)))

> hist(means,seq(1,6,0.5))







Sampling Distribution of the Mean, n = 2

Suppose we use a small sample of size n = 2, and compute m = 50samples. The means are well-distributed across the interval from 1 to 6.







Increasing the Sample Size

Now, suppose that we increase n (keeping m fixed) and see whathappens to the distribution. We see that the distribution becomes likethat of a normal distribution, with its mean roughly that of the originaluniform distribution.







The Central Limit Theorem

The behavior in the preceding example is no coincidence; it is actually anillustration of what is known as the Central Limit Theorem.

This theorem states that as the sample size n increases, the samplemeans tend to converge to a normal distribution around the truepopulation mean, regardless of distribution of the population from whichthe sample is taken.







Standard Error of the Mean

The Central Limit Theorem also states that as the sample size nincreases, the standard deviation of the sample means, denoted by σx ,converges to

σx =σ√n,

where σ is the standard deviation of the population.

This standard deviation of the sample means is called the standard errorof the mean.

Using the standard error σx and the population mean µ, one can use thefact that the sample mean is normally distributed for sufficiently large nto compute the probability that the sample mean will fall within a certaininterval, as has been shown previously for a general normal distribution.







Example

In the case of the roll of a six-sided die, with a sample size of n = 20, thestandard error is

σx =σ√n

=

√35/12√

20= 0.382.

Therefore, to obtain the probability that the sample mean will be greaterthan 4, we compute the z-score for 4:

4− µσx

=4− 3.5

0.382= 1.309.

We conclude that

P(X > 4) = 1− P(X ≤ 4) = 1− P(Z ≤ 1.309) = 1− 0.9047 = 0.0953.

That is, there is a less than 10% chance that the sample mean will begreater than 4.







Sampling Distribution of the Sum

Suppose that instead of taking the mean of the observations in eachsample, we instead take the sum.

If the population mean and standard deviation are µ and σ, respectively,then as n increases, the sampling distribution of the sum converges toN (nµ, σ

√n).

That is, the mean and standard deviation of the sampling distribution ofthe mean are simply multiplied by n







Sampling Distribution of the Proportion

In addition to the mean, we can measure the proportion of thepopulation that possesses a characteristic that is binary in nature, such aswhether a person agrees with a particular statement.

Because of the binary nature of the characteristic, the experiment ofdetermining its value for members of the population follows a binomialdistribution.

That is, the act of inquiring of each member of the population is aBernoulli trial, in which “success” and “failure” correspond to “yes” or“’no” responses.

However, as noted previously, if the number of trials n is sufficiently largethat np ≥ 5 and n(1− p) ≥ 5, where p is the probability of “success”,then this binomial distribution can be approximated by a normaldistribution.







Standard Error of the Proportion

We therefore need the mean and standard deviation of this normaldistribution.

Because the population proportion p is unknown, we must instead usethe sample proportion ps , which is defined to be the number of success inthe sample, divided by the sample size n.

Several samples can be taken, and then their proportion means can beaveraged to obtain an approximate value for p.

The standard deviation of this distribution, called the standard error ofthe proportion, is given by

σp =

√p(1− p)

n.







Relation to the Binomial Distribution

It is worth noting that σp is equal to the standard deviation of the

binomial distribution,√np(1− p), divided by n.

This makes sense because in the sampling distribution of the proportion,we are not measuring the number of successes, as we are in the binomialdistribution.

Rather, we are measuring the proportion of successes, thus requiring thedivision of both the binomial distribution’s mean and standard deviationby n.







Example

Suppose that through sampling, with samples of size n = 100, it isdetermined that 60% of voters in California support a particular ballotinitiative (that is, p = 0.6).

Because np = 100(0.6) = 60 and n(1− p) = 100(0.4) = 40 are largeenough, we may use a normal distribution to model the samplingdistribution of the proportion.







Example, cont’d

The standard error of the proportion is

σp =

√0.6(1− 0.6)

100= 0.049.

Therefore, the probability that more than 65% of the next sample willsupport the initiative is

P(ps > 0.65) = 1−P(ps ≤ 0.65) = 1−P(Z ≤ 1.02) = 1−0.8461 = 0.1538,

where the z-score for 0.65 is

0.65− p

σp=

0.65− 0.6

0.049= 1.02.






Confidence IntervalsHypothesis TestingTwo-Tail Hypothesis TestingOne-Tail Hypothesis TestingHypothesis Testing with One SampleHypothesis Testing with Two SamplesSummary

Introduction

Now that we have learned about sampling and sampling distributions, weare ready to learn how to use inferential statistics to make conclusionsabout populations based on information obtained from samples.

A key component of inferential statistics is to quantify the uncertaintythat is inherent in using only a sample.

An example of this is polling; a statement of a poll result is accompaniedby an indication of the sampling error.







Confidence Intervals for Means

Suppose that we wish to know the population mean, but only have asample mean.

We can construct a confidence interval that is centered at the samplemean and can provide an indication of the population mean.







Large Samples

We first consider the case where the sample size n is sufficiently large,meaning that n ≥ 30.

If this is the case, then, by the Central Limit Theorem, the sample meansare approximately normally distributed, even if the population is not.







Estimators

The sample mean is an example of what is called a point estimate, whichis a single value that describes population.

Point estimators are easy to compute, but impossible to validate.

To gauge the validity of a sample mean, we will rely on an intervalestimate, which is a range of values that describes the population.

The particular interval estimate we will use is called a confidence interval.







Confidence Levels

The first step in constructing a confidence interval is choosing aconfidence level, which is the probability that the interval estimate willinclude the population parameter (in this case, the population mean).

For example, for a 90% confidence interval, the confidence level is 0.9.

Subtracting this value from 1 yields the significance level α; that is, for a90% confidence interval, the significance level is 0.1.







Constructing a Confidence Interval

When the population standard deviation σ is known, the confidenceinterval is determined as follows:

1. Compute the standard error of the mean, σx = σ/√n.

2. Find the z-value zα/2 such that for the random variable Z withstandard normal distribution N (0, 1), P(Z ≤ zα/2) = 1− α/2,where α is the level of significance and 1− α is the correspondingconfidence level. The value of zα/2 can be found by looking up theprobability 1− α/2 in a normal distribution table, or by using the Rfunction qnorm with argument 1− α/2.

3. Compute the margin of error E = zα/2σx .

4. Then, the confidence interval is [x − E , x + E ].







Meaning of zα/2







Example

Suppose that a signal with value µ is received with a value that isnormally distributed around µ with variance 4.

To reduce error, the signal is transmitted 10 times.

If the values received are 8.5, 9.5, 9.5, 7.5, 9, 8.5, 10.5, 11, 11 and 7.5, thenwhat is a 95% confidence interval for µ?







Example, cont’d

1. First, we compute the sample mean, x = 9.25.

2. Then, we compute the standard error of the mean,

σx = σ/√n = 2/

√10 = 0.6325.

3. Using α = 0.05, we obtain zα/2 = 1.96.

4. The margin of error is then

E = zα/2σx = (1.96)(0.6325) = 1.24.

5. Finally, the confidence interval is

[x − E , x + E ] = [9.25− 1.24, 9.25 + 1.24] = [8.01, 10.49].







Interpreting Confidence Intervals

Once the confidence interval is obtained, it is essential to interpret itcorrectly.

Given a 90% confidence interval, it is not true that the population meanhas a 90% probability of falling within the interval.

Instead, what we know is that there is a 90% probability that any givenconfidence interval from a random sample will contain the populationmean.

Note that all confidence intervals for a given confidence level and samplesize have the same width E , but the center is the sample mean, whichcan vary.







Changing the Confidence Level

The significance level α represents the probability of erroneouslyconcluding that the population mean is outside the confidence interval,when in fact it lies within the interval.

As the confidence level 1− α increases, the significance level α decreases(since these two quantities must sum to one), which causes the z-scorezα/2 to increase, and therefore the interval widens.

As a result, the chance of erroneously concluding that the populationmean is outside the confidence interval decreases.







Changing the Sample Size

As the sample size n increases, the standard error of the mean decreases.

It follows that the margin of error decreases, and therefore the confidenceinterval shrinks.

This makes sense because with a larger sample size, the sample meanshould more accurately approximate the population mean.

In fact, this is confirmed by the Law of Large Numbers, which states thatas n→∞, the sample mean x converges to the population mean µ.







Choosing the Sample Size for the Mean

Given a desired margin of error E , one can solve for the sample size nthat would produce this value of E for the width of the interval.

Rearranging the formulas presented earlier for the construction of theconfidence interval, we obtain

n =

(σ

σx

)2

=(σzα/2

E

)2

.

We can see from this formula that as the margin of error E decreases,the sample size n must increase.







When σ is Unknown

If the population standard deviation σ is unknown, a confidence intervalcan be obtained by substituting the sample standard deviation s.

That is, the standard error of the mean is taken to be σx = s/√n.







Small Samples

When the sample size n is considered small (that is, n < 30), we can nolonger rely on the Central Limit Theorem to conclude that the samplingdistribution of the mean is normal.

We must instead assume that the population itself is normal.

When the population standard deviation σ is known, then we can proceedin the same way as for large samples.







When σ is Unknown

When σ is unknown, we can substitute s for σ as is done for largesamples, but to determine the margin of error E , instead of using thez-value zα/2 from the normal distribution, we use the Student’st-distribution.

This distribution, like the normal distribution, is bell-shaped andsymmetric around the mean, and the area under the probability densitycurve is 1, but the shape of this curve depends on the degrees offreedom, which is n − 1.

This is because there are n observations in the sample, but one degree offreedom is removed due to the mean.

The Student’s t-distribution curve is flatter than the normal distributioncurve, but it converges to a normal distribution as n increases.







Using the Student’s t-distribution

In this scenario, the confidence interval is given by[x − tα/2,n−1σx , x + tα/2,n−1σx

], σx =

s√n.

The value of tα/2,n−1 can be obtained by looking up the probability1− α/2 in a Student’s t-distribution table, or using the R function qt

with arguments 1− α/2 and n − 1.







Introduction

In this section, we explore one of the most useful applications ofinferential statistics, that truly demonstrates its power: hypothesistesting, in which a sample is used to determine, within a certain level ofconfidence, whether to reject a hypothesis about the population fromwhich the sample was drawn.

This is a prime example of how statistics is useful for acquiring insightinto populations from raw data.

A hypothesis is defined to be an assumption about a populationparameter.

In this section, we will formulate hypotheses about whether a certainparameter is less than, equal to, or greater than a certain value, and thenuse confidence intervals to test whether these hypotheses should berejected.







The Null and Alternative Hypotheses

For hypothesis testing, we use two hypotheses:

I The null hypothesis, denoted by H0, represents the “status quo”. Itstates a belief about how a population parameter, such as the meanor a proportion, compares to a specific value.

I The alternative hypothesis, denoted by H1, is the opposite of H0.







Stating the Null and Alternative Hypotheses

For hypothesis testing to be as useful as possible, it is important tochoose the alternative hypothesis H1 wisely.

The alternative hypothesis plays the role of the “research hypothesis”;that is, it corresponds to the position that the researcher wants toestablish.







Example

Suppose that a brand of lightbulbs has a mean lifetime of 2000 hours,but an improvement has been made to their design that may extend theirlifetime.

Then, an appropriate null hypothesis for this situation would beH0 : µ ≤ 2000, and the corresponding alternative hypothesis would beH1 : µ > 2000.

Therefore, if it is determined that H0 should be rejected, then there isevidence to support the claim that the newly designed lightbulbs do infact have a longer lifetime.







The Process of Hypothesis Testing

The way a hypothesis test proceeds as follows:

1. First, we determine a rejection region of the sampling distributionfor the parameter featured in H0 (for example, the samplingdistribution of the mean).

2. Then, we check whether an appropriate test statistic (for example,the sample mean) falls within the rejection region.

3. If so, we choose to reject H0, and conclude that there is sufficientevidence support the claim made by H1.

Otherwise, we choose not to reject H0, and conclude that there isnot sufficient evidence to support the claim made by H1.

It’s important to note that a hypothesis test does not provide enoughevidence to accept H0; we are only concerned with whether to reject it.







Type I and Type II Errors

Because of the reliance on a sample, it is possible for the conclusion of ahypothesis test to be erroneous. There are two kinds of erroneousconclusions:

I A Type I error is committed when the decision is made to reject H0,even though it is actually valid. This kind of error is often due to asampling error. The probability of a Type I error is the level ofsignificance used to construct the confidence interval used for thehypothesis test; as before, this probability is denoted by α.

I A Type II error is committed when the decision is made not to rejectH0 even though it is actually false. The probability of such an erroris denoted by β.

For a fixed sample size, β decreases as α increases.

However, the probability of both errors can be decreased by increasingthe sample size.







Two-Tail Hypothesis Testing

A two-tail hypothesis test is a hypothesis test in which the nullhypothesis H0 is a statement of equality.

For example, a null hypothesis for the mean would be of the formH0 : µ = µ0, for some chosen value of µ0.







The Role of Confidence Intervals

We first choose the significance level α, based on what is considered anacceptable probability of making a Type I error.

Then, we construct a confidence interval around µ0, which is

[µ0 − zα/2σx , µ0 + zα/2σx ].

If the sample mean x falls within this confidence interval, then we do notreject H0.

Otherwise, we say that x falls within the rejection region (that is, thesubset of the real number line outside of the confidence interval), and wereject H0.







The Test Statistic

By rearranging algebraically, we obtain the equivalent condition that wedo not reject H0 if the test statistic

z∗ =x − µ0

σx

satisfies−zα/2 ≤ z∗ ≤ zα/2.







One-Tail Hypothesis Testing

A one-tail hypothesis test is a hypothesis test in which the nullhypothesis H0 is an inequality.

For example, a null hypothesis for the mean would be of the formH0 : µ ≤ µ0 or H0 : µ ≥ µ0.







One-sided Confidence Intervals

As with the two-tail test, we first choose the significance level α. Then,we construct a one-sided confidence interval.

For the null hypothesis H0 : µ ≤ µ0, the interval is

(−∞, µ0 + zασx ].

If the sample mean x falls within this confidence interval (that is,x ≤ µ0 + zασx), then we do not reject H0.

Otherwise, if x > µ0 + zασx , then x falls within the rejection region, andwe reject H0.







Test Statistics with One-Tail Tests

Equivalently, we do not reject H0 if the test statistic satisfies

z∗ ≤ zα.

Similarly, if the null hypothesis is H0 : µ ≥ µ0, we do not reject H0 if

z∗ ≥ −zα.

Note that one-tail hypothesis testing uses the same test statistic as in thetwo-tail case, but it is compared to different values.







Hypothesis Testing with One Sample

We now consider hypothesis testing in several scenarios, all of whichinvolve only one sample.

In each scenario, the general idea is the same: a confidence interval needsto be constructed around the value that is compared to the parameter inH0.

Outside this interval lies the rejection region; if the test statistic fallswithin the rejection region, then H0 is rejected.

The differences between scenarios relate to the various parameters usedto construct the confidence interval.







Testing for the Mean, Large Sample

First, we consider hypothesis testing for the case in which the parameterof interest is the mean, and the sample is large (that is, of size 30 ormore).

Under this assumption, by the Central Limit Theorem, the samplingdistribution of the mean is well approximated by a normal distribution.

We will consider both one-tail hypothesis tests, for which the nullhypothesis is of the form H0 : µ ≥ µ0 or H0 : µ ≤ µ0, and two-tail tests,for which the null hypothesis is of the form H0 : µ = µ0.







When σ is Known

When the population standard deviation σ is known, then the appropriatetest statistic is the one introduced in the preceding discussion,

z∗ =x − µ0

σx,

where σx = σ/√n is the standard error of the mean.







Example

A commercial hatchery grows salmon whose weights are normallydistributed with a standard deviation of 1.2 pounds.

The hatchery claims that the mean weight is at least 7.6 pounds.Suppose a random sample of 40 fish yields an average weight of 7.2pounds.

Is this strong enough evidence to reject the hatchery’s claim at the 5%level of significance?







Testing the Hypothesis

We use the null hypothesis H0 : µ ≥ 7.6, and alternative hypothesisH1 : µ < 7.6. The standard error is

σx =σ√n

=1.2√

40= 0.1897.

The test statistic is

z∗ =x − µ0

σx=

7.2− 7.6

0.1897= −2.1082.

We then compare this value to −zα = −z0.05 = −1.6449.

Because z∗ < −zα, the test statistic falls within the rejection region, andtherefore we reject H0 and conclude that the hatchery’s claim does nothave merit.







When σ is Unknown

By contrast, when σ is unknown, we substitute s, the sample standarddeviation, for σ and proceed as before.

Because the sample is large, it is assumed that s is a reasonably accurateapproximation for σ.

Therefore, the test statistic is

z∗ =x − µ0

s/√n.







Example

Twenty years ago, male students at a high school could do an average of24 pushups in 60 seconds.

To determine whether this is still true today, a sample of 50 malestudents was chosen.

If the sample mean was 22.5 pushups and the sample standard deviationwas 3.1, can we conclude that the mean is no longer 24?







Example, cont’d

Our null hypothesis is H0 : µ = 24, and the alternative hypothesis isH1 : µ 6= 24. We test at the 5% level of significance.

Since the population standard deviation is unknown, but the sample issufficiently large, we use the sample standard deviation instead.

Then, the standard error is

σx =s√n

=3.1√

50= 0.4384.







Example, cont’d

Therefore, the test statistic is

z∗ =x − µ0

σx=

22.5− 24

0.4384= −3.4215.

Because this is a two-tail test, we compare z∗ to zα/2 = z0.025 = 1.96.

We have |z∗| > zα/2, which means z∗ falls within the rejection region.

Therefore, we reject H0 and conclude that the mean is no longer 24.







The Role of α

It can be seen from examination of a normal distribution table that as αincreases, zα (or zα/2, for that matter) decreases, because zα is thez-value for which P(Z > zα) = α, or, equivalently, P(Z ≤ zα) = 1− α.

It follows that the test statistic is less likely to fall within the appropriateconfidence interval for the hypothesis test; that is, it is more likely thatH0 will be rejected.







The Role of α

Considering that the alternative hypothesis H1 is generally the one thatsupports a position that a researcher is trying to establish, it is in theresearcher’s interest that H0 be rejected.

As such, they can help their cause by choosing a larger value of α, whichcorresponds to a lower confidence level 1− α.

This is an important ethical consideration for a statistician, andunderscores the importance of knowing the parameters used in anystatistical analysis that is used to support a particular position.

The smaller the value of α, the more confidence (pun intended) one canhave in the result of a hypothesis test.







p-Values

Since it is important to avoid a Type I error (rejecting H0 when it isactually valid), and the probability of making this error is the level ofsignificance α, it is helpful to have some guidance in choosing α.

For this reason, we introduce the concept of a p-value, which is definedto be the smallest value of significance at which H0 will be rejected,assuming it is true.







One-Tail Tests

Recall that for a one-tail test of a hypothesis of the form H0 : µ ≤ µ0,with alternative hypothesis H1 : µ > µ0, H0 should be rejected if

z∗ > zα.

Because zα satisfies P(Z > zα) = α, it follows that H0 will be rejected if

P (Z > z∗) < α.

We therefore take the p-value for such a hypothesis test to be

P (Z > z∗) .







What is the p-value?







The Case of H0 : µ ≥ µ0

On the other hand, if the null hypothesis is H0 : µ ≥ µ0, then H0 isrejected if

z∗ < −zα.

Because P(Z ≤ −zα) = α, it follows that H0 is to be rejected if

P (Z ≤ z∗) < α.

We conclude that the p-value is P(Z ≤ z∗).







Two-Tail Tests

Finding a p-value for a two-tail test is similar to the one-tail case.

For such a test, the null hypothesis H0 : µ = µ0 is rejected if

|z∗| > zα/2.

Because zα/2 satisfies P(|Z | > zα/2) = α, it follows that H0 is rejected if

P (|Z | > |z∗|) < α.

Due to the symmetry of the normal distribution, this condition isequivalent to

P (Z > |z∗|) < α/2.

That is, the p-value for a two-tail test is twice the p-value of thecorresponding one-tail test.







Computing p-values in R

To compute p-values in R, given test statistic z:

I H0 : µ ≤ µ0: 1-pnorm(z)

I H0 : µ ≥ µ0: pnorm(z)

I H0 : µ = µ0: 2*(1-pnorm(abs(z)))

Analogous expressions can be used when working with otherdistributions, as will be discussed later







Interpreting p-values

The null hypothesis should be rejected if the p-value is smaller than thesignificance level α

A p-value of, for example, 0.02 would therefore imply that H0 is rejectedat the 95% confidence level, but not at the 99% level

As such, researchers like small p-values, as they imply statisticallysignificant results







Manipulating p-values

Unfortunately, this means researchers may (intentionally or otherwise)improperly influence p-values to drive them toward zero

For example, this can readily be achieved by measuring many quantitieson a small sample, which almost guarantees a statistically significantresult

For more information:“I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here’sHow.” by John Bohannon, posted on io9.com

Remember the advice of Ronald Fisher, who introduced p-values: a smallp-value does not prove a research hypothesis true! It only means theevidence is worth a second look







Testing for the Mean, Small Sample

When the sample size is small (that is, less than 30), we can no longerrely on the Central Limit Theorem and automatically treat the samplingdistribution of the mean as a normal distribution.

Therefore, we must instead assume that the population itself is normallydistributed for our hypothesis testing procedures to remain valid.







When σ is Known

When the population standard deviation σ is known, then we can proceedwith hypothesis testing as before, provided that the population is in factnormally distributed.

If this is not the case, then the result of a hypothesis test may beunreliable.







When σ is Unknown

When σ is unknown, but it can be assumed that the population isnormally distributed, then we need an alternative approach to computingthe threshold against which to compare the test statistic.

That is, we need an alternative value that plays the role of zα in aone-tail test or zα/2 in a two-tail test.

For this purpose, we use the Student’s t-distribution, as we did beforewhen constructing confidence intervals using small samples with σunknown.







Using the Student’s t-distribution

As in the case of a large sample with σ unknown, our test statistic is

t∗ =x − µ0

s/√n.

This test statistic is compared to a value of the Student’s t-distributionwith n − 1 degrees of freedom, where n is the sample size.

For a given significance level α, we let tα,n−1 be the t-value such thatP(Tn−1 > tα,n−1) = α, where Tn−1 is a random variable that follows theStudent’s t-distribution with n − 1 degrees of freedom.







When to Reject?

For a one-tail test, with null hypothesis H0 : µ ≤ µ0, we reject H0 ift∗ > tα,n−1, and do not reject H0 otherwise.

On the other hand, if the null hypothesis is H0 : µ ≥ µ0, we reject H0 ift∗ < −tα,n−1, and do not reject H0 otherwise.

Finally, for a two-tail test with null hypothesis H0 : µ = µ0, we reject H0

if |t∗| > tα/2,n−1, and do not reject H0 if |t∗| ≤ tα/2,n−1.







Hypothesis Testing with Two Samples

We now consider testing hypotheses involving characteristics of twopopulations.

Examples of situations that call for a two-sample test include

I investigating differences in test scores between males and females,

I comparison of long-life vs standard light bulbs, and

I average selling prices of homes in different areas.







Sampling Distribution for the Difference of Means

Two-sample hypothesis testing can be used to compare means.

For this purpose, we use the sampling distribution for the difference inmeans, which describes the probability of observing various intervals fordifference between two sample means.

To perform hypothesis testing with this distribution, we need thestandard error of the difference,

σx1−x2 =

√σ2

1

n1+σ2

2

n2,

where the first sample of size n1 has standard deviation σ1, and thesecond sample of size n2 has standard deviation σ2.







Testing for Difference of Means, Large Samples

If the sample sizes n1 and n2 are large, then it can be assumed that thesampling distribution of the difference of means follows a normaldistribution.

The test statistic is then

z∗ =x1 − x2

σx1−x2

,

where σx1−x2 was defined earlier.

For this discussion, we need to assume that the two samples areindependent of one another.







Example

Two new methods of producing a tire are to be compared.

For the first method, n1 = 40 tires are tested at location A and found tohave a mean lifetime of x1 = 40, 000 miles, while for the second method,n2 = 50 tires are tested at location B and found to have a mean lifetimeof x2 = 42, 000 miles.

It is known that tires tested at location A have a standard deviation ofσ1 = 4, 000 miles, while tires tested at location B have a standarddeviation of σ2 = 5, 000 miles.

We wish to test the hypothesis that both methods produce tires with thesame average lifetimes.







Example, cont’d

The null hypothesis is H0 : µ1 = µ2, and the alternative hypothesis isH1 : µ1 6= µ2. We test at the 5% significance level.

The standard error is

σx1−x2 =

√σ2

1

n1+σ2

2

n2=

√40002

40+

50002

50= 948.6833.

Then, the test statistic is

z∗ =x1 − x2

σx1−x2

=40, 000− 42, 000

948.6833= −2.1082.

We compare this against zα/2 = z0.05/2 = 1.96.







Example, cont’d

Since |z∗| > zα/2, we reject the null hypothesis and conclude that thetwo methods produce tires with statistically different average lifetimes.

It is worth noting that the p-value is

P(|Z | > |−2.1082|) = 2P(Z > 2.1082) = 2(1−P(Z ≤ 2.1082)) = 0.035.

That is, the null hypothesis would be rejected at any significance levelabove 3.5%.







Testing for Difference of Means, Unknown Variance

If the standard deviations are unknown, then we must use the Student’st-distribution.

When the sample sizes are small (that is, n1, n2 < 30) we must assumethat the populations are normally distributed.

For now, we assume that the samples are independent; this kind ofhypothesis test is called an unpaired t-test.







Equal Standard Deviations

When the population standard deviations are unknown but assumed to beequal, we use the sample standard deviations to obtain a pooled estimateof standard deviation:

sp =

√(n1 − 1)s2

1 + (n2 − 1)s22

n1 + n2 − 2

Note that the degrees of freedom of the samples, n1 − 1 and n2 − 1, areadded to obtain the degrees of freedom to be used for the test,n1 + n2 − 2.







Equal Standard Deviations, cont’d

We then obtain the standard error of the difference of means as follows:

σx1−x2 = sp

√1

n1+

1

n2

The corresponding test statistic is

t∗ =d − d0

σd,

where, for conciseness, the variable d represents x1 − x2, and d0 is thevalue of d against which we are testing.







Unequal Standard Deviations

When the standard deviations are unequal, we perform an unpooled test.

We first define the standard error of the difference of means as

σx1−x2 =

√s2

1

n1+

s22

n2.

In this case, we define the number of degrees of freedom by

d .f . =

(s2

1

n1+

s22

n2

)2

(s2

1

n1

)2

n1 − 1+

(s2

2

n2

)2

n2 − 1

,

which must be rounded to an integer.James V. Lambers Statistical Data Analysis 207 / 314






Testing for Difference of Means, Dependent Samples

Now, suppose that the two samples are actually dependent on oneanother.

An example would be testing the average weight loss of a group ofindividuals, in which each person’s original weight must be paired withtheir current weight.

In this situation, we use the Student’s t-distribution for what is called apaired t-test.







Example

Suppose that a group of 10 patients is given medication that is intendedto lower their cholesterol.

Their cholesterol is tested before and after being given the medication,and they are found to have their cholesterol level lowered by an averageof d = 10 mg/dL, with a sample standard deviation of sd = 8 mg/dL.

If we test at the 1% significance level, is the reduction in cholesterol levelstatistically significant?







Example, cont’d

Our null hypothesis is that the medication does not help; that is,H0 : µ ≤ 0, where µ is the mean reduction in cholesterol level. Thealternative hypothesis is H1 : µ > 0.

The standard error is

σd =sd√n

=8√10

= 2.5298.


t∗ =d − 0

σd=

10

2.5298= 3.9528.

We compare this value to tα,n−1 = t0.01,9 = 2.8214.

Because t∗ > tα,n−1, we reject H0 and conclude that the reduction incholesterol level is statistically significant.







Testing for Difference of Proportions

We now consider testing for the difference of proportions.

As in the one-sample case, we assume that the sample sizes are largeenough to allow approximation of the binomial distributions followed bythe variables X1 and X2 to be approximated by normal distributions.

Let p1 and p2 be the true proportions of the two populations, and let p1

and p2 be the sample proportions from samples of size n1 and n2,respectively.







Testing for Difference of Proportions, cont’d

If the null hypothesis is H0 : p1 = p2, then we use this assumption ofequality to compute the following estimate of the overall proportion ofthe two populations:

p =n1p1 + n2p2

n1 + n2,

Then, the standard error of the proportion is

σp1−p2 =

√p(1− p)

(1

n1+

1

n2

).







Assuming Unequal Proportions

On the other hand, if the null hypothesis is H0 : p1 − p2 = d0 for somenonzero value d0, then we use the standard error

σp1−p2 =

√p1(1− p1)

n1+

p2(1− p2)

n2.

In either case, we use the test statistic

z∗ =(p1 − p2)− d0

σp1−p2

,

where we assume d0 = 0 in the case of the null hypothesis H0 : p1 = p2.







Summary

As we have seen, there are several hypothesis tests for differentsituations, and various ways to conduct the test that are allmathematically equivalent.

To help keep track of these situations:

I The test statistic is always the difference between the value of thevariable being tested (e.g. the sample mean) and the value it’s beingtested against (e.g. µ0 if the null hypothesis is H0 : µ = µ0), dividedby the standard error.

I We use the Student’s t-distribution if the variances are unknown.We use sample standard deviations to obtain the standard error.







Standard Errors

Characteristic Scenario Standard Error

µ Variance known σx = σ/√n

µ Variance unknown σx = s/√n

p n sample σp =

√p0(1− p0)

n

µ1 − µ2 ni large, σi known, σx1−x2 =

√σ2

1

n1+σ2

2

n2

independent samples

µ1 − µ2 ni large, σi unknown, σx1−x2 = sp

√1

n1+

1

n2,

σ1 = σ2, independent sp =

√(n1 − 1)s2

1 + (n2 − 1)s22

n1 + n2 − 2







Standard Errors, cont’d

Characteristic Scenario Standard Error

µ1 − µ2 ni large, σi unknown, σx1−x2 =

√s2

1

n1+

s22

n2,

σ1 6= σ2, independent

d = µ1 − µ2 n large, σ unknown, σd = sd/√n

dependent samples

p1 − p2 ni large, H0 : p1 = p2 σp1−p2 =

√p(1− p)

(1

n1+

1

n2

),

p =n1p1 + n2p2

n1 + n2

p1 − p2 ni large, H0 : p1 − p2 = d σp1−p2 =

√p1(1− p1)

n1+

p2(1− p2)

n2






Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit TestChi-Square Test for Independence

Introduction

In this section, we will use hypothesis testing for new purposes:

I To determine whether a given data set follows a specific probabilitydistribution, and

I To determine whether two random variables are statisticallyindependent.







Review of Data Measurement Scales

There are four data measurement scales: nominal, ordinal, interval, andratio.

The hypothesis testing techniques presented earlier only apply to thescales that are more quantitative, interval and ratio.

Now, though, we can use hypothesis testing for data measured innominal or ordinal scales as well.

This is because we are working with frequency distributions, which can beconstructed from any data set, regardless of its measurement scale.







The Chi-Square Goodness-of-Fit Test

The chi-square goodness-of-fit test uses a sample to determine whetherthe frequency distribution of the population conforms to a particularprobability distribution that it is believed to follow.

Example Suppose that a six-sided die is rolled 150 times, and the resultof each roll is recorded. The number of rolls that are a 1,2,3, 4,5 or 6should follow a uniform distribution.

A chi-square goodness-of-fit test can be used to compare the observednumber of rolls for each value, from 1 to 6, to the expected number ofrolls for each value, which is 150/6 = 25.







Stating the Hypotheses

For the chi-square goodness-of-fit test, the null hypothesis H0 is that thepopulation does follow the predicted distribution, and the alternativehypothesis H1 is that it does not.







Observed and Expected Frequencies

The chi-square goodness-of-fit test works with two frequencydistributions, with the same classes, and frequencies denoted by {Oi} and{Ei}, respectively.

Each frequency Oi is the actual number of observations from the samplethat belong to the ith class.

Each frequency Ei is the expected number of observations that shouldbelong to class i , assuming H0 is true.

It is essential that the total number of observations in both frequencydistributions are equal; that is,

n∑i=1

Oi =n∑

i=1

Ei ,

where n is the number of classes.James V. Lambers Statistical Data Analysis 221 / 314






Calculating the Chi-Square Statistic

The test statistic for the chi-square goodness-of-fit test, also known asthe chi-square score is given by

χ2 =n∑

i=1

(Oi − Ei )2

Ei,

where, as before, n is the number of classes.







Determining the Critical Chi-Square Score

Once we have computed the test statistic, we compare it against thecritical value χ2

c , which can be obtained as follows:

I It can be looked up in a table of right-tail areas for the chi-squaredistribution, with the degrees of freedom d .f . = n − 1 and chosensignificance level α, or

I One can use the R function qchisq with first parameter 1− α andsecond parameter d .f . = n − 1; this function returns the left-tail

area corresponding to these parameters, in contrast to the tablegiven in Appendix A, which is why 1− α is given as the firstparameter instead of α.

If the chi-square score χ2 is greater than this critical value χ2c , then we

reject H0; otherwise we do not reject H0.

Because test statistic and critical value are always positive, the chi-squaregoodness of fit test is always a one-tail test.







Characteristics of a Chi-Square Distribution

The chi-square distribution is of a very different character than otherdistributions that we have seen.

If Z1,Z2, . . . ,Zn are independent, standard random normal variables, thenthe random variable Q defined by

Q =n∑

i=1

Z 2i

follows the chi-square distribution with n degrees of freedom.

It is not symmetric; rather, its values are skewed toward zero, which isthe leftmost value of the distribution.

However, as the number of degrees of freedom (d .f .) increases, thedistribution becomes more symmetric.







Characteristics, cont’d

The probability density function for n degrees of freedom is

fn(x) =1

2n/2Γ(n/2)xn/2−1e−x/2







A Goodness-of-Fit Test with the Binomial Distribution

Suppose a coin is flipped 10 times, and the number of times it comes upheads is recorded.

Then, this process is repeated several times, for a total of 100 sequencesof 10 flips each.

Since coin flips are Bernoulli trials, the number of heads follows abinomial distribution, which yields the expected number of sequencesthat produces k heads.







Observed and Expected Values

Number of heads Observed Sequences Expected Sequences0 1 0.0981 2 0.9772 3 4.3953 9 11.7194 18 20.5085 26 24.6096 21 20.5087 13 11.7198 5 4.3959 2 0.97710 0 0.098







Performing the Chi-Square Test

Our null hypothesis H0 is that the number of heads does in fact follow abinomial distribution.

The chi-square score is

χ2 =10∑i=0

(Oi − Ei )2

Ei

=(1− 0.098)2

0.098+

(2− 0.977)2

0.977+

(3− 4.395)2

4.395+

(9− 11.719)2

11.719+

(18− 20.508)2

20.508+

(26− 24.609)2

24.610+

(21− 20.508)2

20.508+

(13− 11.719)2

11.719

+(5− 4.395)2

4.395+

(2− 0.977)2

0.977+

(0− 0.098)2

0.098= 12.274.







And the Verdict is...

This is compared to the critical value χ2c , with degrees of freedom

d .f . = n − 1 = 10, since there are n = 11 classes, with level ofsignificance α = 0.05.

We can use the R expression qchisq(1-0.05,10) to obtainχ2c = 18.307.

Since χ2 < χ2c , we do not reject H0, and conclude that the distribution of

the number of heads from each sequence of 10 flips follows a binomialdistribution, as expected.







Chi-Square Goodness-of-fit Test in R

> obs=c(1,2,3,9,18,26,21,13,5,2,0)

> pexp=dbinom(0:10,10,0.5)

> chisq.test(obs,p=pexp)

Chi-squared test for given probabilities

data: obs

X-squared = 12.2743, df = 10, p-value = 0.2671







Chi-Square Test for Independence

Now, we use the chi-square distribution to test whether two givenrandom variables are statistically independent.

For this test, the null hypothesis H0 is that the variables are independent,while the alternative hypothesis H1 is that they are not.







Contingency Tables

To compute the test statistic, we construct a contingency table, which isa two-dimensional array, or a matrix, in which each cell contains anobserved frequency of an ordered pair of values of the two variables.

That is, the entry in row i , column j , which we denote by Oi,j , containsthe number of observations that fall into class i of the first variable andclass j of the second.

The frequencies in this table are the observed frequencies for thechi-square goodness of fit test.







Computing Expected Frequencies

Next, for each row i and each column j , we compute Ei,j , which is:

(sum of entries in row i)× (sum of entries of column j),

divided by the total number of observations, to get the expectedfrequencies for the chi-square goodness-of-fit test.







Relation to Independent Events

That is, if the contingency table has m rows and n columns, then

Ei,j =

(n∑

k=1

Oi,k

)(m∑`=1

O`,j

)∑m`=1

∑nk=1 O`,k

.

It should be noted that this quantity, divided again by the total numberof observations, is exactly P(Ai )P(Bj), where Ai is the event that thefirst variable falls into class i , and Bj is the event that the second variablefalls into class j .

By the multiplication rule, this probability would equal P(Ai ∩ Bj) if thevariables were independent.







The Test Statistic


χ2 =m∑i=1

n∑j=1

(Oi,j − Ei,j)2

Ei,j.

We then obtain the critical value χ2c using d .f . = (m− 1)(n− 1) and our

chosen level of significance α.

As before, if χ2 > χ2c , then we reject H0 and conclude that the variables

are in fact statistically dependent.







Example

Suppose that 300 voters were surveyed, and classified according togender and political affiliation: Democrat, Republican, or Independent.

The contingency table for these classifications is as follows:

AffiliationGender Democrat Republican Independent TotalFemale 68 56 32 156Male 52 72 20 144Total 120 128 52 300

That is, 68 of the voters are female and Democrat, 72 of the voters aremale and Republican, and so on.

The entry in row i and column j is the observation Oi,j .







Computing Expected Frequencies

Let Gi be the event that the voter is of the gender for row i , i = 1, 2, andlet Aj be the event that the voter’s affiliation corresponds to column j ,j = 1, 2, 3. Then, we compute the expected observations as follows:

(i , j) Gi ∩ Aj Ei,j = P(Gi ∩ Aj)

(1, 1) Female, Democrat(156)(120)

300= 62.4

(1, 2) Female, Republican(156)(128)

300= 66.56

(1, 3) Female, Independent(156)(52)

300= 27.04

(2, 1) Male, Democrat(144)(120)

300= 57.60

(2, 2) Male, Republican(144)(128)

300= 61.44

(2, 3) Male, Independent(144)(52)

300= 24.96







The Test Statistic


χ2 =2∑

i=1

3∑j=1

(Oi,j − Ei,j)2

Ei,j

=(68− 62.4)2

62.4+

(56− 66.56)2

66.56+

(32− 27.04)2

27.04+

(52− 57.60)2

57.60+ · · ·

= 6.433.

We compare this value against the critical value χ2c , with degrees of

freedom d .f . = (2− 1)(3− 1) = 2 and significance level 0.05.

Since this value is χ2c = 5.991, and χ2 > χ2

c , we reject the null hypothesisthat gender and political affiliation are independent.







Independence Test in R

> M=matrix(c(68,52,56,72,32,20),nrow=2,ncol=3)

> chisq.test(M)

Pearson’s Chi-squared test

data: M

X-squared = 6.4329, df = 2, p-value = 0.0401






Independent and Dependent VariablesCorrelationSimple RegressionNonlinear RegressionMaximum Likelihood

Introduction

In the previous section, we learned how to determine whether tworandom variables were statistically dependent on one another, using thechi-square goodness-of-fit test.

However, that test alone does not give us any indication of how thevariables are related.

In this section, we will learn how to use correlation and regression to gainsome insight into the nature of the relationship between two variables.







Independent and Dependent Variables

In the following discussion, we classify one of the variables, x , as theindependent variable, and the other variable, y , as the dependent variable.

This means that x serves as the “input” and y serves as the “output”.

Mathematically, y is a function of x , meaning that y is determined fromx in some systematic way.

Therefore, for each value of x , there is only one value of y , whereas onevalue of y can correspond to more than one value of x .







Correlation

Correlation measures the strength and direction of the relationshipbetween x and y . Types of correlation are:

I positive linear correlation, which means that as x increases, yincreases linearly,

I negative linear correlation, which means that as x increases, ydecreases linearly,

I nonlinear correlation, which means that there is a clear relationshipbetween x and y , but the dependence of y on x cannot be describedgraphically using a straight line, and

I no correlation, which means that there is no clear relationshipbetween x and y .

In the remainder of this discussion, we will limit ourselves to linearcorrelation.







Correlation Coefficient

To determine the correlation between two variables x and y , for which wehave n observations of each, we compute the correlation coefficient,which is defined by

r =

nn∑

i=1

xiyi −

(n∑

i=1

xi

)(n∑

i=1

yi

)√√√√√n n∑

i=1

x2i −

(n∑

i=1

xi

)2n n∑

i=1

y2i −

(n∑

i=1

yi

)2.

Geometrically, r is the cosine of the angle between the vector of x-valuesand the vector of y -values, with their respective means subtracted. Itfollows from this interpretation that |r | ≤ 1.







Interpretation

If r > 0, then x and y have a positive linear correlation, whereas if r < 0,then x and y have a negative linear correlation.

If r = 0, then there is no correlation between x and y .

In the extreme cases, r = ±1, we have y = cx for some constant c thatis positive (r = 1) or negative (r = −1).

The benefit of knowing whether two variables are linearly correlated isthat we can, at least approximately, predict values of the dependentvariable y from values of the independent variable x .

Of course, the accuracy of this prediction depends on |r |; if r is nearlyzero, such a prediction is not likely to be reliable.







Testing the Significance of r

Suppose we have determined that x and y are linearly correlated, basedon the value of the correlation coefficient r obtained from a sample.

How do we know whether a similar correlation applies to the entirepopulation?

We can answer this question by performing a hypothesis test on thepopulation correlation coefficient, which we denote by p.

If we only wish to test whether p is nonzero, then we can use a two-tailtest, with null hypothesis H0 : p = 0 and alternative hypothesisH1 : p 6= 0.

On the other hand, if we wish to test for a positive linear correlation, wecan perform a one-tail test with null hypothesis H0 : p ≤ 0 and alternativehypothesis H1 : p > 0; testing for a negative linear correlation is similar.







Performing the Test

For this test, we use the Student t-distribution. The test statistic is

t∗ =r√

1− r2

n − 2

,

where, as before, n is the sample size for each variable, d .f . = n − 2 isthe number of degrees of freedom, and

√(1− r2)/(n − 2) is the

standard error of the correlation coefficient.

For the one-tail test with H0 : p ≤ 0, we reject H0 and conclude that xand y have a positive linear correlation if t∗ > tα.

For the two-tail test with H0 : p = 0, we reject H0 and conclude that xand y are linearly correlated if |t∗| > tα/2.







Correlation vs. Causation

Always keep in mind: correlation does not imply causation!

Meaning: it often occurs that variables exhibit a correlation with oneanother even though there is no influence whatsoever

Even if there is a causal relationship, it’s not always clear which is thecause and which is the effect!







Reverse Causality

Case in point: the “effect” of Course Signals on student retention atPurdue University

Purdue developed Course Signals to use analytics to alert faculty andstaff to potential problems for students

Purdue claimed that when students took at least two courses that usedCourse Signals, retention improved by 21%!

This conclusion was supported by appropriate data, so what could be theproblem?







Look for Anomalies!

It was observed from the data that taking two Course Signals coursesgreatly improved retention, whereas taking only one did not help at all

Also, an initial bump in retention rate quickly faded after Course Signalshad been in use for a few years

What the data was really showing was that students were taking moreCourse Signals courses because they were taking more courses overall(that is, they did not control for freshmen dropping out early)

In other words, it was retention that led to increased use of CourseSignals, not the other way around!

Reference: “What the Course Signals ’Kerfuffle’ is About, and What itMeans to You” by Michael Caulfield, posted at educause.edu







Causal Inference

Given that two variables are correlated, the ideal approach to establishingcausation is to understand the mechanism by which it acts

Failing that, another approach, if less effective, is to perform a controlledintervention study

Establishing causation based solely on observations is much less reliable,but more broadly applicable

In fact, this is impossible without making assumptions about the data

Reference: Max Planck Institute







Simple Regression

If x and y are found to be linearly correlated, then we can use simpleregression to find the straight line that best fits the ordered pairs (xi , yi ),i = 1, 2, . . . , n.

The equation of this line is

y = a + bx ,

where y is the predicted value of y obtained from x .

The y -intercept a and slope b need to be determined.







The Least Squares Method

To find the values of a and b such that the line y = a + bx best fits thesample data, we use the least squares method.

In this method, we compute a and b so as to minimize

n∑i=1

(yi − yi )2 =

n∑i=1

(yi − a− bxi )2.

The name of the method comes from the fact that we are trying tominimize a sum of squares, of the deviations between y and y .

The line y = a + bx that minimizes this sum of squares, and thereforebest fits the data, is called the regression line.







Solving the Least Squares Problem

The criterion of minimizing the sum of squares is chosen because it isdifferentiable, and is therefore suitable for minimization techniques fromcalculus. The minimizing coefficients are

b =

nn∑

i=1

xiyi −

(n∑

i=1

xi

)(n∑

i=1

yi

)

nn∑

i=1

x2i −

(n∑

i=1

xi

)2 ,

a = y − bx ,

where x and y are the sample means x =∑n

i=1 xi , y =∑n

i=1 yi .







Discussion

It should be noted that b is closely related to the correlation coefficient r ;the formulas have the same numerator.

It follows that the slope is positive if and only if the correlationcoefficient indicates that x and y have a positive linear correlation.

In R, the least squares method is implemented in the function lsfit. Itssimplest usage is to specify two arguments, which are vectors consistingof the x- and y -values, respectively.

It returns a data structure called a named list, which includes thecoefficients a and b of the regression line.







Example

The following code illustrates the use of lsfit, including extraction ofthe y -intercept a and slope b. Then, both the data points and regressionline are plotted.

> x=c(1:10)

> y=c(8,6,10,6,10,13,9,11,15,17)

> lslist=lsfit(x,y)

> coefs=lslist[["coefficients"]]

> coefs

Intercept X

5.1333333 0.9757576







Extracting the Coefficients

> a=coefs[["Intercept"]]

> b=coefs[["X"]]

> a

[1] 5.133333

> b

[1] 0.9757576

> plot(x,y)

> abline(a,b)







Plot of Regression Line

It is merely coincidence that in this example, the regression line happensto pass through one of the points; in general this does not happen, as thegoal of the least squares method is to minimize the distance between allof the predicted y -values and observed y -values.







Confidence Interval for the Regression Line

To measure how well the regression line fits the data, we can construct aconfidence interval. We use the standard error of the estimate,

se =

√√√√√√n∑

i=1

(yi − yi )2

n − 2=

√√√√√√n∑

i=1

y2i − a

n∑i=1

yi − bn∑

i=1

xiyi

n − 2,

which measures the amount of dispersion of the observations aroundregression line.

The smaller se is, the closer the points are to the regression line.

It is worth noting the similarity between this formula and the samplestandard deviation; the number of degrees of freedom is n − 2 since twodegrees of freedom are taken away by the coefficients a and b of theregression line.







Testing the Slope of the Regression Line

We need to determine whether the slope b of the regression line isindicative of the slope β for the population. To that end, we can performa hypothesis test.

For example, we can use the null hypothesis H0 : β = β0 and H1 : β 6= β0

for a two-tail test.

If β0 = 0, then we are testing whether there is any linear relationshipbetween x and y , and rejection of H0 would imply that this is the case.







Standard Error of the Slope

The standard error of slope is

sb =se√∑n

i=1 x2i − nx2

,

where se is the standard error of the estimate, defined earlier.

Note that sb is the standard deviation in the y -values divided by√n

times the standard deviation of the x-values, which intuitively makessense because we are testing the slope, which is the change in y dividedby the change in x .







Test Statistic

As with the test of the correlation coefficient, we use the Student’st-distribution to determine the critical value.


t∗ =b − β0

sb.

This is compared to the critical value tα/2,n−2, the t-value satisfyingP(|Tn−2| > tα/2,n−2) = α/2.

If |t∗| > tα/2,n−2, then we reject H0 and conclude β 6= β0.

If β0 = 0, then our conclusion is that x and y are linearly correlated.







Assumptions

For the least squares method to be valid, we need to make the followingassumptions:

I Individual differences between yi and yi , i = 1, 2, . . . , n, areindependent of one another.

I The observed values of y are normally distributed around y .

I The variation of y around the regression line is equal for all values ofx .







Polynomial Regression

In linear regression, we are trying to find constants a and b such that thefunction y = a + bx best fits the data (xi , yi ), i = 1, 2, . . . , n, inleast-squares sense.

The method of least squares can readily be generalized to the problem offinding constants a0, a2, . . . , am such that the function

y = c0 + c1x + c2x2 + · · ·+ cmx

m,

a polynomial of degree m, best fits the data.







System Set-up

We define the n × (m + 1) matrix

A =

1 x1 x2

1 · · · xm11 x2 x2

2 · · · xm2...

...1 xn x2

n · · · xmn

,and the vectors

c =

c0

c1

...cm

, y =

y1

y2

...yn

.A is known as a Vandermonde matrix.







The Normal Equations

Then, by solving the normal equations

ATAc = ATy,

we obtain the coefficients of the best-fitting polynomial of degree m.

Note that AT is the transpose of A, which is obtained by changing rowsinto columns; that is, (AT )ij = aji .







Example

The following R statements construct data vectors x and y, and then callthe function lm (short for “linear model”) to obtain

> x=c(0.6291,0.2956,0.6170,0.9885,0.3440,0.2396,0.0004,...

> y=c(0.7487,0.6169,0.1834,0.8436,0.7160,0.6518,0.6128,...

> lm(y ∼ poly(x,2,raw=TRUE))

Call:

lm(formula = y ∼ poly(x, 2, raw = TRUE))

Coefficients:

(Intercept) poly(x,2,raw=TRUE)1 poly(x,2,raw=TRUE)2

0.6741 -0.4575 0.6512

That is, the quadratic function that best fits the data isy = 0.6512x2 − 0.4575x + 0.6741.







Multiple Linear Regression

A similar approach can be used for multiple linear regression, in which weseek a model of the form

y = c0 + c1x1 + c2x2 + · · ·+ cmxm.

Let xij be the ith observation of xj . We define the matrix A by

A =

1 x11 x12 · · · x1m

1 x21 x22 · · · x2m

......

...1 xn1 xn2 · · · xnm

.Then, we solve the normal equations

ATAc = ATy

to obtain the coefficients c0, c1, . . . , cm.James V. Lambers Statistical Data Analysis 267 / 314






Example

Suppose that we have a set of n observations (xi1, xi2, yi ), i = 1, 2, . . . , n,and seek the coefficients c0, c1, c2 so that the model

y = c0 + c1x1 + c2x2

best fits the data in the least-squares sense.







Getting the Job Done in R

The following R statements obtain these coefficients.

> x1=c(0.4092,0.9977,0.6238,0.3532,0.1827,0.3209,..

> x2=c(0.9525,0.8742,0.1622,0.1467,0.6498,0.7901,...

> y=c(0.2549,0.9122,0.3675,0.0380,0.6508,0.8164,...

> lm(y∼x1+x2)Call:

lm(formula = y ∼ x1 + x2)

Coefficients:

(Intercept) x1 x2

0.03273 0.18821 0.59569

That is, c0 = 0.03273, c1 = 0.18821, and c2 = 0.59569.







Maximum Likelihood

Let x1, x2, . . . , xn be a sample of n i.i.d (independent and identicallydistributed) observations, coming from an unknown distribution withprobability distribution function of the form f (x , θ)

The method of maximum likelihood is used to obtain an estimate θ ofthe unknown parameter θ

Because the observations are independent, we have

f (x1 ∩ x2 ∩ · · · ∩ xn|θ) = f (x1|θ)f (x2|θ) · · · f (xn|θ)

The maximum likelihood estimator (MLE) is the value of θ thatmaximizes the average log-likelihood

ˆ =1

n

n∑i=1

ln f (xi |θ)







Example

Let the n observations be coin flips of an unfair coin, and let h be thenumber of heads. These flips follow a binomial distribution

f (X = h|θ) =

(nh

)θh(1− θ)n−h

with unknown probability of success θ

The MLE θ maximizes

1

nln

(nh

)θh(1− θ)n−h =

1

n

[ln

(nh

)+ h ln θ + (n − h) ln(1− θ)

]which, through calculus, is maximized at θ = h/n






Analysis of VarianceBiostatisticsBig Data

Analysis of Variance

Previously, we have learned how test a single mean, and compare twomeans.

Analysis of variance, also known as ANOVA, is useful for comparing threeor more population means.







The Hypotheses

Suppose that we have m samples, each of size ni , for i = 1, 2, . . . ,m,drawn from m populations with means µ1, µ2, . . . , µm.

Let the ith sample consist of observations xij , j = 1, 2, . . . , ni , withsample mean xi .

For one-way ANOVA, the null hypothesis is H0 : µ1 = µ2 = · · · = µm.That is, all of the population means are equal.

The alternative hypothesis H1 is that there is a statistically significantdifference between at least two of the population means.







Stuff We Need

To perform one-way ANOVA, we need to compute the following:

¯x =

∑mi=1 ni xi∑mi=1 ni

is the grand mean, which is the mean of all observations from all samples.

SSW =m∑i=1

ni∑j=1

(xij − xi )2

is the “sum of squares within groups”.

SSB =m∑i=1

ni (xi − ¯x)2

is the “sum of squares between groups”.James V. Lambers Statistical Data Analysis 274 / 314






More Stuff We Need

Let n = n1 + n2 + · · ·+ nm be the total number of observations from allsamples.

MSW =SSW

n −m

is the “within-sample variance”. It is an estimate of the variance σ2

whether H0 is true or not.

MSB =SSB

m − 1,

also known as MSE , is the “between-sample variance” or “mean squareerror”. It is an estimate of the variance only if H0 is true. Otherwise, it isquite large compared to MSW .







The Test


F ∗ =MSB

MSW.

This is compared to the critical value Fα,m−1,n−m from the F -distribution.

If F ∗ ≤ Fα,m−1,n−m, then we do not reject H0, and conclude that thepopulation means are equal.

Alternatively, we could compute the p-value P(F > F ∗) and compare itto our level of significance α.







The F -distribution

The F -distribution

1. is not symmetric; it is skewed toward zero.

2. becomes more symmetric as its degrees of freedom increase.

3. has a total area under the curve of 1.

4. has a mean that is approximately 1.







The F -distribution







Pairwise Comparisons

Suppose that H0 from ANOVA is rejected, so that we know that at leasttwo of the means are statistically different.

To find out which means are different, we can use the Scheffe test tocompare each pair of sample means.

For each pair (i , j), i , j = 1, 2, . . . ,m, the test statistic is

F ∗S =(xi − xj)

2

SSW

n −m

(1

ni+

1

nj

) .This is compared to FSC = (m − 1)Fα,m−1,n−m.

If FS > FSC , we conclude that means i and j are different.







Example

A consumer group is testing the gas mileage of three different models ofcars. Each car was driven 500 miles and the mileage recorded as follows:

Model 1 Model 2 Model 322.5 18.7 17.220.8 19.8 18.022.0 20.4 21.123.6 18.0 19.821.3 21.4 18.622.5 19.7







Example: One-way ANOVA

We have n1 = n2 = 6, n3 = 5, n = n1 + n2 + n3 = 14, and m = 3.

Our means are x1 = 22.2, x2 = 19.7, x3 = 18.9, and

¯x =n1x1 + n2x2 + n3x3

n1 + n2 + n3= 20.3.

We then compute SSW = 21.6 and SSB = 31.5, followed by

MSW =SSW

n −m=

21.6

14= 1.54, MSB =

SSB

m − 1=

31.5

2= 15.73,

which yields F ∗ = MSB/MSW = 10.19.

This is compared to F0.05,2,14 = 3.74. Since F ∗ > F0.05,2,14, we reject H0

and conclude that the means are not equal.







Example: Pairwise Comparisons

To see which means are not equal, we obtain

FSC = (m − 1)Fα,m−1,n−m = 2F0.05,2,14 = 7.48.

This is compared against FC for all 3 pairs of means:

F ∗S,1,2 =(x1 − x2)2

SSW

n −m

(1

n1+

1

n2

) = 11.7,

and similarly, F ∗S,1,3 = 17.83 and F ∗S,2,3 = 0.93.

Since F ∗S,1,2 and F ∗S,1,3 are both greater than FSC , we conclude those pairsof means are different, whereas F ∗S,2,3 < FSC , so µ2 and µ3 are equal.







What is Biostatistics?

Biostatistics is the application of statistics to biological research

Biostatistics deals with the design of experiments, the collection of data,the analysis of data, and the interpretation of results







Applications

I Public health: epidemiology, health services research, nutrition,environmental health, healthcare policy and management

Check outhttps://publichealthwatch.wordpress.com by @RVAwonk

I Design and analysis of clinical trials

I Assessment of severity state of a patient

I Population genetics: correlating phenotype with genotype

I Human genetics: correlating alleles with diseases

I Climate envelope modeling: correlations between speciesdistributions, environmental variables define a species’ tolerance

I Sequence analysis: matching new and existing DNA/RNA/peptidesequences to understand the biology of organisms







Studies in epidemiology

Types of studies:

I Case series

I Case-control studies

I Cohort studies







Case Series

A case series study compares periods of time during which patients areexposed to some potentially illness-causing factor with periods withoutexposure

Poisson regression techniques are used to compare incidence ratesbetween exposed and unexposed periods

In Poisson regression, the response Y is assumed to follow a Poissondistribution, and its logarithm depends linearly on independent variables

R code:glm(y ∼ offset(log(exposure)) + x, family=poisson(link=log) )

Used to study adverse reactions to vaccination







Case-control Studies

Case-control studies are retrospective studies that select patients basedon their disease status

Given this table:Cases Controls

Exposed A BUnexposed C D

these studies examine the statistic AD/BC , the odds ratio (OR)

If OR � 1, cases are likely linked to exposure. If OR ∼ 1, not likelyassociated. If OR � 1, exposure is protective

Drawbacks: sensitive to bias, and cost prohibitive for smaller values of OR







Cohort Studies

Cohort studies are prospective studies that select subjects based on theirexposure status

Using the same table as in Case-control Studies, the statistics of interestis relative risk

RR = Pe/Pu, Pe =A

A + B, Pu =

C

C + D

RR > 1 shows association

RR more reliable than OR, but studies more costly, and follow-upproblematic due to long duration







Precision and Bias

Sources of error in epidemiological studies:

I Random error: can be reduced by increasing sampling size orreducing measurement error, but both costly

I Selection bias: for example, nonsmokers tend to participate instudies more often than smokers, which can skew results

I Information bias: systematic error in assessment of variables, forexample recall bias

I Confounding: bias due to mixing of extraneous effects (confounders)with effects of interest







Jackknifing

Jackknifing is a resampling technique useful for variance and biasestimation

After estimating a parameter from a sample of size n (e.g. estimating apopulation mean with a sample mean), jackknifing entails computing newestimates based on any n − 1 observations from the sample, and thenaveraging those new estimates

This allows bias to be substantially reduced







Bootstrapping

Inspired by jackknifing, boostrapping (Efron 1979) is a resamplingtechnique based on the idea that making an inference about a populationbased on a sample is analogous to making an inference about the samplebased on a resample, but the accuracy of the latter can be measuredbecause the “exact answer” is known

Process: given an original sample of size n, take resamples of size n usingsampling with replacement, to obtain a sampling distribution of thedesired parameter

Useful when the underlying distribution of the population is unknown, orwhen the sample size is too small for standard hypothesis testing

Particularly advantageous for obtaining standard errors and confidenceintervals for complex estimators such as correlation coefficients







What is Big Data?

Wikipedia: “Big data” refers to any collection of data sets so large andcomplex that processing with on-hand data management tools ortraditional data processing applications becomes problematic.

SAS/Doug Laney: Big data is characterized by the “three V’s”: volume,velocity and variety. SAS adds two more dimensions: variability andcomplexity.

Whereas business intelligence uses descriptive statistics, big data usesinferential statistics







Examples

I UPS uses telematic sensors in its trucks, in conjunction with mapdata, to redesign delivery routes and reduce fuel costs

I United Healthcare analyzes voice data from customer calls to detectdissatisfied customers for possible intervention

I Bank of America analyzes customer interactions in order to presentappealing offers

I Sears has used big data technology and real-time processing torelease complex marketing campaigns in one week rather than eightweeks

I GE installs sensors on the blades of its gas turbines that generategigabytes of data per day (each) to detect and analyze defects

I Schneider uses driving sensors on its trucks to detect indicators ofpotential accidents and allows intervention before they occur







Case Study: 2012 Presidential Election

Organization of Obama campaign: driven by centralized analytics

Key ingredients:

I Dedicated analytics department, and analysts at campaign officesand in the field

I HP Vertica, R and Stata

I Hadoop considered, but not as scalable or efficient, plus steeperlearning curve

I AirWolf: data from field fed back into Vertica to drive digital arm ofcampaign

I Media Optimizer: combined voter data, TV ratings data and adprice data for targeted ad buys







Exploratory Data Analysis

The size and complexity of raw data sets inhibits the discovery of insightfrom Big Data. Only if the data is viewed in the right way can suchinsight be gained

Statistical methods, machine learning algorithms can only help so much.The best tool for the job? Our brains!

Exploratory Data Analysis (EDA) involves “playing with data”, is the firststep in working with Big Data

Techniques: visualization, principal component analysis, multidimensionalscaling, etc.

Requirement: creativity!







Business Analytics

Business analytics uses data from past performance to predict futureperformance and guide business planning

Types of business analytics:

I Descriptive analytics: descriptive statistics applied to historical data

I Predictive analytics: techniques from inferential statistics andmachine learning to make predictions from historical data

I Prescriptive analytics: a variation of predictive analytics in whichdata is used to guide decisions and predict their effects







Predictive analytics

Main idea: use predictive models to forecast data that does not yet exist

Techniques used: regression (linear, logistic, time series, etc.), machinelearning (neural networks, support vector machines, etc.)

Examples:

I CRM: using customer information (e.g. order history) to predictfuture purchases and promote relevant products at touch points

I Clinical decision support: using patient data to predict developmentof disease

I Non-temporal prediction: using social media activity data to predictpotential to influence, or predicting one’s sentiment from theirpostings







Prescriptive analytics

Beyond predicting what may happen, prescriptive analytics guidesdecisions based on predictions

Prescriptive analytics adds these components to predictive analytics:I Actionability: must be able to act on whatever prediction is producedI Feedback: the result of any action is fed back into the predictive

model to refine the prediction

Requirement: predictive window > reaction time

Examples:I Oil and gas: guiding deployment of capital to maximize effectiveness

of resource extraction in spite of uncertaintyI Healthcare: guiding care delivery and capital investments by

providers, or administration of clinical trials for pharmaceuticalcompanies







Google Flu Trends

In 2008, Google launched “Google Flu Trends”, to track the spread of fluacross the United States.

Using their top 50 million search terms, they looked for a correlationbetween searches and flu symptoms.

As a result, flu could be tracked by Google Flu Trends more rapidly thanby the CDC, without using data relied upon by the CDC obtained fromphysicians, and without any hypothesis about which search terms mightcorrelate with flu symptoms– algorithms did all the work.







Target

In Minnesota, a man complained to Target because they were sendingmaternity-oriented coupons to his teenage daughter.

As it turned out, she really was pregnant. Target suspected before herfather did, based on her purchases of unscented wipes and magnesiumsupplements.







Street Bump

The city of Boston released the Street Bump app to enable smartphonesto automatically detect potholes using their accelerometers.

This way, city workers did not need to patrol the streets looking forpotholes–the app could tell them where to go!







Implications?

Incidents like these can lead us to believe:

I Data analysis can produce amazingly accurate results

I Every single data point can be captured (“n = All”), so who needssampling anymore?

I Why worry about causation, if correlation tells us what we need toknow?

I Given enough data, the numbers speak for themselves, so who needstheory?







Not So Fast...

By 2013, Google Flu Trends lost its accuracy, instead dramaticallyoverstating the prevalence of flu

The lesson: correlation isn’t everything. If you don’t know the reasonbehind a correlation, you also don’t know what can cause it to breakdown!

Possible explanation: news stories about the flu in 2012 may haveprompted flu-related searches by healthy people

Remember history: The Literary Digest sampled over 2 million readers topredict the result of the 1936 presidential election and got it wrong, whileGallup got it right with 3000. Statistical techniques matter.







Oh, What About Target?

They may have “discovered” one pregnancy, but they used similarcoupon-distribution techniques on many other customers for which their“discoveries” were wrong.

The numbers don’t tell a story by themselves, no matter how many ofthem there are. Rudimentary analysis produces, at best, a somewhateducated guess.

“There are a lot of small data problems in big data...they don’t disappearbecause you’ve got lots of the stuff. They get worse.” – Prof. DavidSpiegelhalter, Cambridge







And, About those Potholes...

The Street Bump app works well, but consider this: which residents doesit really serve?

The effect of the app was that potholes were fixed in areas with young,affluent residents who are more likely to own smartphones.

Every bump from every enabled smartphone may have been recorded, butmany potholes were still overlooked!

The lesson: “n = All” is a seductive illusion. Don’t fall for it!







Big Data: Warning #1

Using very large data sets can help detect correlations that might beoverlooked by smaller samples.

However, that doesn’t mean that these correlations are meaningful!

From 2006 to 2011, the market share of Internet Explorer droppedprecipitously...as did the murder rate in the United States. What doesthis tell us?








Big data should be considered a tool for scientific inquiry, not areplacement!

Example: molecular biologists can’t understand the structure of proteinsfrom data alone; any statistical analysis must be informed by knowledgeof the underlying physics and biochemistry.








Tools based on big data can be gamed!

Example: “Google bombing” to manipulate search results

A more insidious example: big data programs for grading student essaystend to favor characteristics like sentence length or word sophistication,due to their correlation with grades given by humans. What do youexpect will happen?








Large data sets are constantly changing, and so is the way in which it iscollected.

This is especially true for data based on web requests, such as data thatGoogle collects.

In fact, the changing nature of its data, and failure to take this changeinto account, is considered a contributing factor to the issues withGoogle Flu Trends.








Beware of the echo-chamber effect: much of big data comes from theweb, but much of what’s on the web comes from big data!

Example: Google Translate relies on pairs of parallel texts written indifferent languages, such as Wikipedia articles written in two languages.

But how many of these foreign-language Wikipedia pages were writtenusing Google Translate?!








If you look long enough for a correlation, you will find one!

That is, given enough data, strong correlations may appear simply bychance.

The more data, the more bogus patters can be found, resulting in badlyflawed inferences!








Don’t be fooled into thinking that big data can deliver scientific-soundinganswers to questions that we could never precisely answer.

Example: Wikipedia is being used to rank people in terms of “historicalimportance” or “cultural contributions.”

Two separate projects of this kind correctly identify Jesus, Lincoln andShakespeare as very important people, but are we really supposed tobelieve that Nostradamus was the 20th most important writer in history,or that Francis Scott Key the 19th most important poet?

Big data can reduce anything to a single number, but that doesn’t meanit should be accepted!








Big data works well when analyzing patterns that appear very frequently,but falters on rare occurrences.

Example: search engines rely on trigrams, which are three-wordsequences.

Use Google Translate to translate the trigram “dumbed-down escapistfare” into German, and then back to English.








The biggest problem of all with big data? People who fall for the hype!

We need to view big data with a reasonable perspective. There is nosilver bullet.

Big data is a valuable resource, it’s not going away, and its fullimplications are yet to be realized. But that doesn’t mean we shouldeven consider tossing out centuries of ingenuity that has led to the dataanalysis techniques that we have today.

In fact, the arrival of big data means we need these techniques more thanever!


Documents

Introduction to Statistical Data Analysis · Types of Statistics Data Display Measures of Central Tendency Measures of Dispersion Introduction This course is an introduction to statistical