FPP 16-18 Expected Values, Standard Errors, Central Limit Theorem

FPP 16-18

Expected Values, Standard Errors, Central Limit Theorem

Statistical inferenceUp to this point we have focused primarily on

exploratory type statistical analyses (with a little probability thrown in).

We will now dive into the realm of statistical inference

The ideas associated with sampling distributions, p-values, and confidence intervals are more abstract and are therefore slightly harder

These concepts are also very powerfulFor good if used correctlyFor bad if used incorrectly

Statistics vs probability modelingProbability: know the truth, want to

estimate the chances that data occur

Statistics: know the data that occur, want to infer about the truth

Coin tossSuppose we tossed a coin 50 times. We are interested

to know if this coin is fair. If the coin is fair then then a straightforward model

that mimics reality is:# heads = 0.5(# of tosses)

It should be fairly obvious that the number of heads won’t be exactly 25. How far away from 25 would convince us that the coin isn’t fair?

Statistical model:# heads = 0.5(# of tosses) + chance error

This chance error will help us answer the question how many heads is too many for the coin not to be fair

We will study this chance error quite rigorously.

Study of chance errorPlan of attack for study of chance error

Law of averages

Sampling distributions

Central limit theorem

Our main tool will be so called “box models”

Law of averagesWhat does the law of averages say?

Toss a coin

As # of tosses increase the

|#heads – 0.5(#tosses)| |%heads – 50%|

In words:As the number of tosses goes up

The difference between the number of heads and half the number of tosses gets bigger

The difference between the percentage of heads and 50% gets smaller (if coin is fair)

Law of averagesA die is thrown some number of times, and

the object is to guess the total number of spots. There is a one-dollar penalty for each spot that the guess is off. For instance, if you guess 200 and the total is 215, you lose $15.

Which do you prefer: 50 throws, or 100?

Chance processesWhen tossing a coin:

Actual #heads ≠ Expected #heads

What is the likely size of the difference?

Strategy: Find an analogy between the process being studied and drawing numbers at random from a box (box model)

Box modelsA so called box model is a good starting

point into statistical inference

The purpose of these very simple models is to analyze chance variability

They are a construction for learning about characteristics of populations

They help us incorporate the probability techniques we learned in studying chance error.

Box ModelA die is thrown some number of times, and

the object is to guess the total number of spots. What is “typical” total number of spots after 50 throws. After 100 throws.

Create a box model for this

Constructing Box modelsA quiz has 25 multiple choice questions.

Each question has 5 possible answers, one of which is correct. A correct answer is worth 4 points, but a point is taken off for each incorrect answer. A student answers all of the questions by guessing randomly.

What is the box model for this scenario?What is the “expected” score on the quiz?What is the range of scores?What is the SD of scores?

Duke donor examplePopulation: 119,106 graduates of DukeVariable: donation amount in $$ to Duke

Annual Fund in 2001

Box model:make a ticket for every alumnus containing

his/her donation amountPut all these tickets in a hypothetical box.

Box models: typical questionsPick 100 tickets at random from the box, with

replacement1. Before collecting the data, what do you expect the sum of

these 100 alumni donations to equal?2. What do you think is a typical deviation from this

expected value?1. We can answer these questions with a box model

3. Before collecting the data how many of the 100 alumni people do you expect to be donators?

4. What do you think is a typical deviation from this expected value?

1. To answer these questions need another box model

Characteristics of alumni donationsFor the 119,106 alumni:

Average of all donations = $735SD of donations = $23,827

42,938 donated (36%)76,168 did not donate (64%)

Learning about the sample sumWhen we sample randomly, the sum of the

100 tickets will differ for different samples

What is the expected value (EV) of the sample sumE(sample sum) = n*(average of box) = n*(μ)

What is a typical deviation of a sample sum from this expected valueStandard error (SE) of sum = *(SD of

box) =

€

n

€

n *σ

Sample sum of donations for 100 alumniSo the sum of the 100 alumni donations should

be: E(sample sum) = 100*($735) = $73,500

give or take the SESE =

How sure are we about the sum of donations using a sample of 100?

Key idea If we take independent samples of 100 alumni over

and over again, recording the sum of each sample thenThe average of the sample sums should be around $73,500The SD of the sample sums should be around $238,270

€

100($23,827) = $238,270

Box model for binary (dichotomous) outcomes42,938 donated and 76,168 did not

Make a box with tickets comprised of 42,938 ones and 76,168 zeros.

Average of box = % of ones = 0.36 = pSD of box = 0.48

Short cut for SD for binary box models (and only for binary box models)

Sample 100 tickets out of the box with replacement. What does this process remind you of?

)1()1%1()1(% ppssSD

Sample number of donators out of 100 alumniThe number of donators in the sample equals

the sample sum of the 0-1 tickets

Thus, the expected number of donators is EV of sample sum = n * (Average of box)

= 100 * 0.36 = 36

The typical deviation of the sample sum for expected value is

The Standard error (SE) of sum = * (SD of box) = 10 * .48 = 4.8

€

n

Sample number of donators out of 100 alumniHence, the number of alumni who donated

out of a random sample of 100 should be 36, give or take around 5 people (SE = 4.8).

Compared to the average donation per alumni how “confident” are we that any give sample of 100 will produce 36 donors.

Key ideaIf we take independent samples of 100 alumni

over and over again, recording the number of donators in each sampleThe average of the sample number of donators should be

around 36The SD of the sample numbers of donators should be

around 4.8

Chance error / Standard ErrorStandard error allows us to assess how big

the chance error will be in the modelsum = expected value + chance error

Chance error is the difference between an observed value and the expected value

A problem from the text100 draws are made with replacement from

a box containing the seven numbers

101 102 103 104 105 106 107

Suppose you were betting. The closer your guess is to the sample sum, the more money you win. What number would you guess?Use the expected value as your guess.

100*104=10400

How much would you expect the sample sum to be off from the expected value of the sum?This is the standard error. √100*2.16 = 21.6

Difference between SD and SESD is the typical deviation from the average in a

box. SD is a property of the box; it doesn’t depend on a random sampling

SE is the typical deviation from the expected value in a random sample. SE results from random samplingSE gives an idea of how large the chance error isSum of draws is likely to be around its expected

value, but to be off by a chance error similar in size to its SE

Sum of draws = EV ± chance error

EV and SE of the sample average or percentSince sample average(percent) = sample

sum /n we get

1. Just like sample sums, sample averages and sample percentages are subject to chance variation

2. EV for sample average ( or %) = EV of sample sum / n = Avg. of box.

3. SE for sample average (or %) = SE for sample sum / n = SD of box /√n

Common theme for SE of sample average and sample percentageFir a binary variable, the population SD =

So both the sample average and sample percentage have a standard error of the form

SE = Population SD /

)1( pp

€

n

Sample averages and percentagesIn a random sample of 100 alumni, we expect the

sample average donation to equal $735 give or take $2,382.70. We expect 36% to donate, give or take 4.8%

If we take independent samples of 100 alumni over and over again, recording the average donation and the percentage of donators in each sampleThe average of the sample averages of donations

should be around 735The SD of the sample averages of donations should

be around 2,382.70The average of the sample percentages of donators

should be around 0.36The SD of the sample percentages of donators should

be around 0.048

Law of averagesPlot the SE of sample

average donation for an increasing sample taken from the box

As n in increases, the SE of the sample average decreases

This is called the law of averages

Vegas was built on this law

Shape of chance processThe expected value and the standard error

provide a measure of center and spread for the chance process

What about the shape

Book introduces something called the “probability histogram”

This is a histogram of the samples take from the box model.

What shape will this histogram take on

Parameters vs statisticsA parameter is a number that

describes the populationa fixed numberin practice, we don’t know its value

A statistic is a number that describes a sampleits value is known when we have taken a

samplevalue can change from sample to sampleoften used to estimate an unknown

parameter

Sampling distributionsBox model is trying to motivate ideas

surrounding a sampling distribution

All statistics have a sampling distribution

Formal definitionThe sampling distribution of a statistic is the

distribution of values taken by the statistic in all possible samples of the same size from the same population.

Note that a statistics sampling distribution depends on the sample size

Sampling distribution constructionFrom a given population exhaust all possible

samples of size n

For each sample compute the statistic

Treat these statistics as the “data” and plot a histogram

The histogram displays the sampling distributionI believe FPP calls these distributions probability

histograms

Note that these distributions are highly dependent on the sample size

Silly example

Approximating sampling distributionsWhat if populations is such that exhausting

all samples of size n is impossible

The sampling distribution can be well approximated using a ton of samples instead of all samples

Cool applet

Central Limit TheoremWhen dealing with a statistic that uses a sum

of some sort we can theoretically show what the sampling distribution will be like through the Central Limit Theorem

The central limit theoremTake many random samples with

replacement from a box model, all of the samples of size n. When n is sufficiently large, the distribution of the sample average (or sample %) is well-described by a normal curve

The mean of this normal curve is the EV and the standard deviation for this normal curve is the SE

The Central Limit TheoremWhat does the CLT give us? A ton of stuff

We can find probabilities and percentiles using the the normal table

Can predict fairly accurately how unlikely it is to sample an observed sample mean

Can assess rather accurately how likely a population mean lies within an interval

Central Limit TheoremWhat happens if the distribution of the

original variable is not symmetric (or think about the distribution of the values on the tickets in a box)The central limit theorem still kicks in (the

sample size n just needs to be bigger)

What happens if the distribution of the original variable is bimodalThe central limit theorem still kicks in (the

sample size n just needs to be bigger)

This is absolutely a fantastic result !!!

Does CLT applyA box consists of 9 ones and 1 zero. A

random sample of size 50 is drawn with replacement from the box and the number of ones are counted.

A box consists of the ages of the 100 students in our stat class (assume that the mean is 20 and sd is 1). A random sample of size 50 is drawn with replacement from the box and and the 25th percentile is computed.

Central Limit Theorem M&MsPick 50 M&Ms at random (from a bag).

How likely is it to have less than 40% yellow and brown M&Ms in the bag?

Assume 50% of all M&M’s are yellow and brown (source: M&M’s home page)

For a sample proportion of yellow and brown M&Ms

EV = 0.5 and SE =

0707.50

50.50.

Size of sampleFor binomial (categorical data with two

categories) data, the CLT usually kicks in pretty well when both of the following conditions on sample size are met

€

n × (% of 1's) = n × p >10

n × (% of 0's) = n × (1− p) >10

CLT and M&MsSince n=50, CLT applies

The probability of getting less than 40% yellow and brown M&Ms in a bag of 50 is

It is somewhat unusual to get less than 40% yellow and brown M&Ms (about 8 chances in 100)

CLT household exampleThe average size of U.S. households is 2.6

people. The SD of household size is 1.42. (These are true values from the U.S. Census).

Pick 200 houses at random in the U.S.

How likely is it that we’ll get a sample average household size of 3 or more?

CLT household exampleFor a sample average of 200 households

EV = 2.6 and SE =

The chance of getting an average household size greater than 3 equals the area under the standard normal curve to the right of 4. This is a very small chance

1005.200/42.1

Alumni donations exampleIn a random sample of 100 alumni, what is

the chance that more than half donated?

Alumni donations exampleWhat is the chance that the sample

average of donations from 100 randomly picked alumni will be between $50 and $100

CLT under three conditions1. If original variable follows a normal distribution

no need for CLT. We know the sampling distribution of a sum theoretically

2. If distribution of original variable is symmetric and unimodal then CLT holds for a small sample size (say less than 15)

3. If distribution is skewed, not unimodal then the CLT holds after a larger sample size

how large depends on the sharpness of the skew. In this class we will follow convention and say 30.

Parameter μ

Statistic

Inference

Sample

€

x

Documents

FPP 16-18 Expected Values, Standard Errors, Central Limit Theorem