GOV óþþþ Section ƒ: Probability Distributions & Sampling

Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions

GOV 2000 Section 4:Probability Distributions & Sampling

Distributions

Konstantin Kashin1

Harvard University

September 26, 2012

1�ese notes and accompanying code draw on the notes fromMolly Roberts,Maya Sen, Iain Osgood, Brandon Stewart, and TF’s from previous years


Outline

Administrative Details

Probability Distributions

Sampling Distributions

Estimating Sampling Distributions


Problem Set Expectations

▸ �ird problem set distributed yesterday.▸ Corrections for �rst problem set due next Tuesday.▸ Must be typeset using LATEX or Word and submitted as onedocument containing graphics and explanation electronically inpdf form

▸ Must be accompanied by source-able, commented code


Midterm Exam

▸ Window of exam: Tuesday, October 9th (a�er class) - Sunday,October 14th at 11.59pm

▸ Needs to be completed in 5 hours▸ Open note / book, but no collaboration allowed


Outline






Sampling from Common Probability Distributions

How do we sample 10 observations from the normal distribution withµ = 0 and σ = 2 in Stata?

set seed 02138

set obs 10

drawnorm x, mean (0) sd(2)

How do we sample 10 observations from the uniform distribution onthe interval [0, 10] in R?

set seed 02138

set obs 10

gen u = 10*uniform ()


What is the probability that a random variable liesin a particular subdomain?

X is the height (in meters) of the US adult population (18+). LetX ∼ N (1.7, 0.0625). What is the probability that X ∈ [1.5, 1.7], thatone’s height is between 1.5 and 1.7 meters?

0.0

0.5

1.0

1.5

0.5 1.0 1.5 2.0 2.5 3.0X (Height in Meters)

Density


What is the probability that a random variable liesin a particular subdomain?

X is the height (in meters) of the US adult population (18+). LetX ∼ N (1.7, 0.0625). What is the probability that X ∈ [1.5, 1.7], thatone’s height is between 1.5 and 1.7 meters?

0.0

0.5

1.0

1.5

0.5 1.0 1.5 2.0 2.5 3.0X (Height in Meters)

Density


Analytic Approach: CDFs

P(1.5 ≤ X ≤ 1.7) = P(X ≤ 1.7) − P(X ≤ 1.5) = FX(1.7) − FX(1.5)

In Stata:

di normal ((1.7 -1.7)/0.25) - normal ((1.5 -1.7)/0.25)

∴P(1.5 ≤ X ≤ 1.7) ≈ 0.288


Simulation Approach

1. Sample from distribution of interest to approximate it2. Calculate proportion of observations in sample that fall in

subdomain of interest

In Stata:

clear

set seed 02138

set obs 1000

drawnorm x, mean (1.7) sd (0.25)

gen area = x < 1.7 & x > 1.5

quietly sum area

di r(mean)

∴P(1.5 ≤ X ≤ 1.7) ≈ 0.2839


Outline






Our Data

We are going to work with fulton.csv.

▸ Election data at precinct level for Fulton County, Georgia.▸ Population: 268 precincts▸ Variables: turnout rate, % black, % female, mean age, turnout inDem. primary, turnout in Rep. primary, dummy variable forwhether precinct is in Atlanta, and location dummies for pollingstations

Let’s de�ne X = % black


Visualizing the Population

What does the population of X = % black look like?

0

20

40

60

80

0.0 0.2 0.4 0.6 0.8 1.0% Black in Precinct (X)

Frequency


Calculate True (Population) Mean

sum black

di r(mean)

µ = 0.506


Sampling of Precincts

Suppose we can only sample (SRS without replacement) n = 40precincts. How do we do this in Stata?

set seed 02138

sample 40, count


Visualizing the Sample

What does the sample of X = % black look like?

0

5

10

15

0.0 0.2 0.4 0.6 0.8 1.0% Black in Precinct (X)

Frequency


Calculate SampleMean (X̄)

sum black

di r(mean)

X̄ = 0.412


Connecting this to Resampling

We can think of our sample as one of many possible samples we candraw from our population:

Samples from Pop.1 2 3 ⋯ 10000

X1 0.496 x1,2 x1,3 ⋯ x1,10000X2 0.320 x2,2 x2,3 ⋯ x2,10000⋮ ⋮ ⋮ ⋮ ⋱ ⋮X40 0.172 x40,2 x40,3 ⋯ x40,10000X̄40 0.412 x̄2 x̄3 ⋯ x̄10000S̄2 0.160 s22 s23 ⋯ s210000


Connecting this to Resampling

We can think of our sample as one of many possible samples we candraw from our population:


X1 0.496 x1,2 x1,3 ⋯ x1,10000X2 0.320 x2,2 x2,3 ⋯ x2,10000⋮ ⋮ ⋮ ⋮ ⋱ ⋮X40 0.172 x40,2 x40,3 ⋯ x40,10000

µ̂ = X̄40 0.412 x̄2 x̄3 ⋯ x̄10000σ̂2 = S̄2 0.160 s22 s23 ⋯ s210000


Sampling Distribution of X̄

�e sampling distribution of X̄ is the distribution of the followingvector:


X1 0.496 x1,2 x1,3 ⋯ x1,10000X2 0.320 x2,2 x2,3 ⋯ x2,10000⋮ ⋮ ⋮ ⋮ ⋱ ⋮X40 0.172 x40,2 x40,3 ⋯ x40,10000

µ̂ = X̄40 0.412 x̄2 x̄3 ⋯ x̄10000σ̂2 = S̄2 0.160 s22 s23 ⋯ s210000


Calculating Sampling Distribution with KnownPopulation

1. Start with complete population.2. De�ne a quantity of interest (the parameter). For us, it’s µ.3. Choose a plausible estimator. For us, it’s µ̂ = X̄.4. Draw a sample from the population and calculate the estimate

using the estimator.5. Repeat step 4 many times.


Calculating Sampling Distribution with KnownPopulation

We can approximate a sampling distribution for the mean andvariance by doing the sampling we did before repeatedly and manuallystoring the estimates.set seed 02138

insheet using fulton.csv , comma clear

sample 40, count

quietly sum receipts , detail

di r(mean)

di r(Var)


sample 40, count

quietly sum receipts , detail

di r(mean)

di r(Var)

*** and so on...


Visualizing Sampling Distributions

We can now plot the manually stored estimates of the populationmean.

matrix m=(0.4117 \ 0.5123 \ 0.5398 \ 0.6038\ 0.5376\

0.5901\ 0.5087\ 0.4707\ 0.5044\ 0.5646)

svmat m, name(simu)

hist(simu)


Can we automate this a bit more?


foreach i of numlist 1/20 {

preserve

quietly sample 40, count

quietly sum black

mat holder = (nullmat(holder) \ r(mean))

restore

}

svmat holder

hist holder1 , frequency



Here is the sampling distribution (with 10,000 samples) for X̄:

0

2

4

6

0.3 0.4 0.5 0.6 0.7Sample Mean

Density



Here is the sampling distribution (with 10,000 samples) for X̄:

0

2

4

6 Original Sample Mean Pop. Mean

0.3 0.4 0.5 0.6 0.7Sample Mean

Density


Characterizing Sampling Distribution of SampleMean

We know that in large sample (large enough for Central Limit�eorem to kick in), the sample mean will be distributed as:

X̄ ∼ N (µ, σ2/n)


Sampling Distributions for Other Statistics

We can calculate in�nitely many statistics from a sample. If we havethe population, we can simulate a sampling distribution for those!

For example, we can look at the sampling distribution of S - the samplestandard deviation.



Here is the sampling distribution (with 10,000 draws) for S:

0

5

10

15

20

25

0.35 0.40 0.45Sample Standard Deviation

Density


Outline






Sampling Distribution of X̄ with UnknownPopulation

In reality, we only draw one sample and cannot directly observe thesampling distribution of X̄:


X1 0.496 ? ? ⋯ ?X2 0.320 ? ? ⋯ ?⋮ ⋮ ⋮ ⋮ ⋱ ⋮X40 0.172 ? ? ⋯ ?

µ̂ = X̄40 0.412 ? ? ⋯ ?σ̂2 = S̄2 0.160 ? ? ⋯ ?


Estimating the Sampling Distribution of SampleMean

We can estimate the equation in the previous slide using estimators forthe population mean µ and the population variance σ2. We will usethe following estimators:

▸ µ̂ = X̄ = 1n ∑

ni=1 Xi

▸ σ̂2 = S2X = 1n−1 ∑

ni=1(Xi − X̄)2

�e estimated sampling distribution is thus:

X̄ ∼ N (X̄, S2X/n)


Bootstrapping the Sampling Distribution of SampleMean

1. Start with sample.2. De�ne a quantity of interest (the parameter). For us, it’s µ.3. Choose a plausible estimator. For us, it’s µ̂ = X̄.4. Take a resample of size n (with replacement) from the sample and

calculate the estimate using the estimator.5. Repeat step 4 many times (we will do 10,000).


Visualizing the Bootstrapped Sampling Distribution

Here is the bootstrapped sampling distribution:

0

2

4

6

Sample Mean

0.2 0.3 0.4 0.5 0.6 0.7Sample Mean

Density



However, any given sample may be far away from the truth!

0

2

4

6

Sample Mean Population Mean

0.2 0.3 0.4 0.5 0.6 0.7Sample Mean

Density



However, any given sample may be far away from the truth!

0

2

4

6

Sample Mean Population Mean

0.2 0.3 0.4 0.5 0.6 0.7Sample Mean

Density


Conclusion

ANY QUESTIONS?

Documents

GOV óþþþ Section ƒ: Probability Distributions & Sampling