Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
GOV 2000 Section 4:Probability Distributions & Sampling
Distributions
Konstantin Kashin1
Harvard University
September 26, 2012
1�ese notes and accompanying code draw on the notes fromMolly Roberts,Maya Sen, Iain Osgood, Brandon Stewart, and TF’s from previous years
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Outline
Administrative Details
Probability Distributions
Sampling Distributions
Estimating Sampling Distributions
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Problem Set Expectations
▸ �ird problem set distributed yesterday.▸ Corrections for �rst problem set due next Tuesday.▸ Must be typeset using LATEX or Word and submitted as onedocument containing graphics and explanation electronically inpdf form
▸ Must be accompanied by source-able, commented code
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Midterm Exam
▸ Window of exam: Tuesday, October 9th (a�er class) - Sunday,October 14th at 11.59pm
▸ Needs to be completed in 5 hours▸ Open note / book, but no collaboration allowed
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Outline
Administrative Details
Probability Distributions
Sampling Distributions
Estimating Sampling Distributions
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Sampling from Common Probability Distributions
How do we sample 10 observations from the normal distribution withµ = 0 and σ = 2 in Stata?
set seed 02138
set obs 10
drawnorm x, mean (0) sd(2)
How do we sample 10 observations from the uniform distribution onthe interval [0, 10] in R?
set seed 02138
set obs 10
gen u = 10*uniform ()
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
What is the probability that a random variable liesin a particular subdomain?
X is the height (in meters) of the US adult population (18+). LetX ∼ N (1.7, 0.0625). What is the probability that X ∈ [1.5, 1.7], thatone’s height is between 1.5 and 1.7 meters?
0.0
0.5
1.0
1.5
0.5 1.0 1.5 2.0 2.5 3.0X (Height in Meters)
Density
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
What is the probability that a random variable liesin a particular subdomain?
X is the height (in meters) of the US adult population (18+). LetX ∼ N (1.7, 0.0625). What is the probability that X ∈ [1.5, 1.7], thatone’s height is between 1.5 and 1.7 meters?
0.0
0.5
1.0
1.5
0.5 1.0 1.5 2.0 2.5 3.0X (Height in Meters)
Density
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Analytic Approach: CDFs
P(1.5 ≤ X ≤ 1.7) = P(X ≤ 1.7) − P(X ≤ 1.5) = FX(1.7) − FX(1.5)
In Stata:
di normal ((1.7 -1.7)/0.25) - normal ((1.5 -1.7)/0.25)
∴P(1.5 ≤ X ≤ 1.7) ≈ 0.288
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Simulation Approach
1. Sample from distribution of interest to approximate it2. Calculate proportion of observations in sample that fall in
subdomain of interest
In Stata:
clear
set seed 02138
set obs 1000
drawnorm x, mean (1.7) sd (0.25)
gen area = x < 1.7 & x > 1.5
quietly sum area
di r(mean)
∴P(1.5 ≤ X ≤ 1.7) ≈ 0.2839
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Outline
Administrative Details
Probability Distributions
Sampling Distributions
Estimating Sampling Distributions
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Our Data
We are going to work with fulton.csv.
▸ Election data at precinct level for Fulton County, Georgia.▸ Population: 268 precincts▸ Variables: turnout rate, % black, % female, mean age, turnout inDem. primary, turnout in Rep. primary, dummy variable forwhether precinct is in Atlanta, and location dummies for pollingstations
Let’s de�ne X = % black
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Visualizing the Population
What does the population of X = % black look like?
0
20
40
60
80
0.0 0.2 0.4 0.6 0.8 1.0% Black in Precinct (X)
Frequency
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Calculate True (Population) Mean
sum black
di r(mean)
µ = 0.506
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Sampling of Precincts
Suppose we can only sample (SRS without replacement) n = 40precincts. How do we do this in Stata?
set seed 02138
sample 40, count
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Visualizing the Sample
What does the sample of X = % black look like?
0
5
10
15
0.0 0.2 0.4 0.6 0.8 1.0% Black in Precinct (X)
Frequency
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Calculate SampleMean (X̄)
sum black
di r(mean)
X̄ = 0.412
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Connecting this to Resampling
We can think of our sample as one of many possible samples we candraw from our population:
Samples from Pop.1 2 3 ⋯ 10000
X1 0.496 x1,2 x1,3 ⋯ x1,10000X2 0.320 x2,2 x2,3 ⋯ x2,10000⋮ ⋮ ⋮ ⋮ ⋱ ⋮X40 0.172 x40,2 x40,3 ⋯ x40,10000X̄40 0.412 x̄2 x̄3 ⋯ x̄10000S̄2 0.160 s22 s23 ⋯ s210000
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Connecting this to Resampling
We can think of our sample as one of many possible samples we candraw from our population:
Samples from Pop.1 2 3 ⋯ 10000
X1 0.496 x1,2 x1,3 ⋯ x1,10000X2 0.320 x2,2 x2,3 ⋯ x2,10000⋮ ⋮ ⋮ ⋮ ⋱ ⋮X40 0.172 x40,2 x40,3 ⋯ x40,10000
µ̂ = X̄40 0.412 x̄2 x̄3 ⋯ x̄10000σ̂2 = S̄2 0.160 s22 s23 ⋯ s210000
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Sampling Distribution of X̄
�e sampling distribution of X̄ is the distribution of the followingvector:
Samples from Pop.1 2 3 ⋯ 10000
X1 0.496 x1,2 x1,3 ⋯ x1,10000X2 0.320 x2,2 x2,3 ⋯ x2,10000⋮ ⋮ ⋮ ⋮ ⋱ ⋮X40 0.172 x40,2 x40,3 ⋯ x40,10000
µ̂ = X̄40 0.412 x̄2 x̄3 ⋯ x̄10000σ̂2 = S̄2 0.160 s22 s23 ⋯ s210000
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Calculating Sampling Distribution with KnownPopulation
1. Start with complete population.2. De�ne a quantity of interest (the parameter). For us, it’s µ.3. Choose a plausible estimator. For us, it’s µ̂ = X̄.4. Draw a sample from the population and calculate the estimate
using the estimator.5. Repeat step 4 many times.
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Calculating Sampling Distribution with KnownPopulation
We can approximate a sampling distribution for the mean andvariance by doing the sampling we did before repeatedly and manuallystoring the estimates.set seed 02138
insheet using fulton.csv , comma clear
sample 40, count
quietly sum receipts , detail
di r(mean)
di r(Var)
insheet using fulton.csv , comma clear
sample 40, count
quietly sum receipts , detail
di r(mean)
di r(Var)
*** and so on...
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Visualizing Sampling Distributions
We can now plot the manually stored estimates of the populationmean.
matrix m=(0.4117 \ 0.5123 \ 0.5398 \ 0.6038\ 0.5376\
0.5901\ 0.5087\ 0.4707\ 0.5044\ 0.5646)
svmat m, name(simu)
hist(simu)
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Can we automate this a bit more?
insheet using fulton.csv , comma clear
foreach i of numlist 1/20 {
preserve
quietly sample 40, count
quietly sum black
mat holder = (nullmat(holder) \ r(mean))
restore
}
svmat holder
hist holder1 , frequency
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Visualizing Sampling Distributions
Here is the sampling distribution (with 10,000 samples) for X̄:
0
2
4
6
0.3 0.4 0.5 0.6 0.7Sample Mean
Density
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Visualizing Sampling Distributions
Here is the sampling distribution (with 10,000 samples) for X̄:
0
2
4
6 Original Sample Mean Pop. Mean
0.3 0.4 0.5 0.6 0.7Sample Mean
Density
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Characterizing Sampling Distribution of SampleMean
We know that in large sample (large enough for Central Limit�eorem to kick in), the sample mean will be distributed as:
X̄ ∼ N (µ, σ2/n)
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Sampling Distributions for Other Statistics
We can calculate in�nitely many statistics from a sample. If we havethe population, we can simulate a sampling distribution for those!
For example, we can look at the sampling distribution of S - the samplestandard deviation.
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Visualizing Sampling Distributions
Here is the sampling distribution (with 10,000 draws) for S:
0
5
10
15
20
25
0.35 0.40 0.45Sample Standard Deviation
Density
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Outline
Administrative Details
Probability Distributions
Sampling Distributions
Estimating Sampling Distributions
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Sampling Distribution of X̄ with UnknownPopulation
In reality, we only draw one sample and cannot directly observe thesampling distribution of X̄:
Samples from Pop.1 2 3 ⋯ 10000
X1 0.496 ? ? ⋯ ?X2 0.320 ? ? ⋯ ?⋮ ⋮ ⋮ ⋮ ⋱ ⋮X40 0.172 ? ? ⋯ ?
µ̂ = X̄40 0.412 ? ? ⋯ ?σ̂2 = S̄2 0.160 ? ? ⋯ ?
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Estimating the Sampling Distribution of SampleMean
We can estimate the equation in the previous slide using estimators forthe population mean µ and the population variance σ2. We will usethe following estimators:
▸ µ̂ = X̄ = 1n ∑
ni=1 Xi
▸ σ̂2 = S2X = 1n−1 ∑
ni=1(Xi − X̄)2
�e estimated sampling distribution is thus:
X̄ ∼ N (X̄, S2X/n)
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Bootstrapping the Sampling Distribution of SampleMean
1. Start with sample.2. De�ne a quantity of interest (the parameter). For us, it’s µ.3. Choose a plausible estimator. For us, it’s µ̂ = X̄.4. Take a resample of size n (with replacement) from the sample and
calculate the estimate using the estimator.5. Repeat step 4 many times (we will do 10,000).
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Visualizing the Bootstrapped Sampling Distribution
Here is the bootstrapped sampling distribution:
0
2
4
6
Sample Mean
0.2 0.3 0.4 0.5 0.6 0.7Sample Mean
Density
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Visualizing the Bootstrapped Sampling Distribution
However, any given sample may be far away from the truth!
0
2
4
6
Sample Mean Population Mean
0.2 0.3 0.4 0.5 0.6 0.7Sample Mean
Density
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Visualizing the Bootstrapped Sampling Distribution
However, any given sample may be far away from the truth!
0
2
4
6
Sample Mean Population Mean
0.2 0.3 0.4 0.5 0.6 0.7Sample Mean
Density
Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions
Conclusion
ANY QUESTIONS?