70
1 Sampling and Inference The Quality of Data and Measures March 23, 2006

Sampling and Inference

  • Upload
    atara

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Sampling and Inference. The Quality of Data and Measures March 23, 2006. Why we talk about sampling. General citizen education Understand data you’ll be using Understand how to draw a sample, if you need to Make statistical inferences. Why do we sample?. How do we sample?. - PowerPoint PPT Presentation

Citation preview

Page 1: Sampling and Inference

1

Sampling and Inference

The Quality of Data and Measures

March 23, 2006

Page 2: Sampling and Inference

2

Why we talk about sampling

• General citizen education• Understand data you’ll be using• Understand how to draw a sample, if you

need to• Make statistical inferences

Page 3: Sampling and Inference

3

Why do we sample?

N

Cost/benefit Benefit

(precision)

Cost(hassle factor)

Page 4: Sampling and Inference

4

How do we sample?

• Simple random sample– Variant: systematic sample with a random start

• Stratified• Cluster

Page 5: Sampling and Inference

5

Stratification

• Divide sample into subsamples, based on known characteristics (race, sex, religiousity, continent, department)

• Benefit: preserve or enhance variability

Page 6: Sampling and Inference

6

Cluster sampling

Block

HH Unit

Individual

Page 7: Sampling and Inference

7

Effects of samples

• Obvious: influences marginals• Less obvious

– Allows effective use of time and effort– Effect on multivariate techniques

• Sampling of independent variable: greater precision in regression estimates

• Sampling on dependent variable: bias

Page 8: Sampling and Inference

8

Sampling on Independent Variable

x

y

x

y

Page 9: Sampling and Inference

9

Sampling on Dependent Variable

x

y

x

y

Page 10: Sampling and Inference

10

Sampling

Consequences for Statistical Inference

Page 11: Sampling and Inference

11

Statistical Inference:Learning About the Unknown From the

Known• Reasoning forward: distributions of sample

means, when the population mean, s.d., and n are known.

• Reasoning backward: learning about the population mean when only the sample, s.d., and n are known

Page 12: Sampling and Inference

12

Reasoning Forward

Page 13: Sampling and Inference

13

Exponential Distribution Example

Frac

tion

inc0 500000 1.0e+06

0

.271441

Mean = 250,000Median=125,000s.d. = 283,474Min = 0Max = 1,000,000

Page 14: Sampling and Inference

14

Consider 10 random samples, of n = 100 apiece

Sample mean

1 253,396.9

2 198.789.6

3 271,074.2

4 238,928.7

5 280,657.3

6 241,369.8

7 249,036.7

8 226,422.7

9 210,593.4

10 212,137.3

Frac

tion

inc0 250000 500000 1.0e+06

0

.271441

Page 15: Sampling and Inference

15

Consider 10,000 samples of n = 100

Frac

tion

(mean) inc0 250000 500000 1.0e+06

0

.275972N = 10,000Mean = 249,993s.d. = 28,559Skewness = 0.060Kurtosis = 2.92

Page 16: Sampling and Inference

16

Consider 1,000 samples of various sizes

10 100 1000

Mean =250,105s.d.= 90,891Skew= 0.38Kurt= 3.13

Mean = 250,498s.d.= 28,297Skew= 0.02Kurt= 2.90

Mean = 249,938s.d.= 9,376Skew= -0.50Kurt= 6.80

Frac

tion

(mean) inc0 250000 500000 1.0e+06

0

.731

Frac

tion

(mean) inc0 250000 500000 1.0e+06

0

.731

Frac

tion

(mean) inc0 250000 500000 1.0e+06

0

.731

Page 17: Sampling and Inference

17

Difference of means example

Frac

tion

inc0 250000 500000 1.0e+06

0

.280203

Frac

tion

inc20 250000 500000 1.0e+06

0

.251984

State 1Mean = 250,000

State 2Mean = 300,000

Page 18: Sampling and Inference

18

Take 1,000 samples of 10, of each state, and compare them

First 10 samplesSample State 1 State 2

1 311,410 < 365,2242 184,571 < 243,0623 468,574 > 438,3364 253,374 < 557,9095 220,934 > 189,6746 270,400 < 284,3097 127,115 < 210,9708 253,885 < 333,2089 152,678 < 314,882

10 222,725 > 152,312

Page 19: Sampling and Inference

19

1,000 samples of 10(m

ean)

inc2

(mean) inc0 1.1e+06

0

1.1e+06

State 2 > State 1: 673 times

Page 20: Sampling and Inference

20

1,000 samples of 100(m

ean)

inc2

(mean) inc0 1.1e+06

0

1.1e+06

State 2 > State 1: 909 times

Page 21: Sampling and Inference

21

1,000 samples of 1,000

State 2 > State 1: 1,000 times

(mea

n) in

c2

(mean) inc0 1.1e+06

0

1.1e+06

Page 22: Sampling and Inference

22

Another way of looking at it:The distribution of Inc2 – Inc1

n = 10 n = 100 n = 1,000

Mean = 51,845s.d. = 124,815

Mean = 49,704s.d. = 38,774

Mean = 49,816s.d. = 13,932

Frac

tion

diff-400000 0 600000

0

.565

Frac

tion

diff-400000 050000 600000

0

.565

Frac

tion

diff-400000 050000 600000

0

.565

Page 23: Sampling and Inference

23

Play with some simulations

• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

• http://www.kuleuven.ac.be/ucs/java/index.htm

Page 24: Sampling and Inference

24

Reasoning Backward

about somethingsay obut want t , and ,X , knowyou When sn

Page 25: Sampling and Inference

25

Central Limit Theorem

As the sample size n increases, the distribution of the mean of a random sample taken from practically any population approaches a normal distribution, with mean and standard deviation

X

n

Page 26: Sampling and Inference

26

Calculating Standard Errors

In general:

ns

err. std.

Page 27: Sampling and Inference

27

Most important standard errors

2

22

1

11 )1()1(n

ppn

pp

Mean

Proportion

Diff. of 2 means

Diff. of 2 proportionsDiff of 2 means (paired data)Regression (slope) coeff.

ns

npp )1(

2

22

1

21

ns

ns

xsnres 11...

nsd

Page 28: Sampling and Inference

28

Using Standard Errors, we can construct “confidence intervals”

• Confidence interval (ci): an interval between two numbers, where there is a certain specified level of confidence that a population parameter lies

• ci = sample parameter +• multiple * sample standard error

Page 29: Sampling and Inference

29

Constructing Confidence Intervals

• Let’s say we draw a sample of tuitions from 15 private universities. Can we estimate what the average of all private university tuitions is?

• N = 15• Average = 29,735• S.d. = 2,196• S.e. = 567

15196,2

ns

Page 30: Sampling and Inference

30

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

N = 15; avg. = 29,735; s.d. = 2,196; s.e. = s/√n = 567

29,735

29,735+567=30,30229,735-567=29,168

29,735+2*567=30,869

29,735-2*567=28,601

Page 31: Sampling and Inference

31

Confidence Intervals for Tuition Example

• 68% confidence interval = 29,735+567 = [29,168 to 30,302]

• 95% confidence interval = 29,735+2*567 = [28,601 to 30,869]

• 99% confidence interval = 29,735+3*567 = [28,034 to 31,436]

Page 32: Sampling and Inference

32

What if someone (ahead of time) had said, “I think the average tuition of

major research universities is $25k”?• Note that $25,000 is well out of the 99%

confidence interval, [28,034 to 31,436]• Q: How far away is the $25k estimate from

the sample mean?– A: Do it in z-scores: (29,735-25,000)/567 = 8.35

Page 33: Sampling and Inference

33

Constructing confidence intervals of proportions

• Let us say we drew a sample of 1,000 adults and asked them if they approved of the way George Bush was handling his job as president. (March 13-16, 2006 Gallup Poll) Can we estimate the % of all American adults who approve?

• N = 1000• p = .37• s.e. = 02.0

1000)37.1(37.)1(

n

pp

Page 34: Sampling and Inference

34

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

N = 1,000; p. = .37; s.e. = √p(1-p)/n = .02

.37

.37+.02=.39.37-.02=.35

.37+2*.02=.41.37-2*.02=.33

Page 35: Sampling and Inference

35

Confidence Intervals for Bush approval example

• 68% confidence interval = .37+.02 = [.35 to .39]• 95% confidence interval = .37+2*.02 = [.33 to .41]• 99% confidence interval = .37+3*.02 = [ .31 to .43]

Page 36: Sampling and Inference

36

What Gallup said about the confidence interval

• Results are based on telephone interviews with 1,000 national adults, aged 18 and older, conducted March 13-16, 2006. For results based on the total sample of national adults, one can say with 95% confidence that the maximum margin of sampling error is ±3 percentage points [because the actual standard error is 1.5%, not 2%].

Page 37: Sampling and Inference

37

What if someone (ahead of time) had said, “I think Americans are equally

divided in how they think about Bush.”

• Note that 50% is well out of the 99% confidence interval, [31% to 43%]

• Q: How far away is the 50% estimate from the sample proportion?– A: Do it in z-scores: (.37-.5)/.02 = -6.5 [-8.7 if

we divide by 0.15]

Page 38: Sampling and Inference

38

Constructing confidence intervals of differences of means

• Let’s say we draw a sample of tuitions from 15 private and public universities. Can we estimate what the difference in average tuitions is between the two types of universities?

• N = 15 in both cases• Average = 29,735 (private); 5,498 (public); diff = 24,238• s.d. = 2,196 (private); 1,894 (public)• s.e. =

74915

3,587,23615

4,822,416

2

22

1

21

ns

ns

Page 39: Sampling and Inference

39

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

N = 15 twice; diff = 24,238; s.e. = 749

24,238

24,238+749=24,98724,238-749=23,489

24,238+2*749=25,736

24,238-2*749=22,740

Page 40: Sampling and Inference

40

Confidence Intervals for difference of tuition means example

• 68% confidence interval = 24,238+749 = [23,489 to 24,987]• 95% confidence interval = 24,238+2*749 =

[22,740 to 25,736]• 99% confidence interval =24,238+3*749 = • [21,991 to 26,485]

Page 41: Sampling and Inference

41

What if someone (ahead of time) had said, “Private universities are no

more expensive than public universities”

• Note that $0 is well out of the 99% confidence interval, [$21,991 to $26,485]

• Q: How far away is the $0 estimate from the sample proportion?– A: Do it in z-scores: (24,238-0)/749 = 32.4

Page 42: Sampling and Inference

42

Constructing confidence intervals of difference of proportions

• Let us say we drew a sample of 1,000 adults and asked them if they approved of the way George Bush was handling his job as president. (March 13-16, 2006 Gallup Poll). We focus on the 600 who are either independents or Democrats. Can we estimate whether independents and Democrats view Bus differently?

• N = 300 ind; 300 Dem.• p = .29 (ind.); .10 (Dem.); diff = .19• s.e. =

03.300

)10.1(10.300

)29.1(29.)1()1(

2

22

1

11

npp

npp

Page 43: Sampling and Inference

43

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

diff. p. = .19; s.e. = .03

.19

.19+.03=.22.19-.03=.16

.19+2*.03=.25.19-2*.03=.13

Page 44: Sampling and Inference

44

Confidence Intervals for Bush Ind/Dem approval example

• 68% confidence interval = .19+.03 = [.16 to .22]• 95% confidence interval = .19+2*.03 = [.13 to .25]• 99% confidence interval = .19+3*.03 = [ .10 to .28]

Page 45: Sampling and Inference

45

What if someone (ahead of time) had said, “I think Democrats and

Independents are equally unsupportive of Bush”?

• Note that 0% is well out of the 99% confidence interval, [10% to 28%]

• Q: How far away is the 0% estimate from the sample proportion?– A: Do it in z-scores: (.19-0)/.03 = 6.33

Page 46: Sampling and Inference

46

Constructing confidence intervals of differences of means in a paired

sample• Let’s say we draw a sample of tuitions from 15

private universities, in 2003 and again in 2004. Can we estimate what the difference in average tuitions is between the two years?

• N = 15 • Averages = 28,102 (2003); 29,735 (2004); diff = 1,632• s.d. = 2,196 (private); 1,894 (public); 886 (diff)• s.e. =

22915

886

nsd

Page 47: Sampling and Inference

47

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

N = 15; diff = 1,632; s.e. = 229

1,632

1,632+229=1,8611,632-229=1,403

1632+2*229=2,090

1,632-2*229=1,174

Page 48: Sampling and Inference

48

Confidence Intervals for second difference of tuition means example

• 68% confidence interval = 1,632+ 229= [1,403 to 1,861]• 95% confidence interval = 1,632+ 2*229 =

[1,174 to 2,090]• 99% confidence interval = 1,632+3*229 =

[945 to 2,319]

Page 49: Sampling and Inference

49

What if someone (ahead of time) had said, “Private university tuitions did

not grow from 2003 to 2004”

• Note that $0 is well out of the 99% confidence interval, [$1,174 to $2,090]

• Q: How far away is the $0 estimate from the sample proportion?– A: Do it in z-scores: (1,632-0)/229 = 7.13

Page 50: Sampling and Inference

50

The Stata output

. gen difftuition=tuition2004-tuition2003

. ttest diff=0 in 1/15

One-sample t test------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------difftu~n | 15 1631.6 228.6886 885.707 1141.112 2122.088------------------------------------------------------------------------------ mean = mean(difftuition) t = 7.1346Ho: mean = 0 degrees of freedom = 14

Ha: mean < 0 Ha: mean != 0 Ha: mean > 0 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

Page 51: Sampling and Inference

51

Constructing confidence intervals of regression coefficients

• Let’s look at the relationship between the mid-term seat loss by the President’s party at midterm and the President’s Gallup poll rating

1938

1942

1946

1950

1954

1958

1962

1966

1970

1974

1978

1982

19861990

1994

19982002

-80

-60

-40

-20

0C

hang

e in

Hou

se s

eats

30 40 50 60 70Gallup approval rating (Nov.)

loss Fitted valuesFitted values

Slope = 1.97N = 14s.e.r. = 13.8sx = 8.14s.e.slope =

47.014.81

138.131

1...

xsnres

Page 52: Sampling and Inference

52

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

N = 14; slope=1.97; s.e. = 0.45

1.97

1.97+0.47=2.441.97-0.47=1.50

1.97+2*0.47=2.911.97-2*0.47=1.03

Page 53: Sampling and Inference

53

Confidence Intervals for regression example

• 68% confidence interval = 1.97+ 0.47= [1.50 to 2.44]• 95% confidence interval = 1.97+ 2*0.47 =

[1.03 to 2.91]• 99% confidence interval = 1.97+3*0.47 =

[0.62 to 3.32]

Page 54: Sampling and Inference

54

What if someone (ahead of time) had said, “There is no relationship

between the president’s popularity and how his party’s House members

do at midterm”?• Note that 0 is well out of the 99%

confidence interval, [0.62 to 3.32]• Q: How far away is the 0 estimate from the

sample proportion?– A: Do it in z-scores: (1.97-0)/0.47 = 4.19

Page 55: Sampling and Inference

55

The Stata output. reg loss gallup if year>1948

Source | SS df MS Number of obs = 14-------------+------------------------------ F( 1, 12) = 17.53 Model | 3332.58872 1 3332.58872 Prob > F = 0.0013 Residual | 2280.83985 12 190.069988 R-squared = 0.5937-------------+------------------------------ Adj R-squared = 0.5598 Total | 5613.42857 13 431.802198 Root MSE = 13.787

------------------------------------------------------------------------------ loss | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- gallup | 1.96812 .4700211 4.19 0.001 .9440315 2.992208 _cons | -127.4281 25.54753 -4.99 0.000 -183.0914 -71.76486------------------------------------------------------------------------------

Page 56: Sampling and Inference

56

z vs. t

Page 57: Sampling and Inference

57

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

2 3 4234 68%

95%99%

Page 58: Sampling and Inference

58

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

2 3 4234 68%

95%99%

Page 59: Sampling and Inference

59

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

2 3 4234 68%

95%99%

Page 60: Sampling and Inference

60

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

2 3 4234 68%

95%99%

Page 61: Sampling and Inference

61

Therefore….

• When the sample size is large (i.e., > 150), convert the difference into z units and consult a z table

Z = (H1 - H0) / s.e.

Page 62: Sampling and Inference

62

Reading a z table

Page 63: Sampling and Inference

63

Page 64: Sampling and Inference

64

Therefore….

• When the sample size is small (i.e., <150), convert the difference into t units and consult a t table

t = (H1 - H0) / s.e.

Page 65: Sampling and Inference

65

t (when the sample is small)

z-4 -2 0 2 4

.000045

.003989

t-distribution

z (normal) distribution

Page 66: Sampling and Inference

66

Reading a t table

Page 67: Sampling and Inference

67

A word about standard errors and collinearity

• The problem: if X1 and X2 are highly correlated, then it will be difficult to precisely estimate the effect of either one of these variables on Y

Page 68: Sampling and Inference

68

How does having another collinear independent variable affect standard

errors?

s eN n

SS

RR

Y

X

Y

X

. . ( )1

2

2

2

2

11

11

1 1

R2 of the “auxiliary regression” of X1 on allthe other independent variables

Page 69: Sampling and Inference

69

Example: Effect of party, ideology, and religiosity on feelings toward

Quincy BushBush

FeelingsConserv. Repub. Religious

Bush Feelings

1.0 .39 .57 .16

Conserv. 1.0 .46 .18

Repub. 1.0 .06

Relig. 1.0

Page 70: Sampling and Inference

70

Regression table(1) (2) (3) (4)

Intercept 32.7(0.85)

32.9(1.08)

32.6(1.20)

29.3(1.31)

Repub. 6.73(0.244)

5.86(0.27)

6.64(0.241)

5.88(0.27)

Conserv. --- 2.11(0.30)

--- 1.87(0.30)

Relig. --- --- 7.92(1.18)

5.78(1.19)

N 1575 1575 1575 1575

R2 .32 .35 .35 .36