Creating structured and flexible models: some open problemsgelman/presentations/mittalk2.pdf · Interactions in before-after studies Interactions in regressions Conclusions Creating

Weakly informative priorsInteractions in before-after studies

Interactions in regressionsConclusions

Creating structured and flexible models: someopen problems

Andrew GelmanDepartment of Statistics and Department of Political Science

Columbia University

8 June 2009

Andrew Gelman Creating structured and flexible models: some open problems



Themes

I Goal: models that are structured enough to be able to learnfrom data but not be so strong as to overwhelm the data

I Collaborators:Joe Bafumi, Valerie Chan, Samantha Cook, Jeronimo Cortina,Zaiying Huang, Aleks Jakulin, Jouni Kerman, Gary King, IainPardoe, David Park, Maria Grazia Pittau, Boris Shor, MattStevens, Yu-Sung Su, Francis Tuerlinckx, Masanao Yajima,Shouhao Zhou




Themes






Themes






Separation in logistic regressionBayesian solutionPrior informationEvaluation using a corpus of datasets

Logistic regression

−6 −4 −2 0 2 4 60.0

0.2

0.4

0.6

0.8

1.0 y = logit−1(x)

x

logi

t−1(x

)

slope = 1/4





A clean example

−10 0 10 20

0.0

0.2

0.4

0.6

0.8

1.0

estimated Pr(y=1) = logit−1(−1.40 + 0.33 x)

x

y slope = 0.33/4





The problem of separation

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

slope = infinity?

x

y





Separation is no joke!

glm (vote ~ female + black + income, family=binomial(link="logit"))

1960 1968

coef.est coef.se coef.est coef.se

(Intercept) -0.14 0.23 (Intercept) 0.47 0.24

female 0.24 0.14 female -0.01 0.15

black -1.03 0.36 black -3.64 0.59

income 0.03 0.06 income -0.03 0.07

1964 1972

coef.est coef.se coef.est coef.se

(Intercept) -1.15 0.22 (Intercept) 0.67 0.18

female -0.09 0.14 female -0.25 0.12

black -16.83 420.40 black -2.63 0.27

income 0.19 0.06 income 0.09 0.05





bayesglm()

I Bayesian logistic regression

I In the arm (Applied Regression and Multilevel modeling)package in R

I Replaces glm(), estimates are more numerically andcomputationally stable

I Student-t prior distributions for regression coefs

I Use EM-like algorithm

I We went inside glm.fit to augment the iteratively weightedleast squares step

I Default choices for tuning parameters (we’ll get back to this!)





bayesglm()












bayesglm()












bayesglm()












bayesglm()












bayesglm()












bayesglm()












bayesglm()












Regularization in action!





What else is out there?

I glm (maximum likelihood): fails under separation, gives noisyanswers for sparse data

I Augment with prior “successes” and “failures”: doesn’t workwell for multiple predictors

I brlr (Jeffreys-like prior distribution): computationallyunstable

I brglm (improvement on brlr): doesn’t do enough smoothing

I BBR (Laplace prior distribution): OK, not quite as good asbayesglm

I Non-Bayesian machine learning algorithms: understateuncertainty in predictions







































































Information in prior distributions

I Informative prior distI A full generative model for the data

I Noninformative prior distI Let the data speakI Goal: valid inference for any θ

I Weakly informative prior distI Purposely include less information than we actually haveI Goal: regularization, stabilization





































































Weakly informative priors forlogistic regression coefficients

I Separation in logistic regressionI Some prior info: logistic regression coefs are almost always

between −5 and 5:I 5 on the logit scale takes you from 0.01 to 0.50

or from 0.50 to 0.99I Smoking and lung cancer

I Independent Cauchy prior dists with center 0 and scale 2.5

I Rescale each predictor to have mean 0 and sd 12

I Fast implementation using EM; easy adaptation of glm


















































































Prior distributions

−10 −5 0 5 10

θ





Another example

Dose #deaths/#animals

−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5

I Slope of a logistic regression of Pr(death) on dose:I Maximum likelihood est is 7.8± 4.9I With weakly-informative prior: Bayes est is 4.4± 1.9

I Which is truly conservative?

I The sociology of shrinkage





Another example


−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5








Another example


−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5








Another example


−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5








Another example


−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5








Another example


−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5








Maximum likelihood and Bayesian estimates

Dose

Pro

babi

lity

of d

eath

0 10 20

0.0

0.5

1.0

glmbayesglm





Conservatism of Bayesian inference

I Problems with maximum likelihood when data showseparation:

I Coefficient estimate of −∞I Estimated predictive probability of 0 for new cases

I Is this conservative?

I Not if evaluated by log score or predictive log-likelihood


















































Which one is conservative?

Dose

Pro

babi

lity

of d

eath

0 10 20

0.0

0.5

1.0

glmbayesglm





Prior as population distribution

I Consider many possible datasets

I The “true prior” is the distribution of β’s across these datasets

I Fit one dataset at a time

I A “weakly informative prior” has less information (widervariance) than the true prior

I Open question: How to formalize the tradeoffs from usingdifferent priors?























































Evaluation using a corpus of datasets

I Compare classical glm to Bayesian estimates using variousprior distributions

I Evaluate using 5-fold cross-validation and average predictiveerror

I The optimal prior distribution for β’s is (approx) Cauchy (0, 1)

I Our Cauchy (0, 2.5) prior distribution is weakly informative!









































Expected predictive loss, avg over a corpus of datasets

0 1 2 3 4 5

0.29

0.30

0.31

0.32

0.33

scale of prior

−lo

g te

st li

kelih

ood

(1.79)GLM

BBR(l)

df=2.0

df=4.0

df=8.0BBR(g)

df=1.0

df=0.5





Other examples of weakly informative priors

I Variance parameters

I Covariance matrices

I Population variation in a physiological model

I Mixture models

I Intentional underpooling in hierarchical models









I Mixture models










I Mixture models










I Mixture models










I Mixture models










I Mixture models






End of part 1

I “Noninformative priors” are actually weakly informative

I “Weakly informative” is a more general and useful conceptI Regularization

I Better inferencesI Stability of computation (bayesglm)

I Why use weakly informative priors rather than informativepriors?

I Conformity with statistical culture (“conservatism”)I Labor-saving deviceI Robustness





End of part 1










End of part 1










End of part 1










End of part 1










End of part 1










End of part 1










End of part 1










End of part 1










End of part 1









Legislative redistrictingEducational experimentIncumbency advantageHierarchical models for treatment interactions

No-interaction model

I Before-after data with treatment and control groupsI Default model: constant treatment effects

I Fisher’s classical null hyp: effect is zero for all casesI Regression model: yi = Tiθ + Xiβ + εi

control

treatment

"before" measurement, x

"afte

r" m

easu

rem

ent,

y








control

treatment


"afte

r" m

easu

rem

ent,

y








control

treatment


"afte

r" m

easu

rem

ent,

y








control

treatment


"afte

r" m

easu

rem

ent,

y








control

treatment


"afte

r" m

easu

rem

ent,

y





Actual data show interactions

I Treatment interacts with “before” measurement

I Before-after correlation is higher for controls than for treatedunits

I ExamplesI An observational study of legislative redistrictingI An experiment with pre-test, post-test dataI Congressional elections with incumbents and open seats





















































Observational study of legislative redistricting:before-after data

Estimated partisan bias in previous election

Est

imat

ed p

artis

an b

ias

(adj

uste

d fo

r st

ate)

-0.05 0.0 0.05

-0.0

50.

00.

05

no redistricting

bipartisan redistrict

Dem. redistrict

Rep. redistrict.

. .. .

.

..

.....

...

.

.

.

...

.

.

..

..

.

.

.. .

. .

.

.

.

.

.

..

..

.

. .

.

.

. .

..

..

.

.. ..

.

.

.

..

.

..

..

.

.

. .

.

.

.

..

..

.

.

.

.

.

.. .

.

.

..

.

.

.

. .

..

..

.

..

.

. .

.

.. o

o

o

o

o o

ox

x

x

x

x

xx

x

x

x

•

•• •

•

•

•

•

•

•

•••

•

•

(favors Democrats)

(favors Republicans)





Educational experiment: correlation between pre-test andpost-test data for controls and for treated units

grade

corr

elat

ion

1 2 3 4

0.8

0.9

1.0

controls

treated





Correlation between two successive Congressional electionsfor incumbents running (controls) and open seats (treated)

1900 1920 1940 1960 1980 2000

0.0

0.2

0.4

0.6

0.8

year

corr

elat

ion

incumbents

open seats





Interactions as variance components

I yi = Tiθ + Xiβ + εi + ηi

I Unit-level “error term” ηi

I For control units, ηi persists from time 1 to time 2I For treatment units, ηi changes:

I Subtractive treatment error (ηi only at time 1)I Additive treatment error (ηi only at time 2)I Replacement treatment error

I Under all these models, the before-after correlation is higherfor controls than treated units











































































End of part 2

I Treatment and control can be modeled asymmetrically

I Additional variance component





End of part 2







End of part 2






Federal spendingVote preferencesIncome and votingIncentives in sample surveys

Examples of interactions in regression

I Federal spending by state, year, category (50× 19× 10)

I Vote preference given state and demographic variables(50× 2× 2× 4× 4)

I Rich state, poor state, red state, blue state (50× 2 for eachelection)

I Meta-analysis of incentives in sample surveys (26)









































General concerns

I Lots of potential interactions

I Setting high-level interactions to zero? Too extreme,especially when interactions are of substantive interest

I Simple hierarchical model for interactions is too crude

I Model: large main effects can have large interactions. Inhierarchical setting, model should come “naturally”





General concerns









General concerns









General concerns









General concerns









Federal spending by state


I For each spending category, 50× 19 data structure

I yjt = αj + βt + γjt

I possible model: γjt ∼ N (0, A + B|αjβt |)I Some example data


















































Interactions |γjt plotted vs. main effects |αjβt |

Disretire − Federal Spending (Inflation−adjusted)

Abs(a_ j*b_t)

Abs

(e_

jt)

0.00 0.02 0.04 0.06 0.08

0.00

0.05

0.10

0.15

0.20





Logistic regression for pre-election polls

I Logistic regression predicting vote from demographic andgeographic factors: sex, ethnicity, age, education, state

I Hierarchical model for 4 age levels, 4 education levels, 16 age× education, 50 states

I Also consider interactions such as ethnicity × state





































Prediction error as function of # of predictors

MSE : training sample

number of parameters

MS

E

0 50 100 150

0.22

0.23

0.24

0.25

MSE : test sample

number of parameters

MS

E

0 50 100 150

0.22

0.23

0.24

0.25





Red state, blue state, rich state, poor state

I Richer voters favor the Republicans, but

I Richer states favor the Democrats

I Hierarchical logistic regression: predict your vote given yourincome and your state (“varying-intercept model”)





























Varying-intercept model

Income

Pro

babi

lity

Vot

ing

Rep

−2 −1 0 1 2

0.25

0.35

0.45

0.55

0.65

0.75

Connecticut

Ohio

Mississippi





Varying-intercept, varying-slope model

Income

Pro

babi

lity

Vot

ing

Rep

−2 −1 0 1 2

0.4

0.5

0.6

0.7

0.8

Connecticut

Ohio

Mississippi





Interactions!

Avg Income 2000 vs. Var Slope 2000

Avg State Income ($10k)

Slo

pe

2.0 2.5 3.0 3.5

0.0

0.1

0.2

0.3

0.4

0.5

AL

AKAZ

AR

CA

CO

CT

DEFL

GA

HIID

ILIN

IA

KS

KY

LA

ME

MD

MA

MI MN

MS

MO

MT

NENV NH

NJ

NM

NY

NCNDOH

OK

OR

PA

RISC SD

TN

TXUT VT

VA

WA

WV

WI

WY





3-way interactions!

Income

Pro

babi

lity

Vot

ing

Rep

−2 −1 0 1 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1980

ConnecticutOhio

Mississippi

Income

Pro

babi

lity

Vot

ing

Rep

−2 −1 0 1 20.

10.

20.

30.

40.

50.

60.

70.

80.

9

1984

Connecticut

OhioMississippi

Income

Pro

babi

lity

Vot

ing

Rep

−2 −1 0 1 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1988

Connecticut

Ohio

Mississippi

Income

Pro

babi

lity

Vot

ing

Rep

−2 −1 0 1 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1992

Connecticut

Ohio

Mississippi

Income

Pro

babi

lity

Vot

ing

Rep

−2 −1 0 1 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1996

Connecticut

Ohio

Mississippi

Income

Pro

babi

lity

Vot

ing

Rep

−2 −1 0 1 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2000

Connecticut

Ohio

Mississippi





Meta-analysis of effects of incentives on survey responserates

I 6 factorsI Incentive or notI Value of incentiveI Form (gift or cash)I Timing (before or after)I Mode (telephone or face-to-face)I Burden (short or long survey)

I ModelsI No interactions: estimates don’t make senseI Interactions: estimates are out of control




































































Model without interactions

I Estimated effects on response rate (in percentage points)

Beta (s.e.)Intercept 1.4 (1.6)Value of incentive 0.34 (0.17)Prepayment 2.8 (1.8)Gift −6.9 (1.5)Burden 3.3 (1.3)

I Will a low-value postpaid gift really reduce response rates by 7percentage points??





















Models with interactions

Model I Model II Model III Model IVConstant 60.7 (2.2) 60.8 (2.5) 61.0 (2.5) 60.1 (2.5)Incentive 5.4 (0.7) 3.7 (0.8) 2.8 (1.0) 6.1 (1.2)Mode 15.2 (4.7) 16.1 (5.1) 16.0 (4.9) 18.0 (4.6)Burden −7.2 (4.3) −8.9 (5.0) −8.7 (5.0) −9.9 (5.0)Mode × Burden −7.6 (9.8) −7.8 (9.4) −4.9 (9.1)Incentive × Value 0.14 (0.03) 0.33 (0.09) 0.26 (0.09)Incentive × Timing 4.4 (1.3) 1.7 (1.7) −0.2 (2.1)Incentive × Form 1.4 (1.3) 1.1 (1.2) −1.2 (2.0)Incentive × Mode −2.3 (1.6) −2.0 (1.7) 7.8 (2.9)Incentive × Burden 4.8 (1.5) 5.4 (1.8) −5.2 (2.7)Incentive × Value × Timing 0.40 (0.17) 0.58 (0.18)Incentive × Value × Burden −0.06 (0.06) 1.10 (0.24)Incentive × Timing × Burden 11.1 (3.9)Incentive × Value × Form 0.30 (0.20)Incentive × Value × Mode −1.20 (0.24)Incentive × Timing × Form 9.9 (2.7)Incentive × Timing × Mode −17.4 (4.1)Incentive × Form × Mode −0.3 (2.5)Incentive × Form × Burden 5.9 (3.2)Incentive × Mode × Burden −5.8 (3.0)Within-study sd, σ 4.2 (0.3) 3.6 (0.3) 3.6 (0.3) 2.8 (0.3)Between-study sd, τ 18 (2) 19 (2) 18 (2) 18 (2)





Structured hierarchical models

I Need to go beyond exchangeability to shrink batches ofparameters in a reasonable way

I For example, parameter matrices αjk don’t look likeexchangeable vectors

I Similar problems arise in shrinking higher-order terms inneural nets, wavelets, tree models, image models, . . .

I Recall the “blessing of dimensionality”: as the number offactors, and the number of levels per factor, increases, moreinformation is available to estimate the hyperparameters ofthe big model

I In the background: advances in Bayesian computationincluding parameter expansion (Meng, Liu, Liu, Rubin, vanDyk), adaptive Metropolis algorithms (Pasarica), structuredcomputations (Kerman)























































End of part 3

I Interactions are important

I Technical challenges in modeling and computation

I Blessing of dimensionality





End of part 3








End of part 3








End of part 3








Another example: who supports school vouchers?




What have we learned?

I Models need structure but not too much structureI Interactions are important

I Treatment interactions in before-after studiesI 2-way, 3-way, . . . , interactions in regression models

I Conservatism in statistics

I Weak prior information is key


















































Documents

Creating structured and flexible models: some open problemsgelman/presentations/mittalk2.pdf · Interactions in before-after studies Interactions in regressions Conclusions Creating