Regression Discontinuity Designs · Extremist candidates What approach might we use to estimate a...

Preview:

Citation preview

Regression Discontinuity Designs

PUBL0050 – Week 9

Jack BlumenauDepartment of Political Science

UCL

1 / 53

Course outline

I 1: Potential Outcomes and Causal InferenceI 2: Randomized ExperimentsI 3: Selection on Observables (I)I 4: Selection on Observables (II)I 5: Difference-in-Differences and Panel DataI 6: Synthetic Control MethodI 7: Instrumental Variables (I)I 8: Instrumental Variables (II)I 9: Regression Discontinuity DesignsI 10: Overview and Review

2 / 53

Lecture outline

Motivation

Sharp RDD

RDD estimation

Validating RDD

Fuzzy RDD

Conclusions

3 / 53

Motivation

Running example

What happens when extremists win primaries?What are the consequences of nominating an extremist candidate in aprimary election for downstream electoral outcomes? Andy Hall (2015)studies a sample of primary elections for the US House between 1980 and2010 where the primary was contested between an extremist candidate anda moderate candidate. Extremism is determined by receving donationsfrom extreme interest groups. He uses the outcomes of these races tocompare the electoral outcomes of moderates and challengers insubsequent general elections.

I Outcome (Yi,p,t): Party vote share in district i in the general election at time tI Treatment (Di,p,t): 1 if the party’s primary winner in district i is an “extremist”I Running variable (Xi,p,t): Candidate vote share in the primary election in district i

4 / 53

Extremist candidates

Why can’t we interpret this causally?

vote_share_extreme <- mean(hall$vote_share_general[hall$extreme == 1])vote_share_moderate <- mean(hall$vote_share_general[hall$extreme == 0])

vote_share_extreme - vote_share_moderate

## [1] -0.02736295

Selection bias: Extremists may differ in many ways from moderatesI Candidate differences

I Less experiencedI Less well financedI Less supported by local party

I District differencesI May be selected in districts where the party performs poorly historically

5 / 53

Extremist candidates

What approach might we use to estimate a causal effect?I Condition on observed differences between extremists and moderatesI Find an instrument that increases the probability of an extremist

winning the primary, but that has no effect on the generalI Use variation over time in party vote shares using a difference in

differences analysisAn alternative approach (RDD):

I Compare the vote share of parties in districts where extremistsnarrowly won their primary races to the vote share of parties whereextremists narrowly lost

I → Assume that winning the election is as good as random in closeraces

6 / 53

RDD outlineI Each unit has a score on a “running variable” which determines

treatmentI Treatment is:

I assigned to units when their score on the running variable exceeds aknown cutoff

I not assigned to units whose value of the score is below the cutoffI Key feature: probability of receiving the treatment changes abruptly

at the cutoffI Discontinuous change in this probability can be used to learn about

the local causal effect of the treatment on an outcome of interest

IntuitionUnits with scores barely below the cutoff can be used as counterfactualsfor units with scores barely above it.I.e. Districts where the extremist narrowly wins their election are comparable todistricts in which the extremist narrowly loses

7 / 53

RDD outline

RDD is an appropriate strategy when we know that treatment and controlconditions are not randomly assigned, but we know the assigned rule thatinfluences how units are assigned or selected into the treatment.

The design is reliant on us having access to a forcing variable thatdetermines the treatment status.

RDD is widely used in rule-based settings, where it is clear how and whenDi = 1 is asssigned:

I ElectionsI Administrative programmesI Geographic boundaries

8 / 53

Sharp RDD

Sharp Regression Discontinuity Designs

Imagine that our binary treatment variable, Di , is completely determinedby the value of an explanatory variable, Xi , according to:

Di = 1(Xi > c) so Di ={

Di = 1 if Xi ≥ cDi = 0 if Xi < c

whereI Xi is known as the “forcing” or “running“ variable, and may be

correlated with the outcomes (Yi ) and potential outcomes (Y1i ,Y0i )I c is a fixed cutoff point

Implications:I Di is a deterministic function of Xi → when we know Xi , we know DiI Di is a discontinuous function of Xi → no matter how close to c we

are, Di = 0 until Xi > c

9 / 53

Examples of Xi , Di , and cI Eggers (2015)

I Yi – turnout (aggregate)I Di – proportional representation in a French townI Xi – population of the townI c – 3500

I de Kadt (2017)I Yi – turnout (individual)I Di – voting in South Africa in 1994I Xi – age in 1994I c – 18

I Hall (2015)I Yi – party vote share in general electionI Di – primary election won by an extremistI Xi – margin of victory in the primary electionI c – 0

10 / 53

Graphical illustration

Do scholarships increase earnings?Thistlethwaite and Campbell (1960) study the effects of collegescholarships on employment outcomes for students later in life. Theystudy the allocation of “merit awards”, which were given out to studentsbased on a score, and anyone with a score above some cutoff received themerit award, whereas everyone below that cutoff did not.

I Outcome (Yi ): Adult earnings ($)I Treatment (Di ): Receipt of a merit awardI Running variable (Xi ): Score on a standardized testI Cutoff (c): Scores of 2000 on more on Xi result in a merit award.

11 / 53

Graphical illustration (Xi and Di)

1600 1800 2000 2200 2400

0.0

0.2

0.4

0.6

0.8

1.0

X

D Xi= c

Assigned to control Assigned to treatment

12 / 53

Graphical illustration (Xi and Yi)

1600 1800 2000 2200 2400

2000

025

000

3000

035

000

X

Y

Xi= c

13 / 53

Graphical illustration (Xi , Y1i and Y0i)

1600 1800 2000 2200 2400

2000

025

000

3000

035

000

X

Y

Unobserved outcomes(treatment)

Unobserved outcomes(control)Observed outcomes

(control)

Observed outcomes(treatment)

14 / 53

Graphical illustration (τATE at c)

1600 1800 2000 2200 2400

2000

025

000

3000

035

000

X

Y

Xi= c

LATE

15 / 53

Sharp RDD: Identification

We want to be able to estimate the difference between Di = 1 and Di = 0at the threshold c.

Can we estimate this?

τLATE = E [Y1i |Xi = c]− E [Y0i |Xi = c]= E [Yi |Xi = c,Di = 1]− E [Yi |Xi = c,Di = 0]

No! We never observe both Di = 1 and Di = 0 at c.

We have a complete absense of common support: no treatment units willhave the same value of Xi as a control unit, because Di is a discontinuousfunction of Xi (where the discontinuity is defined at c).

16 / 53

Sharp RDD: Identification

Identification assumptionsE [Y1i |Xi ,Di ] and E [Y0i |Xi ,Di ] are continuous in X around the thresholdX = c

Identification resultThe treatment effect at the threshold c is identified by:

τLATE = E [Y1i − Y0i |X = c]= E [Y1i |X = c]− E [Y0i |X = c]= lim

X↓cE [Yi |X = c]− lim

X↑cE [Yi |X = c]

Implications:I We extrapolate a small amount to infer potential outcomes at cI Without futher assumptions, the LATE only identifies the ATE at c

17 / 53

Local nature of the RD effect

1600 1800 2000 2200 2400

2000

030

000

No heterogeneity

X

Y

E[Y1|X]

E[Y0|X]

1600 1800 2000 2200 2400

2000

030

000

Moderate heterogeneity

X

Y

E[Y1|X]

E[Y0|X]

1600 1800 2000 2200 2400

2000

030

000

Severe heterogeneity

X

Y

E[Y1|X]

E[Y0|X]

18 / 53

RDD estimation

Estimation

1. Recode the running variable to deviations from c: X̃i = Xi − cI X̃i = 0 if Xi = cI X̃i > 0 if Xi > c and so Di = 1I X̃i < 0 if Xi < c and so Di = 0

2. Decide on a regression model for E [Yi |Xi ,Di ]I Linear, same slope for E [Y0i |Xi ] and E [Y1i |Xi ]I Linear, different slopesI PolynomialI Local linear

3. Produce an RD plot, visualising the discontinuity

4. Inference via regression standard errors

19 / 53

EstimationConsider the following model where X̃i = X − c:

E [Yi |Di ,Xi ] = α + βX̃i + τDi

Why does τ identify the LATE in this model? (i.e the difference betweenE [Yi |Xi = c,Di = 1] and E [Yi |Xi = c,Di = 0]).

If Xi = c then X̃i = 0:

E [Yi |X̃i = 0,Di = 1] = α + β · 0 + τDi = α + τ

andE [Yi |X̃i = c,Di = 0] = α + β · 0 + τ · 0 = α

and so:

E [Yi |Xi = c,Di = 1]− E [Yi |Xi = c,Di = 0] = (α + τ)− α = τ

20 / 53

Estimation in R (I)

same_slope_model <- lm(vote_share_general ˜ extreme + running_variable,data = hall_subset)

different_slope_model <- lm(vote_share_general ˜ extreme * running_variable,data = hall_subset)

polynomial_model <- lm(vote_share_general ˜ extreme * running_variable +extreme*I(running_variableˆ2) +extreme*I(running_variableˆ3),

data = hall_subset)

21 / 53

Linear model, same slopes

Extremist Primary Election Margin

Gen

eral

Ele

ctio

n V

ote

Sha

re

0.4

0.6

0.8

−0.2 −0.1 0 0.1 0.2

E [Y |X̃i ,Di ] = α+ β1X̃i + τDi

22 / 53

Linear model, different slopes

Extremist Primary Election Margin

Gen

eral

Ele

ctio

n V

ote

Sha

re

0.4

0.6

0.8

−0.2 −0.1 0 0.1 0.2

E [Y |X̃i ,Di ] = α+ β01X̃i + β1(X̃i Di ) + τDi

23 / 53

Non-linear model

Extremist Primary Election Margin

Gen

eral

Ele

ctio

n V

ote

Sha

re

0.4

0.6

0.8

−0.2 −0.1 0 0.1 0.2

E [Y |X̃i ,Di ] = α+ β01X̃i + β02X̃2i + β03X̃3

i + β1(X̃i Di ) + β2(X̃2i Di ) + β4(X̃3

i Di ) + τDi

24 / 53

Comparing models

Same slope Different slope Polynomial(1) (2) (3)

extreme −0.098 −0.095 −0.116(0.034) (0.034) (0.074)

Constant 0.643 0.606 0.624(0.019) (0.024) (0.053)

Observations 233 233 233R2 0.035 0.060 0.102

Note: Standard errors in parentheses

Implication: When an extremist wins a “coin-flip” election over amoderate, the party’s general-election vote share decreases on average byapproximately 9-12 percentage points.

25 / 53

Non-linearity mistaken for discontinuity

It is often the case that the choice of functional form for X̃ isconsequential for the inference about τ̂LATE:

0.0 0.2 0.4 0.6 0.8 1.0

−0.

50.

00.

51.

01.

5

Running variable

Out

com

e

26 / 53

Bandwidth selection

One way to reduce this type of model dependence is to focus only onobservations that are close to the cutoff.

In practice, only keep observations with:

c − h ≤ Xi ≤ c + h

where h is a positive value determining the window or bandwith size.

The bandwidth – h – controls the width of the neighbourhood around thecutoff that is used to calculate the discontinuity.

h directly affects the properties of the estimation process and empiricalfindings can be sensitive to the particular value that one chooses for h.

27 / 53

Bandwidth selection

−30 −20 −10 0 10 20 30

−5

05

10

Running variable

Out

com

e

LinearQuadratic

28 / 53

Bandwidth selection

−30 −20 −10 0 10 20 30

−5

05

10

Running variable

Out

com

e

LinearQuadratic

28 / 53

Bandwidth selection

−30 −20 −10 0 10 20 30

−5

05

10

Running variable

Out

com

e

LinearQuadratic

28 / 53

Bandwidth selection

−30 −20 −10 0 10 20 30

−5

05

10Bandwidth = 30

Running variable

Out

com

e

LinearQuadratic

28 / 53

Bandwidth selection

−30 −20 −10 0 10 20 30

−5

05

10Bandwidth = 20

Running variable

Out

com

e

LinearQuadratic

28 / 53

Bandwidth selection

−30 −20 −10 0 10 20 30

−5

05

10Bandwidth = 10

Running variable

Out

com

e

LinearQuadratic

28 / 53

Bandwidth selection

−30 −20 −10 0 10 20 30

−5

05

10Bandwidth = 5

Running variable

Out

com

e

LinearQuadratic

28 / 53

Bandwidth selection

ImplicationsComparing average outcomes in a small neighbourhood to the right andleft of the cutoff leads to:

1. Estimates of LATE that are less dependent on the functional formspecification for X̃

2. Decreases the bias that comes from misspecification3. Leads to a smaller sample size, thus increasing the variance

In picking h we face a bias-variance trade-off:I Smaller values of h → less bias in τ̂LATEI Smaller values of h → greater variance in τ̂LATE (i.e. SE (τ̂LATE) ↑)

28 / 53

How do we pick h?

The choice of h is important for the estimates of τ̂LATE.

Two approaches to choosing h:1. “Optimal” bandwidth selection

I Use algorithmic bandwidth selection methodsI Most common → Imbens-Kalyanaraman procedure

I Choose h to balance bias-variance tradeoffI h is chosen to minimise the expected mean-square error of the RD

estimator

2. Reporting results from multiple bandwidthsI In practice, it is common to show that the how much (if at all) the

estimate of τ̂LATE changes as we vary the bandwidth

29 / 53

Extremist candidates – optimal bandwidth

library(rdd)

optimal_bandwidth <- IKbandwidth(X = hall$running_variable,Y = hall$vote_share_general,cutpoint = 0)

optimal_bandwidth

## [1] 0.0851

30 / 53

Extremist candidates – optimal bandwidth

rd_est <- RDestimate(vote_share_general ˜ running_variable,cutpoint = 0,bw = optimal_bandwidth,data = hall)

rd_est

#### Call:## RDestimate(formula = vote_share_general ˜ running_variable, data = hall,## cutpoint = 0, bw = optimal_bandwidth)#### Coefficients:## LATE Half-BW Double-BW## -0.07504 -0.06580 -0.06792

31 / 53

Extremist candidates – bandwidth sensitivity

0.05 0.10 0.15 0.20 0.25

−0.

4−

0.3

−0.

2−

0.1

0.0

0.1

Bandwidth

Trea

tmen

t effe

ct

32 / 53

Break

Validating RDD

Falsification checks

1. Balance checks: Are covariates discontinuous at the threshold?

2. Placebo thresholds: Do we estimate significant treatment effects at“placebo” thresholds, c∗?

3. Sorting: Are units able to “sort” around the threshold?

33 / 53

Balance checks

If treatment is “as good as random” around the threshold, then inexpectation treated and control units around the threshold should be thesame with respect to both observed and unobserved covariates.

We cannot check balance of unobserved covariates, but we can assessbalance on observed covariates (Zi , not an instrument!):

I Visual inspectionI Plot E [Zi |X̃i ,Di ] – there should be no discontinuities around cI The relationship between X̃i and Zi should be smooth around c

I RD model for covariatesI Estimate E [Zi |X̃i ,Di ] = α + β01X̃i + β1(X̃iDi ) + τZ DiI This should yeild τZ = 0 if Zi is balanced at the threshold

34 / 53

Balance checks for extremist candidates

−0.4 −0.2 0.0 0.2 0.4

Estimated RD treatment effect

Probability Female

Probability experienced

Share of donations

35 / 53

Placebo thresholds

We can also check whether the discontinuity only appears where it“should” appear, and that it is zero at other values of the cutoff.

If we have a placebo value c∗ 6= c, then define X̃ ∗i = Xi − c∗ and estimate:

E [Yi |X̃ ∗i ,Di ] = α + β01X̃ ∗i + β1(X̃ ∗i Di ) + τ∗Di

or more flexible specifications thereof.

Implication: If our RDD is valid, we should find no significant treatmenteffects, τ∗, for any c∗.

36 / 53

Placebo test for extremist candidates

−0.15 −0.10 −0.05 0.00 0.05 0.10 0.15

−0.

2−

0.1

0.0

0.1

0.2

0.3

Cut point

LAT

E

37 / 53

SortingThe RDD is based on the assumption that there is continuity in thepotential outcomes at the threshold.

One way this assumption might be violated is if units can control theirvalues of the running variable.

Examples of sorting:I Population thresholds: Administrators might misreport population in

town/district if particular benefits are received at certain thresholds(e.g. Eggers et al., 2018)

I Earnings thresholds: Individuals may reduce their earnings if benefitsare granted to those below a certain income (e.g. McCrary, 2008)

I Geographic thresholds: Businesses might locate in different areas ifbenefits are allocated differentially across localities (e.g. Keele andTitunik, 2015)

38 / 53

SortingMcCrary (2008) proposes a test to detect sorting in X̃i :

I Looks for evidence on discontinuous jumps in the running variable atthe threshold

I Null hypothesis is that there is no sorting, so small p-values from thetest suggest evidence of sorting

I DCdensity(running variable) in R

39 / 53

Sorting of extreme candidatesDCdensity(hall$running_variable)

## [1] 0.9563002

−0.2 −0.1 0.0 0.1 0.2

01

23

45

40 / 53

Compound treatments

RDD assumes that the only thing that is determined by Xi at the cutoff isthe probability of receiving the treatment.

It is often the case that there are multiple changes at a given cutoff, andso we can only estimate a compound treatment effect

Eggers (2015) uses the fact that French towns with ≥ 3500 people holdPR elections while towns with < 3500 hold majoritarian elections.

I Outcome (Yi ): Turnout in municipalityI Treatment (Di ): PR election systemI Running variable (Xi ): Population of municipalityI Cutoff (c): 3500

Key question: Is the electoral system the only thing that changes at 3500?

41 / 53

Compound treatments (Eggers et. al., 2018)

42 / 53

Fuzzy RDD

Fuzzy RDD

Thresholds/cutoffs may not perfectly determine treatment status, butmight still create discontinuities in the probability of treatment exposure

Incentives to participate in a program may change discontinuously at athreshold, but the incentives are not powerful enough to move all unitsfrom nonparticipation to participation

We can think of the cutoff as assigning units to a treatment condition,where only some units will comply with the treatment.

→ We can use discontinuities to produce instrumental variable estimatorsof the treatment (close to the discontinuity).

43 / 53

Assumptions in Fuzzy RDD1. First stage

I There should be a discontinuity in treatment probability at the cutoffI Empirically: check RD plots with running variable on X and treatment

probability on Y

2. Local independenceI The treatment assignment should be as good as random around the

cutoffI Empirically: check RD balance plots of covariates

3. MonotonicityI No units should be discouraged from taking the treatment at the cutoffI Generally trivial

4. Exclusion restrictionI Crossing the cutoff should only affect the outcome through a unit’s

treatment values, not through any other channelI Often plausible, so long as it is only D that is affected at c (no

compound effects)44 / 53

Fuzzy RD example

Does education decrease anti-immigrant views?Although low-levels of education are powerful predictors of anti-immigrantsentiment, it is difficult to establish a causal relationship betweeneducation and attitudes towards immigrants. Marshal and Cavaille (2018)use an RDD to address this question by exploiting changes to the length ofmandatory education in five countries (Denmark, France, UK,Netherlands, and Sweden).

I Outcome (Yi ): Index of anti-immigrant attitudesI Treatment (Di ): 1 if respondent was affected by the reformI Running variable (Xi ): Birth year of the respondent minus year the

birth year of those first affected by the policy

45 / 53

Schooling and immigration attitudes

46 / 53

Schooling and immigration attitudes

Here, treatment is determined by age:

Di ,c ={

1 if birth yeari ,c ≥ birth year of first effectedc0 if birth yeari ,c < birth year of first effectedc

But many students would have stayed in school longer even in the absenceof a reform. We therefore have some non-compliance (always-takers).

47 / 53

Fuzzy RD estimation

1. Restrict data to small window above and below the cutoff (±h)2. Code the instrument, Zi , using the running variable (Zi = 1{Xi > c})3. Fit 2SLS

Yi = α + β1X̃i + β2Zi X̃i + τ D̂i

where D̂i is instrumented by Zi and X̃i = Xi − c4. We can, as before, add more flexible specifications for X̃i

5. We would normally also plot and estimate both the first- andsecond-stage discontinuities

48 / 53

Schooling and immigration attitudes – first stage

Implication: On average, reforms increase a student’s secondary schoolingby 0.29 years.

49 / 53

Schooling and immigration attitudes – reduced form

Implication: On many indexes, reform-affected students are less opposedto immigration 50 / 53

Schooling and immigration attitudes – LATE

Note that the LATE estimated in a fuzzy RD is “local” in two ways:I Local to the thresholdI Local to the compliers

51 / 53

Conclusions

Internal and external validity

I Internal validityI RDD is a transparent approach to inference which requires less

stringent assumptions that IV (at least in the Sharp RDD case)I Many of the key identifying assumptions are empirically verifiableI RDD has been shown to do a very good job at recovering known

experimental benchmarks (Cook et. al., 2008)

I External validityI Sharp RDD only identifies the ATE at the point of the discontinuityI Fuzzy RDD only identifies the ATE at the point of the discontinuity,

amongst compliersI Generalizability depends on how weird the units are at the cutoff, and

how weird the compliers are

52 / 53

Next week

1. Advice for coursework

2. Overview of course

3. Topics for future study

53 / 53

Recommended