241
Foundations of Causal Inference Jiaming Mao Xiamen University

Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Foundations of Causal Inference

Jiaming MaoXiamen University

Page 2: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Copyright © 2017–2019, by Jiaming Mao

This version: Spring 2019

Contact: [email protected]

Course homepage: jiamingmao.github.io/data-analysis

All materials are licensed under the Creative CommonsAttribution-NonCommercial 4.0 International License.

Page 3: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

“Causa latet: vis est notissima (The cause is hidden, but theresult is known)” – Ovid, Metamorphoses, IV. 287.

“Felix qui potuit rerum cognoscere causas (happy be the manwho has been able to learn the causes of things)” – Virgil,Georgics, II, 490.

“Shallow men believe in luck or in circumstance. Strong menbelieve in cause and effect.” Ralph Waldo Emerson

© Jiaming Mao

Page 4: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Inference

© Jiaming Mao

Page 5: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Inference

Learning p (y |x) or p (x , y) tells us nothing about whether thereexists a causal relationship between x and y .

Causal inference is concerned with the following questions:

1 Does x have a causal effect on y? If so, how large is the effect?(causal effect learning)

2 If a causal effect exists, what is the mechanism by which it occurs?(causal mechanism learning)

© Jiaming Mao

Page 6: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Correlation does not imply Causation

Auto Sales and Search for Indian Restaurants. Source: Google Correlate© Jiaming Mao

Page 7: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Correlation does not imply Causation

1990 2000 20101995 2005 2015

0

20

40

60

80

100

120

140

0

300

600

900

1,200

1,500

1,800

2,100

CrudeOilPrices:Brent-Europe(left)

GoldFixingPrice10:30A.M.(Londontime)inLondonBullionMarket,basedinU.S.Dollars(right)

Do

lla

rsp

er

Ba

rre

lU

.S.D

olla

rspe

rTroy

Ou

nc

e

ShadedareasindicateU.S.recessions Sources:EIA,IBA myf.red/g/mfcv

© Jiaming Mao

Page 8: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Anecdotes are not enough

Many people have strong beliefs about causal effects in their own lives.

Emily used to suffer from chronic migraine but no longer does. Sheclaims that her secret to getting rid of migraines is doing the embryoyoga pose every morning while watching TV news.

All we know is that Emily is no longer experiencing migraines, andthat she does the yoga pose every morning (while watching TV).

We do not know whether doing so contributed to the alleviation ofher symptoms. We do not know what would happen if other peopleadopted this practice.

© Jiaming Mao

Page 9: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Why Causal Inference

The urge and the capacity to find causal explanations for observedphenomena has been an essential characteristic of human beings sincethe very beginning of human development and is the very goal ofmodern science and social science.

Why do we want to know how things work? – an obvious answer isthat it makes a big difference in how we act. If the rooster’s crowcauses the sun to rise, we could make the night shorter by waking upour rooster earlier.

Ultimately, every question related to the effect of actions must bedecided by causal considerations. Statistical information alone isinsufficient.

True understanding enables predictions under a wide range ofcircumstances, including new hypothetical situations.

© Jiaming Mao

Page 10: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Russell’s Chicken

Bertrand Russella told the following cautionary tale of the perils of notunderstanding causal mechanisms:

A chicken infers, on repeated evidence, that when the farmercomes in the morning, he feeds her. The inference serves her welluntil Christmas morning, when he wrings her neck and serves herfor dinner.

aRussell (1912), via Deaton and Cartwright (2018).

© Jiaming Mao

Page 11: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Simpson’s Paradox

Should a doctor prescribe this drug?

Any statistical relationship between two variables may be reversed byincluding additional factors in the analysis.

Causal relationships are more stable than statistical relationships.

© Jiaming Mao

Page 12: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Why Causal Inference

Research on causal inference methodologies has taken on new importancewith the development of artificial intelligence (AI).

How should a robot acquire causal information through interactionwith its environment?

How should a robot receive causal information from humans?

© Jiaming Mao

Page 13: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Artificial Intelligence

© Jiaming Mao

Page 14: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Seeing vs. Doing

The do operator:do(x = a) : set x = a

Barometer readings are useful for predicting rain:

Pr (rain | barometer = low) > Pr (rain | barometer = high)

But hacking a barometer won’t change the probability of raining:

Pr (rain | do (barometer = low)) = Pr (rain | do (barometer = high))

© Jiaming Mao

Page 15: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Seeing vs. Doing

Doing: if x has a causal effect on y , then we can change x andexpect it to cause a change in y .

Seeing: If x is correlated1 with y but does not have a causal effect ony , then we can only observe the correlation without the ability tochange y by manipulating x .

1In this lecture, we use the term “correlation” in its broad sense to mean statisticaldependence (association).

© Jiaming Mao

Page 16: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Seeing vs. Doing

Being able to manipulate x to see its effect on y is essential to theconcept of causality.

Holland (1986): “No causation without manipulation.” – if there is noway to manipulate a variable, at least in principle, then it is hard todefine what its causal effect means2.

I A thought experiment that is often used to determine whether avariable x is manipulable in principle is to imagine a hypotheticalexperiment that assigns different values to x3.

2What about causal effects of immutable variables like age and gender?3Angrist and Pischke (2009): “research questions that cannot be answered by any

experiment are FUQ’d: Fundamentally Unidentified Questions.”© Jiaming Mao

Page 17: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal vs. Statistical Predictions

Causal prediction: What will y be if I set x = a?

I E [y |do(x = a)]4

Statistical prediction: What will y be if I observe x = a?

I E [y |x = a]

4Assuming we minimize the expected L2 loss in prediction.© Jiaming Mao

Page 18: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Hospitalization and Health

Q1: what is the expected health status of someone who has receivedhospitalization? (statistical prediction)

Q2: what will my health status be if I receive hospitalization? (causalprediction)

© Jiaming Mao

Page 19: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Advertising and Sales

Q1: what is the expected sales of a company with a given amount ofTV ad spending? (statistical prediction)

Q2: how much will my sales increase if I increase my TV ad spendingby a certain amount? (causal prediction)

© Jiaming Mao

Page 20: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

The potential outcomes framework, also called the Rubin causalmodel (RCM)5, is a framework for causal inference thatconceptualizes observed data as if they were outcomes of experiments,conducted either by the researcher – as in actual experiments, or bythe subjects of the research themselves – as in observational studies.

Using the analogy of an experiment, when investigating the causaleffect of x on y , x is referred to as treatment or intervention and yas outcome.

5due to Neyman (1923) and Rubin (1974, 1978). The RCM represents an effort touse the language of statistical analysis of experiments to model causality.

© Jiaming Mao

Page 21: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

Suppose x takes on a set of discrete values {1, . . . ,A}. The RCMposits that y ∈

{Y1, . . . ,YA

}, where {Ya}Aa=1 is a set of random

variables, with each Ya being the potential outcome under thetreatment x = a6:

Ya ≡ y |do (x = a)

Thus under RCM, the relationship between treatment x and outcomey is described by the joint distribution p

(x ,Y1, . . . ,YA

)and

y =A∑

a=1YaI (x = a)

6See Appendix II for a structural interpretation of the potential outcome.© Jiaming Mao

Page 22: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

Now consider a binary treatment x ∈ {0, 1}. The potential outcomesare

{Y0,Y1} and the outcome y can be written as:

y = xY1 + (1− x)Y0

x is said to have a causal effect on y if p(Y0) 6= p

(Y1).7

I Causal effects are defined by comparing potential outcomes.I How do we measure the size of causal effects?

7Causal effects are also called treatment effects in the RCM. In this lecture, we usethe two terms interchangeably.

© Jiaming Mao

Page 23: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

Let τ ≡ Y1 − Y0.8

Average treatment effect (ATE): E [τ ]

Average treatment effect on the treated (ATT): E [τ | x = 1]

Average treatment effect on the untreated (ATU): E [τ | x = 0]

8More generally, for any discrete or continuous x , we can define the causal effect of xon y to be

τ (x , y) = ∂p (y |do (x))∂x

, which is a function of both x and y . The average treatment effect (ATE) is then

ATE (x) =∫τ (x , y) p (y) dy = dE [y |do (x)]

dx

Here we use the do operator because the RCM notation is not naturally suited forcontinuous x .

© Jiaming Mao

Page 24: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

© Jiaming Mao

Page 25: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

Let the observed data be D = {(x1, y1) , . . . , (xN , yN)}, where xi ∈ {0, 1},

yi = xiY1i + (1− xi )Y0

i

, and9 {(x1,Y0

1 ,Y11

), . . . ,

(xN ,Y0

N ,Y1N

)} i .i .d .∼ p(x ,Y0,Y1

)

9Equivalently, {(x1, y1) , . . . , (xN , yN)} i.i.d.∼ p (x , y), where

p (x , y) = x∫

p(x ,Y0,Y1) dY0 + (1− x)

∫p(x ,Y0,Y1) dY1

© Jiaming Mao

Page 26: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

τi ≡ Y1i − Y0

i is referred to as the individual treatment effect. τi isnever observed10. For each individual unit, we only observe eitheryi = Y0

i or yi = Y1i .

The potential outcomes that are not observed are calledcounterfactual outcomes.

I Before treatment: any outcome is a potential outcomeI After treatment: observed (realized) outcome and counterfactual

outcomes.

10If we consider treatments assigned at different times to the same individual as eitherdifferent treatments or as assigned to different subjects.

© Jiaming Mao

Page 27: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

Did influenza vaccine prevent me from getting the flu?What actually happened:

I got the vaccine and did not get sick.

My actual treatment: x = 1

My observed outcome: y = Y1

What would have happened:

Had I not gotten the vaccine, would I have gotten sick?

My counterfactual treatment: x = 0

My counterfactual outcome: Y0 =?

© Jiaming Mao

Page 28: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

Since counterfactual outcomes are not observed, we are not able tolearn individual treatment effects. This has been called thefundamental problem of causal inference11.

We can only learn causal effects at the population level.

I Therefore, when learning a causal effect, we should always be clearabout the population on which it is defined.

11Rubin (1974, 1978) claim that causal inference is fundamentally a missing dataproblem: given any treatment assigned to an individual, the potential outcomesassociated with any alternate treatments are missing. We note, however, that this is nota unique problem facing causal inference: the goal of predictive modeling – and hencestatistical learning – can also be thought of as predicting what yi will be if xi takes onsome other value.

© Jiaming Mao

Page 29: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

Treatment Observed Outcome Potential Outcomesx y Y0 Y1

0 -.34 -.34 3.460 1.67 1.67 4.030 -.77 -.77 3.080 2.64 2.64 .900 -.02 -.02 .961 2.31 -1.52 2.311 2.79 1.05 2.791 1.53 -.13 1.531 3.61 -1.41 3.611 3.36 .60 3.36

Red: observed outcome. Grey: counterfactual outcome.

© Jiaming Mao

Page 30: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Potential Outcomes Framework

0

1

2

3

0 1Treatment (x)

Ob

serv

ed

Ou

tco

me

(y)

© Jiaming Mao

Page 31: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Learning Causal Effects

Given observed data D, we can learn p(Y1∣∣ x = 1

)= p (y | x = 1)

and p(Y0∣∣ x = 0

)= p (y | x = 0).

To compute ATT however, we need information about p(Y0∣∣ x = 1

):

ATT = E[Y1 − Y0

∣∣∣ x = 1]

= E[Y1∣∣∣ x = 1

]− E

[Y0∣∣∣ x = 1

]Similarly, to compute ATU, we need information about p

(Y1∣∣ x = 0

).

© Jiaming Mao

Page 32: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Learning Causal Effects

To compute ATE, we need information about both p(Y0∣∣ x = 1

)and

p(Y1∣∣ x = 0

):

ATE = E[Y1 − Y0

]= E

[Y1 − Y0

∣∣∣ x = 1]p (x = 1)

+ E[Y1 − Y0

∣∣∣ x = 0]p (x = 0)

= ATT× p (x = 1) + ATU× p (x = 0)

We can think about causal effect learning as trying to learn thesecounterfactual outcome probabilities.

© Jiaming Mao

Page 33: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

In randomized controlled trials (RCT), the treatment x is assignedrandomly to the population.

For a binary treatment, if x is assigned randomly, then x and(Y0,Y1) are independent, denoted by x q

(Y0,Y1). Hence

p(Y0)

= p(Y0∣∣∣ x = 1

)= p

(Y0∣∣∣ x = 0

)= p (y | x = 0)

p(Y1)

= p(Y1∣∣∣ x = 0

)= p

(Y1∣∣∣ x = 1

)= p (y | x = 1)

, i.e. under random assignment, had the group that receivedtreatment b received treatment a, the outcome probabilities would bethe same as the group that actually received treatment a.

© Jiaming Mao

Page 34: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

As a result12,

ATE = ATT = ATU = E [y | x = 1]− E [y | x = 0]

12Note: we are able to calculate

ATE = E[Y1 − Y0] [1]= E

[Y1]− E

[Y0]

[2]= E [y | x = 1]− E [y | x = 0]

because of random assignment ([2]) and because the mean is a linear operator ([1]).

In general, however, without further assumptions, we will not be able to learn from anRCT other characteristics of the treatment effect distribution, such as its median,variance, and percentiles, as they are not linear operators. E.g.,

Median[Y1 − Y0] 6= Median

[Y1]−Median

[Y0]

© Jiaming Mao

Page 35: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

Without random assignment of x , E [y | x = 1]− E [y | x = 0] generallydoes not give us the average treatment effect.

E[Y1 − Y0] compares what would happen if the same population

receives treatment x = 1 versus x = 0.

E [y | x = 1]− E [y | x = 0] is comparing two different populations.

© Jiaming Mao

Page 36: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

© Jiaming Mao

Page 37: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

© Jiaming Mao

Page 38: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

Consider, for example, the example of hospitalization and health:

E [y | x = 1]− E [y | x = 0] = E [y | x = 1]− E[Y0∣∣∣ x = 1

]︸ ︷︷ ︸

ATT

+ E[Y0∣∣∣ x = 1

]− E [y | x = 0]︸ ︷︷ ︸

selection bias

, where y denotes health outcome and x denotes whether the individualhas received hospitalization (x = 1) or stayed home (x = 0).

© Jiaming Mao

Page 39: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

E [y | x = 1]− E[Y0∣∣ x = 1

]= average health outcome of those who

received hospitalization − their average health outcome if they hadstayed home instead.

E[Y0∣∣ x = 1

]− E [y | x = 0] = average health outcome of those who

received hospitalization if they had stayed home instead − averagehealth outcome of those who did not receive hospitalization.

If individuals chose whether to receive hospitalization themselves,then it is likely that E

[Y0∣∣ x = 1

]− E [y | x = 0] < 0. This is called

self-selection effect or self-selection bias.

© Jiaming Mao

Page 40: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

Self-selection bias is a central concern to causal inference based onobserved socio-economic data generated by individual choices.

When individuals choose their own treatments (self-selection)13, thosewho choose to receive a treatment can be systematically differentthan those who choose not to, leading to a correlation betweentreatment and outcome that is not due to direct causation.

Random assignment of treatment removes such self-selection effect.

13Or is it really self-selection? Do we really have free will? Why did the chicken crossthe road? – I choose not to ponder these questions here ,

© Jiaming Mao

Page 41: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

More generally, for x ∈ {1, . . . ,A},

Exchangeability: x q(Y1, . . . ,YA

)Random treatment assignment leads to exchangeability.

I Under random assignment, groups that receive different treatmentvalues are ex ante similar, or, exchangeable. In this case, we also saythat the treatment is exogenous.

© Jiaming Mao

Page 42: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

Under random treatment assignment, correlation implies causation.

Association (Correlation): p (y |x = a) 6= p (y |x = b)Causation: p (Ya) 6= p

(Yb)

Random assignment of x ⇒ p (Ya) = p (y |x = a) ∀a

© Jiaming Mao

Page 43: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

In conditionally randomized experiments, the treatment assignmentprobabilities depend on the values of some variable(s) s14.

The treatment x is randomly assigned within each sub-populationwith a fixed value of s.

I e.g., given a binary treatment x ∈ {0, 1} and a population consisting ofmales (s = 1) and females (s = 2), a conditionally randomizedexperiment would assign x = 1 to males with probability p1 andfemales with probability p2.

Conditional random assignment leads to conditionalexchangeability: x q

(Y1, . . . ,YA

)∣∣∣ s14s can be multi-dimensional: s = (s1, . . . , sp)

© Jiaming Mao

Page 44: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

Suppose s ∈ {1, . . . ,S}. Under conditional random assignment,

E [Ya] =S∑

j=1E [Ya| s = j] p (s = j) (1)

=S∑

j=1E [y | x = a, s = j] p (s = j)

© Jiaming Mao

Page 45: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

In the experimental design literature, s is called nuisance factorsthat the experimenter wishes to control when conducting an RCT.

A nuisance factor is a variable that can affect y either directly orindirectly, but is not of primary interest to the experimenter. Anuisance factor is also called an effect modifier: consider a binarytreatment x , s is an effect modifier if p

(Y1 − Y0) 6= p

(Y1 − Y0∣∣ s).

Various experimental designs exist to efficiently control for s andconduct conditionally randomized experiments.

© Jiaming Mao

Page 46: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

Randomized Complete Block Design (RCBD)a

α1 α2 α3 α4b a a ca c b bc b c a

x ∈ {a, b, c}, s ∈ {α1, α2, α3, α4}

aThe original use of the term blocking for removing sources of variation dueto nuisance factors comes from agriculture, where a block is typically a set ofhomogeneous (contiguous) plots of land with similar fertility, moisture, andweather, which are typical nuisance factors in agricultural studies,

© Jiaming Mao

Page 47: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Random Treatment Assignment

Latin Square Design (LSD)a,b

α1 α2 α3β1 a b cβ2 b c aβ3 c a b

x ∈ {a, b, c}, s1 ∈ {α1, α2, α3, }, s2 ∈ {β1, β2, β3}

aNot to be confused with lysergic acid diethylamide.bA Latin square of order n is an n × n array of cells in whichn symbols are placed, one per cell, in such a way that eachsymbol occurs once in each row and once in each column.

© Jiaming Mao

Page 48: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Using RCTs for Causal Effect Learning

Because random treatment assignment results in exchangeability, RCTscan be used to produce an unbiased estimate of the ATE in the populationfrom which the trial sample is a random sample15.

15The population could be the trial sample itself.© Jiaming Mao

Page 49: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Demand EstimationGoal: want to know consumer demand for a product.Using the RCM, the problem can be stated as:

treatment: price (p)outcome: purchases (q)

Suppose there are two price levels: p ∈ {L,H}.Potential outcomes:

{QL,QH

}Desired causal effect: ATE = E

[QH −QL

]Data: D = {(p1, q1) , . . . , (pN , qN)}16

16D can be generated by either experimental or observational studies, where

qi = QLi I (pi = L) +QH

i I (pi = H)

, and {(p1,QL

1 ,QH1), . . . ,

(pN ,QL

N ,QHN)} i.i.d.∼ p

(p,QL,QH)

© Jiaming Mao

Page 50: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Demand EstimationFrom the data we can learn p (q| p = a), a ∈ {L,H}17.Problem: without exchangeability, p (Qa) 6= p (Qa| p = a) = p (q| p = a).The group that “received” the treatment p = L could be systematicallydifferent than the group that “received” p = H18.

People that buy when the price is high can be richer than those whobuy when the price is low.If we observe the person over time, then her income may be differentwhen the price is low vs. when the price is high.

Solution: randomized experiments.

e.g., companies could run experiments by randomly assigning prices tocustomers in different markets and over time.

17We abuse notations here by using p to denote both price and probability.18Here we assume there are no other unobserved “treatments” that affect

{QL,QH},

such as the prices of related goods.© Jiaming Mao

Page 51: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Giffen Behavior

Economic theory has long speculated the existence of Giffen behavior:when poor consumers face price increase of inferior goods that areessential to them but have no close substitutes, they may end upbuying more of these goods, not less.

In reality, we often observe higher prices associated with morepurchases, but are they cases of

I higher demand causing both higher prices and more purchases, orI higher prices causing people to buy more (Giffen behavior)?

© Jiaming Mao

Page 52: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Giffen Behavior

Jensen and Miller (2008) provides the first empirical evidence of theexistence of Giffen behavior using a randomized field experiment.

randomly selected 1,300 households (3,661 individuals) from twoChinese provinces (Hunan, Gansu) who live under the poverty line.

Households were randomly given subsidies of .10, .20 or .30 yuan perjin (500g) for staple food consumption (Hunan: rice; Gansu: wheat).

Theory predicts that households with high but not too high staplecalorie shares should exhibit Giffen behavior – they would buy less notmore staple food if the price becomes cheaper.

© Jiaming Mao

Page 53: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Giffen Behavior

Percentage decrease in consumption as a result of a one percent subsidy-induceddecrease in price, plotted against household staple calorie share. Positive value

indicates a decrease in consumption (Jensen and Miller, 2008).

© Jiaming Mao

Page 54: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Program Evaluation

Evaluating policy is a central concern to governments. The programevaluation literature in applied economic research is concerned withevaluating and predicting the effects of various government programs andeconomic policies:

Effect of worker training programs on employment

Effect of early childhood interventions on adult outcomes

Effect of negative income taxes on labor supply

Effect of environmental regulations on pollution emission

. . .

© Jiaming Mao

Page 55: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Classroom Size and Student Learning

Governments operating public schools want to know whether theexpense of smaller classes has a payoff in terms of higher studentachievement.

Many studies of education production using non-experimental datasuggest there is little or no link between class size and studentlearning.

Problem: weaker students often deliberately grouped into smallerclasses.

© Jiaming Mao

Page 56: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Classroom Size and Student Learning

The 2002 U.S. Education Sciences Reform Act mandates the use ofrigorous experimental or quasi-experimental research designs for allfederally-funded education studies.

The Tennessee STAR project: randomly assigned a cohort ofkindergarten students to one of three groups: small classes (13–17students per teacher), regular-size classes (22–25 students), andregular/aide classes (22–25 students) which also included a full-timeteacher’s aide. The experiment ran for 4 years and a total of 11,600students from 80 schools were involved (random assignment tookplace within schools).

© Jiaming Mao

Page 57: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Classroom Size and Student Learning

Krueger (1999)

© Jiaming Mao

Page 58: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Classroom Size and Student Learning

Krueger (1999) © Jiaming Mao

Page 59: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

But wait! What have we learned?

Have we learned from the STAR project that small classes are betterfor student learning everywhere, regardless of culture, studentcomposition, teacher quality, etc.? Probably not. So what is themeaning of this causal effect estimate?

I In particular, does the effect apply to inner-city students in Chicago orHarry Porter and his friends at Hogwarts? Does it apply if the subjectof study is economics, or if the teachers come from China?

© Jiaming Mao

Page 60: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

But wait! What have we learned?

The answer depends on our understanding of the underlying causalmechanism: if we believe small class affects learning by allowingstudents to see the blackboard more clearly and hence does notdepend on student composition, then the effect probably should applyto inner-city students in Chicago. But since professors at Hogwarts donot use a blackboard, the effect probably does not apply there.Importantly, if we believe the observed effect was due to the initialdiscomfort felt by Kindergarten students enrolled in large classes, thenthe effect may not even apply to the same cohort of Tennesseestudents if we were to conduct the experiment on them again.

© Jiaming Mao

Page 61: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Meaningful Causal Effects

Causal effects do not exist in a vacuum. In social scientific research,the causal effects of interest are often the results of complex socialand economic processes.

Causal effects are population-specific: when we talk about “the causaleffect of x on y”, it is always with respect to a specific populationwithin a specific social, cultural, and economic environment.

The set of populations in which a causal effect applies is called itsscope. A causal effect is only meaningful if we can define its scope.

© Jiaming Mao

Page 62: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Meaningful Causal EffectsConsider a binary treatment x . In a specific population with a specificsocial, cultural, and economic environment,

E[Y1 − Y0

]=∫

E[Y1 − Y0

∣∣∣ s] p (s) ds (2)

, where s = s (M) is the set of effect modifiers according to the causalmechanismM by which x affects y in that population.

The causal effect of x on y can differ in two populations because:1 The causal mechanism is different.

I The effect of decreasing oil supply on gas price will be different incountries where gas prices are determined by market and countrieswhere gas prices are controlled by governments.

2 Or, the mechanism is the same, but p (s) is different.I The impact of raising the retirement age on the economy will be

different for two countries with different population age structures.© Jiaming Mao

Page 63: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Meaningful Causal Effects

An experiment or observational study is conducted on a specificsample, drawn likewise from a specific population with a specificsocial, cultural, and economic environment – call it the studypopulation.

Often this study population is not we are interested in. Instead, weare interested in learning the causal effect in a target population.

The result of an experimental or observational study is said to haveinternal validity if it is valid for the study population. It is said tohave external validity if it can be generalized or extrapolated toother populations. The ability to be extrapolated from one populationto another is also referred to as the transportability of a result.

A causal effect that we have learned is transportable from the studypopulation to the target population if both are within its scope.

© Jiaming Mao

Page 64: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Meaningful Causal Effects

To understand the meaning and scope of a causal effect, we need anunderstanding of the underlying causal mechanism19, which should bebased on prior information and analyses, i.e., our prior knowledge.

Understanding the underlying causal mechanism is not only necessaryfor understanding a causal effect – what it means, where it applies,but also for determining whether a causal effect estimated from astudy population applies to the target population – whether the samemechanism holds in both populations and whether the relevant effectmodifiers have the same distribution20.

19Heckman and Vytlacil (2007) thus criticize the RCM: “Rooted in biostatistics, theyare motivated by the experiment as an ideal. They do not clearly specify themechanisms determining how hypothetical counterfactuals are realized or howhypothetical interventions are implemented except to compare “randomized” with“nonrandomized” interventions.”

20This is in fact true for all statistical inferences, causal or non-causal.© Jiaming Mao

Page 65: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Meaningful Causal Effects

Understanding the causal mechanism also helps us to raise moreinteresting questions and design studies to learn more useful effects.

I Have we learned from the STAR project why small classes were better?There are many ways class size could affect student learning. Teacherscould employ different teaching methods in small classes. Studentsmay interact more with each other, therefore learning better in smallclasses. Or it could be simply the fact that sitting closer to theblackboard allows students to see and therefore learn better. Theoverall effect of small class on learning masks the effect of each ofthese possible channels.

I An understanding of the existence of these different channels allows usto design our studies specifically to learn their respective effects, whichwould be more interesting than the overall effect.

© Jiaming Mao

Page 66: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Meaningful Causal Effects

Fumigation and YieldFumigation is the use of fumigants to control eelworms which affectscrop yield. Suppose we conduct an RCT to study the effect offumigation on barley yield by randomly selecting N barley fields inLlanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch (short:Llanfairpwll)a and randomly applying fumigation to M of them. Theresult shows that fumigation increases barley yield by 20%.

What does this result mean? Where does it apply? Is the result validfor barley fields in China? Does it apply to Okay, Oklahomab? Or canwe even say that it applies to the same fields in Llanfairpwll, in thesense that we will observe the same effect if we are to conduct thesame experiment again next year?

aA village in Wales. See here.bAnother actual town.

© Jiaming Mao

Page 67: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Meaningful Causal Effects

Fumigation and Yield (cont.)The understanding of the result depends on our understanding of thecausal mechanism and its implied effect modifiers.

Let τ denote the possible effects in the study population and let τdenote the possible effects for any barley field in the world. Supposewe believe the effect of fumigation depends on the season in which afield is fumigated, and suppose the experiment is conducted in thesummer, then the result we get from the experiment is reallyE [τ ] = E [τ |summer].

If, in addition, we believe the effect also depends on what crops weregrown last year and in Llanfairpwll, 50% of the barley fields grewbarley last year and the other 50% grew wheat, then the result we getis E [τ ] = 0.5× E [τ |summer, barley] + 0.5× E [τ |summer,wheat],where we condition on season and last year’s crop.

© Jiaming Mao

Page 68: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

External Validity

“Psychology is the study of psychology students.” – Anonymous

A 2008 survey of the top psychology journals found that 96% of subjects werefrom Western, educated, industrialized, rich and democratic (WEIRD) societies –

particularly American undergraduates.

© Jiaming Mao

Page 69: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Multiple Versions of Treatment

The problem of multiple versions of treatment, also called the problem ofill-defined intervention, is another illustration that a causal effect can bemeaningless without a careful thinking about the underlying mechanism.

The effect of democracy on economic growthThere are different forms of democracy: parliamentary system,presidential system, ... and there are many ways for a country tobecome democratic: through peaceful transition, civil uprising andarmed revolt, foreign invasion, ...

If, based on prior knowledge, we believe these different forms andchannels of transition can have very different effects on economicgrowth, then the so-called “effect of democracy” may not bewell-defined or meaningful.

In this case, we say that democracy as a treatment has “multipleversions” and is ill-defined.

© Jiaming Mao

Page 70: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Multiple Versions of Treatment

The effect of obesity on healthWhat is obesity as a treatment? How do we intervene on obesity?

Multiple channels to becoming obese or un-obese: (lack of) exercise,(un)healthy diet, surgery, ...

The apparently straightforward comparison of the health outcomes ofobese and non-obese individuals masks the true complexity of theinterventions “make someone obese” and “make someone non-obese.”

© Jiaming Mao

Page 71: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Experimental Ideal and Its Limitations

RCTs are often regarded as the gold standard for causal inference.However, there is a popular but mistaken belief that they require noassumptions on causal mechanisms – as we have discussed, withoutan understanding of – or making assumptions on – the underlyingcausal mechanism21, any causal effect estimate is meaningless,whether produced by RCTs or observational studies22,23.

In addition, there are limits to the practicality and usefulness of RCTs.

21See page 118 for a discussion on the exact knowledge requirement of RCTs.22Heckman and Vytlacil (2007) on the consequence of conducting RCTs without an

understanding of the underlying mechanism: “Simplicity in estimation is oftenaccompanied by obscurity in interpretation ... Blind empiricism leads nowhere.”

23Deaton and Cartwright (2018): “There is no escape from thinking about the waythings work; the why as well as the what.”

© Jiaming Mao

Page 72: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Experimental Ideal and Its Limitations

For many causal inference problems, RCTs are impossible orimpractical to run.

I infeasibility (e.g., monetary policy)I ethical reasons (e.g., smoking and lung cancer)I cost and duration (e.g., childhood intervention and adult outcomes)

F RCTs require special conditions if they are to be conducted successfully– local agreements, compliant subjects, affordable administrators,multiple blinding, people competent to measure and record outcomesreliably, etc.,

F Long duration studies often suffer from significant (non-random)attrition.

I high-dimensional treatment or nuisance factors

© Jiaming Mao

Page 73: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Experimental Ideal and Its Limitations

Many causal inference problems involve a large number of effectmodifiers s, making it infeasible to conduct RCTs that control for alldimensions of s or have enough randomization such that enoughvalues of s in each dimension are observed.

If we do not specifically control for or randomize over s, however,then the causal effect estimate we get may be very local (conditionalon fixed values of s in many dimensions). This limits the usefulness ofRCTs and often means that we have to rely on observational studies,which are less subject to the constraints listed on the preceding page.

© Jiaming Mao

Page 74: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Experimental Ideal and Its Limitations

Fumigation and Yield (cont.)If we believe the effect of fumigation on barley yield varies by seasons,then we need to control for or randomize over seasons in order toobtain causal effect estimates that are not conditional on a particularseason. This, however, increases the duration of the experiment andwould significantly increase its cost.

When many other factors also determine the effect of fumigation andwe do not specifically control or randomize over them, then we mayend up getting a very local effect such asE [τ |summer, lcrop=barley, rainfall=high, altitude=low, ...]

© Jiaming Mao

Page 75: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Scaling Up

Often, RCTs are conducted for program evaluation purposes. Becauseof cost and duration constraints, they are typically small-scale, but ifdeemed successful, the program is then a candidate for scaling-up –applying the same intervention to a much larger area and population.Predicting the same results at scale as in the trial can be problematic,however, as the larger target population can be very different fromthe study population so that causal effects are not transportable.But even if the trial sample is a random sample of the targetpopulation, so that the target population = the study population24,applying the same intervention to everyone in the population couldgenerate very different effects than in the trial due to generalequilibrium effects – a particular problem that often limits theusefulness of RCTs for program evaluation.

24e.g., the target population being the national population and the trial sample beinga nationally representative sample.

© Jiaming Mao

Page 76: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Scaling Up

Fumigation and Yield (cont.)Suppose a government is interested in finding ways to help farmersincrease their income. Since in the fumigation study, farmers whosefields have been fumigated see significant increase in their cropproduction and hence income, the government believes that it is agood idea to subsidize fumigation.

However, if the use of fumigants on barley fields is scaled up to thewhole country, then the price will drop – assuming a closed domesticbarley market – and if the demand for barley is price inelastic, thenfarmers’ incomes will fall.

In this case, the scaled-up effect is opposite in sign to the trial effect.

© Jiaming Mao

Page 77: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

SUTVA

The existence of general equilibrium effects can be thought of as aviolation of the stable unit treatment value assumption(SUTVA), which is an assumption that an individual unit’s potentialoutcome under a treatment does not depend on the treatmentsreceived by other individual units25.

SUTVA is implicitly assumed in the RCM: yi = Y(x1,...,xN)i = Yxi

i

SUTVA can be violated if there exists interaction among individualunits26.

25An individual unit could be a person, a firm, a country, etc. at a given point in time.26Of which general equilibrium effect is a manifestation.

© Jiaming Mao

Page 78: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

SUTVA

The causal effect of d on y depends on how many individuals receive d = 1.SUTVA is violated if d is treatment and y is outcome, in which case there is

treatment dilution: the more treated, the less effective the treatment.

© Jiaming Mao

Page 79: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

SUTVA

When SUTVA is violated,

p (yi | do (xi = a, xj = b)) 6= p (yi | do (xi = a) , xj = b)

6=p (yi | do (xi = a)) =∫

p (yi | do (xi = a) , xj) p (xj) dxj

In this case, do (xi = a, xj = b), do (xi = a, xj = c), do (xi = a)| xj = b,and do (xi = a)| xj = c are effectively different interventions27.

27Thus, the violation of SUTVA can also be viewed as a problem of ill-definedinterventions.

© Jiaming Mao

Page 80: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

SUTVA

SUTVA can be thought of as an i.i.d. assumption on causal effects. Ifviolated, then we need to take the interaction into account. Assuming apopulation of {i , j}, there are two approaches to do so:

1. Learn p (yi | do (xi )) or p (yi | do (xi ) , xj), treating xj as an effectmodifier.

I When estimating the treatment effect on an individual unit, if SUTVAis violated, then we need to consider the treatments received by otherindividual units as effect modifiers.

© Jiaming Mao

Page 81: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

SUTVA

2. Learn p (yi | do (xi , xj)).I This requires changing the unit of analysis from the original individual

unit to a population of those units where interaction occurs and isconfined in28.

28For example, for the problem on page 78 , we can define each individual unit i to bea “local population” where such interference occurs and is confined in, and let theunderlying population be a population of such local populations. Define the treatmentvariable x to be

x =

1 if 1 individual in the local population receives d = 12 if 2 individuals in the local population receive d = 1...

...

SUTVA would be satisfied if we look at the causal effect of x on y .© Jiaming Mao

Page 82: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

SUTVA

Socio-economic outcomes are often the results of individualinteraction29. Individual choices are rarely independent and eachperson’s choice can affect other people30.

I micro-scale: social and strategic interaction (e.g., firm competition inoligopolistic markets)

I macro-scale: general equilibrium effects

Thus, SUTVA would often be violated if we do not properly takethese interaction effects into account.

29Unlike, say, in the medical sciences.30Such interaction effects are sometimes negligible, as in the case of buyers and sellers

in competitive markets. Here we emphasize that they are often not.© Jiaming Mao

Page 83: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

SUTVA

College Education and WageLabor economists are perpetually interested in the effect of educationon labor market returns. But questions such as “what is the effect ofcollege education on wage?” needs some clarity.Wage is an equilibrium outcome. How much wage a person wouldearn if she receives college education depends on labor demand andlabor supply. Labor supply, in turn, depends on how many otherpeople have received college educationa.The effect of college education on wage is clearly different if only oneperson receives college education and if all individuals do.When people ask this question, the causal effect they most likely reallyhave in mind is the effect of an individual receiving college educationon her wage conditional on current labor demand and labor supply.

aDisregarding the heterogeneity in college education (good college, badcollege, history major, economics major) and treating it as homogeneous here.

© Jiaming Mao

Page 84: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Prior Knowledge

Causal inference generally requires prior knowledge regarding thedata-generating causal mechanisms. Such knowledge can only exist asa result of previously observed information and conducted studies.Causal inference therefore builds on causal inference31.

31See page 150 for a discussion on causal mechanism learning and the process ofscientific progress.

© Jiaming Mao

Page 85: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Model

How to represent our knowledge of a causal mechanism? Given variables xand y , we can say

x ∼ U (0, 1) (3)y = 2x

But without giving “=” a causal reading, this is just a statistical modelthat gives us the joint distribution p (x , y).

© Jiaming Mao

Page 86: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Model

(3) becomes a causal model if we imbue “=” with the causal meaning thatthe variable on the left is determined by the variables on the right.

Here: x causes y , so if we set x = 1, y will be 2, but setting the valueof y will not affect x .

Once given a causal meaning, “y = 2x” becomes a structuralequation and is sometimes written as “y ← 2x”.

© Jiaming Mao

Page 87: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Model

A causal modelM32 for a set of random variables {x1, . . . , xn} is amodel of the joint distribution p (x1, . . . , xn), as well as the causalstructure governing {x1, . . . , xn}, which describes the causalrelationships among the variables.

32Also called scientific model.© Jiaming Mao

Page 88: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Diagrams

A causal diagram G33 is a graph that can be used to represent acausal structure and therefore describe our qualitative knowledgeabout a causal mechanism34,35.

33Also called causal graph.34A causal model that uses causal diagram to represent causal structure is called a

causal graphical model or causal Bayesian network, and can be written asM = (H,G), where H is a generative statistical model and G is a causal diagram. See

Appendix I for an introduction to graphical models and Bayesian networks.35While the RCM and the potential outcomes framework were developed in statistics,

the modern theory of causal graphical models arose within the disciplines of computerscience and artificial intelligence. See Pearl (2009).

© Jiaming Mao

Page 89: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal DiagramsFor example, suppose we see a boa constrictor that looks like this:

Our theory of the causal mechanism that leads to the boa constrictorlooking like this is that it has just swallowed a baby elephant:

© Jiaming Mao

Page 90: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Diagrams

How do we represent our theory? We can write out a full causal model:

log he ∼ N (0, 0.1)log `e ∼ N (0.5, 0.1)log `b ∼ N (1.5, 0.2)

a∣∣∣`e , `b, `b > `e ∼ U

(0, `b − `e

)y ←

{heI (a ≤ x ≤ a + `e) I (E = 1) `e < `b

0 `e ≥ `b

, where y is the height of the boa constrictor, x is the distance along thebody of the boa constrictor from its head, (he , `e) are respectively theheight and length of the baby elephant, `b is the length of the boaconstrictor, and E ∈ {0, 1} is the event that the boa constrictor hasswallowed the elephant.

© Jiaming Mao

Page 91: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Diagrams

Or we can draw a causal diagram:

© Jiaming Mao

Page 92: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Diagrams

In a causal diagram, the nodes (vertices) are {x1, . . . , xn}, withdirected edges (arrows) representing direct causation.

The presence of an arrow that points from xi to xj indicates eitherthat xi has a direct causal effect on xj – an effect not mediatedthrough any other variables on the graph, or that we are unwilling toassume such a effect does not exist. The lack of an arrow from xi toxj then indicates the absence of a direct effect36.

36The absence of an arrow therefore represents a more substantive assumption.© Jiaming Mao

Page 93: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Diagrams

If an arrow points from xi to xj , then xi is a parent of xj and xj achild of xi

37.

A path is a sequence of connected nodes. The path is causal if all itsarrows point in the same direction. Otherwise it is noncausal.

All nodes on a causal path that begins with xi are descendants of xi .Those on a causal path that leads to xj are ancestors of xj . Avariable is a cause of all its descendants.

Variables with no parents are said to be exogenous to the causalmodel represented by the causal diagram. Others are endogenous.

37For detailed definitions on the concepts introduced here, see Appendix I

© Jiaming Mao

Page 94: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Diagrams

A causal diagram compatible with a joint distribution p (x1, . . . , xn) mustsatisfy the following properties:

Causal Markov Condition: xi q nd (xi )| pa (xi ), where pa (xi ) andnd (xi ) denote respectively the parents and non-descendants of xi

38.

Completeness: All common causes, even if unmeasured, of any pairof variables on the graph are themselves on the graph39.

Faithfulness: The joint distribution p (x1, . . . , xn) has all of theconditional independence relations implied by the causal diagram, andonly those conditional independence relations.

38i.e., a variable xi is independent of any other variables (except its own effects)conditional on its direct causes.

39As it turns out, this property is tantamount to the requirement that all relevantfactors are accounted for. Our ability to extract causal information from data ispredicated on this untestable assumption.

© Jiaming Mao

Page 95: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Independence in Causal Diagrams

Given a path P, a collider is a node c on P with neighbors a and bon P such that a→ c ← b.

A path is said to be blocked if it contains a noncollider that has beenconditioned on, or it contains a collider that has not been conditionedon and has no descendants that have been conditioned on.

Two variables are said to be d-separated if all paths between themare blocked. Otherwise they are d-connected.

If two variables are d-separated (after conditioning on a set ofvariables), then they are (conditionally) independent. If two variablesare d-connected (after conditioning on a set of variables), then theyare (conditionally) dependent (associated)40.

40Without the faithfulness condition, d-connection does not necessarily implyconditional dependence.

© Jiaming Mao

Page 96: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Basic Patterns of Causal Relations

Basic patterns of causal relationships among three variables

© Jiaming Mao

Page 97: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Association and Causation

L has a causal effect on both A and Y . A does not have a causaleffect on Y . A depends on L and on no other causes of Y .

L is called a common cause to A and Y .

A and Y are d-connected and hence associated, because there existsan open path, A← L→ Y , between them.

Having information about A improves our ability to predict Y , eventhough A does not have a causal effect on Y .

E.g., A : carrying a lighter; Y : lung cancer; L : smoking

© Jiaming Mao

Page 98: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Association and Causation

Both A and Y have a causal effect on L. A does not have a causaleffect on Y .

L is called a common effect of A and Y .

L is a collider on the path A→ L← Y that has not been conditionedon. Hence L blocks the path.

A and Y are d-separated and hence independent, because the onlypath between them, A→ L← Y , is blocked.

E.g., A : family heart disease history; Y : smoking; L : heart disease

© Jiaming Mao

Page 99: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Association and Causation

Box indicates conditioning

Conditioning on B and L block the paths A→ B → Y andA← L→ Y .

A and Y are d-separated after conditioning on B and L. Therefore,they are conditionally independent given B and L, even though theyare marginally associated in both graphs.

E.g. (left), A : smoking; B : tar deposits in lung; Y : lung cancer

© Jiaming Mao

Page 100: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Association and Causation

Conditioning on collider L or its descendent C opens the pathA→ L← Y , which is blocked otherwise.

A and Y are d-connected after conditioning on L and C . Therefore,they are conditionally associated given L and C , even though they aremarginally independent.

E.g. (right), A : family heart disease history; Y : smoking; L : heartdisease; C : taking heart disease medication

© Jiaming Mao

Page 101: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Association and Causation

Conditioning on CollidersSuppose a light bulb (C) is controlled by two on/off switches (A and B).The states of A and B are independent. C is lit up only if both A and Bare in the “on” state. Then the causal diagram is:

A→ C ← B

A and B are independent: the state of A tells you nothing about thestate of B.

A and B are dependent conditional on C: conditional on the lightbeing off, A must be off if B is on, and vice versa.

© Jiaming Mao

Page 102: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Association and Causation

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

−3 −2 −1 0 1 2 3

−3

−2

−1

01

2

Hypothetical College Admission

Exam Score

Inte

rvie

w S

core

RejectedAdmitted

© Jiaming Mao

Page 103: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Association and Causation

In summary, there are three structural reasons why two variables may beassociated:

1 One causes the other41

2 They share common causes3 The analysis is conditioned on their common effects42

41either directly or through mediating variables, i.e. there exists an open causal paththat connects the two variables.

42or the consequences of the common effects (i.e., common descendants).© Jiaming Mao

Page 104: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Confounding

When x and y share common causes, they are correlated even if theydo not cause each other. This makes it harder for us to learn thecausal effect of x on y or vice versa. We call this problemconfounding. The common causes are called confounders. In thepresence of confounding, p (y |x) 6= p (y |do (x)).

Self-selection bias is an important type of confounding. Considertreatment x and outcome y . When x is selected based on the valuesof z , if z also has a causal effect on y , then z is a confounder andthere is self-selection bias.

When z is fully observed, we say there is selection on observables(or, there exists no unmeasured confounding). When z is not fullyobserved, there is selection on unobservables (or, there existsunmeasured confounding).

© Jiaming Mao

Page 105: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Identifiability of Causal Effects

Given a modelM with variables x = (x1, . . . , xn) and a set ofobserved random variables v = (v1, . . . , vm), let θ (M) = gM (x) be afunction of x according toM, we say θ (M) is identifiable if it canbe uniquely determined based on observations of v43.

Thus, letM be a causal model with variables x and let θ be thecausal effect of, say xi on xj , according toM. Let v be our observedvariables – the data we observe are a random sample D ∼ p (v) –then θ is identifiable if it can be uniquely determined from D44.

43Technically, θ is identifiable if the map from the space of its possible values to thespace of probability distributions of observables is invertible.

44We have said nothing about the size of D. In particular, the identification questioncan be phrased as: given infinite data – if we actually know the true distribution of theobserved variables – can we learn θ?

© Jiaming Mao

Page 106: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Intervention in Causal Diagrams

Given a causal diagram G with variables x = (x1, . . . , xn),

p (x) =n∏

j=1p (xj | pa (xj)) (4)

The effect of an intervention do (xi = a) is to transform thepre-intervention distribution (4) into the post-interventiondistribution:

p (x−i | do (xi = a)) =∏j 6=i

p (xj | pa (xj)) (5)

= p (x−i , xi = a)p (xi = a| pa (xi ))

= p (x−i |xi = a, pa (xi )) p (pa (xi ))

© Jiaming Mao

Page 107: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Intervention in Causal Diagrams

Graphically, the transformation amounts to removing all arrows enteringxi , while setting xi equal to a.

p (A,B,C ,D) =p (D|C) p (C |A,B) p (A) p (B)

−→

p (A,B,C ,D|do (C = c)) =p (D|C = c) p (A) p (B)

It is clear from the post-intervention graph that p (A|do (C)) = p (A), thoughp (A|C) 6= p (A).

© Jiaming Mao

Page 108: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Intervention in Causal Diagrams

The reason that do (xi = a) corresponds to removing arrows enteringxi is that before the intervention, xi is the consequence of pa (xi ). Bysetting xi = a, we replace pa (xi ) as the cause of xi . Thus,intervention changes this local part of the causal mechanism45.

Since incoming arrows to xi has been removed, the associationbetween xi and any other variable xj on the graph, if exists, must be aresult of xi causing xj , rather than of common causes. Notice alsothat in the post-intervention graph, since xi no longer has parents, ithas become exogenous to the model.

45Pearl (2009) describes intervention as “surgery on the causal diagram.”© Jiaming Mao

Page 109: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Intervention in Causal Diagrams

© Jiaming Mao

Page 110: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning

Given a causal model with variables {x , y , s1 . . . , sn}, to learn the causaleffect of x on y , or to make a causal prediction of y after intervening on x ,it suffices to make p (y |do (x)) our target distribution, from which we cancalculate various causal effects of interest (ATE, ATT ..) or make a causalprediction46.

Causal effect learning = statistical learning with E [y |do (x)] as the targetfunction, or p (y |do (x)) as the target distribution.

46Pearl (2009) defines causal effect of x on y as p (y |do (x)).© Jiaming Mao

Page 111: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning

Let sY = pa (y)\ {x} be the direct causes of y other than x . Then (5) ⇒

p (y |do (x)) =∫

p(y |do (x) , sY

)p(sY)dsY (6)

Note that sY are the set of potential effect modifiers47: given anys ∈ sY , if the effect of s on y interacts with that of x on y , then s isan effect modifier.

47Hence (6) corresponds to (2).© Jiaming Mao

Page 112: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning: Two Stages

Identification(causalreasoning)

Estimation(statisticallearning)

© Jiaming Mao

Page 113: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning: Identification

The identification question: is it possible to learn the causal effect ofinterest from our observed variables48? What causal assumptions dowe need in order to do so?

More rigorously, suppose we are interested in learning p (y |do (x)) andv is the set of observed variables49. The identification problem iswhether we can use p (v) to (uniquely) determine p (y |do (x)) –equivalently, whether p (y |do (x)) can be expressed uniquely as afunction of p (v).

48i.e., given infinite data on our observed variables – if we actually know their truedistribution – can we learn our causal effect of interest?

49v can include {x , y , . . .}.© Jiaming Mao

Page 114: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning: Identification

Given a causal modelM = (H,G) that describes that mechanismlinking x and y , if we can express p (y |do (x)) in terms of p (v)without making any parametric assumptions on the relationshipsamong the variables, then we say p (y |do (x)) is nonparametricallyidentified.

I Nonparametric identification is based on knowledge of G only and doesnot rely on any statistical or functional form assumptions in H.

If statistical and functional form assumptions are involved inexpressing p (y |do (x)) in terms of p (v), then p (y |do (x)) isparametrically identified.

© Jiaming Mao

Page 115: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning: Estimation

Once we have expressed p (y |do (x)) in terms of p (v), sayp (y |do (x)) = g (p (v)), then we can estimate g (p (v)) from theobserved data D ∼p (v) using any appropriate statistical models –parametric or nonparametric50.

The estimation question: how to learn the identified causal effectfrom finite sample.

Herein lies the connection between statistical learning and causaleffect learning: once we have established identification using causalreasoning based on causal models, we are left with a pure statisticallearning problem.

50Hence, we could have nonparametrically identified–nonparametrically estimatedcausal effect, nonparametrically identified–parametrically estimated causal effect, etc.

© Jiaming Mao

Page 116: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning from Randomized Experiments

To learn p (y |do (x)), the simplest way is to “just do it” – sampledirectly from p (y |do (x)). This is what an RCT does.

An RCT that randomly assigns values to treatment x with probabilityp (x) and observes outcomes y generates dataD ∼ p (x , y) = p (y |do (x)) p (x).

Randomization helps ensure that the links from pa (x) to x arebroken, so that x q pa (x). x becomes exogenous, so there is noconfounding to worry about.

x q pa (x)⇒ p (pa (x) |x = a) = p (pa (x) |x = b) – as a result ofrandomization, the distribution of pa (x) is balanced across thedifferent values of treatment x .

p (y |do (x)) is nonparametrically identified: p (y |do (x)) = p (y |x),and can be estimated from D using any appropriate statistical models.

© Jiaming Mao

Page 117: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning from Randomized Experiments

© Jiaming Mao

Page 118: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning from Randomized Experiments

Compared with observational studies, RCTs have less knowledgerequirement: we do not need to know the treatment assignmentmechanisms that generate the observational data, since we nowdetermine the mechanism ourselves. In graph terms, only knowledgeof the post-intervention causal diagram is needed.

A major limitation of RCTs, as already discussed, is that it is hard –often impossible – to generate data from a p (y |do (x)) that is theresult of a desired distribution of effect modifiers.

© Jiaming Mao

Page 119: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning from Randomized Experiments

For RCTs, whether S1 or S2 is a cause of X pre-intervention is immaterial tolearning the causal effect of X on Y . As an example, suppose X is a workertraining program and Y is employment outcome, S1 is ability and S2 is

motivation. An RCT investigator who randomly assigns workers to trainingprograms needs not worry about whether in non-experimental settings, workerswho enroll in such programs have higher motivation or higher ability or both.

However, knowledge about S1 and S2 as causes of Y is still needed and a roughidea about how they are distributed in the trial sample, even if unobserved, is

necessary for defining the scope of the resulting causal effect estimate.

© Jiaming Mao

Page 120: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning from Observational Studies

For observational studies, given a causal diagram with variables{x , y , s1 . . . , sn}, if all variables are observed, then (5) provides thefollowing way for computing p (y |do (x)):

p (y |do (x = a)) =∫

p (y |x = a, pa (x)) p (pa (x)) dpa (x) (7)

Hence p (y |do (x)) is nonparametrically identifiable and can be estimatedby estimating p (y |x , pa (x)) and p (pa (x)) respectively from data.

© Jiaming Mao

Page 121: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning from Observational Studies

Fumigation and Yield (cont.)Suppose we cannot do an RCT to investigate the effect of fumigation oncrop yield, so instead we rely on observational data. Suppose themechanism that generates our observed data works as follows: fumigation(F ) helps control eelworms (E ), which affects crop yield (Y ). Farmers’ useof fumigation is affected by weather (W ) and the price of fumigants (C).Weather also affects yield independently. Finally, we observe theequilibrium crop price (P), which is affected by both the price of fumigantsand the realized crop yield.

© Jiaming Mao

Page 122: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning from Observational Studies

Fumigation and Yield (cont.)

Figure 1:Fumigation andYield

© Jiaming Mao

Page 123: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning from Observational Studies

Fumigation and Yield (cont.)Assuming all variables are discrete, then (7) ⇒

p (Y |do (F = 1)) =∑w ,c

p (Y |F = 1,W = w ,C = c) (8)

× p (W = w) p (C = c)

, where F ∈ {0, 1} is treated as a binary variable.

© Jiaming Mao

Page 124: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning from Observational Studies

Fumigation and Yield (cont.)If we observe {F ,W ,C ,Y }, then p (Y |F ,W ,C) , p (W ) , p (C) can all beestimated from data, from which we can calculate p (Y |do (F = 1))a.

But what if we do not observe W or C?

aIf our target function is E [Y |do (F )], then (8) ⇒

E [Y |do (F )] =∑w,c

E [Y |F ,W ,C ] p (W ) p (C)

E [Y |F ,W ,C ], p (W ) and p (C) can be estimated from data using anyappropriate statistical models. For example, we can fit a simple linear model toE [Y |F ,W ,C ]:

E [Y |F ,W ,C ] = β0 + β1F + β2W + β3C

and estimate p (W ) and p (C) using their empirical distributions.

© Jiaming Mao

Page 125: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Effect Learning from Observational Studies

When pa (x) are fully observed, p (y |do (x)) is always nonparametricallyidentifiable. This is no longer true when pa (x) are not fully observed, inwhich case we have the following identification criteria:

1 The back-door criterion

2 The front-door criterion

These criteria provide sufficient conditions for the nonparametricidentification of causal effects.

© Jiaming Mao

Page 126: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Back-Door Criterion

A back-door path is a path between treatment x and outcome ythat has an arrow into x .

I These are the paths that, if left open, induce association between xand y that is not a result of x causing y .

A set of variables sB satisfies the back-door criterion if (i)conditioning on sB blocks every back-door path from x to y , and (ii)no variable in sB is a descendant of x51.

51(ii) is sometimes stated as the requirement that sB includes only pre-treatmentvariables.

© Jiaming Mao

Page 127: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Back-Door Criterion

If we observe a set of variables sB that meets the back-door criterion52,then p (y |do (x)) is nonparametrically identifiable53:

p(y |do (x) , sB

)= p

(y |x , sB

)(9)

p (y |do (x)) =∫

p(y |x , sB

)p(sB)dsB

52i.e., the observed variables include{x , y , sB, . . .

}53Note that in general,

p (y |do (x)) =∫

p(y |x , sB

)p(sB)dsB

6=∫

p(y |x , sB

)p(sB∣∣ x) dsB = p (y |x)

© Jiaming Mao

Page 128: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Back-Door Criterion

x is said to be exogenous to y if there is no open back-door pathfrom x to y , in which case p (y |do (x)) = p (y |x).

x is conditionally exogenous to y if x is exogenous to y afterconditioning on s, in which case p (y |do (x) , s) = p (y |x , s).

When x is (conditionally) exogenous to y , individual units withdifferent values of x are (conditionally) exchangeable with respect toy , in which case an observational study resembles a (conditionally)randomized experiment54.

Exchangeability and conditional exchangeability are therefore coded ingraph language as the lack of open back-door paths between thetreatment and outcome.

54in which case we also say that the treatment assignment mechanism is ignorable.© Jiaming Mao

Page 129: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Back-Door Criterion

Fumigation and Yield (cont.)The back-door path F ← C → P ← Y is already blocked by the collider P.Therefore, by the back-door criterion, we only need to condition on W a:

p (Y |do (F = 1)) =∑w

p (Y |F = 1,W = w) p (W = w) (10)

Estimation using this strategy requires only the observation of {F ,Y ,W }.

a(10) can be derived from (8) by noticing that conditioning on F blocks theback-door path from C to Y . Hence p (Y |F ,W ,C) = p (Y |F ,W ).

© Jiaming Mao

Page 130: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Back-Door Criterion

Condition (i) of the back-door criterion is equivalent to the requirementthat sY q x

∣∣∣ sB.

If there exists an open back-door path from x to a variable s ∈ sY ,such that s��qx , then there exists an open back-door path from x to y .

Therefore, if conditioning on sB blocks all back-door paths from x toy , then we have s q x | sB ∀s ∈ sY .

Therefore, the back-door criterion can be equivalently stated as requiringthat conditional on sB, no direct causes of y are correlated with x .

Equivalently, x is exogenous to y conditioning on sB if sY q x∣∣∣ sB.

Otherwise, x is endogenous55.55This is the way the econometrics literature defines endogeneity. See Appendix III for a

discussion on the econometric approach to causal reasoning and its limitations.© Jiaming Mao

Page 131: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Front-door Criterion

If we can establish an isolated and exhaustive mechanism that relatesx to y , then the causal effect of x on y can be calculated as itpropagates through the mechanism.

A set of variables sF satisfies the front-door criterion when (i)conditioning on sF blocks all causal paths from x to y ; (ii) no openback-door paths exist from x to sF , i.e. x is exogenous to sF ; (iii)conditioning on x blocks all back-door paths from sF to y , i.e. sF isexogenous to y conditional on x .

© Jiaming Mao

Page 132: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Front-door Criterion

If we observe a set of variables sF that meets the front-door criterion,then p (y |do (x)) is nonparametrically identifiable:

p (y |do (x = a)) =∫

p(sF∣∣∣ do (x = a)

)p(y |do

(sF))

dsF (11)

=∫

p(sF∣∣∣ x = a

)(∫p(y∣∣∣sF , x

)p (x) dx

)dsF

© Jiaming Mao

Page 133: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Front-door Criterion

The causal path X → M → Y represents an isolated (not affected by U) andexhaustive (only causal path from X to Y ) mechanism. All of the effect of X onY is mediated through X ’s effect on M. M’s effect on Y is confounded by theback-door path M ← X ← U → Y , but X blocks this path. So we can use

back-door adjustment to find p (Y |do (M)) and directly findp (M|do (X )) = p (M|X ). Putting these together gives p (Y |do (X )).

© Jiaming Mao

Page 134: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Front-door Criterion

Fumigation and Yield (cont.)The causal path F → E → Y constitutes an isolated and exhaustivemechanism. Hence E satisfies the front-door criterion. (11) ⇒

p (Y |do (F = 1)) =∑

ep (E = e|F = 1)× (p (Y |E = e,F = 0) p (F = 0)

+ p (Y |E = e,F = 1) p (F = 1))

Estimation using this strategy requires only the observation of {F ,Y ,E}.

© Jiaming Mao

Page 135: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Instrumental Variables

If we use observational data, but neither the back-door nor thefront-door criterion is satisfied by the variables that we observe, thenwe may need additional parametric assumptions in order to identifythe causal effect of interest.

Figure 2: The causal effect of x on y is not identified if U is not observed and weassume only the causal relationships encoded in this diagram. If in addition weassume Y = βX + αU, then β can be identified using I as an instrument.

© Jiaming Mao

Page 136: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Instrumental Variables

One strategy that can be used to identify causal effects underadditional parametric assumptions is the instrumental variablestrategy.

A variable sI can serve an instrumental variable (IV) for identifyingthe causal effect of x on y if (i) sI is associated with x , (ii) everyopen path connecting sI and y has an arrow pointing into x .

I (ii) ⇒ sI is d-separated and independent from any common causes ofx and y , since x is a collider on their paths.

I (i) and (ii) imply that any association between sI and y could only bea result of the association between sI and x and the causal effect of xon y – a condition the IV strategy relies on to identify the causal effect.

© Jiaming Mao

Page 137: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Instrumental Variables

One of the simplest parametric assumptions under which an IVstrategy works is that of a linear model: y = β′x + α′u, where u arethe causes of y other than x , including those that are x and y ’scommon causes.

Consider figure 2: assume Y = βX + αU + ε, where U and ε areunobserved. Then β ≡ ∂E [Y |do (X = a)]/ ∂a represents the causaleffect of X on Y and can be estimated by using I as an instrumentalvariable. By assumption, we have:

Cov (Y , I) = Cov (βX + αU + ε, I) = βCov (X , I)

⇒β = Cov (Y , I)

Cov (X , I)

© Jiaming Mao

Page 138: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Instrumental Variables

Fumigation and Yield (cont.)The variable C can serve as an instrument for F : it affects Y only throughits effect on F a. Assume Y = αW + γE + ε and E = βF + ε, thenY = αW + βγF + γε+ ε. βγ represents the causal effect of F on Y andcan be estimated by:

βγ = Cov (Y ,C)Cov (F ,C)

Estimation using this strategy requires only the observation of {F ,Y ,C}.

aThe collider P blocks the path C → P ← Y .

© Jiaming Mao

Page 139: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Non-random Sampling

So far we have assumed that we can at least observe a random sample of{x , y}. But what if we don’t?

Mutual Fund PerformanceSuppose we are interested in how the size of assets under managementaffects a fund’s performance. If we simply look at the relationship betweenfund size and returns among existing funds, however, there will be what isreferred to as a survival bias: we do not observe funds that have closeddue to bad performance. So if fund size negatively affects performance, wemay end up under-estimating the magnitude of the effect.

© Jiaming Mao

Page 140: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Survival Bias

Similarly, a study of how intelligence and work ethic are related amongpigs will generate biased results if we only look at pigs that survive wolfattacks (the ones that built brick houses).

© Jiaming Mao

Page 141: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Survival Bias

© Jiaming Mao

Page 142: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Non-random Sampling

When the data we observe is not a random sample of the populationwe are interested in56 – if we have a non-representative sample – thensampling bias may arise.

Sampling bias is a general problem that can arise when we try tomake inference – whether statistical or causal – about a populationusing data collected from another population.

Sample-selection bias is a type of sampling bias that arises when wetry to make inference about a larger population from a sample that isdrawn from a distinct subpopulation.

I Survival bias is a special type of sample-selection bias.I More generally, sample-selection bias can be thought of as a missingdata problem, where data are not missing at random (NMAR)57.

56Equivalently, the study population is different from the target population.57As opposed to missing at random (MAR)

© Jiaming Mao

Page 143: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Non-random Sampling

wage if those not working had worked observed

© Jiaming Mao

Page 144: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Non-random Sampling

0 500 1000 1500 2000 2500

020000

40000

60000

Balance

Inco

me

Income, balance, and default status for credit card holdersCan the relationship learned from this data set be used to predict default rate for

a random credit card applicant?© Jiaming Mao

Page 145: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Sample Selection Bias

Let c ∈ {0, 1} indicate whether an individual unit belongs to thesubpopulation. Analyses based on samples drawn from thesubpopulation can be thought of as analyses using the wholepopulation, but conditional on c = 1.

The reason that doing so may produce biased learning about thewhole population is that the subpopulation with c = 0 and thesubpopulation with c = 1 are not exchangeable.

Suppose we want to learn p (y) in the whole population, then c canbe considered as a treatment itself: p (y) = p (y |do (c = 1)), i.e. thedistribution of y when everyone is “assigned” to the “observed group.”

Hence nonparametric identification of p (y) relies on whether we canmake c exogenous to y , which can be done by blocking all backdoorpaths between c and y .

© Jiaming Mao

Page 146: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Sample Selection Bias

A: income; Y ∈ {0, 1}: default; C ∈ {0, 1}: credit card holder

Suppose credit card companies determine whether to accept (C) creditcard applications based solely on income (A). Once a person is issued acredit card, income determines the probability of her default (Y ).Assume we observe A for both C = 0 and C = 1, but only observe Y forC = 158.

58This is a case of censoring, where, given a random sample of individuals drawnfrom the population of interest, some variables – in particular the outcome variable – areobserved only on individuals belonging to a subpopulation, while other variables areobserved on all individuals in the sample. If all variables are observed only on individualsbelonging to a subpopulation, then we have truncation. Truncation entails greaterinformation loss than censoring.

© Jiaming Mao

Page 147: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Sample Selection Bias

If the goal is to learn p (Y ), then sincep (Y ) = p (Y |do (C = 1)) 6= p (Y |C = 1), there is sample selectionbias59. To correct the bias, notice that A satisfies the back-doorcriterion. Hence,

p (Y ) = p (Y |do (C = 1)) =∫

p (Y |C = 1,A) p (A) dA

If the goal is to learn p (Y |do (A)), then

p (Y |do (A)) = p (Y |do (A,C = 1)) = p (Y |do (A) ,C = 1)

, i.e. there is no sample selection bias here in learning the causaleffect of A on Y .

59As this example demonstrates, an understanding of the data-generating causalmechanism is needed for non-causal statistical inferences as well if we want toextrapolate the results from one population to another.

© Jiaming Mao

Page 148: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Sample Selection Bias

Suppose the goal is to learn p (Y |do (A)). Notice that C is a collider onthe path A→ C ← L← U → Y . Without conditioning on C , the path isblocked. Conditioning on C = 1 opens the path, making A and Yassociated even though they have no causal effect on each other, leadingto sample selection bias in learning p (Y |do (A)).

© Jiaming Mao

Page 149: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Sample Selection Bias

Assume we observe {A, L} for both C = 0 and C = 1, but only observe Yfor C = 1. Since conditioning on L blocks all back-door paths from A toY and from C to Y , we have:

p (Y |do (A)) = p (Y |do (A,C = 1))

=∫

p (Y |A, L,C = 1) p (L) dL

, i.e., p (Y |do (A)) is nonparametrically identifiable and can be estimatedby estimating p (Y |A, L,C = 1) and p (L) from data respectively.

© Jiaming Mao

Page 150: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Mechanism Learning

Our discussion so far focuses on learning a causal effect based on ourprior knowledge of a causal mechanism. But how do we learn causalmechanisms60?

Conceptually, we begin with a hypothesis set containing causalmodels. The learning problem is to choose the one that best fits ourobserved data.

Recall that a causal model contains a specification of the jointdistribution and the causal structure. A causal model is identifiable ifit can be uniquely determined (out of the hypothesis set of causalmodels) once we observe the entire population.

60The distinction is between understanding the “effects of causes” (causal effectlearning) and understanding the “causes of effects” (causal mechanism learning).

© Jiaming Mao

Page 151: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Mechanism Learning

Often, our interest lies mainly in learning the causal structure61.Since each causal structure implies, statistically, a set of conditionalindependence relations, their identifiability depends on whether theycan be uniquely determined by the set of conditional independencerelations observed in the population.

61As our discussion on meaningful causal effects shows, causal effects are often asunstable as statistical relationships. Both can change from population to populationeven though the underlying causal relationships are the same. Causal relationships aremore stable than causal effects and statistical relationships. That’s why our knowledgeabout the physical world is largely encoded and transmitted in the qualitative languageof causal relationships (“pushing the glass off the table will cause it to break”), ratherthan the quantitative language of causal effects and statistical relationships (“pushingthe glass off the table will result in a 95% probability of breakage.”).

© Jiaming Mao

Page 152: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Mechanism Learning

If two causal structures imply the same set of conditionalindependence relations, then they are said to be observationallyequivalent62. In this case, they cannot be distinguished withoutresorting to manipulative experimentation or temporal information.Hence, experiments play an important role in identifyingobservationally equivalent causal structures.

62See Appendix I

© Jiaming Mao

Page 153: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Causal Mechanism Learning

In practice, we (human beings) have been learning causal mechanismsby formulating models, then conducting experiments or observationalstudies, and based on the results of which, updating our belief abouteach model’s probability of being true. This, in essence, is the processof scientific progress63,64.

63The big question for AI is whether this process can be automated.64This view of scientific progress as continuous Bayesian updating based on evidence

has been challenged by historians like Thomas Kuhn, who pointed out that thesociological nature of the scientific community leads to periodic paradigm shifts ratherthan continuous progress.

© Jiaming Mao

Page 154: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Structual Estimation

Econometrics began as a discipline that seeks to link economic theoryto data.

Today in the econometrics literature, causal models based oneconomic theory are referred to as structural models. Theirestimation is called structural estimation.

© Jiaming Mao

Page 155: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Birth of Econometrics

The Econometric Society was founded on 29th December 1930 at agathering during the annual joint meeting of the American EconomicAssociation and the American Statistical Association.

Ragnar Frisch, a founding member of the Econometric Society andthe first editor-in-chief of Econometrica, wrote in 192365:

Intermediate between mathematics, statistics, and economics, wefind a new discipline which for lack of a better name, may becalled econometrics. Econometrics has as its aim to subjectabstract laws of theoretical political economy or ’pure’ economicsto experimental and numerical verification, and thus to turn pureeconomics, as far as possible, into a science in the strict sense ofthe word.

65Frisch (1926). Quoted is an English translation of the French original. Underliningmine.

© Jiaming Mao

Page 156: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Birth of Econometrics

Regarding the name “Econometrics”, Frisch later wrote66:

So far, we have been unable to find any better word than"econometrics". We are aware of the fact that in the beginningsomebody might misinterpret this word to mean economicstatistics only. But ... we believe that it will soon become clearto everybody that the society is interested in economic theoryjust as much as in anything else.

66Bjerkholt (1998).© Jiaming Mao

Page 157: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Birth of Econometrics

The Cowles Commission for Research in Economics was founded in1932 and moved to the University of Chicago from 1939 to 195567.

I Motto: “Theory and Measurement”68

The Cowles Commission made foundational contributions to the earlydevelopment of Econometrics during its Chicago years69.

1 Introducing the probabilistic framework as well as the methods ofmodern statistical inference into Econometrics

2 Estimation of structural simultaneous equations models (SEMs).

67The commission moved to Yale in 1955 and has been renamed the CowlesFoundation.

68According to Cowles’ own statement: “This motto replaced the original CowlesCommission motto ’Science is Measurement,’ reflecting the importance of theory thatbecame clear early in the history of Cowles.”

69See Haavelmo (1943, 1944), Koopmans (1950), Hood and Koopmans (1953).© Jiaming Mao

Page 158: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

The Birth of Econometrics

Thus at its inception, Econometrics was conceived as “a branch ofeconomics in which economic theory and statistical method are fusedin the analysis of numerical and institutional data” (Hood andKoopmans 1953)70.

70Emphasis mine© Jiaming Mao

Page 159: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Structual Estimation

A complete structural model may specify preferences, technology, theinformation available to agents, the constraints under which theyoperate, and the rules of interaction among agents in market andsocial settings.

More generally, we refer to any causal models that use economictheory to specify the functional form of causal relationships asstructural models.

© Jiaming Mao

Page 160: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Structual Estimation

Given a structural modelM with variables X = {x1, . . . , xn} andcausal structure G, let XE ⊂ X be the variables that are exogenous toG. The structural model specifies the statistical distributions of XE aswell as the functional forms governing G, which together determinethe joint distribution p (X ).

Let θ ∈ Θ be the set of parameters that parametrize the statisticaldistributions and functional forms inM. The hypothesis setcorresponding toM is therefore {p (X ; θ) , θ ∈ Θ}, which we denoteas p (X ;M). The goal of structural estimation is to learn θ.

© Jiaming Mao

Page 161: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Structual Estimation

Because structural estimation learns the entire causal model, once amodel is learned, we can use it to derive p (xj | do (xi )) for any{xi , xj} ⊂ X .

In contrast, the two-stage causal effect learning procedure introducedon page 112 , whose goal is to learn a single causal effect, has beencalled reduced-form analysis in the econometrics literature71.

Difference between the two: structural estimation is a generativeapproach to causal inference, while reduced-form is a discriminativeapproach.

71Historically, given a structural model g (x , y) = 0 that specifies the relationshipgoverning exogenous variable x and endogenous variable y , if y is solved as a function ofx , i.e. y = f (x), then f is referred to as the reduced form of g .

© Jiaming Mao

Page 162: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Structual Estimation

Discriminative ModelDiscriminative statistical learning p (xj | xi )Causal effect learning p (xj | do (xi ))

Generative ModelGenerative statistical learning p (x1, . . . xN ;M);M : statistical modelStructural estimation p (x1, . . . xN ;M);M : causal model

© Jiaming Mao

Page 163: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Auction

●● ●

●●

● ●

●●

●●●

5 10 15 20 25Number of Bidders

Win

nin

g B

id

First-price Sealed-bid Auctions for Identical Goods© Jiaming Mao

Page 164: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Auction

ModelN risk-neutral biddersIndependent private value vi ∼i .i .d . F (.)Each bidder knows her own vi and the distribution F , but not the viof othersObserved bids are the Bayesian Nash equilibrium outcome of the game

⇒ Equilibrium bidding strategy:

bi = vi −1

F (vi )N−1

∫ vi

0F (x)N−1 dx

= vi −1

N − 1GN (bi )gN (bi )

, where GN (.) and gN (.) are the c.d.f. and p.d.f. of the bid distribution.

© Jiaming Mao

Page 165: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Auction

Structural Estimation1 For each auctiona, nonparametrically estimate GN (.) and gN (.) from

observed bids {b1, . . . , bN}.2 For each bidder, calculate

vi = bi + 1N − 1

GN (bi )gN (bi )

(12)

3 Use vi to nonparametrically estimate F (.)4 F (.) can be used to predict the winning bid in an N−bidder auction:

E [max {bi}] = E[

max{vi −

1F (vi )N−1

∫ vi

0F (x)N−1 dx

}]

aSee Guerre et al. (2000).© Jiaming Mao

Page 166: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Auction

●● ●

●●

● ●

●●

●●●

5 10 15 20 25Number of Bidders

Win

nin

g B

id

Fit Linear Model Quadratic

© Jiaming Mao

Page 167: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Auction

● ●●

●●

●●

● ●● ●

●●

●●●

● ●●

●● ●

●●●●●

●●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

10 20 30 40 50Number of Bidders

Win

nin

g B

id

Prediction Linear Model Quadratic

© Jiaming Mao

Page 168: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Auction

Here, there is no confounding between N (the number of bidders) andbmax (the winning bid). Hence f (N) ≡ E [bmax|N] nonparametricallyidentifies the effect of N on bmax.

The estimation problem is to learn f (N) from data. Here, theoryhelps specify the functional form of f (N) and therefore serves as amodel selection mechanism.

Theory also helps us to learn the values of the bidders (equation (12))– which cannot be identified nonparametrically – by specifying thefunctional form of the mapping from {vi} to {bi}.

© Jiaming Mao

Page 169: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Monopoly

A monopoly firm’s pricing and sales in different geographical marketsData: price, sales, average income, population for each market

price

sale

s/p

op

ula

tion

income low medium high

© Jiaming Mao

Page 170: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Monopoly

Model: DemandIn each market m with population Nm and mean income Im, consumerschoose between the monopoly product and an outside good. Individualutilities are given by:

Umi0 = εmi0 (13)

Umi1 = β0 + β1Im − β2pm + εmi1

, where (Umi0 ,Um

i1 ) are respectively the indirect utilities of the outside goodand the monopoly product, and εmij ∼ Gumbel (0, 1).

(13) ⇒ qm ∼ Binomial (Nm, πm), where

πm = exp (β0 + β1Im − β2pm)1 + exp (β0 + β1Im − β2pm)

© Jiaming Mao

Page 171: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Monopoly

Model: SupplyFor each market m, given demand qm (p), the monopoly firm chooses p tomaximize:

maxp{p × qm (p)− c(qm (p))} (14)

, where c (q) is the firm’s cost function.

(14) ⇒c ′ (qm) = pm +

[q′m (pm)

]−1 qm (15)

© Jiaming Mao

Page 172: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Monopoly

Q

Pest. demand est. mc monopoly supply

Estimated marginal cost and demand curvesfor a market with median income and population © Jiaming Mao

Page 173: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Monopoly

Here, theory helps us to learn the marginal cost function of themonopoly firm as well as the consumer utility function – neither ofwhich is observed and neither can be nonparametrically identified.

Using the estimation results, we can conduct welfare analysis andmake normative statements.

I For example, calculating the total deadweight loss due to monopoly.

© Jiaming Mao

Page 174: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Counterfactual Simulation

One of the main benefits of learning a structural model is that itallows us to predict the effect of a completely new treatment – atreatment that has never been observed before.

I If in the observed data, xj is always equal to 0, what would be theeffect of do (xj = 1)?

Because structural models are generative models, once we havelearned a model, we can use it to generate synthetic dataD = {(xi ,1, . . . , xi ,j−1, xi ,j = a, xi ,j+1, xi ,n)} fromp (x1, . . . , xn| do (xj = a)). This is called counterfactual simulation.

© Jiaming Mao

Page 175: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Counterfactual Simulation

© Jiaming Mao

Page 176: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Counterfactual Simulation

What if Chuji Qu never visited Niu’s Village?© Jiaming Mao

Page 177: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Counterfactual Simulation

Or Caesar never crossed the Rubicon?© Jiaming Mao

Page 178: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Monopoly

What happens if the government imposes a 20% sales tax on thecompany?

After tax:

∆ Consumer Surplus: −27.83%∆ Total Surplus: −27.95%

Tax incidence:

Consumer: 26.65%

© Jiaming Mao

Page 179: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Dynamic Structural Model

In a changing environment, with new information arriving eachperiod, individual are forward-looking when making decisions: choicesare made partly based on expectations of the future.

Decisions are also often influenced by the past. Since it can be costlyto transition from one state to another, payoffs to different choicesare often history-dependent: our past partly shapes our future.

In dynamic models, treatment effects can be time-varying and it’soften useful to distinguish between short-run and long-run effects.

© Jiaming Mao

Page 180: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Dynamic Structural Model

Negative Political Advertising

Candidate decides whether to go negative based on pollinga.

Going negative affects future polling, which in turn, affects futurenegative advertising decisions.

Outcome: final vote share.aThis example is taken from Blackwell (2015).

© Jiaming Mao

Page 181: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Dynamic Structural Model

Negative Political Advertising (cont.)Static (single-shot) causal inference:

© Jiaming Mao

Page 182: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Dynamic Structural Model

Negative Political Advertising (cont.)Dynamic causal inference:

© Jiaming Mao

Page 183: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Crop Supply

Data: crop price and percentage of cropland cultivated in a county fort = 1, . . . ,T

●●

●●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●●●

● ●

●●●

● ●●

●●

●●●

●●● ●

●●

●● ●●

●●

●●●

●●●

●●

● ●

●●●

●● ●

●●●●● ●

●●

●●

●●

●●

●●

●●

0.25

0.50

0.75

1.00

100 110 120 130Crop Price ($/MT)

Cro

pla

nd

(%

Cu

ltiva

ted

)

© Jiaming Mao

Page 184: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Crop Supply

●●●●●●

●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●

●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●

100

110

120

130

0.25

0.50

0.75

1.00

0 25 50 75 100time

$/M

T%

Cu

ltiva

ted

Price

Crop Supply

© Jiaming Mao

Page 185: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Crop Supply

ModelAt the beginning of each period t, each field owner decides whetheror not to plant the crop in the current period.

The decision is based on observed period−t price as well asexpectations of future prices.

If a field has not been cultivated for k periods, then in order to(re)-cultivate it, the farmer needs to pay a one time cost c (k).

Farmers have rational expectation: their expectations of futureprices are unbiased conditional on the information they have.

I Here we assume that crop prices follow an AR(1) process, which isknown to the farmers.

© Jiaming Mao

Page 186: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Crop Supply

Counterfactual Simulation:

How would crop supply change in response to changes in crop pricesif farmers are myoptic: if they are not forward-looking?

How would crop supply change in response to changes in crop pricesif farmers are static: if they are neither forward-looking, nor subject toany re-cultivation costs, so that planting decisions are made entirelybased on current prices?

© Jiaming Mao

Page 187: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Crop Supply

●●●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●

●●●●

●●●

●●●●●

●●●●

●●●●●●

●●●●

●●

●●●

●●●

●●●●

●●●

●●●

●●●●●●

●●●

●●●

●●●

●●●

●●

0.25

0.50

0.75

1.00

0 25 50 75 100Time

% C

ulti

vate

d

Model Dynamic Myopic Static

© Jiaming Mao

Page 188: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Crop Supply

In general, if we are interested in the effect of x on y , but x isself-selected based on expectations of y , then without any measuresof such expectations, the causal effect cannot be nonparametricallyidentified and we need to rely on theory to specify how expectationsare formed.

Here, farmers take crop prices as given and decide whether or not toplant72. Models like this are called dynamic discrete choicemodels.

If prices are endogenous – if farmers’ planting decisions affectequilibrium prices73 – then we need to model both crop supply andcrop demand. Such models are called dynamic general equilibriummodels.

72This is a reasonable assumption since we are looking at a single county.73For example, if we look at all the farmers in a country or in the world.

© Jiaming Mao

Page 189: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Why Structural Estimation

S-0. Structural models, by explicitly modeling the data-generating causalmechanisms, make clear what prior knowledge (assumptions) arerelied upon to draw causal inference74.

74There is a mistaken belief among some practitioners that structural estimation iscausal mechanism learning. This is incorrect: structural estimation is learning based onan assumed causal mechanism.

© Jiaming Mao

Page 190: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Why Structural Estimation

By using theory to specify the functional forms of causal relationships,structural models can be used to:

S-1. identify causal effects or the values of unobserved variables thatcannot be nonparametrically identified75.

S-2. serve as a model selection mechanism76 for causal effects that can beidentified nonparametrically.

75For this reason, structural estimation is sometimes described as identification byfunctional form.

76i.e., determine the functional form of E [y |do (x)].© Jiaming Mao

Page 191: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Why Structural Estimation

S-3.1. Using structural models, what we learn from one set of dataD ∼ p (x , y) can be potentially used to explain and predict datadrawn from another distribution, say p (u, v), if {x , y} and {u, v}are generated from a similar causal mechanism.

I In other words, what we learn from one observed phenomenon canbe used to explain and predict other related phenomena.

I For example, we can learn individuals’ risk aversion from theirinvestment behavior, which in turn, can help explain and predicttheir career choices.

© Jiaming Mao

Page 192: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Why Structural Estimation

S-3.2. Structural models make it possible to predict the effects of existingtreatments in a new population/environment, or the effects ofcompletely new treatments.

I To do so, a structural model must be “deep” enough so that itsparameters remain invariant in the new population/environment, orwhen new treatments are applied.

I The concept of invariance is closely related to the concept ofstability for causal relationships. The need for invariant parameters iskey to causal analysis and policy evaluation.

S-4. Once we have learned a structural model, we can use it to generatesynthetic data and perform counterfactual simulations.

© Jiaming Mao

Page 193: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Three Types of Program Evaluation Problems

P-1 Evaluating the impacts of historical programs on outcomes.

P-2 Forecasting the impacts of programs implemented in onepopulation/environment in other populations/environments.

P-3 Forecasting the impacts of programs never historicallyexperienced.

P-1 is the problem of internal validity.P-2 is the problem of external validity.P-2 and P-3 often require the use of structural models.For all three types of problems, if we want to evaluate welfare impact,we need a structural model.

© Jiaming Mao

Page 194: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Why Structural Estimation

S-5. Structural models, by linking economic theories based on individualpreferences to data, allow the economist to make welfare calculationsand normative statements.

I Individual choices reveal information about their preferences and thepotential outcomes they face77.

77Heckman and Vytlacil (2007): “Incorporating choice into the analysis of treatmenteffects is an essential and distinctive ingredient of the econometric approach .. Anassignment in [the RCM] is an assignment to treatment, not an assignment of incentivesand eligibility for treatment with the agent making treatment choices. [The statisticaltreatment effect approach] has only one assignment mechanism and treatsnoncompliance with it as a problem rather than as a source of information on agentpreferences, as in the econometric approach ... Accounting for uncertainty and subjectivevaluations of outcomes ... is a major contribution of the econometric approach.”

© Jiaming Mao

Page 195: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Why Structural Estimation

S-6. Like all scientific models, structural models can potentially deliverbetter predictive performance than statistical models trained on singledata sets, because their parameters can be learned from acombination of data from various sources that share the sameunderlying causal mechanism78.

78This point is related to S-3.1 and S-3.2: different observed phenomena can all helpinform the values of the same set of “deep” parameters.

© Jiaming Mao

Page 196: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Graphs

A graph consists of nodes (vertices) and undirected or directed links(edges) between nodes.

A path from Xi to Xj is a sequence of connected nodes starting at Xi andending at Xj .

© Jiaming Mao

Page 197: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Directed Graphs

Directed Graphs are graphs in which all the edges are directed.

Directed Acyclic Graph (DAG): Graph in which by following thedirection of the arrows a node will never be visited more than once.

© Jiaming Mao

Page 198: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Directed Graphs

If there is a link from Xi to Xj , then Xi is called a parent of Xj and Xj is achild of Xi .

The ancestors of a node Xi are the nodes with a directed path ending atXi . The descendants of Xi are the nodes with a directed path beginningat Xi .

© Jiaming Mao

Page 199: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Bayesian Networks

A Bayesian network (BN) is a DAG in which each node has associatedthe conditional probability of the node given its parents.

The joint distribution is obtained by taking the product of the conditionalprobabilities:

p (A,B,C ,D,E ) = P (A)P (B)P (C |A,B)P (D|C)P (E |B,C)

© Jiaming Mao

Page 200: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Bayesian Networks

Formally, a Bayesian network79 is a distribution of the form:

p (x1, . . . , xN) =N∏

i=1p (xi | pa (xi ))

, where pa (xi ) denotes the parental variables of xi . The BN is representedas a DAG with the i th node in the graph corresponding to the factorp (xi | pa (xi )).

A BN encodes the following set of conditional independence assumptions,called the local independencies:

xi q nd (xi )| pa (xi )

, where nd (xi ) denotes variables that are not descendants of xi .79Also called belief network.

© Jiaming Mao

Page 201: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Bayesian Networks

The possible ways variables can interact is extremely large. Given Nbinary variables {x1, . . . , xN}, specifying all the joint probabilitiesp (x1, . . . , xN) would take O

(2N)space – impractical for large N.

To render specification and inference tractable, we need to constrainthe nature of the variable interactions. The key idea is to specifywhich variables are independent of others, leading to a structuredfactorization of the joint probability distribution.

Bayesian networks provide a convenient framework for representingsuch independence assumptions via graphs.

© Jiaming Mao

Page 202: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Bayesian Networks

Sally’s burglar Alarm (A) is sounding. Has she been Burgled (B), or wasthe alarm triggered by an Earthquake (E)? She turns the car Radio (R) onfor news of earthquakes.

Without loss of generality, we can write

p (A,R,E ,B) = p (A|R,E ,B) p (R|E ,B) p (E |B) p (B)

Independence Assumptions:p (A|R,E ,B) = p (A|E ,B)p (R|E ,B) = p (R|E )p (E |B) = P (E )

Given these assumptions:

p (A,R,E ,B) = p (A|E ,B) p (R|E ) p (E ) p (B)© Jiaming Mao

Page 203: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Bayesian Networks

Specifying the tables:p (B = 1) = .01p (E = 1) = .000001

p (A|B,E ) p (R|E )

© Jiaming Mao

Page 204: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Bayesian Networks

Initial Evidence: The alarm is sounding

p (B = 1|A = 1) =∑

E ,R p (B = 1,E ,A = 1,R)∑B,E ,R p (B,E ,A = 1,R)

=∑

E ,R p (A = 1|B = 1,E ) p (B = 1) p (E ) p (R|E )∑B,E ,R p (A = 1|B,E ) p (B) p (E ) p (R|E )

≈ 0.99

Additional Evidence: The radio broadcasts an earthquake warning

A similar calculation gives p (B = 1|A = 1) ≈ 0.01

Initially, because the alarm sounds, Sally thinks that she’s been burgled.However, this probability drops dramatically when she hears there’s beenan earthquake. The earthquake “explains away” the alarm.

© Jiaming Mao

Page 205: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Independence in Bayesian Networks

All BNs with three nodes and two links:

In (a), (b) and (c), A, B are conditionally independent given C .

(a) p (A,B|C) = p(A,B,C)p(C) = p(A|C)p(B|C)p(C)

p(C) = p (A|C) p (B|C)

(b) p (A,B|C) = p(A,C)p(B|A,C)p(C) = p(A,C)p(B|C)

p(C) = p (A|C) p (B|C)

(c) is equivalent to (b). In (d), A,B are conditionally dependent given C :p (A,B|C) ∝ p (C |A,B) p (A) p (B).

© Jiaming Mao

Page 206: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Independence in Bayesian Networks

All BNs with three nodes and two links:

In (a), (b) and (c), A, B are marginally dependent.

In (d), A,B are marginally independent:

p (A,B) =∑C

p (A,B,C) =∑C

p (C |A,B) p (A) p (B) = p (A) p (B)

© Jiaming Mao

Page 207: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Collider

Given a path P, a collider is a node c on P with neighbors a and bon P such that a→ c ← b80.

Given a set of nodes C, a path P is blocked by C if at least one ofthe following conditions is satisfied:

1 ∃ a non-collider on P that ∈ C.2 ∃ a collider on P such that neither the collider nor any of its

descendants ∈ C.

80Note that a collider is path specific.© Jiaming Mao

Page 208: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: D-Separation

Given three disjoint sets of nodes {X ,Y, C}, X and Y are said to bed-separated by C if all paths from any element of X to any elementof Y are blocked by C . Otherwise they are d-connected by C .

X and Y are d-separated by C ⇒ X q Y| C81.

81D-connection, however, does not necessarily imply conditional dependence. Moreprecisely, if X and Y are d-separated by C, then X q Y| C in every distributioncompatible with the causal diagram. If X and Y are d-connected by C, then X�qY| C inat least one distribution compatible with the causal diagram.

© Jiaming Mao

Page 209: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Independence in Bayesian Networks

x��qy , x q y | u, x��qy∣∣w , x��qy

∣∣ z ,x��qy

∣∣ {u,w}, x��qy∣∣ {u, z},

x��qy∣∣ {w , z}, x��qy ∣∣ {u,w , z}

x q y , x q y | u, x q y |w , x q y | z ,x q y | {u,w}, x q y | {u, z},x��qy

∣∣ {w , z}, x q y | {u,w , z}

© Jiaming Mao

Page 210: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Observational Equivalence

When two DAGs are compatible with the same probability distribution, wesay they are observationally equivalent.

These two DAGs are observationally equivalent:

A→ B B ← A

Both convey that A and B are dependent: P (A,B) 6= P (A)P (B).

© Jiaming Mao

Page 211: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Observational Equivalence

SkeletonFormed from a directed graph by removing the arrows

ImmoralityAn immorality in a DAG is a configuration of three nodes, A,B,C , suchthat C is a child of both A and B, with A and B not directly connected82.

Observational Equivalence (Markov Equivalence)Two DAGs are observationally equivalent, in the sense that they representthe same set of independence assumptions, if and only if they have thesame skeleton and the same set of immoralities.

82i.e., not “married.”© Jiaming Mao

Page 212: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Observational Equivalence

(a), (b) and (c) are observationally equivalent, while (d) is observationallydifferent from (a) - (c).

© Jiaming Mao

Page 213: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix I: Observational Equivalence

© Jiaming Mao

Page 214: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix II: Structural Interpretation of the RCM

Let x be a potential cause of y and let v represent all other causes of y ,observed and unobserved83. Then, taking a deterministic view of theworld84, there exists a function f such that

y ← f (x , v) (16)

, where we use ← to make clear (16) is a structural equation with thecausal meaning that y is determined by x and v .

83We use sY to denote pa (y)\ {x} on the causal diagram. A causal diagram onlyneeds to include enough variables to satisfy the completeness and faithfulnessconditions. We use v to denote all causes of y (other than x), including those not onthe causal diagram, so that y is completely determined by x and v .

84Note that quantum physics take a non-deterministic view of the world.© Jiaming Mao

Page 215: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix II: Structural Interpretation of the RCM

When x ∈ {0, 1}, for each individual i , we can write

Y0i = f (0, vi )Y1

i = f (1, vi )

, i.e., the effect of an intervention do (x) is to change the value of x whileholding the other determinants of y constant.

Individual treatment effect:

τi = f (1, vi )− f (0, vi )

© Jiaming Mao

Page 216: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix II: Structural Interpretation of the RCM

At the population level, we have:

E [Yx=a] =∫

f (a, v) p (v) dv

E [y |x = a] =∫

yp (y |x = a) dy

=∫

f (a, v) p (v |x = a) dv

Therefore, E [Yx ] = E [y |x ] if v and x are independent such thatp (v |x) = p (v)85.

85This is another way of saying that when x is exogenous to y, their correlation impliescausation.

© Jiaming Mao

Page 217: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix II: Structural Interpretation of the RCM

When x ∈ {0, 1}, we can always write86

E [y |x ] = α0 + α1x

When v q x , α1 is the ATE:

α1 = E [y |x = 1]− E [y |x = 0]

= E[Y1]− E

[Y0]

= E [τ ]

86without imposing statistical assumptions.© Jiaming Mao

Page 218: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix II: Structural Interpretation of the RCM

Equivalently, with binary x , we can write:

y = x · f (1, v) + (1− x) · f (0, v)= f (0, v) + (f (1, v)− f (0, v)) · x= E [f (0, v)] + E [f (1, v)− f (0, v)] · x + u= α0 + α1x + u

, where E [.] is with respect to v , and

u = u (x , v) ={f (0, v)− E [f (0, v)] if x = 0f (1, v)− E [f (1, v)] if x = 1

= f (0, v)− E [f (0, v)] + (τ − E [τ ]) · x

So u captures “control group” variation + treatment effect heterogeneity.

© Jiaming Mao

Page 219: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix II: Structural Interpretation of the RCM

If v q x , then

E [u|x = a] = E [ f (x , v)− E [f (x , v)]| x = a]= E [ f (a, v)| x = a]− E [f (a, v)][1]= 0

, where [1] follows from v q x .

Therefore, v q x ⇒ u has mean 0 and is mean independent of x :

E [u|x ] = E [u] = 0

, which further implies that u and x are linearly uncorrelated:

E [ux ] = 0

© Jiaming Mao

Page 220: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

The econometric approach87 to causal reasoning starts with assuming a“true model”:

y = α0 + α1x + u (17)

, where x is observed, u is unobserved, and α1 represents the averagecausal effect of x on y .

To estimate α, we can run a linear regression:

y = β0 + β1x + e (18)

If u and x are linearly uncorrelated, then we say x is exogenous and OLSestimation of (18) produces β1 that is an unbiased estimate of α1.Otherwise, x is endogenous and β1 will be a biased estimate of α1.

87We focus on the reduced-form approach.© Jiaming Mao

Page 221: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

Limitations:

1. The econometric approach does not clearly distinguish between causalreasoning and statistical modeling, between what assumptions arecausal and what are statistical.

© Jiaming Mao

Page 222: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

(17) is often confused with (18). (18) is a statistical model, wherethe linear relation between x and y is purely a statistical assumption.OLS estimation of (18) will always produce β that are unbiasedestimates of

β∗ = arg minβ=(β0,β1)

E[(y − β0 − β1x)2

], in the sense that E

(β)

= β∗.

© Jiaming Mao

Page 223: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

(17), on the other hand, is (part of) a causal model that makes thecausal assumption that y is determined by x and variables in u, inaddition to making a statistical assumption on the functional form oftheir relationship88. In other words, (17) should be considered astructural equation.

β1 is a statistical parameter. α1 is a causal parameter.

88See page 218

© Jiaming Mao

Page 224: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

When the econometrics literature states that: “when the error term iscorrelated with the regressor, the estimation result is biased,” it reallyhas in mind two models: (17) and (18).

I By “the error term,” it is referring to u not to e.

I By “the estimation result is biased,” it is saying that E(β)6= α, i.e.,

the statistical model (18) produces a biased estimate of the causalmodel (17).

F β1 is an unbiased estimate of the statistical parameter β∗1 , but a biased

estimate of the causal parameter α1.

© Jiaming Mao

Page 225: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

However, this is not always made clear. Most econometrics textbooksuse one equation to represent both models, confusing what is causaland what is statistical.

The requirement that u be (linearly) uncorrelated with x is oftenstated as an essential assumption on the linear regression model itself,under which the OLS estimator is unbiased89,90 – again confusingwhat is causal and what is statistical.

89For example, in Stock and Watson (2010), this is listed as one of the “LeastSquares Assumptions.”

90The linear regression model as a statistical model, of course, places no suchassumptions on the error term.

© Jiaming Mao

Page 226: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in EconometricsStructural vs. Statistical EquationFailure to understand (17) as a structural equation can lead to confusionand unhappiness in life. Take, for example, the following causal model:

x ∼ U (0, 1) , y ∼ U (0, 1)u ← y − x

The causal effect of x on y is 0 – no effect. However, the causal modelimplies the following statistical equation:

y = αx + u (19)

, where α = 1 and u is correlated with both x and y .

Failure to understand (19) as a statistical rather than structural equationwill lead us to wrongly conclude that a regression of y on x will producebiased causal estimates – it won’t. It will only produce a biased estimateof α, but α is not the average causal effect of x on y . © Jiaming Mao

Page 227: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

Limitations:

2. The econometric approach does not clearly distinguish betweennonparametric and parametric identification, and betweenidentification and estimation.

© Jiaming Mao

Page 228: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

Identification in the econometric approach refers to whether theparameters in a “true model” can be uniquely determined giveninfinite data on the observed variables. A “true model” like (17),however, already makes parametric assumptions on the underlyingcausal structure.

Therefore, while causal inference based on causal graphical modelsrecognizes causal effect learning as a two-stage process –identification and estimation – and uses the back-door and front-doorcriteria to establish clear rules for nonparametric identification, theeconometric approach fails to do so91.

91This failure leads to confusion over how to apply statistical and machine learningmodels to causal inference.

© Jiaming Mao

Page 229: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

Limitations:

3. The econometric approach does not provide an easily operational wayof choosing control variables for identifying a desired causal effect.

© Jiaming Mao

Page 230: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

If x is endogenous, we can try to find control variables z such thatconditional on z , u is no longer correlated with x in the followingmodel:

y = α0 + α1x + α2z + u (20)

Then we can run the following regression:

y = β0 + β1x + β2z + e (21)

and β1 will be an unbiased estimate of α1 – our causal parameter ofinterest.

© Jiaming Mao

Page 231: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

The requirement that u be (linearly) uncorrelated with x conditionalon z can be thought of as implied by the back-door criterion.

However, unlike the back-door criterion, the econometric approach tocausal reasoning does not offer a clear guidance on the choice of z ,because it is not based on a clear thinking of the underlying causalmechanism (such as a causal diagram representation)92.

92Clear thinking on causal mechanism forms part of the appeal of structuralestimation over the reduced-form approach.

© Jiaming Mao

Page 232: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

The statistics and econometrics literature often state that in order forthe strategy outlined on page 230 to be successful – in order for causaleffect to be identifiable by conditioning – there must be nounmeasured confounding (statistics) or selection on observables(econometrics).

However, the back-door criterion shows that we do not need toobserve and condition on all confounders, only a sufficient set ofvariables that renders all back-door paths blocked.

I In Figure 3, W is the confounder, but we do not need to observe it:the causal effect of X on Y is identifiable by conditioning on C .

© Jiaming Mao

Page 233: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

There are now more than one “true models”, because for mostproblems, multiple sets of variables exist that can render all back-doorpaths blocked ({C} , {W } , {C ,W } in Figure 3).

The econometrics literature defines omitted variable bias as the biasthat arises when a variable is omitted from the “true model.” Butwith more than one true model – with multiple sets of z that cansufficiently control for confounding – what is an omitted variable?

© Jiaming Mao

Page 234: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

Not only is (20) no longer unique, its meaning is now no longer clear.Once we include control variables z , (20) can no longer be interpretedas a structural equation in a causal model. The back-door criterionmakes it clear that z can include not only direct causes of y (W inFigure 3), but also variables that are not causes of y (C in Figure 3).Therefore, (20) is no longer a structural equation specifying therelation between y and its determinants. The meaning of theequation itself has become unclear, not to mention its ability to offermeaningful guidance on the choice of z .

The back-door criterion also makes it clear that z should not includedescendants of x (C in Figure 4) and should not include colliders thatmay open a back-door path (E in Figure 4). The econometricapproach offers none of these guidances.

© Jiaming Mao

Page 235: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Appendix III: Causal Reasoning in Econometrics

X Y

C W

x Y

U S

x Y

U S

Figure 3:

X Y

C

W

x Y

U S

x Y

U S

E

P

Figure 4:

© Jiaming Mao

Page 236: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Acknowledgement I

Part of this lecture is adapted from the following sources:

Antoine de Saint-Exupéry. 2001. Le Petit Prince. Harcourt, Inc.(Original work published 1943.)Barber, D. 2012. Bayesian Reasoning and Machine Learning.Cambridge University Press.Blackwell, M. 2015. Causal Inference. Lecture at Harvard University,retrieved on 2017.01.01. [link]Hernán, M. A. and J. M. Robins. 2018. Causal Inference. BocaRaton: Chapman & Hall/CRC, forthcoming.James, G., D. Witten, T. Hastie, R. Tibshirani. 2013. AnIntroduction to Statistical Learning: with Applications in R. Springer.

© Jiaming Mao

Page 237: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Acknowledgement IIMorgan, S. L. and C. Winship. 2007. Counterfactuals and CausalInference: Methods and Principles for Social Research. CambridgeUniversity Press.Pearl, J. 2009. Causality: Models, Reasoning and Inference (2nd ed.).Cambridge University Press.Pearl, J., M. Glymour, and N. P. Jewell. 2016. Causal Inference inStatistics: A Primer. Wiley.Roy, J. A. A Crash Course in Causality: Inferring Causal Effects fromObservational Data, Lecture at the University of Pennsylvania,retrieved on 2019.01.01. [link]Silva, R. Causal Inference in Machine Learning. Talk at ImperialCollege London, retrieved on 2017.01.01. [link]Varian, H. R., Machine Learning and Econometrics. Talk at Google,retrieved on 2017.01.01. [link]

© Jiaming Mao

Page 238: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Reference I

Angrist, J. D. and J. Pischke. 2009. Mostly Harmless Econometrics: AnEmpiricist’s Companion. Princeton University Press.

Bjerkholt, O. 1998. “Ragnar Frisch and the Foundation of the EconometricSociety,” In Steinar Strøm (ed.), Econometrics and Economic Theory in the 20thCentury: The Ragnar Frisch Centennial Symposium, Cambridge University Press.

Dawid, A. P. 2000. “Causal inference without counterfactuals,” Journal of theAmerican Statistical Association, 95(450).

Deaton, A. and N. Cartwright. 2018. “Understanding and misunderstandingrandomized controlled trials,” Social Science & Medicine, 210.

Frisch, R. 1926. “Sur un problème d’économie pure,” Norsk Matematisk ForeningsSkrifter, Series I, No. 16.

Guerre, E., I. Perrigne, and Q. Vuong. 2000. “Optimal Nonparametric Estimationof First-Price Auctions,” Econometrica, 68(3).

© Jiaming Mao

Page 239: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Reference II

Haavelmo, T. 1943. “The Statistical Implications of a System of SimultaneousEquations,” Econometrica, 11(1).

Haavelmo, T. 1944. “The Probability Approach in Econometrics,” Econometrica,12(Supplement).

Heckman, J. J. and E. J. Vytlacil. 2007. “Econometric Evaluation of SocialPrograms, Part I: Causal Models, Structural Models and Econometric PolicyEvaluation.” In J. Heckman and E. Leamer (eds.), Handbook of Econometrics, Vol.6, Elsevier.

Holland, P. W. 1986. “Statistics and causal inference,” Journal of the AmericanStatistical Association, 81(396).

Hood, W. C. and T. C. Koopmans (eds.) 1953. Studies in Econometric Method,Cowles Commission Monograph 14, Wiley.

Jensen, R. T. and N. H. Miller. 2008. “Giffen Behavior and SubsistenceConsumption,” American Economic Review, 98(4).

© Jiaming Mao

Page 240: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Reference III

Koopmans, T. C. (ed.) 1950. Statistical Inference in Dynamic Economic Models,Cowles Commission Monograph 10, Wiley.

Krueger, Alan. 1999. “Experimental Estimates of Education Production Functions,”Quarterly Journal of Economics, 114.

Neyman, J. 1923. “On the Application of Probability Theory to AgriculturalExperiments. Essay on Principles. Section 9,” translated in Statistical Science, 5(4),(Nov., 1990).

Rubin, D. B. 1974. “Estimating causal effects of treatments in randomized andnonrandomized studies,” Journal of Educational Psychology, 56.

Rubin, D. B. 1978. “Bayesian inference for causal effects: The role ofrandomization,” Annals of Statistics, 6.

Russell, B., 1912. The Problems of Philosophy. Arc Manor, Rockville, MD (2008).

Scott, P. T. 2013. “Dynamic Discrete Choice Estimation of Agricultural Land Use,”Working Paper.

© Jiaming Mao

Page 241: Foundations of Causal Inference · 2019-12-03 · Anecdotesarenotenough Manypeoplehavestrongbeliefsaboutcausaleffectsintheirownlives. Emilyusedtosufferfromchronicmigrainebutnolongerdoes

Reference IV

Stock, J. H. and M. W. Watson. 2010. Introduction to Econometrics (3rd ed.).Pearson Education.

© Jiaming Mao