Informative Subspace Learning for Counterfactual Inferenceychang/papers/AAAI_2017_talk.pdf ·...

Preview:

Citation preview

Informative Subspace Learning for Counterfactual Inference

Informative Subspace Learning for Counterfactual Inference

Yale Chang Jennifer G. DyDepartment of Electrical and Computer Engineering

Northeastern University

February 9, 2017

Motivation: Why Causal Inference?

Treatment Outcome?

New Medication

Blood Pressure

?

Job Training Employee’s Income

?Ø Healthcare

Ø Economics

Ø Advertising Advertising Campaign

Company’s Revenue

?

Question of Interest:

Challenges

Figures: Shalit & Sontag www.cs.nyu.edu/~shalit/tutorial.html

?Potential Outcome Framework

Ø Only one outcome can be observed

Randomized Controlled Trial

Observational Data

Ø Confounding factors

Contributions of This Work

Ø Propose a novel approach for causal inference on observational data.

Ø Speed up the proposed approach (reducing quadratic to linear complexity) via randomized approximation and provide theoretical results proving an upper bound on the approximation error.

Ø Empirical results on simulated and real-world data demonstrate that our proposed approach outperforms competing methods.

Potential Outcome Framework

Age

Bloo

d Pr

essu

re

Control Outcome

Treatment Outcome

Factual Outcome

Potential Outcome Framework

Age

Bloo

d Pr

essu

re

Control Outcome

Treatment Outcome

Factual Outcome

Counterfactual Outcome

Potential Outcome Framework

Age

Bloo

d Pr

essu

re

ITEITE

ITE: Individual Treatment Effect

Potential Outcome Framework

Age

Bloo

d Pr

essu

re

ITEITE

ITE: Individual Treatment Effect

ATE:Average Treatment Effect

ATE

Nearest Neighbor Matching

Ø Set each sample’s counterfactual outcome equal to factual outcome of its nearest neighbor in the opposite group

Ø Distance can be measured by Euclidean metric

Age

Bloo

d Pr

essu

re

Nearest Neighbor Matching

Ø Not all features affect the outcome.

Ø Need learn informative subspaces (predictive of outcomes) for both treatment and control group before matching.

However!

In this case, only age affects the outcome

AgeWeight

Blo

od P

ress

ure

Informative Subspace Learning

Key Property: make samples 𝑥" with similar outcomes 𝑦" be close in the learned subspace.

𝐾% =𝑠𝑖𝑚(𝑦+, 𝑦+) ⋯ 𝑠𝑖𝑚(𝑦+, 𝑦/)

⋮ ⋱ ⋮𝑠𝑖𝑚(𝑦/, 𝑦+) ⋯ 𝑠𝑖𝑚(𝑦/, 𝑦/)

Learn projection matrix 𝑊 ∈ ℝ5×7 to map 𝑥" ∈ ℝ5 to its low dimensional embedding 𝑧" = 𝑊9𝑥" ∈ ℝ7 to preserve the similarity structure in 𝑌.

𝐾< =𝑠𝑖𝑚(𝑧+, 𝑧+) ⋯ 𝑠𝑖𝑚(𝑧+, 𝑧/)

⋮ ⋱ ⋮𝑠𝑖𝑚(𝑧/, 𝑧+) ⋯ 𝑠𝑖𝑚(𝑧/, 𝑧/)

Maximize Hilbert-Schmidt Independence Criterion (HSIC) between 𝒁 and 𝒀!

HSIC Z, Y = 1

𝑛(𝑛 − 1)𝑇𝑟 𝐾< 𝐾%

=1

𝑛(𝑛 − 1)KK𝐾L 𝑖, 𝑗 𝐾%(𝑖, 𝑗)

/

NO+

/

"O+

Error Bound on HSIC Approximation

ChallengeThe storage and computation of kernels are quadratic w.r.t. sample size!

SolutionApproximate kernel matrices with random Fourier features.

𝐾<

𝐾%

PPQR

SSQT

𝐹 ∈ ℝ/×R

𝐾< ∈ ℝ/×/

𝐾% ∈ ℝ/×/

𝐺 ∈ ℝ/×T

𝑚, 𝑙 are the numbers of random Fourier features 𝑚, 𝑙 ≪ 𝑛

Approximation Error Bound

𝔼 |𝑒𝑟𝑟𝑜𝑟|≤ /

/^+_/ `ab /RT

cd/ `ab /RT

Learning Objective

maxh

HSIC Z, Y − 𝜆||𝑊||Pd

Ø Solved with L-BFGS

Ø Time complexity: 𝒪(𝑛(𝑚𝑑 +𝑚𝑙 + 𝑑𝑞))

Ø Storage cost: 𝒪(𝑛(𝑑 +𝑚 + 𝑙))

Infant Health Development Data

MDM PSM RLP LASSO BART CausalForest

Proposed

News Data

MDM PSM RLP LASSO BART CausalForest

Proposed

Summary

ØSignificantly improve nearest-neighbor matching for counterfactual inference through informative subspace learning.

ØSpeed up HSIC computation via random Fourier features and provided proof on an upper bound on the approximation error.

ØEmpirically show state-of-the-art performance on real datasets.

Recommended