Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Matching Methods for High-Dimensional Data withApplications to Text
Molly Roberts, Brandon Stewart and Rich Nielsen
UCSD, Princeton and MIT
May 11, 2016
Roberts (UCSD) Text Matching May 11, 2016 1 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online information
I Censoring the whole internet is hard (# of bloggers � # of censors)
I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
Roberts (UCSD) Text Matching 28 April 2016 2 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online information
I Censoring the whole internet is hard (# of bloggers � # of censors)
I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
Roberts (UCSD) Text Matching 28 April 2016 2 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online information
I Censoring the whole internet is hard (# of bloggers � # of censors)
I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
Roberts (UCSD) Text Matching 28 April 2016 2 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online information
I Censoring the whole internet is hard (# of bloggers � # of censors)
I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
Roberts (UCSD) Text Matching 28 April 2016 2 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online information
I Censoring the whole internet is hard (# of bloggers � # of censors)
I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
Roberts (UCSD) Text Matching 28 April 2016 2 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online information
I Censoring the whole internet is hard (# of bloggers � # of censors)
I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
Roberts (UCSD) Text Matching 28 April 2016 2 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online information
I Censoring the whole internet is hard (# of bloggers � # of censors)
I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
Roberts (UCSD) Text Matching 28 April 2016 2 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
I But this could turn out one of two ways:I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel
I how censorship works: important tounderstand self-censorship
I BUT self-censorship very hard to measure
Roberts (UCSD) Text Matching 28 April 2016 3 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
I But this could turn out one of two ways:
I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel
I how censorship works: important tounderstand self-censorship
I BUT self-censorship very hard to measure
Roberts (UCSD) Text Matching 28 April 2016 3 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
I But this could turn out one of two ways:I Bloggers might take cues to self-censor OR
I Bloggers might hate censorship and rebel
I how censorship works: important tounderstand self-censorship
I BUT self-censorship very hard to measure
Roberts (UCSD) Text Matching 28 April 2016 3 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
I But this could turn out one of two ways:I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel
I how censorship works: important tounderstand self-censorship
I BUT self-censorship very hard to measure
Roberts (UCSD) Text Matching 28 April 2016 3 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
I But this could turn out one of two ways:I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel
I how censorship works: important tounderstand self-censorship
I BUT self-censorship very hard to measure
Roberts (UCSD) Text Matching 28 April 2016 3 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
I But this could turn out one of two ways:I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel
I how censorship works: important tounderstand self-censorship
I BUT self-censorship very hard to measure
Roberts (UCSD) Text Matching 28 April 2016 3 / 32
Introduction
How do people react to online repression?
I Lots of governments try to control online informationI Censoring the whole internet is hard (# of bloggers � # of censors)I Limited external enforcement gov’ts scare people into self-policing
I Governments might jail some bloggers toscare people
I Then encourage self-censorship by signalingoff-limits topics
I But this could turn out one of two ways:I Bloggers might take cues to self-censor ORI Bloggers might hate censorship and rebel
I how censorship works: important tounderstand self-censorship
I BUT self-censorship very hard to measure
Roberts (UCSD) Text Matching 28 April 2016 3 / 32
Introduction
The perfect experiment
1. Be the Chinese government
2. Randomly assign censorship
3. See what bloggers write after censorship
Problem 1: unethicalProblem 2: we aren’t the Chinese government
Roberts (UCSD) Text Matching 28 April 2016 4 / 32
Introduction
The perfect experiment
1. Be the Chinese government
2. Randomly assign censorship
3. See what bloggers write after censorship
Problem 1: unethicalProblem 2: we aren’t the Chinese government
Roberts (UCSD) Text Matching 28 April 2016 4 / 32
Introduction
The perfect experiment
1. Be the Chinese government
2. Randomly assign censorship
3. See what bloggers write after censorship
Problem 1: unethicalProblem 2: we aren’t the Chinese government
Roberts (UCSD) Text Matching 28 April 2016 4 / 32
Introduction
The perfect experiment
1. Be the Chinese government
2. Randomly assign censorship
3. See what bloggers write after censorship
Problem 1: unethicalProblem 2: we aren’t the Chinese government
Roberts (UCSD) Text Matching 28 April 2016 4 / 32
Introduction
The perfect experiment
1. Be the Chinese government
2. Randomly assign censorship
3. See what bloggers write after censorship
Problem 1: unethical
Problem 2: we aren’t the Chinese government
Roberts (UCSD) Text Matching 28 April 2016 4 / 32
Introduction
The perfect experiment
1. Be the Chinese government
2. Randomly assign censorship
3. See what bloggers write after censorship
Problem 1: unethicalProblem 2: we aren’t the Chinese government
Roberts (UCSD) Text Matching 28 April 2016 4 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar postswith similar posts
written on the same day
Only one censoredOnly one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar postswith similar posts
written on the same day
Only one censoredOnly one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar postswith similar posts
written on the same day
Only one censoredOnly one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar postswith similar posts
written on the same day
Only one censoredOnly one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar posts
with similar posts
written on the same day
Only one censoredOnly one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar posts
with similar posts
written on the same day
Only one censoredOnly one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar postswith similar posts
written on the same day
Only one censored
Only one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar postswith similar posts
written on the same day
Only one censored
Only one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar postswith similar posts
written on the same day
Only one censored
Only one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar postswith similar posts
written on the same day
Only one censored
Only one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
How Can We Measure Deterrence?
The best approximation:
Find two bloggers
Xsimilar users,
Xsimilar censorship histories,
Xsimilar numbers of posts
Xsimilar previous post sensitivity......
with very similar postswith similar posts
written on the same day
Only one censored
Only one censored
Censorship ‘Mistake’
Does the censored blogger’s behavior change?
Does the censored blogger stay away from the topic?
Does the censored blogger pursue the topic?
Roberts (UCSD) Text Matching 28 April 2016 5 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder
a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problem
I ApplicationsI Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?
I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?
I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?
I Control for letters of recommendation, trade treaties, Congressionalbills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)
I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatment
I But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Text Matching
I Text as pre-treatment confounder a surprisingly frequent problemI Applications
I Does censorship change a bloggers behavior?I Do targeted killings of islamic extremists create interest in their work?I In International Relations, are women cited less frequently than men?I Control for letters of recommendation, trade treaties, Congressional
bills, etc
I BUT existing matching methods impossible to apply tohigh-dimensional data
I You can’t possibly match on every word! (and you wouldn’t want to)I We care about controlling for covariates predictive of treatmentI But with text, we don’t know what predicts treatment
Very little work on this.
Roberts (UCSD) Text Matching 28 April 2016 6 / 32
Introduction
Our Approach to Text Matching
1. Construct analogs to current methods
I Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching
2. Identify benefits and drawbacks of each
3. Create a new method
Topical Inverse Regression Matching (TIRM),by combining the two
Roberts (UCSD) Text Matching 28 April 2016 7 / 32
Introduction
Our Approach to Text Matching
1. Construct analogs to current methods
I Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching
2. Identify benefits and drawbacks of each
3. Create a new method
Topical Inverse Regression Matching (TIRM),by combining the two
Roberts (UCSD) Text Matching 28 April 2016 7 / 32
Introduction
Our Approach to Text Matching
1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse Regression
I Coarsened exact matching Topically Coarsened Matching
2. Identify benefits and drawbacks of each
3. Create a new method
Topical Inverse Regression Matching (TIRM),by combining the two
Roberts (UCSD) Text Matching 28 April 2016 7 / 32
Introduction
Our Approach to Text Matching
1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching
2. Identify benefits and drawbacks of each
3. Create a new method
Topical Inverse Regression Matching (TIRM),by combining the two
Roberts (UCSD) Text Matching 28 April 2016 7 / 32
Introduction
Our Approach to Text Matching
1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching
2. Identify benefits and drawbacks of each
3. Create a new method
Topical Inverse Regression Matching (TIRM),by combining the two
Roberts (UCSD) Text Matching 28 April 2016 7 / 32
Introduction
Our Approach to Text Matching
1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching
2. Identify benefits and drawbacks of each
3. Create a new method
Topical Inverse Regression Matching (TIRM),by combining the two
Roberts (UCSD) Text Matching 28 April 2016 7 / 32
Introduction
Our Approach to Text Matching
1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching
2. Identify benefits and drawbacks of each
3. Create a new method Topical Inverse Regression Matching (TIRM)
,by combining the two
Roberts (UCSD) Text Matching 28 April 2016 7 / 32
Introduction
Our Approach to Text Matching
1. Construct analogs to current methodsI Propensity score matching Multinomial Inverse RegressionI Coarsened exact matching Topically Coarsened Matching
2. Identify benefits and drawbacks of each
3. Create a new method Topical Inverse Regression Matching (TIRM),by combining the two
Roberts (UCSD) Text Matching 28 April 2016 7 / 32
Introduction
Outline of Talk
I A Quick Review of Matching
I Text Analogs to Current Matching Method
I Topical Inverse Regression Matching
I Applications
Roberts (UCSD) Text Matching 28 April 2016 8 / 32
Introduction
Outline of Talk
I A Quick Review of Matching
I Text Analogs to Current Matching Method
I Topical Inverse Regression Matching
I Applications
Roberts (UCSD) Text Matching 28 April 2016 8 / 32
Introduction
Outline of Talk
I A Quick Review of Matching
I Text Analogs to Current Matching Method
I Topical Inverse Regression Matching
I Applications
Roberts (UCSD) Text Matching 28 April 2016 8 / 32
Introduction
Outline of Talk
I A Quick Review of Matching
I Text Analogs to Current Matching Method
I Topical Inverse Regression Matching
I Applications
Roberts (UCSD) Text Matching 28 April 2016 8 / 32
Introduction
Outline of Talk
I A Quick Review of Matching
I Text Analogs to Current Matching Method
I Topical Inverse Regression Matching
I Applications
Roberts (UCSD) Text Matching 28 April 2016 8 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic
matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xi
I Many approaches: propensity score matching, coarsened exact matching, genetic
matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching
, coarsened exact matching, genetic
matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching
, genetic
matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic
matching
, covariate-balanced propensity scores, entropy balancing, synthetic matching,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic
matching, covariate-balanced propensity scores, entropy balancing, synthetic matching
,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic
matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic
matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic
matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores
2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic
matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Previous Approaches to Matching
I Goal: ti ⊥⊥ yi (1), yi (0)|~xiI Many approaches: propensity score matching, coarsened exact matching, genetic
matching, covariate-balanced propensity scores, entropy balancing, synthetic matching,
mahalanobis matching, exact matching, subclassification matching, nearest neighbor
matching, full matching . . .
I Today two of these strategies:
1. model p(ti |~xi ) propensity scores2. match on all ~xi coarsened exact matching
I Both strategies scale poorly with high-dimensional covariates.
Roberts (UCSD) Text Matching 28 April 2016 9 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation
I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!
Roberts (UCSD) Text Matching 28 April 2016 10 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Classical approach
I fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation
I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!
Roberts (UCSD) Text Matching 28 April 2016 10 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Classical approachI fit logistic regression π̂i = p(ti |~xi )
I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation
I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!
Roberts (UCSD) Text Matching 28 April 2016 10 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatment
I pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation
I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!
Roberts (UCSD) Text Matching 28 April 2016 10 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )
I cons: only produces balance in expectation
I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!
Roberts (UCSD) Text Matching 28 April 2016 10 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation
I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!
Roberts (UCSD) Text Matching 28 April 2016 10 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation
I Problem: high-dimensional confounders
I X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!
Roberts (UCSD) Text Matching 28 April 2016 10 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation
I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)
I can only estimate π̂i well when N � V , which isn’t the case!
Roberts (UCSD) Text Matching 28 April 2016 10 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation
I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!
Roberts (UCSD) Text Matching 28 April 2016 10 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Classical approachI fit logistic regression π̂i = p(ti |~xi )I match units with similar probability of treatmentI pros: units matched by scalar (π̂i ) instead of long vector (~xi )I cons: only produces balance in expectation
I Problem: high-dimensional confoundersI X is N × V (# of documents by # of words in vocab)I can only estimate π̂i well when N � V , which isn’t the case!
Roberts (UCSD) Text Matching 28 April 2016 10 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )
I where qi,v ∝ exp(αv + tiφv )
I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )I where qi,v ∝ exp(αv + tiφv )
I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )I where qi,v ∝ exp(αv + tiφv )
I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and word
I projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and word
I projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Propensity Scores: An Analog for Text
I Solution: Multinomial Inverse Regression (Cook 2007, Taddy 2013)
I assume xi ∼Multinomial(~qi ,mi =∑
v xi,v )I where qi,v ∝ exp(αv + tiφv )I φv measures relationship between treatment and wordI projection zi = Φ′(~xi/mi ) is a sufficient reduction X ⊥⊥ T |Z estimate π̂i with projection
I Match on zi or π̂i
Roberts (UCSD) Text Matching 28 April 2016 11 / 32
Matching Methods for Text
Problems with MNIR Matching
Posts equally likely to be treated are not always semantically similar:
I wouldn’t be a problem in expectation BUT
I hard to assess balance in the text case
I could be more efficient if matches were more similar
Roberts (UCSD) Text Matching 28 April 2016 12 / 32
Matching Methods for Text
Problems with MNIR Matching
Posts equally likely to be treated are not always semantically similar:
I wouldn’t be a problem in expectation BUT
I hard to assess balance in the text case
I could be more efficient if matches were more similar
Roberts (UCSD) Text Matching 28 April 2016 12 / 32
Matching Methods for Text
Problems with MNIR Matching
Posts equally likely to be treated are not always semantically similar:
I wouldn’t be a problem in expectation BUT
I hard to assess balance in the text case
I could be more efficient if matches were more similar
Roberts (UCSD) Text Matching 28 April 2016 12 / 32
Matching Methods for Text
Problems with MNIR Matching
Posts equally likely to be treated are not always semantically similar:
I wouldn’t be a problem in expectation BUT
I hard to assess balance in the text case
I could be more efficient if matches were more similar
Roberts (UCSD) Text Matching 28 April 2016 12 / 32
Matching Methods for Text
Problems with MNIR Matching
Posts equally likely to be treated are not always semantically similar:
I wouldn’t be a problem in expectation BUT
I hard to assess balance in the text case
I could be more efficient if matches were more similar
Roberts (UCSD) Text Matching 28 April 2016 12 / 32
Matching Methods for Text
Problems with MNIR Matching
Posts equally likely to be treated are not always semantically similar:
I wouldn’t be a problem in expectation BUT
I hard to assess balance in the text case
I could be more efficient if matches were more similar
Roberts (UCSD) Text Matching 28 April 2016 12 / 32
Matching Methods for Text
Problems with MNIR Matching
Posts equally likely to be treated are not always semantically similar:
I wouldn’t be a problem in expectation BUT
I hard to assess balance in the text case
I could be more efficient if matches were more similar
Roberts (UCSD) Text Matching 28 April 2016 12 / 32
Matching Methods for Text
Problems with MNIR Matching
Posts equally likely to be treated are not always semantically similar:
I wouldn’t be a problem in expectation BUT
I hard to assess balance in the text case
I could be more efficient if matches were more similar
Roberts (UCSD) Text Matching 28 April 2016 12 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Classical approach
I coarsen each variable into natural categoriesi.e. years of education {high school, elementary school, college}
I exactly match on coarsened variableI pros: bounds imbalance on each variable
I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match
Roberts (UCSD) Text Matching 28 April 2016 13 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Classical approachI coarsen each variable into natural categories
i.e. years of education {high school, elementary school, college}
I exactly match on coarsened variableI pros: bounds imbalance on each variable
I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match
Roberts (UCSD) Text Matching 28 April 2016 13 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Classical approachI coarsen each variable into natural categories
i.e. years of education {high school, elementary school, college}I exactly match on coarsened variable
I pros: bounds imbalance on each variable
I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match
Roberts (UCSD) Text Matching 28 April 2016 13 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Classical approachI coarsen each variable into natural categories
i.e. years of education {high school, elementary school, college}I exactly match on coarsened variableI pros: bounds imbalance on each variable
I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match
Roberts (UCSD) Text Matching 28 April 2016 13 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Classical approachI coarsen each variable into natural categories
i.e. years of education {high school, elementary school, college}I exactly match on coarsened variableI pros: bounds imbalance on each variable
I Problem: high-dimensional confounder
I thousands of variables, even if we coarsen, no exact match
Roberts (UCSD) Text Matching 28 April 2016 13 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Classical approachI coarsen each variable into natural categories
i.e. years of education {high school, elementary school, college}I exactly match on coarsened variableI pros: bounds imbalance on each variable
I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match
Roberts (UCSD) Text Matching 28 April 2016 13 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Classical approachI coarsen each variable into natural categories
i.e. years of education {high school, elementary school, college}I exactly match on coarsened variableI pros: bounds imbalance on each variable
I Problem: high-dimensional confounderI thousands of variables, even if we coarsen, no exact match
Roberts (UCSD) Text Matching 28 April 2016 13 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Solution: topically coarsened matching
I innovation: coarsen across variablessimple example: “tax”, “income”, “tariff” “economics”
I topics must be equivalent across documents instead of wordsI bounds imbalance across groups of stochastically equivalent words
I Estimate a topic model
I Match on the topic density rather than raw word counts
Roberts (UCSD) Text Matching 28 April 2016 14 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Solution: topically coarsened matchingI innovation: coarsen across variables
simple example: “tax”, “income”, “tariff” “economics”
I topics must be equivalent across documents instead of wordsI bounds imbalance across groups of stochastically equivalent words
I Estimate a topic model
I Match on the topic density rather than raw word counts
Roberts (UCSD) Text Matching 28 April 2016 14 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Solution: topically coarsened matchingI innovation: coarsen across variables
simple example: “tax”, “income”, “tariff” “economics”I topics must be equivalent across documents instead of words
I bounds imbalance across groups of stochastically equivalent words
I Estimate a topic model
I Match on the topic density rather than raw word counts
Roberts (UCSD) Text Matching 28 April 2016 14 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Solution: topically coarsened matchingI innovation: coarsen across variables
simple example: “tax”, “income”, “tariff” “economics”I topics must be equivalent across documents instead of wordsI bounds imbalance across groups of stochastically equivalent words
I Estimate a topic model
I Match on the topic density rather than raw word counts
Roberts (UCSD) Text Matching 28 April 2016 14 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Solution: topically coarsened matchingI innovation: coarsen across variables
simple example: “tax”, “income”, “tariff” “economics”I topics must be equivalent across documents instead of wordsI bounds imbalance across groups of stochastically equivalent words
I Estimate a topic model
I Match on the topic density rather than raw word counts
Roberts (UCSD) Text Matching 28 April 2016 14 / 32
Matching Methods for Text
Coarsened Exact Matching: An Analog for Text
I Solution: topically coarsened matchingI innovation: coarsen across variables
simple example: “tax”, “income”, “tariff” “economics”I topics must be equivalent across documents instead of wordsI bounds imbalance across groups of stochastically equivalent words
I Estimate a topic model
I Match on the topic density rather than raw word counts
Roberts (UCSD) Text Matching 28 April 2016 14 / 32
Matching Methods for Text
Problems with Topical CEM
Topics aren’t always the most important predictor of treatment:
Roberts (UCSD) Text Matching 28 April 2016 15 / 32
Matching Methods for Text
Problems with Topical CEM
Topics aren’t always the most important predictor of treatment:
Roberts (UCSD) Text Matching 28 April 2016 15 / 32
Matching Methods for Text
Problems with Topical CEM
Topics aren’t always the most important predictor of treatment:
Roberts (UCSD) Text Matching 28 April 2016 15 / 32
Matching Methods for Text
Problems with Topical CEM
Topics aren’t always the most important predictor of treatment:
Roberts (UCSD) Text Matching 28 April 2016 15 / 32
Matching Methods for Text
Problems with Topical CEM
Topics aren’t always the most important predictor of treatment:
Roberts (UCSD) Text Matching 28 April 2016 15 / 32
Matching Methods for Text
Problems with Topical CEM
Topics aren’t always the most important predictor of treatment:
Roberts (UCSD) Text Matching 28 April 2016 15 / 32
Matching Methods for Text
Topical Inverse Regression Matching (TIRM)
I We need something that:
1. Bounds imbalance between documents2. Doesn’t leave out important words
I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment
I topical bounding propertiesI estimates which words associated with treatment
I Ingredients:
I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate
Roberts (UCSD) Text Matching 28 April 2016 16 / 32
Matching Methods for Text
Topical Inverse Regression Matching (TIRM)
I We need something that:
1. Bounds imbalance between documents
2. Doesn’t leave out important words
I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment
I topical bounding propertiesI estimates which words associated with treatment
I Ingredients:
I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate
Roberts (UCSD) Text Matching 28 April 2016 16 / 32
Matching Methods for Text
Topical Inverse Regression Matching (TIRM)
I We need something that:
1. Bounds imbalance between documents2. Doesn’t leave out important words
I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment
I topical bounding propertiesI estimates which words associated with treatment
I Ingredients:
I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate
Roberts (UCSD) Text Matching 28 April 2016 16 / 32
Matching Methods for Text
Topical Inverse Regression Matching (TIRM)
I We need something that:
1. Bounds imbalance between documents2. Doesn’t leave out important words
I TIRM: Jointly estimate probability of treatment and topic density
I Match on topic proportions & topic-specific probability of treatment
I topical bounding propertiesI estimates which words associated with treatment
I Ingredients:
I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate
Roberts (UCSD) Text Matching 28 April 2016 16 / 32
Matching Methods for Text
Topical Inverse Regression Matching (TIRM)
I We need something that:
1. Bounds imbalance between documents2. Doesn’t leave out important words
I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment
I topical bounding propertiesI estimates which words associated with treatment
I Ingredients:
I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate
Roberts (UCSD) Text Matching 28 April 2016 16 / 32
Matching Methods for Text
Topical Inverse Regression Matching (TIRM)
I We need something that:
1. Bounds imbalance between documents2. Doesn’t leave out important words
I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment
I topical bounding properties
I estimates which words associated with treatment
I Ingredients:
I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate
Roberts (UCSD) Text Matching 28 April 2016 16 / 32
Matching Methods for Text
Topical Inverse Regression Matching (TIRM)
I We need something that:
1. Bounds imbalance between documents2. Doesn’t leave out important words
I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment
I topical bounding propertiesI estimates which words associated with treatment
I Ingredients:
I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate
Roberts (UCSD) Text Matching 28 April 2016 16 / 32
Matching Methods for Text
Topical Inverse Regression Matching (TIRM)
I We need something that:
1. Bounds imbalance between documents2. Doesn’t leave out important words
I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment
I topical bounding propertiesI estimates which words associated with treatment
I Ingredients:
I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate
Roberts (UCSD) Text Matching 28 April 2016 16 / 32
Matching Methods for Text
Topical Inverse Regression Matching (TIRM)
I We need something that:
1. Bounds imbalance between documents2. Doesn’t leave out important words
I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment
I topical bounding propertiesI estimates which words associated with treatment
I Ingredients:I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)
I with treatment as content covariate
Roberts (UCSD) Text Matching 28 April 2016 16 / 32
Matching Methods for Text
Topical Inverse Regression Matching (TIRM)
I We need something that:
1. Bounds imbalance between documents2. Doesn’t leave out important words
I TIRM: Jointly estimate probability of treatment and topic densityI Match on topic proportions & topic-specific probability of treatment
I topical bounding propertiesI estimates which words associated with treatment
I Ingredients:I Structural Topic Model (Roberts, Stewart, Tingley et al 2014)I with treatment as content covariate
Roberts (UCSD) Text Matching 28 April 2016 16 / 32
Matching Methods for Text
Structural Topic Model
I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior
I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)
I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)
I Documents have different expected topic proportions based onobserved covariates.
I Topics are now deviations from a baseline distribution.
P(word |topic , doc) ∝
exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))
κ(c) and κ(int) how words are related to treatment.
Roberts (UCSD) Text Matching 28 April 2016 17 / 32
Matching Methods for Text
Structural Topic Model
I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior
I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)
I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)
I Documents have different expected topic proportions based onobserved covariates.
I Topics are now deviations from a baseline distribution.
P(word |topic , doc) ∝
exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))
κ(c) and κ(int) how words are related to treatment.
Roberts (UCSD) Text Matching 28 April 2016 17 / 32
Matching Methods for Text
Structural Topic Model
I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior
I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)
I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)
I Documents have different expected topic proportions based onobserved covariates.
I Topics are now deviations from a baseline distribution.
P(word |topic , doc) ∝
exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))
κ(c) and κ(int) how words are related to treatment.
Roberts (UCSD) Text Matching 28 April 2016 17 / 32
Matching Methods for Text
Structural Topic Model
I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior
I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)
I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)
I Documents have different expected topic proportions based onobserved covariates.
I Topics are now deviations from a baseline distribution.
P(word |topic , doc) ∝
exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))
κ(c) and κ(int) how words are related to treatment.
Roberts (UCSD) Text Matching 28 April 2016 17 / 32
Matching Methods for Text
Structural Topic Model
I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior
I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)
I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)
I Documents have different expected topic proportions based onobserved covariates.
I Topics are now deviations from a baseline distribution.
P(word |topic , doc) ∝
exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))
κ(c) and κ(int) how words are related to treatment.
Roberts (UCSD) Text Matching 28 April 2016 17 / 32
Matching Methods for Text
Structural Topic Model
I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior
I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)
I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)
I Documents have different expected topic proportions based onobserved covariates.
I Topics are now deviations from a baseline distribution.
P(word |topic , doc) ∝
exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))
κ(c) and κ(int) how words are related to treatment.
Roberts (UCSD) Text Matching 28 April 2016 17 / 32
Matching Methods for Text
Structural Topic Model
I STM adds a “structure” to the Latent Dirichlet Allocation (Blei, Ngand Jordan 2003) via a prior
I Replace topic prevalence prior → (heuristically) glm with arbitrarycovariates(Blei and Lafferty 2006, Mimno and McCallum 2008)
I Replace the distribution over words → multinomial logit (Eisenstein,Ahmed and Xing 2011)
I Documents have different expected topic proportions based onobserved covariates.
I Topics are now deviations from a baseline distribution.
P(word |topic , doc) ∝
exp(κ(m)+ topic∗κ(k)+ covariatedoc∗κ(c) + topic*covariatedoc∗κ(int))
κ(c) and κ(int) how words are related to treatment.Roberts (UCSD) Text Matching 28 April 2016 17 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :
I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi )
covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)
topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :
I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi )
covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)
topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :I let (xi/mi ) % of document i that is word x
I (κ(c))′(xi/mi )
covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)
topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi )
covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)
topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)
topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)
topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
TIRM
Match on:
1. θ: Estimated topic proportion (K covariates)
2. proj :I let (xi/mi ) % of document i that is word xI (κ(c))′(xi/mi ) covariate-only projection
I (κ(c))′(xi/mi ) + 1mi
∑v xi,v
((κ
(int)v
)′θi
)topic-covariate projection
3. Any other covariates you think are important
We generally use CEM to match but other methods could be used.
Limitations of TIRM
I New: relies on a parametric method to reduce dimensions
I Old: requires SUTVA, relevant covariates
Roberts (UCSD) Text Matching 28 April 2016 18 / 32
Matching Methods for Text
Simulations
Set up:
1. Simulate 200 outcome and treatment with confounding topics andwords
2. Estimate STM3. Condition on topics and projection
−2.0 −1.5 −1.0 −0.5 0.0
05
1015
20
Estimated Effect
Den
sity
Naive EstimatorTopics OnlyTopics and ProjectionTrue
Roberts (UCSD) Text Matching 28 April 2016 19 / 32
Matching Methods for Text
Simulations
Set up:
1. Simulate 200 outcome and treatment with confounding topics andwords
2. Estimate STM3. Condition on topics and projection
−2.0 −1.5 −1.0 −0.5 0.0
05
1015
20
Estimated Effect
Den
sity
Naive EstimatorTopics OnlyTopics and ProjectionTrue
Roberts (UCSD) Text Matching 28 April 2016 19 / 32
Matching Methods for Text
Simulations
Set up:
1. Simulate 200 outcome and treatment with confounding topics andwords
2. Estimate STM
3. Condition on topics and projection
−2.0 −1.5 −1.0 −0.5 0.0
05
1015
20
Estimated Effect
Den
sity
Naive EstimatorTopics OnlyTopics and ProjectionTrue
Roberts (UCSD) Text Matching 28 April 2016 19 / 32
Matching Methods for Text
Simulations
Set up:
1. Simulate 200 outcome and treatment with confounding topics andwords
2. Estimate STM3. Condition on topics and projection
−2.0 −1.5 −1.0 −0.5 0.0
05
1015
20
Estimated Effect
Den
sity
Naive EstimatorTopics OnlyTopics and ProjectionTrue
Roberts (UCSD) Text Matching 28 April 2016 19 / 32
Matching Methods for Text
Simulations
Set up:
1. Simulate 200 outcome and treatment with confounding topics andwords
2. Estimate STM3. Condition on topics and projection
−2.0 −1.5 −1.0 −0.5 0.0
05
1015
20
Estimated Effect
Den
sity
Naive EstimatorTopics OnlyTopics and ProjectionTrue
Roberts (UCSD) Text Matching 28 April 2016 19 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:
I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:
I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:
I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:
I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:
I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:
I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?
I Outcome: Bloggers’ writings after censorship:
I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:
I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:
I censorship rate after
I sensitivity of blog text after (estimated by TIRM)I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:
I censorship rate afterI sensitivity of blog text after (estimated by TIRM)
I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
Example 1: How do bloggers react to censorship?
I Data: 593 bloggers over 6 months spanning 2011,2012
I 150,000 posts
I Return to blogs to measure censorship
I Find censors’ mistakes: two similar blogs, different censorship
I Also match on date, previous censorship, previous sensitivity.
I How do ’treated’ bloggers react to censorship?I Outcome: Bloggers’ writings after censorship:
I censorship rate afterI sensitivity of blog text after (estimated by TIRM)I topical content of blogs after
Roberts (UCSD) Text Matching 28 April 2016 20 / 32
Estimating Censorship Effects
TIRM Finds Almost Identical Posts
Original Data
String Kernel Similarity
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
02
46
8
Roberts (UCSD) Text Matching 28 April 2016 21 / 32
Estimating Censorship Effects
TIRM Finds Almost Identical Posts
TIRM
String Kernel Similarity
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
05
1015
20
Roberts (UCSD) Text Matching 28 April 2016 21 / 32
Estimating Censorship Effects
TIRM Finds Almost Identical Posts
Topic Match
String Kernel Similarity
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
010
2030
Roberts (UCSD) Text Matching 28 April 2016 21 / 32
Estimating Censorship Effects
TIRM Finds Almost Identical Posts
MNIR Match
String Kernel Similarity
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
05
1015
Roberts (UCSD) Text Matching 28 April 2016 21 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:
I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:
I Treated group: 20% censorship
Control group: 7% censorship
I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:
I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:
I Treated group: 20% censorship
Control group: 7% censorship
I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:
I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:
I Treated group: 20% censorship
Control group: 7% censorship
I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:
I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:
I Treated group: 20% censorship
Control group: 7% censorship
I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:
I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:
I Treated group: 20% censorship
Control group: 7% censorship
I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:I No statistical difference between actual censorship
I No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:
I Treated group: 20% censorship
Control group: 7% censorship
I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorship
I (Not surprising, we are matching on these!)
I 5 posts after treatment:
I Treated group: 20% censorship
Control group: 7% censorship
I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:
I Treated group: 20% censorship
Control group: 7% censorship
I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:
I Treated group: 20% censorship
Control group: 7% censorship
I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:I Treated group: 20% censorship
Control group: 7% censorshipI TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:I Treated group: 20% censorship Control group: 7% censorship
I TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:I Treated group: 20% censorship Control group: 7% censorshipI TIRM estimates treated text significantly more sensitive than control
I Treated group talks significantly more about Bo Xilai incident aftercensorship than control
I Treated group talks significantly more about CCP History/Mao aftercensorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:I Treated group: 20% censorship Control group: 7% censorshipI TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than control
I Treated group talks significantly more about CCP History/Mao aftercensorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Censorship Effects
Results
I We find 46 matched blogs (censors’ mistakes)
I Nearly perfect matches
I Most matched posts are about Bo Xilai incident, Maoist protests
I 5 posts before treatment:I No statistical difference between actual censorshipI No statistical difference between TIRM-predicted censorshipI (Not surprising, we are matching on these!)
I 5 posts after treatment:I Treated group: 20% censorship Control group: 7% censorshipI TIRM estimates treated text significantly more sensitive than controlI Treated group talks significantly more about Bo Xilai incident after
censorship than controlI Treated group talks significantly more about CCP History/Mao after
censorship than control
Roberts (UCSD) Text Matching 28 April 2016 22 / 32
Estimating Effects of Gender Citations in IR
Example 2: Does gender affect citations in PoliticalScience?
I Maliniak, Powers, Walter (2013): women get cited less than men in IR
I Problem: women write about different topics than men
I Maliniak et al solution: Code articles into (many) categories
I Our solution: Text matching!
I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.
I Code lots of variables, including gender, article age, tenure, etc.
I Treatment: all-female Control: co-ed/all-male
I Our motive: Find similar articles, see how they are cited differently.
Roberts (UCSD) Text Matching 28 April 2016 23 / 32
Estimating Effects of Gender Citations in IR
Example 2: Does gender affect citations in PoliticalScience?
I Maliniak, Powers, Walter (2013): women get cited less than men in IR
I Problem: women write about different topics than men
I Maliniak et al solution: Code articles into (many) categories
I Our solution: Text matching!
I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.
I Code lots of variables, including gender, article age, tenure, etc.
I Treatment: all-female Control: co-ed/all-male
I Our motive: Find similar articles, see how they are cited differently.
Roberts (UCSD) Text Matching 28 April 2016 23 / 32
Estimating Effects of Gender Citations in IR
Example 2: Does gender affect citations in PoliticalScience?
I Maliniak, Powers, Walter (2013): women get cited less than men in IR
I Problem: women write about different topics than men
I Maliniak et al solution: Code articles into (many) categories
I Our solution: Text matching!
I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.
I Code lots of variables, including gender, article age, tenure, etc.
I Treatment: all-female Control: co-ed/all-male
I Our motive: Find similar articles, see how they are cited differently.
Roberts (UCSD) Text Matching 28 April 2016 23 / 32
Estimating Effects of Gender Citations in IR
Example 2: Does gender affect citations in PoliticalScience?
I Maliniak, Powers, Walter (2013): women get cited less than men in IR
I Problem: women write about different topics than men
I Maliniak et al solution: Code articles into (many) categories
I Our solution: Text matching!
I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.
I Code lots of variables, including gender, article age, tenure, etc.
I Treatment: all-female Control: co-ed/all-male
I Our motive: Find similar articles, see how they are cited differently.
Roberts (UCSD) Text Matching 28 April 2016 23 / 32
Estimating Effects of Gender Citations in IR
Example 2: Does gender affect citations in PoliticalScience?
I Maliniak, Powers, Walter (2013): women get cited less than men in IR
I Problem: women write about different topics than men
I Maliniak et al solution: Code articles into (many) categories
I Our solution: Text matching!
I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.
I Code lots of variables, including gender, article age, tenure, etc.
I Treatment: all-female Control: co-ed/all-male
I Our motive: Find similar articles, see how they are cited differently.
Roberts (UCSD) Text Matching 28 April 2016 23 / 32
Estimating Effects of Gender Citations in IR
Example 2: Does gender affect citations in PoliticalScience?
I Maliniak, Powers, Walter (2013): women get cited less than men in IR
I Problem: women write about different topics than men
I Maliniak et al solution: Code articles into (many) categories
I Our solution: Text matching!
I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.
I Code lots of variables, including gender, article age, tenure, etc.
I Treatment: all-female Control: co-ed/all-male
I Our motive: Find similar articles, see how they are cited differently.
Roberts (UCSD) Text Matching 28 April 2016 23 / 32
Estimating Effects of Gender Citations in IR
Example 2: Does gender affect citations in PoliticalScience?
I Maliniak, Powers, Walter (2013): women get cited less than men in IR
I Problem: women write about different topics than men
I Maliniak et al solution: Code articles into (many) categories
I Our solution: Text matching!
I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.
I Code lots of variables, including gender, article age, tenure, etc.
I Treatment: all-female Control: co-ed/all-male
I Our motive: Find similar articles, see how they are cited differently.
Roberts (UCSD) Text Matching 28 April 2016 23 / 32
Estimating Effects of Gender Citations in IR
Example 2: Does gender affect citations in PoliticalScience?
I Maliniak, Powers, Walter (2013): women get cited less than men in IR
I Problem: women write about different topics than men
I Maliniak et al solution: Code articles into (many) categories
I Our solution: Text matching!
I Data: 3,201 journal articles from top 12 IR journals, 1980-2006.
I Code lots of variables, including gender, article age, tenure, etc.
I Treatment: all-female Control: co-ed/all-male
I Our motive: Find similar articles, see how they are cited differently.
Roberts (UCSD) Text Matching 28 April 2016 23 / 32
Estimating Effects of Gender Citations in IR
Words men and women use differently in IR
Original Data:
Roberts (UCSD) Text Matching 28 April 2016 24 / 32
Estimating Effects of Gender Citations in IR
Words men and women use differently in IR
Topic Matching:
Roberts (UCSD) Text Matching 28 April 2016 24 / 32
Estimating Effects of Gender Citations in IR
Words men and women use differently in IR
TIRM:
Roberts (UCSD) Text Matching 28 April 2016 24 / 32
Estimating Effects of Gender Citations in IR
TIRM Reduces Topical Differences
Mean topic difference (Women−Men)
−0.10 −0.05 0.00 0.05 0.10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Topic 2 : model, variabl, data, effect, measur
Topic 9 : nuclear, weapon, arm, forc, defens
Topic 6 : game, will, cooper, can, strategi
Topic 14 : war, conflict, state, disput, democraci
Topic 1 : state, power, intern, system, polit
Topic 7 : polici, foreign, public, polit, decis
Topic 12 : soviet, militari, war, forc, defens
Topic 13 : trade, econom, polici, bank, intern
Topic 8 : polit, parti, polici, govern, vote
Topic 15 : war, israel, peac, conflict, arab
Topic 3 : polit, conflict, group, ethnic, state
Topic 4 : econom, develop, industri, countri, world
Topic 10 : state, china, unit, foreign, polici
Topic 11 : intern, state, organ, institut, law
Topic 5 : polit, social, one, theori, world
●
●
Full Data Set (Unmatched)TIRMMNIRTopic MatchingHuman Coding Matched
Roberts (UCSD) Text Matching 28 April 2016 25 / 32
Estimating Effects of Gender Citations in IR
TIRM Reduces Topical Differences
Mean topic difference (Women−Men)
−0.10 −0.05 0.00 0.05 0.10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Topic 2 : model, variabl, data, effect, measur
Topic 9 : nuclear, weapon, arm, forc, defens
Topic 6 : game, will, cooper, can, strategi
Topic 14 : war, conflict, state, disput, democraci
Topic 1 : state, power, intern, system, polit
Topic 7 : polici, foreign, public, polit, decis
Topic 12 : soviet, militari, war, forc, defens
Topic 13 : trade, econom, polici, bank, intern
Topic 8 : polit, parti, polici, govern, vote
Topic 15 : war, israel, peac, conflict, arab
Topic 3 : polit, conflict, group, ethnic, state
Topic 4 : econom, develop, industri, countri, world
Topic 10 : state, china, unit, foreign, polici
Topic 11 : intern, state, organ, institut, law
Topic 5 : polit, social, one, theori, world
●
●
Full Data Set (Unmatched)TIRMMNIRTopic MatchingHuman Coding Matched
Roberts (UCSD) Text Matching 28 April 2016 25 / 32
Estimating Effects of Gender Citations in IR
Results
I Maliniak et al: Women receive 80% the citations of men
I In our data: women receive fewer citations robust across matches
I Final match: Women receive 40-60% the citations of men
I Still looking into why we are getting more extreme results
I Could be the difference is in very high citation counts
Roberts (UCSD) Text Matching 28 April 2016 26 / 32
Estimating Effects of Gender Citations in IR
Results
I Maliniak et al: Women receive 80% the citations of men
I In our data: women receive fewer citations robust across matches
I Final match: Women receive 40-60% the citations of men
I Still looking into why we are getting more extreme results
I Could be the difference is in very high citation counts
Roberts (UCSD) Text Matching 28 April 2016 26 / 32
Estimating Effects of Gender Citations in IR
Results
I Maliniak et al: Women receive 80% the citations of men
I In our data: women receive fewer citations robust across matches
I Final match: Women receive 40-60% the citations of men
I Still looking into why we are getting more extreme results
I Could be the difference is in very high citation counts
Roberts (UCSD) Text Matching 28 April 2016 26 / 32
Estimating Effects of Gender Citations in IR
Results
I Maliniak et al: Women receive 80% the citations of men
I In our data: women receive fewer citations robust across matches
I Final match: Women receive 40-60% the citations of men
I Still looking into why we are getting more extreme results
I Could be the difference is in very high citation counts
Roberts (UCSD) Text Matching 28 April 2016 26 / 32
Estimating Effects of Gender Citations in IR
Results
I Maliniak et al: Women receive 80% the citations of men
I In our data: women receive fewer citations robust across matches
I Final match: Women receive 40-60% the citations of men
I Still looking into why we are getting more extreme results
I Could be the difference is in very high citation counts
Roberts (UCSD) Text Matching 28 April 2016 26 / 32
Estimating Effects of Gender Citations in IR
Results
I Maliniak et al: Women receive 80% the citations of men
I In our data: women receive fewer citations robust across matches
I Final match: Women receive 40-60% the citations of men
I Still looking into why we are getting more extreme results
I Could be the difference is in very high citation counts
Roberts (UCSD) Text Matching 28 April 2016 26 / 32
Estimating Effects of Gender Citations in IR
Ex. 3: Did killing Bin Laden make his ideas less popular?
“His death will serve as a global clarion call for another generation ofjihadists.”– Ed Husain (CFR)
“al-Qaida may emerge even more radical, and more closely united underthe banner of an iconic martyr.”– Abdel Bari Atwan (The Guardian)
Roberts (UCSD) Text Matching 28 April 2016 27 / 32
Estimating Effects of Gender Citations in IR
Ex. 3: Did killing Bin Laden make his ideas less popular?
“His death will serve as a global clarion call for another generation ofjihadists.”– Ed Husain (CFR)
“al-Qaida may emerge even more radical, and more closely united underthe banner of an iconic martyr.”– Abdel Bari Atwan (The Guardian)
Roberts (UCSD) Text Matching 28 April 2016 27 / 32
Estimating Effects of Gender Citations in IR
Ex. 3: Did killing Bin Laden make his ideas less popular?
“The idea that Obama made a strategic misstep by killing a manresponsible for the death of thousands of U.S. citizens and committed tokilling thousands more is absurd. Rather than making him a martyr, BinLaden’s killing demonstrated that he was, like the rest of us, mortal.” –Robert Simcox (LA Times)
Roberts (UCSD) Text Matching 28 April 2016 28 / 32
Estimating Effects of Gender Citations in IR
Ex. 3: Did killing Bin Laden make his ideas less popular?
We don’t really know.
Usama Bin Laden Anwar al-Awlaki Abu Yahya al-Libi5/2/2011 9/30/2011 6/5/2012
Roberts (UCSD) Text Matching 28 April 2016 29 / 32
Estimating Effects of Gender Citations in IR
Ex. 3: Did killing Bin Laden make his ideas less popular?
We don’t really know.
Usama Bin Laden Anwar al-Awlaki Abu Yahya al-Libi5/2/2011 9/30/2011 6/5/2012
Roberts (UCSD) Text Matching 28 April 2016 29 / 32
Estimating Effects of Gender Citations in IR
Empirical Strategy
I View-count data from a Jihadist website, scraped over time
I Does targeted killing of Bin Laden increase views of his work?
I TIRM matching + match on pre-treatment page views.
I QOI is ATT: nearest neighbor matching instead of CEM
I Validation: Matches accord with sub-pages on website
Roberts (UCSD) Text Matching 28 April 2016 30 / 32
Estimating Effects of Gender Citations in IR
Empirical Strategy
I View-count data from a Jihadist website, scraped over time
I Does targeted killing of Bin Laden increase views of his work?
I TIRM matching + match on pre-treatment page views.
I QOI is ATT: nearest neighbor matching instead of CEM
I Validation: Matches accord with sub-pages on website
Roberts (UCSD) Text Matching 28 April 2016 30 / 32
Estimating Effects of Gender Citations in IR
Empirical Strategy
I View-count data from a Jihadist website, scraped over time
I Does targeted killing of Bin Laden increase views of his work?
I TIRM matching + match on pre-treatment page views.
I QOI is ATT: nearest neighbor matching instead of CEM
I Validation: Matches accord with sub-pages on website
Roberts (UCSD) Text Matching 28 April 2016 30 / 32
Estimating Effects of Gender Citations in IR
Empirical Strategy
I View-count data from a Jihadist website, scraped over time
I Does targeted killing of Bin Laden increase views of his work?
I TIRM matching + match on pre-treatment page views.
I QOI is ATT: nearest neighbor matching instead of CEM
I Validation: Matches accord with sub-pages on website
Roberts (UCSD) Text Matching 28 April 2016 30 / 32
Estimating Effects of Gender Citations in IR
Empirical Strategy
I View-count data from a Jihadist website, scraped over time
I Does targeted killing of Bin Laden increase views of his work?
I TIRM matching + match on pre-treatment page views.
I QOI is ATT: nearest neighbor matching instead of CEM
I Validation: Matches accord with sub-pages on website
Roberts (UCSD) Text Matching 28 April 2016 30 / 32
Estimating Effects of Gender Citations in IR
Empirical Strategy
I View-count data from a Jihadist website, scraped over time
I Does targeted killing of Bin Laden increase views of his work?
I TIRM matching + match on pre-treatment page views.
I QOI is ATT: nearest neighbor matching instead of CEM
I Validation: Matches accord with sub-pages on website
Roberts (UCSD) Text Matching 28 April 2016 30 / 32
Estimating Effects of Gender Citations in IR
Martyr Effect: Clear short-term increase in page views
2012 2013 2014
−3000
−2000
−1000
0
1000
2000
3000
Est
imat
ed In
crea
se in
Pag
e V
iew
s pe
r D
ocum
ent
Date
0
250
500
2011
−05−
02
2011
−07−
02
2011
−09−
02
2011
−11−
02
Matching on topics and page views
Figure: Estimated effects of Usama Bin Laden’s death (on May 2, 2011) onsubsequent page views of his documents on a large jihadist web-library.
Roberts (UCSD) Text Matching 28 April 2016 31 / 32
Estimating Effects of Gender Citations in IR
Martyr Effect: Clear short-term increase in page views
2012 2013 2014
−3000
−2000
−1000
0
1000
2000
3000
Est
imat
ed In
crea
se in
Pag
e V
iew
s pe
r D
ocum
ent
Date
0
250
500
2011
−05−
02
2011
−07−
02
2011
−09−
02
2011
−11−
02
Matching on topics and page views
Figure: Estimated effects of Usama Bin Laden’s death (on May 2, 2011) onsubsequent page views of his documents on a large jihadist web-library.
Roberts (UCSD) Text Matching 28 April 2016 31 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate
bounds differences betweentopics
I Match on probability of treatment
balances on words related totreatment
I Future work:
I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate
bounds differences betweentopics
I Match on probability of treatment
balances on words related totreatment
I Future work:
I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do this
I We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate
bounds differences betweentopics
I Match on probability of treatment
balances on words related totreatment
I Future work:
I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate
bounds differences betweentopics
I Match on probability of treatment
balances on words related totreatment
I Future work:
I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate
bounds differences betweentopics
I Match on probability of treatment
balances on words related totreatment
I Future work:
I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate bounds differences betweentopics
I Match on probability of treatment
balances on words related totreatment
I Future work:
I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate bounds differences betweentopics
I Match on probability of treatment
balances on words related totreatment
I Future work:
I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate bounds differences betweentopics
I Match on probability of treatment balances on words related totreatment
I Future work:
I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate bounds differences betweentopics
I Match on probability of treatment balances on words related totreatment
I Future work:
I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate bounds differences betweentopics
I Match on probability of treatment balances on words related totreatment
I Future work:I Develop theoretical properties of TIRM
I Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate bounds differences betweentopics
I Match on probability of treatment balances on words related totreatment
I Future work:I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than text
I Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32
Estimating Effects of Gender Citations in IR
Conclusion
I Lots of applications measure pre-treatment confounders with text
I No methods developed yet to do thisI We develop a new method, Topical Inverse Regression Matching
I Matching on topical density estimate bounds differences betweentopics
I Match on probability of treatment balances on words related totreatment
I Future work:I Develop theoretical properties of TIRMI Extend to high-dimensional cases other than textI Create an R package
Roberts (UCSD) Text Matching 28 April 2016 32 / 32