Connecting the Dots Between News Articles Dafna Shahaf and
Carlos Guestrin
Slide 2
Slide 3
Information overload is everywhere
Slide 4
Well, we have Google
Slide 5
Search Limitations InputOutput Interaction New query
Slide 6
Our Approach InputOutput Interaction Phrase complex information
needs Structured, annotated output New queryRicher forms of
interaction
Slide 7
Connecting the Dots: News Domain 3.19.2008
Slide 8
Input: Pick two articles (start, goal) Output: Bridge the gap
with a smooth chain of articles Input: Pick two articles (start,
goal) Output: Bridge the gap with a smooth chain of articles Input:
Pick two articles (start, goal) InputOutput Interaction InputOutput
Interaction Bailout Housing Bubble
Slide 9
Keeping Borrowers Afloat A Mortgage Crisis Begins to Spiral,...
Investors Grow Wary of Bank's Reliance on Debt Markets Can't Wait
for Congress to Act Bailout Plan Wins Approval Housing Bubble
Bailout InputOutput Interaction Input: Pick two articles (start,
goal) Output: Bridge the gap with a smooth chain of articles Input:
Pick two articles (start, goal) Output: Bridge the gap with a
smooth chain of articles
Slide 10
Game Plan What is a good chain? Formalize objective Score a
chain Find a good chain
Slide 11
What is a Good Chain? Whats wrong with shortest-path? Build a
graph Node for every article Edges based on similarity
Chronological order (DAG) Run BFS s t
Slide 12
Shortest-path A1: alks Over Ex-Intern's Testimony On Clinton
Appear to Bog A2: Judge Sides with the Government in Microsoft
Antitrust Trial A3: Who will be the Next Microsoft? trading at a
market capitalization A4: Palestinians Planning to Offer Bonds on
Euro. Markets A5: Clinton Watches as Palestinians Vote to Rescind
1964 Provision A6: ontesting the Vote: The Overview; Gore asks
Public For Lewinsky Florida recount Talks Over Ex-Intern's
Testimony On Clinton Appear to Bog Down Contesting the Vote: The
Overview; Gore asks Public For Patience;
Slide 13
Shortest-path A1: alks Over Ex-Intern's Testimony On Clinton
Appear to Bog A2: Judge Sides with the Government in Microsoft
Antitrust Trial A3: Who will be the Next Microsoft? trading at a
market capitalization A4: Palestinians Planning to Offer Bonds on
Euro. Markets A5: Clinton Watches as Palestinians Vote to Rescind
1964 Provision A6: ontesting the Vote: The Overview; Gore asks
Public For Talks Over Ex-Intern's Testimony On Clinton Appear to
Bog Down Contesting the Vote: The Overview; Gore asks Public For
Patience;
Slide 14
Shortest-path A1: A2: Judge Sides with the Government in
Microsoft Antitrust Trial A3: Who will be the Next Microsoft?
trading at a market capitalization A4: Palestinians Planning to
Offer Bonds on Euro. Markets A5: Clinton Watches as Palestinians
Vote to Rescind 1964 Provision A6: Talks Over Ex-Intern's Testimony
On Clinton Appear to Bog Down Contesting the Vote: The Overview;
Gore asks Public For Patience; Stream of consciousness? - Each
transition is strong - No global theme Stream of consciousness? -
Each transition is strong - No global theme
Slide 15
More-Coherent Chain Lewinsky Florida recount Talks Over
Ex-Intern's Testimony On Clinton Appear to Bog Down Contesting the
Vote: The Overview; Gore asks Public For Patience; B1: B2: Clinton
Admits Lewinsky Liaison to Jury B3: G.O.P. Vote Counter in House
Predicts Impeachment of Clinton B4: Clinton Impeached; He Faces a
Senate Trial B5: Clintons Acquittal; Senators Talk About Their
Votes B6: Aides Say Clinton Is Angered As Gore Tries to Break Away
B7: As Election Draws Near, the Race Turns Mean B8:
Slide 16
More-Coherent Chain Talks Over Ex-Intern's Testimony On Clinton
Appear to Bog Down Contesting the Vote: The Overview; Gore asks
Public For Patience; B1: B2: Clinton Admits Lewinsky Liaison to
Jury B3: G.O.P. Vote Counter in House Predicts Impeachment of
Clinton B4: Clinton Impeached; He Faces a Senate Trial B5: Clintons
Acquittal; Senators Talk About Their Votes B6: Aides Say Clinton Is
Angered As Gore Tries to Break Away B7: As Election Draws Near, the
Race Turns Mean B8: What makes it coherent?
Slide 17
For Shortest Path Chain Topic changes every transition
(jittery) Word Patterns
Slide 18
For Coherent Chain Topic consistent over transitions Use this
intuition to estimate coherence of chains
Slide 19
What is a Good Chain? Every transition is strong Global theme
No jitteriness (back-and-forth) Short (5-6 articles?)
Slide 20
What is a Good Chain? Every transition is strong Global theme
No jitteriness (back-and-forth) Short (5-6 articles?)
Slide 21
Strong transitions between consecutive documents d1 w1:
Lewinsky w2: Clinton w5: Microsoft w4: Intern w3: Oath 4 4 3 3 1 1
min(4,3,1)=1 d2d3 d4
Slide 22
Strong transitions between consecutive documents min(4,3,1)=1
Too coarse Word importance in transition Missing words ???
Intuitively, high iff d i and d i+1 very related w plays an
important role in the relationship Intuitively, high iff d i and d
i+1 very related w plays an important role in the relationship
Slide 23
Influence min(4,3,1)=1 Intuitively, high iff d i and d i+1 very
related w plays an important role in the relationship Intuitively,
high iff d i and d i+1 very related w plays an important role in
the relationship Most methods assume edges Influence propagates
through the edges No edges in our dataset
Slide 24
Computing Influence(d i, d j | w) Clinton Admits Lewinsky
Contest the Vote Judge Sides with the Govmnt The Next Microsoft
Clinton Judge Microsoft Gore didi didi djdj djdj w w
Slide 25
Computing Influence(d i, d j | w) Clinton Admits Lewinsky
Contest the Vote Judge Sides with the Govmnt The Next Microsoft
Clinton Judg e Microsof t Gore didi didi djdj djdj w w 1. Run
random walks - Random restarts from d i - controls expected length
1. Run random walks - Random restarts from d i - controls expected
length
Slide 26
Computing Influence(d i, d j | w) Clinton Admits Lewinsky
Contest the Vote Judge Sides with the Govmnt The Next Microsoft
Clinton Judge Microsoft Gore didi didi djdj djdj w w
Slide 27
Computing Influence(d i, d j | w) Clinton Admits Lewinsky
Contest the Vote Judge Sides with the Govmnt The Next Microsoft
Clinton Judge Microsoft Gore didi didi djdj djdj w w Calculate
stationary distribution of dj - Intuitively, high if documents are
related Calculate stationary distribution of dj - Intuitively, high
if documents are related How important is w? -Check how many walks
went through w How important is w? -Check how many walks went
through w
Slide 28
Computing Influence(d i, d j | w) Clinton Admits Lewinsky
Contest the Vote Judge Sides with the Govmnt The Next Microsoft
Clinton Judge Microsoft Gore didi didi djdj djdj w w d j no longer
reachable: All influence is due to w d j no longer reachable: All
influence is due to w 2. Influence(d i, d j | w) = Stationary
distribution(d j ) with w - Stationary distribution(d j ) without w
2. Influence(d i, d j | w) = Stationary distribution(d j ) with w -
Stationary distribution(d j ) without w
Slide 29
Influence: Reality Check d i : OJ Simpson trial article d j :
DNA evidence in OJ trial d j : Super Bowl 49ers
Slide 30
Coherence formulation No edges. Computed using random
walks
Slide 31
What is a Good Chain? Every transition is strong Global theme
No jitteriness (back-and-forth) Short (5-6 articles?)
Slide 32
Global Theme, No Jitter Jittery chain can score well! But need
a lot of words Good chains can often be represented by a small
number of segments
Slide 33
Global Theme, No Jitter Choose 3 segments to be scored on Good
score Score = 0
Slide 34
Coherence: New Objective Maximize over legal activations: Limit
total number of active words Limit number of words per transition
Each word to be activated at most once
Slide 35
Game Plan What is a good chain? Formalize objective Score a
chain Find a good chain
Slide 36
Scoring a Chain Problem is NP-Complete Softer notion of
activation: [0,1] Natural formalization as a linear program
(LP)
Slide 37
LP: Objective Pre-computed
Slide 38
LP: Smoothness A word is active if either Active before Just
initialized Each word is initialized at most once
Slide 39
LP: Activation Limit #words Limit #words per transition
Slide 40
Scoring a chain September 11 th to Daniel Pearl Example
Activation levels weighted by influence (rescaled)
Slide 41
Game Plan What is a good chain? Formalize objective Score a
chain Find a good chain
Slide 42
Finding a good chain Cant brute-force n d possible chains:
>>10 20 after pruning Joint LP: optimize activation and chain
New variables: Is document d i a part of the chain? Does document d
j come after d i in the chain? New constraints: Chain structure
Length = K s23t next(s,3) next(s,2) next(s,t)
Slide 43
Unlike previous LP, we need to round Extract a chain
Approximation guarantees Chain length K in expectation Objective:
O(sqrt(ln(n/ )) with probability 1- Rounding s23t 0.7 0.3 0.6 0.1
0.9
Slide 44
Scaling Up LP has variables Polynomial, but D is large
Restricting number of documents Sparsifying the graph Random
walks
Slide 45
Game Plan What is a good chain? Formalize objective Score a
chain Find a good chain How good is it?
Slide 46
Evaluation: Competitors Shortest path Google Timeline Enter a
query Pick k equally-spaced articles Event threading (TDT)
[Nallapati et al 04] Generate cluster graph Representative articles
from clusters
Slide 47
Example Chain (1) Simpson Strategy: There Were Several Killers
O.J. Simpson's book deal controversy CNN OJ Simpson Trial News:
April Transcripts Tandoori murder case a rival for OJ Simpson case
Google News Timeline Simpson trial Simpson verdict
Slide 48
Example Chain (2) Issue of Racism Erupts in Simpson Trial
Ex-Detective's Tapes Fan Racial Tensions in LA Many Black Officers
Say Bias Is Rampant in LA Police Force With Tale of Racism and
Error, Lawyers Seek Acquittal Connect-the-Dots Simpson trial
Simpson Verdict
Slide 49
Evaluation #1: Familiarity 18 users Show two articles 5 news
stories Before and after reading the chain Do you know a coherent
story linking these articles?
Slide 50
Average fraction of gap closed Better Base familiarity: 2.1 3.1
3.2 3.4 1.9 Effectiveness (improvement in familiarity)
Slide 51
Average fraction of gap closed Better Base familiarity: 2.1 3.1
3.2 3.4 1.9 Effectiveness (improvement in familiarity)
Slide 52
Average fraction of gap closed Better Base familiarity: 2.1 3.1
3.2 3.4 1.9 Effectiveness (improvement in familiarity)
Slide 53
Average fraction of gap closed Better Base familiarity: 2.1 3.1
3.2 3.4 1.9 Effectiveness (improvement in familiarity)
Slide 54
Average fraction of gap closed Base familiarity: 2.1 3.1 3.2
3.4 1.9 Better We are better almost everywhere
Slide 55
Evaluation #2: Chain Quality Compare two chains for Coherence
Relevance Redundancy
Slide 56
Relevance, Coherence, Non- redundancy Across complex stories
Better CoherenceRelevanceNon-redundancy
Slide 57
Relevance, Coherence, Non- redundancy Across complex stories
Better CoherenceRelevanceNon-redundancy
Slide 58
Relevance, Coherence, Non- redundancy Across complex stories
Better CoherenceRelevanceNon-redundancy
Slide 59
Relevance, Coherence, Non- redundancy Across complex stories
Better CoherenceRelevanceNon-redundancy
Interaction Defense cross-examines state DNA expert With fiber
evidence, prosecution Simpson Defense Drops DNA Challenge Simpson
Verdict A day the country stood still In the joy of victory,
defense team in discord Many black officers say bias Is rampant in
LA police force Racial split at the end Verdict Race Blood, glove
Algorithmic ideas from online learning
Slide 63
Interaction User Study Refinement 72% prefer chains refined our
way User Interests 63.3% able to identify intruders 2 correct words
out of 10
Slide 64
Conclusions Fight information overload Provide a structured,
easy way to navigate topics New task Explored desired properties LP
formalization Efficient algorithm Evaluated over real news data
Demonstrate effectiveness Interaction Complex information needs
Structured, annotated output
Slide 65
Since KDD New domains Research papers Email New questions Fixed
endpoints? No endpoints? New forms of output
Slide 66
Issue Maps 66
Slide 67
Issue Maps
Slide 68
machines cant have emotions is supported by concept of feeling
only applies to living organisms [Ziff 59] we can imagine artifacts
that have feelings [Smart 59] is disputed by Challenge: Build
automatically!
Slide 69
OJ Simpson Trial With Tale of Racism and Error, Simpson Lawyers
Seek Acquittal Simpson Defense Lawyers Unleash Sharp Assault on
Police Inquiry at Murder Scene 69
Slide 70
Conclusions Fight information overload Provide a structured,
easy way to navigate topics New task Explored desired properties LP
formalization Efficient algorithm Evaluated over real news data
Demonstrate effectiveness Interaction Complex information needs
Structured, annotated output