Connecting the Dots Between News Articles Dafna Shahaf and Carlos Guestrin

Embed Size (px)

Citation preview

  • Slide 1
  • Connecting the Dots Between News Articles Dafna Shahaf and Carlos Guestrin
  • Slide 2
  • Slide 3
  • Information overload is everywhere
  • Slide 4
  • Well, we have Google
  • Slide 5
  • Search Limitations InputOutput Interaction New query
  • Slide 6
  • Our Approach InputOutput Interaction Phrase complex information needs Structured, annotated output New queryRicher forms of interaction
  • Slide 7
  • Connecting the Dots: News Domain 3.19.2008
  • Slide 8
  • Input: Pick two articles (start, goal) Output: Bridge the gap with a smooth chain of articles Input: Pick two articles (start, goal) Output: Bridge the gap with a smooth chain of articles Input: Pick two articles (start, goal) InputOutput Interaction InputOutput Interaction Bailout Housing Bubble
  • Slide 9
  • Keeping Borrowers Afloat A Mortgage Crisis Begins to Spiral,... Investors Grow Wary of Bank's Reliance on Debt Markets Can't Wait for Congress to Act Bailout Plan Wins Approval Housing Bubble Bailout InputOutput Interaction Input: Pick two articles (start, goal) Output: Bridge the gap with a smooth chain of articles Input: Pick two articles (start, goal) Output: Bridge the gap with a smooth chain of articles
  • Slide 10
  • Game Plan What is a good chain? Formalize objective Score a chain Find a good chain
  • Slide 11
  • What is a Good Chain? Whats wrong with shortest-path? Build a graph Node for every article Edges based on similarity Chronological order (DAG) Run BFS s t
  • Slide 12
  • Shortest-path A1: alks Over Ex-Intern's Testimony On Clinton Appear to Bog A2: Judge Sides with the Government in Microsoft Antitrust Trial A3: Who will be the Next Microsoft? trading at a market capitalization A4: Palestinians Planning to Offer Bonds on Euro. Markets A5: Clinton Watches as Palestinians Vote to Rescind 1964 Provision A6: ontesting the Vote: The Overview; Gore asks Public For Lewinsky Florida recount Talks Over Ex-Intern's Testimony On Clinton Appear to Bog Down Contesting the Vote: The Overview; Gore asks Public For Patience;
  • Slide 13
  • Shortest-path A1: alks Over Ex-Intern's Testimony On Clinton Appear to Bog A2: Judge Sides with the Government in Microsoft Antitrust Trial A3: Who will be the Next Microsoft? trading at a market capitalization A4: Palestinians Planning to Offer Bonds on Euro. Markets A5: Clinton Watches as Palestinians Vote to Rescind 1964 Provision A6: ontesting the Vote: The Overview; Gore asks Public For Talks Over Ex-Intern's Testimony On Clinton Appear to Bog Down Contesting the Vote: The Overview; Gore asks Public For Patience;
  • Slide 14
  • Shortest-path A1: A2: Judge Sides with the Government in Microsoft Antitrust Trial A3: Who will be the Next Microsoft? trading at a market capitalization A4: Palestinians Planning to Offer Bonds on Euro. Markets A5: Clinton Watches as Palestinians Vote to Rescind 1964 Provision A6: Talks Over Ex-Intern's Testimony On Clinton Appear to Bog Down Contesting the Vote: The Overview; Gore asks Public For Patience; Stream of consciousness? - Each transition is strong - No global theme Stream of consciousness? - Each transition is strong - No global theme
  • Slide 15
  • More-Coherent Chain Lewinsky Florida recount Talks Over Ex-Intern's Testimony On Clinton Appear to Bog Down Contesting the Vote: The Overview; Gore asks Public For Patience; B1: B2: Clinton Admits Lewinsky Liaison to Jury B3: G.O.P. Vote Counter in House Predicts Impeachment of Clinton B4: Clinton Impeached; He Faces a Senate Trial B5: Clintons Acquittal; Senators Talk About Their Votes B6: Aides Say Clinton Is Angered As Gore Tries to Break Away B7: As Election Draws Near, the Race Turns Mean B8:
  • Slide 16
  • More-Coherent Chain Talks Over Ex-Intern's Testimony On Clinton Appear to Bog Down Contesting the Vote: The Overview; Gore asks Public For Patience; B1: B2: Clinton Admits Lewinsky Liaison to Jury B3: G.O.P. Vote Counter in House Predicts Impeachment of Clinton B4: Clinton Impeached; He Faces a Senate Trial B5: Clintons Acquittal; Senators Talk About Their Votes B6: Aides Say Clinton Is Angered As Gore Tries to Break Away B7: As Election Draws Near, the Race Turns Mean B8: What makes it coherent?
  • Slide 17
  • For Shortest Path Chain Topic changes every transition (jittery) Word Patterns
  • Slide 18
  • For Coherent Chain Topic consistent over transitions Use this intuition to estimate coherence of chains
  • Slide 19
  • What is a Good Chain? Every transition is strong Global theme No jitteriness (back-and-forth) Short (5-6 articles?)
  • Slide 20
  • What is a Good Chain? Every transition is strong Global theme No jitteriness (back-and-forth) Short (5-6 articles?)
  • Slide 21
  • Strong transitions between consecutive documents d1 w1: Lewinsky w2: Clinton w5: Microsoft w4: Intern w3: Oath 4 4 3 3 1 1 min(4,3,1)=1 d2d3 d4
  • Slide 22
  • Strong transitions between consecutive documents min(4,3,1)=1 Too coarse Word importance in transition Missing words ??? Intuitively, high iff d i and d i+1 very related w plays an important role in the relationship Intuitively, high iff d i and d i+1 very related w plays an important role in the relationship
  • Slide 23
  • Influence min(4,3,1)=1 Intuitively, high iff d i and d i+1 very related w plays an important role in the relationship Intuitively, high iff d i and d i+1 very related w plays an important role in the relationship Most methods assume edges Influence propagates through the edges No edges in our dataset
  • Slide 24
  • Computing Influence(d i, d j | w) Clinton Admits Lewinsky Contest the Vote Judge Sides with the Govmnt The Next Microsoft Clinton Judge Microsoft Gore didi didi djdj djdj w w
  • Slide 25
  • Computing Influence(d i, d j | w) Clinton Admits Lewinsky Contest the Vote Judge Sides with the Govmnt The Next Microsoft Clinton Judg e Microsof t Gore didi didi djdj djdj w w 1. Run random walks - Random restarts from d i - controls expected length 1. Run random walks - Random restarts from d i - controls expected length
  • Slide 26
  • Computing Influence(d i, d j | w) Clinton Admits Lewinsky Contest the Vote Judge Sides with the Govmnt The Next Microsoft Clinton Judge Microsoft Gore didi didi djdj djdj w w
  • Slide 27
  • Computing Influence(d i, d j | w) Clinton Admits Lewinsky Contest the Vote Judge Sides with the Govmnt The Next Microsoft Clinton Judge Microsoft Gore didi didi djdj djdj w w Calculate stationary distribution of dj - Intuitively, high if documents are related Calculate stationary distribution of dj - Intuitively, high if documents are related How important is w? -Check how many walks went through w How important is w? -Check how many walks went through w
  • Slide 28
  • Computing Influence(d i, d j | w) Clinton Admits Lewinsky Contest the Vote Judge Sides with the Govmnt The Next Microsoft Clinton Judge Microsoft Gore didi didi djdj djdj w w d j no longer reachable: All influence is due to w d j no longer reachable: All influence is due to w 2. Influence(d i, d j | w) = Stationary distribution(d j ) with w - Stationary distribution(d j ) without w 2. Influence(d i, d j | w) = Stationary distribution(d j ) with w - Stationary distribution(d j ) without w
  • Slide 29
  • Influence: Reality Check d i : OJ Simpson trial article d j : DNA evidence in OJ trial d j : Super Bowl 49ers
  • Slide 30
  • Coherence formulation No edges. Computed using random walks
  • Slide 31
  • What is a Good Chain? Every transition is strong Global theme No jitteriness (back-and-forth) Short (5-6 articles?)
  • Slide 32
  • Global Theme, No Jitter Jittery chain can score well! But need a lot of words Good chains can often be represented by a small number of segments
  • Slide 33
  • Global Theme, No Jitter Choose 3 segments to be scored on Good score Score = 0
  • Slide 34
  • Coherence: New Objective Maximize over legal activations: Limit total number of active words Limit number of words per transition Each word to be activated at most once
  • Slide 35
  • Game Plan What is a good chain? Formalize objective Score a chain Find a good chain
  • Slide 36
  • Scoring a Chain Problem is NP-Complete Softer notion of activation: [0,1] Natural formalization as a linear program (LP)
  • Slide 37
  • LP: Objective Pre-computed
  • Slide 38
  • LP: Smoothness A word is active if either Active before Just initialized Each word is initialized at most once
  • Slide 39
  • LP: Activation Limit #words Limit #words per transition
  • Slide 40
  • Scoring a chain September 11 th to Daniel Pearl Example Activation levels weighted by influence (rescaled)
  • Slide 41
  • Game Plan What is a good chain? Formalize objective Score a chain Find a good chain
  • Slide 42
  • Finding a good chain Cant brute-force n d possible chains: >>10 20 after pruning Joint LP: optimize activation and chain New variables: Is document d i a part of the chain? Does document d j come after d i in the chain? New constraints: Chain structure Length = K s23t next(s,3) next(s,2) next(s,t)
  • Slide 43
  • Unlike previous LP, we need to round Extract a chain Approximation guarantees Chain length K in expectation Objective: O(sqrt(ln(n/ )) with probability 1- Rounding s23t 0.7 0.3 0.6 0.1 0.9
  • Slide 44
  • Scaling Up LP has variables Polynomial, but D is large Restricting number of documents Sparsifying the graph Random walks
  • Slide 45
  • Game Plan What is a good chain? Formalize objective Score a chain Find a good chain How good is it?
  • Slide 46
  • Evaluation: Competitors Shortest path Google Timeline Enter a query Pick k equally-spaced articles Event threading (TDT) [Nallapati et al 04] Generate cluster graph Representative articles from clusters
  • Slide 47
  • Example Chain (1) Simpson Strategy: There Were Several Killers O.J. Simpson's book deal controversy CNN OJ Simpson Trial News: April Transcripts Tandoori murder case a rival for OJ Simpson case Google News Timeline Simpson trial Simpson verdict
  • Slide 48
  • Example Chain (2) Issue of Racism Erupts in Simpson Trial Ex-Detective's Tapes Fan Racial Tensions in LA Many Black Officers Say Bias Is Rampant in LA Police Force With Tale of Racism and Error, Lawyers Seek Acquittal Connect-the-Dots Simpson trial Simpson Verdict
  • Slide 49
  • Evaluation #1: Familiarity 18 users Show two articles 5 news stories Before and after reading the chain Do you know a coherent story linking these articles?
  • Slide 50
  • Average fraction of gap closed Better Base familiarity: 2.1 3.1 3.2 3.4 1.9 Effectiveness (improvement in familiarity)
  • Slide 51
  • Average fraction of gap closed Better Base familiarity: 2.1 3.1 3.2 3.4 1.9 Effectiveness (improvement in familiarity)
  • Slide 52
  • Average fraction of gap closed Better Base familiarity: 2.1 3.1 3.2 3.4 1.9 Effectiveness (improvement in familiarity)
  • Slide 53
  • Average fraction of gap closed Better Base familiarity: 2.1 3.1 3.2 3.4 1.9 Effectiveness (improvement in familiarity)
  • Slide 54
  • Average fraction of gap closed Base familiarity: 2.1 3.1 3.2 3.4 1.9 Better We are better almost everywhere
  • Slide 55
  • Evaluation #2: Chain Quality Compare two chains for Coherence Relevance Redundancy
  • Slide 56
  • Relevance, Coherence, Non- redundancy Across complex stories Better CoherenceRelevanceNon-redundancy
  • Slide 57
  • Relevance, Coherence, Non- redundancy Across complex stories Better CoherenceRelevanceNon-redundancy
  • Slide 58
  • Relevance, Coherence, Non- redundancy Across complex stories Better CoherenceRelevanceNon-redundancy
  • Slide 59
  • Relevance, Coherence, Non- redundancy Across complex stories Better CoherenceRelevanceNon-redundancy
  • Slide 60
  • Whats left? Interaction Two documents Chain
  • Slide 61
  • Interaction Types 1.Refinement 2.User interests d1 d2 d3d4 ???
  • Slide 62
  • Interaction Defense cross-examines state DNA expert With fiber evidence, prosecution Simpson Defense Drops DNA Challenge Simpson Verdict A day the country stood still In the joy of victory, defense team in discord Many black officers say bias Is rampant in LA police force Racial split at the end Verdict Race Blood, glove Algorithmic ideas from online learning
  • Slide 63
  • Interaction User Study Refinement 72% prefer chains refined our way User Interests 63.3% able to identify intruders 2 correct words out of 10
  • Slide 64
  • Conclusions Fight information overload Provide a structured, easy way to navigate topics New task Explored desired properties LP formalization Efficient algorithm Evaluated over real news data Demonstrate effectiveness Interaction Complex information needs Structured, annotated output
  • Slide 65
  • Since KDD New domains Research papers Email New questions Fixed endpoints? No endpoints? New forms of output
  • Slide 66
  • Issue Maps 66
  • Slide 67
  • Issue Maps
  • Slide 68
  • machines cant have emotions is supported by concept of feeling only applies to living organisms [Ziff 59] we can imagine artifacts that have feelings [Smart 59] is disputed by Challenge: Build automatically!
  • Slide 69
  • OJ Simpson Trial With Tale of Racism and Error, Simpson Lawyers Seek Acquittal Simpson Defense Lawyers Unleash Sharp Assault on Police Inquiry at Murder Scene 69
  • Slide 70
  • Conclusions Fight information overload Provide a structured, easy way to navigate topics New task Explored desired properties LP formalization Efficient algorithm Evaluated over real news data Demonstrate effectiveness Interaction Complex information needs Structured, annotated output