Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR)

{jahna, radev} @ umich.edu

Modeling Document Dynamics: An Evolutionary Approach

What are dynamic texts?

• Sets of topically related documents (news stories, Web pages, etc.)

• Multiple sources• Written/published at different points in time –

may change over time• Challenging features:

– Paraphrases– Contradictions– Incorrect/biased information

Milan plane crash: April 18, 2002

04/18/02 13:17 (CNN)The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26th floor at 5:50pm (1450 GMT) on Thursday.

04/18/02 13:42 (ABCNews)The plane was destined for Italy’s capital Rome, but there were conflicting reports as to whether it had come from Locarno, Switzerland or Sofia, Bulgaria.

04/18/02 13:42 (CNN)The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26th floor at 5:50pm (1450 GMT) on Thursday.

04/18/02 13:42 (FoxNews)The plane had taken off from Locarno, Switzerland, and was heading to Milan’sLinate airport, De Simone said.

Problem for IR systems

• User poses a question or query to a system– Known facts change at different points in time– Sources contradict one another– Many paraphrases – similar but not necessarily

equivalent - information

• What is the “correct” information? What should be returned to the user?

Current Goals

• Propose that dynamic texts “evolve” over time

• Chronology recovery task

• Approaches– Phylogenetics: reconstruct history of a set of

species based on DNA– Language modeling: LM constructed from first

document should fit less well over time

Phylogenetic models• [Fitch&Margoliash,67]

• Given a set of species and information about their DNA, construct a tree that describes how they are related, w.r.t. a common ancestor

• Statistically optimal tree minimizes the deviation between the original distances and those represented in the tree

dog wolf

D 0 4 56

W 4 0 44

B 56 44 02424

Distance matrix Candidate tree

Phylogenetic models (2)

• History of chain letters [Bennett&al,03]

– “Genes” were facts in the letters:• Names/titles of people• Dates• Threats to those who don’t send the letter on

– Distance metric was the amount of shared information between two chain letters

– Used Fitch/Margoliash method to construct trees

• Result: An almost perfect phylogeny. Letters that were close to one another in the tree shared similar dates, “genes” and even geographical properties.

Procedure: Phylogenetics

• For each document cluster and representation, generate a phylogenetic tree using Fitch [Felsenstein, 95]

– Representations: full document, extractive summaries– Generate the Levenshtein distance matrix– Input matrix into Fitch to obtain unrooted tree

• “Reroot” the unrooted tree at the first document in the cluster.

• To obtain the chronological ordering, traverse the rerooted tree.

• Assign chronological ranks, starting with ‘1’ for the root.

S1(d=3.5)

1 (d=0)

S2(d=6.5)

S4(d=1)

2 (d=8.5)

time t

S3(d=0)

Unrooted tree

S2(d=10)

S1 (d=0)

S3(d=12)

S4(d=13)

2 (d=12)

1 (d=3.5)

time t

Rerooted tree

Procedure: LM Approach

• Inspiration: document ranking for IR– If candidate document’s LM assigns high probability to

query relevant [Ponte & Croft, 98]

• Create LM from earliest document– Trigram backoff model using CMU-Cambridge toolkit

[Clarkson & Rosenfeld,97]

• Evaluate it on remaining documents– Use fit to rank them: OOV rates (increasing), trigram

(decreasing) and unigram-hit ratios (increasing)

Evaluation

• Metric: Kendall’s rank-order correlation coefficient (Kendall’s ) [Siegel & Castellan,88]

– -1 1 – Expresses extent to which the chronological

rankings assigned by the algorithm agree with the actual rankings

• Randomly assigned rankings have, on average, a = 0.

Dataset

• 36 document sets– Manually collected (6)

– NewsInEssence clusters (3)– TREC Novelty clusters (27)

[Soboroff & Harman, 03]

• 15 training, 6 dev/test, 15 test

• Example topics

Story #Doc. Time Span

#Sources

Milan plane crash

56 1.5 days 5

N33 Russian submarine Kursk sinks

25 1 month 3

N48 Human genome decoded

25 2 years 3

Training Phase

Median # Significant

(=0.10)

Full document 0.16 8/15

Summ-1 0.13 6

Summ-5 0.17 6

3-gram hit 0.17 7

1-gram hit 0.21 11

OOV 0.28 13

Training Phase (2)Novelty Training Clusters

Median # Sig.

Summ-5 0.05 3/11

1-gram 0.20 8

OOV 0.19 8

Manual Training Clusters

Median # Sig.

Summ-5 0.32 3/3

1-gram 0.42 3

OOV 0.26 3

Test Phase(15 clusters)

Median # Significant

Summ-5 0.15 5/15

1-gram hit 0.14 6

OOV 0.22 9

Manual Clusters

Story OOV Summ-5Gulfair Plane Crash 0.37 0.39

Honduras bus hijacking

0.12 0.17

Columbia shuttle 0.56 0.48

Milan plane crash 0.26 0.33

RI nightclub fire 0.58 0.32

Iraq bombing 0.24 0.17

Med. 0.31 0.33

# Significant 5/6 6/6

Conclusions

• Over all clusters, LM approach based on OOV had best performance

• LM and phylogenetic models had similar performance on manual clusters– Have more salient “evolutionary” properties

Future work

• Tracking facts in multiple news stories over time

• Produce a timeline of known facts

• Determine if the facts have settled at each time

04/18/02 13:17

Locarno, Switzerland

Journalist Desidera Cavvina told CNN

04/18/02 13:42

Locarno, Switzerland

or Sofia, Bulgaria

ABC No

Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR)

Documents

Participation of Bulgaria in Multinational Defence Projects Deputy Minister of Defence Dr. Valentin Radev International High-Level Conference on International

Lecture 17 Networks in Web & IR Slides are modified from Lada Adamic and Dragomir Radev

A New Age of Terrorist Recruitment: Target Perceptions of ... · Otterbacher UW-L Journal of Undergraduate Research (2016) 1 A New Age of Terrorist Recruitment: Target Perceptions

Hierarchical Summarizationfor Delivering …clair.si.umich.edu/~radev/papers/ipm08a.pdfHierarchical Summarizationfor Delivering Information to Mobile Devices ... port decision making

Integrating unstructured data into relational databasestangra.si.umich.edu/~radev/767w10/papers/Week06/TextRepresentatio… · Integrating unstructured data into relational databases

Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev

Otterbacher, Jahna. 2009. 'Helpfulness' in online communities: a … · 2014. 9. 2. · Product and Sales Data Individual Review Reviewer Characteristics Reviewer History Review Readability

George Balanchine's Jewels at Pacific Northwest Ballet ...encoreartsseattle.com/sites/default/files/programs/george... · Jahna Frantziskonis Angelica Generosa Joshua Grant Eric Hipolito

PMI Europe B.V. Pricelist Automotive 2015€¦ · KRT Custom Speed GmbH Otterbacher Strasse 4, A-4786 Brunnenthal Tel.: +43 7712 296370, E-Mail: info@customspeed.at Currency Euro

Information Retrieval (9) Prof. Dragomir R. Radev

CWAG Private Sector with Sponsorship Info 2018-10-23 · MEMBERS ALASKA Jahna Lindemuth AMERICAN SAMOA TalauegaEleasalo Ale ARIZONA Mark Brnovich CALIFORNIA Xavier Becerra COLORADO

CATÁLOGO DE LISTA DE PROPIEDADES A LA VENTA OTOÑO …...grande en Florida. Richard Dempsey 863.774.3548 - richard@SREland.com Bo Jahna 863.800.0332 - bojahna@SREland.com VIDEO AÉREO

Fundamentals, Design, and Implementation, 9/e SI654 Database Application Design Instructor: Dragomir R. Radev Winter 2005

Dragomir Rtangra.si.umich.edu/~radev/resume/resume.doc · Web viewInformation retrieval and natural language processing (Web graph analysis, biologically-inspired natural language

Text summarization Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University

Recent research at CLAIR Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Fall 2004

Dragomir Radev Wrocław, Poland July 29, 2009 Computational Linguistics

Lecture 7 Centrality (cont) Slides modified from Lada Adamic and Dragomir Radev

Centroid-basedsummarizationofmultipledocumentsclair.si.umich.edu/~radev/papers/centroid.pdf · Centroid-basedsummarizationofmultipledocuments ... CBS uses the centroids of the clusters

Foreign Language School Simeon Radev The fairy tale of knowledge…