19
Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR) {jahna, radev} @ umich.edu Modeling Document Dynamics: An Evolutionary Approach

Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR)

Embed Size (px)

DESCRIPTION

Modeling Document Dynamics: An Evolutionary Approach. Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR) {jahna, radev} @ umich.edu. What are dynamic texts?. Sets of topically related documents (news stories, Web pages, etc.) Multiple sources - PowerPoint PPT Presentation

Citation preview

Page 1: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR)

{jahna, radev} @ umich.edu

Modeling Document Dynamics: An Evolutionary Approach

Page 2: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

What are dynamic texts?

• Sets of topically related documents (news stories, Web pages, etc.)

• Multiple sources• Written/published at different points in time –

may change over time• Challenging features:

– Paraphrases– Contradictions– Incorrect/biased information

Page 3: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Milan plane crash: April 18, 2002

04/18/02 13:17 (CNN)The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26th floor at 5:50pm (1450 GMT) on Thursday.

04/18/02 13:42 (ABCNews)The plane was destined for Italy’s capital Rome, but there were conflicting reports as to whether it had come from Locarno, Switzerland or Sofia, Bulgaria.

04/18/02 13:42 (CNN)The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26th floor at 5:50pm (1450 GMT) on Thursday.

04/18/02 13:42 (FoxNews)The plane had taken off from Locarno, Switzerland, and was heading to Milan’sLinate airport, De Simone said.

Page 4: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Problem for IR systems

• User poses a question or query to a system– Known facts change at different points in time– Sources contradict one another– Many paraphrases – similar but not necessarily

equivalent - information

• What is the “correct” information? What should be returned to the user?

Page 5: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Current Goals

• Propose that dynamic texts “evolve” over time

• Chronology recovery task

• Approaches– Phylogenetics: reconstruct history of a set of

species based on DNA– Language modeling: LM constructed from first

document should fit less well over time

Page 6: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Phylogenetic models• [Fitch&Margoliash,67]

• Given a set of species and information about their DNA, construct a tree that describes how they are related, w.r.t. a common ancestor

• Statistically optimal tree minimizes the deviation between the original distances and those represented in the tree

1

bear2

dog wolf

D W B

D 0 4 56

W 4 0 44

B 56 44 02424

2022

Distance matrix Candidate tree

Page 7: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Phylogenetic models (2)

• History of chain letters [Bennett&al,03]

– “Genes” were facts in the letters:• Names/titles of people• Dates• Threats to those who don’t send the letter on

– Distance metric was the amount of shared information between two chain letters

– Used Fitch/Margoliash method to construct trees

• Result: An almost perfect phylogeny. Letters that were close to one another in the tree shared similar dates, “genes” and even geographical properties.

Page 8: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Procedure: Phylogenetics

• For each document cluster and representation, generate a phylogenetic tree using Fitch [Felsenstein, 95]

– Representations: full document, extractive summaries– Generate the Levenshtein distance matrix– Input matrix into Fitch to obtain unrooted tree

• “Reroot” the unrooted tree at the first document in the cluster.

• To obtain the chronological ordering, traverse the rerooted tree.

• Assign chronological ranks, starting with ‘1’ for the root.

Page 9: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

S1(d=3.5)

1 (d=0)

S2(d=6.5)

S4(d=1)

2 (d=8.5)

time t

S1

S2

S3

S4

S3(d=0)

Unrooted tree

Page 10: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

S2(d=10)

S1 (d=0)

S3(d=12)

S4(d=13)

2 (d=12)

1 (d=3.5)

time t

S1

S2

S3

S4

Rerooted tree

Page 11: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Procedure: LM Approach

• Inspiration: document ranking for IR– If candidate document’s LM assigns high probability to

query relevant [Ponte & Croft, 98]

• Create LM from earliest document– Trigram backoff model using CMU-Cambridge toolkit

[Clarkson & Rosenfeld,97]

• Evaluate it on remaining documents– Use fit to rank them: OOV rates (increasing), trigram

(decreasing) and unigram-hit ratios (increasing)

Page 12: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Evaluation

• Metric: Kendall’s rank-order correlation coefficient (Kendall’s ) [Siegel & Castellan,88]

– -1 1 – Expresses extent to which the chronological

rankings assigned by the algorithm agree with the actual rankings

• Randomly assigned rankings have, on average, a = 0.

Page 13: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Dataset

• 36 document sets– Manually collected (6)

– NewsInEssence clusters (3)– TREC Novelty clusters (27)

[Soboroff & Harman, 03]

• 15 training, 6 dev/test, 15 test

• Example topics

Story #Doc. Time Span

#Sources

Milan plane crash

56 1.5 days 5

N33 Russian submarine Kursk sinks

25 1 month 3

N48 Human genome decoded

25 2 years 3

Page 14: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Training Phase

Median # Significant

(=0.10)

Full document 0.16 8/15

Summ-1 0.13 6

Summ-5 0.17 6

3-gram hit 0.17 7

1-gram hit 0.21 11

OOV 0.28 13

Page 15: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Training Phase (2)Novelty Training Clusters

Median # Sig.

Summ-5 0.05 3/11

1-gram 0.20 8

OOV 0.19 8

Manual Training Clusters

Median # Sig.

Summ-5 0.32 3/3

1-gram 0.42 3

OOV 0.26 3

Page 16: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Test Phase(15 clusters)

Median # Significant

Summ-5 0.15 5/15

1-gram hit 0.14 6

OOV 0.22 9

Page 17: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Manual Clusters

Story OOV Summ-5Gulfair Plane Crash 0.37 0.39

Honduras bus hijacking

0.12 0.17

Columbia shuttle 0.56 0.48

Milan plane crash 0.26 0.33

RI nightclub fire 0.58 0.32

Iraq bombing 0.24 0.17

Med. 0.31 0.33

# Significant 5/6 6/6

Page 18: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Conclusions

• Over all clusters, LM approach based on OOV had best performance

• LM and phylogenetic models had similar performance on manual clusters– Have more salient “evolutionary” properties

Page 19: Jahna Otterbacher, Dragomir Radev  Computational Linguistics And Information Retrieval (CLAIR)

Future work

• Tracking facts in multiple news stories over time

• Produce a timeline of known facts

• Determine if the facts have settled at each time

04/18/02 13:17

Locarno, Switzerland

Journalist Desidera Cavvina told CNN

No

04/18/02 13:42

Locarno, Switzerland

or Sofia, Bulgaria

ABC No