Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR)

Preview:

DESCRIPTION

Modeling Document Dynamics: An Evolutionary Approach. Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR) {jahna, radev} @ umich.edu. What are dynamic texts?. Sets of topically related documents (news stories, Web pages, etc.) Multiple sources - PowerPoint PPT Presentation

Citation preview

Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR)

{jahna, radev} @ umich.edu

Modeling Document Dynamics: An Evolutionary Approach

What are dynamic texts?

• Sets of topically related documents (news stories, Web pages, etc.)

• Multiple sources• Written/published at different points in time –

may change over time• Challenging features:

– Paraphrases– Contradictions– Incorrect/biased information

Milan plane crash: April 18, 2002

04/18/02 13:17 (CNN)The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26th floor at 5:50pm (1450 GMT) on Thursday.

04/18/02 13:42 (ABCNews)The plane was destined for Italy’s capital Rome, but there were conflicting reports as to whether it had come from Locarno, Switzerland or Sofia, Bulgaria.

04/18/02 13:42 (CNN)The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26th floor at 5:50pm (1450 GMT) on Thursday.

04/18/02 13:42 (FoxNews)The plane had taken off from Locarno, Switzerland, and was heading to Milan’sLinate airport, De Simone said.

Problem for IR systems

• User poses a question or query to a system– Known facts change at different points in time– Sources contradict one another– Many paraphrases – similar but not necessarily

equivalent - information

• What is the “correct” information? What should be returned to the user?

Current Goals

• Propose that dynamic texts “evolve” over time

• Chronology recovery task

• Approaches– Phylogenetics: reconstruct history of a set of

species based on DNA– Language modeling: LM constructed from first

document should fit less well over time

Phylogenetic models• [Fitch&Margoliash,67]

• Given a set of species and information about their DNA, construct a tree that describes how they are related, w.r.t. a common ancestor

• Statistically optimal tree minimizes the deviation between the original distances and those represented in the tree

1

bear2

dog wolf

D W B

D 0 4 56

W 4 0 44

B 56 44 02424

2022

Distance matrix Candidate tree

Phylogenetic models (2)

• History of chain letters [Bennett&al,03]

– “Genes” were facts in the letters:• Names/titles of people• Dates• Threats to those who don’t send the letter on

– Distance metric was the amount of shared information between two chain letters

– Used Fitch/Margoliash method to construct trees

• Result: An almost perfect phylogeny. Letters that were close to one another in the tree shared similar dates, “genes” and even geographical properties.

Procedure: Phylogenetics

• For each document cluster and representation, generate a phylogenetic tree using Fitch [Felsenstein, 95]

– Representations: full document, extractive summaries– Generate the Levenshtein distance matrix– Input matrix into Fitch to obtain unrooted tree

• “Reroot” the unrooted tree at the first document in the cluster.

• To obtain the chronological ordering, traverse the rerooted tree.

• Assign chronological ranks, starting with ‘1’ for the root.

S1(d=3.5)

1 (d=0)

S2(d=6.5)

S4(d=1)

2 (d=8.5)

time t

S1

S2

S3

S4

S3(d=0)

Unrooted tree

S2(d=10)

S1 (d=0)

S3(d=12)

S4(d=13)

2 (d=12)

1 (d=3.5)

time t

S1

S2

S3

S4

Rerooted tree

Procedure: LM Approach

• Inspiration: document ranking for IR– If candidate document’s LM assigns high probability to

query relevant [Ponte & Croft, 98]

• Create LM from earliest document– Trigram backoff model using CMU-Cambridge toolkit

[Clarkson & Rosenfeld,97]

• Evaluate it on remaining documents– Use fit to rank them: OOV rates (increasing), trigram

(decreasing) and unigram-hit ratios (increasing)

Evaluation

• Metric: Kendall’s rank-order correlation coefficient (Kendall’s ) [Siegel & Castellan,88]

– -1 1 – Expresses extent to which the chronological

rankings assigned by the algorithm agree with the actual rankings

• Randomly assigned rankings have, on average, a = 0.

Dataset

• 36 document sets– Manually collected (6)

– NewsInEssence clusters (3)– TREC Novelty clusters (27)

[Soboroff & Harman, 03]

• 15 training, 6 dev/test, 15 test

• Example topics

Story #Doc. Time Span

#Sources

Milan plane crash

56 1.5 days 5

N33 Russian submarine Kursk sinks

25 1 month 3

N48 Human genome decoded

25 2 years 3

Training Phase

Median # Significant

(=0.10)

Full document 0.16 8/15

Summ-1 0.13 6

Summ-5 0.17 6

3-gram hit 0.17 7

1-gram hit 0.21 11

OOV 0.28 13

Training Phase (2)Novelty Training Clusters

Median # Sig.

Summ-5 0.05 3/11

1-gram 0.20 8

OOV 0.19 8

Manual Training Clusters

Median # Sig.

Summ-5 0.32 3/3

1-gram 0.42 3

OOV 0.26 3

Test Phase(15 clusters)

Median # Significant

Summ-5 0.15 5/15

1-gram hit 0.14 6

OOV 0.22 9

Manual Clusters

Story OOV Summ-5Gulfair Plane Crash 0.37 0.39

Honduras bus hijacking

0.12 0.17

Columbia shuttle 0.56 0.48

Milan plane crash 0.26 0.33

RI nightclub fire 0.58 0.32

Iraq bombing 0.24 0.17

Med. 0.31 0.33

# Significant 5/6 6/6

Conclusions

• Over all clusters, LM approach based on OOV had best performance

• LM and phylogenetic models had similar performance on manual clusters– Have more salient “evolutionary” properties

Future work

• Tracking facts in multiple news stories over time

• Produce a timeline of known facts

• Determine if the facts have settled at each time

04/18/02 13:17

Locarno, Switzerland

Journalist Desidera Cavvina told CNN

No

04/18/02 13:42

Locarno, Switzerland

or Sofia, Bulgaria

ABC No

Recommended