28
1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David Zajic, Bonnie Dorr Necip Fazil Ayan, Jimmy Lin University of Maryland, College Park

Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

1

Multiple Alternative SentenceCompressions (MASC)

A Framework for Automatic Summarization

Nitin Madnani, David Zajic, Bonnie Dorr

Necip Fazil Ayan, Jimmy Lin

University of Maryland, College Park

Page 2: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

2

Outline

• Problem Description

• MASC Architecture

• MASC Results

• Improving Candidate Selection

• Summary & Future Work

Page 3: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

3

Problem Description

• Sentence-level extractive summarization– Source sentences contain mixture of relevant/non-

relevant, novel/redundant information.

• Compression– Single output compression can’t provide best

compression of each sentence for every user need.

• Multiple Alternative Sentence Compression– Generation of multiple candidate compressions of

source sentences.– Feature-based selection to choose among

candidates.

Page 4: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

4

Outline

• Problem Description

• MASC Architecture

• MASC Results

• Improving Candidate Selection

• Summary & Future Work

Page 5: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

5

MASC Architecture

SentenceFiltering

SentenceCompression

CandidateSelection

Sentences

Candidates

Task-Specific Features(e.g. query)

Documents

Summary

HMM Hedge

Trimmer

Topiary

(Zajic et al., 2005) (Zajic et al., 2006)

Page 6: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

6

HMM Hedge Architecture

Part of SpeechTagger1

HMM Hedge

Sentence

Sentence withVerb Tags

VERB VERB

Compressions

HeadlineLanguage Model

StoryLanguage Model

Language models based on242,918 AP headlines andstories from Tipster Corpus

1TreeTagger (Schmid, 1994)

Page 7: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

7

HMM HedgeMultiple Alternative Compressions

• Calculate best compression at each word-lengthfrom 5 to 15 words

• Calculate 5 best compressions at each wordlength

Page 8: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

8

Trimmer Architecture

Entity Tagger1

Trimmer

Sentence

Sentence withEntity Tags

PERSON TIME EXPR

Compressions

Parser2Parse

1BBN IdentiFinder (Bikel et al., 1999)

2Charniak Parser (Charniak, 2000)

Page 9: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

9

Multi-candidate Trimmer

• How to generate multiple candidatecompressions?– Use the state of the parse tree after each rule

application as a candidate

– Use rules that generate multiple candidates

– 9 single-output rules, 3 multi-output rules• Zajic et al, 2005, 2006; Zajic 2007

Page 10: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

10

Trimmer Rule: Root-S

• Select node to be root of compression• Consider any S node with NP,VP children

The latest flood crestpassed Chongqing in

southwest China

and waters were rising inYichang on the middlereaches of the Yangtze

statetelevision

reportedSunday

S1

S

S2 CC S3NP VP

Page 11: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

11

Trimmer Rule: Conjunction

Illegalfireworks

injured hundredsof people

and startedsix fires

S

NP VP CC VP

VP

• Conjunction rule removes right, left orneither child.

Page 12: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

12

Topiary Architecture

Topiary

Sentence

Candidates

TopicAssignment1

Document DocumentCorpus

TopicTerms

Compressions

Trimmer

1BBN Unsupervised TopicDetection

Page 13: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

13

Topiary ExamplesDUC2004

PINOCHET: wife appealed saying he too sick to beextradited to face charges

MAHATHIR ANWAR_IBRAHIM: Lawyers went tocourt to demand client's release– Mahathir Mohamad is the former Prime Minister of

Malaysia

– Anwar bin Ibrahim is a former deputy prime ministerand finance minister of Malaysia, convicted ofcorruption in 1998

Page 14: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

14

Selector Architecture

Relevance &Centrality Scorer1

SentenceSelector

Candidates + Features

Document

DocumentSet

Candidates + More Features

Query

?

FeatureWeightsSummary

Cull &Rescore

1Uniform Retrieval Architecture(URA), UMD’s software infrastructurefor IR tasks.

Page 15: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

15

Outline

• Problem Description

• MASC Architecture

• MASC Results

• Improving Candidate Selection

• Summary & Future Work

Page 16: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

16

Evaluation ofHeadline Generation Systems

DUC2004 Test Data, Rouge recall with unigrams

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

First 75 UTD

Topics

HMM

Hedge

Trimmer Topiary HMM

Hedge

Trimmer Topiary

Ro

ug

e 1

Re

ca

ll

No MASC MASC

Page 17: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

17

Evaluation of Multi-DocumentSummarization Systems

DUC2006 Test Data

0.05

0.055

0.06

0.065

0.07

0.075

No Compression HMM Hedge Trimmer

Ro

ug

e 2

Re

ca

ll

Page 18: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

18

Outline

• Problem Description

• MASC Architecture

• MASC Results

• Improving Candidate Selection

• Summary & Future Work

Page 19: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

19

Tuning Feature Weights with ΔROUGE

c1

c2

ck

.

.

.

Initialize: S = {}, H = {}

C ← current k-best candidates

for c ∈ C

ΔROUGE(c) = R2R(S∪{c}) - R2R(S)

Add hypothesis to HS ← S ∪ {c1}

Update remaining candidates

Repeat unless |S| > L

wopt ← powellROUGE(H, w0)

Summary(S)

Hypotheses(H)

C

Δ1

Δ2

Δk

.

.

.

Page 20: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

20

Optimization Results

0.1540.126SU-4

0.1040.0812

0.4030.3631

ΔROUGE (k=10)ManualROUGE

DUC2007 data, all differences significant at p < 0.05

Manual : Feature weights optimized manually to maximizeROUGE-2 Recall on the final system output

Key Insights for ΔROUGE optimization:• Uses multiple alternative sentence compressions• Directly optimizes candidate selection process.

Page 21: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

21

• Candidate words can be emitted by two disparate worddistributions

• Assuming candidate words are i.i.d., the redundancyfeature for a given candidate is:

Redundancy

S = Summary, L = General English language!

P(w | S) = n(w,S) S( )

!

R(c) = log P(c)( ) = log "P(w | S) + (1# ")P(w | L)w$c

%&

' (

)

* +

!

P(w | L) = n(w,L) L( )

Other documents in the same cluster are used to represent the general language

REDUNDANT NON-REDUNDANT

λ + (1-λ)

!

P(w) =

Page 22: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

22

Incorporating Paraphrases

• Redundancy uses bags-of-words to compute P(w|S)

• Not useful if candidate word is a paraphrase of summaryword (classified as non-redundant)

• Add another bag-of-words P, such that

• Use n(w,P) for redundancy computation if n(w,S) = 0

!

P(w | S) =n(w,S)

| S |

!

"w # SP = { a paraphrase for w, }

Page 23: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

23

Generating Paraphrases

• Leverage phrase-based MT system– Use E-F correspondences extracted from word-aligned bi-

text– Pivot each pair of E-F correspondence with common

foreign side to get E-E correspondence–

• Example

• Pick most frequent correspondence for w

!

c(e1,e2) = c(e

1, f )c( f ,e

2)

f

"

上升 ||| climbed ||| 1.0上升 ||| increased ||| 2.0上升 ||| uplifted ||| 1.0

increased ||| climbed ||| 2.0climbed ||| uplifted ||| 1.0. . .. . .uplifted ||| increased ||| 2.0

Page 24: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

24

Paraphrase Results

• Using paraphrases yields no significantimprovements

• Unrelated to the quality of the paraphrases

• Anomalous cases occur extremely rarely– The original bag-of-words is sufficient to

capture candidate redundancy almost all thetime

Page 25: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

25

Outline

• Problem Description

• MASC Architecture

• MASC Results

• Improving Candidate Selection

• Summary & Future Work

Page 26: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

26

DUC 2007 Results

• Systems 7, 36• Main:

– Responsiveness = 3.089 (4th)– ROUGE-2 = 0.108 (8th)– ROUGE-SU4 = 0.158 (11th)

• Update:– Responsiveness = 2.800 (2nd)– ROUGE-2 = 0.086 (9th)– ROUGE-SU4 = 0.124 (8th)

Page 27: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

27

Summary

• MASC with feature-based candidate selectionimproves headline generation and showspromise for multi-document summarization.

• Optimizing for ΔROUGE provides significantimprovements over previous approach

• Redundancy feature works at lexical as well asdocument-level

• Using paraphrases requires novel formulation

Page 28: Multiple Alternative Sentence Compressions (MASC) · 2007. 4. 26. · 1 Multiple Alternative Sentence Compressions (MASC) A Framework for Automatic Summarization Nitin Madnani, David

28

Future Work

• Fully explore Trimmer search space• Split redundancy feature into its components

and tune λ automatically• Use an n-gram LM to estimate P(w|L)• Continue to experiment with paraphrase-based

approaches to redundancy– Scale up to phrase-level paraphrases– Use combination of high-coverage and high-quality

paraphrases