21
Overview of the TAC 2008 Update Summarization Task Hoa Trang Dang, Karolina Owczarzak

Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Overview of the TAC 2008  Update Summarization Task

Hoa Trang Dang, Karolina Owczarzak

Page 2: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Update Summarization Task

● Task– main: produce a 100­word summary from a set of 

10 documents (Summary A)– update: produce a 100­word summary from a set 

of subsequent 10 documents, with the assumption that the information in the first set is already known to the reader (Summary B)

Page 3: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Update Summarization Task

● 48 topics● 20 documents per topic in chronological order: 

– main summary (first 10 documents) – update summary (second 10 documents)

● 100 words per summary● 4 model summaries

– one summary by topic creator

Page 4: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Data

● AQUAINT­2 Corpus– part of LDC English Gigaword corpus 3rd Ed.– 2.5GB of text– news articles Oct 2004 – Mar 2006: 

● Agence France Presse●  Xinhua News Agency● Los Angeles Times – Washington Post News Service●  New York Times● Associated Press

● Average length of selected doc: 3368 wrds

Page 5: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Topics

● D0820D

Title:    Submarine Rescue

Narrative:    Describe efforts of the Russian navy to rescue the trapped submariners and any assistance provided by other countries.  Include information regarding the results of the rescue mission and the results and consequences of the subsequent investigation into the matter.

Page 6: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Participants

● 33 teams● 71 runs (up to 3 per team)

– manual evaluation for 1st and 2nd priority runs (57)– automatic evaluation for all runs

● NIST baseline– first sentence(s) of the most recent document– up to 100 words

Page 7: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Manual Evaluation

● Overall ResponsivenessHow well is the summary responding to the information need contained in the topic statement? How good is the structure of the summary and its linguistic quality? 

What is the overall linguistic quality of the summary, independent of content? Note the fluency, structure, grammaticality, non­redundancy, referential clarity, focus, coherence.

● Overall Readability

Page 8: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Manual Evaluation

● Overall Responsiveness

1......................2......................3......................4......................5

Very Poor          Poor      Barely Acceptable     Good           Very Good

● Overall Readability

1......................2......................3......................4......................5

Very Poor          Poor      Barely Acceptable     Good           Very Good

Page 9: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Manual Evaluation

● Pyramid framework (Passonneau et al., 2005)

Model1

Model2

Model3

Model4

Summary Content Units (SCUs):

­ Mini­submarine trapped underwater (4)­ Mini­sub snagged by underwater cables (3)­ Britain sent a robotic vehicle (3)­ U.S. sent underwater vehicles (2)­ Japan sent four vessels (2)­ British arrived first (2)­ Crew taken for medical examination (1)­ Military submarine (1)­ Mini­sub trapped in eastern Russia (1)­ U.S. sent equipment (1)

Page 10: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Manual Evaluation

● Pyramid framework (Passonneau et al., 2005)

SCU (4):   Mini­submarine trapped underwater

contributor1:   mini­submarine... became trapped... on the sea floorcontributor2:   a small... submarine... snagged... at a depth of 625 feetcontributor3:   mini­submarine was trapped... below the surfacecontributor4:   A small... submarine... was trapped on the seabed

Page 11: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Manual Evaluation

● Pyramid framework (Passonneau et al., 2005)

score =total SCU weight

max SCU weight possible with average SCU count

Candidate Summary­ Mini­submarine trapped underwater (4)­ Mini­sub trapped in eastern Russia (1)­ U.S. sent equipment (1)

_________________________________Total SCU count: 3   

Total SCU weight: 6

M1

M2

M3

M4

­ Mini­submarine trapped underwater (4)­ Mini­sub snagged by underwater cables (3)­ Britain sent a robotic vehicle (3)­ U.S. sent underwater vehicles (2)­ Japan sent four vessels (2)­ British arrived first (2)­ Crew taken for medical examination (1)­ Military submarine (1)­ Mini­sub trapped in eastern Russia (1)­ U.S. sent equipment (1)

Average model SCU count: 8

Max weightwith 8 SCUs:

       18

6score =        = 0.3318

Page 12: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Automatic Evaluation

● ROUGE (Lin, 2004)– ROUGE­2 recall: matching bigrams– ROUGE­SU4 recall: matching skip­bigrams (skip up to 4 

intervening words)

● BE (Hovy et al., 2005)– BE­HM: matching head­modifier pairs

● Jackknifing for all metrics– evaluate each model summary against remaning 3 models– evaluate each automatic summary 4 times, each time against a different set of 3 

models, average out

sent | call    (obj)sent | they   (subj)call | help    (for)help | international (mod)sent | out      (guest)

Page 13: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Results – Main vs Update

      Responsiveness        Readability           Pyramidmodels systems models systems models systems

Summaries A 4.620 2.324* 4.786 2.347 0.663 0.260*Summaries B 4.625 2.024* 4.800 2.337 0.630 0.204*

      ROUGE­2        ROUGE­SU4           BE­HMmodels systems models systems models systems

Summaries A 0.117 0.079* 0.154 0.116* 0.078 0.038Summaries B 0.117 0.068* 0.150 0.107* 0.089 0.039

Macro­average per­topic scores

* difference statistically significant with p < 0.05

Page 14: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Results – Models vs Systems

D 4.833F 4.729G 4.708A 4.688B 4.583H 4.583C 4.500E 4.354

23 2.66749 2.66744 2.63550 2.62514 2.61511 2.54224 2.52152 2.47925 2.47941 2.47937 2.47926 2.469

6 2.46951 2.448

1 2.42713 2.42742 2.41745 2.38534 2.385

2 2.38512 2.34446 2.33317 2.32319 2.31243 2.260

3 2.24035 2.21910 2.21915 2.20822 2.19854 2.18848 2.177

4 2.16736 2.15616 2.115

5 2.10433 2.10429 2.083

0 2.07355 2.07357 2.07320 2.06227 2.05232 2.03121 2.02140 1.99056 1.94831 1.93853 1.91730 1.91728 1.740

7 1.68847 1.656

8 1.54238 1.51018 1.47939 1.417

9 1.198

RESPONSIVENESS

READABILITY

D 4.917F 4.896G 4.854A 4.833B 4.812E 4.729H 4.688C 4.6040 3.333

49 3.07323 2.95850 2.89652 2.89624 2.88526 2.88551 2.81244 2.79225 2.77134 2.760

1 2.71914 2.70846 2.646

6 2.59417 2.56237 2.55245 2.52113 2.47916 2.45810 2.44831 2.43833 2.43835 2.427

5 2.4274 2.417

22 2.40611 2.40627 2.37515 2.36520 2.354

2 2.35447 2.344

3 2.33341 2.32353 2.30254 2.29257 2.28136 2.24048 2.20819 2.18821 2.17756 2.15612 2.03142 2.03132 2.01043 2.00040 1.95830 1.93855 1.83329 1.80239 1.77118 1.760

7 1.6779 1.635

28 1.62538 1.448

8 1.312

PYRAMID

G 0.805D 0.708H 0.655C 0.651B 0.625F 0.613A 0.608E 0.511

11 0.33144 0.31914 0.31741 0.31323 0.30437 0.30149 0.299

6 0.29613 0.29525 0.29050 0.28743 0.28545 0.28412 0.28242 0.28051 0.278

2 0.27619 0.27624 0.27552 0.27248 0.26315 0.263

1 0.26134 0.26026 0.25835 0.25017 0.249

3 0.24210 0.23836 0.23446 0.23429 0.23422 0.23254 0.230

4 0.22955 0.22216 0.22220 0.21940 0.21221 0.21227 0.21232 0.20630 0.20457 0.20228 0.191

5 0.19033 0.18653 0.18456 0.180

0 0.16331 0.160

8 0.15338 0.140

7 0.13847 0.13018 0.08539 0.073

9 0.055

Page 15: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Results – Models vs Systems

Responsiveness Readability Pyramidmodels 4.622* 4.792* 0.647*systems 2.174* 2.342* 0.232*

ROUGE­2 ROUGE­SU4 BE­HMmodels 0.117* 0.152* 0.084*systems 0.074* 0.111* 0.045*

Macro­average submission scores

* difference statistically significant with p < 0.05

Page 16: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Results – Models vs Systems

Page 17: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Manual Metrics ­ Correlation

                         Pearson's                    Spearman'smodels systems models systems

Readability 0.778* 0.763* 0.910* 0.750*Pyramid 0.64 0.950* 0.46 0.941*

● Overall Readability – evaluation of form● Pyramid – evaluation of content● Overall Responsiveness – evaluation of form + content

Correlation between average Responsiveness and average Readability/Pyramid

* correlation statistically significant with p < 0.05

Page 18: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Manual Metrics ­ Correlation

Page 19: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Manual and Automatic Metrics

                         Pearson's                    Spearman'smodels systems models systems

ROUGE­2 0.276 0.946* 0.429 0.967*ROUGE­SU4 0.457 0.928* 0.595 0.951*BE­HM 0.423 0.949* 0.309 0.950*

Correlation between Pyramid score and ROUGE/BE

Correlation between Responsiveness score and ROUGE/BE

                         Pearson's                    Spearman'smodels systems models systems

ROUGE­2 0.725* 0.894* 0.874* 0.920*ROUGE­SU4 0.866* 0.874* 0.898* 0.909*BE­HM 0.656 0.911* 0.683 0.910*

* correlation statistically significant with p < 0.05

Page 20: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Conclusions

● Update summaries more difficult for automatic systems than main summaries– lower Overall Responsiveness– lower Pyramid scores

● Gap between automatic and human summaries– Overall Responsiveness– Overall Readability– Pyramid score

● NIST baseline best in Readability, low in content (Pyramid)

Page 21: Overview of the TAC 2008 Update Summarization Task · 2009. 2. 27. · Los Angeles Times – Washington Post News Service New York Times ... – manual evaluation for 1st and 2nd

Thank you