Multi-Document Summary Space:What do People Agree is Important?

Multi-Document Summary Space:What do People Agree is Important?John M. ConroyInstitute for Defense AnalysesCenter for Computing SciencesBowie, MD

OutlineProblem statement.Human Summaries.Oracle Estimates.Algorithms.

Query-Based Multi-document SummarizationUser types query.Relevant documents are retrieved.Retrieved documents are clustered.Summaries for each cluster are displayed.

Example Query: hurricane earthquake

columbia

michagan

Recent Evaluation and Problem Definition EffortsDocument Understanding Conferences2001-2004: 100 word generic summaries.2005-2006: 250 word focused summaries.http://duc.nist.gov/Multi-lingual Summarization Evaluation 2005-2006. (MSE)Given a cluster of translated documents and English documents produce100 word. http://www.isi.edu/~cyl/MTSE2005/

Overview of TechniquesLinguistic Tools (find sentence boundaries, to shorten sentences, extract features).Part of speech.Parsing.Entity Extraction.Bag of words, position in document.Statistical Classifier.Linear classifiers.Bayesian methods, HMM, SVM, etc.Redundancy Removal.Maximum marginal relevance (MMR).QR.

Sample DataDUC 2005.50 topics.25 to 50 relevant documents per topic.4 or 9 human summaries.

Linguistic ProcessingUse heuristic patterns to find phrases/clauses/words to eliminateShallow processingValue of full sentence elimination?

Linguistic ProcessingPhrase elimination Gerund phrasesExample:Suicide bombers targeted a crowded open-air market Friday, setting off blasts that killed the two assailants, injured 21 shoppers and passersby and prompted the Israeli Cabinet to put off action on .

Example Topic DescriptionTitle: Reasons for Train WrecksNarative: What causes train wrecks and what can be done to prevent them? Train wrecks are those events that result in actual damage to the trains themselves not just accidents where people are killed or injured.Type: General

Example Human SummaryTrain wrecks are caused by a number of factors: human, mechanical and equipment errors, spotty maintenance, insufficient training, load shifting, vandalism, and natural phenomenon. The most common types of mechanical and equipment errors are: brake failures, signal light and gate failures, track defects, and rail bed collapses. Spotty maintenance is characterized by failure to consistently inspect and repair equipment. Lack of electricians and mechanics results in letting equipment run down until someone complains. Engineers are often unprepared to detect or prevent operating problems because of the lack of follow-up training needed to handle updated high technology equipment. Load shiftings derail trains when a curve is taken too fast or there is a track defect. Natural phenomenon such as heavy fog, torrential rain, or floods causes some accidents. Vandalism in the form of leaving switches open or stealing parts from them leads to serious accidents. Human errors may be the most common cause of train accidents. Cars and trucks carelessly crossing or left on tracks cause frequent accidents. Train crews often make inaccurate tonnage measurements that cause derailments or brake failures, fail to heed single-track switching precautions, make faulty car hook-ups, and, in some instances, operate locomotives while under the influence of alcohol or drugs. Some freak accidents occur when moving trains are not warned about other trains stalled on the tracks. Recommendations for preventing accidents are: increase the number of inspectors, improve emergency training procedures, install state-of-the-art warning, control, speed and weight monitoring mechanisms, and institute closer driver fitness supervision.

Another Example TopicTitle: Human Toll of Tropical Storms What has been the human toll in death or injury of tropical storms in recent years? Where and when have each of the storms caused human casualties? What are the approximate total number of casualties attributed to each of the storms?Granularity: Specific

Example Human SummaryJanuary 1989 through October 1994 tolled 641,257 tropical storm deaths and 5,277 injuries world-wide.In May 1991, Bangladesh suffered 500,000 deaths; 140,000 in March 1993; and 110 deaths and 5,000 injuries in May 1994.The Philippines had 29 deaths in July 1989 and 149 in October; 30 in June 1990, 13 in August and 14 in November.South Carolina had 18 deaths and two injuries in October 1989; 29 deaths in April 1990 and three in October.North Carolina had one death in July 1989 and three in October 1990.Louisiana had three deaths in July 1989; and two deaths and 75 injuries in August 1992.Georgia had three deaths in October 1990 and 19 in July 1994.Florida had 15 in August 1992.Alabama had one in July 1994.Mississippi had five in July 1989.Texas had four in July 1989 and two in October.September 1989 Atlantic storms killed three.The Bahamas had four in August 1992.The Virgin Islands had five in December 1990.Mexico had 19 in July 1993.Martinique had six in October 1990 and 10 injuries in August 1993.September 1993 Caribbean storms killed three Puerto Ricans and 22 others.China had 48 deaths and 190 injuries in September 1989, and 216 deaths in August 1990.Taiwan had 30 in October 1994.In September 1990, Japan had 15 and Vietnam had 10.Nicaragua had 116 in January 1989.Venezuela had 300 in August 1993.

Inter-Human Word Agreement

Evaluation of SummariesIdeally each machine summary would be judged by multiple humans for1. Responsiveness to query.2. Cohesiveness, grammar, etc.Reality: This would take too much time!Plan: Use Metric which correlates at 90-97% with human responsiveness judgments.

Recall Oriented Understanding for Gisting Evaluation

ROUGE-1 Scores

ROUGE-2 Scores

Frequency and SummarizationAni Nenkova, Columbia and Lucy Vanderwende, Microsoft report:High frequency content words correlate with high frequency words chosen by humans.SumBasic, a simple method based on this principle, produces state of the art generic summaries, e.g., DUC 04 and MSE 05.Van Halteren and Teufel 2003, Radev et. Al. 2003, Copeck and Szpakowicz 2004.

What is Summary Space?Is there enough information in the documents to approach human performance as measured by ROUGE?Do humans abstract so much that extracts dont suffice?Is a unigram distribution enough?

A CandidateSuppose an oracle gave us:Pr(t)=Probability that a human will choose term t to be included in a summary.t is a non-stop word term.Estimate based on our data.E.g., 0, 1/4, 1/2, 3/4, or 1 if 4 human summaries are provided.

A Oracle Simple ScoreGenerate extracts:Score sentences by the expected percentage of abstract terms they contain.Discard any short sentences or any long sentences.Pivoted QR to remove redundancy.

The Oracle Pleases Everyone!

Approximate Pr(t)Two bits of Information:Topic Description.Extract query phrases.Documents Retrieved.Extract terms which are indicative or give the signature of the documents.

Query TermsGiven Topic Description.Tag it for part of speech.Take any NN (noun), VB (verb), JJ (adjective), RB (adverb), multi-word groupings of NNP.E.g. train, wrecks, train wrecks, causes, prevent, events, result, actual, actual damage,trains, accidents, killed, injured.

Signature Terms

Term: space-delimited string of characters from {a,b,c,,z}, after text is lower cased and all other characters and stop words are removed.Need to restrict our attention to indicative terms (signature terms).Terms that occur more often then expected.

Signature TermsTerms that occur more often than expectedBased on a 22 contingency table of relevance counts.Log-likelihood; equivalent to mutual information.Dunning 1993, Hovy Lin 2000.

Hypothesis TestingH0: P(C|ti)=p=P(C|~ti)H1: P(C|ti)=p1p2=P(C|~ti)ML Estimate p, p1, and p2

C~CtiO11O12~tiO21O22

Likelihood of H0 vs. H1 and Mutual Information

Example Signature Termsaccident accidents ammunition angeles avenue beach bernardino blamed board boulevard boxcars brake brakes braking cab car cargo cars caused cc cd collided collision column conductor coroner crash crew crews crossing curve derail derailed desk driver edition emergency engineer engineers equipment failures fe fog freight ft grade holland injured injuries investigators killed line loaded locomotives los maintenance mechanical metro miles nn ntsb occurred pacific page part passenger path photo pipeline rail railroad railroads railway runaway safety san santa scene seal shells sheriff signals southern speed staff station switch track tracks train trains transportation truck weight westminster words workers wreck yard yesterday

An Approximation of Pr(t)For a given data set and topic descriptionLet Q be the set of query terms.Let S be the set of signature terms.Estimate of Pr(t)=(Q (t) + S(t))/2 where (t)=1 if tA and 0 otherwise.

Our ApproachUse expected abstract word score to select candidate sentences (~2w).Terms as sentence featuresTerms: {t1, , tm} RmSentences: {s1, , sn} RnScaling: each column scaled to score.Use Pivoted QR to select sentences.

Redundancy RemovalPivoted QRChoose column with maximum norm (aj)Subtract components along aj from remaining columns, i.e., remaining columns are orthogonal to the chosen columnStop criteria: chosen sentences (columns) ~w (~2w) wordsRemoves semantic redundancy

Results

ConclusionsPr(t), the oracle score produces summaries which please everyone.A simple estimate of Pr(t) induced by query and signature terms gives rise to a top scoring system.

Future WorkBetter estimates for Pr(t).Pseudo-relevance feedback.LSI or similar dimension reduction tricks?Ordering of sentences for readability is important. (with Dianne OLeary) A 250 word summary has approximately 12 sentences.Two directions in linguistic preprocessing:Eugene Charniaks parser. (with Bonnie Dorr and David Zaijac)Simple rule based (POS lite). (Judith Schlesinger).

On Brevity

"I Will Be Brief. Not Nearly So Brief As Salvador Dali, Who Gave the World's Shortest Speech. He Said I Will Be So Brief I Have AlreadyFinished, and He Sat Down.- Edward O. Wilson

About 16% increase in number of sentences in 100 word summary, base on DUC 03 dataSee LAMP talk for dataDeath tolls in tropical stormsBaseline=1, most recent 250 words from most recent documentPr(t) is in a sense a sufficient statistic.