25
IGIR 2001 – WTS / DUC 13 Sep 2001 1/28 Centrifuser Output Min Yen Kan, 2001 Centrifuser’s output comes in three parts: Navigation; Informative extract, based on similarities; Indicative generated text, based on differences. Centrifuser can currently produce this output for documents with the same domain and genre

Centrifuser Output Min Yen Kan, 2001

  • Upload
    talen

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Centrifuser’s output comes in three parts: Navigation; Informative extract, based on similarities; Indicative generated text, based on differences. Centrifuser can currently produce this output for documents with the same domain and genre. Centrifuser Output Min Yen Kan, 2001. Part 1 - PowerPoint PPT Presentation

Citation preview

Title

Centrifuser OutputMin Yen Kan, 2001Centrifusers output comes in three parts:

Navigation;Informative extract, based on similarities;Indicative generated text, based on differences.

Centrifuser can currently produce this output for documents with the samedomain and genre

SIGIR 2001 WTS / DUC13 Sep 2001#/28

1Part 1 Informative Summaries

SIGIR 2001 WTS / DUC13 Sep 2001#/28

2Informative SummariesInformative = replaces the document with a shorter version

Task Provide most important aspects of thedocument(s)Interaction Browsing TypeStrategy Since search results are similar, puttogether similarities across documents

SIGIR 2001 WTS / DUC13 Sep 2001#/28

3Algorithm1. *Convert each document to a Document Topic Tree2. *Compute Composite Topic Tree 3. Align query and topics across trees4. Extract sentences5. Order into summary

SIGIR 2001 WTS / DUC13 Sep 2001#/28

41. Document Topic TreeHierarchical view of the documentLayout (Hu, et al 99)Lexical chains (Hearst 94, Choi 00)Done offline per document

AHA RecommendationLevel: 2 Order: 1Style: ProseContents: 1 Table, Related AHA publicationsLevel: 2 Order:3Style: Bulleted Contents: See also in this guideLevel: 2 Order: 3Style: ProseContents: 5 items, High Blood PressureLevel: 1Style: ProseContents: 3 Headers,

SIGIR 2001 WTS / DUC13 Sep 2001#/28

52. Composite Topic TreeNorm for a particular type of documentCreate by aligning topics in example trees by similarityStores order, frequency and variants of each topicDone offline per domain and genre combination handledjoined node at level 1 (e.g. disease)doc tree 1 (yellow)doc tree 2 (blue)newly joined node at level 2(e.g. symptoms)symptoms nodenewly joined node at level 3 (e.g. nausea)disease nodejoining nodes at level 2joining nodes at level 3

SIGIR 2001 WTS / DUC13 Sep 2001#/28

63. Topic AlignmentUse similarity metric to map query to composite and document treesFocus topic defines 3 regions

Done online, to find scope of information needed in summaryroot as focus topic(e.g. About hypertension)2nd level subtopic as focus topic (e.g. Guide to CardiacDiseases)= irrelevant= relevant= focus topic= too detailedQuery: HypertensionComposite treeDocument trees

SIGIR 2001 WTS / DUC13 Sep 2001#/28

74. Sentence ExtractionAligned topics chosen in descending typicality Use SimFinder to choose sentences Cover as many topics as possible to ensure breadth of summary*Disease*Freq: 1.0DietFreq: 0.6For more informationFreq: 0.7TreatmentFreq: 0.9DiagnosisFreq: 0.8SurgeryFreq: 0.3DrugsFreq: 0.7DefinitionFreq: 0.2CausesFreq: 0.8SymptomsFreq: 0.8NauseaFreq: 0.2= aligned= focus topic= unaligned(no instance in documents)Composite topictree 1.0 (hypertension) Since blood is carried "If a drug that blocks

0.9 (treatment) How Can I Reduce High How Do I Manage My

0.8 (causes) Blood pressure is

0.7 (drugs) "Over-the-counter

0.7 (for more 2000 Heart and Stroke information)

0.6 (diet) Everybody's looking for Extracted Sentences

SIGIR 2001 WTS / DUC13 Sep 2001#/28

85. Sentence OrderingOrder extracted sentences by order in composite tree (by norm)Order by norm order to get best resultsReordered Sentences1.0 (hypertension) Since blood is carried "If a drug that blocks

0.9 (treatment) How Can I Reduce High How Do I Manage My

0.8 (causes) Blood pressure is

0.7 (drugs) "Over-the-counter

0.7 (for more 2000 Heart and Stroke information)

0.6 (diet) Everybody's looking for Extracted Sentences1. (hypertension) Since blood is carried "If a drug that blocks

1.4 (causes) Blood pressure is

1.5 (treatment) How Can I Reduce High How Do I Manage My

1.5.1 (drugs) "Over-the-counter

1.5.2 (diet) Everybody's looking for

1.6 (for more 2000 Heart and Stroke information)(Ordered by typicality)(Ordered by normal first appearance)

SIGIR 2001 WTS / DUC13 Sep 2001#/28

9Part 2Indicative Summaries

SIGIR 2001 WTS / DUC13 Sep 2001#/28

10Indicative SummariesIndicative = help decide whether document is worthwhile for retrieval

TaskShow salient differences from othercandidatesInteraction SearchingtypeStrategyIdentify content and non-content aspects in which each source is different

SIGIR 2001 WTS / DUC13 Sep 2001#/28

11What goes into an Indicative Summary?Examine existing indicative summaries:Library card catalog

Examine multidocument scenarios

SIGIR 2001 WTS / DUC13 Sep 2001#/28

12Corpus Parameters82 summaries from CUs online catalogHealthcare domainCatalogued types of information presentDocument-derived featuresMetadata features

Practical Interventional Cardiology represents a practical reference for the interventional cardiologist and those in training, as well as the non-invasive cardiologist and physician. [] Rather than providing detailed and exhaustive reviews, the purpose of this book is to present practical information regarding cardiac interventional procedures. []

SIGIR 2001 WTS / DUC13 Sep 2001#/28

13Corpus Analysis ResultsFreqFreqDocument FeatureDocument Feature(Document Derived)(Metadata)Topicality 100%Content Types 37%Readability 18%Internal Structure 17%Special Content 7%Title 31%Revised/Edition 28%Author/Editor 21%Purpose 18%Audience 17%Practical Interventional Cardiology represents a practical reference for the interventional cardiologist and those in training, as well as the non-invasive cardiologist and physician. [] Rather than providing detailed and exhaustive reviews, the purpose of this book is to present practical information regarding cardiac interventional procedures. []

SIGIR 2001 WTS / DUC13 Sep 2001#/28

14Analysis - MultidocumentPrescriptive GuidelinesOpen Directory Project website hierarchy

Differences are important!1. Differences between documents2. Differences from the norm3. Those relevant to the query (Grice `75)Make clear what makes a site different from the rest

SIGIR 2001 WTS / DUC13 Sep 2001#/28

15Corpus Analysis DiscussionTopicality (i.e. content) is most importantOther features have a strong role

For CentrifuserDesign summary around topicsWhen space allows, add other features as neededWhen feature differs from the normFuture work: mimic the percentages in studyDifferences drive the textQuery and norm should affect the summary content.

SIGIR 2001 WTS / DUC13 Sep 2001#/28

16Algorithm1. *Make Composite and Document Topic Trees 2. Align query and topics across trees3. Use region ratios to compute document categories4. Decide messages to realize5. Order messages6. Generate the text

SIGIR 2001 WTS / DUC13 Sep 2001#/28

172. (recap) Align query and topicsMap the query to a topicQuery node divides nodes into relevant, irrelevant and intricate regions= irrelevantroot as focus topic2nd level subtopic as focus topic= relevant= focus topic= intricateQuery: AnginaQuery: Treatments of AnginaAttributing the effect of the query on the generated text

SIGIR 2001 WTS / DUC13 Sep 2001#/28

18Epilogue: Weve described topic trees, but recall the goal is to differentiate documents by their topical distribution. To do this we have another intermediate step: we have to categorize each documents topic distribution.

same spelled wrongput queries on slideClassifying Topics By NormRelevant nodes divided into typical and rareCompositetopic tree= focus topic= typical node (freq >= .5)= rare node (freq < .5)Documenttopic treeAttributing the effect of the norm on the generated text= unaligned topic

SIGIR 2001 WTS / DUC13 Sep 2001#/28

193. Categorizing DocumentsRatio of typical, rare, intricate and irrelevant determines category7 categories altogether

3 typical, 2 rare, 2 intricate and 8 irrelevant 5 typical, 2 rare, 2 intricateIrrelevant Document50+% irrelevantSpecialized Document> 50+% typical, < 50% all possible typical

SIGIR 2001 WTS / DUC13 Sep 2001#/28

204. Forming MessagesMessages and the text thatthey eventually realize

Other messages may include:Number of categories in summaryOther optional information (e.g. content type)Relation: category-elementsArgs:docCat: atypicalelement: AMA Guideelement: CU GuideRelation: category-descriptionArgs:[ docCat: atypical ][][][]Relation: has-topicsArgs:docCat: atypicaltopic: definitiontopic: risks[]][Document category descriptionDocuments belonging to categoryTopics in categoryMore information on additional topics which are not included in the summary are available in these files (The American Medical Association family medical guide and The Columbia University Collegeof Physicians and Surgeon complete home medical guide).. The topics include definition and what are

SIGIR 2001 WTS / DUC13 Sep 2001#/28

21Done with content planning after this slide not before

Put an example and correlate with categories from prev slidee.g. prototype = sentence from this document category

5. Ordering MessagesInter-category by importance of dominant topic type.

Intra-category document category and elements before optional information.

SIGIR 2001 WTS / DUC13 Sep 2001#/28

22Really confusing.

Make a link between the ordering and the descriptive analysisWhy is something oblig vs opti6. Text Generation

Use a small grammar to realize the messages Referring Expression Issues Size of referring expressions Re-ordering documents in the set

SIGIR 2001 WTS / DUC13 Sep 2001#/28

23Task Based EvaluationScenario: Youve been diagnosed with cancer

Compare against 3 real-world systemsIR engine (google);Human expert (about.com).

GoalsEvaluate on subjective criteria, use think aloud techniquesSee which document features best fit user need

Pilot study complete; full study going on now Hub (yahoo);

SIGIR 2001 WTS / DUC13 Sep 2001#/28

24ConclusionAn application of summarization for IRPerforms informative and indicative summarizationBy using extraction and text generation techniquesTo support browsing and searching

http://centrifuser.cs.columbia.edu

SIGIR 2001 WTS / DUC13 Sep 2001#/28

25