Upload
melvin-lindsey
View
214
Download
0
Embed Size (px)
Citation preview
A SUMMARIZATION JOURNEY
Search and Information Extraction Lab
IIIT Hyderabad
Information Overload
Explosive growth of information on web
Failure of information retrieval systems tosatisfy user’s information need
Need for sophisticated information accesssolutions
Summarization
Summary is a condensed version of source document(s) having a recognizable genre : to give the reader an exact and concise idea of the contents of the source.
Text interpretation
Extraction of Relevant information
Condensing Extracted Information
Summary Generation
Flavors of Summarization
Progressive
Single documen
t
Query Focused
Opinion/ Sentimen
t
Code
ComparativeGuided
Personalized
Extract Vs. Abstract
Extract An extract is a summary consisting of
entirely of material from the input text Abstract
An abstract is a summary at least some of whose material is not present in the input. eg. paraphrases of content, subject of
categories
Towards Abstraction
Personalized , Cross Lingual Summarization
Guided SummarizationCode SummarizationComparison Summarization
Blog summarization Progressive Summarization
Abstractive
Single Document, Query Focused Multi Document Summarization
Technological Aspects
Summarization
Support Vector Regression
Relevance based
Language Models
External Knowledge
Web, Wikipedia
User ModelingStatistics – word and
document
Similarity measures,
Novelty detection
Graph Clustering –
Topic identification
EXTRACTIVE SUMMARIZERS
Query Focused Summarization
Documents should be ranked in order of probability of relevance to the request or information need, as calculated from whatever evidence is available to the system
Query Dependent ranking: Relevance Based Language models Language models (PHAL)
Query Independent ranking: Sentence Prior
RBLM is an IR approach that computes the conditional probabilities of relevance from document and query
PHAL- probabilistic extension to HAL spaces HAL constructs dependencies of a term w on other terms
based on their occurrence in its context in the corpus
DUC Peformance
38 systems participated in 2006
Significant difference between first two systems
2006
Extract vs. Abstract Summarization
We conducted a study (post TAC 2006) Generated best possible extracts Calculated the scores for these extracts
Evaluation with respect to the reference summaries
Rouge 2 Rouge SU4
Human Answers 0.1025 0.1624
Best Answers 0.09965 0.15407
HAL Feature 0.07618 0.13805
Cross Lingual Summarization
Cross Lingual Summarization
A bridge between CLIR and MT Extended our mono-lingual summarization
framework to a cross-lingual setting in RBLM framework
Designed a cross-lingual experimental setup using DUC 2005 dataset
Experiments were conducted for Telugu-English language pair
Comparison with mono-lingual baseline shows about 90% performance in ROUGE-SU4 and about 85% in ROUGE-2 f-measures
Progressive Summarization
Emerging area of research in summarization
Summarization with a sense of prior knowledge
Introduced as “Update Summarization” at DUC 2007, TAC 2008, TAC 2009
Generate a short summary of a set of newswire articles, under the assumption that the user has already read a given set of earlier articles.
To keep track of temporal news stories
Key challenge
To detect information that is not only relevant but also new given the prior knowledge of reader
Relevant and new VsNon-Relevant and new Vs Relevant and redundant
Three level approach to Novelty DetectionSentence Scoring
Developing new features that capture novelty along with relevance of a sentence
NF, NW
Ranking
Sentences are re ranked based on the amount of novelty it contains
ITSim, CoSim
Summary Generation
A selected pool of sentences that contain novel facts. All remaining sentences are filtered out
Evaluations
TAC 2008 Update Summarization data for training: 48 topics
Each topic divided into A, B with 10 documents
Summary for cluster A is normal summary and cluster B is update summary
TAC 2009 update Summarization for testing: 44 topics
Baseline summarizer generates summary by picking first 100 words of last document
Run1 – DFS + SL1
Run2 – PHAL + KL
Personalized Summarization Perception of text differs with background of
the reader Need of incorporating user background in the
summarization process Summarization not only a function of input text
but also the reader
Serve
Tennis player
Hotel managerPolitician
Web-based profile creation: Personal information available on web- a conference page, a project page, an online paper, or even in a Weblog.
Estimate Model P(w/Mu) to incorporate user in sentence extraction process
Opinion summarizationSentiment Analysis User-generated-content is growing rapidly
through blogs Sentiment analysis provides better access to
information
Sentiment Textual information on the Web can be
categorized as facts and opinions Computational study of opinions, sentiments in
market perspective
Optimization of sentiment in the summary to the maximum extent
Sentiment summarization as a two stage classification problem at sentence level
Polarity Estimation Opinion/fact Positive/Negative
SEMI ABSTRACTIVE SUMMARIZERS
Comparative summarization Summaries for comparing multiples items belonging to a
category Category of “Mobile phones“ will have “Nokia”, “Black
berry’ as its items
Comparative summaries provide the properties or facts common to these items and their corresponding values with respect to each item. “Memory”, “Display”, “Battery Life”,
Memory
Battery Life
Comparative Summaries Generation
Attribute Extraction Find the attributes of the product class
Attribute Ranking Rank the attributes according to importance in
comparison Summary Generation
Find the occurrence of attributes in various products
Guided Summarization Query Focused Summarization
User’s information need expressed as a query along with a narrative
Set of documents related to the topic Goal is to produce a shot coherent summary
focusing answer to the query Guided Summarization
Each topic is classified into a set of predefined categories
Each category has a template of important aspects about the topic
Summary is expected to answer all the aspects of template while containing other relevant information
Guided summarization
Encourage deeper linguistic and semantic analysis of the source documents instead of relying only on document word frequencies to select important concepts
Shares similarity with information extraction Specific information from unstructured text is
identified and consequently classified into a set of semantic labels (templates)
Makes information more suitable for other information processing tasks
A guided summarization system has to produce a readable summary encompassing all the information about the templates
Very few investigations exploring the potential of merging summarization with information extraction techniques
Our approach Building a domain model
Essential background knowledge for information extraction
Sentence Annotations To identify sentences having answers to aspects of
template
Concept Mining To use semantic concepts instead of words to
calculate sentence importance
Summary Extraction Modification of summary extraction algorithm to
adapt to the requirements using sentence annotations
THANKS