View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Text Summarization
Jagadish M(07305050)Annervaz K M (07305063)
Joshi Prasad(07305047)Ajesh Kumar S(07305065)Shalini Gupta(07305R02)
Introduction
Summary: Brief but accurate representation of the contents of a document
Goal: Take an information source, extract the most important content from it and present it to the user in a condensed form and in a manner sensitive to the user’s needs.
Compression: Amount of text to present or the length of the summary to the length of the source.
MSWord AutoSummarize
Presentation Outline Motivation Different Genres Simple Statistical Techniques Degree Centrality Lex Rank Lexical/Co-reference Chains Rhetorical Structure Theory WordNet Based Methods DUC/TAC
Motivation
Abstracts for Scientific and other articles
News summarization (mostly Multiple document summarization)
Classification of articles and other written data
Web pages for search engines Web access from PDAs, Cell phones Question answering and data gathering
Genres
Indicative vs. informative used for quick categorization vs. content
processing. Extract vs. abstract
lists fragments of text vs. re-phrases content coherently.
Generic vs. query-oriented provides author’s view vs. reflects user’s interest.
Background vs. just-the-news assumes reader’s prior knowledge is poor vs. up-
to-date. Single-document vs. multi-document source
based on one text vs. fuses together many texts.
Statistical scoring
Scoring techniques Word frequencies throughout the
text(Luhn58) Position in the text(Edmundson69) Title Method(Edmundson69) Cue phrases in sentences (Edmundson69)
Luhn58
Important words occur fairly frequently
Earliest work in field
Statistical Approaches(contd..)
Degree Centrality LexRank Continuous LexRank
Degree Centrality
Problem Formulation Represent each sentence by a vector Denote each sentence as the node of a
graph Cosine similarity determines the edges
between nodes
Degree Centrality
Since we are interested in significant similarities, we can eliminate some low values in this matrix by defining a threshold.
Degree Centrality
Compute the degree of each sentence
Pick the nodes (sentences) with high degrees
Degree Centrality
Disadvantage in Degree Centrality approach
LexRank
Centrality vector p which will give a lexrank of each sentence (similar to page rank) defined by :
What Should B Satisfy?
Stochastic Matrix and Markov Chain property.
Irreducible. Aperiodic
Perron-Frobenius Theorem
An irreducible and aperiodic Markov chain is guaranteed to converge to a stationary distribution
Reducibility
Aperiodicity
LexRank
B is a stochastic matrix Is it an irreducible and aperiodic
matrix? Dampness (Page et al. 1998)
Matrix Form of p for Dampening
Solve for p using Power method
Continuous LexRank
Linguistic/Semantic Methods
Co-reference /Lexical Chain Rhetorical Analysis
Co-reference/Lexical Chains
Assumption/Observation :- Important parts in a text will be more related in a semantic interpretation
Co-reference / Lexical Chains (Object-Action, Part-of relation, Semantically related)
Important sentences will be traversed by more number of such chains
Co-reference/Lexical Chains
Mr. Kenny is the person that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achieve much closer monitoring of the pump feeding the anesthetic into the patient
Rhetorical Structure Theory
Mann & Thompson 88 Rhetoric Relation
Between two non-overlapping text snippets
Nucleus - Core Idea, Writers Purpose Satellite - Referred in context to nucleus
for Justifying, Evidencing, Contradicting etc
Rhetorical Structure Theory Nucleus of a rhetorical relation is
comprehensible independent of the satellite, but not vice versa
All rhetoric relations are not nucleus-satellite relations, Contrast is a multinuclear relationship
Example: evidence [The truth is that the pressure to smoke in 'junior high' is greater than it will be any other time of one’s life:][ we know that 3,000 teens start smoking each day.]
Rhetorical Structure Theory
Rhetoric Parsing Breaks into elementary units Uses cue phrases(discourse markers) and
notion of semantic similarity in order to hypothesize rhetorical relations
Rhetorical relations can be assembled into rhetorical structure trees (RS-trees) by recursively applying individual relations across the whole text
2Elaboration
2Elaboration
8Example
2BackgroundJustification
3Elaboration
8Concession
10Antithesis
Mars experiences
frigid weather
conditions(2)
Surface temperatures typically average
about -60 degrees
Celsius (-76 degrees
Fahrenheit) at the
equator and can dip to -
123 degrees C near the
poles(3)
4 5Contrast
Although the atmosphere
holds a small
amount of water, and water-ice
clouds sometimes develop,
(7)
Most Martian weather involves
blowing dust and carbon monoxide.
(8)
Each winter, for example, a blizzard of
frozen carbon dioxide
rages over one pole, and a few meters of
this dry-ice snow
accumulate as
previously frozen carbon dioxide
evaporates from the opposite
polar cap.(9)
Yet even on the summer pole, where
the sun remains in the sky all day long,
temperatures never warm
enough to melt frozen
water.(10)
With its distant orbit (50 percent farther from the sun than Earth) and
slim atmospheric
blanket,(1)
Only the midday sun at tropical latitudes is
warm enough to
thaw ice on occasion,
(4)
5Evidence
Cause
but any liquid water formed in this way would
evaporate almost
instantly(5)
because of the low
atmospheric pressure
(6)
RST Based Summarization
Multiple RS-trees A built RS-tree captures relations in the
text and can be used for high quality summarization
Picking up the ‘K’ nodes nearest to the root
Disadvantages
WordNet based Approach for Summarization
Preprocessing of text Constructing sub-graph from WordNet Synset Ranking Sentence Selection Principal Component Analysis
Preprocessing
Break text into sentences Apply POS tagging Identify collocations in the text Remove the stop words
Sequence is important
Constructing sub-graph from WordNet Mark all the words and collocations in
the WordNet graph which are present in the text
Traverse the generalization edges up to a fixed depth, and mark the synsets you visit
Construct a graph, containing only the marked synsets
Synset Ranking
Rank synsets based on their relevance to text
Construct a Rank vector, corresponding to each node of the graph, initialized to 1/√ (no_of_nodes, n in graph)
Create an authority matrix, A(i,j) = 1/(num_of_predecessors(j)), if j is a child of i.
Synset Ranking
Update the R vector iteratively as,
Higher value implies better rank and higher relevance
Sentence Selection
Construct a matrix, M with m rows and n columns
m is number of sentences and n is number of nodes
For each sentence Si
Traverse graph G, starting with words present in Si and following generalization edges
Find set of reachable synsets, SYi
For each syij ∈ SYi
set M[Si][syij] to rank of syij calculated in previous step
Principal Component Analysis
Apply PCA on matrix M and get set of principal components or eigen vectors
Eigen value of each eigen vector is measure of relevance of eigen vector to the meaning
Sort Eigen vectors according to Eigen values
For each Eigen vector, find its projection on each sentence
Principal Component Analysis
Select top nnumselect sentences for each eigen vector
nnumselect is proportional to the eigen values of the eigen vectors
nnumselect = i/∑j(j)) where i is the eigen value corresponding to the eigen vector, i
Document Understanding Conference(DUC) Text Analysis Conference(TAC)
Interest and activity aimed at building powerful multi-purpose information systems
Evaluation results of various summarization techniques
www-nlpir.nist.gov/projects/duc/data.html
Human Summary of Our Presentation :) What is Text Summarization? Why Text Summarization? Methods to Summarization
LexRank Lexical Chains Rhetorical Structure Theory Wordnet Based
Challenges ahead..
Ensuring text coherency Sentences may have dangling
anaphors Summarizing non-textual data Handling multiple sources effectively High reduction rates are needed Achieving human quality
summarization!!
References
Erkan, Radev, 2004. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Vol: 22, 457 – 479, Journal of Artificial Intelligence Research
Barzilay, R. and M. Elhadad. 1997. Using Lexical Chains for Text Summarization. In Proceedings of the Workshop on Intelligent Scalable Text Summarization at the ACL/EACL Conference, 10–17. Madrid, Spain.
Mann, W.C. and S.A. Thompson. 1988. Rhetorical Structure Theory: Toward a Functional Theory of Text Organization. Text 8(3), 243–281. Also available as USC/Information Sciences Institute Research Report RR-87-190.
References
Baldwin, B. and T. Morton. 1998. Coreference-Based Summarization. In T. Firmin Hand and B. Sundheim (eds). TIPSTER-SUMMAC Summarization Evaluation. Proceedings of the TIPSTER Text Phase III Workshop. Washington.
Marcu, D. 1998. Improving Summarization Through Rhetorical Parsing Tuning. Proceedings of the Workshop on Very Large Corpora. Montreal, Canada.
Ramakrishnan and Bhattacharya, 2003. Text representation with wordnet synsets. Eighth International Conference on Applications of Natural Language to Information Systems (NLDB2003)
References
Bellare,Anish S., Atish S., Loiwal, Bhattacharya, Mehta, Ramakrishnan, 2004. Generic Text Summarization using WordNet
Inderjeet Mani and Mark T. Maybury (eds). Advances in Automatic Text. Summarization. MIT Press, 1999. ISBN 0-262-13359-8.
www.wikipedia.com
Thank You