EXTRACTION BASED AUTOMATIC SUMARRIZATION Shashank …stripeli/documents/projects/nlp... · 2018. 10. 24. · EXTRACTION BASED AUTOMATIC SUMARRIZATION Shashank Singh University of

EXTRACTION BASED AUTOMATIC SUMARRIZATION

Shashank Singh University of Southern

California

Dimitris Stripelis University of Southern

California

Nishant Parikh University of Southern

California

1. INTRODUCTION It is undeniable that in our today’s online world the volume of information we are receiving in a daily basis is inconceivable. Millions of documents and articles are created every day while the overall Web capacity is growing at an exponential pace. It is estimated that each individual receives 54,000 words on average just through reading online content. Tools that could provide ease of digest and timely access to numerous diverse sources in order to alleviate the overload of information that people are facing is a necessity rather than a requirement. A key component for addressing this problem is the utilization of automatic summarization techniques. Since most of the contemporary summarization methods do not perform at the expected degree we want to provide a set of summarization mechanisms, which could gain people’s trust and will be flexible enough to be employed for text analysis and compression. Types of Automatic Summarization In the current literature there are several distinctions made in automatic summarization process. Two of the most widely used approaches are abstraction and extraction. Extractive summaries (extracts) are produced by concatenating several sentences taken exactly as they appear inside the information source. Abstractive summaries (abstracts) are written to convey the main information in the input and may reuse phrases or clauses from it, but the summaries are overall expressed in the words of the summary author. In the present paper we introduce the implementation and evaluation of three different approaches of extraction-based summary. The algorithms we developed and evaluated against a baseline approach and other well-known online summarization tools are K-Mediod, LexRank and LSA. Methods of Evaluation Most of the time, the flow of information in a given document/article is not uniformly distributed, which results to have parts of the document with higher importance than others. As a consequence, one of the major challenges in automatic summarization lies in the ability to distinguish the more informative parts of a document from the less ones. Another challenging task of automatic summarization is the evaluation

process. Evaluation process in automatic summarization is classified into two separate categories, extrinsic and intrinsic. Extrinsic evaluation techniques measure the impact of the summary with task-oriented techniques (e.g. task related to extraction of information) while intrinsic metrics of summary evaluation judge the quality of a text summary by comparing it to a reference summary (gold standard), which is manually created by human experts. Since extrinsic evaluations are expensive and require very careful planning there are not suitable for system comparisons and evaluation during development. Therefore, in this paper we produce our evaluation metrics by employing the intrinsic approach. Challenges From previous research papers it has been observed that inter-human agreement is quite low in creating extracts from documents (∼ 40% for single-documents and ∼ 29% for multi-documents). Due to this fact, the development of an efficient supervised learning technique from which trustworthy results could be generated is particularly difficult. In addition, creating an effective summarizer is limited due to constraints that arise from the diversity of the existing corpus. Specifically, even though a document or an article may belong to a superset of categories it could also be characterized by multiple other subset categories, which in turn may be superset categories themselves. For instance, even if a document belongs to the sports category its central point could be a computing device that tracks individual’s training metrics. In such cases, a system that could produce summaries of documents for general purposes categories may not be highly efficient.

2. RELATED WORK A Frequent Term and Semantic Similarity based Single Document Text Summarization Algorithm Naresh Kumar Nagwani, Asst. Professor, Department of CS&E, NIT Raipur Dr. Shrish Verma, Asst. Professor, Department of CS&E, NIT Raipur In this paper, a frequent term based text summarization algorithm is designed and implemented. The designed algorithm works in three steps. In the first step, the document, which is required to be summarized, is processed by eliminating the stop

word and by applying the stemmers. In the second step term-frequent data is calculated from the document and frequent terms are selected, for these selected words the semantic equivalent terms are also generated. Finally in the third step all the sentences in the document, which are containing the frequent and semantic equivalent terms, are filtered for summarization. It uses a clear and concise method to approach the problem. The results of their approach are quite promising. Given a single document, this approach may or may not produce a good result because the summary for a document depends on factors other than the frequent terms in the document. The method of evaluation proposed is outdated and cannot be trusted for a single document summarizer. The summary may end up having sentences with redundant information. In our K-medoid based approach, we use the DISCO API to generate semantic similar words as proposed in this paper. However, we also use Jaccard similarity, Jaccard distance and the K-Medoid algorithm on top of it. This approach also tackles the problem of eliminating redundant information in the final summary by applying K-Medoid clustering on the sentences. Our method uses the DUC2002 dataset whereas the method proposed in this paper uses TIPSTER SUMMAC corpus for evaluation of the summarizer.

LexRank: Graph-based Lexical Centrality as Salience in Text Summarization Güneş Erkan, Dragomir R. Radev, Department of EECS, School of Information University of Michigan, Ann Arbor, MI 48109 USA In this paper the automatic summarization approach that is proposed focuses on assessing the centrality of each sentence inside a cluster and investigating different ways of defining the lexical centrality principle in a multi-document summarization. Inspired by this technique we modify the centrality evaluation of the sentences inside a single document given that our implementation concentrates on producing a single-document summary. Moreover, in the reference paper a cluster of related documents is taken into account. Since many of the sentences of the cluster are expected to be somewhat similar to each other due to their same topic aggregation; the goal is to identify the sentences with the most significant similarities. With respect to this approach we set our threshold to 0.1 and we discard any similarity pairs that are below this limit. In addition, in order to calculate the sentence similarity between any two sentences, we extend the cosine distance computation by also considering the 5 most

similar words for every word of the sentence using the results from the DISCO API.

Text Summarization within the Latent Semantic Analysis Framework: Comparative Study Rasha Mohammed Badry, Ahmed Sharaf Eldin, Doaa Saad Elzanfally This paper compares various approaches used in summarizing multi-documents using LSA. It mostly describes generic LSA and intuition behind it for application in document summarization. For our implementation we followed the approach of Gong And Liu’s Approach (2001) on selecting the best sentence for each concept in Vt Matrix. Furthermore, we selected the number of the sentences from each concept based on their relative weight (Σ) compare to other concepts based on Murray, Renals and Carletta’s Approach (2005). Lastly, we applied OzSoy’s Approach (2010) as follows. Cross Method Taking average of each concept (rows of Vt) and zeroing out values in concept below average, to remove below average similarity between sentences that may arise due to stop-word matching and other unimportant word matching. Selecting sentence based on length thereafter. During our development, we have applied Cross Method, with a few modifications:

! Stopwords removal. ! POS tagging of sentences to assign more

weight to noun than verbs, as nouns play major role in summarization.

! Using DISCO API to generate 5 similar words to each noun to better calculate similarity between sentences.

! Using Cross method, wherein zeroing out values below median (to reduce the effect of extremes).

3. DATASETS

The datasets that we used in order to deploy our implementation are arranged into three different categories. Training Data The training data consist of the Open American National Corpus (OANC, from the paths /data/written_1/, /data/written_2/) and the Wikipedia dump articles (enwiki 2014-0811 /-1008 /-1208, 20150205). The total number of documents and articles from the OANC dataset is 2,475 and their overall size is ~50MB. However, because of the small size of the OANC dataset we collected datasets from the Wikipedia Database. Since the Wikipedia corpus is provided in XML format we used an open source

available tool (Annotated Wiki-Extractor) to retrieve the required documents in JSON Format. Afterwards, we proceeded with the extraction of the raw text for each document and article. The total number of Wikipedia documents and articles we have collected is 498,991 and their respective size is ~1.89GB. Due to system limitations the actual training dataset we use for our purposes is ~500MB. In the last collection we included all the OANC dataset and a random selection of articles and documents from the Wikipedia corpus. Staging Data During the implementation of all the summarization techniques we employ the DISCO (extracting DIstributionally related words using CO-occurrences) API to retrieve the semantic similarity between arbitrary words. The corpus that is being used for the development phase is the British National Corpus (BNC) with a total size of ~1.6GB. Testing Data The Document Understanding Conferences (DUC) Dataset of 2002 is utilized for benchmarking the performance of our implementation. The 2002 DUC corpus is a comprehensive collection of 557 documents and their manually annotated summaries from human experts. For evaluating our implementation a total number of around 100 documents were used.

4. K-MEDOID The K-Medoid algorithm starts with a fixed number of clusters (based on the percentage of summary required) and allocates a medoid randomly for each cluster from the set of sentences in its first iteration. The algorithm assigns each sentence to the cluster for which the distance between the sentence and the medoid of the cluster (also a sentence) is the least among all the medoids for all the clusters. After each iteration is complete, it recalculates the medoids for each sentence and again assigns sentences to clusters. This process is repeated until the clusters converge, i.e. the medoids do not change between iterations. The medoid of each cluster is extracted and sorted according to the index in the actual document and presented as the summary of the document. Document Pre Processing Sentence Extraction This step involves extracting the sentences from a document. DUC dataset provides documents with line breaks, which have been used to generate summaries. Each sentence is extracted based on line breaks. Stopwords Removal A list of more than 400 accurately defined stop words

have been used to find matches in each sentence for removal of stop words. Part of Speech Tagging NLTK toolkit is used to tag words after tokenizing each word based on NLTK’s word tokenizer. Extraction of nouns The target words used to represent sentences in this approach are specifically nouns. Stemming NLTK’s Lancaster stemmer is used to bring each word to its stem or root. This is done so that words, which don’t match when compared but have the same root, are not discarded. Implementation 1. DISCO (extracting DIstributionally related

words using CO - occurrences) API is used to retrieve the semantically most similar words for an input word. This similarity is based on the statistical analysis of British National Corpus (BNC) dictionary of more than 1.5 GB. For each representative word in the sentence, this API retrieves 10 most similar words.

2. Each sentence is essentially represented by a aggregate set of words actually present in the original sentence plus the semantically similar words returned by the DISCO API.

3. Cosine similarity between each pair of sentences is used to find the semantic similarity between the sentences. This method of finding similarity between two sentences looks for number of matching words between the two sentences. The value is normalized using the sum of words number that represents each sentence. A Cosine distance between each pair of sentences is calculated as:

Sim(si, sj) = 2 * |si ∩ sj| / (|si| + |sj|) Distance(si, sj) = 1 - Sim(si, sj)

This process essentially builds a Distance Matrix for the sentences of the document. Each element in the matrix represents the distance between the pair of sentences. A K-Medoid Clustering Algorithm is performed on this matrix to group sentences with relatively smaller distance (similar meaning) between them into single clusters. This algorithm is used to cluster sentences here because the sentences are represented by a set of words and not as a vector of features. The medoid of each cluster is a sentence, which has the least average distance to all the other sentences within the same cluster (i.e. most similar to all the sentences in that cluster). Note - Sentence Clustering is an important aspect of this approach because sub-topics and multiple themes in a document should be properly identified.

Clustering of sentences groups the sentences with similar meaning together. This helps a great deal in chalking out sentences with similar meaning in the final summary and hence makes way for sentences, which build a more diverse and effective summary.

5. LEX RANK Lexical Rank method is a graph-based algorithm for finding centrality of sentences from a sentence similarity matrix. In this model, a connectivity matrix based on the intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. A single document can be viewed as a network of sentences that are related to each other. We argue that the sentences that are similar to many of the other sentences in a document are more central to the main topic of the document. We define similarity between sentences using a bag-of-words model to represent each sentence as an N-dimensional vector where N is the number of all possible words in the target language. For each word that occurs in a sentence, the value of the corresponding dimension in the represented vector of the sentence is the number of occurrences (term frequency, TF) of the word inside the sentence, times the inverse document frequency (IDF) of the word. The IDF value of the word is calculated from the cluster that the testing document falls into. The similarity between any two sentences is defined as the cosine of the two corresponding vectors for the sentences. Outline of the Algorithm

OR

(Here d = Damping Factor) p^T = is the left Eigenvector of the matrix B with corresponding Eigenvalue equal to 1. The Matrix B is obtained from the adjacency matrix (A) of the similarity graph by dividing each element by the corresponding row sum:

Since the similarity matrix B satisfies the properties of a stochastic matrix, we can treat it as a Markov chain. The centrality vector P corresponds to the stationary distribution of B. However, we need to make sure that the similarity matrix is always irreducible and aperiodic. Our Approach 1. The documents collected from Wikipedia and

OANC are clustered into 20 clusters using the K-

means clustering algorithm with cosine similarity as the similarity measure.

2. The IDF of the words in each cluster is calculated and stored. For summarizing any document, we first determine which cluster it belongs to and we use IDF of that cluster in the TF*IDF weight of a given word for that cluster. Afterwards, we compute the Sentence Similarity Matrix B.

3. Sentence similarity is found by removing punctuations and stopwords and applying POS tagging, using the Natural Language Toolkit (NLTK), of original sentence and generating the 5 most similar word for each noun using the DISCO API. 3.1 TF is equal to the number of times a word

appears inside a sentence. 3.2 IDF is equal to the number of documents of

the cluster that the word appears. 3.3 Modified weighted cosine similarity is then

calculated. 4. Setting damping factor (d) equal to 0.85. With

respect to the dumping factor the sentence similarity values are changed; the newly computed Matrix B will be both aperiodic and irreducible.

5. Eigen vector corresponding to Eigen value 1 will be calculated recursively through the Power Method until an error threshold of 0.1 is reached.

6. The final Eigen vector corresponds to a stationary distribution P.

7. The top K sentences according to their respective values in the final vector will then be chosen for the document summary.

6. LSA Latent Semantic Analysis (LSA) is used in many applications (e.g. information retrieval, document categorization, information filtering and text summarization). LSA is a technique based on statistical calculations to extract and represent the contextual meaning of words and the similarity of sentences. It is an unsupervised method of deriving vector space semantic representation from a large corpus of data, which do not need any training or external knowledge. LSA uses context from the input document and extracts information such as which words are used together and which common words are seen in different sentences. Relying on these principals we can conclude that if the number of common words between any sentences is high then the sentences are more semantically related. LSA has three main steps. These steps are as follows: 1. Creation of an input matrix

In this case, the text (input document) is represented as a matrix. Each row represents the word and each column represents the sentence.

The cell value represents the importance of the word. There are many different approaches to fill the cell values like the frequency of the words inside the sentences.

2. Singular Value Decomposition (SVD) Singular value decomposition is a mathematical method applied to the input matrix. SVD is used to identify patterns in the relationships between terms and sentences. SVD as a mathematical equation can be represented as an m×n matrix (M) = U Σ Vt

3. Sentence Selection After applying the SVD, its result is used to select the sentences for generating the summary.

Our Approach 1. Removal of punctuations and stopwords.

Applying POS tagging using NLTK to each sentence.

2. Generate 5 most similar words for each noun using DISCO API and append it to the original sentence.

3. Words with a noun tag are given higher weights comparing to words with other tags. In our case, nouns have a weight of 3, verbs 1.5 and rest 0.5.

4. Create input matrix (LSA step 1), wherein each row is a unique word and columns are sentences. The respective cell value is the (TF of word in sentence * weight).

5. SVD decomposition is done using the “Sparsesvd” library.

6. Sentence selection is done using OzSoy’s approach.

Cross Method Cross method adds a preprocessing step between the SVD calculations step and sentence selection step. Then the Vt matrix is used for sentence selection. The preprocessing step tries to remove sentences, which are not one of the most important sentences for each concept. For each sentence, the average value is calculated. Then, the cell value is set to zero if its value is less than or equal to the average.

After preprocessing, the total length of each sentence vector is calculated. Then, the sentences with the longest vector length are selected for the summary. The length is the matrix Σ * V(mod) T.

7. EVALUATION ROGUE (Recall-Oriented Understudy of Gisting Evaluation) software package is used for evaluating our system generated document summaries. It is based on the computation of n-gram overlap between a summary and a set of models. This recall-oriented n-

gram counting is shown to correlate better with DUC dataset. The Document Understanding Conference (DUC) has been carrying out large-scale evaluations of summarization systems on a common dataset since 2001. DUC content evaluation is based on a single human model. However, in order to mitigate the bias coming from using gold standards from only one person, different annotators create the models for different subset of the test data. In order to address the need for better analysis granularity than the sentence level, DUC used elementary discourse units (EDUs) as the basis of evaluation. The machine or system generated summaries are evaluated by the degree to which they cover each EDU in the model. The dataset used for evaluation purposes was the DUC 2002 dataset, which contains 557 news articles on different categories and manually annotated extractive summary per document. The summaries provided by DUC 2002 dataset are on average of 100 words length. The system-generated summaries produced by our three algorithms are on average 100 – 120 words of length. Thus, our summaries have a length comparable to the corresponding manual summaries present in the DUC dataset, to ensure a fair evaluation. The manual summaries provided with the DUC 2002 dataset are defined as reference summaries in the ROGUE software and the summaries generated by our three approaches are defined as candidate summaries. The metric used for testing the performance of our system-generated summary for documents is the F Score produced by the ROGUE software system. It is based on precision and recall score for n-gram overlap between the manual and system generated summaries. Baseline system We extract the first sentence of each paragraph in the document or the top k sentences from the document (where k is based on the percentage of summary required for a document). The summaries generated by our three summarizers resulted in F scores which were better than those produced by the baseline system over the entire test dataset. Comparable system We used two of the best available online summarizers for our purpose of comparing the summary generated by them against our system generated summaries. The online summarizers that were used are Sumplify and Tools4Noobs. The resulting summaries generated by the online summarizers were better than the baseline system summaries based on their F scores. However, at least one of our summarizers was better than both the online summarizers over the entire DUC test dataset, based on the F score generated by the ROGUE system.

Our system(s) gets a pretty good F-score on the DUC2002 dataset. We had an extensive reading on the subject matter and as far as our understanding goes, our methodology can be used to build a proper summarizer as a service based on our very promising results. Our system(s) was tested only on the DUC2002 dataset because trustful manually annotated summaries are required for evaluation purposes. To build a proper summarizer as a service, we need to figure out a way to evaluate our algorithms against all kind of test data like news articles, text conversations, discussion board datasets etc. The comparative study is presented below with the help of a table depicting the F-score for our approaches alongside F-scores generated by the baseline and the two online summarizers for a set of 10 documents and also the bar plots indicating the better performance of our summarizers against the baseline and the two online summarizers.

Figure 1: K-Medoid Evaluation

Figure 2: LexRank Evaluation

Figure 3: LSA Evaluation Figures Details: x-axis: Set of 10 documents y-axis: F-score produced by ROGUE

Figure 4: Evaluation Analysis ACKNWOLEDGMENTS Shashank Singh: Responsible for the K-Medoid Algorithm based summarizer and Evaluation of the summary generated by our algorithms against the baseline and online summarizers. The evaluation plan involved finding a suitable evaluation system, finding a baseline system, extracting the summary from the required summarizers and testing all the summarizers against the reference summary present in the DUC2002 dataset using ROGUE. Stripelis Dimitris: Responsible for the Dataset collection part which involved extraction and modification of OANC, Wikipedia dump files as well as finding the DUC dataset and building the LexRank algorithm based summarizer.

Nishant Parikh: Responsible for building the K-means clustering algorithm for generating tf-idf values from Wikipedia and OANC dataset; building the LSA algorithm based summarizer and plotting the bar graphs after the final evaluation scores.

REFERENCES A Survey of Various Summary Evaluation Techniques Vishal Gupta UIET, Panjab University Chandigarh, India, 2014 A Survey on Automatic Text Summarization Dipanjan Das Andr´e F.T. Martins Language Technologies Institute Carnegie Mellon University, 2007 Text Summarization within the Latent Semantic Analysis Framework: Comparative Study Rasha Mohammed Badry, Ahmed Sharaf Eldin, Doaa Saad Elzanfally A Frequent Term and Semantic Similarity based Single Document Text Summarization Algorithm Naresh Kumar Nagwani, Asst. Professor, Department of CS&E, NIT Raipur Dr. Shrish Verma, Asst. Professor, Department of CS&E, NIT Raipur LexRank: Graph-Based Lexical Centrality as Salience in Text Summarization Güneş Erkan, Dragomir R. Radev Single Document Text Summarization Using Clustering Approach Implementation for News Article Pankaj Bhole, Dr. A.J. Agrawal A language independent approach to multilingual text summarization Alkesh Patel, Indian Institute of Information Technology, Allahabad Tanveer Siddiqui, Indian Institute of Information Technology, Allahabad U. S. Tiwary, Indian Institute of Information Technology, Allahabad Extraction-Based Single-Document Summarization Using Random Indexing Niladri Chatterjee, IIT Delhi Shiwali Mohan, Netaji Subhas Institute of Technology Numpy / Scipy Recipes for Data Science: k-Medoids CLustering Christian Bauckhage,B-IT, University of Bonn, Germany Fraunhofer IAIS, Sankt Augustin, Germany

The Evaluation of Sentence Similarity Measures Palakorn Achananuparp, Xiaohua Hu, Shen Xiajiong Multi Document Extraction Based Summarization Sandeep Sripada, Venu Gopal Kasturi, Gautam Kumar Parai Sentence Similarity Based on Semantic Nets and Corpus Statistics Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett

Documents

EXTRACTION BASED AUTOMATIC SUMARRIZATION Shashank …stripeli/documents/projects/nlp... · 2018. 10. 24. · EXTRACTION BASED AUTOMATIC SUMARRIZATION Shashank Singh University of