img2.tapuz.co.ilimg2.tapuz.co.il/forums/1_174160515.docx · Web viewExtractive summarization is aimed at the selection of a subset of the most relevant fragments, which can be paragraphs,

Department of Computer Engineering

מחלקת הנדסת תוכנה

System Requirements Document

By:

- Bar Dromi- Yevgeni Krapivin

1



Table of ContentsTable of Contents.....................................................................................................................2

Revision table...........................................................................................................................4

Distribution table......................................................................................................................4

Introduction..............................................................................................................................5

Description of problem.........................................................................................................5

Importance of solving the problem......................................................................................5

Related work.........................................................................................................................5

Multilingual Sentence Extractor............................................................................................6

Purposed updates.............................................................................................................7

Workflow..............................................................................................................................9

Text input..........................................................................................................................9

Text Pre-processing...........................................................................................................9

Feature Extraction............................................................................................................9

Summarization..................................................................................................................9

Training...........................................................................................................................10

Representation...............................................................................................................10

Functional Requirements........................................................................................................10

Input consumption..............................................................................................................10

Text pre-processing............................................................................................................10

Statistics gathering..............................................................................................................10

Model construction.............................................................................................................11

Structure model..............................................................................................................11

Vector-Space model........................................................................................................11

Graph model...................................................................................................................11

Feature extraction..............................................................................................................11

Linear-combination using a weighting vector.....................................................................11

Output representation........................................................................................................11

Training...............................................................................................................................11

Quality Assurance...........................................................................................................12

Non-Functional Requirements................................................................................................12

Portability...........................................................................................................................12

Modularity..........................................................................................................................12

Extensibility.........................................................................................................................12

Performance.......................................................................................................................13

2



System Requirements.............................................................................................................13

Operating system................................................................................................................13

Java Virtual Machine (JRE)..................................................................................................13

Hardware Requirements.........................................................................................................14

Development Requirements...................................................................................................14

Environment.......................................................................................................................14

Operating System............................................................................................................14

Editor..............................................................................................................................14

Java Development Kit.....................................................................................................14

Source Control................................................................................................................14

Maven.............................................................................................................................14

Testing............................................................................................................................14

Diagrams.................................................................................................................................15

Class diagrams....................................................................................................................15

Sequence............................................................................................................................16

Use case..............................................................................................................................17

Figure Table............................................................................................................................18

References..............................................................................................................................19

3



Revision tableRevision # Revision Date Description of Change Author1 28/11/2013 Original document conception Yevgeni Krapivin,

Bar Dromi2 30/11/2013 Added revision and distribution table,

Added performance section to non-functional requirements

Yevgeni Krapivin

3 2/12/2013 Added purposed features and optimizations, Diagrams

Bar Dromi, Eugene Krapivin

Distribution tableRecipient Name Recipient Organization Distribution MethodDr. Litvak Marina Project consultant e-mail

4



IntroductionDescription of problemIn a world of growing data finding the relevant information in the sea of ever growing data becomes a real time consuming problem. Every year researchers submit new articles that, sometimes, may contain duplicate data and almost always contain irrelevant data.

There is a major problem of summarizing texts in exotic languages. Most summarization algorithm need golden standards to "learn" how to summarize a language, or use language specific features making those algorithm language dependent.

Importance of solving the problemThere is a major importance of solving this problem, because of the unstoppable growth of the data in the coming years and decades. Moreover, it is very important to have the ability to summarize information in exotic languages for the intelligence community.

Related workExtractive summarization is aimed at the selection of a subset of the most relevant fragments, which can be paragraphs, sentences, key phrases, or keywords from a given source text. The extractive summarization process usually involves ranking, such that each fragment of a summarized text gets a relevance score, and extraction, during which the top-ranked fragments are extracted and arranged in a summary in the same order they appeared in the original text. Statistical methods for calculating the relevance score of each fragment can rely on such information as: fragment position inside the document, its length, whether it contains keywords or title words. [1]

A research conducted by Dr. Litvak [1] we see the use of 31 different linguistic features to rank sentences. The 31 scores are combined using a weighting function that receives the weights from a genetic algorithm [2] that finds the optimal weights ratio that provides the best summarization (according to evaluations).

Figure 1: Original MUSE features taxonomy

5



Most of the current algorithms for sentence extraction use sentence specific features rendering them useless on a large variety of languages without being retrained specifically for the needed language.

The process of retraining the algorithms may not even be possible because of a tight dependency on the language. Retraining some algorithms require preparing a special corpus of texts called "Golden Standard" with a corresponding set of human made summarizations. The creation of such a "golden standard" is very expensive and time demanding. In cases of rare and exotic languages may not even be possible due to insufficient man power.

Multilingual Sentence ExtractorThe MUltilingual Sentence Extractor algorithm (MUSE) tries to deal with some of the problems presented previously.

The MUSE algorithm is a language independent algorithm (hence Multi-lingual) that could be trained once and used for a variety of languages with success providing high standard results.

The original MUSE algorithm [3] described in Dr. Litvak's paper contains 31 features [4]. Those features are divided into three primary groups:

- Structural features – Involved with the placement of the sentences in the text and the length of those sentences.

- Vector-Space features – Involved with the term frequency in the sentence.- Graph-Space features – Using graph representation of the vector space to use graph

specific algorithms.

Using a genetic algorithm [2] to create a linear combination of the rankings given by each of the features MUSE selects the most ranked sentences and reorders them into the original orders creating a summarization comprised entirely from the original sentences of the text.

Figure 2: Original MUSE workflow

6



The same linear combination could be potentially used on other languages without retraining the algorithm. Thus, giving MUSE an advantage over other summarization algorithms.

The problems with the MUSE algorithm is the runtime of the algorithm due to a non-parallelized implementation.

The aim of our project and research is to find ways to parallelize the algorithm giving it a better implementation, decreasing runtime and increasing the capacity of work done in a single run. The research part of the project is aimed at improving the quality of the summarizations in popular languages such as English by adding language specific features to the summarization of English texts.

Purposed updatesIntroduction of new featuresWe evaluated to muse new language dependency features. We believe that language dependency features will improve the quality of the summery.

Features depending upon lemmatizationn Feature Source Taxonomy1 KEY Edmundson (1969) Vector-based2 COV Liu et al. (2006a) Vector-based3 TF Vanderwende et al.

(2007)Vector-based

4 TFISF Neto et al. (2000) Vector-based5 SVD Steinberger and Jezek

(2004)Vector-based

6 TITLE_O Edmundson (1969) Vector-based7 TITLE_J Edmundson (1969) Vector-based8 TITLE_C Edmundson (1969) Vector-based9 D_COV_O Litvak et al. (2010b) Vector-based10 D_COV_J Litvak et al. (2010b) Vector-based11 D_COV_C Litvak et al. (2010b) Vector-based12 LUHN_DEG Litvak et al. (2010b) Graph-based13 KEY_DEG Litvak et al. (2010b) Graph-based14 COV_DEG Litvak et al. (2010b) Graph-based15 DEG Litvak et al. (2010b) Graph-based16 TITLE_E_O Litvak et al. (2010b) Graph-based17 TITLE_E_J Litvak et al. (2010b) Graph-based18 D_COV_E_O Litvak et al. (2010b) Graph-based19 D_COV_E_J Litvak et al. (2010b) Graph-based

Figure 3: Purposed new lemmatization features

Features depending upon Named EntityBased

onFormulaDescriptionFeatureNumber of Location NE in the sentence

NE_LOC

Number of Person NE in the sentence

NE_PER

7



Number of Date-Time NE in the sentence

NE_DATE

Number of Quantitative NE in the sentence

NE_QU

|Named Ent (S )|Total number of NE in the sentence

NE

Ratio of Location NE to all words in the sentence

NE_LOC_RAT

Ratio of Person NE to all words in the sentence

NE_PER_RAT

Ratio of Date-Time NE to all words in the sentence

NE_DATE_RAT

Ratio of Quantitative NE to all words in the sentence

NE_QU_RAT

|Named Ent (S )|N

Ratio of total number of NE to all words in the sentence

NE_RAT

COV|Named Ent(S )||Named Ent(D)|

Ratio of NE in sentence to NE in document

NE_COV

TFISF∑NE∈S

tf (NE )∗isf (NE)isf (NE )=1− log (n(NE))log (n)

, n(NE) is the number of sentences containing NE

NE_TFISF

KEY∑NE∈S

tf (NE )Sum of the NE frequencies

NE_FRQ_SUM

Number of NE mentioned in the title

NE_TITLE

KEY_DEG∑NE∈S

Deg (NE )Sum of NE degreesNE_DEG_SUM

Figure 4: Proposed new NE features

Notations description:DescriptionNotat

ionNamed entityNESentenceSDocumentD

8



Sentence indexiNumber of sentences in document

n

Number of words in sentenceNIn document Named entity frequency

tf(NE)

Degree of Named entity nodeDeg(NE)

Figure 5 : Annotations

OptimizationsLemmatizationLemma (headword) - In morphology and lexicography, a lemma (plural lemmas) is the canonical form, dictionary form, or citation form of a set of words [5]

Stem - In linguistics, a stem is a part of a word. The term is used with slightly different meanings. In one usage, a stem is a form to which affixes can be attached. [6]

One of the optimizations this project researches and introduces to MUSE is the usage of lemmas instead of stems (or in cooperation with stems). This method should find better keywords in the text for all the keyword based metrics. Using stems loses importance in keywords that hold the same lexical meaning however differ morphologically. By finding all the lemmas in the text, the keyword bases metrics will be more accurate in rating the sentences.

Reduction of IO usageThis version of MUSE uses less IO then the previous version. The only IO interaction this implementation of MUSE reads from the HDD (Hard Drive Disk) only once. All the configurations files and text input is loaded to the memory and stays in the memory during the runtime. This method uses less IO, hence less interrupts for IO and less taken for information request. On the downside of this method, the amount of RAM (Random Access Memory) needed for this program is increased.

WorkflowThe new workflow of MUSE resembles to the original workflow, however, the difference is the parallelization of each step.

Figure 6 : New MUSE workflow

Text inputIn this stage we read/receive the raw (no XML or other annotations) texts, and store them in the RAM (Memory) for fast access and consumption.

9



Text Pre-processingIn this stage the pre-processor splits the single text into sentences and tokenizes the sentences to receive separate words. Those words are stemmed, lemmatized and tagged.

All those pre-processing elements are done by the Stanford NLP toolkit.

In this stage the data about the text, sentences and the tokens is collected and saved for later use in the feature extraction stage.

Feature ExtractionIn this stage in the workflow each sentence in each text gets its rankings according to the data collected in the pre-processing stage. At the end of the stage each sentence has as much rankings as there are features involved in the feature extraction process.

SummarizationIn this stage the algorithm takes the rankings of each sentence and creates a single rank using the linear combination created in the training process.

Using the ranking of each sentence the algorithm then selects the top ranked sentences and uses them as the summarization of the text.

TrainingIn this stage, the trainer uses the output of the feature extraction stage explained earlier. Using a genetic algorithm, and the quality evaluation packages ROUGE-1 and ROUGE-2, the trainer creates the best suitable linear combination of the given features that create the highest quality summarizations. In this stage the trainer also needs a golden standard to check upon the quality of the summarizations created in the training process and evaluate the quality of the linear combination weights.

Figure 7 : Genetic algorithm workflow

RepresentationThe representation of the final outcome of the summarization process could be suited to the needs of the user. The output could be annotated (as needed by the ROUGE evaluation kits), exported as raw *.txt files, uploaded to a server etc.

The representations of the summarizations vary on the user needs. Summarizations could be raw texts that only contain the extracted sentences from the original text or a highlighting of the selected sentences in the original text.

10



Functional RequirementsThe MUSE implementation is a pipeline style framework. Construction of pipelines is done by exploiting the Parallel-Processing-Pipeline library (GPL license).

Input consumptionThe input texts should be raw *.txt files. The presence of annotations of any kind (XML, HTML, JSON etc.) will make the program behave in unpredicted ways, yield summarizations of low quality and even crash.

Text pre-processingThe text pre-processing is done by using the Stanford NLP toolkit.

Statistics gatheringThe pre-processor should gather the following parameters of the text:

Sentence splitting Tokenization Stem frequency Lemma frequency Part-Of-Speech (POS) tagging Named-Entity (NE) recognition NE frequency Inverse term-sentence mapping Inverse lemma-sentence mapping Inverse NE-sentence mapping Sentence vector normal calculation

Model constructionStructure modelThe structure model depicts the sentence position in the text, the sentence character length and word count. This is a trivial model, and is created during the statistics gathering.

Vector-Space modelThe vector space model depicts the term to sentence relations. Each sentence has a vector which is built of the term used in the sentence itself. The terms should be after stemming or lemmatization.

Graph modelThe graph model depicts the term-to-term relations in the whole document and the sentence-to-sentence relations. Two primary graphs are constructed for this model:

- Word graph – Essentially it is a directed bi-gram graph with labeled edges, each label denoted the sentence in which the word progression was made.

- Sentence graph – This is a weighted and undirected graph that shows similarity relations between two sentences. Each edge in the graph has a weight of the cosine-similarity between the two sentences. Some threshold is filtering the edges, so non-similar sentences won't be connected by an edge at all.

11



Feature extractionThe original features used in MUSE are denoted and explained in Dr. Litvak's et al. papers about MUSE. [4] [7] [8]

New lingual features are available in this research.

Linear-combination using a weighting vectorThe feature extraction provides each sentence with a rank. To find the best ranked sentences in the text the MUSE algorithm uses a linear-combination of those ranks given by the features using a weighting vector that holds calculated weights for each features in the linear combination. The weighting vector will be explained below.

Output representationThe algorithm has two different output representations:

- Summary – Holds only the selected sentences in their original order.- Annotated – Holds an annotated representation of the summary for quality

assurance.

TrainingThe training of the algorithm is done by a genetic algorithm [2].

The training algorithm could be inter-changeable with other types of training mechanisms, for example Artificial Neuron Network (ANN).

Quality AssuranceThe quality of the summarizations are evaluated by ROUGE-1 and ROUGE-2 suites.

Those suites are used during the training sessions of the algorithm. The genetic algorithm tries different variations of linear-combinations, the ROUGE suites test the result of those combinations and evaluate them with the use of a "Golden-Standard".

Non-Functional RequirementsPortabilityThe MUSE implementation should be portable to different operating systems and machines that pass the needed system requirements and hardware requirements as stated below.

The whole project is written in Java (JDK 7) which is known to be portable to any machine that could run the JVM (Java Virtual Machine). As a data storage format the project uses JSON format strings and structures. This format is vastly used in the web for communication, thus giving the option to use MUSE in the internet environment.

ModularityThe stages of the algorithm should be modular and changeable, so that a future user (developer) could potentially change/add stages (the pre-processing for example) with ease.

ExtensibilityThe framework should be extensible enough to work with difference summarization algorithms, training techniques and output representations.

The MUSE algorithm as well is extensible and should be ready to receive newly written feature calculators without the need to recompile/rebuild the whole project.

12



Performance

Figure 8 : New MUSE parallelized workflow

The implementation should utilize the multi-core architecture of the CPUs. Using multiple cores and multiple threads will enable the MUSE algorithm to work on multiple texts at once, or utilize a pipeline like processing structure that will increase the throughput of the original MUSE implementation.

System RequirementsOperating systemThe operating system should be a 64bit type to receive the needed RAM requirements noted below.

Any OS that has a compatible JRE implementation should suffice. Notice that this project will be tested on Windows 7 64bit Home Premium edition

Java Virtual Machine (JRE)JRE 7 64bit should be installed on the computer.

13



Hardware Requirements The hardware on which the project was implemented should be used as the minimal hardware requirements:

CPU: Intel i7 2670QM.

Memory: 8 GB DDR3 1600MHz.

HDD: At least 350MB of free space, not including the input and the output space needed.

Development RequirementsEnvironmentOperating SystemThe minimal needed version of the OS should be as described above.

EditorEclipse 4.3 (Keller) used in the development of this project. Using the Eclipse editor the developer can work on either Windows or Linux OS.

Java Development KitJDK 7u45 is used to build this project. Any newer version of JDK 7 should suffice. Notice that using JDK 8 could potentially arise warning due to deprecation.

Source ControlThe source control software used in this project is Mercurial, using a BitBucket server as the repository host.

MavenMaven is an open-source library (under the Apache license ver.2) that

This project leverages Maven in multiple ways:

- Dependency resolution- Build tool

TestingThis project uses the jUnit4 testing library. This library is an open-source tool under the Apache v2 license.

14



DiagramsThe set of this diagrams comes to show a preliminary structure of the project software implementation.

Class diagrams

Figure 9 : Calculator and Feature base hierarchy

Figure 10 : Document and Summary hierarchy

15



Sequence

Figure 11: MUSE initiation sequence

Figure 12: MUSE running

16



Figure 13 : Random stage initiation

Use case

17



Figure TableFIGURE 1: ORIGINAL MUSE FEATURES TAXONOMY............................................................................................5FIGURE 2: ORIGINAL MUSE WORKFLOW.........................................................................................................6FIGURE 3: PURPOSED NEW LEMMATIZATION FEATURES.......................................................................................7FIGURE 4: PROPOSED NEW NE FEATURES.........................................................................................................8FIGURE 5 : ANNOTATIONS.............................................................................................................................8FIGURE 6 : NEW MUSE WORKFLOW...............................................................................................................9FIGURE 7 : GENETIC ALGORITHM WORKFLOW..................................................................................................10FIGURE 8 : NEW MUSE PARALLELIZED WORKFLOW..........................................................................................13FIGURE 9 : CALCULATOR AND FEATURE BASE HIERARCHY...................................................................................15FIGURE 10 : DOCUMENT AND SUMMARY HIERARCHY........................................................................................15FIGURE 11: MUSE INITIATION SEQUENCE......................................................................................................16FIGURE 12: MUSE RUNNING.......................................................................................................................16FIGURE 13: SOME RANDOM STAGE INITIATION................................................................................................17

18



References

[1] M. LITVAK, H. LIPMAN, A. BEN GUR, M. LAST, S. KISILEVICH and D. KEIM, "Towards multi-lingual summarization: A comparative analysis ofsentence extraction methods on English and Hebrew corpora," in Proceedings of the CLIA/COLING 2010, Beijing, 2010.

[2] M. LITVAK, M. LAST and M. FRIEDMAN, "A new Approach to Improving Multilingual Summarization using a Genetic Algorithm," in Proceedings of the Association for Computational Linguistics (ACL), Uppsala, 2010.

[3] M. LITVAK and M. LAST, "MUSE - A Multilingual Sentence Extractor," in Proceedings of the Computational Linguistics-Applications (CL-A'11), Jachranka, 2011.

[4] M. LITVAK and M. LAST, "Cross-lingual training of summarization systems using annotated corpora in a foreign language," Information Retrieval: Volume 16, Issue 5, pp. 629-656, 2013.

[5] Wikipedia, "Lemma (morphology)," 1 July 2013. [Online]. Available: http://en.wikipedia.org/wiki/Lemma_(morphology). [Accessed 2 December 2013].

[6] Wikipedia, "Word stem," Wikipedia, 28 October 2013. [Online]. Available: http://en.wikipedia.org/wiki/Word_stem. [Accessed 02 December 2013].

[7] M. LITVAK and M. LAST, "Multilingual Single-Document Summarization with MUSE," in Proceedings of ACL/MultiLing, Sofia, 2013.

[8] H.-T. Dilek, G. Tur, M. Levit, D. Gillick, A. Singla and S. Yaman, "Statistical Sentence Extraction for Multilingual Information Distillation," 2009.

[9] L. WANG, H. RAGHAVAN, V. CASTELLI, R. FLORIAN and C. CARDIE, "A Sentence Compression Base Frameword To Query-focuse Multi-Document Summarization".

[10] J. MAYFIELD, P. MCNAMEE and C. PIATKO, "Named Ebtity Recognition using Hundreds of Thousnads of Features".

[11] M. LITVAK and M. LAST, "Language-independent Techniques for Automated Text Summarization," Web Intelligence and Security, IOS Press, NATO Science for Peace and Security Series, pp. 209-240, 2010.

[12] M. LITVAK and M. LAST, "Graph-Based Keyword Extraction for Single-Document Summarization," in Proceedings of the 2nd Workshop on Multi-source, Multilingual Information Extraction and Summarization, Manchester, 2008.

19

Documents

img2.tapuz.co.ilimg2.tapuz.co.il/forums/1_174160515.docx · Web viewExtractive summarization is aimed at the selection of a subset of the most relevant fragments, which can be paragraphs,