Text Summarisation based on Human Language Technologies and its Applications Elena Lloret Pastor...

Text Summarisation based on Human Language Technologies and its Applications

Elena Lloret PastorSupervisor: Dr. Manuel

PalomarSeminar - June 2011

Outline•Introduction•State of the Art•COMPENDIUM Text Summarisation Tool

•Evaluation and Experiments•COMPENDIUM in HLT Applications•Conclusion

IntroductionMOTIVATION

•Human Language Technologies (HLT)▫Allow people to communicate with

machines by using natural language (Cole, 1997)

•Intelligent applications based on HLT▫Information retrieval▫Question Answering▫Text Classification▫Opinion Mining▫Text Summarisation

IntroductionMOTIVATION

•Why is Text Summarization (TS) needed?▫To condense information, keeping at the

same time, the most relevant one▫Help users to manage and process large

amounts of information

The 2008 Summer Olympics took place in Beijing, China, from August 8 to August 24, 2008. A total of 11,028 athletes from 204 National Olympic Committees (NOCs) competed in 28 sports and 302 events. It was the third time that the Summer Olympic Games were held in Asia, after Tokyo, Japan in 1964 and Seoul, South Korea in 1988. The program for the Beijing Games was quite similar to that of the 2004 Summer Olympics held in Athens. There were 28 sports and 302 events. Moreover, there were 43 new world records and 132 new Olympic records set at the 2008 Summer Olympics. Chinese athletes won the most gold medals, with 51, and 100 medals altogether, while the United States had the most medals total with 110. There were many memorable champions but it was Michael Phelps and Usain Bolt who stole the headlines.

Source documents: http://en.wikipedia.org/wiki/2008_Summer_Olympicshttp://en.beijing2008.cn/#http://www.olympic.org/beijing-2008-summer-olympics

17.500.000 results

State of the ArtTYPES OF SUMMARIES

textspeech

hypertext

LANGUAGE

mono-lingual

multi-lingual

cross-lingual

single-document

multi-document

PURPOSE

generic

persona-lised

query-focused

updatesentiment-

indica-tive

infor-mative

OUTPUT

extract

abstract

headline

State of the ArtTEXT SUMMARISATION PROCESS

•Topic identification▫What the document is about

•Interpretation or topic fusion▫Important topics are expressed using new

formulation

•Summary generation▫Natural Language Generation is applied to

build the final summary

Topic identification Interpretation Summary

generation

THE PROCESS OF SUMMARISATION

State of the ArtGENERATION OF SUMMARIES

•Approaches▫Statistical-based tf, tf*idf (e.g. Orăsan, 2009)▫Topic-based event words (e.g. Kuo & Chen,

2008)▫Graph-based LexRank (e.g. Erkan & Radev,

2004)▫Discourse-based lexical chains (e.g. Barzilay

& Elhadad, 1999)▫Machine learning-based neuronal nets (e.g.

Svore et al., 2007)

•New types of summaries▫Personalised summaries user profiles (e.g.

Díaz & Gervás, 2007)▫Update summaries “history” (e.g. Li et al.,

2008)▫Sentiment-based summaries multi-aspect

rating model (e.g. Titov, & McDonald, 2008)▫Surveys summaries Wikipedia articles (e.g.

Sauper & Barzilay, 2009)▫Abstractive summaries sentence compression

(e.g. Filippova, 2010)

•New scenarios ▫Literary text books (e.g. Ceylan &

Mihalcea, 2009)▫Patent claims (e.g. Trappey et al., 2009)▫Image captioning (e.g. Aker & Gaizauskas,

2010)▫Web 2.0 textual genres blogs (e.g. Balahur

et al., 2009)

State of the ArtEVALUATION OF SUMMARIES

•Types of evaluation▫Intrisic evaluate the summary on its own

Informativeness assessment

Quality assessment

▫Extrinsic evaluate how good the summaries are to perform other tasks

Pyramid QARLA ROUGEBasic Elements

Indicativeness Grammaticality Coherence Non-redundancy

COMPENDIUM TS toolTYPES OF SUMMARIES

LANGUAGE

mono-lingual

single-document

multi-document

PURPOSE

generic

query-focused

sentiment-based

infor-mative

OUTPUT

extract

abstract

Legend:

Additional Stages

Types of Summaries (output)

Core Stages

Input for the additional stages

COMPENDIUM TS toolARCHITECTURE

SURFACE LINGUISTIC ANALYSIS

TOPICIDENTIFICATION

RELEVANCE DETECTION

SUMMARY GENERATION

EXTRACTIVESUMMARY

QUERYOPINIONQUERY

REDUNDANCY DETECTION

QUERY SIMILARITY

SUBJECTIVE INFORMATION

DETECTION

QUERY-FOCUSED SUMMARY

SENTIMENT-BASEDSUMMARY

INFORMATIONCOMPRESSION

AND FUSION

ABSTRACTIVESUMMARY

DOC DOCS

•Suface Linguistic Analysis▫Pre-process the input text by employing state-of-the-art tools

Sentence segmentation Tokenisation Part-of-Speech tagging Stemming Stop word identification

COMPENDIUM TS toolCORE STAGES

TOPICIDENTIFICATION

RELEVANCE DETECTION

SUMMARY GENERATION

EXTRACTIVESUMMARY

DOC DOCS

•Redundancy Detection▫Identify and removerepeated information

Textual Entailment (Ferrández, 2009)

▫ The main idea behind the use of TE for detecting redundancy is that those sentences whose meaning is already contained in other sentences can be discarded, as

the information has been previously mentioned

TOPICIDENTIFICATION

RELEVANCE DETECTION

SUMMARY GENERATION

EXTRACTIVESUMMARY

DOC DOCS

T: The man was killed last weekH: The man is dead

T: The man was shot in his shoulderH: The man is dead

TRUE FALSE

•Topic Identification▫Identify the most relevant topics

Term frequency (Luhn, 1958)

▫Most frequent words (without considering stop words) can be considered the main topics of a document

TOPICIDENTIFICATION

RELEVANCE DETECTION

SUMMARY GENERATION

EXTRACTIVESUMMARY

DOC DOCS

TOPIC IDENTIFICATION

•Relevance Detection▫Compute a weight for each sentence, depending on itsimportance

The Code Quantity Principle (Givón, 1990) Coding element noun phrase

▫Sentences containing a noun phrase including high frequent words will be considered more important

▫Score for each sentence

TOPICIDENTIFICATION

RELEVANCE DETECTION

SUMMARY GENERATION

EXTRACTIVESUMMARY

DOC DOCS

RELEVANCE DETECTION

•Summary Generation▫Summary size

number of words compression rate

▫The highest scored sentences up to a desired length are selected and extracted

▫Sentences are ordered as they appear in the document

•Type of summaries (output)▫Generic extracts COMPENDIUME

TOPICIDENTIFICATION

RELEVANCE DETECTION

SUMMARY GENERATION

EXTRACTIVESUMMARY

DOC DOCS

SUMMARYGENERATION

•Query Similarity▫Cosine similarity qSim

•Type of summaries (output)

▫Query-focused extract COMPENDIUMQE

▫Score for each sentence

COMPENDIUM TS toolADDITIONAL STAGES

TOPICIDENTIFICATION

RELEVANCE DETECTION

SUMMARY GENERATION

QUERY SIMILARITY

QUERY-FOCUSED SUMMARY

DOC DOCS

•Subjective Information Detection▫Opinion mining techniques (Balahur-Dobrescu et al., 2009)

•Type of summaries (output)▫Sentiment-based extract COMPENDIUMSE ▫Select the highest relevant sentences

among the subjective ones

TOPICIDENTIFICATION

RELEVANCE DETECTION

SUMMARY GENERATION

OPINIONQUERY

SUBJECTIVE INFORMATION

DETECTION

SENTIMENT-BASEDSUMMARY

DOC DOCS

•Information compressionand fusion

▫Word graphs

•Type of summaries (output)▫Abstractive-oriented summary

COMPENDIUME-A ▫Combine extractive and new information

TOPICIDENTIFICATION

RELEVANCE DETECTION

SUMMARY GENERATION

EXTRACTIVESUMMARY

INFORMATIONCOMPRESSION

AND FUSION

ABSTRACTIVESUMMARY

DOC DOCS

EVALUATION AND EXPERIMENTSEVALUATION METHODOLOGY

•Type of evaluation ▫intrinsic

•What are we going to assess?▫COMPENDIUM in different domains and

contexts•Which criteria are we going to use for the

evaluation?▫Content (automatically) ROUGE (Lin, 2004)▫Quality (manually) readability & user

satisfaction

• Newswire▫ Single-document generic extracts: ~ 45% (F-measure, ROUGE-

▫ Multi-document: ~ 30% (F-measure, ROUGE-1)

• Blogs▫ Multi-document sentiment-based summaries: ~ 64% (F-

measure, Pyramid)

• Image captions▫ Multi-document query-focused summaries: ~36% (F-

measure, ROUGE-1)

• Medical research papers▫ Single-document abstractive-oriented summaries: ~ 42%

(F-measure, ROUGE-1)

EVALUATION AND EXPERIMENTSRESULTS

•Question answering▫ Allows users to formulate questions in

natural language and provide them with the exact information required

•Objective ▫Integrate COMPENDIUM with a Web-based

question answering approach COMPENDIUMQE

COMPENDIUM in HLT APPLICATIONS QUESTION ANSWERING

COMPENDIUM in HLT QUESTION ANSWERING

•Proposed approach • Question analysis▫ Question type, focus and

keywords • Information retrieval

▫ Retrieve the first 20 documents in Google

• Summarisation▫ COMPENDIUMQE

▫ Summary size length of snippets

• Answer extraction▫ Named Entities▫ Semantic roles

•Data ▫100 factual questions

Person Location Temporal Organization

•Evaluation▫Correct▫Incorrect▫Non-answered

Question(temporal)

When was the first Barbie produced?

Answer 1959

Question (location)

Where is the pancreas located?

Answer abdomen

F-measure (%)

•Results▫Named entity-based QA

▫Semantic role-based QA

Question type

PersonOrganizatio

nTempora

lLocation

Snippets 53.3 52.2 48.9 61.2

COMPENDIUMQE 56.5 53.3 66.7 65.3

Question type

PersonOrganizatio

nTempora

lLocation

Snippets 43.9 25.0 24.2 48.9

COMPENDIUMQE

53.7 41.2 60.0 62.5

NE-based QA 12%SR-based QA 48%

•The proposed techniques are appropriate for TS▫Textual entailment appropriate to tackle

redundancy▫Code Quantity Principle detecting relevant

information▫Word graph-based algorithms compress and

merge information•Summaries, although imperfect in their nature,

can improve the performance of other HLT tasks▫Question Answering

CONCLUSION

Text Summarisation based on Human Language Technologies and its Applications

Elena Lloret PastorSupervisor: Dr. Manuel

PalomarSeminar - June 2011

Text Summarisation based on Human Language Technologies and its Applications Elena Lloret Pastor...

Documents

Video summarisation based on the psychological content in ...tmoriyam/research/videosum.pdf · Video summarisation based on the psychological content in the track structure Tsuyoshi

Lloret little

Lloret Accommodation & Services guide 2016

Development of a Corpus for Evidence Medicine Summarisation

Lloret Culture EN/DE

Lloret de Mar city map

THIS IS LLORET

Lloret de Mar Turisme - Meetings, Incentives, Conferences & Eventsprofessionals.lloretdemar.org/wp-content/uploads/2015/03/... · 2015-06-01 · 4 5 LLoret convention bureau Lloret

OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)

Lloret Formula Weekend - EN (professional)

Lloret Formula Weekend - EN

Lloret Card

CAT Enjoy & Respect Lloret

Automatic Voicemail Summarisation for Mobile Messaging

Video Summarisation by Classiﬁcation with Deep ...andrea/papers/2018_BMVC_Video... · ZHOU ET AL.: VIDEO SUMMARISATION BY CLASSIFICATION WITH DEEP RL 3. the classiﬁer as global

Palomar Pomerado Health - Palomar Health

Lloret Heritage - CA

» Summarisation Levels in SAP COPA #2 – Define Your Summarisation Level

Automated Summarisation for Evidence Based Medicineweb.science.mq.edu.au/~diego/medicalnlp/slides/HAIL201203.pdf · Evidence Based Medicine Our Corpus for Summarisation Applications

Club Nàutic Lloret de Mar