Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor...

Preview:

Citation preview

Text Summarization -- In Search of Effective Ideas and Techniques

Shuhua Liu, Assistant ProfessorDepartment of Information SystemsÅbo Akademi University, Finland &Univercity Berkeley

Modified By Shinta P., 2012

2

Headline news — informing

3

TV-GUIDES — decision making

4

Abstracts of papers — time saving

5

Graphical maps — orienting

What is text summarization?

To reduce (long) textual information to its most essential points

to distill the most important information from a source or sources to produce an abridged version of it (Endres-Niggemeyer, 1998; Mani and Maybury, 1999; Spärck-Jones, 1999).

Text summarization: a context-dependent activity

8

‘Genres’ of Summary? Indicative vs. informative

...used for quick categorization vs. content processing.

Extract vs. abstract...lists fragments of text vs. re-phrases content

coherently.

Generic vs. query-oriented...provides author’s view vs. reflects user’s interest.

Background vs. just-the-news...assumes reader’s prior knowledge is poor vs. up-to-

date.

Single-document vs. multi-document source...based on one text vs. fuses together many texts.

Shuhua Liu, IIS/IAMSR, ÅA

Text summarization Key issues:

how to identify the most important content out of the rest of the text?

how to synthesize the substance and formulate a summary text based on the identified content?

Major approaches: Selection based: produce ”extracts” Text understanding based: produce

”abstracts”

Shuhua Liu, IIS/IAMSR, ÅA

Shuhua Liu, IIS/IAMSR, ÅA

Selection based summarization: how does it work?

The most content-bearing sentences or passages are identified and selected to compose a summary.

Compute a significance value for each sentence: (Luhn, 1958; Edmundson, 1969) Count word frequency the keywords, title words, cue words it

contains; the position of the sentence

RST (Rhetorical structute theory) based discourse analysis (Marcu, 1997)

Passage and sentence similarity analysis (Goldstein et al, 2000; CMU)

Shuhua Liu, IIS/IAMSR, ÅA

MSWord AutoSummarize

Shuhua Liu, IIS/IAMSR, ÅA

Text understanding system

A text understanding task often aims to recover all of the information that there is in a text, including what is only implicit in what is actually written. “All the richness of natural language

becomes fair game, including metaphor, metonymy, discourse structure, and the recognition of the author's underlying intentions, and the full interplay between language and world knowledge becomes central to the task.”

Shuhua Liu, IIS/IAMSR, ÅA

Text understanding based summarization

Depend on complete sentence analysis and discourse analysis with full knowledge support Syntactic pasrer, semantic interpreter Linguistic knowledge, world

knowledge, domain knowledge Reasoning mechnisms that work

effectively over huge knowledge collections.

Shuhua Liu, IIS/IAMSR, ÅA

Selection based vs. Understanding based

Selection based: general applicable, but incoherent content, poor readability due to unclear relationships between the selected text excerpts, dangling references, and so on.

Understanding based: high precision, but very slow, large amount of wasted computation, highly domain specific.

Endres-Niggenger (2000) found that, people prefer (sometimes) extractive summaries instead of gloss-over abstractive summaries!

Shuhua Liu, IIS/IAMSR, ÅA

The reality:

The dominant approach in practice is still selection-based;

Understanding based systems only exist in theory, and will continue to be so for quite a while;

However, certain text understanding tasks in small scale or restricted domains can be done.

Shuhua Liu, IIS/IAMSR, ÅA

Topic guided text summarization

Text summarization as a process of topic analysis, passage extraction, and text understanding, information integration/fusion, and text generation proces.

Passage extraction guided by topic structure will expect to keep the logic relationships between the extracted text parts: e.g. sentences are arranged logically according to topic structure

Topic representation will also be very helpful in next phase text analysis and information integration.

Shuhua Liu, IIS/IAMSR, ÅA

Phase 1: Theme detection, topic labels, sentence/passage selection

Theme detection through passage pairwise similarity analysis Vector space model of term and document TF-IDF: baseline method

nNfw ijij log

t

kjk

t

kik

t

kjkik

ji

ww

wwDDsimilarity

1

2

1

2

1

)()(

),(

Shuhua Liu, IIS/IAMSR, ÅA

Passage similarity analysis with LSA method

LSA (Latent Sematic Analysis) Similar results as using TF-IDF Fuzzy LSI approach (Nikravesh, 2002)

ndddD ,,, 21 mwwwW ,,, 21

ijjiij dwnN , 1

tVUN NVUVUN tt ~~

Shuhua Liu, IIS/IAMSR, ÅA

Passage adjacency matrix (partial)

similarity strength >= 0.35

s21 s22 s23 s24 s25 s26 s27 s28s21 0 1 0 1 0 1 0 0s22 1 0 0 0 0 1 0 0s23 0 0 0 0 0 0 0 0s24 1 0 0 0 1 1 1 1s25 0 0 0 1 0 0 1 1s26 1 1 0 1 0 0 0 1s27 0 0 0 1 1 0 0 1s28 0 0 0 1 1 1 1 0s29 0 0 0 0 0 0 0 0s210 0 1 0 0 0 0 0 0s211 0 0 0 0 0 0 0 1

Shuhua Liu, IIS/IAMSR, ÅA

Passage Relation Map

Shuhua Liu, IIS/IAMSR, ÅA

Passage Extraction Rules

Passage clusters help us to identify themes and topics; unconnected passages form distinct topics covered in a document.

The MMR algorithm (CMU) (Goldstein et al, 2000) A sentence/passage closest to the centroid of the

cluster be chosen to be included in the summary. Sentences that are maximally similar to the

document and maximally dissimilar to sentences already in the summary are selected to compose a summary.

Shuhua Liu, IIS/IAMSR, ÅA

Creating theme labels

Keywords (TF based) Word families (semantic related

words in a passage cluster) Key phrases

Linguistic approach Statistical + simple heuristics (Kelledy

and Smeaton, 1997) – seems quite effective.

Shuhua Liu, IIS/IAMSR, ÅA

Next step

Shuhua Liu, IIS/IAMSR, ÅA

WordNet, since 1985

Lexical database developed at Princeton University, led by George Miller

Hand-coded, freely available Word knowledge of: nouns, verbs,

adjectives, adverbs Semantic network representation with only

a few semantic relations: Synonym, hypernynm, Categorization relation: Is-a

Widely used in query expansion, word similarity determination (based on synsets)

Shuhua Liu, IIS/IAMSR, ÅA

Table Semantic Relations in WordNet (Miller, 1995) Semantic Relation Syntactic Category Examples Synonym (similar) N, V, Aj, Av Pipe, tube; rise, ascend;

Sad, unhappy; rapidly, speedily Antonymy (opposite) Aj, Av, (N, V) Wet, dry; powerful, powerless; friendly,

unfriendly; rapidly, slowly Hyponymy (subordinate) N Sugar maple, maple, maple tree, plant

Meronymy (part)

N Brim, hat; gin, martini; ship, fleet;

Troponymy (manner) V March, walk; whisper, speak

Entailment V Drive, ride; divorce, marry

Note: N – Nouns Aj – Adjectives V – Verbs Av - Adverbs

Shuhua Liu, IIS/IAMSR, ÅA

Shuhua Liu, IIS/IAMSR, ÅA

Shuhua Liu, IIS/IAMSR, ÅA

ConceptNet, MIT Media Lab

Common sense knowledge base with NLP capability

Extracted automatically from common sense knowledge expressed in semi-structured NL sentences from OMCSNet (open mind common sense) – applying about 50 extraction rules ”The Effect of [falling off a bike] is [you get hurt].” ”A lime is a very sour fruit” at OMCS is extracted

into two assertations:IsA (lime, fruit)PropertyOf (lime, very sour)

Shuhua Liu, IIS/IAMSR, ÅA

Twenty Semantic Relation Types in ConceptNet (Liu and Singh, 2004)

THINGS (52,000 assertions)

IsA: (IsA "apple" "fruit") Part of: (PartOf "CPU" "computer") PropertyOf: (PropertyOf "coffee" "wet") MadeOf: (MadeOf "bread" "flour") DefinedAs: (DefinedAs "meat" "flesh of animal")

EVENTS (38,000 assertions)

PrerequisiteeventOf: (PrerequisiteEventOf "read letter" "open envelope") SubeventOf: (SubeventOf "play sport" "score goal") FirstSubeventOF: (FirstSubeventOf "start fire" "light match") LastSubeventOf: (LastSubeventOf "attend classical concert" "applaud")

AGENTS (104,000 assertions)

CapableOf: (CapableOf "dentist" "pull tooth")

SPATIAL (36,000 assertions)

LocationOf: (LocationOf "army" "in war")

TEMPORAL time & sequence

CAUSAL (17,000 assertions)

EffectOf: (EffectOf "view video" "entertainment") DesirousEffectOf: (DesirousEffectOf "sweat" "take shower")

AFFECTIONAL (mood, feeling, emotions) (34,000 assertions)

DesireOf (DesireOf "person" "not be depressed") MotivationOf (MotivationOf "play game" "compete")

FUNCTIONAL (115,000 assertions)

IsUsedFor: (UsedFor "fireplace" "burn wood") CapableOfReceivingAction: (CapableOfReceivingAction "drink" "serve")

ASSOCIATION K-LINES (1.25 million assertions)

SuperThematicKLine: (SuperThematicKLine "western civilization" "civilization") ThematicKLine: (ThematicKLine "wedding dress" "veil") ConceptuallyRelatedTo: (ConceptuallyRelatedTo "bad breath" "mint")

Shuhua Liu, IIS/IAMSR, ÅA

ConceptNet (Liu and Singh, 2004a, 2004b)

Inference Spreading activation: node-activation

radiating outward from an origin code GetContext (node) GetAnalogousConcept (node)

Graph traversal: FindPathBetweenNodes (node1, node2)

Shuhua Liu, IIS/IAMSR, ÅA

ConceptNet (Liu and Singh, 2004a, 2004b)

Support Topic sensing Query expansion Semantic similarity of words Lexical generalization Thematic generalization

Much needs to be examined; Uncontrolled vocabulary, can be biased

in terms of content; but seems quite reliable knowledge.

Shuhua Liu, IIS/IAMSR, ÅA

Topic-Sensing

Shuhua Liu, IIS/IAMSR, ÅA

Eurovoc: multilingual thesaurus

Controlled vocabulary, 20 languages, broad fields politics, international relations, European

Communities, law, economics, trade, finance, social questions, education, science, international organizations, employment and working conditions

industry, business and competition, production, technology and research,

transport, environment, energy, agriculture, forestry and fisheries, agri-foodstuffs, geography

Recommended