Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya...

Preview:

Citation preview

Summarizing Encyclopedic Term Descriptions on the Web

from Coling 2004Atsushi Fujii and Tetsuya Ishikawa

Graduate School of Library, Information and Media Studies, University of Tsukuba

motivation

Existing encyclopedias often lack new terms and new definitions for existing terms

Web contains an enormous volume of up-to-date information is a source to obtain new term descriptions

The use of existing search engine has many problems

search engine??

Often retrieve extraneous pages not describing a submitted term

A user has to identify page fragments describing the term

Descriptions in multiple pages are independent

Word senses are not distinguished for ambiguous terms

They propose a summarization method that produces a concise and condensed term description from multiple paragraphs

In this paper, they focus on Japanese technical terms in the computer domain

Overview of CYCLONE

Summarization Method

Given a set of paragraph-style descriptions for a single term in a specific domain, their summarization method produces a concise text describing the term from different viewpoints

12 viewpoints in computer domain: definition, abbreviation, exemplification, purpose, synonym, reference, product, advantage, drawback , history, component, function

Four steps

Identification Recognize the language unit associated with a viewpoint

Classification Merge units with the same viewpoint into a single group

Selection Determine one or more representative units for each group

Presentation Produce a summary in a format

Identification

A sentence is often associated with multiple viewpointse.g. XML is an abbreviation for eXtensible Markup Language, and is markup language

Segment Japanese sentences into simple sentences, and apply zero pronoun detection and anaphora resolution can be used

XML is an abbreviation for eXtensible Markup Language XML is markup language

Abbreviation viewpoint

definition viewpoint

Four steps

Identification Recognize the language unit associated with a viewpoint

Classification Merge units with the same viewpoint into a single

group

Selection Determine one or more representative units for each group

Presentation Produce a summary in a format

Classification

12 viewpoints 36 linguistic patterns are used to describe

terms from a specific viewpoint Simple sentences match with patterns for

multiple viewpoints is classified into viewpoint group

Classification (cont)

How about those sentences do not match any patterns?

Classify remaining sentences into the group where their most similar sentence is belong

Compute the similarity between an unclassified sentences and each of the classified sentences (Dice coefficient)

“miscellaneous” group

example

Four steps

Identification Recognize the language unit associated with a viewpoint

Classification Merge units with the same viewpoint into a single group

Selection Determine one or more representative units for each

group

Presentation Produce a summary in a format

Selection The number of sentences selected from each group

depends on the desired size of the resultant summary

Compute the score for each sentence and select sentences with greater scores in each group # of common words included (W) – sentences including

frequent words are preferred Rank order in CYCLONE (R) # of characters include (C) – short sentences are preferred

Normalize each factor and compute final score as a weighed average of the three factors above (W>R>C)

Selection (cont)

For miscellaneous group, they select the most dissimilar sentence to representative sentences selected from the regular groups

Presentation

Top 50 paragraphs for the term “XML” Only one sentence was selected from each

group Each viewpoint label or sentence is hyper-

linked to the associated group or the source paragraph

Presentation (cont)

Evaluation

Summarization evaluation can be classified into intrinsic and extrinsic approaches

Intrinsic: the quality of a text, informativeness Extrinsic: if a summary improves the efficiency of

a specific task

Evaluation (cont)

15 Japanese terms are test inputs In order to calculate the coverage, for each of the

15 terms, two students annotate each simple sentence in the top 50 paragraphs in the CYCLONE results with one or more viewpoints

They define 28 viewpoints including the 12 viewpoints

Compression ratio and coverage were calculate by the top 50 paragraphs

Results

#Reps: the number of representative sentences selected from each viewpoint group

#Chars: the number of characters in a summary They select five sentences from the miscellaneous

group VBS: viewpoint-based summarization method Lead: systematically extracted the top N characters

from the CYCLONE results

Conclusion

To compile encyclopedic term descriptions from the Web, they introduced a summarization method

They identify the simple sentences, classify those sentences into viewpoint groups, select the representative sentences from each group and show them up

VBS got good compression ratio and the coverage score is better than baseline

Future work includes generating a coherent text and performing extrinsic evaluation method

Recommended