16
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering (WISE 2011) October 14, 2011 Jeroen de Knijff [email protected] Kevin Meijer [email protected] Flavius Frasincar [email protected] Frederik Hogenboom [email protected] Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands ; )

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

  • Upload
    selena

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora. ;). Introduction (1). An increasing amount of documents is digitally stored on the Web Documents can be structured through taxonomies - PowerPoint PPT Presentation

Citation preview

Page 1: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

12th International Conference on Web Information System Engineering (WISE 2011)

October 14, 2011

Jeroen de [email protected]

Kevin [email protected]

Flavius [email protected]

Frederik [email protected]

Erasmus University RotterdamPO Box 1738, NL-3000 DRRotterdam, the Netherlands

;)

Page 2: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Introduction (1)• An increasing amount of documents is digitally stored

on the Web

• Documents can be structured through taxonomies

• Many documents are unstructured, hence driving the need for taxonomy construction

12th International Conference on Web Information System Engineering (WISE 2011)

Page 3: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Introduction (2)• Taxonomy construction:

– Manually:• More accurate• Main method

– Automatic:• Less knowledge needed• Less time consuming

• Taxonomy construction enables inter operability between Web sites, tools, etc. due to the knowledge aggregation into shared taxonomies

12th International Conference on Web Information System Engineering (WISE 2011)

Page 4: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Introduction (3)

12th International Conference on Web Information System Engineering (WISE 2011)

What’s new?

Page 5: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Introduction (4)• Taxonomy construction is a mature and widely

researched topic

• Little literature exists on applying Word Sense Disambiguation (WSD), even though WSD improves results of used techniques like clustering!

• Hence, we propose the Automatic Taxonomy Construction from Text (ATCT) framework, which implements WSD

12th International Conference on Web Information System Engineering (WISE 2011)

Page 6: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

ATCT: Framework (1)

12th International Conference on Web Information System Engineering (WISE 2011)

Page 7: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

ATCT: Framework (2)

12th International Conference on Web Information System Engineering (WISE 2011)

• Term extraction:– Part-of-Speech (POS) tagging– All nouns are extracted

• Term filtering:– Based on domain pertinence and lexical cohesion– Most relevant terms are subsequently selected through a

score, based on domain pertinence, domain consensus and structural relevance

Importance of term: term freq. corpusImportance of term: appearance

(position) in document

Relevance w.r.t. target domain: term freq. domain corpus / term freq. contrastive corpus

Cohesion among words in compound nouns: (# words × term freq. corpus × log(term freq.)) / word freq. corpus

Page 8: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

ATCT: Framework (3)

12th International Conference on Web Information System Engineering (WISE 2011)

• Word Sense Disambiguation:– Optional step– Synsets are retrieved from a semantic lexicon– Structural Semantic Interconnections (SSI)– Utilizes a similarity measure that is proposed by Jiang and

Conrath (1997)– Terms with similar senses are removed– Term counts are aggregated per concept

Page 9: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

ATCT: Framework (4)

12th International Conference on Web Information System Engineering (WISE 2011)

• Concept hierarchy creation:– Based on the subsumption algorithm, which determines

potential parents (subsumers) of concepts:• x potentially subsumes y, if:

1) x appears in at least the proportion t of all documents in which y appears

2) y appears in less than the proportion t of all documents in which x appears

– Additionally takes into account ancestor positions:• Weighting scheme based on the number of layers between

terms x and y • Close parents get assigned more weight

Page 10: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

ATCT: Framework (5)

12th International Conference on Web Information System Engineering (WISE 2011)

• Concept hierarchy creation (cont’d):– Evaluating taxonomy concepts is not trivial:

• Reference taxonomy:

• Generated taxonomy:

Page 11: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

ATCT: Framework (6)

12th International Conference on Web Information System Engineering (WISE 2011)

• Concept hierarchy creation (cont’d):– Look at senses through taxonomy concept disambiguation:

• Similar to term WSD from text, but now surrounding concepts are used instead of surrounding words

• Terms with single sense for lexicon are disambiguated• Other terms are disambiguated using their surrounding terms:

– Concept neighborhood of 2 (up/down)– Root node is disregarded

• Lexicon senses are compared• In case no sense is available (e.g., compound nouns):

– Lexical matching– Descendant / ancestor comparison

• Graph distances are calculated

Page 12: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

ATCT: Implementation• Java-based pipeline

• Noun parsing with the Stanford parser

• RDF implementation using Jena

• Domain taxonomies are expressed in SKOS

12th International Conference on Web Information System Engineering (WISE 2011)

Page 13: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Evaluation (1)• Data:

– Economics & management:• 25,000 abstracts from RePub & RePEc • 2,000 distinct concepts• Golden taxonomy using STW Thesaurus annotations

– Medicine & health:• 10,000 abstracts from RePub• 1,000 distinct concepts• Golden taxonomy using MeSH annotations

• Measures:– Precision– Recall– F-measure

12th International Conference on Web Information System Engineering (WISE 2011)

Page 14: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Evaluation (2)Domain Taxonomy Precision Recall F-Measure

E&M Without WSD 0.7382 0.5082 0.6023

With WSD 0.8056 0.5813 0.6753

M&H Without WSD 0.5681 0.6051 0.5860

With WSD 0.5907 0.6016 0.5961

12th International Conference on Web Information System Engineering (WISE 2011)

Page 15: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Conclusions• ATCT framework:

– Extracts potential taxonomy terms from large corpora– Filters relevant terms– Performs WSD to remove redundant terms– Creates a taxonomy using a subsumption method

• Evaluation shows performance improvement when using WSD (up to 12.12%)

• Future work:– Benchmark against other taxonomy creation methods

(hierarchical clustering, classification, etc.)– Explore other domains (law, chemistry, physics, history, etc.)

12th International Conference on Web Information System Engineering (WISE 2011)

Page 16: Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Questions

12th International Conference on Web Information System Engineering (WISE 2011)