26
Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems Laboratory http://www.intelligence.tuc.gr Technical University of Crete (TUC), Chania, Greece

Unsupervised Ontology Acquisition from plain texts : The OntoGain method

  • Upload
    aaron

  • View
    98

  • Download
    0

Embed Size (px)

DESCRIPTION

Unsupervised Ontology Acquisition from plain texts : The OntoGain method. Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems Laboratory http://www.intelligence.tuc.gr Technical University of Crete (TUC), Chania , Greece. OntoGain. - PowerPoint PPT Presentation

Citation preview

Page 1: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Unsupervised Ontology Acquisition from plain texts: The OntoGain method

Efthymios DrymonasKalliopi ZervanouEuripides G.M. PetrakisIntelligent Systems Laboratoryhttp://www.intelligence.tuc.gr

Technical University of Crete (TUC), Chania, Greece

Page 2: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

OntoGain A platform for unsupervised ontology

acquisition from text Application independent Ontology of multi-word term concepts Adjusts existing methods for taxonomy &

relation acquisition to handle multi-word concepts

Outputs ontology in OWL Good results on Medical, Computer science

corpora

2

Page 3: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Why multi-word term concepts? Majority of terminological expressions Convey classificatory information,

expressed as modifiers e.g. “carotid artery disease” denotes a type

of “artery disease” which is a type of “disease”

Leads to more expressive and compact ontology lexicon

3

Page 4: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Ontology Learning Steps Concept Extraction

C/NC-value Taxonomy Induction

Clustering, Formal Concept Analysis Non-taxonomic Relations

Association Rules, Probabilistic algorithm

4

Page 5: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

5

The C/NC-Value method [Frantzi et.al. , 2000] Identifies multi-word term phrases

denoting domain concepts Noun phrases are extracted first ((adj | noun)+ | ((adj | noun) * (noun prep)?)

(adj | noun) *) noun C-Value: Term validity criterion, relying

on the hypothesis that multi-word terms tend to consist of other terms

NC-Value: Uses context information (valid terms tend to appear in specific context and co-occur with other terms)

Page 6: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

C-Value: Statistical Part For candidate term a

f(a): Total frequency of occurrence f(b): Frequency of a as part of longer termsP(Ta): number of these longer terms|a|: The length of the candidate string

otherwisebf

TPafa

nestednotaafaavalueC

aTba

,))()(

1)((||log

:),(||log)(

2

2

Concept Extraction

Page 7: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

C/NC-Value sample resultsoutput term c-nc value

web page 1740.11

information retrieval 1274.14

search engine 1103.99

machine learning 727.70

computer science 723.82

experimental result 655.125

text mining 645.57

natural language processing 582.83

world wide web 557.33

large number 530.67

artificial intelligence 515.73

relevant document 468.22

similarity measure 464.64

information extraction 443.29

knowledge discovery 435.79

7

Page 8: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Ontology Learning Steps

Preprocessing Concept ExtractionTaxonomy Induction Non-taxonomic Relations

8

Page 9: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Taxonomy InductionAims at organizing concepts into a

hierarchical structure where each concept is related to its respective broader and narrower terms

Two methods in OntoGainAgglomerative clustering Formal Concept Analysis (FCA)

Page 10: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Agglomerative Clustering Proceeds bottom-up: at each step, the

most similar clusters are merged Initially each term is considered a cluster Similarity between all pairs of clusters is

computed The most similar clusters are merged as

long as they share terms with common heads

Group average for clusters, Dice like formula for terms

10

Page 11: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Formal Concept Analysis (FCA) [Ganter et al., 1999]FCA relies on the idea that the objects

(terms) are associated with their attributes (verbs)

Finds common attributes (verbs) between objects and forms object clusters that share common attributes

Formal concepts are connected with the sub-concept relationship

)(),(),( 21212211 AAOOAOAO

Page 12: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

FCA ExampleTakes as input a matrix showing

associations between terms (concepts) and attributes (verbs)

submit test describe print compute search

Html form * * *Hierarchical clustering

* *

Text retrieval *

Root node * * * *Single cluster * * *

Web page * *

Page 13: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

FCA Taxonomy

13

Formal concepts ({hierarchical

clustering, root node, single cluster}, {compute, search})

({html form, web page}, {print, search})

Not all dependencies c,v are interesting

tvfvcfvcP )(),()|(

Page 14: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Non-Taxonomic Relations extraction phase

14

Concept Extraction Taxonomy InductionNon-Taxonomic Relations

Page 15: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Non-Taxonomic RelationsConcepts are also characterized by

attributes and relations to other concepts in the hierarchy

Typically expressed by a verb relating pair of concepts

Two approaches Associations rules Probabilistic

Page 16: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Association Rules [Aggrawal et.al., 1993]Introduced to predict the purchase

behavior of customersExtract terms connected with some

relation subject-verb-objectEnhance with general terms from the

taxonomyEliminate redundant relations:

predictive accuracy < t

Page 17: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Association Rules: ExampleDomain Range Label

chiasmal syndrome pituitary disproportion cause bymedial collateral ligament surgical treatment need

blood transfusion antibiotic prophylaxis resultlipid peroxidation cardiopulmonary bypass lead to

prostate specific antigen prostatectomy followchronic fatigue syndrome cardiac function yieldright ventricular infraction radionuclide ventriculography analyze by

creatinine clearance arteriovenous hemofiltration achievecardioplegic solution superoxide dismutase give

bacterial translocation antibiotic prophylaxis decreaseaccurate diagnosis clinical suspicion depend

ultrasound examination clinical suspicion givetotal body oxygen consumption epidural analgesia attenuate by

coronary arteriography physician perform by

17

Page 18: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Probabilistic approach [Cimiano et.al. 2006] Collect verbal relations from the corpus Find the most general relation wrt verb

using frequency of occurrence Suffer_from(man, head_ache)Suffer_from(woman, stomach_ache)Suffer_from(patient,ache)

Select relationships satisfying a conditional probability measureAssociations > t become accepted

18

Page 19: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Evaluation Relevance judgments are provided by

humans Precision - Recall We examined the 200 top-ranked

concepts and their respective relations in 500 lines

Results from OhsuMed & Computer Science corpus

19

Page 20: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Results

20

Processing Layer Method

Precision –

OhsuMed

Recall -

OhsuMed

Precision –

Comp. Science

Recall –

Comp. Science

Concept Extraction C/NC-Value 89.7% 91.4% 86.7% 89.6%

Taxonomic Relations

Formal Concept Analysis

47.1% 41.6% 44.2% 48.6%

Hierarchical Clustering 71.2% 67.3% 71.3% 62.7%

Non-Taxonomic Relations

Association Rules 71.8% 67.7% 72.8% 61.7%

Probabilistic 62.7% 55.9% 61.6% 49.4%

Page 21: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Comparison with Text2Onto [Cimiano & Volker, 2005]

21

Huge lists of plain single word terms, and relations lacking of semantic meaning

Text2Onto cannot work with big texts Cannot export results in OWL

Page 22: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Conclusions OntoGain

Multi-word term concepts Exports ontology in OWL Domain independent

Results C/NC-Value yields good results Clustering outperforms FCA Association Rules perform better than

Verbal Expressions

22

Page 23: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Future Work Explore more methods / combinations

e.g., clustering, FCA Hearst patterns for discovering additional

relation types (Part-of) Discover attributes and cardinality

constraints Incorporate term similarity information

from WordNet, MeSH Resolve term ambiguities

23

Page 24: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

Thank you!

Questions ?

24

Page 25: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

PreprocessingTokenization, POS tagging, Shallow

parsing (OpenNLP suite)Lemmatization (WordNet Java LibraryApply to all steps of OntoGainShallow parsing is used in relations

acquisition for the detection of verbal dependencies

Page 26: Unsupervised Ontology Acquisition from plain texts :  The  OntoGain  method

26

Terms sharing a head tend to be similar e.g. hierarchical method and agglomerative

method are both methods Nested terms are related to each other

e.g. agglomerative clustering method and clustering method should be associated)