39
© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004 Human Language Technology in Ontology Engineering Ontology Learning from Text Paul Buitelaar DFKI GmbH Language Techology Lab DFKI Competence Center Semantic Web Saarbrücken, Germany

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004 Human Language Technology in Ontology Engineering Ontology Learning from Text Paul Buitelaar

Embed Size (px)

Citation preview

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Human Language Technology in Ontology

Engineering

Ontology Learning from Text

Paul Buitelaar DFKI GmbH

Language Techology LabDFKI Competence Center Semantic Web

Saarbrücken, Germany

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Overview

HLT and Ontology Engineering

Automated Linguistic Analysis

Ontology Learning from Text

Further Issues: Evaluation

Conclusions

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Ontology Lifecycle

Creating

Populating

Validating

Evolving

Maintaining

Deploying

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

HLT in the Ontology Lifecycle

Ontology(Knowledge)

Ontology Learning

Development & EvolutionLinguistic Analysis

to Extract Classes / Relations

Ontology Population

Knowledge Base GenerationLinguistic Analysis

to Extract Instances

Instances

Documents(Text)

HLT for Ontology Learning and Population from Text

Human Language Technology = Automated Linguistic Analysis

Classes,Relations/Properties

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Automated Linguistic Analysis

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Linguistic Analysis: Example

The Dell computer with a flat screen had to be rejected because of a failure in the motherboard.

Dell computerflat screen

motherboard

has-a

has-a

reject

failurelocation-of

animate-entity

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Levels of Linguistic Analysis

Lexical Analysis Word Class: Part-of-Speech (also Semantic Class) Word Structure: Morphology

Phrase Analysis Sentence Structure: Phrases (if ‘shallow’: Chunks) Semantic Units

Dependency Structure Analysis Sentence Meaning: Predicate Argument Structure (Clause) Semantic Structure

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Part-of-Speech, Morphology

Part-of-Speech e.g.: noun, verb, adjective, preposition, … PoS tag sets may have between 10 and 50 (or more) tags

Morphology Most languages have inflection and declination, e.g.:

Singular/Plural computer, computers Present/Past reject, rejected

Many languages have also complex (de)composition, e.g.:

Flachbildschirm (flat screen) > flach + Bildschirm> flach + Bild + Schirm

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Phrases, Terms, Named Entities

Semantic Units Phrases (e.g. nominal - NP, prepositional - PP)

NP a flat screenPP with a flat screenNP (recursive) the Dell computer with a flat

screen a failure in the motherboard

Terms (domain-specific phrases)Dell computer

Dell computer with a flat screen

Named Entities (phrases corresponding to dates, names, …)

COMPANY Dell COMPANY Dell Computer Corporation PERSON Michael Dell

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Dependency Structure (I)

Semantic Structure Dependencies between Predicates and Arguments

the Dell computer with a flat screen had to be rejected

PRED: rejectARG1: ENTITYARG2: ‘the Dell computer with a flat screen’

‘Logical Form’ : reject(x,y) & animate-entity(x) & computer(y) & …

Dependency Structure Analysis is based on:

Sub-categorization Frames

reject :: Subj:NP, Obj:NP

Selection Restrictions

reject :: Subj:NP:ANIMATE-ENTITY, Obj:NP:ENTITY

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Dependency Structure (II)The Dell computer that has been rejected was claimed to have suffered from handling.

reject(e1,x1,y1) & animate-entity(x1) & Dell_computer(y1) & claim(e2,x2,e3) & animate-entity(x2) & suffer_from(e3,y1,y2) & handling (y2)

PRED claim < NULL, XCOMP >

SUBJ y1

XCOMP

PRED computer

MOD Dell

ADJUNCTPRED reject < NULL, SUBJ >

PRED suffer < SUBJ, OBL-from >

SUBJ y1

SUBJ y1

OBL-from handling

claim

y1

Dell reject

suffer

y1

y1handling

SUBJ

SUBJ

XCOMP

MOD ADJUNCT OBL-from

SUBJ

y1 : computer

Lexical Functional Grammar (LFG)

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Ontology Learningfrom Text

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Some History

Lexical Knowledge Extraction Extraction of lexical semantic representations (word meaning) from Machine Readable Dictionaries – 70‘s/80‘s Extraction of semantic lexicons from corpora for Information Extraction systems - 80‘s/90‘s, e.g. CRYSTAL (Soderland) Answer extraction in Question Answering, e.g. Webclopedia (Hovy)Thesaurus Extraction Similar work, (complex, multilingual) term extraction e.g. Sextant (Grefenstette); DR-Link (Liddy)

Ontology Learning from Text Similar work, (domain-specific) term / relation extraction e.g. TextToOnto (Maedche & Staab), OntoLearn (Velardi et al.) Discussed here: OntoLT (Buitelaar, Olejnik & Sintek)

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

TextToOntoAssociation

Rules

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

OntoLearn

Domain-Specific WordNet Tuning and Extension

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

OntoLT: Some Background

Ontology Learning from Text Taxonomy Extraction, Document Clustering

String-based, Document Level

“Unnamed” Relation Extraction, Word ClusteringStemming & Part-of-Speech, Token Level

Extraction of Terms, “Named” RelationsPred-Arg & Head-Mod Structure, Term Level

TextToOnto

OntoLearn

Text in Ontology Engineering Textual Grounding of Concepts

Retain Linguistic Contexts and Realizations

Text-based Ontology MonitoringCompare Language Use over Time

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

OntoLT: Some Background

Ontology Learning from Text Taxonomy Extraction, Document Clustering

String-based, Document Level

“Unnamed” Relation Extraction, Word ClusteringStemming & Part-of-Speech, Token Level

Extraction of Terms, “Named” RelationsPred-Arg & Head-Mod Structure, Term Level

Text in Ontology Engineering Textual Grounding of Concepts

Retain Linguistic Contexts and Realizations

Text-based Ontology MonitoringCompare Language Use over Time

OntoLT

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

OntoLT

What is it?OntoLT provides a middleware solution in ontology development that enables the ontology engineer to bootstrap or extend a domain-specific ontology from a relevant text collection

How does it work?1. automatic linguistic annotation2. automatic statistical preprocessing 3. interactive definition of mapping rules4. interactive user validation of candidates5. automatic integration into an ontology

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

OntoLT: Architecture

AnnotatedCorpus(XML)

Mappings

XML (Linguistic Structure) <=>

Protégé (Classes, Slots)

Extraction

Protégé

Edit Extracted Ontology

Corpus

Definitionof Mappings

LinguisticAnnotation

ExtractedOntology

OntoLT

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

<sentence … >…

<text> … <text>

<phrases> … <phrases>

<clauses> … </clauses>

</sentence>

<text> … <token id="t5" pos="ADJA" str="mittlere"> <lemma id="t5.l1">mittler</lemma> </token> <token id="t6" pos="NN" str="Patellarsehnendrittel"> <lemma id="t6.l1">patellar</lemma> <lemma id="t6.l2">Sehne</lemma> <lemma id="t6.l3">Drittel</lemma> </token> …

Linguistic Annotation

<phrases> … <phrase id="p2" from="t5" to="t6" type="NP"> <mod from="t5" to="t5" /> <head from="t6" to="t6" /> </phrase> … </phrases>

<clauses> <clause id="cl1" from="p1" to="p5" pred="p5" type="pass"> <arg id="a1" type="SUBJ" phrase="none" /> <arg id="a2" type="IOBJ" phrase="p1"/> <arg id="a3" type="DOBJ" phrase="p2" /> </clause> </clauses>

mittlere Patellarsehnendrittel(mid patellar ligament third)

An 40 Kniegelenkpräparaten wurden mittlere Patellarsehnendrittel mit einer neuen Knochenverblockungstechnik in einem zweistufigen Bohrkanal femoral fixiert.

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Mapping Rules

Precondition LanguageVar (Y, XPath (Y)) Get all occurrences of element Y,

e.g. HeadNoun, Modifier, Subject, …ConcatConcatList

combined through AND, OR, NOT, EQUAL

OperatorsCreateCls create a new class with super-classAddSlot add a slot with range to a new or existing classCreateInst introduce an instance for a new or existing classFillSlotset the value of a slot of an instance

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Mapping Rules

Precondition LanguageVar (Y, XPath (Y)) Get all occurrences of element Y,

e.g. HeadNoun, Modifier, Subject, …ConcatConcatList

combined through AND, OR, NOT, EQUAL

OperatorsCreateCls create a new class with super-classAddSlot add a slot with range to a new or existing classCreateInst introduce an instance for a new or existing classFillSlotset the value of a slot of an instance

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Example Experiment

Ontology Extraction for Neurology Neurology Section of a Medical Corpus Medical Scientific Journal Abstracts – MuchMore Project

XML-based Linguistic Annotation PoS, Lemmatization, Phrases, Pred-Arg Structure

Statistical Preprocessing (chi-square) Select Domain-Relevant Linguistic Entities

Definition of Mapping Rules Define Operators for Selected Linguistic Entities

Generate & Validate Class/Slot Candidates Select Candidates for Integration in Neurology Ontology

Generate “Ontology Fragments” for Neurology

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Further Issues

Future Development Organization of Class/Slot Candidate List

Inference & Clustering - “Graph Restructuring” Extend Statistical Preprocessing

Multiple Reference CorporaExtended Frequency Information

Include Machine Learning ApproachSemi-Automatic Definition of Mapping

RulesPerformance Evaluation Guidelines

ECAI04 Workshop on OLP Benchmark

Challenge within PASCAL NoE

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Evaluation: What? -- Subtasks Classes

(Multilingual) Term ExtractionNamed-Entity RecognitionSimilarity ThesaurusTerm,Document Clustering

Class-Hierarchy (Taxonomy) Thesaurus ExtractionTerm,Document Clustering

Class-Properties (Relations)Relation Extraction? Formal Properties of Relations (Properties)

Class-Instances (Individuals)(Multilingual) Term ExtractionNamed-Entity RecognitionTerm,Document Classification

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Evaluation: How?

By Sub-Task – Evaluation of: Classes – Term,NE Extraction,Clustering

Class-Hierarchy – Thesaurus Extraction

Class-Properties – Relation Extraction

Class-Instances – Term,NE Extraction,Classification

By Application – Evaluation of: Ontology Learning and Population – Gold Standard

IR,QA – Precision /Recall Increase with Ontology?

Interactive QA – Increased User Satisfaction?

Information Access – Increased User Performance?

© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004

Conclusions

Stay Tuned

OntoLT Release

To be Announced on Protégé-Discussion List

http://protege.stanford.edu/mailing-lists

Evaluation

Ontology Learning & Population (OLP) Challenge

Within PASCAL NoE - First Task Spring 2005

ECAI04 Workshop: Evaluation of Text-based OLP

http://olp.dfki.de/ECAI04/cfp.htm