67
Language Technology I Language Technology I © 2006 © 2006 Paul Buitelaar Paul Buitelaar Language Technology I 2005/06 Paul Buitelaar German Research Center for Artificial Intelligence (DFKI) Knowledge Extraction/Semantic Web

Language Technology I 2005/06

Embed Size (px)

DESCRIPTION

Language Technology I 2005/06. Knowledge Extraction/Semantic Web. Paul Buitelaar German Research Center for Artificial Intelligence (DFKI). Overview. Semantic Web Introduction Semantic Web Representation and Query Languages Semantic Web Tools Ontologies and Knowledge Markup - PowerPoint PPT Presentation

Citation preview

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Language Technology I2005/06

Paul BuitelaarGerman Research Center for Artificial Intelligence (DFKI)

Knowledge Extraction/Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Overview

Semantic Web Introduction Semantic Web Representation and Query Languages Semantic Web Tools

Ontologies and Knowledge Markup Ontologies and other Knowledge Organization Systems Knowledge Markup for Ontology Population Ontology Life-Cycle

Knowledge Extraction Ontology Population Ontology Learning

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

WebDocs, Data

Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

WebDocs, Data

KnowledgeMarkup

Web > Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

WebDocs, Data

KnowledgeMarkup Ontologies

Web > Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

KnowledgeMarkup Ontologies

Web > Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

KnowledgeMarkup Ontologies

Semantic Web Services

Accessing the Semantic Web - Machines

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Intelligent Man-Machine Interface

KnowledgeMarkup Ontologies

Semantic Web Services

Accessing the Semantic Web - Humans

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semantic Web Layer cake

• Introduced by Tim Berners-Lee in 2001• Built upon existing WWW standards

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Resource Description Framework (RDF)

• RDF is an extensible language for expressing graph-structures• Serializes to XML

node1

DFKI GmbH

Kaiserslautern

<?xml version=‘1.0’ ?><rdf:RDF

xmlns:rdf=“… rdf-syntax-ns#”xmlns:rdfs=“… rdf-schema#”xmlns=“http://example.org”>

<rdf:Description rdf:nodeID=“node1”><name>DFKI GmbH</name><location>Kaiserslautern</location><www rdf:resource=“http://www.dfki.de” />

</rdf:Description></rdf:RDF>

name

location

www http://www.dfki.de

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

RDF Schema (RDFS)

• Adds a vocabulary for representing classes and properties to RDF

Person Teacher

Student

rdf:Literal

name

Course

teaches

enrolledInis-

a

is-a

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Web Ontology Language (OWL)

• OWL - Based on Description Logics • Adds further modelling vocabulary on top of RDFS

XML Schema Namespaces Interpretation

Context

RDF Schema

OWL

Formalization:

Classes (Inheritance),

Properties

Formalization:

Classes, Class Definitions,

Properties, Property Types

(e.g. Transitivity)

Data Types

XML

RDF

Syntax Semantics

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semantic Web Query Languages - SPARQL

• SPARQL - query language developed by W3C• Syntactically based on SQL:

• Results available as XML Documents

PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT ?foafName WHERE {

?x foaf:name ?foafName .OPTIONAL { ?x foaf:mbox ?mbox } .

}

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semantic Web Tools

Programming APIs Jena - Java Redland – Python, … RAP - PhP

Editors Protégé OntoStudio Triple20 - Prolog

Storage Sesame OntoBroker

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontologies and Knowledge Markup

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontologies in Philosophy

• Ontology is a branch of philosophy that deals with the nature and the organization of reality

• Science of Being (Aristotle, Metaphysics) What characterizes being? Eventually, what is being?

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontologies in Computer Science

Ontology refers to an engineering artifact a specific vocabulary used to describe a certain reality a set of explicit assumptions regarding the intended meaning of the

vocabulary

An Ontology is an explicit specification of a conceptualization [Gruber 93] a shared understanding of a domain of interest [Uschold/Gruninger

96]

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Why Develop an Ontology?

• Make domain assumptions explicit Easier to change domain assumptions Easier to understand and update legacy data

• Separate domain knowledge from operational knowledge Re-use domain and operational knowledge separately

• A community reference for applications

• Shared understanding of what information means

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Types of Ontologies

[Guarino, 98]

Describe very general concepts like space, time, event, which are independent of a particular problem or domain. It seems

reasonable to have unified top-level ontologies for large communities of users.

Describe the vocabulary related

to a generic domain by

specializing the concepts introduced

in the top-level ontology.

Describe the vocabulary related to a

generic task or activity by

specializing the top-level

ontologies.

These are the most specific ontologies. Concepts in application ontologies often correspond to roles played by domain entities

while performing a certain activity.

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontologies and Their Relatives

Catalog / ID

Terms/Glossary

Thesauri

InformalIs-a

FormalIs-a

FormalInstance

Frames

ValueRestric-tions

Generallogical

constraints

AxiomsDisjointInverse Relations,...

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Knowledge Organization Systems

• Semantic Lexicons – e.g. WordNet … group together words according to lexical semantic

relations like synonymy, hyponymy, meronymy, antonymy, etc.

• Thesauri …group together domain terms according to a set of

taxonomic relations, including broader term, narrower term, sibling, etc.

• Semantic Networks and Ontologies … group together classes of objects according to a set of

relations that originate in the nature of the domain of application.

Ontologies are defined by a formal semantics, but semantic networks may be informally defined. Therefore all ontologies are semantic networks, but not all semantic networks are ontologies.

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Thesauri - Examples

MeSH Heading Databases, GeneticEntry Term Genetic DatabasesEntry Term Genetic Sequence DatabasesEntry Term OMIMEntry Term Online Mendelian Inheritance in ManEntry Term Genetic Data BanksEntry Term Genetic Data BasesEntry Term Genetic DatabanksEntry Term Genetic Information DatabasesSee Also Genetic Screening

MT 3606 natural and applied sciencesUF gene pool

genetic resourcegenetic stockgenotypeheredity

BT1 biologyBT2 life sciencesNT1 DNANT1 eugenicsRT genetic engineering (6411)

EuroVoc covers terminology in all of the official EU languages for all fields that concern the EU institutions, e.g., politics, trade, law, science, energy, agriculture, 27 such fields in total.

MeSH (Medical Subject Headings) is organized by terms (currently over 250,000) that correspond to a specific medical subject. For each such term a list of syntactic, morphological or semantic variants is given.

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semantic Networks - Examples

Pharmacologic Substance affects Pathologic FunctionPharmacologic Substance causes Pathologic FunctionPharmacologic Substance complicates Pathologic FunctionPharmacologic Substance diagnoses Pathologic FunctionPharmacologic Substance prevents Pathologic FunctionPharmacologic Substance treats Pathologic Function

Accession: GO:0009292Ontology: biological processSynonyms: broad: genetic exchangeDefinition: In the absence of a sexual life cycle, the processes involved in the

introduction of genetic information to create a genetically different individual.Term Lineage all : all (164142)

GO:0008150 : biological process (115947)GO:0007275 : development (11892)

GO:0009292 : genetic transfer (69)

GO (Gene Ontology) allows for “consistent descriptions of gene products in different databases, including several of the world’s major repositories for plant, animal and microbial genomes…“ Organizing principles are molecular function, biological process and cellular component.

UMLS (Unified Medical Language System) integrates linguistic, terminological and semantic information. The Semantic Network consists of 134 semantic types and 54 relations between types.

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Example Ontology

Consider an Example Ontology for the Newspaper Domain

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

• Ontologies are used to semantically organize and retrieve data (structured, textual, multimedia) through knowledge markup

Consider the following example:

• Knowledge Markup from Text is based on Named-Entity Recognition, Semantic Tagging (Term to Class Mapping) and Relation Extraction

Knowledge Markup

<news:story xmnls:jobs=“http://www.jobs.org/owl-jobs#” xmlns:com=“http://www.companies.org/owl-companies#” xmlns:it=“http://www.it.net/owl-it#”>

“We were surprised by several of the results, particularly the order of finish,” said <jobs:SystemsAnalyst>Dan Olds</jobs:SystemsAnalyst>. <com:Company>IBM</com:Company> finished first with very strong results, and <com:Company>HP</com:Company> scored a solid number two; we expected to see <com:Company>Sun Microsystems</com:Company> challenging for first place or at least a strong second place. As the largest <it:operatingsystem>UNIX</it:operatingsystem> vendor in terms of number of installed systems, a third place finish should put their management on notice that their installed base may be vulnerable.

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Knowledge Markup - Images

Semantic Annotation of Medical Images

(miAKT Project - UK)

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Knowledge Markup - Images

Semantic Annotation of Video

(SmartMedia – DFKI KM)

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontology Life-Cycle

Create/SelectDevelopment and/or Selection

PopulateKnowledge Base Generation

ValidateConsistency Checks

EvolveExtension, Modification

MaintainUsability Tests

DeployKnowledge Retrieval

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Knowledge Extraction

Ontology Population & Ontology Learning

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontology Life-Cycle – Ontology Population

Create/SelectDevelopment and/or Selection

PopulateKnowledge Base Generation

ValidateConsistency Checks

EvolveExtension, Modification

MaintainUsability Tests

DeployKnowledge Retrieval

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontology Population with SOBA

SOBA: SmartWeb Ontology-based Annotation

Application Context SmartWeb (http://www.smartweb-projekt.de/) – German Project around World-

Cup 2006 Integrates

Multimodal Dialog Processing IR-based Question Answering Ontology-Based Information Extraction Semantic Web Services

Ontology-Based Information Extraction … Combines:

Semantic Wrapping of Semi-Structured Data Semantic and Linguistic Annotation of Free Text Inference Rules for Instantiation and Integration of Annotated Entities and

Events

… and Display Ontology-driven Hyperlink Generation for Display of Extracted Information

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Linguistic AnnotationLinguistic Annotation

Named Entity Recognition&

Semantic Tagging

Named Entity Recognition&

Semantic Tagging

Image ExtractionImage Extraction

PDF Analysis PDF Analysis

Inference Rules forInstantiation &

Integration

Inference Rules forInstantiation &

Integration

KnowledgeBase

DocumentsOntologies

Wrapping of SemiStructured Data

Wrapping of SemiStructured Data

SOBA – Processing and Data Flow

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

SWIntO: SmartWeb Integrated Ontology

SmartDOLCE:Entity

SmartSUMO:Attribute

SmartSUMO:SocialRole

SmartSUMO:Proposition

SportEvent:FootballPlayer

SportEvent:Goalkeeper

SportEvent:FootballOrganizationPerson

SportEvent:FootballClubPresident

… …

SWIntO (by AIFB, DFKI KM/IUI, EML) covers Foundational (DOLCE) and General (SUMO)

Knowledge Domain- and Task-Specific Knowledge

Football / Sport Events Navigation, Discourse, Multimedia other

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

SMartWeb Integrated Ontology (by AIFB, DFKI KM/IUI, EML)

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

SmartWeb Corpus

(Growing) Web Corpus through Monitor on http://fifaworldcup.yahoo.com/ http://www.uefa.com/competitions/worldcup

Semi-Structured Data Tabular: Match Reports, Teams, etc.

Free Text Match Reports Image Captions

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semi-Structured Data - HTML

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semi-Structured Data - XML

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semi-Structured Data – F-Logic

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

MatchEvent [Score, Team1, Team2]

FootballPlayer

Information Extraction from Free Text

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

FoulEvent [FootballPlayer]

FootballPlayer

Information Extraction from Image Captions

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Linguistic and Semantic Annotation

Mark Crossley saved twice with his legs from Huckerby.

Named Entity Recognition & Semantic Tagging

[Mark Crossley GOALKEEPER] [saved GOALKEEPER_ACTION] twice with his legs from [Huckerby PLAYER].

Linguistic Annotation

[Mark Crossley GOALKEEPER : SUBJ] [saved PRED : GOALKEEPER_ACTION] twice [with his legs PP_OBJ] [from [Huckerby PLAYER] PP_ADJUNCT].

[ GOALKEEPER_ACTION = 'save‘, GOALKEEPER = 'Mark Crossley‘, PLAYER = 'Huckerby‘, MANNER = ‘legs']

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Annotation/Extraction Example

Example Sentence from Match Report

Allerdings ist Petrow fuer die Partie gegen Schweden gesperrt und kann erst gegen Ungarn eingesetzt werden.

“However Petrow has been banned for the match against Sweden and can again be deployed against Hungary.”

Annotated/Extracted Information (with SProUT IE Tool - DFKI-LT )

player_action & [GAME_EVENT "Ban", AGENT player & [SURNAME "PETROW"], IN_MATCH game & [TEAM2 "SWE",

TOURNAMENT "Match"]] team & [NAME "HUN"]

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Knowledge Base Generation

<type orig="player" target="dolce#natual-person-denomination> <link type="dolce#natural-person" method="dolce#HAS-DENOMINATION"

id=""/> <map> <simple-mapping> <input>

<arg orig="GIVEN_NAME" target="VAR1"/> </input> <output method="dolce#FIRSTNAME" value="VAR1"/> </simple-mapping> <simple-mapping> <input> <arg orig="SURNAME" target="VAR1"/> </input> <output method="dolce#LASTNAME" value="VAR1"/> </simple-mapping> </map></type>

Transformation of SProUt Output to F-Logic via Declarative Mappings, e.g.:

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

SProUt to F-Logic

FS type="player_action">

[N [N <F name="GAME_EVENT">

<FS type="world champion"/>

<F name="ACTION_TIME">

<FS type="1990"/>

<F name="ACTION_LOCATION">

<FS type="Italy"/>

<F name="AGENT">

<FS type="player">

<F name="SURNAME">

<FS type="Buchwald"/>

<F name="GIVEN_NAME">

<FS type="Guido"/>

soba#player124:sportevent#FootballPlayer

[sportevent#impersonatedBy -> soba#Guido_BUCHWALD].

soba#Guido_BUCHWALD:dolce#"natural-person"

[dolce#"HAS-DENOMINATION" -> soba#Guido_BUCHWALD_Denomination].

soba#Guido_BUCHWALD_Denomination":dolce#"natural-person-denomination"

[dolce#LASTNAME -> "Buchwald"; dolce#FIRSTNAME -> "Guido"].

SProUt F-Logic

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

A Complex Example

semistruct#"Bolivien_vs_Brasilien_09_Oct_05_16_00_Luis_CRISTALDO":sportevent#FieldMatchFootballPlayer [ externalRepresentation@(de) ->> "Luis CRISTALDO (7)"; sportevent#number -> 7; sportevent#impersonatedBy -> semistruct#"Luis_CRISTALDO"

].

semistruct#"Bolivien_vs_Brasilien_09_OCt_05_16_00" [ sportevent#matchEvents -> soba#ID25 ].

soba#ID25:sportevent#Foul [ sportevent#commitedBy -> semistruct#"Bolivien_vs_Brasilien_09_Oct_05_Luis_CRISTALDO ].

mediainst#ID67:media#Picture [ media#URL -> "http://fifaworldcup.yahoo.com/06/de/photos/index.html?aid=124155&d=1"; media#shows -> ID25 ].

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Display of Extracted Information

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontology Life-Cycle – Ontology Learning

Create/SelectDevelopment and/or Selection

PopulateKnowledge Base Generation

ValidateConsistency Checks

EvolveExtension, Modification

MaintainUsability Tests

DeployKnowledge Retrieval

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Terms

Concepts

Taxonomy

Relations

Rules & Axioms

disease, doctor, hospital

{disease, illness, Krankheit}

DISEASE:=<Int, Ext, Lex>

is_a(DOCTOR, PERSON)

cure(dom:DOCTOR, range:DISEASE)

(Multilingual) Synonyms

))(),((, xillyxsufferFromyx

Introduced in: Philipp Cimiano, PhD Thesis University of Karlsruhe, forthcoming

Ontology Learning Layer Cake

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Some Current Work on Ontology Learning from Text

Term Extraction Statistical Analysis Patterns (Shallow) Linguistic Parsing Term Disambiguation & Compositional Interpretation Combinations

Taxonomy Extraction Statistical Analysis & Clustering (e.g. FCA) Patterns (Shallow) Linguistic Parsing WordNet Combinations

Relation Extraction Anonymous Relations (e.g. with Association Rules) Named Relations (Linguistic Parsing) (Linguistic) Compound Analysis Web Mining, Social Network Analysis Combinations

Definition Extraction (Linguistic) Compound Analysis (incl. WordNet)

Overview of Current Work: Paul Buitelaar, Philipp Cimiano, Bernardo Magnini Ontology Learning from Text: Methods, Evaluation and Applications Frontiers in Artificial Intelligence and Applications Series, Vol. 123, IOS Press, July 2005.

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Terms

Concepts

Taxonomy

Relations

Rules & Axioms

disease, doctor, hospital

{disease, illness, Krankheit}

DISEASE:=<Int, Ext, Lex>

is_a(DOCTOR, PERSON)

cure(dom:DOCTOR, range:DISEASE)

(Multilingual) Synonyms

))(),((, xillyxsufferFromyx

RelExt - Relation Extraction for Ontology Learning

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

RelExt - Motivation

Extend Ontology with Relations Currently ~ 60 Relations in the Sport Events Ontology

– Mostly Properties, e.g. hasName, atMinute, … Representation of (Verbal) Relations Enables Better Modeling

of Events for Information Extraction Purposes

Example

“Ballack shoots the ball in the net.”

Relation:Shoot (Domain:FootballPlayer Range:BallObject)

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

RelExt – System Architecture

Named-Entity Rec. & Semantic Tagging

Shallow Parsing

Corpus

AnnotatedCorpus

Relevance Measure

FrequenciesIn BNC, NZZ

Relevance ScoresHeads, Preds

Co-occurrence Measure

Co-occurrenceScores

Heads <> Preds

Linguistic Annotation Statistical Processing

TripleGenerationTriples

Head : Pred : HeadEvaluation

Relation Extraction and Evaluation

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Linguistic Annotation

Named-Entity Recognition“Michael Ballack” : FootballPlayer

Semantic Tagging“Ball” (ball), “Leder” (leather) : BallObject

Shallow Parsing Part-of-Speech Tagging

Fussballspieler (soccer player): Noun

Morphological AnalysisFussballspieler: Fussball – Spieler

Dependency Structure Analysis“The team won the second match.” SUBJECT PREDICATE DIRECT_OBJECT

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Relevance Ranking

Top-10 Head-Nouns before and after mapping to Ontology Classes

Rank Headnoun Frequency1 125245.24 68492 121888.52 7767

3 95003.21 59674 64157.18 3575

5 57185.76 31326 45474.96 2298

7 34668.11 17528 30017.75 1561

9 27989.09 147910 27414.66 1457

2

Ball (ball)Tor (goal)

Meter (meters)Schuss (shot / drive)

Ecke (corner)Strafraum (penalty area)

Freistoss (freekick)Leder (leather / ball)Flanke (cross)

Pfosten (post)

Rank Concept Label Frequency1 565510.99 FOOTBALLPLAYER 284942 162137.82 GOALOBJECT 81883 143528.88 BALLOBJECT 72494 138535.44 GOALKEEPER 68875 70814.86 SHOT 35786 49018.16 TEAM 24777 45474.96 PENALTYAREA 22988 34668.11 FREEKICK 17529 29324.54 WING 1482

10 28829.78 POST 1457

2

Rank Predicate Frequency1 27167.41 13732 22045.39 1435

3 21908.37 15034 20439.09 1033

5 16342.99 8266 9563.41 1548

7 9468.57 8148 7752.84 1559

9 7653.68 53710 7637.45 405

2

flanken (to cross)klaeren (to clear)

schiessen (to shot)koepfen (to head)

lassen (to let / to leave)ziehen (to pull / to drag)

passen (to pass / to play)spielen (to play / to pass)lenken (to divert)

parieren (to parry / to save)

Top-10 Predicates

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Co-Occurrence Analysis

Rank Concept Label Frequency1 565510.99 FOOTBALLPLAYER 284942 162137.82 GOALOBJECT 81883 143528.88 BALLOBJECT 72494 138535.44 GOALKEEPER 6887

2

Rank Predicate Frequency1 27167.41 1373

2 22045.39 14353 21908.37 15034 20439.09 1033

2

flanken (to cross)klaeren (to clear)schiessen (to shot)koepfen (to head)

.

.

.

.

.

.

flanken SUBJ:FOOTBALLPLAYER “Klasnic”

flanken DOBJ:FOOTBALLPLAYER “Klose”

flanken_in PP_ADJ “Zuschauer” (audience)

.

.

.

beschimpfen (to insult) SUBJ:FOOTBALLPLAYER “Klasnic”

.

.

.

.

.

.

.

.

.

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Integration into Ontology Development

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Terms

Concepts

Taxonomy

Relations

Rules & Axioms

disease, doctor, hospital

{disease, illness, Krankheit}

DISEASE:=<Int, Ext, Lex>

is_a(DOCTOR, PERSON)

cure(dom:DOCTOR, range:DISEASE)

(Multilingual) Synonyms

))(),((, xillyxsufferFromyx

OntoLT – Protégé Plug-In for Ontology Extraction from Text

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

OntoLT – Basic Idea

Middleware Solution in Ontology Development Supports the Ontology Engineer through Semi-Automatic

Extraction of Ontology Fragments from Domain-Relevant Document Collections

Download http://olp.dfki.de/OntoLT/OntoLT.htm

Based on Automatic Linguistic Annotation Manual Definition of Mapping Rules Statistical Preprocessing (Option) Interactive Validation of Candidates Generation in Protégé of Ontology Fragments

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

OntoLT – System Architecture

AnnotatedCorpus(XML)

Mappings

XML (Linguistic Structure) <=>

Protégé (Classes, Slots)

Extraction

Protégé

Edit Extracted Ontology

Corpus

Definitionof Mappings

LinguisticAnnotation

ExtractedOntology

OntoLT

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Corpus Example – KMI News

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Mapping Rules

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Statistical Relevance

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Extract Candidates

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Generate Ontology Fragments

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Exercises

Knowledge Extraction Ontology Modeling (from Text) Ontology Population Ontology Learning (Extension) Ontology Mapping