13
AKSW, Universität Leipzig BOA How To Integrate Your Language Daniel Gerber Axel-Cyrille Ngonga Ngomo

BOA - How To Integrate Your Language

Embed Size (px)

DESCRIPTION

BOA tries to extract knowledge (binary relations) from unstructured data like free text. This is a tutorial based on the Korean language on how to adopt the BOA approach to your language.

Citation preview

Page 1: BOA - How To Integrate Your Language

AKSW, Universität Leipzig

BOAHow To Integrate Your Language

Daniel Gerber Axel-Cyrille Ngonga Ngomo

Page 2: BOA - How To Integrate Your Language

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

General Overview

2

Corpus Indexing Background Knowledge Surface forms

EvaluationRDF extractionSearch & ScoringKorean features

Page 3: BOA - How To Integrate Your Language

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

1. Create a corpus in your language

๏ At least 25M sentences

๏ Chunked into one sentence per line

๏ No HTML

๏ UTF-8?

๏ For later Coreference Resolution, resource URL needs to be available

3

Page 4: BOA - How To Integrate Your Language

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

2. Corpus indexing

4

๏ Apache Lucene 3.4.0

๏ Set of >20 UTF-8 RegEx filters

๏ Whitespace Analyzer

➡ No stemming

➡ Tokenization on every token

➡ Stop-words included in index

➡ Lowercase version

Page 5: BOA - How To Integrate Your Language

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

3. Background knowledge I

5

ObjectProperties

DatatypePropertiesvs

Page 6: BOA - How To Integrate Your Language

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

3. Background knowledge II

6

Line #1 Line #2

URI1 http://dbpedia.org/resource/South_Korea http://dbpedia.org/resource/KAIST

Label1 대한민국 한국 과학 기술원

Property http://dbpedia.org/ontology/capital http://dbpedia.org/ontology/country

URI2 http://dbpedia.org/resource/Seoul http://dbpedia.org/resource/South_Korea

Label2 서울 대한민국

Domain http://dbpedia.org/ontology/PopulatedPlace ⎯

Range http://dbpedia.org/ontology/PopulatedPlace http://dbpedia.org/ontology/Country

Page 7: BOA - How To Integrate Your Language

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

4. Surface form generation

7

๏ DBpedia Spotlight ๏ Labels๏ Redirects๏ Disambiguation

๏ Datatype Properties๏ Person XY is born on 1st of October in 1972.๏ Person XY is born on 1 October in 1972.๏ Person XY is born on a Thursday in 1972

๏ Find and Create those surface forms

Page 8: BOA - How To Integrate Your Language

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

5. Korean feature extraction

8

LanguageDependent

LanguageIndependent

ReVerb

Wordnet Distance

?

?

# of words

# of stopwords

# of occurrences

?

?

Page 9: BOA - How To Integrate Your Language

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

6. Pattern search and scoring

9

Barack Obama was born in Honolulu.was born in

버락 오바마는 호놀룰루에서 태어났습니다.

Subject? Object?Predicate?

Page 10: BOA - How To Integrate Your Language

Named Entity Disambiguation!

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

7. RDF extraction

10

Barack Obama

Honolulu

dbpedia-owl:birthPlace

버락 오바마는 호놀룰루에서 태어났습니다.

Barack Obama was born in Honolulu.was born in

�� ���

���

dbpedia-owl:birthPlace

에서 태어났습니다.

Page 11: BOA - How To Integrate Your Language

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

8. Evaluation

1. Select properties P to evaluate (T100)

2. Query DBpedia for triples (and labels) with p ∈ P

3. Find sentence with labels

4. Assess if triple can be found in sentence

➡ Gold Standard with 1000 annotated sentence/triples

5. Run one BOA iteration on Gold Standard

6. Measure Precision/Recall/F-Measure

11

Page 12: BOA - How To Integrate Your Language

Bootstrapping the Data Web

AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page

Necessary resources for new language

๏ 50M sentence (best general knowledge)

๏ Sentence Boundary Disambiguation

๏ Part of speech tagger helpful

๏ Named Entity Recognition

๏ Named Entity Disambiguation

๏ Labels for resources

๏ SPARQL endpoint

12

Page 13: BOA - How To Integrate Your Language

LOD2 Presentation . 02.09.2010 . Page http://lod2.eu

Thank you!Questions?

Daniel GerberJohannisgasse 26, Room 5-2104103 Leipzig, GermanySIMBA@AKSWhttp://bis.informatik.uni-leipzig.de/DanielGerberhttp://boa.aksw.orghttp://code.google.com/p/boa