Upload
daniel-gerber
View
6.347
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Most knowledge sources on the Data Web were extracted from structured or semi-structured data. Thus, they encompass solely a small fraction of the information available on the document-oriented Web. In this paper, we present BOA, an iterative bootstrapping strat- egy for extracting RDF from unstructured data. The idea behind BOA is to use the Data Web as background knowledge for the extraction of natural language patterns that represent predicates found on the Data Web. These patterns are used to extract instance knowledge from natu- ral language text. This knowledge is finally fed back into the Data Web, therewith closing the loop. We evaluate our approach on two data sets using DBpedia as background knowledge. Our results show that we can extract several thousand new facts in one iteration with very high ac- curacy. Moreover, we provide the first repository of natural language representations of predicates found on the Data Web.
Citation preview
AKSW, Universität Leipzig
Daniel Gerber Axel-Cyrille Ngonga Ngomo
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Motivation
๏ Most knowledge bases extracted from (semi)-structured data
๏ Only 15-20 % of information in structured data
๏ Semantic Web ⬌ Document Web
๏ How can we extract data from the document-oriented web?
2
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Idea I
3
dbpedia:Barack_Obama
dbpedia:Honolulu,_Hawaii
dbpedia:Democratic_Party
dbpedia:Michelle_Obama
dbpedia-owl:birthPlace
dbpedia-owl:party
dbpedia-owl:spouse
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Idea II
Barack Obama was born in Honolulu, Hawaii.
Barack Hussein Obama is a politician of the Democratic Party.
Obama married Michelle Robinson in 1992.
4
is a politician of the
married
was born in
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Idea III
5
is a politician of the married
was born in
Joseph Martin "Joschka" Fischer (born 1948-04-12) is a politician of the German Green Party.
Dietrich's only child, Maria Elisabeth Sieber, was born in Berlin on 13 December 1924.
Jackie Bouvier Kennedy Onassis who married John F. Kennedy was tied to the Auchinclosses via her sister's marriage into the Auchincloss family.
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Related Work
6
๏ ReadTheWeb Project: N(ever) E(nding) L(anguage) L(earner)
๏ PROSPERA: Scalable Knowledge Harvesting with High Precision
and High Recall
2
5
3
14
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Bootstrapping the Data Web
The BOA approach
7
Data Web
Web
Corpora
BackgroundKnowledge
Patterns
SPARQL
Pattern Search
Filtering
Pattern Scoring
RDFGeneration
Use in nextiteration
Corpus Extraction
Crawler
Cleaner
Indexer
Knowledge Acquisition
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page 8
Knowledge acquisition
http://dbpedia.org/resource/Google“Google”http://dbpedia.org/ontology/subsidiaryhttp://dbpedia.org/resource/YouTube“Youtube”
SELECT ?x ?xLabel ?prop ?y ?yLabel ?domain ?rangeWHERE { ?x rdf:type dbpedia-owl:[Organisation|Person|Place] . ?x rdfs:label ?xLabel . ?y rdfs:label ?yLabel . [?y ?prop ?x | ?x ?prop ?y] . FILTER ( lang(?xLabel) = ‘en’ && lang(?yLabel) = ‘en’ ) . ?prop rdfs:range ?range . ?prop rdfs:domain ?domain . }
http://dbpedia.org/ontology/Companyhttp://dbpedia.org/ontology/Company
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Pattern Search
(1) Set of entities s and o connected through p(2) Find all sentences which contain s and o(3) Replace labels with variables (?D?, ?R?)
9
BOA pattern: BOA pattern mapping:
dbpedia-owl:spouse“?D? with his wife ?R?”
dbpedia-owl:spouse“?D? with his wife ?R?”
dbpedia-owl:spouse“?D? and his wife ?R?”
dbpedia-owl:spouse“?D? and her husband ?R?”
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Pattern Scoring - Support
10
Supportpattern should be used across several triples in background knowledge
subsidiary ↣ “?R? was acquired by ?D?”
๏ [Google, DoubleClick] ↣ 2
๏ [General Motors, Opel] ↣ 1
๏ [Cablevision, Rainbow Media] ↣ 4
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Pattern Scoring - Specificity
Specificitypattern should not be used by many pattern mappings
๏ subsidiary: “?D? agreed to buy ?R?”
๏ subsidiary: “?R? is a part of ?D?”
๏ foundationOrganisation: “?R? is a part of ?D?”
11
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Pattern Scoring - Typicity
12
Typicitypattern should be used to connect entities of correct type
๏ Hypercom was acquired by Verifone .
๏ Hypercom_ORG was_O acquired_O by_O Verifone_ORG ._O
๏ Maktoob was acquired by Yahoo!
๏ Maktoob_PER was_O acquired_O by_O Yahoo_ORG ._O
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
RDF Generation
13
dbpedia-owl:spouse
‘‘Leyla Rodriguez Stahl’’@en
rdfs:label
‘‘Abel Pacheco’’@en
rdfs:label
dbpedia-owl:Person
rdf:type
dbpedia-owl:Person
rdf:type
Pacheco_PER arrived_O with_O his_O wife_O Leyla_PER Rodriguez_PER Stahl_PER and_O
?D? with his wife ?R?
Pacheco arrived with his wife Leyla Rodriguez Stahl and several...
boa:Leyla_Rodriguez_Stahldbpedia:Abel_PachecoNEW NEW
NEW
NEW
riverMouthmusicalArtistmusicalBandawardwriteralmaMateroccupationformerTeamdeathPlacebirthPlace
Place Person Organisation
137990
158697
327430
64239
551693
72820
# of
trip
les
is subjectis object
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Evaluation I
14
en-wiki en-news
Language english english
Topicgeneral
knowledgenews
# of lines 44.7M 256.1M
# of words 1,032.1M 5,068.7M
riverMouthmusicalArtistmusicalBandawardwriteralmaMateroccupationformerTeamdeathPlacebirthPlace
Place Person Organisation
# of
trip
les
is subjectis object
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Evaluation II
15
en-wikien-wikien-wiki en-newsen-newsen-news
LOC PER ORG LOC PER ORG
Triples extracted 1465 8817 2567 488 903 916
Triples in DBpedia 138 183 48 52 44 7
Evaluated Triples 100 (8) 100 (1) 100 (1) 100 (1) 100 (7) 100 (0)
Precision 90,5 97 99 61,5 73,5 91
New true Statements* 1200 8375 2494 268 631 827
Found pattern mappings 62 72 59 49 70 55
Found patterns 123k 136k 38k 569k 465k 92k
Scored patterns 1045 612 241 3832 7294 1077
* Number of extracted statements not found in DBpedia multiplied with the precision of our approach
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Future Work
๏ Iteration 1+
๏ Human feedback
๏ Pattern generalization
๏ Datatype Properties
๏ Languages/Corpora
๏ Webservices
16
Bootstrapping the Data Web
WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page
Conclusion
๏ No manual created seed patterns needed
๏ 95.5% Precision on DBpedia/Wikipedia
๏ Output easily integrable in LOD Cloud
๏ Library of natural-language representations of
formal relations, Demo
๏ Quasi language independent (German/Korean)
17
LOD2 Presentation . 02.09.2010 . Page http://lod2.eu
Thank you!Questions?
Daniel GerberJohannisgasse 26, Room 5-2104103 Leipzig, GermanySIMBA@AKSWhttp://bis.informatik.uni-leipzig.de/DanielGerberhttp://boa.aksw.orghttp://code.google.com/p/boa