Upload
jens-lehmann
View
843
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Presentation of the Linked Data Lifecycle given at the ICCL Summer School 2013.
Citation preview
The Linked Data Life-Cycle
Jens Lehmann Lorenz Bühmann
contributors:Quan Nguyen Sören Auer Richard Cyganiak Daniel GerberSebastian Hellmann Anja Jentzsch Dimitris Kontokostas Axel NgongaClaus Stadler Christina Unger
2013-08-23
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 1 / 252
Outline
1 Introduction to Linked Data
2 Linked Dataset Example: DBpedia
3 Linked Data Life-Cycle Overview
4 Knowledge Extraction
5 Data Integration / Linking
6 Enrichment
7 Repair
8 Knowledge Base Exploration / Querying
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 2 / 252
Outline
1 Introduction to Linked Data
2 Linked Dataset Example: DBpedia
3 Linked Data Life-Cycle Overview
4 Knowledge Extraction
5 Data Integration / Linking
6 Enrichment
7 Repair
8 Knowledge Base Exploration / Querying
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 3 / 252
The Linked Data Principles
The term Linked Data refers to a set of best practices for publishing andinterlinking structured data on the Web.
Linked Data principles:
1 Use URIs as names for things.
2 Use HTTP URIs, so that people can look up those names.
3 When someone looks up a URI, provide useful information, using thestandards (RDF, SPARQL).
4 Include links to other URIs, so that they can discover more things.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 4 / 252
LOD Cloud
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 5 / 252
Linked Data Principles Detailed: 1 + 2
1 URI references to identify not just Web documents and digitalcontent, but also real world objects and abstract concepts
tangible things: people, placesabstract things: relationship type of knowing somebody
2 HTTP URIs enable re-use of Web architecture Linked Data givesemphasis to the Web in Semantic Web
Resource dereferencingRe-use of standard tools for security, load-balancing etc.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 6 / 252
Principles Detailed: 3 Content Negotiation
Humans and machines should be able to retrieve appropiraterepresentations of resources: HTML for humans, RDF formachines
Achievable using an HTTP mechanism called content negotiation
Basic idea: HTTP client sends HTTP headers with each request toindicate what kinds of documents they prefer
Servers can inspect headers and select appropriate response
Two strategies:
303 URIsHash URIs
Both ensure that objects and the documents that describe them arenot confused + humans and machines can retrieve appropriaterepresentations
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 7 / 252
Principles Detailed: 3 Content Negotiation
Humans and machines should be able to retrieve appropiraterepresentations of resources: HTML for humans, RDF formachines
Achievable using an HTTP mechanism called content negotiation
Basic idea: HTTP client sends HTTP headers with each request toindicate what kinds of documents they prefer
Servers can inspect headers and select appropriate response
Two strategies:
303 URIsHash URIs
Both ensure that objects and the documents that describe them arenot confused + humans and machines can retrieve appropriaterepresentations
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 7 / 252
Principles Detailed: 3 Content Negotiation
Humans and machines should be able to retrieve appropiraterepresentations of resources: HTML for humans, RDF formachines
Achievable using an HTTP mechanism called content negotiation
Basic idea: HTTP client sends HTTP headers with each request toindicate what kinds of documents they prefer
Servers can inspect headers and select appropriate response
Two strategies:
303 URIsHash URIs
Both ensure that objects and the documents that describe them arenot confused + humans and machines can retrieve appropriaterepresentations
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 7 / 252
303 URIs
303 Redirect: instead of sending the object itself over the network,the server responds to the client with the HTTP response code 303See Other and the URI of a Web document which describes thereal-world object
Second step: client dereferences new URI and gets a Web documentdescribing the real-world object
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 8 / 252
Hash URIs
Hash URI strategy builds on characteristic that URIs may contain aspecial part (fragment identier) separated from their base part by ahash symbol (#)
HTTP protocol requires the fragment part to be stripped o beforerequesting the URI from the server
→ a URI that includes a hash cannot be retrieved directly andtherefore does not necessarily identify a Web document
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 9 / 252
Hash versus 303
Hash Uris
(+) Reduced number of necessary HTTP round-trips → reduces accesslatency(-) Descriptions of all resources sharing the same non-fragment URIpart are always returned to the client together → can lead to largeamounts of data being unnecessarily transmitted to the client
303 Uris
(+) Flexible because the redirection target can be conguredseparately for each resource (usually points to a single document foreach resource, but could also summarise several resources)(-) Requires two HTTP requests to retrieve a single description of areal-world object
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 10 / 252
Principles Detailed: 4 Links
If an RDF triple connects URIs in dierent namespaces/datasets, is iscalled a link (no unique syntactical denition of link exists)
Basic idea of Linked Data: apply the general hyperlink-basedarchitecture of the World Wide Web to the task of sharing structureddata on global scale
Research challenge: ecient creation of links with high precision andrecall
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 11 / 252
Why Linked Data?
Problem: Try to search for these things on the current Web:
Apartments near German-Russian bilingual childcare in Leipzig.
ERP service providers with oces in Vienna and London.
Researchers working on multimedia topics in Eastern Europe.
Information is available on the Web, but opaque to current Websearch.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 12 / 252
Why Linked Data?
Problem: Try to search for these things on the current Web:
Apartments near German-Russian bilingual childcare in Leipzig.
ERP service providers with oces in Vienna and London.
Researchers working on multimedia topics in Eastern Europe.
Information is available on the Web, but opaque to current Websearch.Solution: complement text on Web pages with structured linked open data& intelligently combine/integrate such structured information fromdierent sources:
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 13 / 252
How to get there?
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 14 / 252
Tim Berners-Lee's 5-star plan
Tim Berners-Lee's 5-star plan for an open web of data
F Make data available on the Web under an open license
F F Make it available as structured data
F F F Use a non-proprietary format
F F F F Use URIs to identify things
F F F F F Link your data to other people's data to provide context
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 15 / 252
The 0th star
Data catalog with good metadataMake your data ndable
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 16 / 252
F Data on the Web, Open License
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 17 / 252
F Data on the Web, Open License
Open vs. Closed:
Data used to be closed by default
In the future, it may be open by default.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 18 / 252
F Data on the Web, Open License
Publishers: sharing data to make it more visible
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 19 / 252
F Data on the Web, Open License
E-Commerce: Data sharing for increasing trac
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 20 / 252
F Data on the Web, Open License
Community: Collaboratively created databases
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 21 / 252
Good reasons against opening data
Privacy
Competitive advantage
Producing data and charging for it as business model
Can't get license from upstream
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 22 / 252
F F Structured Data
Enabling re-use:
Delivering data to end users in dierent forms
Combining data with other data
3rd party analysis of data
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 23 / 252
F F Structured Data
Formats:
Good for re-use / Structured: MS Excel, CSV, XML, JSON, Microdata
Not so good for re-use: Pure websites, MS Word
Bad for re-use: PDF
Really bad for re-use: Only charts/maps without numbers
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 24 / 252
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 25 / 252
F F F Non-Proprietary Formats
Specialist tools often have specialist formats
Few people have the toolsExpensiveDicult to re-use(Geospatial tools, statistics packages, etc.)
Non-proprietary:
CSV (dead simple)XMLJSONRDF (good for 4+5 stars)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 26 / 252
F F F F URIs as Identiers
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 27 / 252
F F F F URIs as Identiers
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 28 / 252
F F F F URIs as Identiers
URI-Design: prefer stable, implementation independent URIs
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 29 / 252
F F F F URIs as Identiers
Turning local identiers into URIsWhy?
Make them globally unique
Clarify auhority
Make them resolvable
Make them linkable
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 30 / 252
F F F F F Links to Other Data
Hyperlinks are the soul of the Web. The Web of Data is no dierent.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 31 / 252
F F F F F Links to Other Data
Hyperlinks are the soul of the Web. The Web of Data is no dierent.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 31 / 252
Summary
Linked Data Principles:
1 Use URIs to name things (not only documents, but also people,locations, concepts, etc.)
2 To enable agents (human users and machine agents alike) to look upthose names, use HTTP URIs
3 When someone looks up a URI, provide useful information(structured data in RDF, SPARQL).
4 Include links to other URIs allowing agents to discover more things
5-Star-Data:
Five-star plan for realising an emerging web of data, dataset bydataset
2 stars: re-usable data
3 stars: open standards
4+5 stars: connect data silos
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 32 / 252
Summary
Linked Data Principles:
1 Use URIs to name things (not only documents, but also people,locations, concepts, etc.)
2 To enable agents (human users and machine agents alike) to look upthose names, use HTTP URIs
3 When someone looks up a URI, provide useful information(structured data in RDF, SPARQL).
4 Include links to other URIs allowing agents to discover more things
5-Star-Data:
Five-star plan for realising an emerging web of data, dataset bydataset
2 stars: re-usable data
3 stars: open standards
4+5 stars: connect data silos
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 32 / 252
Outline
1 Introduction to Linked Data
2 Linked Dataset Example: DBpedia
3 Linked Data Life-Cycle Overview
4 Knowledge Extraction
5 Data Integration / Linking
6 Enrichment
7 Repair
8 Knowledge Base Exploration / Querying
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 33 / 252
DBpedia
Community eort to extract structured information from Wikipediaand to make this information available on the Web
Allows to ask sophisticated queries against Wikipedia, and to linkother data sets on the Web to Wikipedia data
Semi-structured Wiki markup → structured information
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 34 / 252
Wikipedia Limitations
Simple Questions hard to answer with Wikipedia:
What have Innsbruck and Leipzig in common?
Who are mayors of central European towns elevated more than1000m?
Which movies are starring both Brad Pitt and Angelina Jolie?
All soccer players, who played as goalkeeper for a club that has astadium with more than 40.000 seats and who are born in a countrywith more than 10 million inhabitants
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 35 / 252
Structure in Wikipedia
Title
Abstract
Infoboxes
Geo-coordinates
Categories
Images
Links
other language versionsother Wikipedia pagesTo the WebRedirectsDisambiguation
...
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 36 / 252
DBpedia Information Extraction Framework
DBpedia Information Extraction Framework (DIEF)
Started in 2007
Hosted on Sourceforge and Github
Initially written in PHP but fully re-written Written in Scala and Java
Around 40 Contributors
See https://www.ohloh.net/p/dbpedia for detailed overview
Can potentially be adapted to other MediaWikis
Currently Wiktionary http://wiktionary.dbpedia.org
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 37 / 252
DIEF - Overview
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 38 / 252
DIEF - Raw Infobox Extractor
WikiText syntaxInfobox Korean settlement|title = Busan Metropolitan City...|area_km2 = 763.46|pop = 3635389|region = [[Yeongnam]]RDF serializationdbp:Busan dbp:title "Busan Metropolitan City"dbp:Busan dbp:area_km2 "763.46"^xsd:oatdbp:Busan dbp:pop "3635389"^xsd:int
dbp:Busan dbp:region dbp:Yeongnam
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 39 / 252
DIEF - Raw Infobox Extractor/Diversity
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 40 / 252
DIEF - Raw Infobox extractor/Diversity
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 41 / 252
DIEF - Mapping-Based Infobox Extractor
Cleaner data:
Combine what belongs together (birth_place, birthplace)
Separate what is dierent (bornIn, birthplace)
Correct handling of datatypes
Mappings Wiki:
http://mappings.dbpedia.org
Everybody can contribute to new mappings or improve existing ones
≈ 170 editors
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 42 / 252
DIEF - Mapping-Based Infobox Extractor
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 43 / 252
URI/IRI schemes
http://lang.dbpedia.org is the main domain
For every article there exists a DBpedia resource in the form:http://lang.dbpedia.org/resource/ArticleName
Properties from the raw infobox extractor use thehttp://lang.dbpedia.org/property/namespace
Ontology is global for all languages and underhttp://dbpedia.org/ontology/namespace
Note: that for English language no language code is used
http://dbpedia.org as main domain
http://dbpedia.org/resource/title for articles
http://dbpedia.org/property/title for properties
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 44 / 252
Linked Data Publication via 303 Redirects
http://dbpedia.org/resource/Dresden - URI of the city ofDresdenhttp://dbpedia.org/page/Dresden - information resourcedescribing the city of Dresden in HTML formathttp://dbpedia.org/data/Dresden - information resourcedescribing the city of Dresden in RDF/XML formatfurther formats supported,e.g. http://dbpedia.org/data/Dresden.n3 for N3
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 45 / 252
DBpedia Links
Data set Predicate Count Tool
Amsterdam Museum owl:sameAs 627 SBBC Wildlife Finder owl:sameAs 444 SBook Mashup rdf:type 9 100
owl:sameAsBricklink dc:publisher 10 100CORDIS owl:sameAs 314 SDailymed owl:sameAs 894 SDBLP Bibliography owl:sameAs 196 SDBTune owl:sameAs 838 SDiseasome owl:sameAs 2 300 SDrugbank owl:sameAs 4 800 SEUNIS owl:sameAs 3 100 SEurostat (Linked Stats) owl:sameAs 253 SEurostat (WBSG) owl:sameAs 137CIA World Factbook owl:sameAs 545 S
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 46 / 252
DBpedia Links
Data set Predicate Count Tool
ickr wrappr dbp:hasPhoto- 3 800 000 CCollection
Freebase owl:sameAs 3 600 000 CGADM owl:sameAs 1 900GeoNames owl:sameAs 86 500 SGeoSpecies owl:sameAs 16 000 SGHO owl:sameAs 196 LProject Gutenberg owl:sameAs 2 500 SItalian Public Schools owl:sameAs 5 800 SLinkedGeoData owl:sameAs 103 600 SLinkedMDB owl:sameAs 13 800 SMusicBrainz owl:sameAs 23 000New York Times owl:sameAs 9 700OpenCyc owl:sameAs 27 100 COpenEI (Open Energy) owl:sameAs 678 S
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 47 / 252
DBpedia Links
Data set Predicate Count Tool
Revyu owl:sameAs 6Sider owl:sameAs 2 000 STCMGeneDIT owl:sameAs 904UMBEL rdf:type 896 400US Census owl:sameAs 12 600WikiCompany owl:sameAs 8 300WordNet dbp:wordnet_type 467 100YAGO2 rdf:type 18 100 000
Sum 27 211 732
(S: Silk, L: LIMES, C: custom script, missing: no regeneration)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 48 / 252
DBpedia Links - Query Example
Compare funding per year (from FTS) and country with the gross domesticproduct of a country (from DBpedia)
SELECT ∗ SELECT ? f t s y e a r ? dbpcount ry (SUM(? amount ) AS ? fund i ng )
?com r d f : t ype f t s−o : Commitment .?com f t s−o : y ea r ? y ea r .? y ea r r d f s : l a b e l ? f t s y e a r .? b e n e f i t f t s−o : deta i lAmount ?amount .? b e n e f i t f t s−o : b e n e f i c i a r y ? b e n e f i c i a r y .? b e n e f i c i a r y f t s−o : coun t r y ? f t s c o u n t r y .? f t s c o u n t r y owl : sameAs ? dbpcount ry .
SELECT ? dbpcount ry ? gdpyear ? gdpnominal ? dbpcount ry r d f : t ype dbo : Country .? dbpcount ry dbp : gdpNominal ? gdpnominal .? dbpcount ry dbp : gdpNominalYear ? gdpyear .
FILTER ( ( ? f t s y e a r = s t r (? gdpyear ) )
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 49 / 252
Infrastructure
DBpedia has two extraction modes:
Wikipedia-database-dump-based extraction
DBpedia Live synchronisation (more later)
DBpedia Dumps:
The DBpedia Dump archive is located in:http://downloads.dbpedia.org/
Latest downloads is described in: http://dbpedia.org/Downloads
Ocial Endpoint (by OpenLink): http://dbpedia.org/sparql
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 50 / 252
Query Answering
Back to our Wikipedia questions:
What have Innsbruck and Leipzig in common?
Who are mayors of central European towns elevated more than1000m?
Which movies are starring both Brad Pitt and Angelina Jolie?
All soccer players, who played as goalkeeper for a club that has astadium with more than 40.000 seats and who are born in a countrywith more than 10 million inhabitants
Using the data extracted from Wikipedia and the public SPARQL endpointDBpedia can answer these questions.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 51 / 252
DBpedia Live
DBpedia dumps are generated on a bi-annual basis
Wikipedia has around 100,000 150,000 page edits per day
DBpedia Live pulls page updates in real-time and extraction resultsupdate the triple store
In practice, a 5 minute update delay increases performance by 15%
Links
SPARQL Endpoint: http://live.dbpedia.org/sparql
Documentation: http://wiki.dbpedia.org/DBpediaLive
Statistics: http://live.dbpedia.org/LiveStats/
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 52 / 252
DBpedia Live - Overview
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 53 / 252
DBpedia Internationalization (I18n)
DBpedia Internationalization Committee founded:
http://wiki.dbpedia.org/Internationalization
Available DBpedia language editions in:
Korean, Greek, German, Polish, Russian, Dutch, Portuguese, Spanish,Italian, Japanese, FrenchUse the corresponding Wikipedia language edition for input
Mappings available for 23 languages
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 54 / 252
DBpedia I18n - Overview
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 55 / 252
Applications: Disambiguation
Named entity recognition and disambiguation Tools such as: DBpediaSpotlight, AlchemyAPI, Semantic API, Open Calais, Zemanta and ApacheStanbol
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 56 / 252
Applications: Question Answering
DBpedia is the primary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop series
IBM Watson relied also on DBpedia
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 57 / 252
Applications: Faceted Browsing
Neofonie Browser
gFacet
OpenLink faceted browser (fct)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 58 / 252
Applications: Search and Querying
Query Builder
RelFinder
SemLens
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 59 / 252
Applications: Digital Libraries & Archives
Virtual International Authority Files (VIAF) project as Linked Data
VIAF added a total of 250,000 reciprocal authority links to Wikipedia.
DBpedia can also provide:
Context information for bibliographic and archive records (e.g. anauthor's demographics, a lm's homepage, an image etc.)Stable and curated identiers for linking.The broad range of Wikipedia topics can form the basis for a thesaurusfor subject indexing.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 60 / 252
Applications: DBpedia Mobile
DBpedia Mobile is a location-centric DBpedia client application for mobiledevices consisting of a map view, the Marbles Linked Data Browser and aGPS-enabled launcher application.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 61 / 252
Applications: DBpedia Wiktionary
Wiktionary is a Wikimedia project: http://wiktionary.org
171 languages, 3M words for English.
Extracted Using the DBpedia Information Extraction Framework
Easily congurable for every Wiktionary language edition
Pre-congured for German, Greek, English, Russian and French.http://Wiktionary.dbpedia.org100 milion triplesLemon model
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 62 / 252
Other Applications
See http://wiki.dbpedia.org/Applications for a more complete list
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 63 / 252
Outline
1 Introduction to Linked Data
2 Linked Dataset Example: DBpedia
3 Linked Data Life-Cycle Overview
4 Knowledge Extraction
5 Data Integration / Linking
6 Enrichment
7 Repair
8 Knowledge Base Exploration / Querying
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 64 / 252
Linked Data - Achievements and Challenges
Achievements:
1 Extension of the Web with a datacommons (50B facts)
2 vibrant, global RTD community
3 Industrial uptake begins (e.g.BBC, Thomson Reuters, Eli Lilly,NY Times, Facebook, Google,Yahoo)
4 Governmental adoption in sight
5 Establishing Linked Data as adeployment path for the SemanticWeb.
Challenges:
1 Coherence: Relatively few,expensively maintained links
2 Quality: partly low quality dataand inconsistencies
3 Performance: Still substantialpenalties compared to relational
4 Data consumption: large-scaleprocessing, schema mapping anddata fusion still in its infancy
5 Usability: Missing direct end-usertools and network eect.
These issues are closely related andshould ultimately lead to anecosystem of interlinked knowledge!
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 65 / 252
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 66 / 252
ExtractionLehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 67 / 252
Extraction
From unstructured sources
Formats: plain textMethods: NLP, text mining, ontology learning
From semi-structured sources
Formats: wiki markup, tagsTools: DBpedia framework (Wikipedia, Wictionary)
From structured sources
Formats: databases, spreadsheets, XMLRDB2RDF tools: Sparqlify, D2R, TriplifyCSV converters: RDF extension of Google Rene
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 68 / 252
Extraction Challenges
From unstructured sources
Improve F-Measure of existing NLP approaches (OpenCalais, OntosAPI)
Develop standardized, LOD enabled interfaces between NLP tools(NLP2RDF)
From semi-structured sources
Ecient bi-directional synchronization
From structured sources
Declarative syntax and semantics of data model transformations (W3CWG RDB2RDF)
Orthogonal challenges
Using LOD as background knowledge
Provenance
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 69 / 252
ABCDE
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 70 / 252
RDF Data Management
From unstructured sources
SPARQL RDF access still by a factor 2-10 slower than relational datamanagement
Performance increases steadily
Comprehensive, well-supported open-soure and commercialimplementations are available:
OpenLink's Virtuoso (os+commercial)OWLIM-Lite (free), OWLIM-SE, OWLIM-EnterpriseTalis (hosted)Bigdata (distributed)Allegrograph (commercial)Mulgara (os)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 71 / 252
Storage and Querying Challenges
Reduce the performance gap between relational and RDF datamanagement
SPARQL Query extensions: Spatial/semantic/temporal datamanagement
View maintenance / adaptive reorganization based on common accesspatterns
More realistic benchmarks
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 72 / 252
Authoring
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 73 / 252
Authoring
Integrated in Existing Environments: Tiki
Data oriented: RDFauthor, rdfEditor
Schema oriented: Protégé, TopBraid Composer, NeOn Toolkit,Swoop, Neologism, Knoodl
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 74 / 252
Authoring: Semantic Wikis
1 Semantic (Text) Wikis
Authoring of semantically annotatedtextsSemantic MediaWiki, KiWi,(Wikipedia+DBpedia)
2 Semantic Data Wikis
Direct authoring of structuredinformation (i.e. RDF, RDF-Schema,OWL)OntoWiki
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 75 / 252
Authoring: Semantic Wikis
1 Semantic (Text) Wikis
Authoring of semantically annotatedtextsSemantic MediaWiki, KiWi,(Wikipedia+DBpedia)
2 Semantic Data Wikis
Direct authoring of structuredinformation (i.e. RDF, RDF-Schema,OWL)OntoWiki
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 75 / 252
ABCDDBEFCCF
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 76 / 252
Interlinking
Data Web is an uncontrolled environment proliferation of equivalentor similar entities need for links / merging
Currently only few RDF triples are links
Manual Link Discovery:
Sindice Integration, LODStats, Semantic Pingback
Tool supported / Semi-Automatic:
SILK, LIMES, COMA, RDF-AIUsually via mapping specications / heuristics
Machine Learning / Automatic:
RAVEN, EAGLE, SILK GP
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 77 / 252
Interlinking Challenges
Apply work in the de-duplication/record linkage literature
Consider the open world nature of Linked Data
Use LOD background knowledge
Zero-conguration linking
Explore active learning approaches, which integrate users in a feedbackloop
Maintain a 24/7 linking service: Linked Open Data Around-The-Clockproject (http://latc-project.eu/)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 78 / 252
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 79 / 252
Enrichment
Currently, lack of knowledge bases with sophisticated schemainformation and instance data adhering to this schema
Goal: powerful reasoning, consistency checking and querying
Manual:
Via ontology editors, DBpedia mappings
(Semi-)Automatic:
DL-Learner, Statistical Schema Induction
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 80 / 252
Enrichment: Example
Given: knowledge base with property birthPlace (i.e. triples using thatproperty) but no information on the semantics of birthPlacePossibly enrichment:
ObjectProperty: birthPlace
Characteristics: Functional
Domain: Person
Range: Place
SubPropertyOf: hasBeenAt
Benets:
axioms serve as documentation for purpose and correct usage ofschema elements
additional implicit information can be inferred
improve the applicability of schema debugging techniques
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 81 / 252
Repair
Ontology Debugging: OWL reasoning to detect inconsistencies andsatisable classes + detect the most likely sources for the problems
basic task: provide feedback to user for resolving undesired entailments
justication J ⊆ O of an entailment is a minimal set of axioms fromwhich the entailment can be drawn
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 82 / 252
AA
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 83 / 252
Linked Data Quality Analysis
Quality on the Data Web is varying a lot
Hand crafted or expensively curated knowledge base (e.g. DBLP,UMLS) vs. extracted from text or Web 2.0 sources (DBpedia)
Quality = Fitness for use
Often not necessary to x all problems, but to know about them
30+ quality dimensions dened in recent survey
Research Challenge
Establish measures for assessing the authority, provenance, reliability ofData Web resources
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 84 / 252
Evolution © CC-BY-SA by alasis on flickr)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 85 / 252
KB Evolution
Tasks:
Performing knowledge base changes / refactoring
Ensuring consistency of related knowledge
Managing changes, e.g. undo operations
Update materialized inferred data upon changes
Update materialised links to other data upon changes
Tools:
Protégé - PROMPT and change management plugins
EvoPat - easily re-usable and sharable evolution patterns dened viaSPARQL
PatOMat - ontology transformation framework
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 86 / 252
A
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 87 / 252
Exploration
RDF data can be complex (as discussed by Pascal Hitzler)
Exploration phase aims to make data accessible to non-experts
Options:
Faceted BrowsingQuestion AnsweringQuery BuildersVisualisation of statistical or geospatial data. . .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 88 / 252
Catalogus Professorum Lipsiensis
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 89 / 252
Visual Query Builder
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 90 / 252
Relationship Finder in CPL
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 91 / 252
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 92 / 252
Make the Web a Linked Data Washing Machine
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 93 / 252
Tool Support for Life-Cycle?
Many SW tools support one or more life-cycle stages
Linked Data Stack (http://stack.linkeddata.org) provides aconsolidated repository of such tools
Each tool is a Debian package
Lightweight integration between tools via common vocabularies andSPARQL
Demonstrator interfaces for showing tools in combination
Developed by LOD2 and GeoKnow EU projects
GeoKnow
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 94 / 252
Outline
1 Introduction to Linked Data
2 Linked Dataset Example: DBpedia
3 Linked Data Life-Cycle Overview
4 Knowledge Extraction
5 Data Integration / Linking
6 Enrichment
7 Repair
8 Knowledge Base Exploration / Querying
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 95 / 252
Knowledge Extraction
Knowledge Extraction is the creation of knowledge from structured(relational databases, XML) and unstructured (text, documents, images)sources.
Resulting knowledge needs to be in a machine-readable and
machine-interpretable format and facilitate inferencing
Similar to Information Extraction (NLP) and ETL (Data Warehouse),but main dierence: extraction result goes beyond the creation ofstructured information or the transformation into a relational schema
Requires re-use of existing formal knowledge (reusing ontologies) orthe generation of a schema based on the source data
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 96 / 252
Categorisation of Approaches
Source - Examples: plain text, relational databases, XML, CSV
Exposition - How is the extracted knowledge made explicit? How canyou query and perform inference?
Synchronization - Is the knowledge extraction process executed onceto produce a dump or is the result synchronized with the source? Arechanges to the result written back (Bi-directional)?
Reuse of Vocabularies - Can popular ontologies (Good Relations,FOAF, . . . ) be re-used to simplify global data integration?
Automatisation - manual, semi-automatic, automatic
Domain Ontology Required - Does the approach require apre-dened ontology or can it create a schema from the source(e.g. ontology learning)?
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 97 / 252
Extraction from Structured Sources to RDF
Simple mappings from RDB tables/views to RDF
Direct mapping of the model of relational databases to RDFTable 7→ OWL classRow 7→ Instance s of this classCell with value o in column p 7→ Triple (s,p,o)Details: http://www.w3.org/TR/rdb-direct-mapping/
Complex mappings of relational databases to RDFAdditional renements can be employed to 1:1 mapping to improve theusefulness of RDF output
Extract or learn an OWL schema from the given database schemaMap the schema and its contents to a pre-existing domain ontology
Powerful mapping languages: R2RML, SML
XML
XML tree structure can be directly converted to RDF graph structureComplex mappings possible, e.g. via XSLT processors
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 98 / 252
Extraction from Natural Language Sources
80% of the information in business documents is in unstructurednatural language1
(-) Increased complexity and decreased quality of extraction
(+) Potential for a massive acquisition of extracted knowledge
Traditional Information Extraction (IE)
Recognize and categorise elements in textTechniques: Named Entity Recognition (NER), Coreference Resolution(CO), . . .
Ontology Learning (OL) from Text
Learn whole ontologies from natural language textUsually (semi-)automatic extracted
1Wimalasuriya, Dou. "Ontology-based information extraction: [. . . ]" Journal of Information Science
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 99 / 252
LinkedGeoData + Sparqlify
Example: LinkedGeoData Knowledge Extraction Project using Sparqlify
Structure
Motivation
OpenStreetMap
LGD Architecture
Mapping
Access (How LinkedGeoData is published)
Use Cases
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 100 / 252
Motivation
Ease information integration tasks that require spatial knowledge,such as
Oerings of bakeries next doorMap of distributed branches of a companyHistorical sights along a bicycle track
LOD cloud contains data sets with spatial features
e.g. Geonames, DBpedia, US census, EuroStatBut: they are restricted to popular or large entities like countries,famous places etc. or specic regions
Therefore they lack buildings, roads, mailboxes, etc.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 101 / 252
OpenStreetMap - Datamodel
Basic entities are:
Nodes Latitude, Longitude.Ways Sequence of nodes.Relations Associations between any number of nodes, ways andrelations. Every member in a relation plays a certain role.
Each entity may be described with tags (= key-value pairs)
A way is closed if the ID of the last referenced node equals that of therst one.
Whether a closed way denotes a linear ring or a polygon (i.e. whetherthe enclosed area is part of the respective OSM entity) depends on thetags.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 102 / 252
Example: Leipzig's Zoo
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 103 / 252
Comparison: Leipzig's Zoo (OpenStreetMap)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 104 / 252
Comparison: Leipzig's Zoo (GoogleMaps)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 105 / 252
LGD Architecture
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 106 / 252
Tag Mappings
Key-value pairs will be assigned toRDF ressources
Each pair (k , v) can be annotated withdatatypes, language tags, classes
Mappings are themselves tables
Example table:lgd_map_literal
k property lang
name rdfs:labelname:en rdfs:label enalt_label skos:altLabelnote rdfs:comment. . . . . . . . .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 107 / 252
View Denition
RDF mapping of the data from aPostgreSQL database
Create View lgd_nodes As
Construct
?n a lgdm:Node .
?n geom:geometry ?g .
?g ogc:asWKT ?o .
With
?n = uri(lgd:node, ?id)
?g = uri(lgd-geom:node, ?id)
?o = typedLiteral(?geom, ogc:wktLiteral)
From
nodes
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 108 / 252
Sparqlify
SPARQL-SQL Rewriter
Rewrites SPARQL Queries accordingto the view denitionPlatform module oers SPARQLEndpoint and Linked Data interface
https:
//github.com/AKSW/Sparqlify
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 109 / 252
Rest-API
Oers REST methods for frequentqueries
Based on SPARQL (Virtuoso) endpoint
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 110 / 252
Downloads
RDF dataset for download
Generated usingConstruct ?s ?p ?o
http:
//downloads.linkedgeodata.org
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 111 / 252
Ontology
Enriched classes and properties with multilingual labels fromTranslateWiki
http://translatewiki.net
Imported icons for 90 classes from the freely available iconcollection from the SJJB Management
http://www.sjjb.co.uk/mapicons/
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 112 / 252
SML Mapping Examples
The following slides demonstrate how to map relational data to RDFwith the Sparqlication Mapping Language (SML).
Thereby, these prexes are used:Prexes
prex IRI
rdfs http://www.w3.org/2000/01/rdf-schema#
ogc http://www.opengis.net/ont/geosparql#
geom http://geovocab.org/geometry#
lgd http://linkedgeodata.org/triplify/
lgd-geom http://linkedgeodata.org/geometry/
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 113 / 252
SML - Mapping Example I: The Goal (1/4)
Input Table
nodesid geom
1 POINT(0 0)2 POINT(1 1)
How to map tables to RDF?
How to introduce thecommonly useddistinction in GIS betweenfeature and geometry?
Aimed for RDF Output
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
...
lgd:node1 geom:geometry lgd-geom:node1 .
lgd:node2 geom:geometry lgd-geom:node2 .
lgd-geom:node1 ogc:asWKT "POINT(0 0)"^^ogc:wktLiteral .
lgd-geom:node2 ogc:asWKT "POINT(1 1)"^^ogc:wktLiteral .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 114 / 252
SML - Mapping Example I: SML Syntax Outline (2/4)
Input Table
nodesid geom
1 POINT(0 0)2 POINT(1 1)
Create View myNodesView As
Construct
...
With
...
From
...
Aimed for RDF Output
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
...
lgd:node1 geom:geometry lgd-geom:node1 .
lgd:node2 geom:geometry lgd-geom:node2 .
lgd-geom:node1 ogc:asWKT "POINT(0 0)"^^ogc:wktLiteral .
lgd-geom:node2 ogc:asWKT "POINT(1 1)"^^ogc:wktLiteral .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 115 / 252
SML - Mapping Example I: Construct and From (3/4)
Input Table
nodesid geom
1 POINT(0 0)2 POINT(1 1)
Create View myNodesView As
Construct
?n geom:geometry ?g .
?g ogc:asWKT ?o
With
...
From
nodes
Aimed for RDF Output
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
...
lgd:node1 geom:geometry lgd-geom:node1 .
lgd:node2 geom:geometry lgd-geom:node2 .
lgd-geom:node1 ogc:asWKT "POINT(0 0)"^^ogc:wktLiteral .
lgd-geom:node2 ogc:asWKT "POINT(1 1)"^^ogc:wktLiteral .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 116 / 252
SML - Mapping Example I: Complete! (4/4)
Input Table
nodesid geom
1 POINT(0 0)2 POINT(1 1)
Create View myNodesView As
Construct
?n geom:geometry ?g .
?g ogc:asWKT ?o
With
?n = uri(lgd:node, ?id)
?g = uri(lgd-geom:node, ?id)
?o = typedLiteral(?geom,
ogc:wktLiteral)
From
nodes
Aimed for RDF Output
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
...
lgd:node1 geom:geometry lgd-geom:node1 .
lgd:node2 geom:geometry lgd-geom:node2 .
lgd-geom:node1 ogc:asWKT "POINT(0 0)"^^ogc:wktLiteral .
lgd-geom:node2 ogc:asWKT "POINT(1 1)"^^ogc:wktLiteral .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 117 / 252
SML Mapping Examples
A more complex example, which demonstrates the use of an SQLmapping table and an SQL helper view.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 118 / 252
SML - Mapping Example II: The Goal (1/8)
Input Table
node_tagsid k v
1 name Universitaet Leipzig1 name:en University of Leipzig1 amenity university1 addr:street Augustusplatz1 addr:city Leipzig
Aimed for RDF Output
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix lgd: <http://linkedgeodata.org/triplify/> .
lgd:node1 rdfs:label "Universitaet Leipzig" .
lgd:node1 rdfs:label "University of Leipzig"@en .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 119 / 252
SML - Mapping Example II: Source Data (2/8)
OSM Table
node_tagsid k v
1 name Universitaet Leipzig1 name:en University of Leipzig1 amenity university1 addr:street Augustusplatz1 addr:city Leipzig
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 120 / 252
SML - Mapping Example II: Mapping Table (3/8)
OSM Table RDF Mapping Table
node_tagsid k v
1 name Universitaet Leipzig1 name:en University of Leipzig1 amenity university1 addr:street Augustusplatz1 addr:city Leipzig
lgd_map_literalk property lang
name rdfs:labelname:en rdfs:label enalt_label skos:altLabelnote rdfs:comment. . . . . . . . .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 121 / 252
SML - Mapping Example II: Helper View (4/8)
OSM Table RDF Mapping Table
node_tagsid k v
1 name Universitaet Leipzig1 name:en University of Leipzig1 amenity university1 addr:street Augustusplatz1 addr:city Leipzig
lgd_map_literalk property lang
name rdfs:labelname:en rdfs:label enalt_label skos:altLabelnote rdfs:comment. . . . . . . . .
Helper View
lgd_node_tags_literalid property v lang
1 rdfs:label Universitaet Leipzig1 rdfs:label University of Leipzig en. . . . . . . . . . . .
SELECT id, property, v, lang FROM node_tags, lgd_map_literal
WHERE node_tags.k = lgd_map_literal.k
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 122 / 252
SML - Mapping Example II: SML View (5/8)
Logical Table SML View
lgd_node_tags_literalid property v lang
1 rdfs:label Univ. L.1 rdfs:label Univ. of L. en. . . . . . . . . . . .
Create View lgd_node_tags_text As
Construct
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 123 / 252
SML - Mapping Example II: SML View (6/8)
Logical Table SML View
lgd_node_tags_literalid property v lang
1 rdfs:label Univ. L.1 rdfs:label Univ. of L. en. . . . . . . . . . . .
Create View lgd_node_tags_text As
Construct
?s ?p ?o .
With
...
From
lgd_node_tags_literal
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 124 / 252
SML - Mapping Example II: SML View (7/8)
Logical Table SML View
lgd_node_tags_literalid property v lang
1 rdfs:label Univ. L.1 rdfs:label Univ. of L. en. . . . . . . . . . . .
Create View lgd_node_tags_text As
Construct
?s ?p ?o .
With
?s = uri(lgd:node, ?id)
?p = uri(?property)
?o = plainLiteral(?v, ?lang)
From
lgd_node_tags_literal
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 125 / 252
SML - Mapping Example II: SML View (8/8)
Logical Table SML View
+lgd_node_tags_literal
id property v lang
1 rdfs:label Univ. L.1 rdfs:label Univ. of L. en. . . . . . . . . . . .
Create View lgd_node_tags_text As
Construct
?s ?p ?o .
With
?s = uri(lgd:node, ?id)
?p = uri(?property)
?o = plainLiteral(?v, ?lang)
From
lgd_node_tags_literal
Resulting RDF
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix lgd: <http://linkedgeodata.org/triplify/> .
lgd:node1 rdfs:label "Universitaet Leipzig" .
lgd:node1 rdfs:label "University of Leipzig"@en .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 126 / 252
Further Tag Mappings
lgd_map_dataypek datatype
seats integerunisex boolean
lgd_map_propertyk property
website foaf:homepage
lgd_map_resource_kk property object
highway rdf:type lgdo:HighwayThing
lgd_map_resource_kvk v property object
waterway river rdf:type lgdo:River
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 127 / 252
LGD Edit Tool
Multi User Tag Mapping WebApp
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 128 / 252
Resources
Sparqlifyhttp://sparqlify.org
LinkedGeoDatahttp://linkedgeodata.org
Tag Mappingshttps://github.com/GeoKnow/LinkedGeoData/blob/master/linkedgeodata-core/src/main/resources/org/aksw/linkedgeodata/sql/Mappings.sql
SML View Denitionshttps://github.com/GeoKnow/LinkedGeoData/blob/master/linkedgeodata-core/src/main/resources/org/aksw/linkedgeodata/sml/LinkedGeoData-Triplify-IndividualViews.sml
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 129 / 252
Statistics (15 August 2013)
Complete OSM planet le corresponds to ∼ 20.000.000.000 triples
Virtual access via Sparqlify
Downloads limited to selected classes.292.780.188 Triples
153.613.243 triples of Nodes139.166.945 triples of WaysRelations not yet available for download
Among them
532.812 PlaceOfWorship82.788 RailwayStation72.091 Toilets71.613 Town19.937 City
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 130 / 252
Access
Materialized Sparql Endpoint (based on Virtuoso DB, downloaddatasets loaded)
http://linkedgeodata.org/sparql
http://linkedgeodata.org/snorql
Virtual Sparql Endpoint (based on Sparqlify, access to 20B triples,limited SPARQL 1.0 support)
http://linkedgeodata.org/vsparql
http://linkedgeodata.org/vsnorql
Rest Interface (based on the Virtual Sparql Endpoint)
Supports limited queries (e.g. circular/rectangular area, ltering bylabels)
Downloads
http://downloads.linkedgeodata.org
Monthly updates on the above datasets envisioned
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 131 / 252
Use Cases Augmented Reality
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 132 / 252
Use Cases Generic Browsing
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 133 / 252
Use Cases Generic Browsing
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 134 / 252
Outline
1 Introduction to Linked Data
2 Linked Dataset Example: DBpedia
3 Linked Data Life-Cycle Overview
4 Knowledge Extraction
5 Data Integration / Linking
6 Enrichment
7 Repair
8 Knowledge Base Exploration / Querying
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 135 / 252
Why Link Discovery?
1 Fourth Linked Dataprinciple
2 Links are central for
Cross-ontology QAData IntegrationReasoningFederated Queries...
3 2011 topology of theLOD Cloud:
31+ billion triples≈ 0.5 billion linksowl:sameAs in mostcases
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 136 / 252
Why is it dicult?
1 Time complexity
Large number of triplesQuadratic a-priori runtime69 days for mapping cities fromDBpedia to Geonames (1ms percomparison)decades for linking DBpedia and LGD. . .
Denition (Link Discovery)
Given sets S and T of resources and relation RTask: Find M = (s, t) ∈ S × T : R(s, t)Common approaches:
Find M ′ = (s, t) ∈ S × T : σ(s, t) ≥ θFind M ′ = (s, t) ∈ S × T : δ(s, t) ≤ θ
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 137 / 252
Why is it dicult?
2 Complexity of specications
Combination of several attributes required for high precisionTedious discovery of most adequate mappingDataset-dependent similarity functions
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 138 / 252
LIMES Framework
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 139 / 252
Runtime Optimization
Reduce the number of comparisons C (A) ≥ |M ′| (assuming we needall σ/θ values for links)
Maximize reduction ratio:
RR(A) = 1− C (A)
|S ||T |
Question
Can we devise lossless approaches with guaranteed RR?
Advantages
Space managementRuntime predictionResource scheduling
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 140 / 252
Runtime Optimization
Reduce the number of comparisons C (A) ≥ |M ′| (assuming we needall σ/θ values for links)
Maximize reduction ratio:
RR(A) = 1− C (A)
|S ||T |
Question
Can we devise lossless approaches with guaranteed RR?
Advantages
Space managementRuntime predictionResource scheduling
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 140 / 252
RR Guarantee
Best achievable reduction ratio: RRmax = 1− |M′||S||T |
Approach H(α) fullls RR guarantee criterion, i:
∀r < RRmax,∃α : RR(H(α)) ≥ r
Here, we use relative reduction ratio (RRR):
RRR(A) =RRmax
RR(A)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 141 / 252
RR Guarantee
Best achievable reduction ratio: RRmax = 1− |M′||S||T |
Approach H(α) fullls RR guarantee criterion, i:
∀r < RRmax, ∃α : RR(H(α)) ≥ r
Here, we use relative reduction ratio (RRR):
RRR(A) =RRmax
RR(A)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 141 / 252
RR Guarantee
Best achievable reduction ratio: RRmax = 1− |M′||S||T |
Approach H(α) fullls RR guarantee criterion, i:
∀r < RRmax, ∃α : RR(H(α)) ≥ r
Here, we use relative reduction ratio (RRR):
RRR(A) =RRmax
RR(A)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 141 / 252
Goal
Formal Goal
Devise H(α) : ∀r > 1, ∃α : RRR(H(α)) ≤ r
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 142 / 252
Restrictions
Minkowski Distance
δ(s, t) = p
√√√√ n∑i=1
|si − ti |p, p ≥ 2
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 143 / 252
Space Tiling
HYPPO
δ(s, t) ≤ θ describes a hypersphere
Approximate hypersphere by using a hypercube
Easy to computeNo loss of recall (blocking)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 144 / 252
Space Tiling
Set width of single hypercube to ∆ = θ/α
Tile Ω = S ∪ T into the adjacent cubes CCoordinates: (c1, . . . , cn) ∈ Nn
Contains points ω ∈ Ω : ∀i ∈ 1 . . . n, ci∆ ≤ ωi < (ci + 1)∆
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 145 / 252
Space Tiling
Set width of single hypercube to ∆ = θ/αTile Ω = S ∪ T into the adjacent cubes C
Coordinates: (c1, . . . , cn) ∈ Nn
Contains points ω ∈ Ω : ∀i ∈ 1 . . . n, ci∆ ≤ ωi < (ci + 1)∆
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 145 / 252
Space Tiling
Set width of single hypercube to ∆ = θ/αTile Ω = S ∪ T into the adjacent cubes C
Coordinates: (c1, . . . , cn) ∈ Nn
Contains points ω ∈ Ω : ∀i ∈ 1 . . . n, ci∆ ≤ ωi < (ci + 1)∆
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 145 / 252
HYPPO
Combine (2α + 1)n hypercubes around C (ω) to approximatehypersphere
RRR(HYPPO(α)) = (2α+1)n
αnS(n)
limα→∞
RRR(HYPPO(α)) = 2n
S(n)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 146 / 252
HYPPO
RRR(HYPPO) for p = 2, n = 2, 3, 4 and 2 ≤ α ≤ 50
limα→∞
RRR(HYPPO(α)) = 4π ≈ 1.27 (n = 2)
limα→∞
RRR(HYPPO(α)) = 6π ≈ 1.91 (n = 3)
limα→∞
RRR(HYPPO(α)) = 32π2≈ 3.24 (n = 4)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 147 / 252
HYPPO
RRR(HYPPO) for p = 2, n = 2, 3, 4 and 2 ≤ α ≤ 50
limα→∞
RRR(HYPPO(α)) = 4π ≈ 1.27 (n = 2)
limα→∞
RRR(HYPPO(α)) = 6π ≈ 1.91 (n = 3)
limα→∞
RRR(HYPPO(α)) = 32π2≈ 3.24 (n = 4)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 147 / 252
HR3: Idea
index(C , ω) =
0 if ∃i : |ci − c(ω)i | ≤ 1, 1 ≤ i ≤ n,n∑
i=1(|ci − c(ω)i | − 1)p else,
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 148 / 252
HR3: IdeaCompare C (ω) with C i index(C , ω) ≤ αpα = 4, p = 2
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 149 / 252
HR3: Idea
Lemma
∀s ∈ S : index(C , s) > αp implies that all t ∈ C are non-matches
Claims
No loss of recallXlimα→∞
RRR(HR3(α)) = 1
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 150 / 252
HR3: Lemma 3
Lemma
∀α > 1 RRR(HR3(2α)) < RRR(HR3(α))
p = 2, α = 4
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 151 / 252
HR3: Proof
Lemma
∀α > 1 RRR(HR3(2α)) < RRR(HR3(α))
p = 2, α = 8
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 152 / 252
HR3: Proof
Lemma
∀α > 1 RRR(HR3(2α)) < RRR(HR3(α))
p = 2, α = 25
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 153 / 252
HR3: Proof
Lemma
∀α > 1 RRR(HR3(2α)) < RRR(HR3(α))
p = 2, α = 50
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 154 / 252
HR3: Idea
Theorem
limα→∞
RRR(HR3(α)) = 1
Claims
No loss of recallXlimα→∞
RRR(HR3(α)) = 1X
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 155 / 252
HR3: Experiments
Compare HR3 with LIMES 0.5's HYPPO and SILK 2.5.1
Experimental Setup:
Deduplicating DBpedia places by minimum elevation, elevation andmaximum elevation (θ = 49m, 99m).Geonames and LinkedGeoData by longitude and latitude (θ = 1, 9)
64-bit computer with a 2.8GHz i7 processor with 8GB RAM.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 156 / 252
HR3: Experiments (Comparisons)
Experiment 2: Deduplicating DBpedia places, θ = 99m0.64× 106 less comparisons
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 157 / 252
HR3: Experiments (Comparisons)
Experiment 4: Linking Geonames and LinkedGeoData, θ = 9
4.3× 106 less comparisons
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 158 / 252
HR3: Experiments (Runtime)
Experiment 1, 2: DBpedia, θ = 49, 99mExperiment 3, 4: Geonames and LGD, θ = 1, 9
Exp. 1 Exp. 2 Exp. 3 Exp. 4100
101
102
103
104
Run
time
(s)
HR3HYPPOSILK
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 159 / 252
HR3: Summary
Mission
New category of algorithms for link discovery
Presented HR3
Link discovery in ane spaces with Minkowski measuresOutperforms the state of the art (runtime, comparisons)Optimal reduction ratioIntegrated in LIMES
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 160 / 252
HR3: Summary
Mission
New category of algorithms for link discovery
Presented HR3
Link discovery in ane spaces with Minkowski measuresOutperforms the state of the art (runtime, comparisons)Optimal reduction ratioIntegrated in LIMES
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 160 / 252
Learning Complex Specications
Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)
Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 161 / 252
Learning Complex Specications
Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)
Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 161 / 252
Learning Complex Specications
Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)
Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 161 / 252
Learning Complex Specications
Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)
Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 162 / 252
Learning Complex Specications
Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)
Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 162 / 252
Learning Complex Specications
Supervised (mostly active, e.g., RAVEN, EAGLE, SILK)
Unsupervised (e.g., KnoFuss, EUCLID, EAGLE)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 162 / 252
Learning Complex Specications
Insight
Choice of right example is key for learning
So far, only use of informativeness
Question
Can we do better by using more information?
Higher F-measure
Often slower
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 163 / 252
Learning Complex Specications
Insight
Choice of right example is key for learning
So far, only use of informativeness
Question
Can we do better by using more information?
Higher F-measure
Often slower
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 163 / 252
Learning Complex Specications
Insight
Choice of right example is key for learning
So far, only use of informativeness
Question
Can we do better by using more information?
Higher F-measure
Often slower
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 163 / 252
Basic Idea
Use similarity of link candidates when selecting most informativeexamples (intra + inter class similarity)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 164 / 252
Basic Idea
Use similarity of link candidates when selecting most informativeexamples (intra + inter class similarity)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 164 / 252
Basic Idea
Use similarity of link candidates when selecting most informativeexamples (intra + inter class similarity)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 164 / 252
Similarity of Candidates
Link candidate x = (s, t) can be regarded as vector(σ1(x), . . . , σn(x)) ∈ [0, 1]n.
Similarity of link candidates x and y :
sim(x , y) =1
1 +
√n∑
i=1(σi (x)− σi (y))2
. (1)
Allows exploiting both intra- and inter-class similarity
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 165 / 252
Graph Clustering
Rationale: Use intra-class similarity
Approach
Cluster elements of S+ and S− independentlyChoose one element per cluster as representativePresent oracle with most informative representatives
0.8
0.9
0.8
S+
S-
0.8
0.9
0.8
0.25
0.25
0.9
0.80.8
0.8
0.25a
b
c
d
e
d
f g
hi
k
l
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 166 / 252
BorderFlow
G = (V ,E , ω) with V = S+ or V = S−
ω(x , y) = sim(x , y)
Keep best ec edges for each x ∈ V
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 167 / 252
BorderFlow
Seed-based algorithm
Goal: Maximize borderow ratio bf (X ) = Ω(b(X ),X )Ω(b(X ),n(X ))
http://sourceforge.net/projects/cugar-framework/
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 168 / 252
BorderFlow
Seed-based algorithm
Goal: Maximize borderow ratio bf (X ) = Ω(b(X ),X )Ω(b(X ),n(X ))
http://sourceforge.net/projects/cugar-framework/
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 168 / 252
BorderFlow
Seed-based algorithm
Goal: Maximize borderow ratio bf (X ) = Ω(b(X ),X )Ω(b(X ),n(X ))
http://sourceforge.net/projects/cugar-framework/Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 168 / 252
BorderFlow
Seed-based algorithm
Goal: Maximize borderow ratio bf (X ) = Ω(b(X ),X )Ω(b(X ),n(X ))
http://sourceforge.net/projects/cugar-framework/
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 169 / 252
BorderFlow
Seed-based algorithm
Goal: Maximize borderow ratio bf (X ) = Ω(b(X ),X )Ω(b(X ),n(X ))
http://sourceforge.net/projects/cugar-framework/Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 169 / 252
Conclusion
Can be combined with arbitrary active learning ML algorithms
Was experimentally combined with EAGLE (genetic programming) andRAVEN (linear classier) and shown to outperform the plaininformativeness function in terms of F-measure
Choice of example important to minimise user eort
Contact me for detailed experimental results
Longer runtimes (up to 2×)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 170 / 252
Summary
Linking crucial task in the web of dataTow key problems
1 Ecient execution of link specications2 Creation of link specication
Presented HR3 to handle the rst problemPresented COALA as building block for the second problem
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 171 / 252
Outline
1 Introduction to Linked Data
2 Linked Dataset Example: DBpedia
3 Linked Data Life-Cycle Overview
4 Knowledge Extraction
5 Data Integration / Linking
6 Enrichment
7 Repair
8 Knowledge Base Exploration / Querying
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 172 / 252
Motivation
rise in the availability and usage of knowledge bases
still a lack of knowledge bases that consist of sophisticated schemainformation and instance data adhering to this schema
e.g. in the life sciences several knowledge bases
only consist of schema informationto a large extent, a collection of facts without a clear structure(e.g. information extracted from databases)
combination of sophisticated schema and instance data would allowpowerful reasoning, consistency checking, and improved querying
→ create schemata based on existing data
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 173 / 252
Example
dbr : Brad_Pitt : b i r t h P l a c e dbr : Shawnee , _Oklahoma ;a : Person .
dbr : Angela_Merkel : b i r t h P l a c e dbr : Hamburg ;a : Person .
dbr : A l b e r t_E i n s t e i n : b i r t h P l a c e dbr : Ulm ;a : Person .
dbr : Shawnee , _Oklahoma a : P lace .dbr : Ulm a : P lace .dbr : Hamburg a : P lace .
Suggestions: birthPlace
Ob j e c tP rope r t y : b i r t h P l a c eC h a r a c t e r i s t i c s : F un c t i o n a lDomain : PersonRange : P laceSubPropertyOf : hasBeenAt
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 174 / 252
Benets of an expressive schema
Axioms serve as documentation for the purpose and correct usage ofschema elements
Additional implicit information can be inferred
Improve querying optimisations
Improve/allow the application of schema debugging techniques
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 175 / 252
Each person was only born at one place?!
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 176 / 252
birthPlace birthPlace
6=
birthPlace is functional
SELECT ? s WHERE ? s dbo : b i r t hP l a c e ?o1 .? s dbo : b i r t hP l a c e ?o2 .FILTER (? o1 != ?o2 )
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252
birthPlace birthPlace
6=
birthPlace is functional
SELECT ? s WHERE ? s dbo : b i r t hP l a c e ?o1 .? s dbo : b i r t hP l a c e ?o2 .FILTER (? o1 != ?o2 )
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252
birthPlace birthPlace
6=
birthPlace is functional
SELECT ? s WHERE ? s dbo : b i r t hP l a c e ?o1 .? s dbo : b i r t hP l a c e ?o2 .FILTER (? o1 != ?o2 )
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252
birthPlace birthPlace
6=
birthPlace is functional
SELECT ? s WHERE ? s dbo : b i r t hP l a c e ?o1 .? s dbo : b i r t hP l a c e ?o2 .FILTER (? o1 != ?o2 )
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252
birthPlace birthPlace
6=
birthPlace is functional
SELECT ? s WHERE ? s dbo : b i r t hP l a c e ?o1 .? s dbo : b i r t hP l a c e ?o2 .FILTER (? o1 != ?o2 )
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252
birthPlace birthPlace
6=
birthPlace is functional
SELECT ? s WHERE ? s dbo : b i r t h P l a c e ?o1 .? s dbo : b i r t h P l a c e ?o2 .FILTER (? o1 != ?o2 )
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 177 / 252
Where was Julia Nannie Wallace born?
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 178 / 252
Julia Nannie Wallace was born in Lacrosse?
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 179 / 252
No, Julia Nannie Wallace was born in La Crosse!
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 180 / 252
birthPlace
Sport
rdf:type
birthPlace range Place
Placerdf:type
Place disjointWith Sport
6=
City
v
SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252
birthPlace
Sport
rdf:type
birthPlace range Place
Placerdf:type
Place disjointWith Sport
6=
City
v
SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252
birthPlace
Sportrdf:type
birthPlace range Place
Placerdf:type
Place disjointWith Sport
6=
City
v
SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252
birthPlace
Sportrdf:type
birthPlace range Place
Placerdf:type
Place disjointWith Sport
6=
City
v
SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252
birthPlace
Sportrdf:type
birthPlace range Place
Placerdf:type
Place disjointWith Sport
6=
City
v
SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252
birthPlace
Sportrdf:type
birthPlace range Place
Placerdf:type
Place disjointWith Sport
6=
City
v
SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252
birthPlace
Sportrdf:type
birthPlace range Place
Placerdf:type
Place disjointWith Sport
6=
City
v
SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252
birthPlace
Sport
rdf:type
birthPlace range Place
Placerdf:type
Place disjointWith Sport
6=
City
v
SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252
birthPlace
Sport
rdf:type
birthPlace range Place
Placerdf:type
Place disjointWith Sport
6=
City
v
SELECT ? s ? p l a c e WHERE ? s dbo : b i r t h P l a c e ? p l a c e .? p l a c e r d f : t ype / r d f s : subC la s sOf ∗ ? type1 .? type2 r d f s : subC la s sOf ∗ dbo : P lace .? type1 owl : d i s j o i n tW i t h ? type2 .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 181 / 252
3 Steps to get a schema
SPARQLEndpoint
Input: Entity URI, Axiom Type, Knowledge Base (SPARQL Endpoint)
3-Phase EnrichmentLearning Approach:
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 182 / 252
3 Steps to get a schema
1. obtain schema information
SPARQLEndpoint
Input: Entity URI, Axiom Type, Knowledge Base (SPARQL Endpoint)
Background Knowledge
3-Phase EnrichmentLearning Approach:
(onl
y ex
ecu
ted
once
per
know
ledg
e ba
se)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 183 / 252
3 Steps to get a schema
1. obtain schema information
Reasoner
SPARQLEndpoint
Input: Entity URI, Axiom Type, Knowledge Base (SPARQL Endpoint)
Background Knowledge
BackgroundKnowledge+ Relevant Instance Data
(opt
ion
alin
voca
tion
)
2. obtain axiom type and entity specific data
3-Phase EnrichmentLearning Approach:
(onl
y ex
ecu
ted
once
per
know
ledg
e ba
se)
(sam
ple
dat
aif
nece
ssar
y)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 184 / 252
3 Steps to get a schema
1. obtain schema information
Reasoner
SPARQLEndpoint
EnrichmentOntology
Input: Entity URI, Axiom Type, Knowledge Base (SPARQL Endpoint)
Background Knowledge
BackgroundKnowledge+ Relevant Instance Data
List of Axiom Suggestions+ Metadata
(opt
ion
alin
voca
tion
)
2. obtain axiom type and entity specific data
3. run machine learning algorithm
3-Phase EnrichmentLearning Approach:
(onl
y ex
ecu
ted
once
per
know
ledg
e ba
se)
(sam
ple
dat
aif
nece
ssar
y)
Learner
DL-Learner
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 185 / 252
3 Steps to get a schema
1. obtain schema information
Reasoner
SPARQLEndpoint
EnrichmentOntology
Input: Entity URI, Axiom Type, Knowledge Base (SPARQL Endpoint)
Background Knowledge
BackgroundKnowledge+ Relevant Instance Data
List of Axiom Suggestions+ Metadata
(opt
ion
alin
voca
tion
)
2. obtain axiom type and entity specific data
3. run machine learning algorithm
3-Phase EnrichmentLearning Approach:
(onl
y ex
ecu
ted
once
per
know
ledg
e ba
se)
iterate over all axiom typesand schema entities for fullenrichment
(sam
ple
dat
aif
nece
ssar
y)
Learner
DL-Learner
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 186 / 252
Starting Point
SPARQL endpoint: http://dbpedia.org/sparql
Entity URI: http://dbpedia.org/ontology/author
Axiom Type: Object Property Domain
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 187 / 252
Step 1 - Obtaining Schema Information
CONSTRUCT WHERE ? sub r d f s : subC la s sOf ? sup .
ORDER BY DESC(? sub ) LIMIT 1000 OFFSET 1000
dbo : D i s e a s e r d f s : subC la s sOf owl : Thing .dbo : Book r d f s : subC la s sOf dbo : WrittenWork .dbo : WrittenWork r d f s : subC la s sOf dbo :Work .dbo :Work r d f s : subC la s sOf owl : Thing .dbo : Ph i l o s o ph e r r d f s : subC la s sOf dbo : Person .dbo : Person r d f s : subC la s sOf dbo : Agent .dbo : Agent r d f s : subC la s sOf owl : Thing .dbo : Spor t r d f s : subC la s sOf dbo : A c t i v i t y .dbo : A c t i v i t y r d f s : subC la s sOf owl : Thing .dbo : F i s h r d f s : subC la s sOf dbo : Animal .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 188 / 252
Step 1 - Obtaining Schema Information
CONSTRUCT WHERE ? sub r d f s : subC la s sOf ? sup .
ORDER BY DESC(? sub ) LIMIT 1000 OFFSET 1000
dbo : D i s e a s e r d f s : subC la s sOf owl : Thing .dbo : Book r d f s : subC la s sOf dbo : WrittenWork .dbo : WrittenWork r d f s : subC la s sOf dbo :Work .dbo :Work r d f s : subC la s sOf owl : Thing .dbo : Ph i l o s o ph e r r d f s : subC la s sOf dbo : Person .dbo : Person r d f s : subC la s sOf dbo : Agent .dbo : Agent r d f s : subC la s sOf owl : Thing .dbo : Spor t r d f s : subC la s sOf dbo : A c t i v i t y .dbo : A c t i v i t y r d f s : subC la s sOf owl : Thing .dbo : F i s h r d f s : subC la s sOf dbo : Animal .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 188 / 252
Step 1 - Obtaining Schema Information
CONSTRUCT WHERE ? sub r d f s : subC la s sOf ? sup .
ORDER BY DESC(? sub ) LIMIT 1000 OFFSET 1000
dbo : D i s e a s e r d f s : subC la s sOf owl : Thing .dbo : Book r d f s : subC la s sOf dbo : WrittenWork .dbo : WrittenWork r d f s : subC la s sOf dbo :Work .dbo :Work r d f s : subC la s sOf owl : Thing .dbo : Ph i l o s o ph e r r d f s : subC la s sOf dbo : Person .dbo : Person r d f s : subC la s sOf dbo : Agent .dbo : Agent r d f s : subC la s sOf owl : Thing .dbo : Spor t r d f s : subC la s sOf dbo : A c t i v i t y .dbo : A c t i v i t y r d f s : subC la s sOf owl : Thing .dbo : F i s h r d f s : subC la s sOf dbo : Animal .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 188 / 252
Step 2 - Obtain axiom type and entity specic data
SELECT ? type (COUNT(DISTINCT ? s ) AS ? cnt ) WHERE ? s dbo : au tho r ?o .? s a ? type .
GROUP BY ? type ORDER BY DESC(? cnt )
type cnt
owl:Thing 30284dbo:Work 30284schema:CreativeWork 30284dbo:WrittenWork 25730dbo:Book 24673schema:Book 24673dbo:TelevisionShow 2567dbo:Play 1057...
...
CONSTRUCT WHERE ? ind dbo : au tho r ?o .? i nd a ? type .
ORDER BY DESC(? ind ) LIMIT 1000 OFFSET 2000
...dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book ....
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 189 / 252
Step 2 - Obtain axiom type and entity specic data
SELECT ? type (COUNT(DISTINCT ? s ) AS ? cnt ) WHERE ? s dbo : au tho r ?o .? s a ? type .
GROUP BY ? type ORDER BY DESC(? cnt )
type cnt
owl:Thing 30284dbo:Work 30284schema:CreativeWork 30284dbo:WrittenWork 25730dbo:Book 24673schema:Book 24673dbo:TelevisionShow 2567dbo:Play 1057...
...
CONSTRUCT WHERE ? ind dbo : au tho r ?o .? i nd a ? type .
ORDER BY DESC(? ind ) LIMIT 1000 OFFSET 2000
...dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book ....
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 189 / 252
Step 2 - Obtain axiom type and entity specic data
SELECT ? type (COUNT(DISTINCT ? s ) AS ? cnt ) WHERE ? s dbo : au tho r ?o .? s a ? type .
GROUP BY ? type ORDER BY DESC(? cnt )
type cnt
owl:Thing 30284dbo:Work 30284schema:CreativeWork 30284dbo:WrittenWork 25730dbo:Book 24673schema:Book 24673dbo:TelevisionShow 2567dbo:Play 1057...
...
CONSTRUCT WHERE ? ind dbo : au tho r ?o .? i nd a ? type .
ORDER BY DESC(? ind ) LIMIT 1000 OFFSET 2000
...dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book ....
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 189 / 252
Step 2 - Obtain axiom type and entity specic data
SELECT ? type (COUNT(DISTINCT ? s ) AS ? cnt ) WHERE ? s dbo : au tho r ?o .? s a ? type .
GROUP BY ? type ORDER BY DESC(? cnt )
type cnt
owl:Thing 30284dbo:Work 30284schema:CreativeWork 30284dbo:WrittenWork 25730dbo:Book 24673schema:Book 24673dbo:TelevisionShow 2567dbo:Play 1057...
...
CONSTRUCT WHERE ? ind dbo : au tho r ?o .? i nd a ? type .
ORDER BY DESC(? ind ) LIMIT 1000 OFFSET 2000
...dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book ....
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 189 / 252
Step 2 - Obtain axiom type and entity specic data
SELECT ? type (COUNT(DISTINCT ? s ) AS ? cnt ) WHERE ? s dbo : au tho r ?o .? s a ? type .
GROUP BY ? type ORDER BY DESC(? cnt )
type cnt
owl:Thing 30284dbo:Work 30284schema:CreativeWork 30284dbo:WrittenWork 25730dbo:Book 24673schema:Book 24673dbo:TelevisionShow 2567dbo:Play 1057...
...
CONSTRUCT WHERE ? ind dbo : au tho r ?o .? i nd a ? type .
ORDER BY DESC(? ind ) LIMIT 1000 OFFSET 2000
...dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book ....
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 189 / 252
Step 3 - Scoring
dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .
dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .
dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book .
Score(Domain(dbo:author, dbo:Book))= 23 ≈ 66.7%
Score(Domain(dbo:author, dbo:WrittenWork))= 13 ≈ 33.3%
dbo : Book r d f s : subC la s sOf dbo : WrittenWork .
Score(Domain(dbo:author, dbo:WrittenWork))= 33 = 100%
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 190 / 252
Step 3 - Scoring
dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .
dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .
dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book .
Score(Domain(dbo:author, dbo:Book))= 23 ≈ 66.7%
Score(Domain(dbo:author, dbo:WrittenWork))= 13 ≈ 33.3%
dbo : Book r d f s : subC la s sOf dbo : WrittenWork .
Score(Domain(dbo:author, dbo:WrittenWork))= 33 = 100%
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 190 / 252
Step 3 - Scoring
dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .
dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .
dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book .
Score(Domain(dbo:author, dbo:Book))= 23 ≈ 66.7%
Score(Domain(dbo:author, dbo:WrittenWork))= 13 ≈ 33.3%
dbo : Book r d f s : subC la s sOf dbo : WrittenWork .
Score(Domain(dbo:author, dbo:WrittenWork))= 33 = 100%
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 190 / 252
Step 3 - Scoring
dbped ia : The_Adventures_of_Tom_Sawyerdbo : au tho r dbped ia : Mark_Twain ;r d f : t ype dbo : Book .
dbped ia : The_Zombie_Survival_Guidedbo : au tho r dbped ia : Max_Brooks ;r d f : t ype dbo : WrittenWork .
dbped ia : Web_Therapydbo : au tho r dbped ia : Lisa_Kudrow ;r d f : t ype dbo : Book .
Score(Domain(dbo:author, dbo:Book))= 23 ≈ 66.7%
Score(Domain(dbo:author, dbo:WrittenWork))= 13 ≈ 33.3%
dbo : Book r d f s : subC la s sOf dbo : WrittenWork .
Score(Domain(dbo:author, dbo:WrittenWork))= 33 = 100%
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 190 / 252
Step 3 - Scoring(2)
Problem:
support for axiom in KB not taken into account→ no dierence between 3 out of 3 and 100 out of 100
Solution:
Average of 95% condence interval (Wald method)
p′ = s+2
m+4
s −#successm −#total
min(1, p′ + 1.96 ·√
p′·(1−p′)m+4
) max(0, p′ − 1.96 ·√
p′·(1−p′)m+4
)
In 95% of the intervals the true value is between ... and ...
Score(Domain(dbo:author, dbo:Book))≈ 57.3%Score(Domain(dbo:author, dbo:WrittenWork))≈ 69.1%
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 191 / 252
Step 3 - Scoring(2)
Problem:
support for axiom in KB not taken into account→ no dierence between 3 out of 3 and 100 out of 100
Solution:
Average of 95% condence interval (Wald method)
p′ = s+2
m+4
s −#successm −#total
min(1, p′ + 1.96 ·√
p′·(1−p′)m+4
) max(0, p′ − 1.96 ·√
p′·(1−p′)m+4
)
In 95% of the intervals the true value is between ... and ...
Score(Domain(dbo:author, dbo:Book))≈ 57.3%Score(Domain(dbo:author, dbo:WrittenWork))≈ 69.1%
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 191 / 252
Step 3 - Scoring(2)
Problem:
support for axiom in KB not taken into account→ no dierence between 3 out of 3 and 100 out of 100
Solution:
Average of 95% condence interval (Wald method)
p′ = s+2
m+4
s −#successm −#total
min(1, p′ + 1.96 ·√
p′·(1−p′)m+4
) max(0, p′ − 1.96 ·√
p′·(1−p′)m+4
)
In 95% of the intervals the true value is between ... and ...
Score(Domain(dbo:author, dbo:Book))≈ 57.3%Score(Domain(dbo:author, dbo:WrittenWork))≈ 69.1%
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 191 / 252
More Complex Axioms
"Pattern Based Knowledge Base Enrichment", ISWC 2013
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 192 / 252
Outlook and Summary
Schema in the Linked Data Web often shallow → tools needed tosupport knowledge engineers
Showed some techniques for learning OWL axioms on large knowledgebases available as SPARQL endpoints
More complex aioms require:
OWL-SPARQL rewriting orFragment extraction
Small- and medium sized knowledge bases can be handled viatechniques from Inductive Logic Programming
All algorithms implemented in DL-Learner framework(http://dl-learner.org)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 193 / 252
Outline
1 Introduction to Linked Data
2 Linked Dataset Example: DBpedia
3 Linked Data Life-Cycle Overview
4 Knowledge Extraction
5 Data Integration / Linking
6 Enrichment
7 Repair
8 Knowledge Base Exploration / Querying
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 194 / 252
Motivation
increasing number of knowledge bases in theSemantic Web (see e.g. LOD cloud)
maintenance of knowledge bases withexpressive semantics is challenging
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 195 / 252
(Automatically) Detectable Ontology Problems
Common problems:
Syntactic Problems
Structural Problems
Semantic Problems (focus of talk)
Task Based Problems:
Reasoning Related Problems
Linked Data Related Problems
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 196 / 252
Syntactic Problems
Syntactic errors are mainly violations of conventions of the language inwhich the ontology is modelled.
Example (Validity of XML)
<?xml v e r s i o n=" 1 .0 "?><rdf:RDF xm l n s : r d f=" h t t p : //www.w3 . org /1999/02/22− rd f−
syntax−ns#" xmlns :dc=" h t t p : // p u r l . o rg /dc/ e l ement s/1 .1/ "><r d f : D e s c r i p t i o n r d f : a b o u t=" h t t p : //www.w3 . org /">
<d c : t i t l e>World Wide Web Consort ium</ d c : t i t l e></ rdf :RDF>
FatalError: The element type rdf:Description must be terminated by thematching end-tag </rdf:Description>.[Line = 7, Column = 3]
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 197 / 252
Structural Problems
Problems in the taxonomy
Example (Circularities)
A v B,B v C ,C v A
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 198 / 252
Reasoning Related Problems
Problems which negatively aect the performance of reasoning overexpressive knowledge bases
Example (A named concept is equivalent to an AllValues restriction)
A ≡ ∀r .CReasoning complexity:
Universal restriction does not require to have a property value but onlyrestricts the values for existing property values
Any concept B for which instances cannot have r -llers satises therestriction, i.e. B v ∀r .C , and becomes a subclass of A
Typically leads to unintended inferences and additional inferences mayeventually slow down reasoning performance
Can be checked via Pellint (part of Pellet)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 199 / 252
Linked Data Related Problems
Problems which are the specic to publishing RDF using the Linked Dataprinciples
Incorrect implementation of content negotiation
Mixing up information and non-information resources
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 200 / 252
Semantic Problems
Logical contradictions in the underlying knowledge base
Example (Unsatisable classes)
O = A v B u C ,C v ¬B |= A v ⊥
Example (Inconsistent ontology)
O = A v B u C ,C v ¬B,A(x) |= > v ⊥
Usually handled by Ontology Debugging
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 201 / 252
Ontology Debugging
Problem: We have undesirable entailments
Solution: Repair (Delete/Modify) responsible axiomsQuestion: Which axioms?
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 202 / 252
Ontology Debugging
Problem: We have undesirable entailmentsSolution: Repair (Delete/Modify) responsible axioms
Question: Which axioms?
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 202 / 252
Ontology Debugging
Problem: We have undesirable entailmentsSolution: Repair (Delete/Modify) responsible axiomsQuestion: Which axioms?
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 202 / 252
Ontology Debugging
Problem: We have undesirable entailmentsSolution: Repair (Delete/Modify) responsible axiomsQuestion: Which axioms?
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 202 / 252
Justication
Justication
For an ontology O and an entailment η where O |= η, a set of axioms J isa justication for η in O if J ⊆ O,J |= η and if J ′ ⊂ J then J ′ 6|= η.
Minimal subsets of an ontology that are sucient for a givenentailment to hold
Synonyms: MUPS (Minimal Unsatisability Preserving Sub-TBoxes),MinAs (Minimal Axiom sets), Kernels
Observations:
there can be multiple justications for a single entailment
an axiom can be part of multiple justications
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 203 / 252
Justication - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥
J1 = 1, 2, 3
J2 = 5, 6
J3 = 3, 4
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 204 / 252
Justication - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥
J1 = 1, 2, 3
J2 = 5, 6
J3 = 3, 4
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 204 / 252
Justication - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥
J1 = 1, 2, 3
J2 = 5, 6
J3 = 3, 4
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 204 / 252
Justication - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥
J1 = 1, 2, 3
J2 = 5, 6
J3 = 3, 4
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 204 / 252
Justication - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥
J1 = 1, 2, 3
J2 = 5, 6
J3 = 3, 4
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 204 / 252
Justication Based Repair
For a repair, at least one axiom from every justication needs to beremoved.
For a repair plan, all justications are needed.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 205 / 252
Justication Algorithms
Single justication:
Glass Box: Modifying underlying reasoning algorithm (tableau tracing)
Black-Box: Using reasoner as oracle
All justications:
Reiter's Hitting Set Tree Algorithm (HST)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 206 / 252
Black-Box
Expansion-Contraction Strategy
Expansion: Add axioms to empty set until entailment holds
Contraction: Remove axioms from set such that set becomes minimaland entailment still can be derived.
CHAPTER 3. COMPUTING JUSTIFICATIONS 54
Expansion Contraction
AxiomAxiom in justificationSelected axiom
Key:
Figure 3.1: A Depiction of a Black-Box Expand-Contract Strategy
3.2 Black-Box Algorithms for Computing Sin-
gle Justifications
The basic idea behind a black-box justification finding algorithm is to systemat-
ically test different subsets of an ontology in order to find one that corresponds
to a justification. As depicted in Figure 3.1, subsets of an ontology are typically
explored using an “expand-contract” strategy. In order to compute a justification
for O |= η, an initial, small, subset S of O (represented by circles with thick black
borders in Figure 3.1) is selected. The axioms in S are typically the axioms whose
signature has a non-empty intersection with the signature of η, or axioms that
“define”2 terms in the signature of η. A reasoner is then used to check if S |= η,
and if not, S is expanded by adding a few more axioms from O. This incremental
expansion phase continues until S is large enough so that it entails η. When this
happens, either S, or some subset of S, is guaranteed to be a justification for η.
At this point S is gradually contracted until it is a minimal set of axioms that
entails η i.e. a justification for η in O.
In some black-box algorithms the expand phase may be trivial, or “empty”,
where S is immediately expanded to all input axioms. In this situation it is the
contraction phase which “does all the work”. An example of such a strategy is
presented in Algorithm 3.1. In this algorithm a set of axioms S is initialised
(expanded) with all of the axioms in O so that S |= η. S is then pruned one
axiom at a time, so that for each α ∈ S, if S \ α |= η, then S = S \ α. This
process terminates when all axioms α ∈ S have been examined, at which point
2For example, the axiom A v B defines the class name A
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 207 / 252
Source: M. Horridge:JusticationBased Explanation in Ontologies(PhD
Thesis)
Hitting Set Tree Algorithm
from eld of Model Based Diagnosis
given a faulty system (ontology), it constructs nite tree whose
nodes are labelled with conict sets (justications), and whoseedges are labelled with components (axioms)
nds all minimal hitting sets, which represent diagnoses for theconict sets in the system
diagnosis = repair
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 208 / 252
Hitting Set Tree Algorithm - ExampleCHAPTER 3. COMPUTING JUSTIFICATIONS 63
Figure 3.2: An Example of a Hitting Set Tree
J1 = A ! B, B ! D
A ! B
A ! "R.C
B ! D
A ! "R.C
J2 = A ! "R.C,"R.# ! D
!R." # D!R." # D
J !2 = A ! "R.C,"R.# ! D
bottom right hand successor to the node labelled with J ′2 and whose successor
edge is labelled with ∃R.> v D was generated by considering O \ S where
S = B v D, ∃R.> v D (∃R.> plus the label of the path to the root) and
noting that O \ S does not contain a justification for A v D. A fresh successor
node was therefore generated and labelled with the empty set, with the successor
edge label being set as ∃R.> v D.
When no more successor nodes can be generated the hitting set tree is com-
plete. At this point, all justifications for O |= η occur as labels of nodes in the
tree. Additionally, all minimal repairs (diagnoses) for O |= η are contained as
leaf-root paths in the tree.
3.3.3 Model Based Diagnosis Optimisations
The above description of a hitting set tree and illustrative example do not take
into consideration any optimisations. In order to achieve acceptable performance
it is necessary to consider two important optimisations: (1) Early Path Termina-
tion, and (2) Justification Reuse, which are detailed below:
Early path termination
In the unoptimised version of the algorithm a node can be extended with successor
edges provided there is an axiom in its label which does not already label one of the
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 209 / 252
O = A v B
B v D
A v ∃R.C∃R.> v D |= A v D
Source: M. Horridge:JusticationBased Explanation in Ontologies(PhD
Thesis)
Justication Scenarios
A user can be faced with the following situations:
Small number of small justications
, Easy and pleasant to inspect
Small number of large justications
, Better than alternatives
Large number of justications
/ Pretty hopeless with current mechanismsIdea: Find source of unsatisability
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 210 / 252
Root Unsatisability - Denitions
A root UC is a class whose unsatisability does not depend on anotherclass, otherwise it is a derived UC.
A derived UC for which there is some justication that is not a strictsuperset of a justication for another UC is a partial derived UC.
Root Unsatisable Class
A class A is a root unsatisable class if there is no justication J |= A v ⊥such that J is a strict superset of a justication for some otherunsatisable class.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 211 / 252
Root Unsatisability - Approaches
Approaches:
1: compute all justications for each unsatisable class and apply thedenition → computationally often too expensive
2: heuristics for structural analysis of axioms
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 212 / 252
Debugging Unsatisable Classes in OWL Ontologies, Kalyanpur, Parsia, Sirin, Hendler,
J. Web Sem, 2005.
Root Unsatisability - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4
|= B v ⊥ J4 = 1, 2 root
partial
(J4 ⊂ J1)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252
Root Unsatisability - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥
J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4
|= B v ⊥ J4 = 1, 2 root
partial
(J4 ⊂ J1)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252
Root Unsatisability - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4
|= B v ⊥ J4 = 1, 2 root
partial
(J4 ⊂ J1)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252
Root Unsatisability - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4
|= B v ⊥
J4 = 1, 2 root
partial
(J4 ⊂ J1)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252
Root Unsatisability - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4
|= B v ⊥ J4 = 1, 2
root
partial
(J4 ⊂ J1)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252
Root Unsatisability - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4
|= B v ⊥ J4 = 1, 2 root
partial
(J4 ⊂ J1)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252
Root Unsatisability - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4
|= B v ⊥ J4 = 1, 2 root
partial
(J4 ⊂ J1)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252
Root Unsatisability - Example
O =
B v ∃r .D (1)
B v ∀r .¬D (2)
A v B u C (3)
B v ¬C (4)
A v E (5)
A v ¬E u F (6)
|= A v ⊥J1 = 1, 2, 3J2 = 5, 6J3 = 3, 4
|= B v ⊥ J4 = 1, 2 root
partial
(J4 ⊂ J1)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 213 / 252
Axiom Relevance
resolving justication requires to delete or edit axioms
ranking methods highlight the most probable causes for problems
methods:
frequency
syntactic relevance
semantic relevance
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 214 / 252
Repair Consequences
after repairing process, axioms have been deleted or modied
→ desired entailments may be lost or new entailments obtained(including inconsistencies!)
→ user can decide to preserve them
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 215 / 252
SPARQL Endpoint Support
Previously mentioned approaches are implemented in the ORE tool(http://ore-tool.net)
ORE supports using SPARQL endpoints
implements an incremental load procedure
knowledge base is loaded in small chunks:
count number of axioms by typepriority based loading proceduree.g. disjointness axioms have higher priority than class assertion axioms
uses Pellet incremental reasoning
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 216 / 252
Learning of OWL Class Descriptions on Very Large Knowledge Bases,
Hellmann, Lehmann, Auer, Int. Journal Semantic Web Inf. Syst, 2009
SPARQL Endpoint Support II
algorithm performs sanity checks, e.g. SPARQL queries which probefor typical inconsistent axiom sets
can fetch additional Linked Data
dierent termination criteria
overall:
ORE allows to apply state-of-the-art ontology debugging methods on a
larger scale than was possible previously
aims at stronger support for the web aspect of the Semantic Weband the high popularity of Web of Data initiative
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 217 / 252
SPARQL Endpoint Support II
algorithm performs sanity checks, e.g. SPARQL queries which probefor typical inconsistent axiom sets
can fetch additional Linked Data
dierent termination criteria
overall:
ORE allows to apply state-of-the-art ontology debugging methods on a
larger scale than was possible previously
aims at stronger support for the web aspect of the Semantic Weband the high popularity of Web of Data initiative
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 217 / 252
DBpedia Live Demo
Inconsistency in DBpedia Live:
Individual: dbr:Purify_(album)
Facts: dbo:artist dbr:Axis_of_Advance
Individual: dbr:Axis_of_Advance
Types: dbo:Organisation
Class: dbo:Organisation
DisjointWith dbo:Person
ObjectProperty: dbo:artist
Range: dbo:Person
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 218 / 252
DBpedia Live Demo 2
Inconsistency in DBpedia in combination with WGS84 (Linked Data):
Individual: dbr:WKWS Facts: geo:long -81.76833343505859
Types: dbo:Organisation
DataProperty: geo:long Domain: geo:SpatialThing
Class: dbo:Organisation DisjointWith: geo:SpatialThing
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 219 / 252
OpenCyc Demo
Inconsistency in OpenCyc:
Individual: 'PopulatedPlace'
Types: 'ArtifactualFeatureType', 'ExistingStuffType'
Class: 'ArtifactualFeatureType'
SubClassOf: 'ExistingObjectType'
Class: 'ExistingObjectType'
DisjointWith: 'ExistingStuffType'
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 220 / 252
ORE - Screenshot
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 221 / 252
ORE - Screenshot
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 222 / 252
Related Tools
Swoopcan compute justications for unsatisability of classes and oers repairmodene-grained justication computation algorithm is incompletecan also compute justications for an inconsistent ontology, but doesnot oer repair mode in this casedoes not extract locality-based modules, which leads to lowerperformance for large ontologies
RaDONplugin for the NeOn toolkitoers a number of techniques for working with inconsistent orincoherent ontologiesallows to reason with inconsistent ontologies and can handle sets ofontologies (ontology networks)no ne-grained justications, no repair impact analysis
Pellintsearches for common patterns which lead to potential reasoningperformance problemsintegration in ORE planned
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 223 / 252
Related Tools II
PION and DION
developed in the SEKT project to deal with inconsistenciesPION is an inconsistency tolerant reasoner (four-valued paraconsistentlogic)DION oers the possibility to compute justications, but no repair
Explanation Workbench
Protégé plugin for reasoner requests like class unsatisability or inferredsubsumption relationscan compute regular and laconic justicationsmotivated the ORE debugging interfacecurrent version of Explanation Workbench does not allow to removeaxioms in laconic justications
RepairTab
supports the user in nding and detecting errors in ontologiesRepairTab uses a modied tableau algorithmshows inferences which can no longer be drawn after removing anaxiom (inspired ORE)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 224 / 252
Outline
1 Introduction to Linked Data
2 Linked Dataset Example: DBpedia
3 Linked Data Life-Cycle Overview
4 Knowledge Extraction
5 Data Integration / Linking
6 Enrichment
7 Repair
8 Knowledge Base Exploration / Querying
Interlinking/ Fusing
Classifi-cation/
Enrichment
Quality Analysis
Evolution / Repair
Search/ Browsing/
Exploration
Extraction
Storage/ Querying
Manual revision/
Authoring
Linked DataLifecycle
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 225 / 252
Motivation
User Query Interfaces:
Knowledge BaseSpecic Interfaces
Facet-BasedBrowsing
Visual SPARQLQuery Builders
Question Answering
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 226 / 252
Motivation
User Query Interfaces:
Knowledge BaseSpecic Interfaces
Facet-BasedBrowsing
Visual SPARQLQuery Builders
Question Answering
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 226 / 252
Motivation
User Query Interfaces:
Knowledge BaseSpecic Interfaces
Facet-BasedBrowsing
Visual SPARQLQuery Builders
Question Answering
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 226 / 252
Motivation
User Query Interfaces:
Knowledge BaseSpecic Interfaces
Facet-BasedBrowsing
Visual SPARQLQuery Builders
Question Answering
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 226 / 252
Motivation
User Query Interfaces:
Knowledge BaseSpecic Interfaces
Facet-BasedBrowsing
Visual SPARQLQuery Builders
Question AnsweringWhich tools for creating (SPARQL) queries
are end user friendly?
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 226 / 252
Strengths of Weaknesses of Query Interfaces
Easy to Use
Robust
Flexible Queries
Expressive
Knowledge Base Specific Facet-Based BrowsingVisual SPARQL Query Builders Question Answering
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 227 / 252
Strengths of Weaknesses of Query Interfaces
Easy to Use
Robust
Flexible Queries
Expressive
Knowledge Base Specific Facet-Based BrowsingVisual SPARQL Query Builders Question Answering
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 227 / 252
Strengths of Weaknesses of Query Interfaces
Easy to Use
Robust
Flexible Queries
Expressive
Knowledge Base Specific Facet-Based BrowsingVisual SPARQL Query Builders Question Answering
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 227 / 252
Strengths of Weaknesses of Query Interfaces
Easy to Use
Robust
Flexible Queries
Expressive
Knowledge Base Specific Facet-Based BrowsingVisual SPARQL Query Builders Question Answering
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 227 / 252
Faceted Browsing
Simple way to browse structured information
User starts with all resources and then drills down via facets
Multiple dimensions can be supported for facets, e.g. taxonomy,existence of properties, values of properties
Can be combined with text search: previously information was browsedeither via a xed classication scheme or text search (with the latterbeing dominant) facet based browsing allows a combination of both
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 228 / 252
Facete
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 229 / 252
Facete
Generic Facet-Based Browser
RDF properties are facets (sub-facets are supported)
Each facete serves as source for columns (table rendering), points ofinterest (map rendering)
JavaScript implementation - SPARQL queries are done by the client
Each SPARQL endpoint can serve as backend, no API needs to beimplemented
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 230 / 252
Question Answering - State of the art
1 Map natural language question to triple-based representation.
2 Match triple-based representation against RDF data.
Example:
Where did Abraham Lincoln die?
SELECT ?x WHERE
res:Abraham_Lincoln dbo:deathPlace ?x .
PowerAqua:
Triple representation:
〈state/place, die, Abraham Lincoln〉Ontology mappings:
〈Place, deathPlace, Abraham_Lincoln〉
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 231 / 252
Problem
Triples do not always provide a faithful representation of the semanticstructure of the question.
Thus more expressive queries cannot be answered.
Example 1:
Which cities have more than three universities?
SELECT ?y WHERE
?x rdf:type dbo:University .
?x dbo:city ?y .
HAVING (COUNT(?x) > 3)
PowerAqua (triple representation):〈cities, more than, universities three〉
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 232 / 252
Problem
Triples do not always provide a faithful representation of the semanticstructure of the question.
Thus more expressive queries cannot be answered.
Example 2:
Who produced the most lms?
SELECT ?y WHERE
?x rdf:type dbo:Film .
?x dbo:producer ?y .
ORDER BY DESC(COUNT(?x)) LIMIT 1
PowerAqua (triple representation):〈person/organization, produced, most lms〉
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 232 / 252
Goal
In order to understand a user question, we need to understand:
The wordsFind a mapping from natural language expressions to ontologyconcepts.
Abraham Lincoln → res:Abraham_Lincoln
died in → dbo:deathPlace
The semantic structureDetermine the triple structure as well as lters and aggregationfunctions (order by, count, etc.).
the most N → ODER BY DESC(COUNT(?n)) LIMIT 1
more than three N → HAVING (COUNT(?n) > 3)
Goal: an approach that combines both an analysis of the semanticstructure and a mapping of words to URIs
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 233 / 252
Goal
In order to understand a user question, we need to understand:
The wordsFind a mapping from natural language expressions to ontologyconcepts.
Abraham Lincoln → res:Abraham_Lincoln
died in → dbo:deathPlace
The semantic structureDetermine the triple structure as well as lters and aggregationfunctions (order by, count, etc.).
the most N → ODER BY DESC(COUNT(?n)) LIMIT 1
more than three N → HAVING (COUNT(?n) > 3)
Goal: an approach that combines both an analysis of the semanticstructure and a mapping of words to URIs
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 233 / 252
Templated-based question answering
1 Template generation (Understanding the semantic structure)Parse question to produce a SPARQL template that directly mirrorsthe structure of the question, including lters and aggregationoperations.
2 Template instantiation (Understanding the words)Instantiate SPARQL template by matching natural languageexpressions with ontology concepts using statistical entityidentication and predicate detection.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 234 / 252
Example: Who produced the most lms?
1 SPARQL template:SELECT ?x WHERE
?y rdf:type ?c .
?y ?p ?x .
ORDER BY DESC(COUNT(?y)) LIMIT 1
?c CLASS [lms]?p PROPERTY [produced]
2 Instantiations:
?c = <http://dbpedia.org/ontology/Film>
?p = <http://dbpedia.org/ontology/producer>
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 235 / 252
Architecture
Natural Language Question
Semantic Representaion
SPARQL Query
Templates
Templates with URI slots
Ranked SPARQL Queries
Answer
LOD
Entity identification
Entity and Query Ranking
Query Selection
Resourcesand Classes
SPARQL Endpoint
Type Checkingand Prominence
BOA PatternLibrary
Properties
Tagged Question
Domain Independent Lexicon
Domain Dependent Lexicon
Parsing
Corpora?
!Loading
State
Process
Uses
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 236 / 252
Step 1: Template generation - Linguistic processing
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 237 / 252
Step 1: Template generation - Linguistic processing
1 Natural language question is taggedwith part-of-speech information.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 237 / 252
Step 1: Template generation - Linguistic processing
2 Based on POS tags, lexical entriesare built on the y.
Lexical entries are pairs of:
tree structures(Lexicalized Tree Adjoining Grammar)
semantic representations(ext. Discourse Representation Structures)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 237 / 252
Step 1: Template generation - Linguistic processing
3 These lexical entries, together withdomain-independent lexical entries,are used for parsing the question.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 237 / 252
Step 1: Template generation - Linguistic processing
4 The resulting semanticrepresentation is translated into aSPARQL template.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 237 / 252
Example: Who produced the most lms?
domain-independent: who, the most
domain-dependent: produced/VBD, lms/NNS
SPARQL template 1:SELECT ?x WHERE
?x ?p ?y .
?y rdf:type ?c .
ORDER BY DESC(COUNT(?y)) LIMIT 1
?c CLASS [lms]?p PROPERTY [produced]
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 238 / 252
Example: Who produced the most lms?
domain-independent: who , the most
domain-dependent: produced/VBD, lms/NNS
SPARQL template 1:SELECT ?x WHERE
?x ?p ?y .
?y rdf:type ?c .
ORDER BY DESC(COUNT(?y)) LIMIT 1
?c CLASS [lms]?p PROPERTY [produced]
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 238 / 252
Example: Who produced the most lms?
domain-independent: who, the most
domain-dependent: produced/VBD , lms/NNS
SPARQL template 1:SELECT ?x WHERE
?x ?p ?y .
?y rdf:type ?c .
ORDER BY DESC(COUNT(?y)) LIMIT 1
?c CLASS [lms]
?p PROPERTY [produced]
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 238 / 252
Example: Who produced the most lms?
domain-independent: who, the most
domain-dependent: produced/VBD, lms/NNS
SPARQL template 2:SELECT ?x WHERE
?x ?p ?y .
ORDER BY DESC(COUNT(?y)) LIMIT 1
?p PROPERTY [lms]
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 238 / 252
Step 2: Template instantiation - Entity identication
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 239 / 252
Step 2: Template instantiation - Entity identication
1 For resources and classes:
Identify synonyms of the label using WordNet.Retrieve entities with a label similar to the slot labelbased on string similarities (trigram, Levenshtein,substring).
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 239 / 252
Step 2: Template instantiation - Entity identication
2 For property labels, the label isadditionally compared to naturallanguage expressions stored in theBOA pattern library.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 239 / 252
Step 2: Template instantiation - Entity identication
3 The highest ranking entities arereturned as candidates for lling thequery slots.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 239 / 252
BOA
The BOA pattern library is a repository of natural language representationsof Semantic Web predicates.Idea:
For each predicate P in a data repository (e.g. DBpedia), collect theset of entities S and O connected through P .
Search a text corpus (e.g. Wikipedia) for all sentences containing thelabels of S and O.
For all retrieved sentences, the natural language predicate is apotential pattern for P . The potential patterns are then scored by aneural network (e.g. according to frequency) and ltered.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 240 / 252
BOA: Example
Predicate:http://dbpedia.org/ontology/subsidiary
RDF snippet:
<http://dbpedia.org/resource/Google>
<http://dbpedia.org/ontology/subsidiary>
<http://dbpedia.org/resource/YouTube> .
<http://dbpedia.org/resource/Google> rdfs:label `Google'@en .
<http://dbpedia.org/resource/YouTube> rdfs:label `Youtube'@en .
Sentences:
Google's acquisition of Youtube comes as online video is really startingto hit its stride.Youtube, a division of Google, is exploring a new way to get morehigh-quality clips on its site: nancing amateur video creators.
Patterns:
subsidiary: S's acquisition of O
subsidiary: O, a division of S
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 241 / 252
BOA
The use of BOA patterns allows us to match natural language expressionsand ontology concepts even if they are not string similar and not coveredby WordNet.Examples:
married to → http://dbpedia.org/ontology/spouse
was born in → http://dbpedia.org/ontology/birthPlace
graduated from → http://dbpedia.org/ontology/almaMater
write → http://dbpedia.org/ontology/author
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 242 / 252
Example: Who produced the most lms?
Candidates for lling query slots:
?c CLASS [lms]
<http://dbpedia.org/ontology/Film>
<http://dbpedia.org/ontology/FilmFestival>
. . .
?p PROPERTY [produced]
<http://dbpedia.org/ontology/producer>
<http://dbpedia.org/property/producer>
<http://dbpedia.org/ontology/wineProduced>
. . .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 243 / 252
Step 3: Query ranking and selection
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 244 / 252
Step 3: Query ranking and selection
1 Every entity receives a scoreconsidering string similarity andprominence (roughly how often itoccurs in the knowledge base).
2 The score of a query is thencomputed as the average of thescores of the entities used to ll itsslots.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 244 / 252
Step 3: Query ranking and selection
3 In addition, type checks areperformed:For all triples ?x rdf:type
<class>, all query tripels ?x p e
and e p ?x are checked w.r.t.whether domain/range of p isconsistent with <class>.(If not, the query is rejected.)
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 244 / 252
Step 3: Query ranking and selection
4 Of the remaining queries, the onewith highest score that returns aresult is chosen to retrieve ananswer.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 244 / 252
Example: Who produced the most lms?
SELECT ?x WHERE
?x <http://dbpedia.org/ontology/producer> ?y .
?y rdf:type <http://dbpedia.org/ontology/Film> .
ORDER BY DESC(COUNT(?y)) LIMIT 1
Score: 0.7592425075864263
SELECT ?x WHERE
?x <http://dbpedia.org/ontology/film> ?y .
ORDER BY DESC(COUNT(?y)) LIMIT 1
Score: 0.6264001353183296
SELECT ?x WHERE
?x <http://dbpedia.org/ontology/producer> ?y .
?y rdf:type <http://dbpedia.org/ontology/FilmFestival>.
ORDER BY DESC(COUNT(?y)) LIMIT 1
Score: 0.6012584940627768
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 245 / 252
Evaluation set-up
Question set: 39 DBpedia training questions from QALD-1
The other 11 questions rely on namespaces which were notincorporated in predicate detection (FOAF and YAGO).
POS tags were idealized, in order to exclude tagging errors.
Evaluation measures:
Precision =number of correct resources returned by system
number of resources returned by system
Recall =number of correct resources returned by systemnumber of resources in gold standard answer
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 246 / 252
Results
Of the 39 questions. . .
5 could not be parsed due to unknown syntactic constructions oruncovered domain-independent expressions
19 were answered exactly as required by the benchmark (with precisionand recall 1.0)
another 2 are answered almost correctly (with precision and recallgreater than 0.8)
Mean precision: 0.61Mean recall: 0.63F-measure: 0.62
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 247 / 252
Discussion: Main sources of error
Incorrect templatesTemplate structure does not coincide with structure of the data:
When did Germany join the EU?res:Germany dbp:accessioneudate ?x .
Predicate detection fails
inhabitants 9 dbp:population, dbp:populationTotalowns 9 dbo:keyPerson
higher 9 dbp:elevationM
Wrong query is selected
Who wrote The pillars of the Earth?res:The_Pillars_of_the_Earth_(TV_Miniseries) dbo:writer ?x .
res:The_Pillars_of_the_Earth dbo:author ?x .
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 248 / 252
Conclusion
Two-step approach:1 Build templates that capture the semantic structure of a question.
map complex expressions (quantiers, comparatives, superlatives, etc.)to aggregation functions
2 Instantiate templates mapping natural language expressions to URIs.
BOA patterns for cases where string similarity and WordNetare not sucient
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 249 / 252
Outlook
Current work: Entity identication should take into account whethercandidate entities actually have any connection in the dataset.
Future work: Make templates less rigid and determine the exact triplestructure on the basis of data exploration.
The created template structure does not always coincide with how thedata is actually modelled.
Considering all possibilities of how the data could be modelled leads toa huge amount of templates (and even more queries) for one question.
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 250 / 252
Links
Web: http://aksw.org/Projects/AutoSPARQL
Demo: http://autosparql-tbsl.dl-learner.org
BOA: http://boa.aksw.orgDaniel Gerber and Axel-Cyrille Ngonga Ngomo: Bootstrapping theLinked Data Web. In: Proceedings of the Web Scale Knowledge
Extraction Workshop (WekEx), ISWC 2011.
QALD: http://www.sc.cit-ec.uni-bielefeld.de/ild/
The End
Jens [email protected]/Uni Leipzig
GeoKnow
http://geoknow.eu
Lehmann, Bühmann (Univ. Leipzig) The Linked Data Life-Cycle 2013-08-23 252 / 252