Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Democratizing Big Semantic Data management
hellip or how to query a labelled graph with 28 billion edges in a standard laptop
Javier D Fernaacutendez
26TH SEPTEMBER 2017
WECOS workshopCSH Vienna
0 Zero knowledge
1 I have just heard of RDF andor Linked Data
2 I know the basic foundations and I gave it a try
3 I often manage RDFLinked Data
PAGE 2
Knowledge of RDFLinked Data
img Nick Youngson
Linked Data Introduction
Preliminaries
Linked Data is simply about using the Web to create typed links between data from different sourcesldquo
A practical scenariohellip
computer scientists working in Vienna younger than 40
4
5
The information is already in the Webhellip but with no structure
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
The Web of Data (Semantic Web)Linking data to data
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
Javier Fernaacutendez
33
age
Javier David Fernaacutendez
WU
works
postdoctoral researcher
Vienna
is a
is located in
same as
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
8
Example
Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg
Is this the same Javier as Javier Bardem (actor)
Is ldquoWorksAtrdquo thesame as
ldquoresearchAtrdquo
I
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
9
Example
lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt
URIs x URIs x (URIs U Literals)
I
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo
foafPerson
rdftype
Formal Query SPARQL
Similar to SQL
SELECT people name
WHERE
people foafknows lthttppolleresnetmegt
people foafname name
people
name
people name
lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo
foafPerson
rdftype
10
lthttppolleresnetmegt
foafknows
Current RDF data
Query
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
0 Zero knowledge
1 I have just heard of RDF andor Linked Data
2 I know the basic foundations and I gave it a try
3 I often manage RDFLinked Data
PAGE 2
Knowledge of RDFLinked Data
img Nick Youngson
Linked Data Introduction
Preliminaries
Linked Data is simply about using the Web to create typed links between data from different sourcesldquo
A practical scenariohellip
computer scientists working in Vienna younger than 40
4
5
The information is already in the Webhellip but with no structure
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
The Web of Data (Semantic Web)Linking data to data
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
Javier Fernaacutendez
33
age
Javier David Fernaacutendez
WU
works
postdoctoral researcher
Vienna
is a
is located in
same as
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
8
Example
Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg
Is this the same Javier as Javier Bardem (actor)
Is ldquoWorksAtrdquo thesame as
ldquoresearchAtrdquo
I
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
9
Example
lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt
URIs x URIs x (URIs U Literals)
I
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo
foafPerson
rdftype
Formal Query SPARQL
Similar to SQL
SELECT people name
WHERE
people foafknows lthttppolleresnetmegt
people foafname name
people
name
people name
lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo
foafPerson
rdftype
10
lthttppolleresnetmegt
foafknows
Current RDF data
Query
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Linked Data Introduction
Preliminaries
Linked Data is simply about using the Web to create typed links between data from different sourcesldquo
A practical scenariohellip
computer scientists working in Vienna younger than 40
4
5
The information is already in the Webhellip but with no structure
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
The Web of Data (Semantic Web)Linking data to data
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
Javier Fernaacutendez
33
age
Javier David Fernaacutendez
WU
works
postdoctoral researcher
Vienna
is a
is located in
same as
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
8
Example
Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg
Is this the same Javier as Javier Bardem (actor)
Is ldquoWorksAtrdquo thesame as
ldquoresearchAtrdquo
I
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
9
Example
lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt
URIs x URIs x (URIs U Literals)
I
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo
foafPerson
rdftype
Formal Query SPARQL
Similar to SQL
SELECT people name
WHERE
people foafknows lthttppolleresnetmegt
people foafname name
people
name
people name
lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo
foafPerson
rdftype
10
lthttppolleresnetmegt
foafknows
Current RDF data
Query
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
A practical scenariohellip
computer scientists working in Vienna younger than 40
4
5
The information is already in the Webhellip but with no structure
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
The Web of Data (Semantic Web)Linking data to data
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
Javier Fernaacutendez
33
age
Javier David Fernaacutendez
WU
works
postdoctoral researcher
Vienna
is a
is located in
same as
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
8
Example
Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg
Is this the same Javier as Javier Bardem (actor)
Is ldquoWorksAtrdquo thesame as
ldquoresearchAtrdquo
I
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
9
Example
lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt
URIs x URIs x (URIs U Literals)
I
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo
foafPerson
rdftype
Formal Query SPARQL
Similar to SQL
SELECT people name
WHERE
people foafknows lthttppolleresnetmegt
people foafname name
people
name
people name
lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo
foafPerson
rdftype
10
lthttppolleresnetmegt
foafknows
Current RDF data
Query
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
5
The information is already in the Webhellip but with no structure
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
The Web of Data (Semantic Web)Linking data to data
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
Javier Fernaacutendez
33
age
Javier David Fernaacutendez
WU
works
postdoctoral researcher
Vienna
is a
is located in
same as
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
8
Example
Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg
Is this the same Javier as Javier Bardem (actor)
Is ldquoWorksAtrdquo thesame as
ldquoresearchAtrdquo
I
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
9
Example
lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt
URIs x URIs x (URIs U Literals)
I
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo
foafPerson
rdftype
Formal Query SPARQL
Similar to SQL
SELECT people name
WHERE
people foafknows lthttppolleresnetmegt
people foafname name
people
name
people name
lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo
foafPerson
rdftype
10
lthttppolleresnetmegt
foafknows
Current RDF data
Query
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
The information is already in the Webhellip but with no structure
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
The Web of Data (Semantic Web)Linking data to data
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
Javier Fernaacutendez
33
age
Javier David Fernaacutendez
WU
works
postdoctoral researcher
Vienna
is a
is located in
same as
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
8
Example
Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg
Is this the same Javier as Javier Bardem (actor)
Is ldquoWorksAtrdquo thesame as
ldquoresearchAtrdquo
I
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
9
Example
lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt
URIs x URIs x (URIs U Literals)
I
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo
foafPerson
rdftype
Formal Query SPARQL
Similar to SQL
SELECT people name
WHERE
people foafknows lthttppolleresnetmegt
people foafname name
people
name
people name
lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo
foafPerson
rdftype
10
lthttppolleresnetmegt
foafknows
Current RDF data
Query
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
The Web of Data (Semantic Web)Linking data to data
httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV
hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip
hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip
Javier Fernaacutendez
33
age
Javier David Fernaacutendez
WU
works
postdoctoral researcher
Vienna
is a
is located in
same as
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
8
Example
Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg
Is this the same Javier as Javier Bardem (actor)
Is ldquoWorksAtrdquo thesame as
ldquoresearchAtrdquo
I
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
9
Example
lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt
URIs x URIs x (URIs U Literals)
I
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo
foafPerson
rdftype
Formal Query SPARQL
Similar to SQL
SELECT people name
WHERE
people foafknows lthttppolleresnetmegt
people foafname name
people
name
people name
lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo
foafPerson
rdftype
10
lthttppolleresnetmegt
foafknows
Current RDF data
Query
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
8
Example
Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg
Is this the same Javier as Javier Bardem (actor)
Is ldquoWorksAtrdquo thesame as
ldquoresearchAtrdquo
I
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
9
Example
lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt
URIs x URIs x (URIs U Literals)
I
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo
foafPerson
rdftype
Formal Query SPARQL
Similar to SQL
SELECT people name
WHERE
people foafknows lthttppolleresnetmegt
people foafname name
people
name
people name
lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo
foafPerson
rdftype
10
lthttppolleresnetmegt
foafknows
Current RDF data
Query
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Quick intro to
Resource Description Framework (W3C Rec 2004)
Machine processable descriptions
Webs services protocols Persons Proteins geographyhellip
Data model Based on Triplessentences Subject Predicate Object
9
Example
lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt
URIs x URIs x (URIs U Literals)
I
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo
foafPerson
rdftype
Formal Query SPARQL
Similar to SQL
SELECT people name
WHERE
people foafknows lthttppolleresnetmegt
people foafname name
people
name
people name
lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo
foafPerson
rdftype
10
lthttppolleresnetmegt
foafknows
Current RDF data
Query
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Formal Query SPARQL
Similar to SQL
SELECT people name
WHERE
people foafknows lthttppolleresnetmegt
people foafname name
people
name
people name
lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo
lthttpFernandeznetJaviergtlthttppolleresnetmegt
ldquoJavier Fernandezrdquo
foafPerson
rdftype
10
lthttppolleresnetmegt
foafknows
Current RDF data
Query
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11
~10K datasets organized into 9 domains which include many and varied knowledge fields
150B statements including entity descriptions and (interintra-dataset) links between them
gt500 live endpoints serving this data
httplod-cloudnet
httpstatslod2eu
httpsparqlesaiwuacat
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Big Semantic Data
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
The greatness of Linked Open Data
13
gt 150B triples
1K-6K datasets
gt557 SPARQL Endpoints
httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
But what about Web-scale queries
Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Solutions
14
select distinct x
x rdfslabel Axel Polleres
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
15
Letrsquos fish in our Linked Data Eco System
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
16
The Web of Data Eco System
httpsparqlesaiwuacat
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
A) Federated Queries
1 Get a list of potential SPARQL Endpoints
datahubio LOV other catalogs
2 Query each SPARQL Endpoint
Problems
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeoutresults)
Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc
17
The Web of Data Eco System
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
B) Follow-your-nose
1 Follow self-descriptive IRIs and links
2 Filter the results you are interested in
Problems
You need some initial seed
DBpedia could be a good start
Itrsquos slow (fetching many documents)
Where should I start for unbounded queries
x rdfslabel ldquoAxel Polleres
18
The Web of Data Eco System
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
C) Use the RDF dumps by yourself
1 Crawl de Web of Data
Probably start with datahubio LOV other catalogs
2 Download datasets
You better have some free space in your machine
3 Index the datasets locally
You better are patience and survive parsing errors
4 Query all datasets
You better are alive by then
Problems
Hugh resources
+ Messiness of the data
19
The Web of Data Eco System
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
1) LOD Laundromat
Challenges
Still you need to query 650K datasets
Of course it does not contain all LOD but ldquoa good approximationrdquo
20
A Linked Data hacker toolkit
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
2) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3) memory footprint
Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X
Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions
Challenges
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)
21
A Linked Data hacker toolkit
431 Mtriples~
63 GB
NT + gzip5 GB
HDT 66 GB
Slightly more but you can query
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
httpsgithubcomrdfhdt C++ and Java tools
Only in the last two weekshellip
HDT-cpp
HDT-java
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
3) Linked Data Fragments
Challenges
Still room for optimization for complex federated queries (delays intermediate results hellip)
23
A Linked Data hacker toolkit
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
PAGE 24
LDF interfaces
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
LOD-a-lot
25
But what about Web-scale queries
- flashback -
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
26
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot
LOD-a-lot28B triples
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Disk size
HDT 304 GB
HDT-FoQ (additional indexes) 133 GB
Memory footprint (to query)
157 GB of RAM (3 of the size)
144 seconds loading time
8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS
LDF page resolution in milliseconds
27
LOD-a-lot (some numbers)
305euro
(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
28
LOD-a-lot
httpsdatahubiodatasetlod-a-lot
httppurlorgHDTlod-a-lot
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Query resolution at Web scale
Using LDF Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
29
LOD-a-lot (some use cases)
subjects predicates objects
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Identity closure
x owlsameAs y
Graph navigations
Eg shortest path random walk
30
LOD-a-lot (some use cases)
Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017
More use cases
httphdtlodlabsvunltripleobject=22Axel20Polleres22
Retrieve all entities in LOD with the label ldquoAxel Polleresldquo
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Keep named graphs with the provenance of each triple
Currently supported only via LOD Laundromat
hellip implement the use cases and help the community to democratize the access to LOD
low-cost access to LOD = high-impact research
Roadmap
21
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
32
ACKs
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
36
RDF Archiving Archiving policies
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1
V2 V3
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
V1
exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1
exS3 exstudy exC1
exS2 exstudy exC1
exC1 exhasProfessor exP1
exC1 exhasProfessor exP2 exC1 exhasProfessor exS2
V12
3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]
a) Independent CopiesSnapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Democratizing Open Data preservationmonitoring
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Enhance usability of Open Data and to enhance its accessibility for non-expert users
Deep search and re-usable visualization components
Integrate Open Data support into the online discussion and Web Intelligence platforms
OPEN DATA PORTAL WATCH
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
The CommuniData Project
httpdatawuacatportalwatch
Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
We are currently facing Big Linked Data challenges
Generation publication and consumption
Archiving evolutionhellip
Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow
Compression democratizes the access to Big Linked Data
= Cheap scalable consumers
low-cost access to LOD = high-impact research
PAGE 40
Take-home messages
Thank you
TOP-K shortest path
Vadim Savenkov
Thank you
TOP-K shortest path
Vadim Savenkov
TOP-K shortest path
Vadim Savenkov