39
Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D. Fernández 26 TH SEPTEMBER 2017 WECOS workshop CSH Vienna

Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Democratizing Big Semantic Data management

hellip or how to query a labelled graph with 28 billion edges in a standard laptop

Javier D Fernaacutendez

26TH SEPTEMBER 2017

WECOS workshopCSH Vienna

0 Zero knowledge

1 I have just heard of RDF andor Linked Data

2 I know the basic foundations and I gave it a try

3 I often manage RDFLinked Data

PAGE 2

Knowledge of RDFLinked Data

img Nick Youngson

Linked Data Introduction

Preliminaries

Linked Data is simply about using the Web to create typed links between data from different sourcesldquo

A practical scenariohellip

computer scientists working in Vienna younger than 40

4

5

The information is already in the Webhellip but with no structure

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

The Web of Data (Semantic Web)Linking data to data

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

Javier Fernaacutendez

33

age

Javier David Fernaacutendez

WU

works

postdoctoral researcher

Vienna

is a

is located in

same as

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

8

Example

Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg

Is this the same Javier as Javier Bardem (actor)

Is ldquoWorksAtrdquo thesame as

ldquoresearchAtrdquo

I

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

9

Example

lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt

URIs x URIs x (URIs U Literals)

I

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo

foafPerson

rdftype

Formal Query SPARQL

Similar to SQL

SELECT people name

WHERE

people foafknows lthttppolleresnetmegt

people foafname name

people

name

people name

lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo

foafPerson

rdftype

10

lthttppolleresnetmegt

foafknows

Current RDF data

Query

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 2: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

0 Zero knowledge

1 I have just heard of RDF andor Linked Data

2 I know the basic foundations and I gave it a try

3 I often manage RDFLinked Data

PAGE 2

Knowledge of RDFLinked Data

img Nick Youngson

Linked Data Introduction

Preliminaries

Linked Data is simply about using the Web to create typed links between data from different sourcesldquo

A practical scenariohellip

computer scientists working in Vienna younger than 40

4

5

The information is already in the Webhellip but with no structure

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

The Web of Data (Semantic Web)Linking data to data

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

Javier Fernaacutendez

33

age

Javier David Fernaacutendez

WU

works

postdoctoral researcher

Vienna

is a

is located in

same as

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

8

Example

Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg

Is this the same Javier as Javier Bardem (actor)

Is ldquoWorksAtrdquo thesame as

ldquoresearchAtrdquo

I

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

9

Example

lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt

URIs x URIs x (URIs U Literals)

I

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo

foafPerson

rdftype

Formal Query SPARQL

Similar to SQL

SELECT people name

WHERE

people foafknows lthttppolleresnetmegt

people foafname name

people

name

people name

lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo

foafPerson

rdftype

10

lthttppolleresnetmegt

foafknows

Current RDF data

Query

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 3: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Linked Data Introduction

Preliminaries

Linked Data is simply about using the Web to create typed links between data from different sourcesldquo

A practical scenariohellip

computer scientists working in Vienna younger than 40

4

5

The information is already in the Webhellip but with no structure

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

The Web of Data (Semantic Web)Linking data to data

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

Javier Fernaacutendez

33

age

Javier David Fernaacutendez

WU

works

postdoctoral researcher

Vienna

is a

is located in

same as

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

8

Example

Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg

Is this the same Javier as Javier Bardem (actor)

Is ldquoWorksAtrdquo thesame as

ldquoresearchAtrdquo

I

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

9

Example

lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt

URIs x URIs x (URIs U Literals)

I

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo

foafPerson

rdftype

Formal Query SPARQL

Similar to SQL

SELECT people name

WHERE

people foafknows lthttppolleresnetmegt

people foafname name

people

name

people name

lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo

foafPerson

rdftype

10

lthttppolleresnetmegt

foafknows

Current RDF data

Query

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 4: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

A practical scenariohellip

computer scientists working in Vienna younger than 40

4

5

The information is already in the Webhellip but with no structure

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

The Web of Data (Semantic Web)Linking data to data

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

Javier Fernaacutendez

33

age

Javier David Fernaacutendez

WU

works

postdoctoral researcher

Vienna

is a

is located in

same as

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

8

Example

Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg

Is this the same Javier as Javier Bardem (actor)

Is ldquoWorksAtrdquo thesame as

ldquoresearchAtrdquo

I

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

9

Example

lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt

URIs x URIs x (URIs U Literals)

I

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo

foafPerson

rdftype

Formal Query SPARQL

Similar to SQL

SELECT people name

WHERE

people foafknows lthttppolleresnetmegt

people foafname name

people

name

people name

lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo

foafPerson

rdftype

10

lthttppolleresnetmegt

foafknows

Current RDF data

Query

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 5: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

5

The information is already in the Webhellip but with no structure

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

The Web of Data (Semantic Web)Linking data to data

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

Javier Fernaacutendez

33

age

Javier David Fernaacutendez

WU

works

postdoctoral researcher

Vienna

is a

is located in

same as

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

8

Example

Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg

Is this the same Javier as Javier Bardem (actor)

Is ldquoWorksAtrdquo thesame as

ldquoresearchAtrdquo

I

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

9

Example

lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt

URIs x URIs x (URIs U Literals)

I

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo

foafPerson

rdftype

Formal Query SPARQL

Similar to SQL

SELECT people name

WHERE

people foafknows lthttppolleresnetmegt

people foafname name

people

name

people name

lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo

foafPerson

rdftype

10

lthttppolleresnetmegt

foafknows

Current RDF data

Query

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 6: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

The information is already in the Webhellip but with no structure

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

The Web of Data (Semantic Web)Linking data to data

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

Javier Fernaacutendez

33

age

Javier David Fernaacutendez

WU

works

postdoctoral researcher

Vienna

is a

is located in

same as

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

8

Example

Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg

Is this the same Javier as Javier Bardem (actor)

Is ldquoWorksAtrdquo thesame as

ldquoresearchAtrdquo

I

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

9

Example

lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt

URIs x URIs x (URIs U Literals)

I

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo

foafPerson

rdftype

Formal Query SPARQL

Similar to SQL

SELECT people name

WHERE

people foafknows lthttppolleresnetmegt

people foafname name

people

name

people name

lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo

foafPerson

rdftype

10

lthttppolleresnetmegt

foafknows

Current RDF data

Query

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 7: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

The Web of Data (Semantic Web)Linking data to data

httpswwwwuacateninfobizteamfernandezhttpmyPersonalWebCV

hellip Javier Fernaacutendez helliphelliphelliphelliphellip33 years oldhelliphelliphelliphelliphelliphelliphelliphellip helliphellip

hellip Javier David FernandezhelliphelliphelliphellipWU (Vienna University of Economics and Business)helliphelliphellip hellipis a postdoctoral researcherhelliphelliphelliphellip

Javier Fernaacutendez

33

age

Javier David Fernaacutendez

WU

works

postdoctoral researcher

Vienna

is a

is located in

same as

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

8

Example

Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg

Is this the same Javier as Javier Bardem (actor)

Is ldquoWorksAtrdquo thesame as

ldquoresearchAtrdquo

I

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

9

Example

lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt

URIs x URIs x (URIs U Literals)

I

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo

foafPerson

rdftype

Formal Query SPARQL

Similar to SQL

SELECT people name

WHERE

people foafknows lthttppolleresnetmegt

people foafname name

people

name

people name

lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo

foafPerson

rdftype

10

lthttppolleresnetmegt

foafknows

Current RDF data

Query

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 8: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

8

Example

Javier isA Person Javier hasName ldquoJavier Fernandezrdquo Javier worksAt WU Javier knows tim Javier knows axelaxel hasName ldquoAxel Polleresldquotim hasName ldquoTim Berners-Leeldquo tim hasCreated httplinkeddataorg

Is this the same Javier as Javier Bardem (actor)

Is ldquoWorksAtrdquo thesame as

ldquoresearchAtrdquo

I

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

9

Example

lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt

URIs x URIs x (URIs U Literals)

I

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo

foafPerson

rdftype

Formal Query SPARQL

Similar to SQL

SELECT people name

WHERE

people foafknows lthttppolleresnetmegt

people foafname name

people

name

people name

lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo

foafPerson

rdftype

10

lthttppolleresnetmegt

foafknows

Current RDF data

Query

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 9: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Quick intro to

Resource Description Framework (W3C Rec 2004)

Machine processable descriptions

Webs services protocols Persons Proteins geographyhellip

Data model Based on Triplessentences Subject Predicate Object

9

Example

lthttpFernandeznetJaviergt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpxmlnscomfoaf01Persongt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01namegt ldquoJavier Fernandezrdquo lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01workplaceHomepagegt lthttpwwwwuacatgt lthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttptimorgfoafrdftimgtlthttpFernandeznetJaviergt lthttpxmlnscomfoaf01knowsgt lthttppolleresnetmegt lthttppolleresnetmegt lthttpxmlnscomfoaf01namegt ldquoAxel Polleresrdquolthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01namegt ldquoTim Berners-Leeldquo lthttptimorgfoafrdftimgt lthttpxmlnscomfoaf01madegt lthttplinkeddataorggt

URIs x URIs x (URIs U Literals)

I

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo ldquoAxel Polleresrdquo

foafPerson

rdftype

Formal Query SPARQL

Similar to SQL

SELECT people name

WHERE

people foafknows lthttppolleresnetmegt

people foafname name

people

name

people name

lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo

foafPerson

rdftype

10

lthttppolleresnetmegt

foafknows

Current RDF data

Query

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 10: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Formal Query SPARQL

Similar to SQL

SELECT people name

WHERE

people foafknows lthttppolleresnetmegt

people foafname name

people

name

people name

lthttpFernandeznetJaviergt ldquoJavier Fernandezrdquo

lthttpFernandeznetJaviergtlthttppolleresnetmegt

ldquoJavier Fernandezrdquo

foafPerson

rdftype

10

lthttppolleresnetmegt

foafknows

Current RDF data

Query

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 11: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

~10K datasets organized into 9 domains which include many and varied knowledge fields

150B statements including entity descriptions and (interintra-dataset) links between them

gt500 live endpoints serving this data

httplod-cloudnet

httpstatslod2eu

httpsparqlesaiwuacat

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 12: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Big Semantic Data

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 13: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

The greatness of Linked Open Data

13

gt 150B triples

1K-6K datasets

gt557 SPARQL Endpoints

httplod-cloudnethttpsdatahubiohttpstatslod2euhttpsparqlesaiwuacat

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 14: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

But what about Web-scale queries

Eg retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Solutions

14

select distinct x

x rdfslabel Axel Polleres

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 15: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

15

Letrsquos fish in our Linked Data Eco System

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 16: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

16

The Web of Data Eco System

httpsparqlesaiwuacat

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 17: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

A) Federated Queries

1 Get a list of potential SPARQL Endpoints

datahubio LOV other catalogs

2 Query each SPARQL Endpoint

Problems

Many SPARQL Endpoints have low availability

SPARQL Endpoints are usually restricted (timeoutresults)

Moreover it can be tricky with complex queries (joins) due to intermediary results delays etc

17

The Web of Data Eco System

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 18: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

B) Follow-your-nose

1 Follow self-descriptive IRIs and links

2 Filter the results you are interested in

Problems

You need some initial seed

DBpedia could be a good start

Itrsquos slow (fetching many documents)

Where should I start for unbounded queries

x rdfslabel ldquoAxel Polleres

18

The Web of Data Eco System

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 19: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

C) Use the RDF dumps by yourself

1 Crawl de Web of Data

Probably start with datahubio LOV other catalogs

2 Download datasets

You better have some free space in your machine

3 Index the datasets locally

You better are patience and survive parsing errors

4 Query all datasets

You better are alive by then

Problems

Hugh resources

+ Messiness of the data

19

The Web of Data Eco System

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 20: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

1) LOD Laundromat

Challenges

Still you need to query 650K datasets

Of course it does not contain all LOD but ldquoa good approximationrdquo

20

A Linked Data hacker toolkit

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 21: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

2) HDT

Highly compact serialization of RDF

Allows fast RDF retrieval in compressed space (without prior decompression)

Includes internal indexes to solve basic queries with small (3) memory footprint

Very fast on basic queries (triple patterns) x 15 faster than Virtuoso Jena RDF3X

Supports FULL SPARQL as the compressed backend store of Jena with an efficiency on the same scale as current more optimized solutions

Challenges

Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently)

21

A Linked Data hacker toolkit

431 Mtriples~

63 GB

NT + gzip5 GB

HDT 66 GB

Slightly more but you can query

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 22: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

httpsgithubcomrdfhdt C++ and Java tools

Only in the last two weekshellip

HDT-cpp

HDT-java

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 23: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

3) Linked Data Fragments

Challenges

Still room for optimization for complex federated queries (delays intermediate results hellip)

23

A Linked Data hacker toolkit

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 24: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

PAGE 24

LDF interfaces

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 25: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

LOD-a-lot

25

But what about Web-scale queries

- flashback -

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 26: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

26

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD-a-lot

LOD-a-lot28B triples

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 27: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Disk size

HDT 304 GB

HDT-FoQ (additional indexes) 133 GB

Memory footprint (to query)

157 GB of RAM (3 of the size)

144 seconds loading time

8 cores (26 GHz) RAM 32 GB SATA HDD on Ubuntu 14045 LTS

LDF page resolution in milliseconds

27

LOD-a-lot (some numbers)

305euro

(LOD-a-lot creation took 64 h amp 170GB RAM HDT-FoQ took 8 h amp 250GB RAM)

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 28: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

28

LOD-a-lot

httpsdatahubiodatasetlod-a-lot

httppurlorgHDTlod-a-lot

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 29: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Query resolution at Web scale

Using LDF Jena

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

29

LOD-a-lot (some use cases)

subjects predicates objects

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 30: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Identity closure

x owlsameAs y

Graph navigations

Eg shortest path random walk

30

LOD-a-lot (some use cases)

Wouter Beek Javier D Fernaacutendez and Ruben Verborgh LOD-a-lot A Single-File Enabler for Data Science In Proc of SEMANTiCS 2017

More use cases

httphdtlodlabsvunltripleobject=22Axel20Polleres22

Retrieve all entities in LOD with the label ldquoAxel Polleresldquo

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 31: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Update LOD-a-lot regularly

More and newer datasets from the LOD Cloud

Keep named graphs with the provenance of each triple

Currently supported only via LOD Laundromat

hellip implement the use cases and help the community to democratize the access to LOD

low-cost access to LOD = high-impact research

Roadmap

21

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 32: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

32

ACKs

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 33: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

36

RDF Archiving Archiving policies

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS3 exstudy exC1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2 exS1 exstudy exC1 exS3 exstudy exC1

V2 V3

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

V1

exC1 exhasProfessor exP1 exS1 exstudy exC1 exS2 exstudy exC1

exS3 exstudy exC1

exS2 exstudy exC1

exC1 exhasProfessor exP1

exC1 exhasProfessor exP2 exC1 exhasProfessor exS2

V12

3exC1 exhasProfessor exP1 [V1V2]exC1 exhasProfessor exP2 [V3]exC1 exhasProfessor exS2 [V3]exS1 exstudy exC1 [V1V2V3]exS2 exstudy exC1 [V1]exS3 exstudy exC1 [V2V3]

a) Independent CopiesSnapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 34: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Democratizing Open Data preservationmonitoring

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 35: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Enhance usability of Open Data and to enhance its accessibility for non-expert users

Deep search and re-usable visualization components

Integrate Open Data support into the online discussion and Web Intelligence platforms

OPEN DATA PORTAL WATCH

Vadim Savenkov

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 36: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Periodically monitoring a list of Open Data Portals

90 CKAN powered Open Data Portals

Quality assessment

Evolution tracking

Meta data

Data

The CommuniData Project

httpdatawuacatportalwatch

Juumlrgen UmbrichSebastian NeumaierAxel Polleres ImagesAd Meskens Doug Coulter

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 37: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

We are currently facing Big Linked Data challenges

Generation publication and consumption

Archiving evolutionhellip

Thanks to compression the Big Linked Data today will be the ldquopocketrdquo data tomorrow

Compression democratizes the access to Big Linked Data

= Cheap scalable consumers

low-cost access to LOD = high-impact research

PAGE 40

Take-home messages

Thank you

TOP-K shortest path

Vadim Savenkov

Page 38: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

Thank you

TOP-K shortest path

Vadim Savenkov

Page 39: Democratizing Big Semantic Data management · Democratizing Big Semantic Data management … or how to query a labelled graph with 28 billion edges in a standard laptop Javier D

TOP-K shortest path

Vadim Savenkov