60
Compressed RDF: Practical Uses & Hands-on Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23 TH AUGUST 2017 3rd KEYSTONE Training School Keyword search in Big Linked Data

Compressed RDF: Practical Uses & Hands-on

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Compressed RDF: Practical Uses & Hands-on

Compressed RDF: Practical Uses & Hands-on

Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training SchoolKeyword search in Big Linked Data

Page 2: Compressed RDF: Practical Uses & Hands-on

Session I (09:00 - 10:30) "Basics of Compression for Big Linked Data Management“

Big (Linked) Semantic Data Compression: motivation & challenges

Compact Data Structures

Session II (13:30 - 15:00) “RDF Compression“

RDF Compression. HDT

RDF Dictionaries

RDF Triples

Session III (15:30-17:00) “Compressed RDF: Practical Uses & Hands-on”

Practical Uses (LOD-a-lot, RDF Archiving, etc.)

Hands on

PAGE 2

General agenda

images: zurb.com

Page 3: Compressed RDF: Practical Uses & Hands-on

Practical uses

LOD-a-lot: Web-scale queries in your pocket

RDF archiving

Linked Data markets (Linked Close Data)

Hands on

HDT-it

Command line tools

HDT and Fuseki

HDT and Linked Data Fragments

HDT and C++/Java

HDT and Jena

PAGE 3

Agenda of this session

images: zurb.com

Page 4: Compressed RDF: Practical Uses & Hands-on

LOD-a-lot

Use case 1

Page 5: Compressed RDF: Practical Uses & Hands-on

E.g. retrieve all entities in LOD with the label “Axel Polleres“

Options:

Crawl and index LOD locally (-no-)

Follow-your-nose (where should I start?)

Federated querying (as good as the endpoints you query)

Use LOD Laundromat as a “good approximation” (still querying 650K datasets)

5

Still… what about Web-scale queries

select distinct ?x {?x rdfs:label “Axel Polleres"

}

Page 6: Compressed RDF: Practical Uses & Hands-on

6

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD Laundromat

Page 7: Compressed RDF: Practical Uses & Hands-on

LOD-a-lot

7

But what about Web-scale queries

- flashback -

Page 8: Compressed RDF: Practical Uses & Hands-on

The real motivation

consume

Page 9: Compressed RDF: Practical Uses & Hands-on

The real motivation

htt

p:/

/ww

w.k

unsan.a

f.m

il/N

ew

s/

Art

icle

/413995/s

erv

ing-t

he-m

asses/

Oh man I’m hungry and I don’ t even know if I will like whatever

you are cooking

consume

Page 10: Compressed RDF: Practical Uses & Hands-on

The real motivationOh man I’m hungry

and I don’ t even know if I will like whatever

you are cooking

consume

htt

p:/

/ww

w.k

unsan.a

f.m

il/N

ew

s/

Art

icle

/413995/s

erv

ing-t

he-m

asses/

Page 11: Compressed RDF: Practical Uses & Hands-on

LOD-a-lot

11

But what about Web-scale queriesBut one could be really hungry

htt

ps:/

/hw

y55burg

ers

.word

pre

ss.c

om

/tag/f

ood-c

hallenge/

Page 12: Compressed RDF: Practical Uses & Hands-on

12

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

LOD-a-lot

SPARQL endpoint

(metadata)

LOD-a-lot

Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias

28B triples

Page 13: Compressed RDF: Practical Uses & Hands-on

Disk size:

HDT: 304 GB

HDT-FoQ (additional indexes): 133 GB

Memory footprint (to query):

15.7 GB of RAM (3% of the size)

144 seconds loading time

8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS

LDF page resolution in milliseconds.

13

LOD-a-lot (some numbers)

305€

(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)

Page 14: Compressed RDF: Practical Uses & Hands-on

14

LOD-a-lot

https://datahub.io/dataset/lod-a-lot

http://purl.org/HDT/lod-a-lot

Page 15: Compressed RDF: Practical Uses & Hands-on

Query resolution at Web scale

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

15

LOD-a-lot (some use cases)

subjects predicates objects

Page 16: Compressed RDF: Practical Uses & Hands-on

16

ACKs LOD-a-lot

Page 17: Compressed RDF: Practical Uses & Hands-on

Archiving

Use case 2

Page 18: Compressed RDF: Practical Uses & Hands-on

ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015

So far so good... But RDF is evolving

Number

of

sources

Update rate

month

year

week

day

hour

minute

second

104 105 106101100 102 103

DBpediaBTC

Dyldo

Internet

of Things

Virtual/Augmented

Reality

versions?LOD-a-lot

Page 19: Compressed RDF: Practical Uses & Hands-on

3

Most semantic Web/Linked Data tools are focused onthis “static view” but do not consider

versioning/evolution

Linked Data Archives:The missing link in the RDF evolution

Sindice, SWSE, Swoogle, LOD Cache, LOD-Laundromat… so far, no versions!

Page 20: Compressed RDF: Practical Uses & Hands-on

Web archives: Common Crawl, Internet Memory, Internet Archive, …

20

Preservation matters

Page 21: Compressed RDF: Practical Uses & Hands-on

21

…in the last few years:

Managing the Evolution and

Preservation of the Data Web (FP7)Preserving Linked Data (FP7)

Research projects

Archives

Tools

Benchmarking

one of the fundamental problems in the Web of Data

BEnchmark of RDF ARchives

RDF evolution at Scale

v-RDFCSA

Page 22: Compressed RDF: Practical Uses & Hands-on

22

…in the last few years:

Managing the Evolution and

Preservation of the Data Web (FP7)Preserving Linked Data (FP7)

Research projects

Archives

Tools

Benchmarking

one of the fundamental problems in the Web of Data

BEnchmark of RDF ARchives

RDF evolution at Scale

v-RDFCSA

Page 23: Compressed RDF: Practical Uses & Hands-on

23

RDF Archiving. Archiving policies

V1

ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .

ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .

V2 V3

ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .

V1

ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .

ex:S3 ex:study ex:C1 .

ex:S2 ex:study ex:C1 .

ex:C1 ex:hasProfessor ex:P1 .

ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .

V1,2,

3ex:C1 ex:hasProfessor ex:P1 [V1,V2].ex:C1 ex:hasProfessor ex:P2 [V3].ex:C1 ex:hasProfessor ex:S2 [V3].ex:S1 ex:study ex:C1 [V1,V2,V3].ex:S2 ex:study ex:C1 [V1].ex:S3 ex:study ex:C1 [V2,V3].

a) Independent Copies/Snapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Page 24: Compressed RDF: Practical Uses & Hands-on

24

BEAR

https://aic.ai.wu.ac.at/qadlod/bear.html

Page 25: Compressed RDF: Practical Uses & Hands-on

Queries and systems

We implemented and evaluate archiving systems on Jena-TDB and HDT, based on IC, CB and TB policies.

Serve as an initial baseline to compare archiving systems

More info: https://aic.ai.wu.ac.at/qadlod/bear.html

25

BEAR: Benchmarking the Efficiency of RDF Archiving

Page 26: Compressed RDF: Practical Uses & Hands-on

26

RDF Archiving. Archiving policies

V1

ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .

ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .

V2 V3

ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .

V1

ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .

ex:S3 ex:study ex:C1 .

ex:S2 ex:study ex:C1 .

ex:C1 ex:hasProfessor ex:P1 .

ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .

V1,2,

3ex:C1 ex:hasProfessor ex:P1 [V1,V2].ex:C1 ex:hasProfessor ex:P2 [V3].ex:C1 ex:hasProfessor ex:S2 [V3].ex:S1 ex:study ex:C1 [V1,V2,V3].ex:S2 ex:study ex:C1 [V1].ex:S3 ex:study ex:C1 [V2,V3].

a) Independent Copies/Snapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Page 27: Compressed RDF: Practical Uses & Hands-on

Instantiation of archive queries in AnQL [1]

Mat(Q,V1)

version materialization

Diff(Q,V1,V2)

Ver(Q)

join(Q1,vi,Q2,vj)

Change(Q)

27

Benchmarking: Define the queries

SELECT * WHERE { Q :[v1] }

[1] Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics (JWS), 12:72--95, March 2012.

Page 28: Compressed RDF: Practical Uses & Hands-on

Instantiation of archive queries in AnQL

Mat(Q,V1)

Diff(Q,V1,V2)

delta materialization

Ver(Q)

join(Q1,vi,Q2,vj)

Change(Q)

28

Benchmarking: Define the queries

SELECT * WHERE {

{ { {Q :[v1]} MINUS {Q :[v2]} } BIND (v1 AS ?V )

}

UNION

{ { {Q :[v2] } MINUS {Q :[v1]}} BIND (v2 AS ?V )

}

Page 29: Compressed RDF: Practical Uses & Hands-on

Instantiation of archive queries in AnQL

Mat(Q,V1)

Diff(Q,V1,V2)

Ver(Q)

results of Q annotated with the version

join(Q1,vi,Q2,vj)

Change(Q)

29

Benchmarking: Define the queries

SELECT * WHERE { Q :?V }

Page 30: Compressed RDF: Practical Uses & Hands-on

Instantiation of archive queries in AnQL

Mat(Q,V1)

Diff(Q,V1,V2)

Ver(Q)

join(Q1,v1,Q2,v2)

Change(Q)

30

Benchmarking: Define the queries

SELECT * WHERE { {Q :[v1]} {Q :[v2]} }

Page 31: Compressed RDF: Practical Uses & Hands-on

Instantiation of archive queries in AnQL

Mat(Q,V1)

Diff(Q,V1,V2)

Ver(Q)

join(Q1,vi,Q2,vj)

Change(Q)

Returns consecutive versions in which Diff of a query is not null

31

Benchmarking: Define the queries

SELECT ?V1 ?V2 WHERE

{ {{Q :?V1 } MINUS {Q :?V2}} UNION

{{Q :?V2 } MINUS {Q :?V1}}

FILTER( abs(?V1-?V2) = 1 ) }

Open question remains: What is the right query

syntax for archive queries?

Page 32: Compressed RDF: Practical Uses & Hands-on

32

Time-based access. Queries

Materialize (s,?,? ; version)

Page 33: Compressed RDF: Practical Uses & Hands-on

33

Time-based access. Queries

diff(?,?,o ; version0 ; version t)

Page 34: Compressed RDF: Practical Uses & Hands-on

RDFCSA: Compressed Suffix Array

v-RDFCSA[2] is designed as a lightweight TB approach

Version information encoding

Any triple can be identified by the position of its subject within SA

Let be N the number of different versions and n the set of version-oblivioustriples

Two encoding strategies

tpv: N bitsequences 𝐁𝐯 𝐢 [𝟏, 𝐧] to encode what triple appears in version i

vpt: n bitsequences 𝐁𝐭 i [1, N ] to encode versions where the kth triple occurs

34

Self-Indexing RDF Archives: v-RDFCSA

Bv1 0 1 1 0 1

Bv2 0 1 0 1 0

Bv3 1 0 0 0 1

Triples

1 2 3 4 5tpv

Versions

1

2

3

Bt1

0 1 1 0 1

0 1 0 1 0

1 0 0 0 1

Triples

1 2 3 4 5vpt

Version

s1

23

Bt2 Bt3 B

t4 B

t5

[2] Ana Cerdeira-Pena, Antonio Fariña, Javier D. Fernández, and Miguel A. Martínez-Prieto. Self-Indexing RDF Archives. Data Compression Conference (DCC), 2016.

Performs more than one order of magnitude faster than Jena-TDB for query resolution

Page 35: Compressed RDF: Practical Uses & Hands-on

Linked Open/Close Data(Linked Data markets)

Use case 3

Page 36: Compressed RDF: Practical Uses & Hands-on

G3b

G1b

Linked Open Data

Cloud

Linked Closed Data

Cloud

dbpedia

G3a G4a

G1a G2a

G1c G2c

G2b

So far so good but.. Linked Open/Close Data

“Deep Semantic Web”

Page 37: Compressed RDF: Practical Uses & Hands-on

Linked Open/Close Data

Page 38: Compressed RDF: Practical Uses & Hands-on

A) Efficient Exchange: Compression + Encryption (hdtcrypt)

38

Linked Open/Close Data

Page 39: Compressed RDF: Practical Uses & Hands-on

B) A secure LD Endpoint

39

Linked Open/Close Data

Self-Enforcing Access Control for Encrypted RDF

Javier D. Fernández, Sabrina Kirrane, Axel Polleres and

Simon Steyskal. In ESWC’17

Future work:

Page 40: Compressed RDF: Practical Uses & Hands-on

Hands on!

Find these slides in: https://aic.ai.wu.ac.at/qadlod/presentations/keystoneHandsOn2017.pdf

https://aic.ai.wu.ac.at/qadlod/presentations/codeKeystone2017

Page 41: Compressed RDF: Practical Uses & Hands-on

1) Desktop tool HDT-it!

Thanks to Mario Arias

Consuming HDT

Page 42: Compressed RDF: Practical Uses & Hands-on

1) Desktop tool HDT-it!

Download the tool for your OS:

http://www.rdfhdt.org/downloads/

Get an HDT dataset from the web

http://www.rdfhdt.org/datasets/

OR

http://lodlaundromat.org/wardrobe/

OR convert your RDF dataset with the tool.

As a suggestion of small datasets:

SWDF (242K triples) or the bigger DBLP (55M triples)

Consuming HDT

Page 43: Compressed RDF: Practical Uses & Hands-on

2) Command line Tools (C++ and Java)

Consuming HDT

rdfhdt.org

HDT-C++ HDT-Java

Command Line tools X X

TP search X X

Full SPARQL - with Jena

Parametrizable Compression X -

Full text support X -

Practical Uses LDF Jena, Fuseki

Page 44: Compressed RDF: Practical Uses & Hands-on

2) Command line Tools (c++ and Java)

For simplicity, in this lecture we will use Java

Download hdt-java library from https://github.com/rdfhdt/hdt-java/

git clone https://github.com/rdfhdt/hdt-java.git

or download https://github.com/rdfhdt/hdt-java/archive/master.zip

Install the library with maven:

mvn install

Query an HDT file:

Go to HDT-cli and execute:

./bin/hdhSearch.sh /path/to/your/hdt

This will open a simple console where you can query triple patterns

Export/Import

$> rdf2hdt file.nt output.hdt

$> hdt2rdf file.hdt output.nt

Consuming HDT

Page 45: Compressed RDF: Practical Uses & Hands-on

3) Set up a SPARQL Endpoint with HDT and Fuseki

Go to hdt-fuseki and compile adding the dependencies:

mvn package dependency:copy-dependencies

Run fuseki

./bin/hdtEndpoint.sh --hdt path/to/dataset.hdt /mydataset

Open your Web Browser and go to: http://localhost:3030

Select Control Panel / Dataset / myDataset and click Select

Type your SPARQL Query and see the results.

Be careful with the number of results, here there is no limitation in the number of results such as in e.g. virtuoso:

select * WHERE{ ?s ?p ?o} LIMIT 400

Consuming HDT

Page 46: Compressed RDF: Practical Uses & Hands-on

4) Set up a Linked Data Fragments Endpoint with HDT

Download LDF Server (Node.js is the best one but we will use java for simplicity in the installation).

git clone https://github.com/LinkedDataFragments/Server.Java.git

or download https://github.com/LinkedDataFragments/Server.Java/archive/master.zip

Install the server, avoid the test (it fails :)

mvn install -Dmaven.test.skip=true

Open the file config-example.json and modify the settings to point to your hdt, e.g.

"settings": { "file": "/home/user/myfile.hdt" }

Run the server

java -jar target/ldf-server.jar

Access http://localhost:8080

Consuming HDT

Page 47: Compressed RDF: Practical Uses & Hands-on

5) Access with the HDT C++/Java libraries (again, we restrict here to Java)

JAVADOC:

http://purl.org/HDT/javadoc/api

http://purl.org/HDT/javadoc/core

I will refer to Eclipse and Maven but you can use your preferred environment

Consuming HDT

Page 48: Compressed RDF: Practical Uses & Hands-on

Setting up the environment…

Create a new maven project

Consuming HDT / HDT-java library

Page 49: Compressed RDF: Practical Uses & Hands-on

Setting up the environment…

Create a new maven project

Select to create a simple project (skip archetype selection)

Consuming HDT / HDT-java library

Page 50: Compressed RDF: Practical Uses & Hands-on

Setting up the environment…

Create a new maven project

With a simple archetype

And any metadata

Consuming HDT / HDT-java library

Page 51: Compressed RDF: Practical Uses & Hands-on

Setting up the environment…

Include the maven dependency of hdt-java-core in the pom.xml

Consuming HDT / HDT-java library

Page 52: Compressed RDF: Practical Uses & Hands-on

Setting up the environment…

Include the maven dependency of hdt-java-core in the pom.xml

Finally, let’s create a new Class and query our HDT

Consuming HDT / HDT-java library

- Test other queries- get the S, P, O of the solution

Page 53: Compressed RDF: Practical Uses & Hands-on

Let’s access the dictionary of terms in HDT

Consuming HDT / HDT-java library

- Open two HDT files- Use the dictionaries to get the common predicates used in both

Page 54: Compressed RDF: Practical Uses & Hands-on

Let’s access the terms as IDs

Consuming HDT / HDT-java library

- Use the estimation of results to count the cardinality of all subjects

- We can build an histogram and see the distribution

Page 55: Compressed RDF: Practical Uses & Hands-on

6) Query full SPARQL with Jena and HDT

First, include the hdt-jena dependency in pom.xml

Consuming HDT

Page 56: Compressed RDF: Practical Uses & Hands-on

6) Query full SPARQL with Jena and HDT

First, include the hdt-jena dependency in pom.xml

Import HDT into a model and query!

Consuming HDT

- Test other queries over your data

Page 57: Compressed RDF: Practical Uses & Hands-on

+) Query LOD-a-lot

First, get the correct hdt-java branch to deal with really long IDs

git clone -b long-dict-id https://github.com/rdfhdt/hdt-java/

Install, avoid the test

mvn install -Dmaven.test.skip=true

Change java head space

export MAVEN_OPTS="-Xmx25G"

In hdt-java-cli

./bin/hdtSearch.sh /media/javi/data/lod-a-lot/LOD_a_lot_v1.hdt

Consuming HDT

Page 58: Compressed RDF: Practical Uses & Hands-on

Let’s the lecture… end

Page 59: Compressed RDF: Practical Uses & Hands-on

We are currently facing Big Linked Data challenges

Generation, publication and consumption

Thanks to compression, the Big Linked Data today will be the “pocket” data tomorrow

Compression is not just about space

Fast exchange

Fast processing/management

Fast querying

Compression democratizes the access to Big Linked Data

= Cheap, scalable consumers

PAGE 59

Take-home messages

Page 60: Compressed RDF: Practical Uses & Hands-on

Thank you!

Let’s the lecture… end