27
Institute for Web Science and Technologies University of Koblenz ▪ Landau, Germany Systematic Generation of SPARQL Benchmark Queries for Linked Open Data Olaf Görlitz , Matthias Thimm, Steffen Staab

SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

Embed Size (px)

DESCRIPTION

ISWC'12 research track

Citation preview

Page 1: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

Institute for Web Science and Technologies

University of Koblenz ▪ Landau, Germany

Systematic Generation of

SPARQL Benchmark Queries

for Linked Open Data

Olaf Görlitz, Matthias Thimm, Steffen Staab

Page 2: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 2

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Linked Data Federation

SPARQL Queries on the Linked Data Cloud

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 3: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 3

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

distributedqueries

federationimplementation

The Problem

Why not usebenchmarkqueries?

Page 4: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 4

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

RDF Benchmarks

LUBM, BSBM, SP²B, ...

• Synthetic datasets• Domain-specific• Highly structured• Sophisticated queries

FedBench (ISWC'11)

• 10 Linked Data sets(~170M triples)

• 25 handpickeddistributed queries

Centralized Fixed

Scalable, Flexible, ExpressiveLinked Data Benchmark

Page 5: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 5

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Overview

Benchmark Idea Methodology Evaluation

Page 6: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 6

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Linked Data Benchmark Features

Scalability Flexibility Expressiveness

Real Linked Data Sets Customization Typical+Complex Queries

Systematic SPARQL Benchmark Query Generator for Linked Open Data

Systematic SPARQL Benchmark Query Generator for Linked Open Data

Page 7: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 7

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Requirements

1. Define QueryCharacteristics

2. Automatic Query Generation

3. Query Validation

What we want:

Customize Benchmark

Random Queries

#results > 0

Page 8: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 8

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Contribution

Methodology and toolset forsystematic query generation

Query Generation Query ValidationParameterization

Linked Data

Config BenchmarkQueries

Page 9: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 9

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Overview

Benchmark Idea Methodology Evaluation

Page 10: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 10

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

SPLODGE Methodology

QueryParameterization

Define typical + challenging distributed queries

QueryGeneration

QueryValidation

QueryGeneration

QueryValidation

No federation query logs available

Analyze queries of benchmarks

SELECT ?drug ?keggUrl ?chebiImage WHERE {  ?drug rdf:type drugbank:drugs .  ?drug drugbank:keggCompoundId ?keggDrug .  ?keggDrug bio2rdf:url ?keggUrl .  ?drug drugbank:genericName ?drugBankName .  ?chebiDrug purl:title ?drugBankName .  ?chebiDrug chebi:image ?chebiImage . }

FedBench/LifeScience#5

Page 11: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 11

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

SPLODGE Methodology

QueryParameterization

QueryGeneration

QueryValidation

• Query Form(Select, Construct, ...)

• Join Type(conj. / disj. / left-join)

• Result Modifiers(limit, offs, order by)

• Variable Patterns(s, o, s+o, ...)

• Join Patterns(star, path)

• Cross Product

• # Data Sources

• # Joins/ Patterns

• # Results

Algebra Structure Cardinality

Page 12: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 12

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

SPLODGE Methodology

QueryParameterization

QueryGeneration

QueryValidation

Main query parameter: join structure

FedBench queries star join

path join

Page 13: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 13

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

SPLODGE Methodology

QueryParameterization

QueryGeneration

QueryValidation

Path-join: n triple patterns,m sources (m≤n)

Additional query parameters: # triple patterns# data sourcesresult size...

Star-join: n triple pattern,anchor node (s/o)

Page 14: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 14

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

SPLODGE Methodology

QueryParameterization

QueryGeneration

QueryValidation

Iteratively add random triple pattern

Need background knowledge level of detail?

#results > 0 ?

Predicate combinations how provided?

owl:sameAs rdf:type

rdfs:label

foaf:knows

Page 15: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 15

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

SPLODGE Methodology

QueryParameterization

QueryGeneration

QueryValidation

owl:sameAs rdf:type

rdfs:label

foaf:knows

Linked Predicates Characteristics Sets*

(owl:sameAs → rdf:type)

DBpedia → geonames (43, 58)freebase → DBpedia (86, 72) ...

{rdfs:label, foaf:knows, …}

DBpedia (322), rdfs:label (437)foaf:knows (322)

...

*[Neumann, Moerkotte, ICDE 2011]

Page 16: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 16

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

SPLODGE Methodology

QueryParameterization

QueryGeneration

QueryValidation

Linked Predicates Characteristics Sets

(p1 → p

2)

p1

p2

p3

p4

⊗ (p2 → p

3)

⊗ (p3 → p

i )

{p1, p

4}

{p1, p

4, ...}

Page 17: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 17

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

SPLODGE Methodology

QueryParameterization

QueryGeneration

QueryValidation

How to evaluate? Compute confidence value

Verify generated queries (#results >0)

minimum join selectivity > e

Page 18: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 18

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Overview

Benchmark Idea Methodology Evaluation

Page 19: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 19

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Evaluation Objective

Verify generation of valid queries (#results >0) Compare variations of query generation algorithms

Metrics: #queries with non-empty results #result per query

Baseline SPLODGElite SPLODGE

“random“predicate

backgroundknowlege

+ minimum join selectivity(> 10-4/10-3/10-2)

Page 20: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 20

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Evaluation Setup

Real Linked Data Random queries Triple Store

SELECT * WHERE {?var1 <http://dbpedia.org/property/description> ?var2 .?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 .?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 .?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 .?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6

}

Billion Triple Challenge Dataset

• Path-joins across data sources• 3-6 patterns, bound predicates• 100 queries per batch

RDF3X

Page 21: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 21

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Evaluation Results

Joined triple patterns

#que

ries

Page 22: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 22

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Evaluation Results

Joined triple patterns

#res

ults

Page 23: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 23

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Estimated vs. actual results size

estimated result size

actu

al r

esul

t siz

e

Page 24: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 24

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Predicate Occurrence in Queries

Page 25: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 25

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

Conclusion

SPLODGE provides Flexible query characterization + parameterization Methodology for Systematic & Scalable Query Generation Toolset as Open Source (http://code.google.com/p/splodge/)

Future Work: Create a LOD Federation Benchmark Interactive SPARQL query construction

Questions?

Page 26: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 26

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

SPLODGE Evaluation Setup

BTC 2011 dataset in RDF3X pure triples, no context 160 GB repository file

(14h loading, 200 GB tmp mem)

Page 27: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

ISWC'12, Boston, 11/15/2012Slide 27

SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab

SPLODGE Pre-Processing for BTC data

Identify common domains(e.g. jane08.lifejournal.com/home) 3,0 h17 GB gzip

Replace quad context(reduce number of sources)

4,4 h

Sort quads + remove duplicates 8,5 h

Build predicate/context dictionary 1,0 h<1 MB gzip

Create resource in/out-link index 9,7 h1.7 GB gzip

Create linked predicate stats Compute characteristic sets 1,6 h