33
LODOP Multi-Query Optimization for Linked Data Profiling Queries Anja Jentzsch (@anjeve), Benedikt Forchhammer, Felix Naumann Hasso Plattner Institute, Potsdam, Germany 1st International Workshop on Dataset Profiling & Federated Search for Linked Data (PROFILES2014), ESWC 2014 2014/05/26

LODOP - Multi-Query Optimization for Linked Data Profiling Queries

Embed Size (px)

DESCRIPTION

Talk at PROFILES2014, ESWC2014

Citation preview

Page 1: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOPMulti-Query Optimization for Linked Data Profiling Queries

Anja Jentzsch (@anjeve), Benedikt Forchhammer, Felix Naumann Hasso Plattner Institute, Potsdam, Germany

!!!!

1st International Workshop on Dataset Profiling & Federated Search for Linked Data (PROFILES2014), ESWC 2014

2014/05/26

Page 2: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

1. Challenges of Linked Data Profiling 2. Profiling Tasks 3. LODOP 4. Multi-Query Optimizations

OUTLINE

Page 3: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

LINKED DATA PROFILING

• Metadata often not available • e.g. statistical information on predicates, classes, vocabularies, value

patterns, property co-occurrence, … • Data registries, VoiD, and Semantic Sitemaps provide only basic

information. e.g., description, author & license information, estimated triple and link count !

• Use cases requiring metadata • Query optimization • Data cleansing • Data integration • Schema induction !

• Data profiling: methods for computing metrics / metadata for datasets

Page 4: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

TRADITIONAL VS LINKED DATA PROFILING

• State of the art data profiling • Based on columns • Assumes well-defined semantics • Expects regular data !

• Heterogeneity on the Web of Data • Diverse sources • Diverse structures • Diverse views

Page 5: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

CHALLENGES OF LD PROFILING

• Heterogeneity • Nested graphs Makes reasoning difficult • Loose structure Things have different predicate sets • Incomplete Missing property definitions • Poorly formatted Property types used inconsistently • Inconsistent Multiple representations claim opposite things !

• Existing (relational) data profiling tools don’t work !• Volume of data

• Requires parallelization

Page 6: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

LODOP - CONTRIBUTIONS

• Implementation of 15 profiling tasks as Apache Pig scripts (56 scripts) • System for executing, benchmarking and optimizing data profiling

scripts with Apache Pig on Hadoop • Development and evaluation of 3 multi-script optimization rules !• Apache Pig:

• Platform for analyzing large datasets • High-level language: Pig Latin • Scripts executed on Hadoop / MapReduce

Page 7: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

PROFILING TASKS

• Groupings • e.g. by resource, class, property type, language, vocabulary, … !

• Tasks • Number of triples • Average number of triples per resource • Average number of triples per object URI • Average number of triples per context URL • Number of property types • Average number of property values • Number of resources • Number of inlinks / outlinks

• Number of context URLs • Number of context PLDs • Property co-occurrence • Inverse Properties • URI-Literal ratio • Property value ranges • Average value length

Page 8: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

DATASETS STATISTICS

!!!!!!!!!!!* source: BTC 2012 dataset ** WDC = Web Data Commons *** EUNIS = European Environment Agency !

Statistics for 1M triples! DBpedia*! Freebase*! WDC RDFa**!

EUNIS Species***!

Number of resources! 169,035! 226,834! 168,736! 65,843!

Avg. number of triples per resource! 5.9! 4.4! 5.9! 15.2!

Number of classes! 19,585! 1,928! 61! 1!

Number of property types! 7,844! 2,748! 477! 16!

Number of URIs! 519,692! 642,183! 174,317! 407,418!

Number of inlinks! 207,712! 192,179! 35,329! 78,377!

Number of literals! 480,279! 357,817! 825,564! 592,582!

Avg. number of property values! 127.5 363.9! 2096.2! 62,500.0!

Page 9: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

PERFORMANCE EVALUATION

Page 10: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

PERFORMANCE EVALUATION

• 10-15s scheduling overhead per MapReduce job (~3.4 jobs per script)

• Earlier MapReduce jobs have longer runtimes • Earlier jobs handle more data ⇒ more HDFS activity

• Most scripts scale linearly • Most scripts reduce amount of data in workflow • Exceptions e.g. property co-occurrence scripts

Page 11: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

OPTIMIZATION GOALS

• Optimize concurrent execution of multiple scripts • Reduce number of operators • Reduce data flow between operators

Page 12: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

NUMBER OF INSTANCES (PIG)

Page 13: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

LODOP - SYSTEM OVERVIEW

Page 14: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

MULTI-QUERY OPTIMIZATION

1. Merging identical operators 2. Combining FILTER operators 3. Combining FOREACH operators

Page 15: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

• Merging all logical plans into one master plan • Allows parallel execution • Reduces runtime to 25-30% of sequential execution !

STEP 0: MASTER PLAN

Page 16: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

1. MERGING IDENTICAL OPERATORS

Number of property types per class!

URI Literal Ratio per class!

Page 17: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

Number of property types per class!

URI Literal Ratio per class!

1. Identify and compare sibling operators

2. Merge matching siblings

1. MERGING IDENTICAL OPERATORS

Page 18: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

Number of property types per class!

URI Literal Ratio per class!

1. Identify and compare sibling operators

2. Merge matching siblings

1. MERGING IDENTICAL OPERATORS

Page 19: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

• Number of operators reduced from 365 to 267 • Number of MapReduce jobs reduced from 176 to 140 • Frees up cluster resources • Prerequisite step for other optimisations • Restricts parallelism

1. MERGING IDENTICAL OPERATORS

Page 20: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

1. MERGING IDENTICAL OPERATORS

Page 21: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

2. COMBINING FILTER OPERATORS

Page 22: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

2. COMBINING FILTER OPERATORS

1. Create combined FILTER operator

2. Rearrange original FILTER operators

3. Remove redundant operators

Page 23: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

2. COMBINING FILTER OPERATORS

1. Create combined FILTER operator

2. Rearrange original FILTER operators

3. Remove redundant operators

Page 24: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

2. COMBINING FILTER OPERATORS

1. Create combined FILTER operator

2. Rearrange original FILTER operators

3. Remove redundant operators

Page 25: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

2. COMBINING FILTER OPERATORS

Page 26: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

3. COMBINING FOREACH OPERATORS

Page 27: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

1. Create combined FOREACH operator

2. Replace with simple projections

3. Remove redundant projection

3. COMBINING FOREACH OPERATORS

Page 28: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

1. Create combined FOREACH operator

2. Replace with simple projections

3. Remove redundant projection

3. COMBINING FOREACH OPERATORS

Page 29: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

3. COMBINING FOREACH OPERATORS

Page 30: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

ALL OPTIMIZATIONS

Page 31: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

SUMMARY

• Optimizations reduce • Number of operations • Number of MapReduce jobs • Data flow between operators → less HDFS I/O → Improved execution time

• Reduces execution time by 70% • … but rules should not be applied in all cases

• More advanced (cost-based) approach is needed

Page 32: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

FUTURE WORK

• Additional logical optimization rules • Ignore projections if it allows further merging of operators

• Advanced optimization strategies • Cost-based approach could use previous profiling results (e.g.

cardinalities) → on-the-go • Materialization of intermediate results

• Materialize common subsets, e.g. only triples with typed object values for later scripts

Page 33: LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A. Jentzsch. PROFILES2014, ESWC2014.

http://github.com/bforchhammer/lodop/ !

@anjeve [email protected]