SPARQL Basic Graph Pattern Processing with SPARQL Basic Graph Pattern Processing with Iterative MapReduceIterative MapReduce
2010-04-26
Presented by Jaeseok Myung
Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea
Copyright 2010 by CEBT
MapReduceMapReduce
MapReduce is easily accessible
The Hadoop project provides an open-source MR implementation
MapReduce gives users a simple abstraction for utilizing parallel and distributed system
Programming Model
– Map(k,v) -> list(k’, v’)
– Reduce(k’, list(v’)) -> list(v’’)
Useful for Massive Data Processing
Center for E-Business Technology MDAC 2010 – 2/23
Copyright 2010 by CEBT
MR & Cloud ComputingMR & Cloud Computing
MapReduce is a kind of platform
MapReduce utilizes a number of commodity machines
There can be a number of applications using MapReduce
Center for E-Business Technology
MapReduceMapReduce
App.App. App.App. App.App.
MDAC 2010 – 3/23
Copyright 2010 by CEBT
RDF Data Warehouse using RDF Data Warehouse using MapReduceMapReduce
Data Warehouse using MapReduce
With extensive studies, it has become known that MR is specialized for large-scale fault-tolerant data analyses
Hive, CloudBase
– Data warehousing solutions built on top of Hadoop
Advantages
– Scalability
– Extensibility
– Fault-tolerance
My Research Interest
RDF Data Warehouse using MapReduce
Center for E-Business Technology MDAC 2010 – 4/23
Copyright 2010 by CEBT
Why RDF Data Warehouse?Why RDF Data Warehouse?
Flexible Data Model
The underlying structure of any expression in RDF is a collection of triples (s, p, o)
Data Integration
RDB-to-RDF (intra)
Linked Open Data (inter)
Incremental Integration
Inference
We can discover some knowledge from what we already know
A goal of data analyses
Center for E-Business Technology MDAC 2010 – 5/23
Copyright 2010 by CEBT
Approaches & AdvantagesApproaches & Advantages
Center for E-Business Technology
Building a Data
Warehouse
Building a Data
Warehouse
Conventional DW
SolutionsRDF Data Warehous
e
RDF Data Warehous
e
Centralized
Distributed & Parallel
Distributed & Parallel
Beforethe Cloud
(MR)Cloud Computing(MR)Cloud Computing
• Flexibility• Integration• Inference
• Complexity• Large-scale
data analyses
• Scalability• Extensibilit
y• Fault-
tolerance
• Support Tools
• Simple• Fast
• Performance• Optimization
MDAC 2010 – 6/23
Copyright 2010 by CEBT
SPARQL BGP Processing with SPARQL BGP Processing with MapReduceMapReduce
Both RDF and MapReduce can benefit a data warehouse
RDF is a data model
– Flexibility, Integration, Inference
MapReduce is a programming model
– Scalability, Extensibility, Fault-tolerance
It has been difficult to create synergy because there have been only few algorithms which connects the data model and the framework
We should focus on a MR algorithm that manipulates RDF datasets
A MapReduce Algorithm for SPARQL Basic Graph Pattern Processing
Center for E-Business Technology MDAC 2010 – 7/23
Copyright 2010 by CEBT
SPARQL Basic Graph PatternSPARQL Basic Graph Pattern
SPARQL is a query language for RDF datasets
Basic Graph Pattern(BGP) is a set of triple patterns
Triple patterns are similar to RDF triples (s, p, o) except that each of the subject, predicate and object can be a variable
BGP processing is important
– Most of SPARQL queries have one or more BGPs
– BGPs require expansive join operations among triple patterns
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
TP#1TP#1BGPBGP
TP#2TP#2
TP#3TP#3
TP#4TP#4
TP#5TP#5
MDAC 2010 – 8/23
Copyright 2010 by CEBT
SPARQL BGP Processing with SPARQL BGP Processing with MapReduceMapReduce
Two Operations
MR-Selection
– Extracts RDF triples which satisfy at least one triple pattern
MR-Join
– Merges selected triples
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
<Prof0><Prof0> ub:worksForub:worksFor <Dept0><Dept0>
<Prof0><Prof0> ub:nameub:name “Professor0”“Professor0”
<Prof0><Prof0> ub:emailub:email “[email protected]”“[email protected]”
<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”
<Dept0><Dept0> rdf:typerdf:type ub:Departmentub:Department
…… …… ……
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
ub:worksForub:worksFor <Dept0><Dept0>
ub:nameub:name “Professor0”“Professor0”
ub:emailub:email “[email protected]”“[email protected]”
ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”
MR-SelectionMR-Selection
MR-JoinMR-Join
MDAC 2010 – 9/23
Copyright 2010 by CEBT
MR-SelectionMR-Selectionpublic void map() {
Read a triple (s, p, o)
// example, s: Prof0 p: rdf:type o:ub:Professor
for each (triple pattern in a given query) {
if(input triple satisfies a triple pattern) {
make a key and a value
// key = [x]Prof0 (variable name, value)
// value = 1 (# of the satisfied triple pattern)
output (key, value)
}
}
}
public void reduce() {
read input from the map function
// input format: (key, list(satisfied tp_numbers))
for each (value in a list of tp_numbers) {
make a key and a value
// key = <1>x, value = [x]Prof0
output (key, value)
}
}
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345
MDAC 2010 – 10/23
Copyright 2010 by CEBT
MR-SelectionMR-Selection
Conceptually, the MR-Selection algorithm produces temporary tables which satisfy each triple pattern
A result table has variable names as a relational table has attribute names
It also has values for the variable names, as does the relational table
The result table will be used for the next MR-Join operation if necessary
Center for E-Business Technology
tp1
x
…
x y1
… …
x
…
x y2
… …
x y3
… …
2 3 4 5
MDAC 2010 – 11/23
Copyright 2010 by CEBT
MapperMapper
Values of Join-key variable
MR-Join: MapMR-Join: Map
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
<Prof0><Prof0> ub:worksForub:worksFor <Dept0><Dept0>
<Prof0><Prof0> ub:nameub:name “Professor0”“Professor0”
<Prof0><Prof0> ub:emailub:email “[email protected]”“[email protected]”
<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”<Prof0><Prof0>
<Prof1><Prof1> ub:emailub:email “[email protected]”“[email protected]”
<Prof1><Prof1> ub:telephoneub:telephone “111-1111-1111”“111-1111-1111”
<Prof1><Prof1>
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
<Prof0><Prof0> ub:worksForub:worksFor <Dept0><Dept0>
<Prof0><Prof0> ub:nameub:name “Professor0”“Professor0”
<Prof0><Prof0> ub:emailub:email “[email protected]”“[email protected]”
<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”
<Prof1><Prof1> ub:emailub:email “[email protected]”“[email protected]”
<Prof1><Prof1> ub:telephoneub:telephone “111-1111-1111”“111-1111-1111”
BGP Analyzer
BGP Analyzer examines a given query before execution and provides join-keys to the map function
BGP Analyzer
BGP Analyzer examines a given query before execution and provides join-keys to the map function
Join-key (shared variable) ?x
MDAC 2010 – 12/23
Copyright 2010 by CEBT
MR-Join: MapMR-Join: Mappublic void map() {
read input from MR-Selection
// example input (<1>x, [x]Prof0)
// example input (<3>x|y1, [x]Prof0|[y1]Professor0)
get join-key variables and corresponding tp_numbers
to be joined from the BGP Analyzer
// example join-key: x, tp_numbers=(1, 2, 3, 4, 5)
for each (join-key determined by BGP Analyzer) {
if(input is related to the join-key) {
make a key and a value
// key = [x]Prof0 (variable name, value)
// value = <tp>1</tp>[x]Prof0 (# of the satisfied triple pattern, variable name, value)
// value = <tp>3</tp>[x]Prof0|[y1]Professor0
output (key, value)
}
}
}
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345
MDAC 2010 – 13/23
Copyright 2010 by CEBT
MR-Join: ReduceMR-Join: Reduce
Center for E-Business Technology
ReducerReducer
Constraints for Join-key variable X
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345
<x>1, 2, 3,
4, 5
<x>1, 2, 3,
4, 5
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
<Prof0><Prof0> ub:worksForub:worksFor <Dept0><Dept0>
<Prof0><Prof0> ub:nameub:name “Professor0”“Professor0”
<Prof0><Prof0> ub:emailub:email “[email protected]”“[email protected]”
<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”
<Prof1><Prof1> ub:emailub:email “[email protected]”“[email protected]”
<Prof1><Prof1> ub:telephoneub:telephone “111-1111-1111”“111-1111-1111”
BGP Analyzer
BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query
BGP Analyzer
BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query
Triple pattern numbers related to the join-key variable
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
ub:worksForub:worksFor <Dept0><Dept0>
ub:nameub:name “Professor0”“Professor0”
ub:emailub:email “[email protected]”“[email protected]”
ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”
MDAC 2010 – 14/23
Copyright 2010 by CEBT
MR-Join: ReduceMR-Join: Reducepublic void reduce() {
read input from the Map function
// example input ([x]Prof0, [<tp>1</tp>[x]Prof0, <tp>3</tp>[x]Prof0|[y1]Professor0])
get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer
// example join-key: x, tp_numbers=(1, 2, 3, 4, 5)
create a temporary hashtable H
for each (value in values) {
add an element
// key = <1>x, value = [x]Prof0
// key = <3>x|y1, value = [x]Prof0|[y1]Professor0
} // H will be used for checking whether the input satisfies all related tps.
if(keys in H cover all tp_numbers to be joined) {
make a Cartesian product among values in H
// (a1, b1), (a1, c1) => (a1, b1, c1)
make a key and a value
// key = <1|3>x|y1
// value = [x]Prof0|[y1]Professor0
output (key, value)
}
}
Center for E-Business Technology MDAC 2010 – 15/23
Copyright 2010 by CEBT
Join-key Selection StrategiesJoin-key Selection Strategies
BGP Analyzer provides join-key variables by analyzing a query
How to select join-key variables?
If a BGP has a shared variable
– We can easily select the variable
If a BGP has two or more shared variables
– We applied two heuristics to select join-key variables
– Greedy Selection Select a join-key according to the number of related triple patterns
– Multiple Selection Select join-keys until every triple pattern is participated in a MR-Join
operation
Utilize the distributed and parallel system architecture
Center for E-Business Technology MDAC 2010 – 16/23
Copyright 2010 by CEBT
SPARQL BGP Processing with MRSPARQL BGP Processing with MR
Advantages
MapReduce can benefit from the multi-way join technique
– If triple patterns share a variable, MR can join them all at once
– It is not unusual that a BGP has several triple patterns sharing the same variable because RDF has a fixed simple data model
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345 ⋈
tp1 ⋈ ⋈ ⋈ ⋈(x)
(x, y1)
(x, y1, y2)
(x, y1, y2, y3)
(x, y1, y2, y3)
(a)
(b)
x
…
x y1
… …
x
…
x y2
… …
x y3
… …
2 3 4 5
tp1
x
…
x y1
… …
x
…
x y2
… …
x y3
… …
2 3 4 5
MDAC 2010 – 17/23
Copyright 2010 by CEBT
SPARQL BGP Processing with MRSPARQL BGP Processing with MR
Disadvantages
If we have two or more shared variables, we need expansive MR iterations
triple patterns in a query cannot be covered by a certain variable
If we have two shared variables, MR iterations cannot be avoided
To reduce unnecessary MR iteration, join-key selection strategies should be applied
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3. ?y2 ub:alias ?y4}
123456
⋈(x, y1, y2, y3)
tp1
x
…
x y1
… …
x
…
x y2
… …
x y3
… …
2 3 4 5
y2 y4
… …
6
⋈
MDAC 2010 – 18/23
Copyright 2010 by CEBT
ExperimentExperiment
Environment
LUBM Dataset
Amazon EC2, Cloudera’s Hadoop Distribution, Amazon EBS
The effect of multi-way join
Multi-way join technique reduces the execution time by joining several triple patterns at once
Some queries do not show a significant difference because they are too simple to take advantages of multi-way join
Center for E-Business Technology
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14
2-way
123.391
181.583
69.773256.59
175.53344.198
205.636
232.551
256.031
68.83466.834112.80
273.36947.092
Multi-way
86.423104.03
567.214
126.474
74.16344.526135.04
7140.41
4152.74
773.33763.55786.11772.82542.156
Diff. 36.96877.548 2.559130.11
71.37 -0.328 70.58992.137
103.284
-4.503 3.277 26.685 0.544 4.936
MDAC 2010 – 19/23
Copyright 2010 by CEBT
ExperimentExperiment
Scalability
As the number of machines increase, the average execution time is decreased
– The MR algorithm makes a sufficient number of reducers so we can utilize a number of machines
While we increase the data size, the algorithm shows scalable execution time
Center for E-Business Technology MDAC 2010 – 20/23
Copyright 2010 by CEBT
Issues & Future Work – IndexingIssues & Future Work – Indexing
Execution Time of MR-Selection and each MR-Join Iteration
MR-Selection can be a bottleneck because it takes about 40 seconds
The underlying storage structure is important
N-triple format -> HBase, Partitioning
Building an index needs a significant amount of loading time
Center for E-Business Technology MDAC 2010 – 21/23
Copyright 2010 by CEBT
Issues & Future Work – PipeliningIssues & Future Work – Pipelining
Hadoop’s MR implementation materializes intermediate results into the file system
It takes so much time because of disk I/O
Pipelining
Allows to send and receive data between tasks and between jobs without disk I/O
– Some implementations become available
Hadoop Online Prototype (http://code.google.com/p/hop/)
CGL-MapReduce (eScience 2008)
Center for E-Business Technology MDAC 2010 – 22/23
Copyright 2010 by CEBT
ConclusionConclusion
There still remain many issues
This work is still in progress
Conclusion
RDF Data Warehouse using MapReduce
– RDF: Flexibility, Integration, Inference
– MapReduce: Scalability, Extensibility, Fault-tolerance
SPARQL Processing with MapReduce
– Synergy effects between RDF and MapReduce
– Issues
System Architecture
Loading(Indexing), Pipelining, Encoding, …
Center for E-Business Technology MDAC 2010 – 23/23