Upload
riccardo-tommasini
View
299
Download
0
Embed Size (px)
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing Engines
Master thesis by: Riccardo Tommasini (799120)
Advisor: Prof. Emanuele Della Valle Co-Advisors: Daniele Dell’Aglio e Marco Balduini
Scuola di Ingegneria Industriale e dell’Informazione Computer Science and Engineering
Anno Accademico 2013 – 2014
1
Master Degree Thesis – Riccardo Tommasini
Agenda
2
Motivations
Research Question
Conclusion
Evaluation
Development
✓
Master Degree Thesis – Riccardo Tommasini
Stream Reasoning
3
Reasoning upon rapidly changing information flows
- Emanuele Della Valle, Stefano Ceri, 2009
Master Degree Thesis – Riccardo Tommasini
Computer Science research mainly focus on proposing new systems and models, lacking for empirical evaluations of the existing ones.
- Walter F. Tichy, 25 August 1994
Motivations
4
Master Degree Thesis – Riccardo Tommasini
5
RSP ENGINE C-SPAEQL Engine CQELS SPARQLstream
EP-SPARQL INSTANS SparkWAVE DynamiTE Trowl
C-SPARQL Engine ≡ ✔ ✔ ✔ ✔
CQELS ✔ ≡ ✔
SPARQLstream ✔ ≡EP-SPARQL ✔ ✔ ≡
INSTANS ≡SparkWAVE ✔ ≡DynamiTE ≡
Trowl ≡
State of the art in RSP Comparison
Master Degree Thesis – Riccardo Tommasini
Agenda
6
Motivations
Research Question
Conclusion
Evaluation
Development
✓Motivations
Master Degree Thesis – Riccardo Tommasini
In Social Science…
Problem Setting - Comparative Method
7
Comparative Analysis is Case Driven
Cases are seen as a combination of properties
Similarities and differences are examined with shared methods
Baselines define analysis guidelines
Master Degree Thesis – Riccardo Tommasini
Problem Setting - Test Stand
Evaluate engines with Test Stands
8
In Aerospace engineering…
Experimental Environment
Reproducibility, Repeatability, ComparabilityEvaluation of running systems
Master Degree Thesis – Riccardo Tommasini
< ,Q>
RSP Engine
9
RDF Stream Processing Engine
data streams integration through RDF data model
continuously infers implied triples w.r.t. ontology T
heterogeneous data streams
continuous querying (Q) answering
T
Master Degree Thesis – Riccardo Tommasini
RSP Engine - Complexity
10
RDF Stream Model
Execution Semantics
Inference Rule
+
+
Master Degree Thesis – Riccardo Tommasini
11
Benchmark DataStreams & Ontologies Queries Metrics Test Stand Baselines Method
SR Bench ✔ ✔ Feasibility ✖ ✖ ✖
LS Bench ✔ ✔Feasibility, Throughput ≈ ✖ ≈
CSRBench ✔ ✔Feasibility,
Throughput, Correctness
≈ ✖ ✖
State of the art of RSP Engine Benchmarking
Master Degree Thesis – Riccardo Tommasini
Research Question
Heavena framework to enable
Systematic Comparative Research Approach (SCRA) of RDF Stream Processing (RSP) Engines
Can an engine test stand, together with existing queries, dataset and metrics, enable Systematic Comparative
Research Approach of RSP Engines?
12
Contribution
we developed and released as open source
Master Degree Thesis – Riccardo Tommasini
Agenda
13
Conclusion
Evaluation
Development
Research Question✓Motivations ✓
Research Question
Master Degree Thesis – Riccardo Tommasini
Do not Influence the experiment
Extendible Design
Engine, Query, Dataset and Ontologies independence allows to exploits existing solutions presented before
Moreover…
Extendible Measurement Set
Heaven - Test Stand Requirements
14
Master Degree Thesis – Riccardo Tommasini
RSPEngine< ,Q>
Heaven - Test Stand
15
E,D,T,QE
Input output
StartStop
Inte
rface
Inte
rface
T
T QD
Streamer D
ResultCollector
Master Degree Thesis – Riccardo Tommasini
Heaven - Test Stand
16
Disk
ResultCollector Streamer RSPEngine
Experiment
Analyser
Start MB StopTestStand
MB
Master Degree Thesis – Riccardo Tommasini
Agenda
17
Conclusion
Evaluation
Development ✓Research Question✓
Motivations ✓
✓
Development
Master Degree Thesis – Riccardo Tommasini
Do
Heaven extends the traditional top-‐down analysis, enabling the comparative methods
How to start the research?
We evaluate four naive RSP Engines, called Baselines, included in the framework
18
Master Degree Thesis – Riccardo Tommasini
Heaven - Baseline Engines
19
RDF StreamNaive
Δ+
Δ-
DSMS Reasoner
Incremental
Input Triple Inferred Triple
active window
RDF Stream
DSMS Incremental Reasoner
RDF Stream RDF Stream
Incremental
Master Degree Thesis – Riccardo Tommasini
Haven - Data
adapts LUBM data to a streaming scenario
20
The RDF2RDFStream Module
generates many RDF Stream controlling the number of contemporary triple
Constant Flow Rate
Con
tem
pora
ry tr
iple
s
time
Step Flow Rate
Con
tem
pora
ry tr
iple
s
time
Master Degree Thesis – Riccardo Tommasini
Doing - Queries
ω= β × S
S = 1
S > 1
21
Tumbling Window
Sliding Window
Variations of the full
materialisation query
Window Dimension ω [ms]Slide Parameter β = 100 [ms]
S ∈ N
Master Degree Thesis – Riccardo Tommasini
Experiments
22
15 SOAK Tests
10 TIMES
FOR EACH BASELINE
168 HOURS OF EXECUTION
6 STEP Tests
10 TIMES
FOR EACH BASELINE
150 HOURS OF EXECUTION
Con
tem
pora
ry
tripl
es
time
Con
tem
pora
ry
tripl
es
time
Master Degree Thesis – Riccardo Tommasini
Heaven - Analyser
We exploit a layered investigation method, which answer different possible question about RSP Engine analysis
L0 -‐ How to choose an engine?
L1 -‐ What distinguish an engine?
L2 -‐ When choosing an engine?
L3 -‐ Why choosing this engine?23
Master Degree Thesis – Riccardo Tommasini
Doing - Analyser L0 - Dashboard
24
Memory(mb)
Latency(ms)
Memory(mb)
Latency(ms)
Memory(mb)
Latency(ms)
Memory(mb)
Latency(ms)
Increasing Window
Dim
ension (ms)
Master Degree Thesis – Riccardo Tommasini
25
Doing - Analyser L1 - Statistical Comparison
6.3 SOAK Test Evaluation Results
(a) Incremental
Triple Slots
in Number
Window 1 10 100 1000 10000
1 G
10 G '100 G ' '1000 G ' ' '10000 NA T S T T
(b) Triple
Triple Slots
in Number
Window 1 10 100 1000 10000
1 I
10 I I
100 N I I
1000 N I I I
10000 NA I I I I
(c) Naive
Triple Slots
in Number
Window 1 10 100 1000 10000
1 '10 ' '100 G ' T
1000 G ' T T
10000 NA ' ' T T
(d) Graph
Triple Slots
in Number
Window 1 10 100 1000 1000
1 I
10 I I
100 ' I I
1000 N I I I
10000 NA I I I I
Table 6.7 – Analyser Investigation Stack - Level 1 - SOAK Test average
latency comparison trough a qualitative approach. The following convention
indicates the baseline has not reached the Steady State Condition: G, T, N, I.
(a), (c) - latency results comparison between Incremental and Naive approaches;
(b), (d) - latency results comparison between Graph-based and Triple-based
models.
representation, butHeaven allows also more detailed analysis with quantitative
comparisons as shows in Section 5.4. To properly read the tables note that
they report that a baseline is better than another one when the di↵erence in
term of latency or memory is bigger than 5%, otherwise we consider the two
terms of comparison as equal and we use the simble '. Moreover, we indicate
that the better solution has not reached the Steady State Condition with the
underlined symbols G, T, N, I.
When N >1, the results in Table 6.7.a and 6.7.c allow to say that using
a Triple-base RDF stream is faster than Graph-based one. In particular, for
the case N=1000 when the window contains 1000 triples (i.e., each CTEvent
contains only one triple), the Naive Triple-based approach is about 10% faster
than the Naive Graph-based one while the Incremental Graph-based is even
about 20% faster. This findings confirm [Hp.2], while the cases when N=10
the does not confirm the hypothesis because the results can be consider as
equal (result di↵erences are smaller than 5%). A possible explanation is that
109
Latency
Evaluation
(a) Incremental
Triple Slots
in Number
Window 1 10 100 1000 10000
1 T
10 G T
100 G T G
1000 G G G T
10000 NA G G G G
(b) Triple
Triple Slots
in Number
Window 1 10 100 1000 10000
1 N
10 I N
100 N N I
1000 N I I I
10000 NA I I I I
(c) Naive
Triple Slots
in Number
Window 1 10 100 1000 10000
1 G
10 G T
100 G G T
1000 G G G T
10000 NA G G T T
(d) Graph
Triple Slots
in Number
Window 1 10 100 1000 10000
1 N
10 N N
100 ' N I
1000 ' I I I
100000 NA N I I I
Table 6.8 – Analyser Investigation Stack - Level 1 - SOAK Test average
memory comparison trough a qualitative approach.The following convention
indicates the baseline has not reached the Steady State Condition: G, T,
N, I. (a), (c) - memory results comparison between Incremental and Naive
approaches; (b), (d) - memory results comparison between Graph-based and
Triple-based models
the dimension of the graph cannot be considered small w.r.t the window when
N=10.
When N=1 (i.e., the window contains only one CTEvent) instead, the
results in Table 6.7.b and Table 6.7.d show that for large events the Naive
approach is faster than the Incremental one, as we stated when we formulate
[Hp.1]. Instead, when CTEvent contains only few triples, the Incremental
approach is faster and this is not intuitive, because to formulate [Hp.1] we
consider the changes dimension in percentage.
The results in Table 6.7.b and 6.7.d support [Hp.1] by stating that when
the number of changing triples in �+�� (Section 4.2) is a small fraction of
those in the window an Incremental approach is faster than the Naive one. The
exception of case N=1, but it can be seen as a limit case, where the reasoner
is asked to deduce all the implicit triples implied by the only explicit triple in
the window.
110
Memory
I: IncrementalN: Naive
SS
Window Dimension (ω) = Slide Parameter (β) × S
Master Degree Thesis – Riccardo Tommasini
Doing - Analyser L2 - Pattern Identification
26
6.3 SOAK Test Evaluation Results
(a) Graph Naive
Triple Slots
in Number
Window 1 10 100 1000 10000
1
10
100
1000
10000
(b) Graph Incremental
Triple Slots
in Number
Window 1 10 100 1000 10000
1
10
100
1000
10000
Table 6.11 – The figure shows the representation in the time domain of mem-
ory for GN (a) and GI (b).
117
Memory
Naive
Master Degree Thesis – Riccardo Tommasini
Doing - Analyser L3 - Visual Comparison
27
Master Degree Thesis – Riccardo Tommasini
Agenda
28
Motivations
Research Question
Conclusion
Evaluation
Development
✓✓✓✓
✓Evaluation
Master Degree Thesis – Riccardo Tommasini
Done
My contributions are
Can an engine test stand, together with existing queries, dataset and metrics, enable SCRA of RSP Engines?
Test Stand
Baselines
Method
Analysis29
Master Degree Thesis – Riccardo Tommasini
Future Works
SCRA of RSP Engines is just at the beginning. Further development of Heaven are possibile.
Benchmark Suite
Heaven as a Service
Research on Baselines
30
Research on Existing RSP Engines
Master Degree Thesis – Riccardo Tommasini
Agenda
31
Motivations
Research Question
Conclusion
Evaluation
Development ✓✓
✓✓
✓Conclusion
Master Degree Thesis – Riccardo Tommasini
Thank You
32
Thank You!
Master Degree Thesis – Riccardo Tommasini
Questions?
?????????
33