Presented by: Dardan Xhymshiti Spring 2016cse.ucdenver.edu/~bdlab/seminar/2016/2.pdf · Skyline is expensive to compute especially in large datasets. The recent multi-core skyline

Presented by: Dardan Xhymshiti

Spring 2016

Authors:

Publication:

ICDE 2015

Type:

Research Paper

2

Sean Chester* Darius Sidlauskas` Ira Assent* Kenneth S. Bogh*

*Data-Intensive Systems Group, Aarhus University, Denmakr

`Data-Intensive Applications and Systems Laboratory, EPFL, Switzerland

Skyline is expensive to compute especially in large datasets.

The recent multi-core skyline algorithms does not effectively reduce the dominance tests.

State-of-the art skyline algorithms outperform multi-core algorithms.

Most of the multi-core Skyline algorithms use the Divide-&-Conquer approach which has two drawbacks:

If the number of local skyline points is large, the merging step is expensive.

Increasing the cardinality of the dataset, the computation becomes expensive.

3

To come up with a new multi-core algorithm, which eliminate as much as it can dominance tests.

4

Provide an overview about skyline operator.

Introduces to the innovative skyline Hybrid algorithm.

Provide experiments which shows that this algorithm outperforms multi-core and sequential algorithms.

5

How to increase the performance of skyline algorithms:

Implementation in GPUs

Implementation in Multi-Core CPUs.

Implementing in distributing environments like MapReduce.

The authors have developed an algorithm called: Hybrid

The authors have chosen the Multi-Core CPU to do the implementation of the algorithm because of:

Cheaper shared data structures and

Parallel work need not be isolated.

6

All the skyline points are maintained in a shared global data structure.

This data structure gets updated regularly and is read by all threads.

The skyline points in the data structure are ordered to detect dominance relationships quickly.

The processing of points is done in blocks that guarantee each point is compared to at most 𝛼 points than in a sequential algorithm.

7

Sort-based algorithms (quickly detect dominance relationships)

SFS (Sort Filter Skyline)

LESS

SaLSa (Sort and Limit Skyline algorithm)

Object-based space partitioning (quickly detect incomparability)

Object-Space partitioning

BSky-Tree-P

8

Presented by: Dardan Xhymshiti

Spring 2016

Authors:

Publication:

ICDE 2015

Type:

Research Paper

10

Michael Shekelyam Gregor Josse Matthias Schubert

Institute of Informatics, Ludwig-Maximilians-University Munich

In many application areas, data is organized as a network of graph.

Important task: compute a cost-optimal path between a start node and a target one.

Example:

Road networks (Cost criteria: travel time, travel distance, energy consumption etc.)

Computer networks (Cost criteria: bandwidth and the latency between routers.)

Cost vector: when considering more than one criterion at a time, the cost of complete path is called cost vector.

11

Cost criteria 1 Cost criteria 2 Cost criteria 3

12

How to define if a path is an optimal path?

1. Map the cost vector to a value by employing a monotonic combination function, and then sort the paths. The top n paths are the optimal ones.

Problems:

a) Hard to find a suitable function,b) Different types of cost might have different levels of scale.

2. Compute the pareto optimal (mathematical definition of Skyline) cost vectors. (This is also known as conventional path skyline).

Problems:

a) The number of pareto optimal paths might increase exponentially as function of distance and the amount of considered cost criteria.

b) Showing to the user a large amount of results is not very helpful.

There actually exist solutions for computing linear path skylines, but they are restricted to the specific case of having just two cost criteria.

Problem: cannot be generalized to more criteria.

The number of skyline paths increases exponentially with the distance between the locations and the number of cost criteria. Thus the result set might be too big.

13

Come up with a new approach of computing the results set of path skylines, which provides better and faster results.

14

Recall: Conventional path skyline computes all potential optimal paths, but the result is too big.

Idea: reduce the result set, to only show the paths which are optimal under a weighted sum function or linear combination of cost criteria.

Intuitively saying: The user weights each type of cost with a percentage describing its importance.

How to compute the linear path skyline?

Naive approach: compute the conventional path skyline and then compute the convex hull on the resulting cost vectors. (Inefficient).

15

What is Convex Hull?

Definition: In mathematics, the convex hull or convex envelope of a set X of points in the Euclidean plane or Euclidean space is the smallest convex set that contains X.

16

The authors come up with an algorithm called LSCH (Linear Skyline Convex Hull) whichconstructs a linear path skyline successively while only adding new paths which aremembers of the result set.

Implementation overview:

1. To add a new cost vector, a single search is performed which combine the cost vectorsbased on the normal vectors of the hyper planes currently limiting the linear path skyline.

2. To identify the areas on the linear skyline where additional results might still exits, thealgorithms applies multidimensional convex hull computation.

17

Experiments are run in two different types of networks:

1. Munich road network with five cost criteria.

2. Artificial lattice graphs that allow to simulate different problem instances and parameter settings.

18

Computing route Skylines algorithms.

Convex hull algorithm.

19

Presented by: Shahab Helmi

Spring 2016

Authors:

Publication:

ICDE 2015

Type:

Research Paper

21

A recent study suggests that the routes provided by a leading navigation service often failto agree with the routes chosen by local drivers, why?

Limited number of travel costs, e.g., distance or travel time.

With the rapid development and continuing use of vehicle tracking technologies, it ispossible to learn and update individual drivers’ driving preferences according to theirtrajectories.

22

It proposes a novel problem on personalized route recommendation based on bigtrajectory data.

It proposes techniques to model and update driving preferences from drivers’trajectories.

The proposed driving preference model can support arbitrary number of travel costs ofinterest and distributions of cost ratios.

Comprehensive experiments were done conducted on a substantial, real trajectorydata set to show efficiency and effectiveness.

23

1. Indexing the road network.

2. Modeling drivers’ preferences from their trajectories considering multiple travel costs.

3. Selecting sub-trajectories according to source, destination, departure time and driver’spreferences.

4. constructing a small graph (Zohreh) with appropriate edge weights reflecting how thedriver would like to use the edges based on the selected trajectories.

5. Returning the shortest route in the small graph as the personalized route to the driver.

24

Route Planning.

Route Planning Using Trajectories: no driver modeling:

Most popular route (MPR)

Time period-based most popular route (TPMPR)

Top-k popular routes (TKPR)

Personalized Route Planning:

TRIP: closest work, tested over a smaller dataset, can only model travel time

25

GPS records:

52,211 taxis in Beijing.

during 2012-09-30 to 2012-11-30.

One GPS record is collected in every 5 seconds or less.

Road network:

6th street in Beijing.

28,342 vertices and 38,577 edges.

60 km60 km square region.

26

Trajectories:

32,379,248 trajectories.

starts when a passenger got in the taxi and ends when the passenger got off the taxi.

Travel Costs:

Travel distance.

Travel time.

Fuel consumption.

27

Presented by: Shahab Helmi

Spring 2016

Authors:

Publication:

ICDE 2015

Type:

Research Paper

29

an episode (serial episode) is usually defined as a totally ordered set of eventsthat occur relatively close to each other.

of an episode: how frequently it occurs in a sequence. Frequency count methods:

Window-based

Non-overlapped occurrences

Non-interleaved occurrences

Total Frequency

can capture the most intense correlation between events.

30

Previous studies on frequent episode mining (FEM) mostly process the data in offlinemode. Shortcomings:

This process may take hours or days.

Testing whether an episode occurs in a sequence is an NP-complete problem.

Real world application challenges:

Fast-growing data.

Recency effect: only freshets pattern from recent events are of interest (highfrequency trading → a trader holds stock only for 22 seconds in average).

Time-critical analysis: a delay may lead to drastic loss (predictive maintenance of datacenters).

31

Online FEM algorithm challenges:

1. Infrequent events at the current moment could become frequent in the future → theycannot be discarded.

2. Then a compact and effective data structure is required to store all episodeoccurrences.

3. Mining all minimal occurrences of episode also becomes a big challenge over thegrowing sequence.

For #1&#2 a data structure, named episode trie, is proposed. It’s a prefix tree similarto on we have in FP-Growth.

For #3, an algorithm, named mining frequent serial episode via last occurrence(MESELO), is proposed. Claimed to be the first online FEM algorithm.

The last occurrence concept is defined for the first time: when was the last time inwhich an episode occurred → using the last occurrence information, MESELO cangenerate new minimal frequent episodes after receiving new data.

Complexity analysis: O(MW), it’s claimed that in most real world applications W issmall, so it works well.

32

There is no previous work of online FEM.

Frequent episode mining on event sequences (main difference is the frequency count method):

Alarm sequences in telecom networks

Web navigation logs

Timestamped fault reports in car manufacturing plants

Sales transactions and stock data

News

Breath first approaches (Apriori-based)

Depth first approaches (prefix tree)

33

Online frequent itemset mining (approximate and exact methods):

MOMENT

Can-Tree

SWIM

FP-Stream

FDPM

…

They does not apply on episode mining since in frequent itemset mining the time information isnot important while episodes are ordered according to their occurrence time. Hence, keepingthe tree up-to-date is hard. Using the last occurrence concept it becomes more efficient.

34

Server 1 (for algorithm):

Intel Xeon E5-2620 2.00 GHz processor

32 GB memory

Windows server 2008

Server 2 (for MySQL database):

Intel Xeon E5-2620 2.00 GHz processor

16 GB memory

Linux

35

36

37

1. Online mode: MESELO-BS is similar to MESELObut does not use theconcept of last episodeoccurrence.

2. Offline mode: theperformance ofMESELO is compare tobaseline algorithms.

• X axis shows window size• MinSupport is 10

Window size

MinSupprt

Documents

Presented by: Dardan Xhymshiti Spring 2016cse.ucdenver.edu/~bdlab/seminar/2016/2.pdf · Skyline is expensive to compute especially in large datasets. The recent multi-core skyline