21
MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, Jianping Fan

MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

Embed Size (px)

Citation preview

Page 1: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm

using MapReduce

Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, Jianping Fan

Page 2: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

INTRODUCTION

This paper is mainly focus on “Parallel Density-based Data Clustering” on shared-nothing cluster environment.

Data clustering is essential data mining technique which can view macroscopic patterns of data.Due to the size of datasets, there is a needs to develop parallel data clustering algorithm.

In this paper, the authors propose an parallel dens-ity-based clustering algorithm and implement it by a 4-stages MapReduce paradigm

Adopt a quick partitioning strategy for large scale non-indexed dataStudy the metric of merge among bordering partitions and optim-izationsEvaluate on real large scale datasets (approx. 1.9 billion GPS log)

Page 3: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

3

Introduction

Clustering techniquesPros of DBScan

• Divide data into clusters with arbitrary shapes

• Does not require the number of the clusters a priori

• Insensitive to the order of the points in the dataset

Cons of DBScan• The sizes of the datasets are growing so

that they can not be held on a single machine

• Much higher computation complexity compared with K-means

=> PARALLELIZE using MapRe-duce!! (what a simple..)

Page 4: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

4

Background : DBScan

DBSCAN (Martin Ester et al, KDD, 1996)The key idea of density-based clustering is that for each point of a cluster the neighborhood of a given radius (Eps) has to contain at least a minimum number of points (MinPts)

Directly density-reachable (DDR): o is DDR p if p ∈ NEps(o) and Card(NEps(o)) ≤ MinPts.Density-reachable (DR): if there is a chain of points {pi|i = 0, .., n} that each pi is DDR pi+1, then pi is DR t, where t ∈ {pj |j = i + 1, ..., n}. (canonical extension)Density-connected (DC): if o is DR p and o is DR q, then p is DC q. (symmetric version)

Page 5: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

5

Background : DBScan

Class of point :- Unclassified- Core- Border- Noise

Page 6: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

6

Background : MapReduce

Borrows from functional programmingUsers should implement two primary methods:

Map: (k1, v1) → list(k2, v2)Reduce: (k2, list(v2)) → list(k3, v3)]

Page 7: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

7

Background : MapReduce

Page 8: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

8

Design And Implementation

Problem StatementGiven a set of d-dimensional points DB = {p1, p2, ..., pn}, a mini-mal density of clusters defined by Eps and MinPts, and a set of computer CP = {C1, C2, ...,Cn} managed by Map-Reduce plat-form; find the density-based clusters with respect to the given Eps and MinPts values.

Overall Framework

Page 9: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

9

Stage 1 : Preprocessing

Summary spatial distribution, and then genenrate grid based partitionMain challenges for a partitioning strategy

1) Load balancing2) Minimized communication

One of the possible solutions is to build an efficient spatial index

However the authors does not take well-known indexing method such as R-Tree, KD-Tree, … Because, iterating recursion to get a hierarchical structure is not practical in MapReduc paradigm

The authors uses partition algorithm on MapReduce adjusted from the grid file.

Page 10: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

10

Stage 1 : Preprocessing

Raw Data

Bucket Counting(in example, 10 bucket which created by interval 0.1)

Compute Spatial distribution for each dimension

Partitioning- Proposed Metrics : avg, m

Bucket ID

Count

Page 11: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

11

Stage 1 : Preprocessing

Shape of the Partitonnecessity of the access to remote data

• For a given Eps, and MinPts D 5, if there is no support of accessing remote data, then the neighborhood of object p1 would contain only 3 points which is less than MinPts, and therefore p1 would not be a core point.

• Therefore, to obtain correct clustering results, a “view” over the border of partitions is necessary

So, the shape of the partition is S + halo

S1 or i S2 or i+1

halo

Outer haloInner halo

Eps

Page 12: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

12

Stage 2 : Local DBSCAN

The algorithm in Local DBSCAN is very similar with DBSCAN

Differences is..A non-noise point q on outer halo, in this point we does not know whether q is a core point or border point.

• (because computing node is on shared-nothing environment)

Those points are classified “Onqueue” status and put into MergeCandidates set (MC)

Page 13: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

13

Stage 3 : Find Merging Mapping

Character of MC setThe composition of MC set

The Completeness of MC set

q is not in halo

q is core point More than one neighbor are on halo

O is Core point or border point on halo

Page 14: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

14

Stage 3 : Find Merging Mapping

Merging clusters of adjacent spaces are needed or not

Page 15: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

15

Stage 3 : Find Merging Mapping

Let MC1(C, S1) = {AP1 ∪ BP1}, where AP1 is the set of core points and BP1 is the set of border pointsTheorem 1: Let MC1(C1, S1) = {AP1∪BP1}, where AP1 is the set of core points and BP1 is the set of border points w.r.t. space constraint S1. MC2(C2, S2) = AP2 ∪ BP2, where AP2 is the set of core points and BP2 is the set of border points w.r.t. space constraint S2. If S1 and S2 are bordering

Page 16: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

16

Stage 3 : Find Merging Mapping

Page 17: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

17

Stage 4 : Merge

Build Global Mapping -> Merge and Relabel

Page 18: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

18

Evaluation

Experiment environment13-node clusterEach node has 3.0GHz i7 950 (quad-core), 8GB ram, 2TB hddUbuntu 10.10Hadoop 0.20.2Block size : 64MB

Data SetSanghai taxi GPS logs

Page 19: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

19

Evaluation

Each location point is normalized into range [0, 1)Two DBSCAN configuration

WL-1• Eps : 0.002, MinPts : 1,000

WL-2• Eps : 0.0002, MinPts : 100

ds-4

Page 20: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

20

Evaluation

WL-1SPD=120

12-node

ds4 ds3 ds2 ds1 (2/12) (4/12) (6/12)

Page 21: MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

Conclusions

In this paper, implement an efficient parallel DBScan algo-rithm in a 4-stages MapReduce paradigm. We analyze and propose a practical data partition strategy for large scale non-indexed spatial data. We apply our work on a real world spatial dataset, which contains over 1.9 billion GPS raw records, and run our experiment on a lab-size 13-nodes cluster. Result from experiment shows the speedup and scale-up performance are very efficient.We observe that roadmap based spatial data will highly skew in the road network. If a main road happens lying in the replication area after partitioning, computation and data replication will increase dramatically. One of the fu-ture works is to improve the partitioning strategy to aware of this observation and minimize the size of MC sets. The challenge is that its performance is still highly restricted by the distribution of raw spatial data.