MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm

using MapReduce

Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, Jianping Fan

INTRODUCTION

This paper is mainly focus on “Parallel Density-based Data Clustering” on shared-nothing cluster environment.

Data clustering is essential data mining technique which can view macroscopic patterns of data.Due to the size of datasets, there is a needs to develop parallel data clustering algorithm.

In this paper, the authors propose an parallel dens-ity-based clustering algorithm and implement it by a 4-stages MapReduce paradigm

Adopt a quick partitioning strategy for large scale non-indexed dataStudy the metric of merge among bordering partitions and optim-izationsEvaluate on real large scale datasets (approx. 1.9 billion GPS log)

3

Introduction

Clustering techniquesPros of DBScan

• Divide data into clusters with arbitrary shapes

• Does not require the number of the clusters a priori

• Insensitive to the order of the points in the dataset

Cons of DBScan• The sizes of the datasets are growing so

that they can not be held on a single machine

• Much higher computation complexity compared with K-means

=> PARALLELIZE using MapRe-duce!! (what a simple..)

4

Background : DBScan

DBSCAN (Martin Ester et al, KDD, 1996)The key idea of density-based clustering is that for each point of a cluster the neighborhood of a given radius (Eps) has to contain at least a minimum number of points (MinPts)

Directly density-reachable (DDR): o is DDR p if p ∈ NEps(o) and Card(NEps(o)) ≤ MinPts.Density-reachable (DR): if there is a chain of points {pi|i = 0, .., n} that each pi is DDR pi+1, then pi is DR t, where t ∈ {pj |j = i + 1, ..., n}. (canonical extension)Density-connected (DC): if o is DR p and o is DR q, then p is DC q. (symmetric version)

5

Background : DBScan

Class of point :- Unclassified- Core- Border- Noise

6

Background : MapReduce

Borrows from functional programmingUsers should implement two primary methods:

Map: (k1, v1) → list(k2, v2)Reduce: (k2, list(v2)) → list(k3, v3)]

7

Background : MapReduce

8

Design And Implementation

Problem StatementGiven a set of d-dimensional points DB = {p1, p2, ..., pn}, a mini-mal density of clusters defined by Eps and MinPts, and a set of computer CP = {C1, C2, ...,Cn} managed by Map-Reduce plat-form; find the density-based clusters with respect to the given Eps and MinPts values.

Overall Framework

9

Stage 1 : Preprocessing

Summary spatial distribution, and then genenrate grid based partitionMain challenges for a partitioning strategy

1) Load balancing2) Minimized communication

One of the possible solutions is to build an efficient spatial index

However the authors does not take well-known indexing method such as R-Tree, KD-Tree, … Because, iterating recursion to get a hierarchical structure is not practical in MapReduc paradigm

The authors uses partition algorithm on MapReduce adjusted from the grid file.

10


Raw Data

Bucket Counting(in example, 10 bucket which created by interval 0.1)

Compute Spatial distribution for each dimension

Partitioning- Proposed Metrics : avg, m

Bucket ID

Count

11


Shape of the Partitonnecessity of the access to remote data

• For a given Eps, and MinPts D 5, if there is no support of accessing remote data, then the neighborhood of object p1 would contain only 3 points which is less than MinPts, and therefore p1 would not be a core point.

• Therefore, to obtain correct clustering results, a “view” over the border of partitions is necessary

So, the shape of the partition is S + halo

S1 or i S2 or i+1

halo

Outer haloInner halo

Eps

12

Stage 2 : Local DBSCAN

The algorithm in Local DBSCAN is very similar with DBSCAN

Differences is..A non-noise point q on outer halo, in this point we does not know whether q is a core point or border point.

• (because computing node is on shared-nothing environment)

Those points are classified “Onqueue” status and put into MergeCandidates set (MC)

13

Stage 3 : Find Merging Mapping

Character of MC setThe composition of MC set

The Completeness of MC set

q is not in halo

q is core point More than one neighbor are on halo

O is Core point or border point on halo

14


Merging clusters of adjacent spaces are needed or not

15


Let MC1(C, S1) = {AP1 ∪ BP1}, where AP1 is the set of core points and BP1 is the set of border pointsTheorem 1: Let MC1(C1, S1) = {AP1∪BP1}, where AP1 is the set of core points and BP1 is the set of border points w.r.t. space constraint S1. MC2(C2, S2) = AP2 ∪ BP2, where AP2 is the set of core points and BP2 is the set of border points w.r.t. space constraint S2. If S1 and S2 are bordering

16


17

Stage 4 : Merge

Build Global Mapping -> Merge and Relabel

18

Evaluation

Experiment environment13-node clusterEach node has 3.0GHz i7 950 (quad-core), 8GB ram, 2TB hddUbuntu 10.10Hadoop 0.20.2Block size : 64MB

Data SetSanghai taxi GPS logs

19

Evaluation

Each location point is normalized into range [0, 1)Two DBSCAN configuration

WL-1• Eps : 0.002, MinPts : 1,000

WL-2• Eps : 0.0002, MinPts : 100

ds-4

20

Evaluation

WL-1SPD=120

12-node

ds4 ds3 ds2 ds1 (2/12) (4/12) (6/12)

Conclusions

In this paper, implement an efficient parallel DBScan algo-rithm in a 4-stages MapReduce paradigm. We analyze and propose a practical data partition strategy for large scale non-indexed spatial data. We apply our work on a real world spatial dataset, which contains over 1.9 billion GPS raw records, and run our experiment on a lab-size 13-nodes cluster. Result from experiment shows the speedup and scale-up performance are very efficient.We observe that roadmap based spatial data will highly skew in the road network. If a main road happens lying in the replication area after partitioning, computation and data replication will increase dramatically. One of the fu-ture works is to improve the partitioning strategy to aware of this observation and minimize the size of MC sets. The challenge is that its performance is still highly restricted by the distribution of raw spatial data.

Documents

MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,