Scalable Distributed Real-Time Clustering for Big Data Streams

Scalable Distributed Real-Time Clustering for Big Data Streams

European Masters in Distributed Computing (EMDC)

Student Antonio Severien [email protected]

Supervisors Albert Bifet (Yahoo! Research) Gianmarco De Francisci Morales (Yahoo! Research) Marta Arias (Universitat Politecnica de Catalunya)

Contributions

¤  SAMOA (Scalable Advanced Massive Online Analysis) ¤  Stream Processing Engine (SPE) abstraction framework

¤  Machine learning libraries adapter layer

¤  API for implementing data flow topologies

¤  SAMOA Clustering Algorithm ¤  Distributed stream clustering algorithm based on CluStream*

¤  Parallelize clustering task and scale-up on resource usage

27/06/13

2

(*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003

Motivation

¤  How BIG is BIG in BIG Data??? ¤  2.5 quintillion of bytes generated every day.

¤  90% of todays data was generated in the last 2 years

¤  Sensors, social networks, e-business, mobile, internet logs, etc.

¤  Problems… 3 Vs ¤  Storage is unviable due to massive Volume

¤  Production rate on increasing in Velocity

¤  Different sources, different data, different types means Variety

27/06/13

3

Where is the Big Data?

¤  Where is the food? ¤  Databases?

¤  Data warehouses?

¤  Distributed databases?

¤  Distributed file systems?

¤  It’s flowing online! It’s Streaming!

27/06/13

4

Crunching Big Data

¤  Map and Reduce ¤  MapReduce/GFS

¤  Hadoop/HDFS

¤  Stream Processing Engines (SPE) ¤  Apache S4

¤  Twitter Storm

27/06/13

5

Distributed Systems

¤  Actors Model ¤  Independent concurrent processes

¤  Communicate asynchronously by message passing

¤  MapReduce Model ¤  Mappers: filter and sorting

¤  Reducers: summary and aggregation

¤  Large volume of data distributed

¤  Iterative: map-reduce-map-reduce…

27/06/13

6

Streaming

¤  Streaming Model ¤  One-pass processing: discard item after use

¤  Low memory usage: store statistics and summaries

¤  Unbounded flow of data

¤  Evolving data sets

¤  Limited processing time

¤  Arrival order is not guaranteed

27/06/13

7

Making sense

¤  Machine Learning & Data Mining ¤  Make sense, extract patterns and react accordingly

¤  Train machines to “think”

¤  Perceive behavior

¤  Relations between similar information

¤  Unsupervised Learning ¤  Clustering algorithms

27/06/13

8

Machine Learning Tools

¤  Mahout ¤  Machine learning framework used on top of Hadoop/HDFS

¤  Batch processing with MapReduce model

¤  Open-source and good community support

¤  Massive Online Analysis (MOA) ¤  Stream machine learning tool

¤  Many algorithms implemented; based on WEKA

¤  Single machine constraint

¤  Jubatus ¤  Distributed streaming machine learning framework

¤  No clustering algorithms yet

¤  No stream platform abstraction

27/06/13

9

Scalable Advanced Massive Online Analysis (SAMOA)

¤  Distributed data streaming machine learning framework ¤  Stream Platform Engine Abstraction

¤  Code once, run everywhere

¤  Focus on distributed algorithm design

¤  Fault-tolerance, communication, consistency and availability are provided by the underlying distributed processing platform

¤  Initial release provides integration with, ¤  Apache S4

¤  Twitter Storm

27/06/13

10


SAMOA Algorithms &

SAMOA-API

SPE Adapter

S4 Storm Other

SPE

SAMOA

ML

Ad

ap

ter

MOA

Other ML

libraries

27/06/13

11


SAMOA Algorithms &

SAMOA-API

SPE Adapter

S4 Storm Other

SPE

SAMOA

ML

Ad

ap

ter

MOA

Other ML

libraries

27/06/13

12


SAMOA Algorithms &

SAMOA-API

SPE Adapter

S4 Storm Other

SPE

SAMOA

ML

Ad

ap

ter

MOA

Other ML

libraries

27/06/13

13


SAMOA Algorithms &

SAMOA-API

SPE Adapter

S4 Storm Other

SPE

SAMOA

ML

Ad

ap

ter

MOA

Other ML

libraries

27/06/13

14

( Apache S4 )

¤  Distributed, semi fault-tolerant, stream processing platform

¤  Based on the Actors model and inspired by the MapReduce model

¤  Flexibility on data flow; any topology and processor unit can be built, besides the mappers and reducers design

¤  Specialized in processing events from a stream and emitting events into a stream

27/06/13

15


SAMOA Topology

PI PI

PI PI Task

EPI

STREAM SOURCE

Stream

PE PE

PE

PE

Stream PE

S4 App

STREAM SOURCE

MAP

27/06/13

16

How to use?

¤  Adding SPE using API ¤  S4ProcessingItem: processing element wrapper

¤  S4Stream: wrapper for a S4 stream

¤  S4ComponentFactory: provides components specific from Apache S4, such as processing elements and streams

¤  S4TopologyBuilder: creates the topology instances

¤  Adding algorithm and building topology class SimpleTask { ...

TopologyBuilder topologyBuilder = new TopologyBuilder( ); EntranceProcessinItem entranceProcessingItem = topologyBuilder.createEntrancePI( new SourceProcessor( ) ); Stream stream = topologyBuilder.createStream( entranceProcessingItem ); ProcessingItem processingItem = topologyBuilder.createPI( new Processor( ) ); processingItem.connectInputKey( stream );

...

27/06/13

17

Grouping the Best of All

¤  Flexible programming model

¤  Distributed stream processing engine abstraction

¤  Integrated machine learning and data mining algorithms

¤  Easy API to implement new algorithms and SPE adapters

27/06/13

18

SAMOA Clustering Algorithm

¤  Distributed stream clustering algorithm

¤  Validate SAMOA implementation and

¤  Integration with Apache S4 using the SAMOA-S4 adapter

¤  Deploy on Apache S4

27/06/13

19

Stream Clustering Algorithm

¤  CluStream Framework ¤  Based on k-means

¤  Online phase (micro-clustering)

¤  Offline phase (macro-clustering)

¤  k-means: partition a set of data into k distinct clusters according to a similarity function

¤  Minimization of squared Euclidean distance objective function:

27/06/13

20

K-means Clustering Algorithm

¤  Advantages ¤  Simple, fast and efficient

¤  Known issues with k-means ¤  Sensitive to initial seeding

¤  Minimization problem is NP-hard even for simple configurations

¤  1-dimensional points

¤  Global optimum not guaranteed

¤  Good for spherical clustering, not good for arbitrary shapes

27/06/13

21

Distributed Stream Clustering

¤  Online micro-clustering ¤  Apply on a local clustering phase

¤  Cluster Feature Vectors with Timestamp (CFT) ¤  N: number of data objects

¤  LS: linear sum of data objects

¤  SS: sum of squares of data objects

¤  LST: sum of timestamps

¤  SST: sum of squares of timestamps

¤  Offline macro-clustering ¤  Use of micro-clusters as weighted pseudo-points

¤  Apply on a global clustering phase with a weighted k-means ¤  Uses probabilistic seeding depending on the weighted

micro-clusters

27/06/13

22

CluStream Snapshot

27/06/13

23

Micro-clusters Macro-clusters

Ground Truth


SAMOA Clustering Task

Evaluation

Clustering

Sampling PI Evaluator PI

Local Clustering PI

Global Clustering PI

Distribution PI STREAM

SOURCE

OUTPUT

OUTPUT

27/06/13

24

Experiments, Evaluation & Results

¤  Experimental Setup ¤  Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM

¤  Process parallelism level: 1, 8 & 16

¤  Instance dimensions: 3 & 15

¤  Source dataset: random events generator

¤  Noise: 0% & 10%

¤  Cluster movement speed: move 0.1 unit every 500 & 12000 instances

¤  Evaluations ¤  Scalability: measure throughput when adding concurrent

processes

¤  Clustering quality: measure if the clustering algorithm are accurate

27/06/13

25

Scalability

27/06/13

26

Baseline Comparison

Evaluation Step

Thro

ug

hp

ut

(inst

an

ce

s/se

co

nd

)

Scalability

Process Parallelism

27/06/13

27

Average Throughput with Dimensions 3 and 15

Ave

rag

e T

hro

ug

hp

ut

(in

sta

nc

es/

sec

on

d)

Scalability

27/06/13

28

Process Parallelism

Avg

. Cu

mu

lativ

e T

hro

ug

hp

ut

(inst

an

ce

s/se

c)

Parallelism Throughput with Dimension 3

Clustering Quality Metrics

¤  Internal & External evaluations ¤  Internal evaluation uses attributes available from the clustering

structure. ¤  External evaluation uses external validation structures.

¤  ex.: ground truth provided by the source generator.

¤  Metrics ¤  Cohesion coefficient (SSE): measures the intra clusters sum of

squares error

¤  Separation coefficient (BSS): measures the inter cluster between-sum of squares.

27/06/13

29

Clustering Quality 0% Noise

27/06/13

30

Snapshot 25,000 instances



27/06/13

31

Ratio = BSS / GT


27/06/13

32



Good clustering

Poor clustering


27/06/13

33

Conclusion

¤  There is important information on the massive amount of data being produced and discarded

¤  There is a need for tools to deal with this efficiently

¤  Efforts have been done to crunch big data

¤  Interpreting and retrieving relevant information is where machine learning and data mining operate

¤  Using real-time analysis responds faster to evolving data

¤  SAMOA abstracts the platform and maintains the algorithms; good to implement, test and use.

27/06/13

34

Acknowledgements

¤  Thanks the Erasmus Mundus and all three universities (UPC, KTH and IST) for providing this opportunity

¤  Thanks all the EMDC students

¤  Thanks Yahoo! Research for the great project

27/06/13

35

Technology

Scalable Distributed Real-Time Clustering for Big Data Streams