35
Scalable Distributed Real-Time Clustering for Big Data Streams European Masters in Distributed Computing (EMDC) Student Antonio Severien [email protected] Supervisors Albert Bifet (Yahoo! Research) Gianmarco De Francisci Morales (Yahoo! Research) Marta Arias (Universitat Politecnica de Catalunya)

Scalable Distributed Real-Time Clustering for Big Data Streams

Embed Size (px)

DESCRIPTION

Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.

Citation preview

Page 1: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalable Distributed Real-Time Clustering for Big Data Streams

European Masters in Distributed Computing (EMDC)

Student Antonio Severien [email protected]

Supervisors Albert Bifet (Yahoo! Research) Gianmarco De Francisci Morales (Yahoo! Research) Marta Arias (Universitat Politecnica de Catalunya)

Page 2: Scalable Distributed Real-Time Clustering for Big Data Streams

Contributions

¤  SAMOA (Scalable Advanced Massive Online Analysis) ¤  Stream Processing Engine (SPE) abstraction framework

¤  Machine learning libraries adapter layer

¤  API for implementing data flow topologies

¤  SAMOA Clustering Algorithm ¤  Distributed stream clustering algorithm based on CluStream*

¤  Parallelize clustering task and scale-up on resource usage

27/06/13

2

(*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003

Page 3: Scalable Distributed Real-Time Clustering for Big Data Streams

Motivation

¤  How BIG is BIG in BIG Data??? ¤  2.5 quintillion of bytes generated every day.

¤  90% of todays data was generated in the last 2 years

¤  Sensors, social networks, e-business, mobile, internet logs, etc.

¤  Problems… 3 Vs ¤  Storage is unviable due to massive Volume

¤  Production rate on increasing in Velocity

¤  Different sources, different data, different types means Variety

27/06/13

3

Page 4: Scalable Distributed Real-Time Clustering for Big Data Streams

Where is the Big Data?

¤  Where is the food? ¤  Databases?

¤  Data warehouses?

¤  Distributed databases?

¤  Distributed file systems?

¤  It’s flowing online! It’s Streaming!

27/06/13

4

Page 5: Scalable Distributed Real-Time Clustering for Big Data Streams

Crunching Big Data

¤  Map and Reduce ¤  MapReduce/GFS

¤  Hadoop/HDFS

¤  Stream Processing Engines (SPE) ¤  Apache S4

¤  Twitter Storm

27/06/13

5

Page 6: Scalable Distributed Real-Time Clustering for Big Data Streams

Distributed Systems

¤  Actors Model ¤  Independent concurrent processes

¤  Communicate asynchronously by message passing

¤  MapReduce Model ¤  Mappers: filter and sorting

¤  Reducers: summary and aggregation

¤  Large volume of data distributed

¤  Iterative: map-reduce-map-reduce…

27/06/13

6

Page 7: Scalable Distributed Real-Time Clustering for Big Data Streams

Streaming

¤  Streaming Model ¤  One-pass processing: discard item after use

¤  Low memory usage: store statistics and summaries

¤  Unbounded flow of data

¤  Evolving data sets

¤  Limited processing time

¤  Arrival order is not guaranteed

27/06/13

7

Page 8: Scalable Distributed Real-Time Clustering for Big Data Streams

Making sense

¤  Machine Learning & Data Mining ¤  Make sense, extract patterns and react accordingly

¤  Train machines to “think”

¤  Perceive behavior

¤  Relations between similar information

¤  Unsupervised Learning ¤  Clustering algorithms

27/06/13

8

Page 9: Scalable Distributed Real-Time Clustering for Big Data Streams

Machine Learning Tools

¤  Mahout ¤  Machine learning framework used on top of Hadoop/HDFS

¤  Batch processing with MapReduce model

¤  Open-source and good community support

¤  Massive Online Analysis (MOA) ¤  Stream machine learning tool

¤  Many algorithms implemented; based on WEKA

¤  Single machine constraint

¤  Jubatus ¤  Distributed streaming machine learning framework

¤  No clustering algorithms yet

¤  No stream platform abstraction

27/06/13

9

Page 10: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalable Advanced Massive Online Analysis (SAMOA)

¤  Distributed data streaming machine learning framework ¤  Stream Platform Engine Abstraction

¤  Code once, run everywhere

¤  Focus on distributed algorithm design

¤  Fault-tolerance, communication, consistency and availability are provided by the underlying distributed processing platform

¤  Initial release provides integration with, ¤  Apache S4

¤  Twitter Storm

27/06/13

10

Page 11: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalable Advanced Massive Online Analysis (SAMOA)

SAMOA Algorithms &

SAMOA-API

SPE Adapter

S4 Storm Other

SPE

SAMOA

ML

Ad

ap

ter

MOA

Other ML

libraries

27/06/13

11

Page 12: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalable Advanced Massive Online Analysis (SAMOA)

SAMOA Algorithms &

SAMOA-API

SPE Adapter

S4 Storm Other

SPE

SAMOA

ML

Ad

ap

ter

MOA

Other ML

libraries

27/06/13

12

Page 13: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalable Advanced Massive Online Analysis (SAMOA)

SAMOA Algorithms &

SAMOA-API

SPE Adapter

S4 Storm Other

SPE

SAMOA

ML

Ad

ap

ter

MOA

Other ML

libraries

27/06/13

13

Page 14: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalable Advanced Massive Online Analysis (SAMOA)

SAMOA Algorithms &

SAMOA-API

SPE Adapter

S4 Storm Other

SPE

SAMOA

ML

Ad

ap

ter

MOA

Other ML

libraries

27/06/13

14

Page 15: Scalable Distributed Real-Time Clustering for Big Data Streams

( Apache S4 )

¤  Distributed, semi fault-tolerant, stream processing platform

¤  Based on the Actors model and inspired by the MapReduce model

¤  Flexibility on data flow; any topology and processor unit can be built, besides the mappers and reducers design

¤  Specialized in processing events from a stream and emitting events into a stream

27/06/13

15

Page 16: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalable Advanced Massive Online Analysis (SAMOA)

SAMOA Topology

PI PI

PI PI Task

EPI

STREAM SOURCE

Stream

PE PE

PE

PE

Stream PE

S4 App

STREAM SOURCE

MAP

27/06/13

16

Page 17: Scalable Distributed Real-Time Clustering for Big Data Streams

How to use?

¤  Adding SPE using API ¤  S4ProcessingItem: processing element wrapper

¤  S4Stream: wrapper for a S4 stream

¤  S4ComponentFactory: provides components specific from Apache S4, such as processing elements and streams

¤  S4TopologyBuilder: creates the topology instances

¤  Adding algorithm and building topology class  SimpleTask  {  ...  

 TopologyBuilder  topologyBuilder  =  new  TopologyBuilder(  );      EntranceProcessinItem  entranceProcessingItem  =        topologyBuilder.createEntrancePI(  new  SourceProcessor(  )  );      Stream  stream  =  topologyBuilder.createStream(  entranceProcessingItem  );    ProcessingItem  processingItem  =  topologyBuilder.createPI(  new  Processor(  )  );    processingItem.connectInputKey(  stream  );    

...    

27/06/13

17

Page 18: Scalable Distributed Real-Time Clustering for Big Data Streams

Grouping the Best of All

¤  Flexible programming model

¤  Distributed stream processing engine abstraction

¤  Integrated machine learning and data mining algorithms

¤  Easy API to implement new algorithms and SPE adapters

27/06/13

18

Page 19: Scalable Distributed Real-Time Clustering for Big Data Streams

SAMOA Clustering Algorithm

¤  Distributed stream clustering algorithm

¤  Validate SAMOA implementation and

¤  Integration with Apache S4 using the SAMOA-S4 adapter

¤  Deploy on Apache S4

27/06/13

19

Page 20: Scalable Distributed Real-Time Clustering for Big Data Streams

Stream Clustering Algorithm

¤  CluStream Framework ¤  Based on k-means

¤  Online phase (micro-clustering)

¤  Offline phase (macro-clustering)

¤  k-means: partition a set of data into k distinct clusters according to a similarity function

¤  Minimization of squared Euclidean distance objective function:

27/06/13

20

Page 21: Scalable Distributed Real-Time Clustering for Big Data Streams

K-means Clustering Algorithm

¤  Advantages ¤  Simple, fast and efficient

¤  Known issues with k-means ¤  Sensitive to initial seeding

¤  Minimization problem is NP-hard even for simple configurations

¤  1-dimensional points

¤  Global optimum not guaranteed

¤  Good for spherical clustering, not good for arbitrary shapes

27/06/13

21

Page 22: Scalable Distributed Real-Time Clustering for Big Data Streams

Distributed Stream Clustering

¤  Online micro-clustering ¤  Apply on a local clustering phase

¤  Cluster Feature Vectors with Timestamp (CFT) ¤  N: number of data objects

¤  LS: linear sum of data objects

¤  SS: sum of squares of data objects

¤  LST: sum of timestamps

¤  SST: sum of squares of timestamps

¤  Offline macro-clustering ¤  Use of micro-clusters as weighted pseudo-points

¤  Apply on a global clustering phase with a weighted k-means ¤  Uses probabilistic seeding depending on the weighted

micro-clusters

27/06/13

22

Page 23: Scalable Distributed Real-Time Clustering for Big Data Streams

CluStream Snapshot

27/06/13

23

Micro-clusters Macro-clusters

Ground Truth

Page 24: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalable Advanced Massive Online Analysis (SAMOA)

SAMOA Clustering Task

Evaluation

Clustering

Sampling PI Evaluator PI

Local Clustering PI

Global Clustering PI

Distribution PI STREAM

SOURCE

OUTPUT

OUTPUT

27/06/13

24

Page 25: Scalable Distributed Real-Time Clustering for Big Data Streams

Experiments, Evaluation & Results

¤  Experimental Setup ¤  Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM

¤  Process parallelism level: 1, 8 & 16

¤  Instance dimensions: 3 & 15

¤  Source dataset: random events generator

¤  Noise: 0% & 10%

¤  Cluster movement speed: move 0.1 unit every 500 & 12000 instances

¤  Evaluations ¤  Scalability: measure throughput when adding concurrent

processes

¤  Clustering quality: measure if the clustering algorithm are accurate

27/06/13

25

Page 26: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalability

27/06/13

26

Baseline Comparison

Evaluation Step

Thro

ug

hp

ut

(inst

an

ce

s/se

co

nd

)

Page 27: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalability

Process Parallelism

27/06/13

27

Average Throughput with Dimensions 3 and 15

Ave

rag

e T

hro

ug

hp

ut

(in

sta

nc

es/

sec

on

d)

Page 28: Scalable Distributed Real-Time Clustering for Big Data Streams

Scalability

27/06/13

28

Process Parallelism

Avg

. Cu

mu

lativ

e T

hro

ug

hp

ut

(inst

an

ce

s/se

c)

Parallelism Throughput with Dimension 3

Page 29: Scalable Distributed Real-Time Clustering for Big Data Streams

Clustering Quality Metrics

¤  Internal & External evaluations ¤  Internal evaluation uses attributes available from the clustering

structure. ¤  External evaluation uses external validation structures.

¤  ex.: ground truth provided by the source generator.

¤  Metrics ¤  Cohesion coefficient (SSE): measures the intra clusters sum of

squares error

¤  Separation coefficient (BSS): measures the inter cluster between-sum of squares.

27/06/13

29

Page 30: Scalable Distributed Real-Time Clustering for Big Data Streams

Clustering Quality 0% Noise

27/06/13

30

Snapshot 25,000 instances

Snapshot 45,000 instances

Page 31: Scalable Distributed Real-Time Clustering for Big Data Streams

Clustering Quality 0% Noise

27/06/13

31

Ratio = BSS / GT

Page 32: Scalable Distributed Real-Time Clustering for Big Data Streams

Clustering Quality 10% Noise

27/06/13

32

Snapshot 45,000 instances

Snapshot 25,000 instances

Good clustering

Poor clustering

Page 33: Scalable Distributed Real-Time Clustering for Big Data Streams

Clustering Quality 10% Noise

27/06/13

33

Page 34: Scalable Distributed Real-Time Clustering for Big Data Streams

Conclusion

¤  There is important information on the massive amount of data being produced and discarded

¤  There is a need for tools to deal with this efficiently

¤  Efforts have been done to crunch big data

¤  Interpreting and retrieving relevant information is where machine learning and data mining operate

¤  Using real-time analysis responds faster to evolving data

¤  SAMOA abstracts the platform and maintains the algorithms; good to implement, test and use.

27/06/13

34

Page 35: Scalable Distributed Real-Time Clustering for Big Data Streams

Acknowledgements

¤  Thanks the Erasmus Mundus and all three universities (UPC, KTH and IST) for providing this opportunity

¤  Thanks all the EMDC students

¤  Thanks Yahoo! Research for the great project

27/06/13

35