21
Near-Realtime Datacenter Analytics: Model-Driven Pregel With Spark Streaming AND GraphX David Ohsie, Distinguished Engineer Cheuk Lam, Consultant Software Engineer EMC Corporation ([email protected]; [email protected])

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Embed Size (px)

Citation preview

Page 1: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Near-Realtime Datacenter Analytics:Model-Driven Pregel With Spark Streaming AND GraphXDavid Ohsie, Distinguished EngineerCheuk Lam, Consultant Software EngineerEMC Corporation ([email protected]; [email protected])

Page 2: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

The ProblemMetrics

Streaming

Page 3: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Cluster1

Model-Driven Approach: Continuous Eval of Expressions over Relationships

Cassandra1

Tell me when I lose quorum

Cassandra Cluster

Cassandra Node

HasMember

Boolean LostQuorum = Sum(HasMember->IsUp)

<= |HasMember|/2

Boolean IsUp;

Cassandra2Cassandra3Cassandra4Cassandra5

Page 4: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Another Example

StorageDevice

Contains

Float ReadBW;

ServerFloat TotalReadBW =

Sum(Contains->ReadBW);

StoragePoolFloat AvgReadBW =

Avg(Classifies->ReadBW);

Classifies

ClusterFloat TotalReadBW =

Sum(Contains->TotalReadBW);

Contains

Page 5: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Eye Chart (medicinal purposes only)

Page 6: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Problem Scale

• Size of a Single Distributed Storage System: 1.8MM Vertices and 3 MM Edges

• Message Rates: 90,000 – 9MM msgs/sec

Page 7: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Success Criteria• Current model-driven system in production for 15+ years

– Manual Scale Out and HA– Current performance on single node:1000’s of messages/sec

• Goals for Spark Implementation:– #1: Automate Scaleout– #2: Automate HA– #3: Operate on a Micro-Cluster

• Improve Scale Up (100,000’s – 1,000,000’s messages/sec on a micro-cluster of 2-4 nodes)

• Spark Based Solution Achieved…– Almost Linear Scaleout on 4 node micro-cluster– 160,000 messages/sec (with some headroom)

Page 8: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Challenges

• Graph Processing Scaleout• Graph Heterogeneity

– Each type of vertex and edge in the graph has different expressions

Page 9: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Challenge #1: Graph Scaleout• Graph Processing is Not

Embarrasingly Parallel• Simply Sharding Vertices

(or Edges) Won’t Do It• Because Expressions

Cross Vertex Boundaries• And Expression can Refer

to Other Expressions, Crossing even more Node Boundaries StorageDevice

Contains

Float ReadBW;

ServerFloat TotalReadBW =

Sum(Contains->ReadBW);

ClusterFloat TotalReadBW =

Sum(Contains->TotalReadBW);

Contains

Page 10: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Server Server

7.2K SP10K SPSSD SP

1010 2020 3030 4040 5050 6060

Avg: 25 Avg: 35 Avg: 45

Total: 60 Total: 150

CLUSTER

25 35 45

60 150

Avg: 35

Total: 210

Send batch messages to neighbors

Vertex Program:Run in each vertex in

parallel to compute a new state

Scaling Out using Pregel

Page 11: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Challenge #2: Heterogeneitydef pregel[A]

(vprog: (VertexId, VD, A) => VD,sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],mergeMsg: (A, A) => A)

} : Graph[VD, ED]

Page 12: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Our Solution: Model-Driven Pregelclass Server {

TotalReadBW =Sum(Contains->ReadBW)

Topology

TelemetrySpark Streaming

GraphXGraph

GraphX Pregel

Results

Page 13: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Model-Driven Vertex Program

Cluster::TotalReadBWComputed Expression

Server::TotalReadBWComputed Expression

Disk::ReadBWTelemetry Input

ContainsRelationship

ContainsRelationship

SumOperator

SumOperator

Cluster-1

Server-1

Disk-1 …

Disk-2

TopologyExpression Model

Contains

Contains

2010

30

Page 14: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Scaling Out

0

10

20

30

0 501234

0 50

0.5

1

1.5

0 50

10

20

30

0 5

0

10

20

30

0 50

20

40

60

0 50

0.51

1.52

0 50

1

2

3

0 5

Time consumed at each stage(seconds)

Page 15: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Lesson Learned on Batch Size

Sweetspot

Page 16: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Lesson Learned: Hidden Memory Cost

IntermediateRDDs in

PregelModel-drivenOverhead

Raw dataGraphX

overhead

• Extra copy of vertex data cached in edge partitions

Page 17: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

LESSON LEARNED: QUIZ TIME

Page 18: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

How many times to does map from RDD A à RDD B get evaluated?

RDD A

ScalaCollection

Parallelize

RDD B

Map

RDD

RDD

Filter

Filter

UnionRDD C

RDD

RDD

RDD D

Filter

Filter

Union

count

Page 19: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Why Doesn’t Caching Fix the problem

RDD A

ScalaCollection

Parallelize

RDD B

Map

RDD

RDD

Filter

Filter

UnionRDD C

RDD

RDD

RDD D

Filter

Filter

Union

count

Cache

Cache

Checkpoint

Workaround: First uncache as usual and then immediately cache again

Page 20: Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Conclusions• Spark can solve datacenter near-realtime analysis

– Achieved over 100K/sec on 4 node microcluster– With sufficient batch size (must tolerate some latency 30s)– Combining streaming and batch (GraphX) was essential

• Pregel can implement heterogeneous graph algorithms using a model-driven approach

• Implementation Lessons:– Pay attention to hidden memory sinks– Pay attention to ALL of your partitioning– Explicitly test your HA with particular attention to caching