Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie

Near-Realtime Datacenter Analytics:Model-Driven Pregel With Spark Streaming AND GraphXDavid Ohsie, Distinguished EngineerCheuk Lam, Consultant Software EngineerEMC Corporation ([email protected]; [email protected])

The ProblemMetrics

Streaming

Cluster1

Model-Driven Approach: Continuous Eval of Expressions over Relationships

Cassandra1

Tell me when I lose quorum

Cassandra Cluster

Cassandra Node

HasMember

Boolean LostQuorum = Sum(HasMember->IsUp)

<= |HasMember|/2

Boolean IsUp;

Cassandra2Cassandra3Cassandra4Cassandra5

Another Example

StorageDevice

Contains

Float ReadBW;

ServerFloat TotalReadBW =

Sum(Contains->ReadBW);

StoragePoolFloat AvgReadBW =

Avg(Classifies->ReadBW);

Classifies

ClusterFloat TotalReadBW =

Sum(Contains->TotalReadBW);

Contains

Eye Chart (medicinal purposes only)

Problem Scale

• Size of a Single Distributed Storage System: 1.8MM Vertices and 3 MM Edges

• Message Rates: 90,000 – 9MM msgs/sec

Success Criteria• Current model-driven system in production for 15+ years

– Manual Scale Out and HA– Current performance on single node:1000’s of messages/sec

• Goals for Spark Implementation:– #1: Automate Scaleout– #2: Automate HA– #3: Operate on a Micro-Cluster

• Improve Scale Up (100,000’s – 1,000,000’s messages/sec on a micro-cluster of 2-4 nodes)

• Spark Based Solution Achieved…– Almost Linear Scaleout on 4 node micro-cluster– 160,000 messages/sec (with some headroom)

Challenges

• Graph Processing Scaleout• Graph Heterogeneity

– Each type of vertex and edge in the graph has different expressions

Challenge #1: Graph Scaleout• Graph Processing is Not

Embarrasingly Parallel• Simply Sharding Vertices

(or Edges) Won’t Do It• Because Expressions

Cross Vertex Boundaries• And Expression can Refer

to Other Expressions, Crossing even more Node Boundaries StorageDevice

Contains

Float ReadBW;

ServerFloat TotalReadBW =

Sum(Contains->ReadBW);

ClusterFloat TotalReadBW =

Sum(Contains->TotalReadBW);

Contains

Server Server

7.2K SP10K SPSSD SP

1010 2020 3030 4040 5050 6060

Avg: 25 Avg: 35 Avg: 45

Total: 60 Total: 150

CLUSTER

25 35 45

60 150

Avg: 35

Total: 210

Send batch messages to neighbors

Vertex Program:Run in each vertex in

parallel to compute a new state

Scaling Out using Pregel

Challenge #2: Heterogeneitydef pregel[A]

(vprog: (VertexId, VD, A) => VD,sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],mergeMsg: (A, A) => A)

} : Graph[VD, ED]

Our Solution: Model-Driven Pregelclass Server {

TotalReadBW =Sum(Contains->ReadBW)

…

Topology

TelemetrySpark Streaming

GraphXGraph

GraphX Pregel

Results

Model-Driven Vertex Program

Cluster::TotalReadBWComputed Expression

Server::TotalReadBWComputed Expression

Disk::ReadBWTelemetry Input

ContainsRelationship

ContainsRelationship

SumOperator

SumOperator

Cluster-1

Server-1

Disk-1 …

…

Disk-2

TopologyExpression Model

Contains

Contains

2010

30

Scaling Out

0

10

20

30

0 501234

0 50

0.5

1

1.5

0 50

10

20

30

0 5

0

10

20

30

0 50

20

40

60

0 50

0.51

1.52

0 50

1

2

3

0 5

Time consumed at each stage(seconds)

Lesson Learned on Batch Size

Sweetspot

Lesson Learned: Hidden Memory Cost

IntermediateRDDs in

PregelModel-drivenOverhead

Raw dataGraphX

overhead

• Extra copy of vertex data cached in edge partitions

LESSON LEARNED: QUIZ TIME

How many times to does map from RDD A à RDD B get evaluated?

RDD A

ScalaCollection

Parallelize

RDD B

Map

RDD

RDD

Filter

Filter

UnionRDD C

RDD

RDD

RDD D

Filter

Filter

Union

count

Why Doesn’t Caching Fix the problem

RDD A

ScalaCollection

Parallelize

RDD B

Map

RDD

RDD

Filter

Filter

UnionRDD C

RDD

RDD

RDD D

Filter

Filter

Union

count

Cache

Cache

Checkpoint

Workaround: First uncache as usual and then immediately cache again

Conclusions• Spark can solve datacenter near-realtime analysis

– Achieved over 100K/sec on 4 node microcluster– With sufficient batch size (must tolerate some latency 30s)– Combining streaming and batch (GraphX) was essential

• Pregel can implement heterogeneous graph algorithms using a model-driven approach

• Implementation Lessons:– Pay attention to hidden memory sinks– Pay attention to ALL of your partitioning– Explicitly test your HA with particular attention to caching

THANK [email protected]; [email protected]

Data & Analytics

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Vertex-centric Programming on Spark Streaming and GraphX by Cheuk Lam and David Ohsie