Upload
spark-summit
View
724
Download
0
Embed Size (px)
Citation preview
Near-Realtime Datacenter Analytics:Model-Driven Pregel With Spark Streaming AND GraphXDavid Ohsie, Distinguished EngineerCheuk Lam, Consultant Software EngineerEMC Corporation ([email protected]; [email protected])
The ProblemMetrics
Streaming
Cluster1
Model-Driven Approach: Continuous Eval of Expressions over Relationships
Cassandra1
Tell me when I lose quorum
Cassandra Cluster
Cassandra Node
HasMember
Boolean LostQuorum = Sum(HasMember->IsUp)
<= |HasMember|/2
Boolean IsUp;
Cassandra2Cassandra3Cassandra4Cassandra5
Another Example
StorageDevice
Contains
Float ReadBW;
ServerFloat TotalReadBW =
Sum(Contains->ReadBW);
StoragePoolFloat AvgReadBW =
Avg(Classifies->ReadBW);
Classifies
ClusterFloat TotalReadBW =
Sum(Contains->TotalReadBW);
Contains
Eye Chart (medicinal purposes only)
Problem Scale
• Size of a Single Distributed Storage System: 1.8MM Vertices and 3 MM Edges
• Message Rates: 90,000 – 9MM msgs/sec
Success Criteria• Current model-driven system in production for 15+ years
– Manual Scale Out and HA– Current performance on single node:1000’s of messages/sec
• Goals for Spark Implementation:– #1: Automate Scaleout– #2: Automate HA– #3: Operate on a Micro-Cluster
• Improve Scale Up (100,000’s – 1,000,000’s messages/sec on a micro-cluster of 2-4 nodes)
• Spark Based Solution Achieved…– Almost Linear Scaleout on 4 node micro-cluster– 160,000 messages/sec (with some headroom)
Challenges
• Graph Processing Scaleout• Graph Heterogeneity
– Each type of vertex and edge in the graph has different expressions
Challenge #1: Graph Scaleout• Graph Processing is Not
Embarrasingly Parallel• Simply Sharding Vertices
(or Edges) Won’t Do It• Because Expressions
Cross Vertex Boundaries• And Expression can Refer
to Other Expressions, Crossing even more Node Boundaries StorageDevice
Contains
Float ReadBW;
ServerFloat TotalReadBW =
Sum(Contains->ReadBW);
ClusterFloat TotalReadBW =
Sum(Contains->TotalReadBW);
Contains
Server Server
7.2K SP10K SPSSD SP
1010 2020 3030 4040 5050 6060
Avg: 25 Avg: 35 Avg: 45
Total: 60 Total: 150
CLUSTER
25 35 45
60 150
Avg: 35
Total: 210
Send batch messages to neighbors
Vertex Program:Run in each vertex in
parallel to compute a new state
Scaling Out using Pregel
Challenge #2: Heterogeneitydef pregel[A]
(vprog: (VertexId, VD, A) => VD,sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],mergeMsg: (A, A) => A)
} : Graph[VD, ED]
Our Solution: Model-Driven Pregelclass Server {
TotalReadBW =Sum(Contains->ReadBW)
…
Topology
TelemetrySpark Streaming
GraphXGraph
GraphX Pregel
Results
Model-Driven Vertex Program
Cluster::TotalReadBWComputed Expression
Server::TotalReadBWComputed Expression
Disk::ReadBWTelemetry Input
ContainsRelationship
ContainsRelationship
SumOperator
SumOperator
Cluster-1
Server-1
Disk-1 …
…
Disk-2
TopologyExpression Model
Contains
Contains
2010
30
Scaling Out
0
10
20
30
0 501234
0 50
0.5
1
1.5
0 50
10
20
30
0 5
0
10
20
30
0 50
20
40
60
0 50
0.51
1.52
0 50
1
2
3
0 5
Time consumed at each stage(seconds)
Lesson Learned on Batch Size
Sweetspot
Lesson Learned: Hidden Memory Cost
IntermediateRDDs in
PregelModel-drivenOverhead
Raw dataGraphX
overhead
• Extra copy of vertex data cached in edge partitions
LESSON LEARNED: QUIZ TIME
How many times to does map from RDD A à RDD B get evaluated?
RDD A
ScalaCollection
Parallelize
RDD B
Map
RDD
RDD
Filter
Filter
UnionRDD C
RDD
RDD
RDD D
Filter
Filter
Union
count
Why Doesn’t Caching Fix the problem
RDD A
ScalaCollection
Parallelize
RDD B
Map
RDD
RDD
Filter
Filter
UnionRDD C
RDD
RDD
RDD D
Filter
Filter
Union
count
Cache
Cache
Checkpoint
Workaround: First uncache as usual and then immediately cache again
Conclusions• Spark can solve datacenter near-realtime analysis
– Achieved over 100K/sec on 4 node microcluster– With sufficient batch size (must tolerate some latency 30s)– Combining streaming and batch (GraphX) was essential
• Pregel can implement heterogeneous graph algorithms using a model-driven approach
• Implementation Lessons:– Pay attention to hidden memory sinks– Pay attention to ALL of your partitioning– Explicitly test your HA with particular attention to caching