Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
CIS 602-01: Scalable Data Analysis
Graph Data Dr. David Koop
D. Koop, CIS 602-01, Fall 2017
Machine Learning• Traditional Programming
• Machine Learning
2D. Koop, CIS 602-01, Fall 2017
[P. Domingos]
ComputerData
ProgramOutput
ComputerData
OutputProgram
Machine Learning• Every machine learning algorithm has three components:
- Representation - Evaluation - Optimization
3D. Koop, CIS 602-01, Fall 2017
Representation• Decision trees • Sets of rules / Logic programs • Instances • Graphical models (Bayes/Markov nets) • Neural networks • Support vector machines • Model ensembles • Etc.
4D. Koop, CIS 602-01, Fall 2017
[P. Domingos]
Evaluation• Accuracy • Precision and recall • Squared error • Likelihood • Posterior probability • Cost / Utility • Margin • Entropy • K-L divergence • Etc.
5D. Koop, CIS 602-01, Fall 2017
[P. Domingos]
Optimization• Combinatorial optimization
- E.g.: Greedy search • Convex optimization
- E.g.: Gradient descent • Constrained optimization
- E.g.: Linear programming
6D. Koop, CIS 602-01, Fall 2017
[P. Domingos]
Types of Learning• Supervised (inductive) learning
- Training data includes desired outputs • Unsupervised learning
- Training data does not include desired outputs • Semi-supervised learning
- Training data includes a few desired outputs • Reinforcement learning
- Rewards from sequence of actions
7D. Koop, CIS 602-01, Fall 2017
Scalable Machine Learning Constraints• Constrained by size:
- Training dataset fits in RAM, need to predict for a much larger dataset
- Training dataset doesn't even fit in RAM, want to scale out by • adopting algorithms that work in batches locally, or • on a distributed cluster.
• Constrained by time: - Want to fit more models (think hyper-parameter optimization or
ensemble learning) on my dataset in a given amount of time. - Scale out by fitting more models in parallel:
• on my laptop by using more cores, or • on a cluster.
8D. Koop, CIS 602-01, Fall 2017
[T. Augspurger, 2017]
Sampling: How much training data?• You may not need to scale • How large should the sample be? Use statistics
9D. Koop, CIS 602-01, Fall 2017
[T. Augspurger, 2017]
Scaling scikit-learn• Fit, then predict • May be able to fit with training set in memory • Can run predict out of core
- Use dask map-partitions - Use the cloud (kubernetes)
• Watch Tom's blog for more posts: - how dask can speed up your existing pipelines by executing them
in parallel - scikit-learn's out of core API for when your training dataset
doesn't fit in memory - using dask to implement distributed machine learning algorithms
10D. Koop, CIS 602-01, Fall 2017
[T. Augspurger, 2017]
Deep Learning at Google
11D. Koop, CIS 602-01, Fall 2017
Growing Use of Deep Learning at Google
AndroidAppsdrug discoveryGmailImage understandingMapsNatural language understandingPhotosRobotics researchSpeechTranslationYouTube… many others ...
Across many products/areas:
# of directories containing model description files
Time
Uni
que
Pro
ject
Dire
ctor
ies
[J. Dean, 2015]
Developer Wish List for Machine Learning System• Ease of expression: for lots of crazy ML ideas/algorithms • Scalability: can run experiments quickly • Portability: can run on wide variety of platforms • Reproducibility: easy to share and reproduce research • Production readiness: go from research to real products
12D. Koop, CIS 602-01, Fall 2017
[J. Dean, 2015]
Computation is a dataflow graph
13D. Koop, CIS 602-01, Fall 2017
MatMul
Add Relu
biases
weights
examples
labels
Xent
Graph of Nodes, also called Operations or ops.
Computation is a dataflow graph
[J. Dean, 2015]
Computation is a dataflow graph with tensors
14D. Koop, CIS 602-01, Fall 2017
with tensors
MatMul
Add Relu
biases
weights
examples
labels
Xent
Edges are N-dimensional arrays: Tensors
Computation is a dataflow graph
[J. Dean, 2015]
Computation is a dataflow graph with state
15D. Koop, CIS 602-01, Fall 2017
with state
Add Mul
biases
...
learning rate
−=...
'Biases' is a variable −= updates biasesSome ops compute gradients
Computation is a dataflow graph
[J. Dean, 2015]
Computation is a distributed dataflow graph
16D. Koop, CIS 602-01, Fall 2017
Device BDevice A
distributed
Add Mul
biases
learning rate
−=...
Devices: Processes, Machines, GPUs, etc
...
Computation is a dataflow graph
[J. Dean, 2015]
Computation is a distributed dataflow graph
17D. Koop, CIS 602-01, Fall 2017
Device A Device B
distributed
Add Mul
biases
learning rate
−=...
Devices: Processes, Machines, GPUs, etc
Send and Receive Nodes
Send
Recv
Send RecvSend Recv
... RecvSend
[J. Dean, 2015]
Experiment Turnaround Time & Research Productivity• Minutes, Hours:
- Interactive research! Instant gratification! • 1-4 days
- Tolerable - Interactivity replaced by running many experiments in parallel
• 1-4 weeks - High value experiments only - Progress stalls
• >1 month - Don't even try
18D. Koop, CIS 602-01, Fall 2017
[J. Dean, 2015]
How to Scale?• Model Parallelism
- Best way to decrease training time: decrease the step time - Many models have lots of inherent parallelism - Problem is distributing work so communication doesn’t kill you - Can do across single core (SIMD), multiple cores, devices (e.g.
GPUs), machines • Data Parallelism
- Use multiple model replicas to process different examples at once - Speedups depend on the kind of model
• Dense models: 10-40x speedup from 50 replicas • Sparse models: support many more replicas (as many as 1,000)
- Really important for Google's problems
19D. Koop, CIS 602-01, Fall 2017
[J. Dean, 2015]
Model Parallelism
20D. Koop, CIS 602-01, Fall 2017
[J. Dean, 2015]
Data Parallelism
21D. Koop, CIS 602-01, Fall 2017
Data Parallelism
Parameter Servers
...ModelReplicas
Data ...
p∆p
p += ∆p
[J. Dean, 2015]
Assignment 3• http://www.cis.umassd.edu/~dkoop/cis602-2017fa/assignment3.html • Analysis of the IRS Form 990 Filings Data • Sign up for AWS Educate • Part 1: Run Spark Locally over Index • Part 2: Run Spark Locally over Subset • Part 3: Run Spark on AWS over Full Dataset (Year) • Windows Issues:
- Install Spark separately from conda's pyspark install - Run notebook normally and use findspark to link to the pyspark
libraries
22D. Koop, CIS 602-01, Fall 2017
Projects• Reviewing project proposals today and tomorrow • Looking for scalability aspects: if you can easily complete analyses
with pandas or matplotlib, it probably doesn't meet the criteria • If your project doesn't quite meet the criteria, I will try to offer
suggestions to satisfy the requirements • Continue to work on projects
23D. Koop, CIS 602-01, Fall 2017
Graphs as Data
24D. Koop, CIS 602-01, Fall 2017
ID Atom Electrons Protons0 N 7 71 C 6 62 S 16 163 C 6 64 N 7 7
ID1 ID2 Bonds0 1 11 2 11 3 23 4 1
Nodes
Edges
SS
CC
CC
NN
CCNN
CC
NN
NNCC
NN
HH
HHHH
HH
HH
Social Networks
25D. Koop, CIS 602-01, Fall 2017
[P. Butler, 2010]
Graphs• "Every time someone visits news feed there are on average 1,500
potential stories from friends, people they follow and pages for them to see, and most people don’t have enough time to see them all" - Lars Backstrom, Facebook Engineer, 2013
26D. Koop, CIS 602-01, Fall 2017
Graph Analytics
K. Salama
D. Koop, CIS 602-01, Fall 2017
Frameworks In-depth
SIGMOD 2016 Tutorial
D. Koop, CIS 602-01, Fall 2017
Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets
N. Satish et al.
D. Koop, CIS 602-01, Fall 2017
Evaluating Different Graph Algorithms• Select a variety of different algorithms:
- PageRank - Breadth First Search - Collaborative Filtering - Triangle Counting
30D. Koop, CIS 602-01, Fall 2017
[Satish et al., 2014]
Algorithm Graph type Vertex Edge Message size Vertexproperty access pattern (Bytes/edge) active?
PageRank Directed, unweighted edges Double (pagerank) Streaming Constant (8) All iterationsBreadth First Search Undirected, unweighted edges Int (distance) Random Constant (4) Some iterations
Collaborative Filtering Bipartite graph; Undirected, weighted edges Array of Doubles (pu or qv) Streaming Constant (8K) All iterationsTriangle Counting Directed, unweighted edges Long (Ntriangles) Streaming Variable (0-106) Non-iterative
Table 1: Diversity in the characteristics of chosen graph algorithms.
where u & v are indices over users and items respectively, Ruv
is the rating of the uth user for the vth item, pu
&qv
are densevectors of length K corresponding to each user and item, respec-tively. This matrix factorization is typically done iteratively usingStochastic Gradient Descent (SGD) or Gradient Descent (GD). ForSGD, each iteration consists of performing the following operationfor all ratings in a random order:
euv
= Ruv
� pT
u
qv
(5)p⇤
u
= pu
+ �t
[euv
qv
� �p
pu
] (6)q⇤
v
= qv
+ �t
[euv
pu
� �q
qv
] (7)(p
u
,qv
) = (p⇤
u
,q⇤v
) (8)
where �t
is the step size for the tth iteration (typically, �t
= �0st
and s is the step size reduction factor 0 < s 1). GD performssimilar operations but updates all the p
u
and qv
once per iterationinstead of once per rating.
We implement all these algorithms on all the graph frameworksthat we use for comparison, except for those available publicly (de-tails in Section 3).
2.1 ChallengesThe chosen graph algorithms vary widely in their characteristics
and correspondingly, their implementations stress different aspectsof the hardware system. Table 1 shows the characteristics of thedifferent graph algorithms. The message passing characteristics arebased on that of a vertex programming implementation. There aredifferences from the structure and properties of the graph itself, ver-tex properties, access patterns, message sizes, and whether verticesare active in all iterations.
The implications of these characteristics are discussed in Section3 for those algorithms. For example, Triangle counting and Collab-orative filtering have total message sizes that are much larger thanthat of the graph itself, necessitating modifications for Giraph.
We now discuss the graph frameworks considered and how thealgorithms map to them.
3. CHOICE OF FRAMEWORKSThe wide variety in graph algorithms that need to be imple-
mented has necessitated the creation of a variety of graph frame-works. There is clearly no consensus on even what programmingmodel gives the best productivity-performance trade-off. In this pa-per, we consider the following popular graph frameworks - GraphLab,CombBLAS, SociaLite, Galois and Giraph. In addition, we also in-clude hand-optimized code for the algorithms. Each of these frame-works are described below.
GraphLab [21] is a graph framework that provides a sequential,shared memory abstraction for running graph algorithms writtenas “vertex programs”. GraphLab works by letting vertices in agraph read incoming messages, update the values and send mes-sages asynchronously. GraphLab partitions the graph in a 1-D fash-ion (vertex partitioning). All graph algorithms must be expressedas a program running on a vertex, which can access its own valueas well as that of its edges and neighboring vertices. The runtimetakes care of scheduling, messaging and synchronization.
The Combinatorial BLAS [11] is an extensible distributed-memoryparallel graph library offering a small but powerful set of linear al-gebra primitives specifically targeting graph analytics. CombBLAStreats graphs as sparse matrices and partitions the non-zeros of thematrix (edges in the graph) across nodes. As such, this is the onlyframework that supports an edge-based partitioning of the graph(also referred to as 2-D partitioning in the paper). Graph com-putations are expressed as operations among sparse matrices andvectors using arbitrary user-defined semirings.
SociaLite [30, 31] is based on Datalog, a declarative languagethat can express various graph algorithms succinctly due to its bet-ter support for recursive queries compared to SQL [32]. In So-ciaLite, the graph and its meta data is stored in tables, and declara-tive rules are written to implement graph algorithms. SociaLite ta-bles are horizontally partitioned, or sharded, to support parallelism.Users can specify how they want to shard a table at table declarationtime, and the runtime partitions and distributes the tables accord-ingly. SociaLite only supports 1-D partitioning.
Giraph[8] is an iterative graph processing system that runs ontop of Hadoop framework. Computation proceeds as a sequence ofiterations, called supersteps in a bulk synchronous (BSP) fashion.Initially, every vertex is active. In each superstep each active ver-tex invokes a Compute method provided by the user. The Computemethod: (1) receives messages sent to the vertex in the previoussuperstep, (2) computes using the messages, and the vertex andoutgoing edge values, which may result in modifications to the val-ues, and (3) may send messages to other vertices. The Computemethod does not have direct access to the values of other verticesand their outgoing edges. Inter-vertex communication occurs bysending messages. Computation halts if all vertices have voted tohalt and there are no messages in flight. Giraph partitions the ver-tices in a 1-D fashion (vertex partitioning).
Since graph computations can be very irregular (little locality,varying amount of work per iteration etc.), Galois [26], a frame-work developed for handling irregular computations can also beused for graph processing. Galois is a work-item based paralleliza-tion framework that can handle graph computations (and other ir-regular problems as well). It provides a rich programming modelwith coordinated and autonomous scheduling, and with and with-out application-defined priorities. Galois provides its own sched-ulers and scalable data structures, but does not impose a particularpartitioning scheme which may be edge or vertex based dependingon how the computation is expressed in the framework.
Other than the explained differences, a major differentiator ofthe frameworks is the communication layer between different hard-ware nodes. Our native implementation and CombBLAS use MPI,whereas GraphLab and SociaLite use sockets. Giraph uses a net-work I/O library (netty), while Galois does not have a multi nodeimplementation as yet.
Table 2 shows a high-level comparison between the differentframeworks under consideration in this paper.
3.1 Example - PageRankWe explain the differences in programming model between the
frameworks with a small example and see how Pagerank can beimplemented in all the frameworks.
Graph Frameworks
31D. Koop, CIS 602-01, Fall 2017
Framework Programming Multi node Lang- Graph Communicationmodel usage Partitioning layer
Native N/A Yes C/C++ N/A MPIGraphLab Vertex Yes C++ 1-D Sockets
CombBLAS Sparse matrix Yes C++ 2-D MPISociaLite Datalog Yes Java 1-D Sockets
Galois Task-based No C/C++ N/A N/AGiraph Vertex Yes Java 1-D Netty
Table 2: High level comparison of the graph frameworks
1
0 3
2
Figure 2: An example directed graph with 4 vertices
Let us begin by finding the most optimal way to execute pager-ank. We refer to this hand-optimized version of the algorithm asnative implementation. We observe that pagerank computation asgiven by equation (1) performs one multiply-add operation per edge.Representing the graph in a Compressed-Sparse Row (CSR) for-mat [15], an efficient way of storing sparse matrix (graph) as adja-cency list, allows for the edges to be stored as a single, contiguousarray. This allows all the accesses to the edge array to be regu-lar and improves the memory bandwidth utilization through hard-ware prefetching. Since each vertex has to access the pagerank val-ues of all the vertices with incoming edges, we store the incomingedges in CSR format (not outgoing as would generally be the case).For the multi node setup, the graph is partitioned in a 1-D fashioni.e. partitioning the vertices (along with corresponding in-edges)among the nodes so that each node has roughly the same numberof edges. Each node calculates the local updates and packages thepagerank values to be sent to the other nodes. These messages arethen used to calculate the remote updates. More details on the op-timizations in pagerank are given in Section 6.
Let us see how the page rank algorithm maps to vertex program-ming. In this model, we write a program that executes on a singlevertex. This program can only access “local” data i.e. informa-tion about the vertices and edges that are directly connected to agiven vertex. An example of vertex program (in pseudocode) isprovided in Algorithm 1. The exact semantics of how the mes-sages are packed and received, how much local data can be ac-cessed, global reductions etc. vary across different implementa-tions. GraphLab [21] and Giraph [8] are both examples of thisapproach to graph processing.
Algorithm 1: Vertex program for one iteration of page rankbegin
PR � rfor msg 2 incoming messages do
PR � PR + (1� r) ⇤msg
Send
PR
degree
to all outgoing edges
Another distinct approach to processing large graphs is to treatthem as sparse matrices - an approach embodied in frameworkslike CombBLAS[11]. The computations in a single iteration ofPageRank can be expressed in matrix form as follows:
pt+1 = r1 + (1� r)AT p̃
t
(9)
where A =
0
BB@
0 1 1 0
0 0 1 1
0 0 0 1
0 0 0 0
1
CCA (for the graph in Figure 2), pt
is the
vector of all the page rank values at iteration t, p̃t
(i) =
pt(i)d(i) , d is
the vector of vertex out-degrees and 1 is a vector of all 1’s.In SociaLite, pagerank computation is expressed as following:
RANK[n](t + 1, $SUM(v)) :- v = r
:- INEDGE[n](s), RANK[s](t, v0), OUTDEG[s](d), v =
(1� r)v0
d.
, where the PageRank value of a vertex n at iteration t + 1 in therule head (RANK[n](t + 1, $SUM(v)) is declared as the summationof the constant term in the first rule, and the normalized PageRankvalues from neighbor vertices at iteration t in the second rule. Theincoming edge table, INEDGE, stores vertices in its first column andneighboring vertices in its second column. INEDGE is declared as atail-nested table [30], effectively implementing a CSR format usedin the native implementation and CombBLAS. Another version ofPageRank is implemented from the perspective of a vertex that dis-tributes its PageRank value to its neighboring vertices, which isexpressed as following:
RANK[n](t + 1, $SUM(v)) :- v = r;
:- RANK[s](t, v0), OUTEDGE[s](n), OUTDEG[s](d), v =
(1� r)v0
d.
In this implementation, all join operations in the rule body arelocally computed, and there is only a single data transfer for theRANK table update in the rule head. Compared to the previous ver-sion, there is one less data transfer, reducing communication over-heads. In terms of lock overhead, since we cannot determine whichshard of RANK will be updated, locks must be held for every up-date. Hence, the first version is optimized for a single multi-coremachine, while the second is optimized for distributed machines.
The pagerank implementation in Galois is very similar to that ofGraphLab or Giraph i.e. the parallelization is over vertices. Eachwork item in Galois is a vertex program for updating its pagerank.Since Galois runs on only a single node with a shared memoryabstraction, each task has access to all of the program’s data.
3.2 Mapping Algorithms to FrameworksAs mentioned earlier, the pagerank algorithm can be expressed
in a variety of frameworks. In a similar fashion, we also describethe implementations of the other algorithms here:
Breadth First Search: Vertex programs are straight-forward towrite for this algorithm. Algorithm 2 shows the pseudocode of BFSimplementation. All distances are initialized to infinity (except thestarting vertex, which is initialized to 0). The iterations continueuntil there are no updates to distances.
Algorithm 2: Vertex program for one iteration of BFS.begin
for msg 2 incoming messages doDistance � min(Distance, msg + 1)
Send Distance to all outgoing edges
CombBLAS implementation performs matrix-vector multiplica-tion in every iteration. For example, in order to do traverse thegraph from both vertices 0 and 1 in Figure 2, we only have to do
[Satish et al., 2014]
Real-world datasets
32D. Koop, CIS 602-01, Fall 2017
Galois is the only framework that implements SGD (not just GD)in a fashion similar to that of the native implementation. This ispossible because of 2 reasons (1) since partitioning is flexible withGalois, we can apply the n2 uniform 2D chunk partitioning (2)since Galois is running only on a single node, it can maintain glob-ally consistent state after any update without much performancedegradation. Each work-item in Galois performs the SGD updateon a single edge (u, v) i.e. it updates both p
u
and qv
.We note that SGD has much better convergence characteristics
than Gradient Descent. For the Netflix dataset, given a fixed con-vergence criterion, SGD converges in about 40x fewer iterationsthan GD. Of course, like all machine learning algorithms, bothSGD and GD have parameters to choose that affect convergence(learning rate, step size reduction and so forth). We did do a coarsesweep over these parameters to obtain best convergence.
4. EXPERIMENTAL SETUPWe now describe our experimental setup with details about our
frameworks, data sets used and our experimental platform.
4.1 Choice of DatasetsWe use a mix of both real-world datasets and data generators
for synthetic data in our experiments. Both classes of datasets ide-ally follow a power law (Zipf law) distribution. For the pagerankand triangle counting problems we use directed graphs, whereas forBFS and Collaborating Filtering we use undirected graphs. Collab-orative Filtering requires edge-weighted, bipartite graphs that rep-resents the ratings matrix.
4.1.1 Real world Datasets
Dataset # Vertices # Edges Brief DescriptionFacebook [34] 2,937,612 41,919,708 Facebook user interaction graphWikipedia [14] 3,566,908 84,751,827 Wikipedia Link graph
LiveJournal [14] 4,847,571 85,702,475 LiveJournal follower graphNetflix [9] 480,189 users 99,072,112 Netflix Prize
17,770 movies ratingsTwitter [20] 61,578,415 1,468,365,182 Twitter follower graph
Yahoo Music [7] 1,000,990 users 252,800,275 Yahoo! KDDCup 2011624,961 items ratings music ratings
Synthetic 536,870,912 8,589,926,431 Described inGraph500 [23] Section 4
Synthetic 63,367,472 users 16,742,847,256 Described inCollaborative 1,342,176 items ratings Section 4
Filtering
Table 3: Real World and largest synthetic datasets
Table 3 provides details on the real-world datasets we used. TheFacebook, Wikipedia, Livejournal and Twitter graphs were used torun Pagerank, BFS and Triangle Counting, while the Netflix andYahoo Music datasets were used to run Collaborative Filtering.
The major issue with the real world datasets is primarily theirsmall size. While these datasets are useful for algorithmic ex-ploration, they are not really scalable to multiple nodes. In par-ticular, the Facebook, Wikipedia, Livejournal and Netflix datasetswere small enough to fit on a single machine. We do have twolarge real-world datasets - Twitter and Yahoo music. These wererun on 4 nodes except for Triangle Counting on Twitter, which re-quired 16 nodes to complete in a reasonable amount of time. Eventhough the twitter graph itself can fit in fewer nodes, as mentionedin Section 3, the total message size is much larger than the graphand needs more resources. To provide a sense of the size of thesedatasets, the largest dataset, Twitter, is just over 30 GB in size.Although we do have a few large examples, developing data gen-erators that can produce synthetic data that is representative of realworld workloads is essential to study scalability.
4.1.2 Synthetic Data GenerationWe derive our synthetic data generators from Graph500 RMAT
data generators [23]. We generate data to scale the graphs up to 64nodes. To give a sense on size of the synthetic graphs, a large scale64 node run for Pagerank and BFS processes over 8 Billion edges.We now describe how we use the Graph500 generator to generateboth directed graphs and ratings matrices for our applications.
We use the default RMAT parameters (A = 0.57, B = C = 0.19)used for Graph500 to generate undirected graphs. The RMAT gen-erator only generates a list of edges (with possible duplicates). ForPagerank, we assign a direction to all the edges generated. For BFS,depending on the framework used we either provide it undirectededges or provide 2 edges in both directions for each generated edge.For triangle counting, we use slightly different RMAT parameters(A = 0.45, B = C = 0.15) to reduce the number of triangles in thegraph. We then assign a direction to edges going from the vertexwith smaller id to one with larger id to avoid cycles. This is doneto make the implementations efficient on all frameworks.
In order to generate large rating matrices, we wrote a rating ma-trix generator that follows the power-law distribution of the Netflixdata set. In order to achieve this, we start with graphs generated bythe Graph500 graph generator. Through experimentation, we foundthat RMAT parameters of A = 0.40 and B = C = 0.22 generatesdegree distributions whose tail is reasonably close to that of theNetflix dataset. Finally, we post-processed the graphs to removeall vertices with degree < 5 from the graph. This yielded graphsthat closely follow the degree distributions of the Netflix data set.In order to convert these Graph500 graphs into bipartite graphs ofN
users
⇥Nmovies
(items are named movies in case of Netflix data),we first chunk the columns of the Graph500 matrix into chunks ofsize N
movies
. We then “fold” the matrix by performing a logical“or” of these chunks.
Given that the power law distribution is what differentiates areal-world dataset, we believe our data generator is much more rep-resentative of real collaborative filtering workloads as opposed toother data generators such as that by [16]. [16] generates data bysampling uniformly matching the expected number of non-zerosoverall but not as a power law distribution.
For our experiments, we generate synthetic data of up to 16 Bil-lion ratings for 64 million users ⇥ 800K movies. For scaling, wegenerate about 256 million ratings per node, or 16 Billion ratingsfor 64 nodes.
4.2 Framework versionsWe used GraphLab v2.2 [3] for this comparison. We used the
CombBLAS v1.3 code from [1]. We obtained the SociaLite codewith additional optimizations over the results in [30] from the au-thors. For Giraph, we used release 1.1.0 from [8]. Finally, weobtained Galois v 2.2.0 from [2].
We implemented all the algorithms on these frameworks wher-ever a public version did not exist.
4.3 Experimental PlatformWe run the benchmarks1 on an Intel® Xeon® 2 CPU E5-2697
based system. Each node has 64 GB of DRAM, and has 24 coressupporting 2-way Simultaneous Multi-Threading each running at2.7 GHz. The nodes are connected with an Mellanox Infiniband1
Software and workloads used in performance tests may have been optimized for performance only on Intel micro-processors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,components, software, operations and functions. Any change to any of those factors may cause the results to vary.You should consult other information and performance tests to assist you in fully evaluating your contemplated pur-chases, including the performance of that product when combined with other products. For more information go tohttp://www.intel.com/performance2
Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.
[Satish et al., 2014]
Results (Run Time)
33D. Koop, CIS 602-01, Fall 2017
0.1
1
10
tion (secon
ds)
Native Combblas GraphlabSocialite Giraph Galois
0.01
0.1
Livejournal
Facebo
ok
Wikiped
ia
Synthe
tic
Time pe
r iterat
(a) PageRank
10
100
e (secon
ds)
Native Combblas GraphlabSocialite Giraph Galois
0
1
Livejournal
Facebo
ok
Wikiped
ia
Synthe
ticOverall tim
e
(b) Breadth-First Search
10
100
1000
ation (secon
ds)
Native Combblas GraphlabSocialite Giraph Galois
1
10
Netflix
Synthe
tic
Time pe
r itera
(c) Collaborative Filtering
10
100
1000
10000
e (secon
ds)
Native Combblas GraphlabSocialite Giraph Galois
0.1
1
Livejournal
Facebo
ok
Wikiped
ia
Synthe
ticOverall Tim
(d) Triangle counting
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axisrepresents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat EnterpriseLinux Server OS release 6.4. We use a mix of OpenMP directivesto parallelize within the node and MPI code to parallelize acrossnodes in native code. We use the Intel® C++ Composer XE 2013SP1 Compiler3 and the Intel® MPI library to compile the code.GraphLab uses a similar combination of OpenMP and MPI for bestperformance. CombBLAS runs best as a pure MPI program. Weuse multiple MPI processes per node to take advantage of the mul-tiple cores within the node. Moreover, CombBLAS requires thetotal number of processes to be a square (due to their 2D partition-ing approach). Hence we use 36 MPI processes per node to run onthe 48 hardware threads; and we further run on a square number ofnodes (1, 2, 4, 9, 16, 36 and 64 nodes). SociaLite uses Java threadsand processes for parallelism. Giraph uses the Hadoop frameworkfor parallelism (we run 4 workers per node). Finally, Galois is asingle node framework and uses OpenMP for parallelism.
5. RESULTS
5.1 Native implementation bottlenecksSince different graph frameworks have their programming mod-
els with different implementation trade-offs, it is hard to directlycompare these frameworks with respect to each other without aclear reference point. As explained in Section 1, we provide a well-optimized native implementation of these algorithms for both sin-gle and multi node systems. Since the native code is optimized, itis easy to see which aspects of the system are stressed by a partic-ular algorithm. Table 4 provides data on the achieved limits of thenative implementations on a single node and 4 nodes system.
Algorithm Single Node 4 NodesH/W limitation Efficiency H/W limitation Efficiency
PageRank Memory BW 78 GBps (92%) Network BW 2.3 GBps ( 42%)BFS Memory BW 64 GBps (74%) Memory BW 54 GBps (63%)
Coll. Filtering Memory BW 47 GBps (54%) Memory BW 35 GBps (41%)Triangle Count. Memory BW 45 GBps (52%) Network BW 2.2 GBps ( 40%)
Table 4: Efficiency achieved by native implementations of differentalgorithms on single and 4 nodes.
We find that on both single and multi node implementations, thealgorithm performance is dependent on either memory or networkbandwidth. The efficiencies are generally within 2-2.5X off the3
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that arenot unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other opti-mizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessorsnot manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel micro-processors. Certain optimizations not specific to Intel micro-architecture are reserved for Intel microprocessors. Pleaserefer to the applicable product User and Reference Guides for more information regarding the specific instruction setscovered by this notice. Notice revision #20110804
ideal results. Given the diversity of bottlenecks within the 4 al-gorithms, and also between single and multiple node implementa-tions, we expect that it would be difficult for any one framework toexcel at all scales in terms performance and productivity.
5.2 Single node resultsWe now show the results of running the Pagerank, BFS, Collab-
orative Filtering and Triangle Counting algorithms on the Comb-BLAS, GraphLab, SociaLite, Giraph and Galois frameworks.
In the following, we only compare time taken per iteration onvarious frameworks for Collaborative Filtering and Pagerank. Asdescribed in Section 3, the native code for Collaborative Filteringimplements Stochastic Gradient Descent which converges muchfaster than the Gradient Descent implementation in other frame-works. However, we do not see much difference in performanceper iteration between Stochastic Gradient Descent and GradientDescent in native code. Hence we compare time/iteration here toseparate out the fact that the frameworks are not expressive enoughto express SGD from other potential performance differences. Sim-ilarly, some Pagerank implementations differ in whether early con-vergence is detected for the algorithm, and hence we report timeper iteration to normalize for this.
Algorithm CombBLAS GraphLab SociaLite Giraph GaloisPageRank 1.9 3.6 2.0 39.0 1.2
BFS 2.5 9.3 7.3 567.8 1.1Coll. Filtering 3.5 5.1 5.8 54.4 1.1
Triangle Count. 33.9 3.2 4.7 484.3 2.5
Table 5: Summary of performance differences for single node bench-marks on different frameworks for our applications. Each entry is aslowdown factor from native code, hence lower numbers indicate bet-ter performance.
Figures 3(a), 3(b) and 3(d) show the results of running Pager-ank, BFS, and Triangle Counting respectively on the real-worldLivejournal, Facebook and Wikipedia datasets described in Sec-tion 4.1 on our frameworks. We also show the results of runninga synthetic scale-free RMAT graph (obtained using the Graph500data generator). Figure 3(c) shows the performance of our frame-works on the Netflix [9] dataset, as well as synthetically generatedcollaborative filtering dataset. For convenience, we also present thegeometric mean of this data across datasets in Table 5. This tableshows the slowdowns of each framework w.r.t. native code.
We see the following key inferences: (1) Native code, as ex-pected, delivers best performance as it is optimized for the underly-ing architecture. (2) Galois performs better than other frameworks,
Results (Run Time)
34D. Koop, CIS 602-01, Fall 2017
0.1
1
10
tion (secon
ds)
Native Combblas GraphlabSocialite Giraph Galois
0.01
0.1
Livejournal
Facebo
ok
Wikiped
ia
Synthe
tic
Time pe
r iterat
(a) PageRank
10
100
e (secon
ds)
Native Combblas GraphlabSocialite Giraph Galois
0
1
Livejournal
Facebo
ok
Wikiped
ia
Synthe
ticOverall tim
e
(b) Breadth-First Search
10
100
1000
ation (secon
ds)
Native Combblas GraphlabSocialite Giraph Galois
1
10
Netflix
Synthe
tic
Time pe
r itera
(c) Collaborative Filtering
10
100
1000
10000
e (secon
ds)
Native Combblas GraphlabSocialite Giraph Galois
0.1
1
Livejournal
Facebo
ok
Wikiped
ia
Synthe
ticOverall Tim
(d) Triangle counting
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axisrepresents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat EnterpriseLinux Server OS release 6.4. We use a mix of OpenMP directivesto parallelize within the node and MPI code to parallelize acrossnodes in native code. We use the Intel® C++ Composer XE 2013SP1 Compiler3 and the Intel® MPI library to compile the code.GraphLab uses a similar combination of OpenMP and MPI for bestperformance. CombBLAS runs best as a pure MPI program. Weuse multiple MPI processes per node to take advantage of the mul-tiple cores within the node. Moreover, CombBLAS requires thetotal number of processes to be a square (due to their 2D partition-ing approach). Hence we use 36 MPI processes per node to run onthe 48 hardware threads; and we further run on a square number ofnodes (1, 2, 4, 9, 16, 36 and 64 nodes). SociaLite uses Java threadsand processes for parallelism. Giraph uses the Hadoop frameworkfor parallelism (we run 4 workers per node). Finally, Galois is asingle node framework and uses OpenMP for parallelism.
5. RESULTS
5.1 Native implementation bottlenecksSince different graph frameworks have their programming mod-
els with different implementation trade-offs, it is hard to directlycompare these frameworks with respect to each other without aclear reference point. As explained in Section 1, we provide a well-optimized native implementation of these algorithms for both sin-gle and multi node systems. Since the native code is optimized, itis easy to see which aspects of the system are stressed by a partic-ular algorithm. Table 4 provides data on the achieved limits of thenative implementations on a single node and 4 nodes system.
Algorithm Single Node 4 NodesH/W limitation Efficiency H/W limitation Efficiency
PageRank Memory BW 78 GBps (92%) Network BW 2.3 GBps ( 42%)BFS Memory BW 64 GBps (74%) Memory BW 54 GBps (63%)
Coll. Filtering Memory BW 47 GBps (54%) Memory BW 35 GBps (41%)Triangle Count. Memory BW 45 GBps (52%) Network BW 2.2 GBps ( 40%)
Table 4: Efficiency achieved by native implementations of differentalgorithms on single and 4 nodes.
We find that on both single and multi node implementations, thealgorithm performance is dependent on either memory or networkbandwidth. The efficiencies are generally within 2-2.5X off the3
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that arenot unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other opti-mizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessorsnot manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel micro-processors. Certain optimizations not specific to Intel micro-architecture are reserved for Intel microprocessors. Pleaserefer to the applicable product User and Reference Guides for more information regarding the specific instruction setscovered by this notice. Notice revision #20110804
ideal results. Given the diversity of bottlenecks within the 4 al-gorithms, and also between single and multiple node implementa-tions, we expect that it would be difficult for any one framework toexcel at all scales in terms performance and productivity.
5.2 Single node resultsWe now show the results of running the Pagerank, BFS, Collab-
orative Filtering and Triangle Counting algorithms on the Comb-BLAS, GraphLab, SociaLite, Giraph and Galois frameworks.
In the following, we only compare time taken per iteration onvarious frameworks for Collaborative Filtering and Pagerank. Asdescribed in Section 3, the native code for Collaborative Filteringimplements Stochastic Gradient Descent which converges muchfaster than the Gradient Descent implementation in other frame-works. However, we do not see much difference in performanceper iteration between Stochastic Gradient Descent and GradientDescent in native code. Hence we compare time/iteration here toseparate out the fact that the frameworks are not expressive enoughto express SGD from other potential performance differences. Sim-ilarly, some Pagerank implementations differ in whether early con-vergence is detected for the algorithm, and hence we report timeper iteration to normalize for this.
Algorithm CombBLAS GraphLab SociaLite Giraph GaloisPageRank 1.9 3.6 2.0 39.0 1.2
BFS 2.5 9.3 7.3 567.8 1.1Coll. Filtering 3.5 5.1 5.8 54.4 1.1
Triangle Count. 33.9 3.2 4.7 484.3 2.5
Table 5: Summary of performance differences for single node bench-marks on different frameworks for our applications. Each entry is aslowdown factor from native code, hence lower numbers indicate bet-ter performance.
Figures 3(a), 3(b) and 3(d) show the results of running Pager-ank, BFS, and Triangle Counting respectively on the real-worldLivejournal, Facebook and Wikipedia datasets described in Sec-tion 4.1 on our frameworks. We also show the results of runninga synthetic scale-free RMAT graph (obtained using the Graph500data generator). Figure 3(c) shows the performance of our frame-works on the Netflix [9] dataset, as well as synthetically generatedcollaborative filtering dataset. For convenience, we also present thegeometric mean of this data across datasets in Table 5. This tableshows the slowdowns of each framework w.r.t. native code.
We see the following key inferences: (1) Native code, as ex-pected, delivers best performance as it is optimized for the underly-ing architecture. (2) Galois performs better than other frameworks,
Results (Scalability): Synthetic Graphs
35D. Koop, CIS 602-01, Fall 2017
and is close to native performance (geometric mean of 1.1-1.2Xfor pagerank, BFS and collaborative filtering, and 2.5X for trianglecounting). (3) Giraph, on the other hand, is 2-3 orders of magnitudeslower than native code (4) CombBLAS and GraphLab performwell on average. CombBLAS is very good for all algorithms exceptfor Triangle Counting, where it ran out of memory for real-worldinputs while computing the A2 matrix product. This is an express-ibility problem in CombBLAS. GraphLab is 3-9X off from nativecode, but performs reasonably consistently across algorithms. (5)SociaLite performance is typically comparable to GraphLab (some-times slightly better and sometimes slightly worse).
Finally, note that the trends on the synthetic dataset are in linewith real-world data, showing that our synthetic generators are ef-fective in modeling real-world data.
5.3 Multi node resultsWe first show our scaling results of our frameworks on multiple
nodes. A major reason for using multiple nodes to process graphdata is to store the data in memory across the nodes. Hence a com-mon use case is weak-scaling, where the data per node is kept con-stant (and hence total data set size increases with number of nodes).If we obtain perfect performance scaling, then the runtime shouldbe constant as we increase node count and data set size. In thisstudy, we include CombBLAS, GraphLab, SociaLite and Giraphframeworks. Galois is currently only a single node framework andwe hence do not include results here.
Figures 4(a), 4(b), 4(c) and 4(d) show the results of multi noderuns on synthetically generated data sets for our benchmarks. Thedata sizes are chosen so that all frameworks could complete withoutrunning out of memory. Figure 5 shows the corresponding perfor-mance results for larger real-world graphs. We run each algorithmusing one large dataset – we use the Twitter dataset [20] for Pager-ank, BFS and Triangle Counting and the Yahoo Music KDDCupdataset 2011 dataset for Collaborative Filtering [7].
Algorithm CombBLAS GraphLab SociaLite GiraphPageRank 2.5 12.1 7.9 74.4
BFS 7.1 29.5 18.9 494.3Coll. Filtering 3.5 7.1 7.0 87.9
Triangle Count. 13.1 3.6 1.5 54.4
Table 6: Summary of performance differences for multi node bench-marks on different frameworks for our applications. Each entry is aslowdown factor, hence lower numbers indicate better performance.
As a convenient summary of performance, Table 6 shows thegeometric mean of the performance differences between our frame-works combining real-world and synthetic datasets at different scales.The table shows performance slowdowns of different frameworksfor specific algorithms compared to the native code for that algo-rithm – hence lower numbers are better.
We note the following trends in our multi-node results. (1) Thereis wide variability in our multi node results; as an example, na-tive code performs anywhere between 2X to more than 560X betterthan other frameworks on multi node runs (still up to 30X discount-ing Giraph runtimes). (2) Giraph performs worse by far than otherframeworks and is frequently 2-3 orders magnitude off from na-tive performance. (3) CombBLAS is competitive for Pagerank (ge-omean of 2.5X native performance), BFS (7.1X off native) and Col-laborative Filtering (3.5X off native). However, it performs poorlyon Triangle Counting due to extra computations performed as a re-sult of framework expressibility issues. CombBLAS also runs outof memory for the Twitter data set and hence this data point is notplotted. (4) GraphLab performs well for Triangle Counting, dueto data structure optimizations performed for this case, namely the
1
10
100
ation (secon
ds)
Pagerank (Weak scaling, 128M edges/node)
Native Combblas Graphlab Socialite Giraph
0.1
1
1 2 4 8 16 32 64Time pe
r itera
Number of nodes(a) PageRank
10
100
1000
time (secon
ds)
BFS (Weak scaling, 128M undirected edges/node)
Native Combblas Graphlab Socialite Giraph
0
1
1 2 4 8 16 32 64
Overall
Number of nodes(b) Breadth-First Search
100
1000
10000
ation (secon
ds)
Collaborative Filtering (Weak scaling, 250 M edges/node)
Native Combblas Graphlab Socialite Giraph
1
10
1 2 4 8 16 32 64Time pe
r itera
Number of nodes(c) Collaborative Filtering
10
100
1000
me (secon
ds)
Triangle Counting (Weak scaling, 32M edges/node)
Native Combblas Graphlab Socialite Giraph
0
11 2 4 8 16 32 64O
verall Tim
Number of nodes(d) Triangle Counting
Figure 4: Performance results for different algorithms on large scalesynthetic graphs. The y-axis represents runtime in log-scale. We per-form weak-scaling, where the amount of graph data per node is keptconstant, (a) 128 M edges/node for pagerank, (b) 128 M edges/node forBFS, (c) 256M ratings/node for SGD, and (d) 32M edges/node for tri-angle counting. Horizontal lines represent perfect scaling.
Results (Scalability): Synthetic Graphs
36D. Koop, CIS 602-01, Fall 2017
and is close to native performance (geometric mean of 1.1-1.2Xfor pagerank, BFS and collaborative filtering, and 2.5X for trianglecounting). (3) Giraph, on the other hand, is 2-3 orders of magnitudeslower than native code (4) CombBLAS and GraphLab performwell on average. CombBLAS is very good for all algorithms exceptfor Triangle Counting, where it ran out of memory for real-worldinputs while computing the A2 matrix product. This is an express-ibility problem in CombBLAS. GraphLab is 3-9X off from nativecode, but performs reasonably consistently across algorithms. (5)SociaLite performance is typically comparable to GraphLab (some-times slightly better and sometimes slightly worse).
Finally, note that the trends on the synthetic dataset are in linewith real-world data, showing that our synthetic generators are ef-fective in modeling real-world data.
5.3 Multi node resultsWe first show our scaling results of our frameworks on multiple
nodes. A major reason for using multiple nodes to process graphdata is to store the data in memory across the nodes. Hence a com-mon use case is weak-scaling, where the data per node is kept con-stant (and hence total data set size increases with number of nodes).If we obtain perfect performance scaling, then the runtime shouldbe constant as we increase node count and data set size. In thisstudy, we include CombBLAS, GraphLab, SociaLite and Giraphframeworks. Galois is currently only a single node framework andwe hence do not include results here.
Figures 4(a), 4(b), 4(c) and 4(d) show the results of multi noderuns on synthetically generated data sets for our benchmarks. Thedata sizes are chosen so that all frameworks could complete withoutrunning out of memory. Figure 5 shows the corresponding perfor-mance results for larger real-world graphs. We run each algorithmusing one large dataset – we use the Twitter dataset [20] for Pager-ank, BFS and Triangle Counting and the Yahoo Music KDDCupdataset 2011 dataset for Collaborative Filtering [7].
Algorithm CombBLAS GraphLab SociaLite GiraphPageRank 2.5 12.1 7.9 74.4
BFS 7.1 29.5 18.9 494.3Coll. Filtering 3.5 7.1 7.0 87.9
Triangle Count. 13.1 3.6 1.5 54.4
Table 6: Summary of performance differences for multi node bench-marks on different frameworks for our applications. Each entry is aslowdown factor, hence lower numbers indicate better performance.
As a convenient summary of performance, Table 6 shows thegeometric mean of the performance differences between our frame-works combining real-world and synthetic datasets at different scales.The table shows performance slowdowns of different frameworksfor specific algorithms compared to the native code for that algo-rithm – hence lower numbers are better.
We note the following trends in our multi-node results. (1) Thereis wide variability in our multi node results; as an example, na-tive code performs anywhere between 2X to more than 560X betterthan other frameworks on multi node runs (still up to 30X discount-ing Giraph runtimes). (2) Giraph performs worse by far than otherframeworks and is frequently 2-3 orders magnitude off from na-tive performance. (3) CombBLAS is competitive for Pagerank (ge-omean of 2.5X native performance), BFS (7.1X off native) and Col-laborative Filtering (3.5X off native). However, it performs poorlyon Triangle Counting due to extra computations performed as a re-sult of framework expressibility issues. CombBLAS also runs outof memory for the Twitter data set and hence this data point is notplotted. (4) GraphLab performs well for Triangle Counting, dueto data structure optimizations performed for this case, namely the
1
10
100
ation (secon
ds)
Pagerank (Weak scaling, 128M edges/node)
Native Combblas Graphlab Socialite Giraph
0.1
1
1 2 4 8 16 32 64Time pe
r itera
Number of nodes(a) PageRank
10
100
1000
time (secon
ds)
BFS (Weak scaling, 128M undirected edges/node)
Native Combblas Graphlab Socialite Giraph
0
1
1 2 4 8 16 32 64
Overall
Number of nodes(b) Breadth-First Search
100
1000
10000
ation (secon
ds)
Collaborative Filtering (Weak scaling, 250 M edges/node)
Native Combblas Graphlab Socialite Giraph
1
10
1 2 4 8 16 32 64Time pe
r itera
Number of nodes(c) Collaborative Filtering
10
100
1000
me (secon
ds)
Triangle Counting (Weak scaling, 32M edges/node)
Native Combblas Graphlab Socialite Giraph
0
11 2 4 8 16 32 64O
verall Tim
Number of nodes(d) Triangle Counting
Figure 4: Performance results for different algorithms on large scalesynthetic graphs. The y-axis represents runtime in log-scale. We per-form weak-scaling, where the amount of graph data per node is keptconstant, (a) 128 M edges/node for pagerank, (b) 128 M edges/node forBFS, (c) 256M ratings/node for SGD, and (d) 32M edges/node for tri-angle counting. Horizontal lines represent perfect scaling.
Results (Scalability): Synthetic Graphs
37D. Koop, CIS 602-01, Fall 2017
and is close to native performance (geometric mean of 1.1-1.2Xfor pagerank, BFS and collaborative filtering, and 2.5X for trianglecounting). (3) Giraph, on the other hand, is 2-3 orders of magnitudeslower than native code (4) CombBLAS and GraphLab performwell on average. CombBLAS is very good for all algorithms exceptfor Triangle Counting, where it ran out of memory for real-worldinputs while computing the A2 matrix product. This is an express-ibility problem in CombBLAS. GraphLab is 3-9X off from nativecode, but performs reasonably consistently across algorithms. (5)SociaLite performance is typically comparable to GraphLab (some-times slightly better and sometimes slightly worse).
Finally, note that the trends on the synthetic dataset are in linewith real-world data, showing that our synthetic generators are ef-fective in modeling real-world data.
5.3 Multi node resultsWe first show our scaling results of our frameworks on multiple
nodes. A major reason for using multiple nodes to process graphdata is to store the data in memory across the nodes. Hence a com-mon use case is weak-scaling, where the data per node is kept con-stant (and hence total data set size increases with number of nodes).If we obtain perfect performance scaling, then the runtime shouldbe constant as we increase node count and data set size. In thisstudy, we include CombBLAS, GraphLab, SociaLite and Giraphframeworks. Galois is currently only a single node framework andwe hence do not include results here.
Figures 4(a), 4(b), 4(c) and 4(d) show the results of multi noderuns on synthetically generated data sets for our benchmarks. Thedata sizes are chosen so that all frameworks could complete withoutrunning out of memory. Figure 5 shows the corresponding perfor-mance results for larger real-world graphs. We run each algorithmusing one large dataset – we use the Twitter dataset [20] for Pager-ank, BFS and Triangle Counting and the Yahoo Music KDDCupdataset 2011 dataset for Collaborative Filtering [7].
Algorithm CombBLAS GraphLab SociaLite GiraphPageRank 2.5 12.1 7.9 74.4
BFS 7.1 29.5 18.9 494.3Coll. Filtering 3.5 7.1 7.0 87.9
Triangle Count. 13.1 3.6 1.5 54.4
Table 6: Summary of performance differences for multi node bench-marks on different frameworks for our applications. Each entry is aslowdown factor, hence lower numbers indicate better performance.
As a convenient summary of performance, Table 6 shows thegeometric mean of the performance differences between our frame-works combining real-world and synthetic datasets at different scales.The table shows performance slowdowns of different frameworksfor specific algorithms compared to the native code for that algo-rithm – hence lower numbers are better.
We note the following trends in our multi-node results. (1) Thereis wide variability in our multi node results; as an example, na-tive code performs anywhere between 2X to more than 560X betterthan other frameworks on multi node runs (still up to 30X discount-ing Giraph runtimes). (2) Giraph performs worse by far than otherframeworks and is frequently 2-3 orders magnitude off from na-tive performance. (3) CombBLAS is competitive for Pagerank (ge-omean of 2.5X native performance), BFS (7.1X off native) and Col-laborative Filtering (3.5X off native). However, it performs poorlyon Triangle Counting due to extra computations performed as a re-sult of framework expressibility issues. CombBLAS also runs outof memory for the Twitter data set and hence this data point is notplotted. (4) GraphLab performs well for Triangle Counting, dueto data structure optimizations performed for this case, namely the
1
10
100
ation (secon
ds)
Pagerank (Weak scaling, 128M edges/node)
Native Combblas Graphlab Socialite Giraph
0.1
1
1 2 4 8 16 32 64Time pe
r itera
Number of nodes(a) PageRank
10
100
1000
time (secon
ds)
BFS (Weak scaling, 128M undirected edges/node)
Native Combblas Graphlab Socialite Giraph
0
1
1 2 4 8 16 32 64
Overall
Number of nodes(b) Breadth-First Search
100
1000
10000
ation (secon
ds)
Collaborative Filtering (Weak scaling, 250 M edges/node)
Native Combblas Graphlab Socialite Giraph
1
10
1 2 4 8 16 32 64Time pe
r itera
Number of nodes(c) Collaborative Filtering
10
100
1000
me (secon
ds)
Triangle Counting (Weak scaling, 32M edges/node)
Native Combblas Graphlab Socialite Giraph
0
11 2 4 8 16 32 64O
verall Tim
Number of nodes(d) Triangle Counting
Figure 4: Performance results for different algorithms on large scalesynthetic graphs. The y-axis represents runtime in log-scale. We per-form weak-scaling, where the amount of graph data per node is keptconstant, (a) 128 M edges/node for pagerank, (b) 128 M edges/node forBFS, (c) 256M ratings/node for SGD, and (d) 32M edges/node for tri-angle counting. Horizontal lines represent perfect scaling.
Results (Scalability): Synthetic Graphs
38D. Koop, CIS 602-01, Fall 2017
and is close to native performance (geometric mean of 1.1-1.2Xfor pagerank, BFS and collaborative filtering, and 2.5X for trianglecounting). (3) Giraph, on the other hand, is 2-3 orders of magnitudeslower than native code (4) CombBLAS and GraphLab performwell on average. CombBLAS is very good for all algorithms exceptfor Triangle Counting, where it ran out of memory for real-worldinputs while computing the A2 matrix product. This is an express-ibility problem in CombBLAS. GraphLab is 3-9X off from nativecode, but performs reasonably consistently across algorithms. (5)SociaLite performance is typically comparable to GraphLab (some-times slightly better and sometimes slightly worse).
Finally, note that the trends on the synthetic dataset are in linewith real-world data, showing that our synthetic generators are ef-fective in modeling real-world data.
5.3 Multi node resultsWe first show our scaling results of our frameworks on multiple
nodes. A major reason for using multiple nodes to process graphdata is to store the data in memory across the nodes. Hence a com-mon use case is weak-scaling, where the data per node is kept con-stant (and hence total data set size increases with number of nodes).If we obtain perfect performance scaling, then the runtime shouldbe constant as we increase node count and data set size. In thisstudy, we include CombBLAS, GraphLab, SociaLite and Giraphframeworks. Galois is currently only a single node framework andwe hence do not include results here.
Figures 4(a), 4(b), 4(c) and 4(d) show the results of multi noderuns on synthetically generated data sets for our benchmarks. Thedata sizes are chosen so that all frameworks could complete withoutrunning out of memory. Figure 5 shows the corresponding perfor-mance results for larger real-world graphs. We run each algorithmusing one large dataset – we use the Twitter dataset [20] for Pager-ank, BFS and Triangle Counting and the Yahoo Music KDDCupdataset 2011 dataset for Collaborative Filtering [7].
Algorithm CombBLAS GraphLab SociaLite GiraphPageRank 2.5 12.1 7.9 74.4
BFS 7.1 29.5 18.9 494.3Coll. Filtering 3.5 7.1 7.0 87.9
Triangle Count. 13.1 3.6 1.5 54.4
Table 6: Summary of performance differences for multi node bench-marks on different frameworks for our applications. Each entry is aslowdown factor, hence lower numbers indicate better performance.
As a convenient summary of performance, Table 6 shows thegeometric mean of the performance differences between our frame-works combining real-world and synthetic datasets at different scales.The table shows performance slowdowns of different frameworksfor specific algorithms compared to the native code for that algo-rithm – hence lower numbers are better.
We note the following trends in our multi-node results. (1) Thereis wide variability in our multi node results; as an example, na-tive code performs anywhere between 2X to more than 560X betterthan other frameworks on multi node runs (still up to 30X discount-ing Giraph runtimes). (2) Giraph performs worse by far than otherframeworks and is frequently 2-3 orders magnitude off from na-tive performance. (3) CombBLAS is competitive for Pagerank (ge-omean of 2.5X native performance), BFS (7.1X off native) and Col-laborative Filtering (3.5X off native). However, it performs poorlyon Triangle Counting due to extra computations performed as a re-sult of framework expressibility issues. CombBLAS also runs outof memory for the Twitter data set and hence this data point is notplotted. (4) GraphLab performs well for Triangle Counting, dueto data structure optimizations performed for this case, namely the
1
10
100
ation (secon
ds)
Pagerank (Weak scaling, 128M edges/node)
Native Combblas Graphlab Socialite Giraph
0.1
1
1 2 4 8 16 32 64Time pe
r itera
Number of nodes(a) PageRank
10
100
1000
time (secon
ds)
BFS (Weak scaling, 128M undirected edges/node)
Native Combblas Graphlab Socialite Giraph
0
1
1 2 4 8 16 32 64
Overall
Number of nodes(b) Breadth-First Search
100
1000
10000
ation (secon
ds)
Collaborative Filtering (Weak scaling, 250 M edges/node)
Native Combblas Graphlab Socialite Giraph
1
10
1 2 4 8 16 32 64Time pe
r itera
Number of nodes(c) Collaborative Filtering
10
100
1000
me (secon
ds)
Triangle Counting (Weak scaling, 32M edges/node)
Native Combblas Graphlab Socialite Giraph
0
11 2 4 8 16 32 64O
verall Tim
Number of nodes(d) Triangle Counting
Figure 4: Performance results for different algorithms on large scalesynthetic graphs. The y-axis represents runtime in log-scale. We per-form weak-scaling, where the amount of graph data per node is keptconstant, (a) 128 M edges/node for pagerank, (b) 128 M edges/node forBFS, (c) 256M ratings/node for SGD, and (d) 32M edges/node for tri-angle counting. Horizontal lines represent perfect scaling.
Results (Multiple Nodes): Real-world Data
39D. Koop, CIS 602-01, Fall 2017
96746.8
10
100
1000
me (in
second
s)
Native Combblas Graphlab Socialite Giraph
1Pagerank
(Twitter, 4 nodes)BFS (Twitter, 4
nodes)Collaborative Filt. (Yahoo Music, 4
nodes)
Triangle Count. (Twitter, 16 nodes)
Tim
Figure 5: Performance results for large real world graphs run on mul-tiple nodes. We show the results of running (in log-scale) Pagerank,BFS and Triangle Counting on a subset of Twitter [20], and Collabora-tive Filtering on the Yahoo Music KDDCup dataset [7]
cuckoo hash data structure that allows for a fast union of neighborlists, which is the primary operation in the algorithm. (5) GraphLabperformance drops off significantly for multi node runs (especiallyfor Pagerank) due to network bottlenecks. (6) SociaLite perfor-mance is typically between GraphLab and CombBLAS, except forTriangle Counting, where it performs best among our frameworks.It also shows more stable performance for different scales thanGraphLab. (7) Native performance is relatively stable on variousbenchmarks. The performance noticeably drops for CollaborativeFiltering where network traffic increases with number of nodes, andthe algorithm gradually becomes more network limited. (8) Finally,the trends we see in real-world data are broadly similar to thosefor synthetic data. In particular, CombBLAS performs best amongnon-native frameworks for three of the four algorithms - Pagerank,BFS and Collaborative Filtering, while SociaLite performs best forTriangle Counting.
Given the various performance issues we see in the frameworks,we next delve deeper into other metrics beside runtime to gain moreinsights into the frameworks.
5.4 Framework AnalysisWe did further investigation of CPU utilization, memory foot-
print and network traffic to narrow down the reasons for the ob-served performance trends. We use the sar/sysstat monitoring tools [4]available in Linux to measure these. These results are summarizedin Figure 6. We briefly describe our key findings in this section,and defer a detailed reasoning on what can be done to address theseproblems to Section 6.
Memory footprint: First, note that our dataset sizes chosen aresuch that the memory footprint of one or more frameworks (usuallyGiraph) is more than 50% of total memory – hence we cannot in-crease dataset sizes per node. The memory footprint is usually notdue to just the input graph size, but due to the large intermediatedata structures, e.g. message buffers that need to be kept. It is possi-ble to avoid buffering entire messages at once (e.g. GraphLab) andhence improve this behavior of Giraph as discussed in Section 6.Large memory footprints are also seen in CombBLAS (TriangleCounting), where the problem is framework expressibility.
CPU utilization: The CPU utilization of various frameworks ishigh when it is not I/O or network limited. Giraph has especiallylow CPU utilization across the board. This is because memory lim-itations restrict the number of workers that can be assigned to asingle node to 4 (even though the number of cores per node is 24).Each worker requires more memory and we cannot add more work-ers within our memory limit. This limits the utilization to 4/24 ⇠16%. Our native code is network limited for Triangle Counting(unlike CombBLAS and GraphLab which are memory bandwidth
bound) and has relatively low CPU utilization. Pagerank is limitedby network traffic for all the frameworks as well.
Peak Network Transfer Rate: The peak network transfer ratewhen running different algorithms is relatively stable for a givenframework, and is dependent mostly on the communication layerused. Note that Giraph has the lowest peak traffic rate of lessthan 0.5 GigaBytes per second (GBps), and CombBLAS and Na-tive code the highest of over 5 GBps. Although SociaLite andGraphLab use similar communication layers, SociaLite achievesabout twice the peak rate of GraphLab (discussion in Section 6). Infact, the currently available SociaLite code (from the authors) doesindeed have low transfer rates, and we were able to improve thisbehavior substantially using network traffic analysis, which is animportant contribution of this paper.
Data volumes: The amount of data transferred varies consider-ably based on algorithm (as shown in Figure 6), as well as platform(depending on compression schemes and data structures used). Weshow the latter impact in Section 6.
These measurements are very useful for predicting the slow-downs we see with graph frameworks compared to native imple-mentation. For example, we look at only the measured networkparameters for pagerank to estimate performance differences (net-work bytes sent/peak network bandwidth) from Figure 6. We canestimate that the frameworks would be 1.75, 9.8, 5.6, 32.7⇥ slowerthan native code (assuming all network transfers happen at the peakmeasured rate). Even this rough estimation is within a factor of 2.5of observed performance differences (Table 6). Similar analysiscan be done for other algorithms as well. Bandwidth bound codewill need to estimate the number of reads/writes and scale it withthe memory footprint.
In the next section, we will discuss optimizations that we per-form in native code, show examples of how such optimizations arealso useful in other frameworks and describe our key recommenda-tions for other frameworks as well.
6. DISCUSSIONIn this section, we discuss the various optimization techniques
used to improve the performance of our native code. We thenpresent recommendations on how the frameworks can be optimizedto achieve native code level performance.
6.1 Native implementations
6.1.1 Key optimization techniquesThe key features for performance and scalability that our native
code uses are listed below. Note that not all techniques have beenapplied or are applicable to every algorithm. However, we maxi-mize the use of these techniques to obtain best possible native code.
Data structures: Careful selection of data structure can ben-efit performance significantly. For example, algorithms like BFSand Triangle Counting can take advantage of bit-vectors instead ofother data structures for constant time lookups while minimizingcache misses. In our native BFS and Triangle Counting code, thisresults in a benefit of slightly over 2X. GraphLab keeps a cuckoo-hash data structure in Triangle Counting for the same reason, result-ing an efficient code that runs at only 2-3X off native performance.
Data Compression: In many cases, the data communicated amongnodes is the id’s of destination vertices of the edges traversed. Suchdata has been observed to be compressible using techniques likebit-vectors and delta coding [28]. For BFS and Pagerank, this re-sults in a net benefit of about 3.2 and 2.2X respectively (and a cor-responding reduction of bytes transferred in Figure 6). GraphLaband CombBLAS perform a limited form of compression that takes
Results: Resource Utilization
40D. Koop, CIS 602-01, Fall 2017
20
40
60
80
100
Native Combblas Graphlab
Socalite Giraph
0
20
CPU
utilizatio
n
Peak network
BW Mem
ory
footprint
Network
bytes sent
(a) PageRank
20
40
60
80
100
Native Combblas Graphlab
Socalite Giraph
0
20
CPU
utilizatio
n
Peak network
BW Mem
ory
footprint
Network
bytes sent
(b) Breadth-First Search
20
40
60
80
100
Native Combblas Graphlab
Socalite Giraph
0
20
CPU
utilizatio
n
Peak network
BW Mem
ory
footprint
Network
bytes sen
t
(c) Collaborative Filtering
20
40
60
80
100
Native Combblas Graphlab
Socalite Giraph
0
20
CPU
utilizatio
n
Peak network
BW Mem
ory
footprint
Network
bytes sent
(d) Triangle Counting
Figure 6: Performance metrics including CPU utilization, peak achieved network bandwidth, memory footprint and bytes sent over the networkfor a 4-node run of our benchmarks. Metrics are normalized so that a value of 100 on the y-axis corresponds to (1) 100% CPU utilization, (2) 5.5GB/s/node peak network BW (network limit) (3) 64 GB of memory/node (memory capacity) and (4) bytes sent by Giraph/node for each algorithm (1.4GB/node for pagerank, 0.73 GB/node for BFS, 37.4 GB/node for Collaborative Filtering and 13.0 GB/node for Triangle Counting). Higher numbersare better for CPU utilization and peak network traffic while lower numbers are better for memory footprint and network bytes sent.
advantage of local reductions to avoid repeated communication ofthe same vertex data to different target vertices in the same node.
Overlap of Computation and Communication: Overlap ofcomputation and communication is possible for many applicationswhere an entire message need not be received before computationcan start on the portion of the message that has been received. Thisallows for hiding the latency of communication with useful com-putation, and has been shown to improve performance of variousoptimized implementations [28]. Native code for BFS, pagerankand Triangle Counting all benefit between 1.2-2X. Apart from la-tency hiding, overlapping communication and computation allowsfor blocking of a very large message into multiple smaller ones,leading to lower memory footprint for buffer storage. This opti-mization specifically helps Triangle Counting, and both GraphLaband native codes perform this, leading to low memory footprints.
Message passing mechanisms: Using the right underlying mes-sage passing mechanisms can boost performance significantly. Thenative code uses MPI message passing [6] to drive the underlyingfabric (FDR InfiniBand network in our case) for high bandwidthand low latency communication among the nodes. Although it ispossible to implement message passing using other techniques suchas TCP sockets using IP over IB [33], this suffers from between2.5-3X lower bandwidth than MPI code, as observed in case ofGraphLab. However, if it not possible to use MPI, we observe thatmultiple sockets between a pair of nodes regains up to 2X of thisperformance, as in SociaLite.
Partitioning schemes: Most frameworks running on multiplenodes support partitioning the graph among the nodes in such away that nodes either own a subset of vertices (Giraph, SociaLite)or in a 2D fashion where they store a subset of edges (CombBLAS).GraphLab performs a more advanced partitioning scheme wheresome nodes with large degree are duplicated in multiple nodes toavoid problems of load imbalance during computation. In line withobservations in recent research [25], 2D partitioning as in Comb-BLAS or advanced 1D partitioning such as GraphLab gives betterload balancing, leading to better performance and scalability.
6.1.2 Impact of the optimization techniquesWe now look at each algorithm and discuss the optimizations
having high impact on each (we do not describe all optimizationsdue to space limits). Figure 7 shows the impact of various per-formance optimizations on native code for Pagerank and BFS. Themain optimizations come from software prefetch instructions thathelp hide the long latency of irregular memory accesses and mes-
048121620
S/W
etching
ression
mp. and
. ructure
pt.
S/W
etching
ression
mp. and
.
Speedu
p
SPrefe
+ Co
mp
+ Overla
p com
comm.
Data str
op SPrefe
+ Co
mp
+ Overla
p com
comm.
Pagerank BFS
Figure 7: Effect of optimizations performed on native implementa-tions of Pagerank and BFS.
sage optimizations (compressing message containing edges to besent to other nodes and overlapping computation with communi-cation). For BFS, additional optimization was to use bit-vectors(a data structure optimization) to compactly maintain the list of al-ready visited vertices [12, 28]. For Triangle counting, the same datastructure optimization (bit vectors), for quick constant time lookupsto identify common neighbors of different vertices, gave a speedupof around 2.2X.
For Collaborative filtering, the key optimizations revolve aroundefficient parallelization of Stochastic Gradient Descent. We adoptthe diagonal parallelization technique used in Gemulla et al. [16] toparallelize SGD both within and across nodes without using locks.
6.1.3 Testcase: Improving SociaLite and GiraphWe select SociaLite and Giraph to demonstrate the possible per-
formance improvement of the frameworks using the analysis inSection 5.4 and depth of understanding of the optimization tech-niques.
SociaLite: We first observed that the code corresponding to pub-lished SociaLite results (obtained from the authors) exhibited poorpeak network performance of about 0.5 GBps (like Giraph does to-day) since it uses sockets for communication. Although changingSociaLite to use MPI would have been ideal, it is much difficultdue to language and other issues. However, using multiple sock-ets to communicate between two workers allows for much higherpeak bandwidths (of close to 2 GBps). This, along with minoroptimizations such as merging communication data for batch pro-cessing, and using custom memory allocators to minimize garbage
Results: Resource Utilization
41D. Koop, CIS 602-01, Fall 2017
20
40
60
80
100
Native Combblas Graphlab
Socalite Giraph
0
20
CPU
utilizatio
n
Peak network
BW Mem
ory
footprint
Network
bytes sent
(a) PageRank
20
40
60
80
100
Native Combblas Graphlab
Socalite Giraph
0
20
CPU
utilizatio
n
Peak network
BW Mem
ory
footprint
Network
bytes sent
(b) Breadth-First Search
20
40
60
80
100
Native Combblas Graphlab
Socalite Giraph
0
20
CPU
utilizatio
n
Peak network
BW Mem
ory
footprint
Network
bytes sen
t
(c) Collaborative Filtering
20
40
60
80
100
Native Combblas Graphlab
Socalite Giraph
0
20
CPU
utilizatio
n
Peak network
BW Mem
ory
footprint
Network
bytes sent
(d) Triangle Counting
Figure 6: Performance metrics including CPU utilization, peak achieved network bandwidth, memory footprint and bytes sent over the networkfor a 4-node run of our benchmarks. Metrics are normalized so that a value of 100 on the y-axis corresponds to (1) 100% CPU utilization, (2) 5.5GB/s/node peak network BW (network limit) (3) 64 GB of memory/node (memory capacity) and (4) bytes sent by Giraph/node for each algorithm (1.4GB/node for pagerank, 0.73 GB/node for BFS, 37.4 GB/node for Collaborative Filtering and 13.0 GB/node for Triangle Counting). Higher numbersare better for CPU utilization and peak network traffic while lower numbers are better for memory footprint and network bytes sent.
advantage of local reductions to avoid repeated communication ofthe same vertex data to different target vertices in the same node.
Overlap of Computation and Communication: Overlap ofcomputation and communication is possible for many applicationswhere an entire message need not be received before computationcan start on the portion of the message that has been received. Thisallows for hiding the latency of communication with useful com-putation, and has been shown to improve performance of variousoptimized implementations [28]. Native code for BFS, pagerankand Triangle Counting all benefit between 1.2-2X. Apart from la-tency hiding, overlapping communication and computation allowsfor blocking of a very large message into multiple smaller ones,leading to lower memory footprint for buffer storage. This opti-mization specifically helps Triangle Counting, and both GraphLaband native codes perform this, leading to low memory footprints.
Message passing mechanisms: Using the right underlying mes-sage passing mechanisms can boost performance significantly. Thenative code uses MPI message passing [6] to drive the underlyingfabric (FDR InfiniBand network in our case) for high bandwidthand low latency communication among the nodes. Although it ispossible to implement message passing using other techniques suchas TCP sockets using IP over IB [33], this suffers from between2.5-3X lower bandwidth than MPI code, as observed in case ofGraphLab. However, if it not possible to use MPI, we observe thatmultiple sockets between a pair of nodes regains up to 2X of thisperformance, as in SociaLite.
Partitioning schemes: Most frameworks running on multiplenodes support partitioning the graph among the nodes in such away that nodes either own a subset of vertices (Giraph, SociaLite)or in a 2D fashion where they store a subset of edges (CombBLAS).GraphLab performs a more advanced partitioning scheme wheresome nodes with large degree are duplicated in multiple nodes toavoid problems of load imbalance during computation. In line withobservations in recent research [25], 2D partitioning as in Comb-BLAS or advanced 1D partitioning such as GraphLab gives betterload balancing, leading to better performance and scalability.
6.1.2 Impact of the optimization techniquesWe now look at each algorithm and discuss the optimizations
having high impact on each (we do not describe all optimizationsdue to space limits). Figure 7 shows the impact of various per-formance optimizations on native code for Pagerank and BFS. Themain optimizations come from software prefetch instructions thathelp hide the long latency of irregular memory accesses and mes-
048121620
S/W
etching
ression
mp. and
. ructure
pt.
S/W
etching
ression
mp. and
.
Speedu
p
SPrefe
+ Co
mp
+ Overla
p com
comm.
Data str
op SPrefe
+ Co
mp
+ Overla
p com
comm.
Pagerank BFS
Figure 7: Effect of optimizations performed on native implementa-tions of Pagerank and BFS.
sage optimizations (compressing message containing edges to besent to other nodes and overlapping computation with communi-cation). For BFS, additional optimization was to use bit-vectors(a data structure optimization) to compactly maintain the list of al-ready visited vertices [12, 28]. For Triangle counting, the same datastructure optimization (bit vectors), for quick constant time lookupsto identify common neighbors of different vertices, gave a speedupof around 2.2X.
For Collaborative filtering, the key optimizations revolve aroundefficient parallelization of Stochastic Gradient Descent. We adoptthe diagonal parallelization technique used in Gemulla et al. [16] toparallelize SGD both within and across nodes without using locks.
6.1.3 Testcase: Improving SociaLite and GiraphWe select SociaLite and Giraph to demonstrate the possible per-
formance improvement of the frameworks using the analysis inSection 5.4 and depth of understanding of the optimization tech-niques.
SociaLite: We first observed that the code corresponding to pub-lished SociaLite results (obtained from the authors) exhibited poorpeak network performance of about 0.5 GBps (like Giraph does to-day) since it uses sockets for communication. Although changingSociaLite to use MPI would have been ideal, it is much difficultdue to language and other issues. However, using multiple sock-ets to communicate between two workers allows for much higherpeak bandwidths (of close to 2 GBps). This, along with minoroptimizations such as merging communication data for batch pro-cessing, and using custom memory allocators to minimize garbage
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
J. Gonzalez
D. Koop, CIS 602-01, Fall 2017