CIS 602-01: Scalable Data Analysisdkoop/cis602/lectures/lecture19.pdf · Android Apps drug discovery Gmail Image understanding Maps Natural language understanding Photos Robotics

CIS 602-01: Scalable Data Analysis

Graph Data Dr. David Koop

D. Koop, CIS 602-01, Fall 2017

Machine Learning• Traditional Programming

• Machine Learning

2D. Koop, CIS 602-01, Fall 2017

[P. Domingos]

ComputerData

ProgramOutput

ComputerData

OutputProgram

Machine Learning• Every machine learning algorithm has three components:

- Representation - Evaluation - Optimization

3D. Koop, CIS 602-01, Fall 2017

Representation• Decision trees • Sets of rules / Logic programs • Instances • Graphical models (Bayes/Markov nets) • Neural networks • Support vector machines • Model ensembles • Etc.

4D. Koop, CIS 602-01, Fall 2017

[P. Domingos]

Evaluation• Accuracy • Precision and recall • Squared error • Likelihood • Posterior probability • Cost / Utility • Margin • Entropy • K-L divergence • Etc.

5D. Koop, CIS 602-01, Fall 2017

[P. Domingos]

Optimization• Combinatorial optimization

- E.g.: Greedy search • Convex optimization

- E.g.: Gradient descent • Constrained optimization

- E.g.: Linear programming

6D. Koop, CIS 602-01, Fall 2017

[P. Domingos]

Types of Learning• Supervised (inductive) learning

- Training data includes desired outputs • Unsupervised learning

- Training data does not include desired outputs • Semi-supervised learning

- Training data includes a few desired outputs • Reinforcement learning

- Rewards from sequence of actions

7D. Koop, CIS 602-01, Fall 2017

Scalable Machine Learning Constraints• Constrained by size:

- Training dataset fits in RAM, need to predict for a much larger dataset

- Training dataset doesn't even fit in RAM, want to scale out by • adopting algorithms that work in batches locally, or • on a distributed cluster.

• Constrained by time: - Want to fit more models (think hyper-parameter optimization or

ensemble learning) on my dataset in a given amount of time. - Scale out by fitting more models in parallel:

• on my laptop by using more cores, or • on a cluster.

8D. Koop, CIS 602-01, Fall 2017

[T. Augspurger, 2017]

https://tomaugspurger.github.io/scalable-ml-01.html

Sampling: How much training data?• You may not need to scale • How large should the sample be? Use statistics

9D. Koop, CIS 602-01, Fall 2017



Scaling scikit-learn• Fit, then predict • May be able to fit with training set in memory • Can run predict out of core

- Use dask map-partitions - Use the cloud (kubernetes)

• Watch Tom's blog for more posts: - how dask can speed up your existing pipelines by executing them

in parallel - scikit-learn's out of core API for when your training dataset

doesn't fit in memory - using dask to implement distributed machine learning algorithms

10D. Koop, CIS 602-01, Fall 2017




Deep Learning at Google

11D. Koop, CIS 602-01, Fall 2017

Growing Use of Deep Learning at Google

AndroidAppsdrug discoveryGmailImage understandingMapsNatural language understandingPhotosRobotics researchSpeechTranslationYouTube… many others ...

Across many products/areas:

# of directories containing model description files

Time

Uni

que

Pro

ject

Dire

ctor

ies

[J. Dean, 2015]

http://learningsys.org/2015/slides/NIPS-Learning-Systems-Workshop-TensorFlow-Jeff-Dean.pdf

Developer Wish List for Machine Learning System• Ease of expression: for lots of crazy ML ideas/algorithms • Scalability: can run experiments quickly • Portability: can run on wide variety of platforms • Reproducibility: easy to share and reproduce research • Production readiness: go from research to real products

12D. Koop, CIS 602-01, Fall 2017

[J. Dean, 2015]


Computation is a dataflow graph

13D. Koop, CIS 602-01, Fall 2017

MatMul

Add Relu

biases

weights

examples

labels

Xent

Graph of Nodes, also called Operations or ops.


[J. Dean, 2015]


Computation is a dataflow graph with tensors

14D. Koop, CIS 602-01, Fall 2017

with tensors

MatMul

Add Relu

biases

weights

examples

labels

Xent

Edges are N-dimensional arrays: Tensors


[J. Dean, 2015]


Computation is a dataflow graph with state

15D. Koop, CIS 602-01, Fall 2017

with state

Add Mul

biases

...

learning rate

−=...

'Biases' is a variable −= updates biasesSome ops compute gradients


[J. Dean, 2015]


Computation is a distributed dataflow graph

16D. Koop, CIS 602-01, Fall 2017

Device BDevice A

distributed

Add Mul

biases

learning rate

−=...

Devices: Processes, Machines, GPUs, etc

...


[J. Dean, 2015]


Computation is a distributed dataflow graph

17D. Koop, CIS 602-01, Fall 2017

Device A Device B

distributed

Add Mul

biases

learning rate

−=...

Devices: Processes, Machines, GPUs, etc

Send and Receive Nodes

Send

Recv

Send RecvSend Recv

... RecvSend

[J. Dean, 2015]


Experiment Turnaround Time & Research Productivity• Minutes, Hours:

- Interactive research! Instant gratification! • 1-4 days

- Tolerable - Interactivity replaced by running many experiments in parallel

• 1-4 weeks - High value experiments only - Progress stalls

• >1 month - Don't even try

18D. Koop, CIS 602-01, Fall 2017

[J. Dean, 2015]


How to Scale?• Model Parallelism

- Best way to decrease training time: decrease the step time - Many models have lots of inherent parallelism - Problem is distributing work so communication doesn’t kill you - Can do across single core (SIMD), multiple cores, devices (e.g.

GPUs), machines • Data Parallelism

- Use multiple model replicas to process different examples at once - Speedups depend on the kind of model

• Dense models: 10-40x speedup from 50 replicas • Sparse models: support many more replicas (as many as 1,000)

- Really important for Google's problems

19D. Koop, CIS 602-01, Fall 2017

[J. Dean, 2015]


Model Parallelism

20D. Koop, CIS 602-01, Fall 2017

[J. Dean, 2015]


Data Parallelism

21D. Koop, CIS 602-01, Fall 2017

Data Parallelism

Parameter Servers

...ModelReplicas

Data ...

p∆p

p += ∆p

[J. Dean, 2015]


Assignment 3• http://www.cis.umassd.edu/~dkoop/cis602-2017fa/assignment3.html • Analysis of the IRS Form 990 Filings Data • Sign up for AWS Educate • Part 1: Run Spark Locally over Index • Part 2: Run Spark Locally over Subset • Part 3: Run Spark on AWS over Full Dataset (Year) • Windows Issues:

- Install Spark separately from conda's pyspark install - Run notebook normally and use findspark to link to the pyspark

libraries

22D. Koop, CIS 602-01, Fall 2017

http://www.cis.umassd.edu/~dkoop/cis602-2017fa/assignment3.html

https://aws.amazon.com/public-datasets/irs-990/

https://aws.amazon.com/education/awseducate/

Projects• Reviewing project proposals today and tomorrow • Looking for scalability aspects: if you can easily complete analyses

with pandas or matplotlib, it probably doesn't meet the criteria • If your project doesn't quite meet the criteria, I will try to offer

suggestions to satisfy the requirements • Continue to work on projects

23D. Koop, CIS 602-01, Fall 2017

Graphs as Data

24D. Koop, CIS 602-01, Fall 2017

ID Atom Electrons Protons0 N 7 71 C 6 62 S 16 163 C 6 64 N 7 7

ID1 ID2 Bonds0 1 11 2 11 3 23 4 1

Nodes

Edges

SS

CC

CC

NN

CCNN

CC

NN

NNCC

NN

HH

HHHH

HH

HH

Social Networks

25D. Koop, CIS 602-01, Fall 2017

[P. Butler, 2010]

Graphs• "Every time someone visits news feed there are on average 1,500

potential stories from friends, people they follow and pages for them to see, and most people don’t have enough time to see them all" - Lars Backstrom, Facebook Engineer, 2013

26D. Koop, CIS 602-01, Fall 2017

Graph Analytics

K. Salama


https://www.slideshare.net/KhalidSalama2/graph-analytics-67723066

Frameworks In-depth

SIGMOD 2016 Tutorial


Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

N. Satish et al.


Evaluating Different Graph Algorithms• Select a variety of different algorithms:

- PageRank - Breadth First Search - Collaborative Filtering - Triangle Counting

30D. Koop, CIS 602-01, Fall 2017

[Satish et al., 2014]

Algorithm Graph type Vertex Edge Message size Vertexproperty access pattern (Bytes/edge) active?

PageRank Directed, unweighted edges Double (pagerank) Streaming Constant (8) All iterationsBreadth First Search Undirected, unweighted edges Int (distance) Random Constant (4) Some iterations

Collaborative Filtering Bipartite graph; Undirected, weighted edges Array of Doubles (pu or qv) Streaming Constant (8K) All iterationsTriangle Counting Directed, unweighted edges Long (Ntriangles) Streaming Variable (0-106) Non-iterative

Table 1: Diversity in the characteristics of chosen graph algorithms.

where u & v are indices over users and items respectively, Ruv

is the rating of the uth user for the vth item, pu

&qv

are densevectors of length K corresponding to each user and item, respec-tively. This matrix factorization is typically done iteratively usingStochastic Gradient Descent (SGD) or Gradient Descent (GD). ForSGD, each iteration consists of performing the following operationfor all ratings in a random order:

euv

= Ruv

� pT

u

qv

(5)p⇤

u

= pu

+ �t

[euv

qv

� �p

pu

] (6)q⇤

v

= qv

+ �t

[euv

pu

� �q

qv

] (7)(p

u

,qv

) = (p⇤

u

,q⇤v

) (8)

where �t

is the step size for the tth iteration (typically, �t

= �0st

and s is the step size reduction factor 0 < s 1). GD performssimilar operations but updates all the p

u

and qv

once per iterationinstead of once per rating.

We implement all these algorithms on all the graph frameworksthat we use for comparison, except for those available publicly (de-tails in Section 3).

2.1 ChallengesThe chosen graph algorithms vary widely in their characteristics

and correspondingly, their implementations stress different aspectsof the hardware system. Table 1 shows the characteristics of thedifferent graph algorithms. The message passing characteristics arebased on that of a vertex programming implementation. There aredifferences from the structure and properties of the graph itself, ver-tex properties, access patterns, message sizes, and whether verticesare active in all iterations.

The implications of these characteristics are discussed in Section3 for those algorithms. For example, Triangle counting and Collab-orative filtering have total message sizes that are much larger thanthat of the graph itself, necessitating modifications for Giraph.

We now discuss the graph frameworks considered and how thealgorithms map to them.

3. CHOICE OF FRAMEWORKSThe wide variety in graph algorithms that need to be imple-

mented has necessitated the creation of a variety of graph frame-works. There is clearly no consensus on even what programmingmodel gives the best productivity-performance trade-off. In this pa-per, we consider the following popular graph frameworks - GraphLab,CombBLAS, SociaLite, Galois and Giraph. In addition, we also in-clude hand-optimized code for the algorithms. Each of these frame-works are described below.

GraphLab [21] is a graph framework that provides a sequential,shared memory abstraction for running graph algorithms writtenas “vertex programs”. GraphLab works by letting vertices in agraph read incoming messages, update the values and send mes-sages asynchronously. GraphLab partitions the graph in a 1-D fash-ion (vertex partitioning). All graph algorithms must be expressedas a program running on a vertex, which can access its own valueas well as that of its edges and neighboring vertices. The runtimetakes care of scheduling, messaging and synchronization.

The Combinatorial BLAS [11] is an extensible distributed-memoryparallel graph library offering a small but powerful set of linear al-gebra primitives specifically targeting graph analytics. CombBLAStreats graphs as sparse matrices and partitions the non-zeros of thematrix (edges in the graph) across nodes. As such, this is the onlyframework that supports an edge-based partitioning of the graph(also referred to as 2-D partitioning in the paper). Graph com-putations are expressed as operations among sparse matrices andvectors using arbitrary user-defined semirings.

SociaLite [30, 31] is based on Datalog, a declarative languagethat can express various graph algorithms succinctly due to its bet-ter support for recursive queries compared to SQL [32]. In So-ciaLite, the graph and its meta data is stored in tables, and declara-tive rules are written to implement graph algorithms. SociaLite ta-bles are horizontally partitioned, or sharded, to support parallelism.Users can specify how they want to shard a table at table declarationtime, and the runtime partitions and distributes the tables accord-ingly. SociaLite only supports 1-D partitioning.

Giraph[8] is an iterative graph processing system that runs ontop of Hadoop framework. Computation proceeds as a sequence ofiterations, called supersteps in a bulk synchronous (BSP) fashion.Initially, every vertex is active. In each superstep each active ver-tex invokes a Compute method provided by the user. The Computemethod: (1) receives messages sent to the vertex in the previoussuperstep, (2) computes using the messages, and the vertex andoutgoing edge values, which may result in modifications to the val-ues, and (3) may send messages to other vertices. The Computemethod does not have direct access to the values of other verticesand their outgoing edges. Inter-vertex communication occurs bysending messages. Computation halts if all vertices have voted tohalt and there are no messages in flight. Giraph partitions the ver-tices in a 1-D fashion (vertex partitioning).

Since graph computations can be very irregular (little locality,varying amount of work per iteration etc.), Galois [26], a frame-work developed for handling irregular computations can also beused for graph processing. Galois is a work-item based paralleliza-tion framework that can handle graph computations (and other ir-regular problems as well). It provides a rich programming modelwith coordinated and autonomous scheduling, and with and with-out application-defined priorities. Galois provides its own sched-ulers and scalable data structures, but does not impose a particularpartitioning scheme which may be edge or vertex based dependingon how the computation is expressed in the framework.

Other than the explained differences, a major differentiator ofthe frameworks is the communication layer between different hard-ware nodes. Our native implementation and CombBLAS use MPI,whereas GraphLab and SociaLite use sockets. Giraph uses a net-work I/O library (netty), while Galois does not have a multi nodeimplementation as yet.

Table 2 shows a high-level comparison between the differentframeworks under consideration in this paper.

3.1 Example - PageRankWe explain the differences in programming model between the

frameworks with a small example and see how Pagerank can beimplemented in all the frameworks.

Graph Frameworks

31D. Koop, CIS 602-01, Fall 2017

Framework Programming Multi node Lang- Graph Communicationmodel usage Partitioning layer

Native N/A Yes C/C++ N/A MPIGraphLab Vertex Yes C++ 1-D Sockets

CombBLAS Sparse matrix Yes C++ 2-D MPISociaLite Datalog Yes Java 1-D Sockets

Galois Task-based No C/C++ N/A N/AGiraph Vertex Yes Java 1-D Netty

Table 2: High level comparison of the graph frameworks

1

0 3

2

Figure 2: An example directed graph with 4 vertices

Let us begin by finding the most optimal way to execute pager-ank. We refer to this hand-optimized version of the algorithm asnative implementation. We observe that pagerank computation asgiven by equation (1) performs one multiply-add operation per edge.Representing the graph in a Compressed-Sparse Row (CSR) for-mat [15], an efficient way of storing sparse matrix (graph) as adja-cency list, allows for the edges to be stored as a single, contiguousarray. This allows all the accesses to the edge array to be regu-lar and improves the memory bandwidth utilization through hard-ware prefetching. Since each vertex has to access the pagerank val-ues of all the vertices with incoming edges, we store the incomingedges in CSR format (not outgoing as would generally be the case).For the multi node setup, the graph is partitioned in a 1-D fashioni.e. partitioning the vertices (along with corresponding in-edges)among the nodes so that each node has roughly the same numberof edges. Each node calculates the local updates and packages thepagerank values to be sent to the other nodes. These messages arethen used to calculate the remote updates. More details on the op-timizations in pagerank are given in Section 6.

Let us see how the page rank algorithm maps to vertex program-ming. In this model, we write a program that executes on a singlevertex. This program can only access “local” data i.e. informa-tion about the vertices and edges that are directly connected to agiven vertex. An example of vertex program (in pseudocode) isprovided in Algorithm 1. The exact semantics of how the mes-sages are packed and received, how much local data can be ac-cessed, global reductions etc. vary across different implementa-tions. GraphLab [21] and Giraph [8] are both examples of thisapproach to graph processing.

Algorithm 1: Vertex program for one iteration of page rankbegin

PR � rfor msg 2 incoming messages do

PR � PR + (1� r) ⇤msg

Send

PR

degree

to all outgoing edges

Another distinct approach to processing large graphs is to treatthem as sparse matrices - an approach embodied in frameworkslike CombBLAS[11]. The computations in a single iteration ofPageRank can be expressed in matrix form as follows:

pt+1 = r1 + (1� r)AT p̃

t

(9)

where A =

0

BB@

0 1 1 0

0 0 1 1

0 0 0 1

0 0 0 0

1

CCA (for the graph in Figure 2), pt

is the

vector of all the page rank values at iteration t, p̃t

(i) =

pt(i)d(i) , d is

the vector of vertex out-degrees and 1 is a vector of all 1’s.In SociaLite, pagerank computation is expressed as following:

RANK[n](t + 1, $SUM(v)) :- v = r

:- INEDGE[n](s), RANK[s](t, v0), OUTDEG[s](d), v =

(1� r)v0

d.

, where the PageRank value of a vertex n at iteration t + 1 in therule head (RANK[n](t + 1, $SUM(v)) is declared as the summationof the constant term in the first rule, and the normalized PageRankvalues from neighbor vertices at iteration t in the second rule. Theincoming edge table, INEDGE, stores vertices in its first column andneighboring vertices in its second column. INEDGE is declared as atail-nested table [30], effectively implementing a CSR format usedin the native implementation and CombBLAS. Another version ofPageRank is implemented from the perspective of a vertex that dis-tributes its PageRank value to its neighboring vertices, which isexpressed as following:

RANK[n](t + 1, $SUM(v)) :- v = r;

:- RANK[s](t, v0), OUTEDGE[s](n), OUTDEG[s](d), v =

(1� r)v0

d.

In this implementation, all join operations in the rule body arelocally computed, and there is only a single data transfer for theRANK table update in the rule head. Compared to the previous ver-sion, there is one less data transfer, reducing communication over-heads. In terms of lock overhead, since we cannot determine whichshard of RANK will be updated, locks must be held for every up-date. Hence, the first version is optimized for a single multi-coremachine, while the second is optimized for distributed machines.

The pagerank implementation in Galois is very similar to that ofGraphLab or Giraph i.e. the parallelization is over vertices. Eachwork item in Galois is a vertex program for updating its pagerank.Since Galois runs on only a single node with a shared memoryabstraction, each task has access to all of the program’s data.

3.2 Mapping Algorithms to FrameworksAs mentioned earlier, the pagerank algorithm can be expressed

in a variety of frameworks. In a similar fashion, we also describethe implementations of the other algorithms here:

Breadth First Search: Vertex programs are straight-forward towrite for this algorithm. Algorithm 2 shows the pseudocode of BFSimplementation. All distances are initialized to infinity (except thestarting vertex, which is initialized to 0). The iterations continueuntil there are no updates to distances.

Algorithm 2: Vertex program for one iteration of BFS.begin

for msg 2 incoming messages doDistance � min(Distance, msg + 1)

Send Distance to all outgoing edges

CombBLAS implementation performs matrix-vector multiplica-tion in every iteration. For example, in order to do traverse thegraph from both vertices 0 and 1 in Figure 2, we only have to do


Real-world datasets

32D. Koop, CIS 602-01, Fall 2017

Galois is the only framework that implements SGD (not just GD)in a fashion similar to that of the native implementation. This ispossible because of 2 reasons (1) since partitioning is flexible withGalois, we can apply the n2 uniform 2D chunk partitioning (2)since Galois is running only on a single node, it can maintain glob-ally consistent state after any update without much performancedegradation. Each work-item in Galois performs the SGD updateon a single edge (u, v) i.e. it updates both p

u

and qv

.We note that SGD has much better convergence characteristics

than Gradient Descent. For the Netflix dataset, given a fixed con-vergence criterion, SGD converges in about 40x fewer iterationsthan GD. Of course, like all machine learning algorithms, bothSGD and GD have parameters to choose that affect convergence(learning rate, step size reduction and so forth). We did do a coarsesweep over these parameters to obtain best convergence.

4. EXPERIMENTAL SETUPWe now describe our experimental setup with details about our

frameworks, data sets used and our experimental platform.

4.1 Choice of DatasetsWe use a mix of both real-world datasets and data generators

for synthetic data in our experiments. Both classes of datasets ide-ally follow a power law (Zipf law) distribution. For the pagerankand triangle counting problems we use directed graphs, whereas forBFS and Collaborating Filtering we use undirected graphs. Collab-orative Filtering requires edge-weighted, bipartite graphs that rep-resents the ratings matrix.

4.1.1 Real world Datasets

Dataset # Vertices # Edges Brief DescriptionFacebook [34] 2,937,612 41,919,708 Facebook user interaction graphWikipedia [14] 3,566,908 84,751,827 Wikipedia Link graph

LiveJournal [14] 4,847,571 85,702,475 LiveJournal follower graphNetflix [9] 480,189 users 99,072,112 Netflix Prize

17,770 movies ratingsTwitter [20] 61,578,415 1,468,365,182 Twitter follower graph

Yahoo Music [7] 1,000,990 users 252,800,275 Yahoo! KDDCup 2011624,961 items ratings music ratings

Synthetic 536,870,912 8,589,926,431 Described inGraph500 [23] Section 4

Synthetic 63,367,472 users 16,742,847,256 Described inCollaborative 1,342,176 items ratings Section 4

Filtering

Table 3: Real World and largest synthetic datasets

Table 3 provides details on the real-world datasets we used. TheFacebook, Wikipedia, Livejournal and Twitter graphs were used torun Pagerank, BFS and Triangle Counting, while the Netflix andYahoo Music datasets were used to run Collaborative Filtering.

The major issue with the real world datasets is primarily theirsmall size. While these datasets are useful for algorithmic ex-ploration, they are not really scalable to multiple nodes. In par-ticular, the Facebook, Wikipedia, Livejournal and Netflix datasetswere small enough to fit on a single machine. We do have twolarge real-world datasets - Twitter and Yahoo music. These wererun on 4 nodes except for Triangle Counting on Twitter, which re-quired 16 nodes to complete in a reasonable amount of time. Eventhough the twitter graph itself can fit in fewer nodes, as mentionedin Section 3, the total message size is much larger than the graphand needs more resources. To provide a sense of the size of thesedatasets, the largest dataset, Twitter, is just over 30 GB in size.Although we do have a few large examples, developing data gen-erators that can produce synthetic data that is representative of realworld workloads is essential to study scalability.

4.1.2 Synthetic Data GenerationWe derive our synthetic data generators from Graph500 RMAT

data generators [23]. We generate data to scale the graphs up to 64nodes. To give a sense on size of the synthetic graphs, a large scale64 node run for Pagerank and BFS processes over 8 Billion edges.We now describe how we use the Graph500 generator to generateboth directed graphs and ratings matrices for our applications.

We use the default RMAT parameters (A = 0.57, B = C = 0.19)used for Graph500 to generate undirected graphs. The RMAT gen-erator only generates a list of edges (with possible duplicates). ForPagerank, we assign a direction to all the edges generated. For BFS,depending on the framework used we either provide it undirectededges or provide 2 edges in both directions for each generated edge.For triangle counting, we use slightly different RMAT parameters(A = 0.45, B = C = 0.15) to reduce the number of triangles in thegraph. We then assign a direction to edges going from the vertexwith smaller id to one with larger id to avoid cycles. This is doneto make the implementations efficient on all frameworks.

In order to generate large rating matrices, we wrote a rating ma-trix generator that follows the power-law distribution of the Netflixdata set. In order to achieve this, we start with graphs generated bythe Graph500 graph generator. Through experimentation, we foundthat RMAT parameters of A = 0.40 and B = C = 0.22 generatesdegree distributions whose tail is reasonably close to that of theNetflix dataset. Finally, we post-processed the graphs to removeall vertices with degree < 5 from the graph. This yielded graphsthat closely follow the degree distributions of the Netflix data set.In order to convert these Graph500 graphs into bipartite graphs ofN

users

⇥Nmovies

(items are named movies in case of Netflix data),we first chunk the columns of the Graph500 matrix into chunks ofsize N

movies

. We then “fold” the matrix by performing a logical“or” of these chunks.

Given that the power law distribution is what differentiates areal-world dataset, we believe our data generator is much more rep-resentative of real collaborative filtering workloads as opposed toother data generators such as that by [16]. [16] generates data bysampling uniformly matching the expected number of non-zerosoverall but not as a power law distribution.

For our experiments, we generate synthetic data of up to 16 Bil-lion ratings for 64 million users ⇥ 800K movies. For scaling, wegenerate about 256 million ratings per node, or 16 Billion ratingsfor 64 nodes.

4.2 Framework versionsWe used GraphLab v2.2 [3] for this comparison. We used the

CombBLAS v1.3 code from [1]. We obtained the SociaLite codewith additional optimizations over the results in [30] from the au-thors. For Giraph, we used release 1.1.0 from [8]. Finally, weobtained Galois v 2.2.0 from [2].

We implemented all the algorithms on these frameworks wher-ever a public version did not exist.

4.3 Experimental PlatformWe run the benchmarks1 on an Intel® Xeon® 2 CPU E5-2697

based system. Each node has 64 GB of DRAM, and has 24 coressupporting 2-way Simultaneous Multi-Threading each running at2.7 GHz. The nodes are connected with an Mellanox Infiniband1

Software and workloads used in performance tests may have been optimized for performance only on Intel micro-processors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,components, software, operations and functions. Any change to any of those factors may cause the results to vary.You should consult other information and performance tests to assist you in fully evaluating your contemplated pur-chases, including the performance of that product when combined with other products. For more information go tohttp://www.intel.com/performance2

Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.


Results (Run Time)

33D. Koop, CIS 602-01, Fall 2017

0.1

1

10

tion (secon

ds)

Native Combblas GraphlabSocialite Giraph Galois

0.01

0.1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

tic

Time pe

r iterat

(a) PageRank

10

100

e (secon

ds)


0

1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

ticOverall tim

e

(b) Breadth-First Search

10

100

1000

ation (secon

ds)


1

10

Netflix

Synthe

tic

Time pe

r itera

(c) Collaborative Filtering

10

100

1000

10000

e (secon

ds)


0.1

1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

ticOverall Tim

(d) Triangle counting

Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axisrepresents runtime (in log-scale), therefore lower numbers are better.

FDR interconnect. The cluster runs on the Red Hat EnterpriseLinux Server OS release 6.4. We use a mix of OpenMP directivesto parallelize within the node and MPI code to parallelize acrossnodes in native code. We use the Intel® C++ Composer XE 2013SP1 Compiler3 and the Intel® MPI library to compile the code.GraphLab uses a similar combination of OpenMP and MPI for bestperformance. CombBLAS runs best as a pure MPI program. Weuse multiple MPI processes per node to take advantage of the mul-tiple cores within the node. Moreover, CombBLAS requires thetotal number of processes to be a square (due to their 2D partition-ing approach). Hence we use 36 MPI processes per node to run onthe 48 hardware threads; and we further run on a square number ofnodes (1, 2, 4, 9, 16, 36 and 64 nodes). SociaLite uses Java threadsand processes for parallelism. Giraph uses the Hadoop frameworkfor parallelism (we run 4 workers per node). Finally, Galois is asingle node framework and uses OpenMP for parallelism.

5. RESULTS

5.1 Native implementation bottlenecksSince different graph frameworks have their programming mod-

els with different implementation trade-offs, it is hard to directlycompare these frameworks with respect to each other without aclear reference point. As explained in Section 1, we provide a well-optimized native implementation of these algorithms for both sin-gle and multi node systems. Since the native code is optimized, itis easy to see which aspects of the system are stressed by a partic-ular algorithm. Table 4 provides data on the achieved limits of thenative implementations on a single node and 4 nodes system.

Algorithm Single Node 4 NodesH/W limitation Efficiency H/W limitation Efficiency

PageRank Memory BW 78 GBps (92%) Network BW 2.3 GBps ( 42%)BFS Memory BW 64 GBps (74%) Memory BW 54 GBps (63%)

Coll. Filtering Memory BW 47 GBps (54%) Memory BW 35 GBps (41%)Triangle Count. Memory BW 45 GBps (52%) Network BW 2.2 GBps ( 40%)

Table 4: Efficiency achieved by native implementations of differentalgorithms on single and 4 nodes.

We find that on both single and multi node implementations, thealgorithm performance is dependent on either memory or networkbandwidth. The efficiencies are generally within 2-2.5X off the3

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that arenot unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other opti-mizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessorsnot manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel micro-processors. Certain optimizations not specific to Intel micro-architecture are reserved for Intel microprocessors. Pleaserefer to the applicable product User and Reference Guides for more information regarding the specific instruction setscovered by this notice. Notice revision #20110804

ideal results. Given the diversity of bottlenecks within the 4 al-gorithms, and also between single and multiple node implementa-tions, we expect that it would be difficult for any one framework toexcel at all scales in terms performance and productivity.

5.2 Single node resultsWe now show the results of running the Pagerank, BFS, Collab-

orative Filtering and Triangle Counting algorithms on the Comb-BLAS, GraphLab, SociaLite, Giraph and Galois frameworks.

In the following, we only compare time taken per iteration onvarious frameworks for Collaborative Filtering and Pagerank. Asdescribed in Section 3, the native code for Collaborative Filteringimplements Stochastic Gradient Descent which converges muchfaster than the Gradient Descent implementation in other frame-works. However, we do not see much difference in performanceper iteration between Stochastic Gradient Descent and GradientDescent in native code. Hence we compare time/iteration here toseparate out the fact that the frameworks are not expressive enoughto express SGD from other potential performance differences. Sim-ilarly, some Pagerank implementations differ in whether early con-vergence is detected for the algorithm, and hence we report timeper iteration to normalize for this.

Algorithm CombBLAS GraphLab SociaLite Giraph GaloisPageRank 1.9 3.6 2.0 39.0 1.2

BFS 2.5 9.3 7.3 567.8 1.1Coll. Filtering 3.5 5.1 5.8 54.4 1.1

Triangle Count. 33.9 3.2 4.7 484.3 2.5

Table 5: Summary of performance differences for single node bench-marks on different frameworks for our applications. Each entry is aslowdown factor from native code, hence lower numbers indicate bet-ter performance.

Figures 3(a), 3(b) and 3(d) show the results of running Pager-ank, BFS, and Triangle Counting respectively on the real-worldLivejournal, Facebook and Wikipedia datasets described in Sec-tion 4.1 on our frameworks. We also show the results of runninga synthetic scale-free RMAT graph (obtained using the Graph500data generator). Figure 3(c) shows the performance of our frame-works on the Netflix [9] dataset, as well as synthetically generatedcollaborative filtering dataset. For convenience, we also present thegeometric mean of this data across datasets in Table 5. This tableshows the slowdowns of each framework w.r.t. native code.

We see the following key inferences: (1) Native code, as ex-pected, delivers best performance as it is optimized for the underly-ing architecture. (2) Galois performs better than other frameworks,

Results (Run Time)

34D. Koop, CIS 602-01, Fall 2017

0.1

1

10

tion (secon

ds)


0.01

0.1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

tic

Time pe

r iterat

(a) PageRank

10

100

e (secon

ds)


0

1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

ticOverall tim

e


10

100

1000

ation (secon

ds)


1

10

Netflix

Synthe

tic

Time pe

r itera


10

100

1000

10000

e (secon

ds)


0.1

1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

ticOverall Tim

(d) Triangle counting

Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axisrepresents runtime (in log-scale), therefore lower numbers are better.

FDR interconnect. The cluster runs on the Red Hat EnterpriseLinux Server OS release 6.4. We use a mix of OpenMP directivesto parallelize within the node and MPI code to parallelize acrossnodes in native code. We use the Intel® C++ Composer XE 2013SP1 Compiler3 and the Intel® MPI library to compile the code.GraphLab uses a similar combination of OpenMP and MPI for bestperformance. CombBLAS runs best as a pure MPI program. Weuse multiple MPI processes per node to take advantage of the mul-tiple cores within the node. Moreover, CombBLAS requires thetotal number of processes to be a square (due to their 2D partition-ing approach). Hence we use 36 MPI processes per node to run onthe 48 hardware threads; and we further run on a square number ofnodes (1, 2, 4, 9, 16, 36 and 64 nodes). SociaLite uses Java threadsand processes for parallelism. Giraph uses the Hadoop frameworkfor parallelism (we run 4 workers per node). Finally, Galois is asingle node framework and uses OpenMP for parallelism.

5. RESULTS

5.1 Native implementation bottlenecksSince different graph frameworks have their programming mod-

els with different implementation trade-offs, it is hard to directlycompare these frameworks with respect to each other without aclear reference point. As explained in Section 1, we provide a well-optimized native implementation of these algorithms for both sin-gle and multi node systems. Since the native code is optimized, itis easy to see which aspects of the system are stressed by a partic-ular algorithm. Table 4 provides data on the achieved limits of thenative implementations on a single node and 4 nodes system.

Algorithm Single Node 4 NodesH/W limitation Efficiency H/W limitation Efficiency

PageRank Memory BW 78 GBps (92%) Network BW 2.3 GBps ( 42%)BFS Memory BW 64 GBps (74%) Memory BW 54 GBps (63%)

Coll. Filtering Memory BW 47 GBps (54%) Memory BW 35 GBps (41%)Triangle Count. Memory BW 45 GBps (52%) Network BW 2.2 GBps ( 40%)

Table 4: Efficiency achieved by native implementations of differentalgorithms on single and 4 nodes.

We find that on both single and multi node implementations, thealgorithm performance is dependent on either memory or networkbandwidth. The efficiencies are generally within 2-2.5X off the3

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that arenot unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other opti-mizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessorsnot manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel micro-processors. Certain optimizations not specific to Intel micro-architecture are reserved for Intel microprocessors. Pleaserefer to the applicable product User and Reference Guides for more information regarding the specific instruction setscovered by this notice. Notice revision #20110804

ideal results. Given the diversity of bottlenecks within the 4 al-gorithms, and also between single and multiple node implementa-tions, we expect that it would be difficult for any one framework toexcel at all scales in terms performance and productivity.

5.2 Single node resultsWe now show the results of running the Pagerank, BFS, Collab-

orative Filtering and Triangle Counting algorithms on the Comb-BLAS, GraphLab, SociaLite, Giraph and Galois frameworks.

In the following, we only compare time taken per iteration onvarious frameworks for Collaborative Filtering and Pagerank. Asdescribed in Section 3, the native code for Collaborative Filteringimplements Stochastic Gradient Descent which converges muchfaster than the Gradient Descent implementation in other frame-works. However, we do not see much difference in performanceper iteration between Stochastic Gradient Descent and GradientDescent in native code. Hence we compare time/iteration here toseparate out the fact that the frameworks are not expressive enoughto express SGD from other potential performance differences. Sim-ilarly, some Pagerank implementations differ in whether early con-vergence is detected for the algorithm, and hence we report timeper iteration to normalize for this.

Algorithm CombBLAS GraphLab SociaLite Giraph GaloisPageRank 1.9 3.6 2.0 39.0 1.2

BFS 2.5 9.3 7.3 567.8 1.1Coll. Filtering 3.5 5.1 5.8 54.4 1.1

Triangle Count. 33.9 3.2 4.7 484.3 2.5

Table 5: Summary of performance differences for single node bench-marks on different frameworks for our applications. Each entry is aslowdown factor from native code, hence lower numbers indicate bet-ter performance.

Figures 3(a), 3(b) and 3(d) show the results of running Pager-ank, BFS, and Triangle Counting respectively on the real-worldLivejournal, Facebook and Wikipedia datasets described in Sec-tion 4.1 on our frameworks. We also show the results of runninga synthetic scale-free RMAT graph (obtained using the Graph500data generator). Figure 3(c) shows the performance of our frame-works on the Netflix [9] dataset, as well as synthetically generatedcollaborative filtering dataset. For convenience, we also present thegeometric mean of this data across datasets in Table 5. This tableshows the slowdowns of each framework w.r.t. native code.

We see the following key inferences: (1) Native code, as ex-pected, delivers best performance as it is optimized for the underly-ing architecture. (2) Galois performs better than other frameworks,

Results (Scalability): Synthetic Graphs

35D. Koop, CIS 602-01, Fall 2017

and is close to native performance (geometric mean of 1.1-1.2Xfor pagerank, BFS and collaborative filtering, and 2.5X for trianglecounting). (3) Giraph, on the other hand, is 2-3 orders of magnitudeslower than native code (4) CombBLAS and GraphLab performwell on average. CombBLAS is very good for all algorithms exceptfor Triangle Counting, where it ran out of memory for real-worldinputs while computing the A2 matrix product. This is an express-ibility problem in CombBLAS. GraphLab is 3-9X off from nativecode, but performs reasonably consistently across algorithms. (5)SociaLite performance is typically comparable to GraphLab (some-times slightly better and sometimes slightly worse).

Finally, note that the trends on the synthetic dataset are in linewith real-world data, showing that our synthetic generators are ef-fective in modeling real-world data.

5.3 Multi node resultsWe first show our scaling results of our frameworks on multiple

nodes. A major reason for using multiple nodes to process graphdata is to store the data in memory across the nodes. Hence a com-mon use case is weak-scaling, where the data per node is kept con-stant (and hence total data set size increases with number of nodes).If we obtain perfect performance scaling, then the runtime shouldbe constant as we increase node count and data set size. In thisstudy, we include CombBLAS, GraphLab, SociaLite and Giraphframeworks. Galois is currently only a single node framework andwe hence do not include results here.

Figures 4(a), 4(b), 4(c) and 4(d) show the results of multi noderuns on synthetically generated data sets for our benchmarks. Thedata sizes are chosen so that all frameworks could complete withoutrunning out of memory. Figure 5 shows the corresponding perfor-mance results for larger real-world graphs. We run each algorithmusing one large dataset – we use the Twitter dataset [20] for Pager-ank, BFS and Triangle Counting and the Yahoo Music KDDCupdataset 2011 dataset for Collaborative Filtering [7].

Algorithm CombBLAS GraphLab SociaLite GiraphPageRank 2.5 12.1 7.9 74.4

BFS 7.1 29.5 18.9 494.3Coll. Filtering 3.5 7.1 7.0 87.9

Triangle Count. 13.1 3.6 1.5 54.4

Table 6: Summary of performance differences for multi node bench-marks on different frameworks for our applications. Each entry is aslowdown factor, hence lower numbers indicate better performance.

As a convenient summary of performance, Table 6 shows thegeometric mean of the performance differences between our frame-works combining real-world and synthetic datasets at different scales.The table shows performance slowdowns of different frameworksfor specific algorithms compared to the native code for that algo-rithm – hence lower numbers are better.

We note the following trends in our multi-node results. (1) Thereis wide variability in our multi node results; as an example, na-tive code performs anywhere between 2X to more than 560X betterthan other frameworks on multi node runs (still up to 30X discount-ing Giraph runtimes). (2) Giraph performs worse by far than otherframeworks and is frequently 2-3 orders magnitude off from na-tive performance. (3) CombBLAS is competitive for Pagerank (ge-omean of 2.5X native performance), BFS (7.1X off native) and Col-laborative Filtering (3.5X off native). However, it performs poorlyon Triangle Counting due to extra computations performed as a re-sult of framework expressibility issues. CombBLAS also runs outof memory for the Twitter data set and hence this data point is notplotted. (4) GraphLab performs well for Triangle Counting, dueto data structure optimizations performed for this case, namely the

1

10

100

ation (secon

ds)

Pagerank (Weak scaling, 128M edges/node)

Native Combblas Graphlab Socialite Giraph

0.1

1

1 2 4 8 16 32 64Time pe

r itera

Number of nodes(a) PageRank

10

100

1000

time (secon

ds)

BFS (Weak scaling, 128M undirected edges/node)


0

1

1 2 4 8 16 32 64

Overall

Number of nodes(b) Breadth-First Search

100

1000

10000

ation (secon

ds)

Collaborative Filtering (Weak scaling, 250 M edges/node)


1

10

1 2 4 8 16 32 64Time pe

r itera

Number of nodes(c) Collaborative Filtering

10

100

1000

me (secon

ds)

Triangle Counting (Weak scaling, 32M edges/node)


0

11 2 4 8 16 32 64O

verall Tim

Number of nodes(d) Triangle Counting

Figure 4: Performance results for different algorithms on large scalesynthetic graphs. The y-axis represents runtime in log-scale. We per-form weak-scaling, where the amount of graph data per node is keptconstant, (a) 128 M edges/node for pagerank, (b) 128 M edges/node forBFS, (c) 256M ratings/node for SGD, and (d) 32M edges/node for tri-angle counting. Horizontal lines represent perfect scaling.


36D. Koop, CIS 602-01, Fall 2017








Triangle Count. 13.1 3.6 1.5 54.4




1

10

100

ation (secon

ds)



0.1

1

1 2 4 8 16 32 64Time pe

r itera


10

100

1000

time (secon

ds)



0

1

1 2 4 8 16 32 64

Overall


100

1000

10000

ation (secon

ds)



1

10

1 2 4 8 16 32 64Time pe

r itera


10

100

1000

me (secon

ds)



0

11 2 4 8 16 32 64O

verall Tim




37D. Koop, CIS 602-01, Fall 2017








Triangle Count. 13.1 3.6 1.5 54.4




1

10

100

ation (secon

ds)



0.1

1

1 2 4 8 16 32 64Time pe

r itera


10

100

1000

time (secon

ds)



0

1

1 2 4 8 16 32 64

Overall


100

1000

10000

ation (secon

ds)



1

10

1 2 4 8 16 32 64Time pe

r itera


10

100

1000

me (secon

ds)



0

11 2 4 8 16 32 64O

verall Tim




38D. Koop, CIS 602-01, Fall 2017








Triangle Count. 13.1 3.6 1.5 54.4




1

10

100

ation (secon

ds)



0.1

1

1 2 4 8 16 32 64Time pe

r itera


10

100

1000

time (secon

ds)



0

1

1 2 4 8 16 32 64

Overall


100

1000

10000

ation (secon

ds)



1

10

1 2 4 8 16 32 64Time pe

r itera


10

100

1000

me (secon

ds)



0

11 2 4 8 16 32 64O

verall Tim



Results (Multiple Nodes): Real-world Data

39D. Koop, CIS 602-01, Fall 2017

96746.8

10

100

1000

me (in

second

s)


1Pagerank

(Twitter, 4 nodes)BFS (Twitter, 4

nodes)Collaborative Filt. (Yahoo Music, 4

nodes)

Triangle Count. (Twitter, 16 nodes)

Tim

Figure 5: Performance results for large real world graphs run on mul-tiple nodes. We show the results of running (in log-scale) Pagerank,BFS and Triangle Counting on a subset of Twitter [20], and Collabora-tive Filtering on the Yahoo Music KDDCup dataset [7]

cuckoo hash data structure that allows for a fast union of neighborlists, which is the primary operation in the algorithm. (5) GraphLabperformance drops off significantly for multi node runs (especiallyfor Pagerank) due to network bottlenecks. (6) SociaLite perfor-mance is typically between GraphLab and CombBLAS, except forTriangle Counting, where it performs best among our frameworks.It also shows more stable performance for different scales thanGraphLab. (7) Native performance is relatively stable on variousbenchmarks. The performance noticeably drops for CollaborativeFiltering where network traffic increases with number of nodes, andthe algorithm gradually becomes more network limited. (8) Finally,the trends we see in real-world data are broadly similar to thosefor synthetic data. In particular, CombBLAS performs best amongnon-native frameworks for three of the four algorithms - Pagerank,BFS and Collaborative Filtering, while SociaLite performs best forTriangle Counting.

Given the various performance issues we see in the frameworks,we next delve deeper into other metrics beside runtime to gain moreinsights into the frameworks.

5.4 Framework AnalysisWe did further investigation of CPU utilization, memory foot-

print and network traffic to narrow down the reasons for the ob-served performance trends. We use the sar/sysstat monitoring tools [4]available in Linux to measure these. These results are summarizedin Figure 6. We briefly describe our key findings in this section,and defer a detailed reasoning on what can be done to address theseproblems to Section 6.

Memory footprint: First, note that our dataset sizes chosen aresuch that the memory footprint of one or more frameworks (usuallyGiraph) is more than 50% of total memory – hence we cannot in-crease dataset sizes per node. The memory footprint is usually notdue to just the input graph size, but due to the large intermediatedata structures, e.g. message buffers that need to be kept. It is possi-ble to avoid buffering entire messages at once (e.g. GraphLab) andhence improve this behavior of Giraph as discussed in Section 6.Large memory footprints are also seen in CombBLAS (TriangleCounting), where the problem is framework expressibility.

CPU utilization: The CPU utilization of various frameworks ishigh when it is not I/O or network limited. Giraph has especiallylow CPU utilization across the board. This is because memory lim-itations restrict the number of workers that can be assigned to asingle node to 4 (even though the number of cores per node is 24).Each worker requires more memory and we cannot add more work-ers within our memory limit. This limits the utilization to 4/24 ⇠16%. Our native code is network limited for Triangle Counting(unlike CombBLAS and GraphLab which are memory bandwidth

bound) and has relatively low CPU utilization. Pagerank is limitedby network traffic for all the frameworks as well.

Peak Network Transfer Rate: The peak network transfer ratewhen running different algorithms is relatively stable for a givenframework, and is dependent mostly on the communication layerused. Note that Giraph has the lowest peak traffic rate of lessthan 0.5 GigaBytes per second (GBps), and CombBLAS and Na-tive code the highest of over 5 GBps. Although SociaLite andGraphLab use similar communication layers, SociaLite achievesabout twice the peak rate of GraphLab (discussion in Section 6). Infact, the currently available SociaLite code (from the authors) doesindeed have low transfer rates, and we were able to improve thisbehavior substantially using network traffic analysis, which is animportant contribution of this paper.

Data volumes: The amount of data transferred varies consider-ably based on algorithm (as shown in Figure 6), as well as platform(depending on compression schemes and data structures used). Weshow the latter impact in Section 6.

These measurements are very useful for predicting the slow-downs we see with graph frameworks compared to native imple-mentation. For example, we look at only the measured networkparameters for pagerank to estimate performance differences (net-work bytes sent/peak network bandwidth) from Figure 6. We canestimate that the frameworks would be 1.75, 9.8, 5.6, 32.7⇥ slowerthan native code (assuming all network transfers happen at the peakmeasured rate). Even this rough estimation is within a factor of 2.5of observed performance differences (Table 6). Similar analysiscan be done for other algorithms as well. Bandwidth bound codewill need to estimate the number of reads/writes and scale it withthe memory footprint.

In the next section, we will discuss optimizations that we per-form in native code, show examples of how such optimizations arealso useful in other frameworks and describe our key recommenda-tions for other frameworks as well.

6. DISCUSSIONIn this section, we discuss the various optimization techniques

used to improve the performance of our native code. We thenpresent recommendations on how the frameworks can be optimizedto achieve native code level performance.

6.1 Native implementations

6.1.1 Key optimization techniquesThe key features for performance and scalability that our native

code uses are listed below. Note that not all techniques have beenapplied or are applicable to every algorithm. However, we maxi-mize the use of these techniques to obtain best possible native code.

Data structures: Careful selection of data structure can ben-efit performance significantly. For example, algorithms like BFSand Triangle Counting can take advantage of bit-vectors instead ofother data structures for constant time lookups while minimizingcache misses. In our native BFS and Triangle Counting code, thisresults in a benefit of slightly over 2X. GraphLab keeps a cuckoo-hash data structure in Triangle Counting for the same reason, result-ing an efficient code that runs at only 2-3X off native performance.

Data Compression: In many cases, the data communicated amongnodes is the id’s of destination vertices of the edges traversed. Suchdata has been observed to be compressible using techniques likebit-vectors and delta coding [28]. For BFS and Pagerank, this re-sults in a net benefit of about 3.2 and 2.2X respectively (and a cor-responding reduction of bytes transferred in Figure 6). GraphLaband CombBLAS perform a limited form of compression that takes

Results: Resource Utilization

40D. Koop, CIS 602-01, Fall 2017

20

40

60

80

100

Native Combblas Graphlab

Socalite Giraph

0

20

CPU

utilizatio

n

Peak network

BW Mem

ory

footprint

Network

bytes sent

(a) PageRank

20

40

60

80

100


Socalite Giraph

0

20

CPU

utilizatio

n

Peak network

BW Mem

ory

footprint

Network

bytes sent


20

40

60

80

100


Socalite Giraph

0

20

CPU

utilizatio

n

Peak network

BW Mem

ory

footprint

Network

bytes sen

t


20

40

60

80

100


Socalite Giraph

0

20

CPU

utilizatio

n

Peak network

BW Mem

ory

footprint

Network

bytes sent

(d) Triangle Counting

Figure 6: Performance metrics including CPU utilization, peak achieved network bandwidth, memory footprint and bytes sent over the networkfor a 4-node run of our benchmarks. Metrics are normalized so that a value of 100 on the y-axis corresponds to (1) 100% CPU utilization, (2) 5.5GB/s/node peak network BW (network limit) (3) 64 GB of memory/node (memory capacity) and (4) bytes sent by Giraph/node for each algorithm (1.4GB/node for pagerank, 0.73 GB/node for BFS, 37.4 GB/node for Collaborative Filtering and 13.0 GB/node for Triangle Counting). Higher numbersare better for CPU utilization and peak network traffic while lower numbers are better for memory footprint and network bytes sent.

advantage of local reductions to avoid repeated communication ofthe same vertex data to different target vertices in the same node.

Overlap of Computation and Communication: Overlap ofcomputation and communication is possible for many applicationswhere an entire message need not be received before computationcan start on the portion of the message that has been received. Thisallows for hiding the latency of communication with useful com-putation, and has been shown to improve performance of variousoptimized implementations [28]. Native code for BFS, pagerankand Triangle Counting all benefit between 1.2-2X. Apart from la-tency hiding, overlapping communication and computation allowsfor blocking of a very large message into multiple smaller ones,leading to lower memory footprint for buffer storage. This opti-mization specifically helps Triangle Counting, and both GraphLaband native codes perform this, leading to low memory footprints.

Message passing mechanisms: Using the right underlying mes-sage passing mechanisms can boost performance significantly. Thenative code uses MPI message passing [6] to drive the underlyingfabric (FDR InfiniBand network in our case) for high bandwidthand low latency communication among the nodes. Although it ispossible to implement message passing using other techniques suchas TCP sockets using IP over IB [33], this suffers from between2.5-3X lower bandwidth than MPI code, as observed in case ofGraphLab. However, if it not possible to use MPI, we observe thatmultiple sockets between a pair of nodes regains up to 2X of thisperformance, as in SociaLite.

Partitioning schemes: Most frameworks running on multiplenodes support partitioning the graph among the nodes in such away that nodes either own a subset of vertices (Giraph, SociaLite)or in a 2D fashion where they store a subset of edges (CombBLAS).GraphLab performs a more advanced partitioning scheme wheresome nodes with large degree are duplicated in multiple nodes toavoid problems of load imbalance during computation. In line withobservations in recent research [25], 2D partitioning as in Comb-BLAS or advanced 1D partitioning such as GraphLab gives betterload balancing, leading to better performance and scalability.

6.1.2 Impact of the optimization techniquesWe now look at each algorithm and discuss the optimizations

having high impact on each (we do not describe all optimizationsdue to space limits). Figure 7 shows the impact of various per-formance optimizations on native code for Pagerank and BFS. Themain optimizations come from software prefetch instructions thathelp hide the long latency of irregular memory accesses and mes-

048121620

S/W

etching

ression

mp. and

. ructure

pt.

S/W

etching

ression

mp. and

.

Speedu

p

SPrefe

+ Co

mp

+ Overla

p com

comm.

Data str

op SPrefe

+ Co

mp

+ Overla

p com

comm.

Pagerank BFS

Figure 7: Effect of optimizations performed on native implementa-tions of Pagerank and BFS.

sage optimizations (compressing message containing edges to besent to other nodes and overlapping computation with communi-cation). For BFS, additional optimization was to use bit-vectors(a data structure optimization) to compactly maintain the list of al-ready visited vertices [12, 28]. For Triangle counting, the same datastructure optimization (bit vectors), for quick constant time lookupsto identify common neighbors of different vertices, gave a speedupof around 2.2X.

For Collaborative filtering, the key optimizations revolve aroundefficient parallelization of Stochastic Gradient Descent. We adoptthe diagonal parallelization technique used in Gemulla et al. [16] toparallelize SGD both within and across nodes without using locks.

6.1.3 Testcase: Improving SociaLite and GiraphWe select SociaLite and Giraph to demonstrate the possible per-

formance improvement of the frameworks using the analysis inSection 5.4 and depth of understanding of the optimization tech-niques.

SociaLite: We first observed that the code corresponding to pub-lished SociaLite results (obtained from the authors) exhibited poorpeak network performance of about 0.5 GBps (like Giraph does to-day) since it uses sockets for communication. Although changingSociaLite to use MPI would have been ideal, it is much difficultdue to language and other issues. However, using multiple sock-ets to communicate between two workers allows for much higherpeak bandwidths (of close to 2 GBps). This, along with minoroptimizations such as merging communication data for batch pro-cessing, and using custom memory allocators to minimize garbage

Results: Resource Utilization

41D. Koop, CIS 602-01, Fall 2017

20

40

60

80

100


Socalite Giraph

0

20

CPU

utilizatio

n

Peak network

BW Mem

ory

footprint

Network

bytes sent

(a) PageRank

20

40

60

80

100


Socalite Giraph

0

20

CPU

utilizatio

n

Peak network

BW Mem

ory

footprint

Network

bytes sent


20

40

60

80

100


Socalite Giraph

0

20

CPU

utilizatio

n

Peak network

BW Mem

ory

footprint

Network

bytes sen

t


20

40

60

80

100


Socalite Giraph

0

20

CPU

utilizatio

n

Peak network

BW Mem

ory

footprint

Network

bytes sent

(d) Triangle Counting

Figure 6: Performance metrics including CPU utilization, peak achieved network bandwidth, memory footprint and bytes sent over the networkfor a 4-node run of our benchmarks. Metrics are normalized so that a value of 100 on the y-axis corresponds to (1) 100% CPU utilization, (2) 5.5GB/s/node peak network BW (network limit) (3) 64 GB of memory/node (memory capacity) and (4) bytes sent by Giraph/node for each algorithm (1.4GB/node for pagerank, 0.73 GB/node for BFS, 37.4 GB/node for Collaborative Filtering and 13.0 GB/node for Triangle Counting). Higher numbersare better for CPU utilization and peak network traffic while lower numbers are better for memory footprint and network bytes sent.

advantage of local reductions to avoid repeated communication ofthe same vertex data to different target vertices in the same node.

Overlap of Computation and Communication: Overlap ofcomputation and communication is possible for many applicationswhere an entire message need not be received before computationcan start on the portion of the message that has been received. Thisallows for hiding the latency of communication with useful com-putation, and has been shown to improve performance of variousoptimized implementations [28]. Native code for BFS, pagerankand Triangle Counting all benefit between 1.2-2X. Apart from la-tency hiding, overlapping communication and computation allowsfor blocking of a very large message into multiple smaller ones,leading to lower memory footprint for buffer storage. This opti-mization specifically helps Triangle Counting, and both GraphLaband native codes perform this, leading to low memory footprints.

Message passing mechanisms: Using the right underlying mes-sage passing mechanisms can boost performance significantly. Thenative code uses MPI message passing [6] to drive the underlyingfabric (FDR InfiniBand network in our case) for high bandwidthand low latency communication among the nodes. Although it ispossible to implement message passing using other techniques suchas TCP sockets using IP over IB [33], this suffers from between2.5-3X lower bandwidth than MPI code, as observed in case ofGraphLab. However, if it not possible to use MPI, we observe thatmultiple sockets between a pair of nodes regains up to 2X of thisperformance, as in SociaLite.

Partitioning schemes: Most frameworks running on multiplenodes support partitioning the graph among the nodes in such away that nodes either own a subset of vertices (Giraph, SociaLite)or in a 2D fashion where they store a subset of edges (CombBLAS).GraphLab performs a more advanced partitioning scheme wheresome nodes with large degree are duplicated in multiple nodes toavoid problems of load imbalance during computation. In line withobservations in recent research [25], 2D partitioning as in Comb-BLAS or advanced 1D partitioning such as GraphLab gives betterload balancing, leading to better performance and scalability.

6.1.2 Impact of the optimization techniquesWe now look at each algorithm and discuss the optimizations

having high impact on each (we do not describe all optimizationsdue to space limits). Figure 7 shows the impact of various per-formance optimizations on native code for Pagerank and BFS. Themain optimizations come from software prefetch instructions thathelp hide the long latency of irregular memory accesses and mes-

048121620

S/W

etching

ression

mp. and

. ructure

pt.

S/W

etching

ression

mp. and

.

Speedu

p

SPrefe

+ Co

mp

+ Overla

p com

comm.

Data str

op SPrefe

+ Co

mp

+ Overla

p com

comm.

Pagerank BFS

Figure 7: Effect of optimizations performed on native implementa-tions of Pagerank and BFS.

sage optimizations (compressing message containing edges to besent to other nodes and overlapping computation with communi-cation). For BFS, additional optimization was to use bit-vectors(a data structure optimization) to compactly maintain the list of al-ready visited vertices [12, 28]. For Triangle counting, the same datastructure optimization (bit vectors), for quick constant time lookupsto identify common neighbors of different vertices, gave a speedupof around 2.2X.

For Collaborative filtering, the key optimizations revolve aroundefficient parallelization of Stochastic Gradient Descent. We adoptthe diagonal parallelization technique used in Gemulla et al. [16] toparallelize SGD both within and across nodes without using locks.

6.1.3 Testcase: Improving SociaLite and GiraphWe select SociaLite and Giraph to demonstrate the possible per-

formance improvement of the frameworks using the analysis inSection 5.4 and depth of understanding of the optimization tech-niques.

SociaLite: We first observed that the code corresponding to pub-lished SociaLite results (obtained from the authors) exhibited poorpeak network performance of about 0.5 GBps (like Giraph does to-day) since it uses sockets for communication. Although changingSociaLite to use MPI would have been ideal, it is much difficultdue to language and other issues. However, using multiple sock-ets to communicate between two workers allows for much higherpeak bandwidths (of close to 2 GBps). This, along with minoroptimizations such as merging communication data for batch pro-cessing, and using custom memory allocators to minimize garbage

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

J. Gonzalez


https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pptx

Documents

CIS 602-01: Scalable Data Analysisdkoop/cis602/lectures/lecture19.pdf · Android Apps drug discovery Gmail Image understanding Maps Natural language understanding Photos Robotics