CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture20.pdf · - useful in fraud/threat detection, social network suggestions D. Koop, CIS 602-01, Fall 2017 5 [K

CIS 602-01: Scalable Data Analysis

Spatiotemporal Data & Topology Dr. David Koop

D. Koop, CIS 602-01, Fall 2017

Graphs as Data

2D. Koop, CIS 602-01, Fall 2017

ID Atom Electrons Protons0 N 7 71 C 6 62 S 16 163 C 6 64 N 7 7

ID1 ID2 Bonds0 1 11 2 11 3 23 4 1

Nodes

Edges

SS

CC

CC

NN

CCNN

CC

NN

NNCC

NN

HH

HHHH

HH

HH

What is a Graph?• In computing, a graph is an

abstract data structure that represents set objects and their relationships as vertices and edges, and supports a number of graph-related operations

• Objects (nodes): {A,B,C,D} • Relationships (edges): {(D,B),(D,A),(B,C),(B,A),(C,A)}

• Operation: shortest path from D to A

3D. Koop, CIS 602-01, Fall 2017

[K. Salama, 2016]

A

B C

D

https://www.slideshare.net/KhalidSalama2/graph-analytics-67723066

Graphs with Properties• Each vertex or edge may have properties associated with it • May include identifiers or classes

4D. Koop, CIS 602-01, Fall 2017

[neo4j]

Person

name = 'Tom Hanks'

born = 1956

Movie

title = 'Forrest Gump'

released = 1994

ACTED_IN

roles = ['Forrest']

Person

name = 'Robert Zemeckis'

born = 1951

DIRECTED

https://neo4j.com/docs/developer-manual/current/introduction/graphdb-concepts/

Types of Graph Analytics• Connectivity Analytics:

- number of vertices/edges, in- and out-degrees of vertices - histogram of degrees can be useful in comparing graphs

• Path Analytics: cycles, reachability, shortest path, minimum spanning tree

• Community Analytics: clusters (cohesion and separation) • Centrality Analytics: degree, vulnerability, PageRank • Pattern Matching: subgraph isomorphism

- can use properties - useful in fraud/threat detection, social network suggestions

5D. Koop, CIS 602-01, Fall 2017

[K. Salama, 2016]

https://www.slideshare.net/KhalidSalama2/graph-analytics-67723066

Scalable Graph Analytics• Nodes have graph subsets • Vertices can communicate to vertices on other nodes if there is an

edge between them • Process vertices in parallel • Pregel (Google)

- Uses Bulk Synchronous Parallelism - Superstep does update and compute over all vertices and then

synchronizes • Apache Giraph • GraphX (Spark)

6D. Koop, CIS 602-01, Fall 2017

Graph Databases and Graph Frameworks• Graph Databases represent data as graphs

- E.g. Neo4j, Titan, Apache Giraph • Graph Frameworks allow graph operations to be computed

efficiently but do not require data to be stored as graphs - E.g. Spark GraphX

7D. Koop, CIS 602-01, Fall 2017

Evaluating Different Graph Algorithms• Select a variety of different algorithms:

- PageRank - Breadth First Search - Collaborative Filtering - Triangle Counting

8D. Koop, CIS 602-01, Fall 2017

[Satish et al., 2014]

Algorithm Graph type Vertex Edge Message size Vertexproperty access pattern (Bytes/edge) active?

PageRank Directed, unweighted edges Double (pagerank) Streaming Constant (8) All iterationsBreadth First Search Undirected, unweighted edges Int (distance) Random Constant (4) Some iterations

Collaborative Filtering Bipartite graph; Undirected, weighted edges Array of Doubles (pu or qv) Streaming Constant (8K) All iterationsTriangle Counting Directed, unweighted edges Long (Ntriangles) Streaming Variable (0-106) Non-iterative

Table 1: Diversity in the characteristics of chosen graph algorithms.

where u & v are indices over users and items respectively, Ruv

is the rating of the uth user for the vth item, pu

&qv

are densevectors of length K corresponding to each user and item, respec-tively. This matrix factorization is typically done iteratively usingStochastic Gradient Descent (SGD) or Gradient Descent (GD). ForSGD, each iteration consists of performing the following operationfor all ratings in a random order:

euv

= Ruv

� pT

u

qv

(5)p⇤

u

= pu

+ �t

[euv

qv

� �p

pu

] (6)q⇤

v

= qv

+ �t

[euv

pu

� �q

qv

] (7)(p

u

,qv

) = (p⇤

u

,q⇤v

) (8)

where �t

is the step size for the tth iteration (typically, �t

= �0st

and s is the step size reduction factor 0 < s 1). GD performssimilar operations but updates all the p

u

and qv

once per iterationinstead of once per rating.

We implement all these algorithms on all the graph frameworksthat we use for comparison, except for those available publicly (de-tails in Section 3).

2.1 ChallengesThe chosen graph algorithms vary widely in their characteristics

and correspondingly, their implementations stress different aspectsof the hardware system. Table 1 shows the characteristics of thedifferent graph algorithms. The message passing characteristics arebased on that of a vertex programming implementation. There aredifferences from the structure and properties of the graph itself, ver-tex properties, access patterns, message sizes, and whether verticesare active in all iterations.

The implications of these characteristics are discussed in Section3 for those algorithms. For example, Triangle counting and Collab-orative filtering have total message sizes that are much larger thanthat of the graph itself, necessitating modifications for Giraph.

We now discuss the graph frameworks considered and how thealgorithms map to them.

3. CHOICE OF FRAMEWORKSThe wide variety in graph algorithms that need to be imple-

mented has necessitated the creation of a variety of graph frame-works. There is clearly no consensus on even what programmingmodel gives the best productivity-performance trade-off. In this pa-per, we consider the following popular graph frameworks - GraphLab,CombBLAS, SociaLite, Galois and Giraph. In addition, we also in-clude hand-optimized code for the algorithms. Each of these frame-works are described below.

GraphLab [21] is a graph framework that provides a sequential,shared memory abstraction for running graph algorithms writtenas “vertex programs”. GraphLab works by letting vertices in agraph read incoming messages, update the values and send mes-sages asynchronously. GraphLab partitions the graph in a 1-D fash-ion (vertex partitioning). All graph algorithms must be expressedas a program running on a vertex, which can access its own valueas well as that of its edges and neighboring vertices. The runtimetakes care of scheduling, messaging and synchronization.

The Combinatorial BLAS [11] is an extensible distributed-memoryparallel graph library offering a small but powerful set of linear al-gebra primitives specifically targeting graph analytics. CombBLAStreats graphs as sparse matrices and partitions the non-zeros of thematrix (edges in the graph) across nodes. As such, this is the onlyframework that supports an edge-based partitioning of the graph(also referred to as 2-D partitioning in the paper). Graph com-putations are expressed as operations among sparse matrices andvectors using arbitrary user-defined semirings.

SociaLite [30, 31] is based on Datalog, a declarative languagethat can express various graph algorithms succinctly due to its bet-ter support for recursive queries compared to SQL [32]. In So-ciaLite, the graph and its meta data is stored in tables, and declara-tive rules are written to implement graph algorithms. SociaLite ta-bles are horizontally partitioned, or sharded, to support parallelism.Users can specify how they want to shard a table at table declarationtime, and the runtime partitions and distributes the tables accord-ingly. SociaLite only supports 1-D partitioning.

Giraph[8] is an iterative graph processing system that runs ontop of Hadoop framework. Computation proceeds as a sequence ofiterations, called supersteps in a bulk synchronous (BSP) fashion.Initially, every vertex is active. In each superstep each active ver-tex invokes a Compute method provided by the user. The Computemethod: (1) receives messages sent to the vertex in the previoussuperstep, (2) computes using the messages, and the vertex andoutgoing edge values, which may result in modifications to the val-ues, and (3) may send messages to other vertices. The Computemethod does not have direct access to the values of other verticesand their outgoing edges. Inter-vertex communication occurs bysending messages. Computation halts if all vertices have voted tohalt and there are no messages in flight. Giraph partitions the ver-tices in a 1-D fashion (vertex partitioning).

Since graph computations can be very irregular (little locality,varying amount of work per iteration etc.), Galois [26], a frame-work developed for handling irregular computations can also beused for graph processing. Galois is a work-item based paralleliza-tion framework that can handle graph computations (and other ir-regular problems as well). It provides a rich programming modelwith coordinated and autonomous scheduling, and with and with-out application-defined priorities. Galois provides its own sched-ulers and scalable data structures, but does not impose a particularpartitioning scheme which may be edge or vertex based dependingon how the computation is expressed in the framework.

Other than the explained differences, a major differentiator ofthe frameworks is the communication layer between different hard-ware nodes. Our native implementation and CombBLAS use MPI,whereas GraphLab and SociaLite use sockets. Giraph uses a net-work I/O library (netty), while Galois does not have a multi nodeimplementation as yet.

Table 2 shows a high-level comparison between the differentframeworks under consideration in this paper.

3.1 Example - PageRankWe explain the differences in programming model between the

frameworks with a small example and see how Pagerank can beimplemented in all the frameworks.

Graph Frameworks

9D. Koop, CIS 602-01, Fall 2017

Framework Programming Multi node Lang- Graph Communicationmodel usage Partitioning layer

Native N/A Yes C/C++ N/A MPIGraphLab Vertex Yes C++ 1-D Sockets

CombBLAS Sparse matrix Yes C++ 2-D MPISociaLite Datalog Yes Java 1-D Sockets

Galois Task-based No C/C++ N/A N/AGiraph Vertex Yes Java 1-D Netty

Table 2: High level comparison of the graph frameworks

1

0 3

2

Figure 2: An example directed graph with 4 vertices

Let us begin by finding the most optimal way to execute pager-ank. We refer to this hand-optimized version of the algorithm asnative implementation. We observe that pagerank computation asgiven by equation (1) performs one multiply-add operation per edge.Representing the graph in a Compressed-Sparse Row (CSR) for-mat [15], an efficient way of storing sparse matrix (graph) as adja-cency list, allows for the edges to be stored as a single, contiguousarray. This allows all the accesses to the edge array to be regu-lar and improves the memory bandwidth utilization through hard-ware prefetching. Since each vertex has to access the pagerank val-ues of all the vertices with incoming edges, we store the incomingedges in CSR format (not outgoing as would generally be the case).For the multi node setup, the graph is partitioned in a 1-D fashioni.e. partitioning the vertices (along with corresponding in-edges)among the nodes so that each node has roughly the same numberof edges. Each node calculates the local updates and packages thepagerank values to be sent to the other nodes. These messages arethen used to calculate the remote updates. More details on the op-timizations in pagerank are given in Section 6.

Let us see how the page rank algorithm maps to vertex program-ming. In this model, we write a program that executes on a singlevertex. This program can only access “local” data i.e. informa-tion about the vertices and edges that are directly connected to agiven vertex. An example of vertex program (in pseudocode) isprovided in Algorithm 1. The exact semantics of how the mes-sages are packed and received, how much local data can be ac-cessed, global reductions etc. vary across different implementa-tions. GraphLab [21] and Giraph [8] are both examples of thisapproach to graph processing.

Algorithm 1: Vertex program for one iteration of page rankbegin

PR � rfor msg 2 incoming messages do

PR � PR + (1� r) ⇤msg

Send

PR

degree

to all outgoing edges

Another distinct approach to processing large graphs is to treatthem as sparse matrices - an approach embodied in frameworkslike CombBLAS[11]. The computations in a single iteration ofPageRank can be expressed in matrix form as follows:

pt+1 = r1 + (1� r)AT p̃

t

(9)

where A =

0

BB@

0 1 1 0

0 0 1 1

0 0 0 1

0 0 0 0

1

CCA (for the graph in Figure 2), pt

is the

vector of all the page rank values at iteration t, p̃t

(i) =

pt(i)d(i) , d is

the vector of vertex out-degrees and 1 is a vector of all 1’s.In SociaLite, pagerank computation is expressed as following:

RANK[n](t + 1, $SUM(v)) :- v = r

:- INEDGE[n](s), RANK[s](t, v0), OUTDEG[s](d), v =

(1� r)v0

d.

, where the PageRank value of a vertex n at iteration t + 1 in therule head (RANK[n](t + 1, $SUM(v)) is declared as the summationof the constant term in the first rule, and the normalized PageRankvalues from neighbor vertices at iteration t in the second rule. Theincoming edge table, INEDGE, stores vertices in its first column andneighboring vertices in its second column. INEDGE is declared as atail-nested table [30], effectively implementing a CSR format usedin the native implementation and CombBLAS. Another version ofPageRank is implemented from the perspective of a vertex that dis-tributes its PageRank value to its neighboring vertices, which isexpressed as following:

RANK[n](t + 1, $SUM(v)) :- v = r;

:- RANK[s](t, v0), OUTEDGE[s](n), OUTDEG[s](d), v =

(1� r)v0

d.

In this implementation, all join operations in the rule body arelocally computed, and there is only a single data transfer for theRANK table update in the rule head. Compared to the previous ver-sion, there is one less data transfer, reducing communication over-heads. In terms of lock overhead, since we cannot determine whichshard of RANK will be updated, locks must be held for every up-date. Hence, the first version is optimized for a single multi-coremachine, while the second is optimized for distributed machines.

The pagerank implementation in Galois is very similar to that ofGraphLab or Giraph i.e. the parallelization is over vertices. Eachwork item in Galois is a vertex program for updating its pagerank.Since Galois runs on only a single node with a shared memoryabstraction, each task has access to all of the program’s data.

3.2 Mapping Algorithms to FrameworksAs mentioned earlier, the pagerank algorithm can be expressed

in a variety of frameworks. In a similar fashion, we also describethe implementations of the other algorithms here:

Breadth First Search: Vertex programs are straight-forward towrite for this algorithm. Algorithm 2 shows the pseudocode of BFSimplementation. All distances are initialized to infinity (except thestarting vertex, which is initialized to 0). The iterations continueuntil there are no updates to distances.

Algorithm 2: Vertex program for one iteration of BFS.begin

for msg 2 incoming messages doDistance � min(Distance, msg + 1)

Send Distance to all outgoing edges

CombBLAS implementation performs matrix-vector multiplica-tion in every iteration. For example, in order to do traverse thegraph from both vertices 0 and 1 in Figure 2, we only have to do


Results (Run Time)

10D. Koop, CIS 602-01, Fall 2017

0.1

1

10

tion (secon

ds)

Native Combblas GraphlabSocialite Giraph Galois

0.01

0.1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

tic

Time pe

r iterat

(a) PageRank

10

100

e (secon

ds)


0

1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

ticOverall tim

e

(b) Breadth-First Search

10

100

1000

ation (secon

ds)


1

10

Netflix

Synthe

tic

Time pe

r itera

(c) Collaborative Filtering

10

100

1000

10000

e (secon

ds)


0.1

1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

ticOverall Tim

(d) Triangle counting

Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axisrepresents runtime (in log-scale), therefore lower numbers are better.

FDR interconnect. The cluster runs on the Red Hat EnterpriseLinux Server OS release 6.4. We use a mix of OpenMP directivesto parallelize within the node and MPI code to parallelize acrossnodes in native code. We use the Intel® C++ Composer XE 2013SP1 Compiler3 and the Intel® MPI library to compile the code.GraphLab uses a similar combination of OpenMP and MPI for bestperformance. CombBLAS runs best as a pure MPI program. Weuse multiple MPI processes per node to take advantage of the mul-tiple cores within the node. Moreover, CombBLAS requires thetotal number of processes to be a square (due to their 2D partition-ing approach). Hence we use 36 MPI processes per node to run onthe 48 hardware threads; and we further run on a square number ofnodes (1, 2, 4, 9, 16, 36 and 64 nodes). SociaLite uses Java threadsand processes for parallelism. Giraph uses the Hadoop frameworkfor parallelism (we run 4 workers per node). Finally, Galois is asingle node framework and uses OpenMP for parallelism.

5. RESULTS

5.1 Native implementation bottlenecksSince different graph frameworks have their programming mod-

els with different implementation trade-offs, it is hard to directlycompare these frameworks with respect to each other without aclear reference point. As explained in Section 1, we provide a well-optimized native implementation of these algorithms for both sin-gle and multi node systems. Since the native code is optimized, itis easy to see which aspects of the system are stressed by a partic-ular algorithm. Table 4 provides data on the achieved limits of thenative implementations on a single node and 4 nodes system.

Algorithm Single Node 4 NodesH/W limitation Efficiency H/W limitation Efficiency

PageRank Memory BW 78 GBps (92%) Network BW 2.3 GBps ( 42%)BFS Memory BW 64 GBps (74%) Memory BW 54 GBps (63%)

Coll. Filtering Memory BW 47 GBps (54%) Memory BW 35 GBps (41%)Triangle Count. Memory BW 45 GBps (52%) Network BW 2.2 GBps ( 40%)

Table 4: Efficiency achieved by native implementations of differentalgorithms on single and 4 nodes.

We find that on both single and multi node implementations, thealgorithm performance is dependent on either memory or networkbandwidth. The efficiencies are generally within 2-2.5X off the3

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that arenot unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other opti-mizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessorsnot manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel micro-processors. Certain optimizations not specific to Intel micro-architecture are reserved for Intel microprocessors. Pleaserefer to the applicable product User and Reference Guides for more information regarding the specific instruction setscovered by this notice. Notice revision #20110804

ideal results. Given the diversity of bottlenecks within the 4 al-gorithms, and also between single and multiple node implementa-tions, we expect that it would be difficult for any one framework toexcel at all scales in terms performance and productivity.

5.2 Single node resultsWe now show the results of running the Pagerank, BFS, Collab-

orative Filtering and Triangle Counting algorithms on the Comb-BLAS, GraphLab, SociaLite, Giraph and Galois frameworks.

In the following, we only compare time taken per iteration onvarious frameworks for Collaborative Filtering and Pagerank. Asdescribed in Section 3, the native code for Collaborative Filteringimplements Stochastic Gradient Descent which converges muchfaster than the Gradient Descent implementation in other frame-works. However, we do not see much difference in performanceper iteration between Stochastic Gradient Descent and GradientDescent in native code. Hence we compare time/iteration here toseparate out the fact that the frameworks are not expressive enoughto express SGD from other potential performance differences. Sim-ilarly, some Pagerank implementations differ in whether early con-vergence is detected for the algorithm, and hence we report timeper iteration to normalize for this.

Algorithm CombBLAS GraphLab SociaLite Giraph GaloisPageRank 1.9 3.6 2.0 39.0 1.2

BFS 2.5 9.3 7.3 567.8 1.1Coll. Filtering 3.5 5.1 5.8 54.4 1.1

Triangle Count. 33.9 3.2 4.7 484.3 2.5

Table 5: Summary of performance differences for single node bench-marks on different frameworks for our applications. Each entry is aslowdown factor from native code, hence lower numbers indicate bet-ter performance.

Figures 3(a), 3(b) and 3(d) show the results of running Pager-ank, BFS, and Triangle Counting respectively on the real-worldLivejournal, Facebook and Wikipedia datasets described in Sec-tion 4.1 on our frameworks. We also show the results of runninga synthetic scale-free RMAT graph (obtained using the Graph500data generator). Figure 3(c) shows the performance of our frame-works on the Netflix [9] dataset, as well as synthetically generatedcollaborative filtering dataset. For convenience, we also present thegeometric mean of this data across datasets in Table 5. This tableshows the slowdowns of each framework w.r.t. native code.

We see the following key inferences: (1) Native code, as ex-pected, delivers best performance as it is optimized for the underly-ing architecture. (2) Galois performs better than other frameworks,


Results (Run Time)

11D. Koop, CIS 602-01, Fall 2017

0.1

1

10

tion (secon

ds)


0.01

0.1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

tic

Time pe

r iterat

(a) PageRank

10

100

e (secon

ds)


0

1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

ticOverall tim

e

(b) Breadth-First Search

10

100

1000

ation (secon

ds)


1

10

Netflix

Synthe

tic

Time pe

r itera

(c) Collaborative Filtering

10

100

1000

10000

e (secon

ds)


0.1

1

Livejournal

Facebo

ok

Wikiped

ia

Synthe

ticOverall Tim

(d) Triangle counting

Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axisrepresents runtime (in log-scale), therefore lower numbers are better.

FDR interconnect. The cluster runs on the Red Hat EnterpriseLinux Server OS release 6.4. We use a mix of OpenMP directivesto parallelize within the node and MPI code to parallelize acrossnodes in native code. We use the Intel® C++ Composer XE 2013SP1 Compiler3 and the Intel® MPI library to compile the code.GraphLab uses a similar combination of OpenMP and MPI for bestperformance. CombBLAS runs best as a pure MPI program. Weuse multiple MPI processes per node to take advantage of the mul-tiple cores within the node. Moreover, CombBLAS requires thetotal number of processes to be a square (due to their 2D partition-ing approach). Hence we use 36 MPI processes per node to run onthe 48 hardware threads; and we further run on a square number ofnodes (1, 2, 4, 9, 16, 36 and 64 nodes). SociaLite uses Java threadsand processes for parallelism. Giraph uses the Hadoop frameworkfor parallelism (we run 4 workers per node). Finally, Galois is asingle node framework and uses OpenMP for parallelism.

5. RESULTS

5.1 Native implementation bottlenecksSince different graph frameworks have their programming mod-

els with different implementation trade-offs, it is hard to directlycompare these frameworks with respect to each other without aclear reference point. As explained in Section 1, we provide a well-optimized native implementation of these algorithms for both sin-gle and multi node systems. Since the native code is optimized, itis easy to see which aspects of the system are stressed by a partic-ular algorithm. Table 4 provides data on the achieved limits of thenative implementations on a single node and 4 nodes system.

Algorithm Single Node 4 NodesH/W limitation Efficiency H/W limitation Efficiency

PageRank Memory BW 78 GBps (92%) Network BW 2.3 GBps ( 42%)BFS Memory BW 64 GBps (74%) Memory BW 54 GBps (63%)

Coll. Filtering Memory BW 47 GBps (54%) Memory BW 35 GBps (41%)Triangle Count. Memory BW 45 GBps (52%) Network BW 2.2 GBps ( 40%)

Table 4: Efficiency achieved by native implementations of differentalgorithms on single and 4 nodes.

We find that on both single and multi node implementations, thealgorithm performance is dependent on either memory or networkbandwidth. The efficiencies are generally within 2-2.5X off the3

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that arenot unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other opti-mizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessorsnot manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel micro-processors. Certain optimizations not specific to Intel micro-architecture are reserved for Intel microprocessors. Pleaserefer to the applicable product User and Reference Guides for more information regarding the specific instruction setscovered by this notice. Notice revision #20110804

ideal results. Given the diversity of bottlenecks within the 4 al-gorithms, and also between single and multiple node implementa-tions, we expect that it would be difficult for any one framework toexcel at all scales in terms performance and productivity.

5.2 Single node resultsWe now show the results of running the Pagerank, BFS, Collab-

orative Filtering and Triangle Counting algorithms on the Comb-BLAS, GraphLab, SociaLite, Giraph and Galois frameworks.

In the following, we only compare time taken per iteration onvarious frameworks for Collaborative Filtering and Pagerank. Asdescribed in Section 3, the native code for Collaborative Filteringimplements Stochastic Gradient Descent which converges muchfaster than the Gradient Descent implementation in other frame-works. However, we do not see much difference in performanceper iteration between Stochastic Gradient Descent and GradientDescent in native code. Hence we compare time/iteration here toseparate out the fact that the frameworks are not expressive enoughto express SGD from other potential performance differences. Sim-ilarly, some Pagerank implementations differ in whether early con-vergence is detected for the algorithm, and hence we report timeper iteration to normalize for this.

Algorithm CombBLAS GraphLab SociaLite Giraph GaloisPageRank 1.9 3.6 2.0 39.0 1.2

BFS 2.5 9.3 7.3 567.8 1.1Coll. Filtering 3.5 5.1 5.8 54.4 1.1

Triangle Count. 33.9 3.2 4.7 484.3 2.5

Table 5: Summary of performance differences for single node bench-marks on different frameworks for our applications. Each entry is aslowdown factor from native code, hence lower numbers indicate bet-ter performance.

Figures 3(a), 3(b) and 3(d) show the results of running Pager-ank, BFS, and Triangle Counting respectively on the real-worldLivejournal, Facebook and Wikipedia datasets described in Sec-tion 4.1 on our frameworks. We also show the results of runninga synthetic scale-free RMAT graph (obtained using the Graph500data generator). Figure 3(c) shows the performance of our frame-works on the Netflix [9] dataset, as well as synthetically generatedcollaborative filtering dataset. For convenience, we also present thegeometric mean of this data across datasets in Table 5. This tableshows the slowdowns of each framework w.r.t. native code.

We see the following key inferences: (1) Native code, as ex-pected, delivers best performance as it is optimized for the underly-ing architecture. (2) Galois performs better than other frameworks,


Graphs are Central to Analytics

12D. Koop, CIS 602-01, Fall 2017

[J. Gonzalez, 2014]

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 Pages

Title PRText Table

Title Body Topic Model (LDA) Word Topics

Word Topic

Editor GraphCommunity Detection

User Community

User Com.

Term-Doc Graph

Discussion Table

User Disc.

Community Topic

Topic Com.

https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pptx

Separate Systems to Support Each View

13D. Koop, CIS 602-01, Fall 2017

[J. Gonzalez, 2014]

Table View Graph View

Dependency Graph

Prege

Table

Result

Row

Row

Row

Row


PageRank on the Live-Journal Graph

14D. Koop, CIS 602-01, Fall 2017

[J. Gonzalez, 2014]

Mahout/Hadoop

Naïve Spark

GraphLab

Runtime (in seconds, PageRank for 10 iterations)

0 350 700 1050 1400

22

354

1340

GraphLab is 60x faster than HadoopGraphLab is 16x faster than Spark


Moving between tables and graphs

15D. Koop, CIS 602-01, Fall 2017

[J. Gonzalez, 2014]

< / >< / >< / >XML

HDFS HDFS HDFS HDFS


GraphX: Composable views of the same data

16D. Koop, CIS 602-01, Fall 2017

[J. Gonzalez, 2014]

GraphX UnifiedRepresentation

Graph ViewTable View

Each view has its own operators that exploit the semantics of the view

to achieve efficient execution


GraphX Example Pipeline Runtimes

17D. Koop, CIS 602-01, Fall 2017

[J. Gonzalez, 2014]

HDFSHDFS

ComputeSpark Preprocess Spark Post.

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 Pages

Spark

Giraph + Spark

GraphX

GraphLab + Spark

Total Runtime (in Seconds)0 400 800 1200 1600

43

275

342

1492

605

375


Projects• Reviews Complete • Disappointed in some projects and their scalability challenges • Focus on projects

18D. Koop, CIS 602-01, Fall 2017

Topological Data Analysis• G. Carlsson • Data has shape, blurry view • Very High-Level Description [Video]

19D. Koop, CIS 602-01, Fall 2017

https://www.youtube.com/watch?v=XfWibrh6stw

Persistent Homology• Video

20D. Koop, CIS 602-01, Fall 2017

https://www.youtube.com/watch?v=h0bnG1Wavag

Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets

F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire

D. Koop, CIS 602-01, Fall 2017

http://fchirigati.com/files/presentations/chirigati-sigmod16.pdf

When searching for relationships…• With enough data, you'll find something: • “A new study shows that drinking a glass of wine is just as good as

spending an hour at the gym” [Fox News, 02/15]. • “A new study shows how sugar might fuel the growth of

cancer” [Today, 01/16]. • “A new study shows late night snacking could damage the part of

your brain that creates and stores memories” [Fox News, 05/16]. • People who go to saunas are more likely to know that Mike

Stonebraker is not a character in "The Simpsons"

22D. Koop, CIS 602-01, Fall 2017

[C. Binnig et al., 2017]

Testing Data Polygamy on Random Data

• Data Polygamy paper talks about the need to check potential relationships

• Issue may be the number of relationships to check

23D. Koop, CIS 602-01, Fall 2017

[C. Binnig et al., 2017]

# Records 100# Attributes 11# Datasets 2

Extreme data prob. 20%

(a) Random input

# Trials 50Significance level 0.05

# Incorrect rejection 43# Correct acceptance 7

(b) False discoveriesFigure 4: Data Polygamy [5] on random extreme points.

though many of which are statistically insignificant.Figures 3a-3d show example spurious recommendations. Suppose

the user analyzes the preference of Potato Chips (Cheddar vs. SourCream) based on the Workspace Preference using the reference viewin Figure 3a, which is already a questionable finding on its own.Still SeeDB recommends three of the top-ranked target views shownin Figure 3b-3d (ranked by the deviation beyond the threshold asin [34]’s example), which are even more questionable and do nothold up in our statistical test. For instance, the recommendationin Figure 3c shows that the disbelief in aliens reverts the trendcompared to the reference view, though the correlation betweendisbelief in aliens and preference of potato chips is insignificant(p-value of 0.59). On the other hand, SeeDB also recommendsviews based on statistically significant yet questionable correlations,such as the correlation between Saunas and Stonebraker from theabstract, which even passes our post mortem statistical test (p-valueof 0.036). Thus, even if the user would perform a statistical testafter seeing the visualization, she might wrongly assume that theinsight is significant as she would certainly never consider the riskthe visualization recommendation system introduced by searchingfor an “interesting” visualization.

2.3 Automatic Correlation FindersAs a last class of system, we analyze recent recommendation

engines, which not just suggest visualizations but try to automat-ically find insights through automatic hypothesis testing. One ofsuch systems is Data Polygamy [5], which searches for statisticallysignificant correlations in temporal-spatial datasets. Such correla-tions may exist at certain time or location. For example, the windspeed may not correlate with the number of taxi trips during theyear, but it may when the hurricane strikes. Data Polygamy firstidentifies extreme data points, then uses the F1 score to measure therelationship strength, and performs Monte Carlo permutation test todetermine the statistical significance given a predefined significancelevel [5].

Unfortunately, Data Polygamy ignores the problem of multiplecomparisons and therefore its method is only sensible for a singlecompared relationship. Suppose there are two datasets of 5 attributeseach, resulting in 25 pairwise relationships to test. With a signifi-cance level of 0.05, on average at least about one such relationshipswould pass the significance criteria even on random input. However,data variety, on the other hand, is increasing quickly. The NYCUrban data collection has 228 features on weather monitoring, andover 1,300 data sets in the span of two years have been collectedby the government agencies in NYC [5] [13]. Thus recommenda-tion systems without controlling for multiple comparisons are notsuitable for real-world datasets.

We downloaded the code of Data Polygamy and studied the num-ber of false discoveries over random data with randomly introducedextreme data points, as summarized in Figure 4. Each extreme datapoint was sampled independently with 20% probability from a dis-tinct uniform distribution than the normal data. With 100 recordsand 11 attributes per dataset, Data Polygamy found a total of 43“bogus” relationships in 50 independent trials. Thus, without consid-ering the risk of multiple comparisons, Data “Polygamy can be badfor you”; it is literally an automatic p-hacking system.

2.4 Automatic Model FindingFinally, systems for automatic model building and tuning in data

mining or machine learning (e.g. MLBase [24]) are also victim ofthe risk of multiple comparisons. To demonstrate the complexity ofthis problem, suppose that we evaluate a sequence of 20 possiblemodels M1,M2, . . . ,M20 for our observed data. We test eachmodel using cross validation on different holdout sets, and accept amodel if its estimation (prediction) error satisfies our requirementwith significance level 0.05 (i.e., the probability that the modelachieved that smaller level of estimation error on a random data isbounded by 0.05). However, this also implies that at least one suchmodel on average would pass our criteria even on random data.

3. QUDE: A SYSTEM TO QUANTIFY THEUNCERTAINTY IN DATA EXPLORATION

As discussed in the previous section, the multi-hypothesis pitfallis a core problem affecting many recent systems for interactivedata exploration, recommendations for visualizations and insights,as well as, automatic model building. With QUDE (pronounced“cute”) we are building the first system to automatically Quantifythe Uncertainty in Data Exploration. QUDE is part of Brown’sInteractive Data Exploration Stack (BIDES) and consist of a riskassessment engine as well as a user facing component integrated intoVizdom, BIDES’ user interface. While the main focus of QUDE ison the control of the risk of false discoveries due to the testing ofmultiple hypotheses, QUDE will also be able to detect other riskfactors as explained at the end of this section.

3.1 Controlling the Exploration RiskWhen a user is exploring a larger number of hypotheses based

on the data, either explicitly, indirectly through visualizations, orautomatically through recommendation engines, there is a growingrisk of flagging a random (i.e., non “statistically significant”) fluc-tuation in the data as a significant discovery. Any sustainable dataexploration system should therefore effectively control the risk ofsuch “false discoveries”.

3.1.1 Multi-Hypothesis EvaluationThe risk of false discovery is known as the problem of multiple

comparisons, or multi-hypothesis evaluation Two main fundamentalchallenges arise when attempting to automatically quantify the risk:(1) the traditional techniques do either not scale well with the numberof hypothesis or can not be used in an interactive environment and(2) in many cases it is not clear which hypothesis is currently beingtested through a visualization by the user (i.e., the “user intent”).In the following, we describe various multiple-hypothesis controltechniques and how well they work to address the first challenge,whereas in Section 3.1.2 we discuss how we plan to address the userintent challenge.

Family Wise Error Rate (FWER): Traditionally, frequentistmethods for multiple hypothesis testing focus on correcting formodest numbers of comparisons. A natural generalization of thesignificance level to multi-hypothesis testing is the Family WiseError Rate (FWER), which is the probability of incurring at leastone Type I error (i.e., false positive: the null hypothesis is true, but isrejected) in any of the individual tests. The Bonferroni correction [3]was proposed to control FWER for m hypothesis tests at an upperbound ↵. The Bonferroni correction tests each null hypothesis withsignificance level ↵/m. However, this method is too conservativein that the power of the test is too low (i.e., the accepted significancelevel becomes extremely small) when m is large, resulting in manyfalse negatives. Several methods have been proposed to improve the

Documents

CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture20.pdf · - useful in fraud/threat detection, social network suggestions D. Koop, CIS 602-01, Fall 2017 5 [K