Massive MapReduce Matrix Computations & Multicore Graph Algorithms DAVID F. GLEICH COMPUTER SCIENCE PURDUE UNIVERSITY
1 David Gleich · Purdue
It’s a pleasure … Intel Intern 2005 in Application Research Lab in Santa Clara Resulting in one of my favorite papers!
!!
“imvol3” — 2007/7/25 — 21:25 — page 257 — #1 !!
!!
!!
Internet Mathematics Vol. 3, No. 3: 257-294
Approximating PersonalizedPageRank with Minimal Useof Web Graph DataDavid Gleich and Marzia Polito
Abstract. In this paper, we consider the problem of calculating fast and accurate ap-proximations to the personalized PageRank score of a webpage. We focus on techniquesto improve speed by limiting the amount of web graph data we need to access.
Our algorithms provide both the approximation to the personalized PageRank scoreas well as guidance in using only the necessary information—and therefore sensiblyreduce not only the computational cost of the algorithm but also the memory andmemory bandwidth requirements. We report experiments with these algorithms onweb graphs of up to 118 million pages and prove a theoretical approximation boundfor all. Finally, we propose a local, personalized web-search system for a future clientsystem using our algorithms.
1. Introduction and Motivation
To have web search results that are personalized, we claim that there is no needto access data from the whole web. In fact, it is likely that the majority of thewebpages are totally unrelated to the interests of any one user.
In the original PageRank paper [Brin and Page 98], Brin and Page proposed apersonalized version of the algorithm for the goal of user-specific page ranking.While the PageRank algorithm models a random surfer that teleports everywherein the web graph, the random surfer in the personalized PageRank Markov chainonly teleports to a few pages of personal interest. As a consequence, the person-alization vector is usually sparse, and the value of a personalized score will benegligible or zero on most of the web.
© A K Peters, Ltd.1542-7951/06 $0.50 per page 257
Could you run your own search engine and crawl the web to compute
your own PageRank vector if you are highly concerned with privacy?
Yes! Theory, Experiments, Implementation!
2 David Gleich · Purdue
Yangyang Hou " Purdue, CS Paul G. Constantine "Austin Benson "Joe Nichols" Stanford University James Demmel " UC Berkeley Joe Ruthruff "Jeremy Templeton" Sandia CA
Massive MapReduce Matrix Computations
Funded by Sandia National Labs CSAR project.
A1
A4
A2
A3
A4
3 David Gleich · Purdue
By 2013(?) all Fortune 500 companies will have a data computer
4 David Gleich · Purdue
Data computers I’ve worked with …
5
Nebula Cluster @ !
Sandia CA!2TB/core storage, "64 nodes, 256 cores, "GB ethernet Cost $150k These systems are good for working with
enormous matrix data!
Student Cluster @ !
Stanford!3TB/core storage, "11 nodes, 44 cores, "GB ethernet Cost $30k
Magellan Cluster @ !
NERSC!128GB/core storage, "80 nodes, 640 cores, "infiniband
David Gleich · Purdue
How do you program them?
6 David Gleich · Purdue
MapReduce and"Hadoop overview
7 David Gleich · Purdue
MapReduce in a picture
8
Like an MPI all-to-all
In parallel In parallel David Gleich · Purdue
Computing a histogram "A simple MapReduce example
9
Input!!Key ImageId Value Pixels
Map(ImageId, Pixels) for each pixel emit� Key = (r,g,b)� Value = 1
Reduce(Color, Values) emit� Key = Color Value = sum(Values)
Output!!Key Color Value " # of pixels
5 15 10 9 3 17 5 10
1 1
1
1
Map Reduce 1 1
1
1
1 1
1
1
1
1 1 1
shuffle
David Gleich · Purdue
Why a limited computational model? Data scalability, fault tolerance.
1 MM R
RMMM
Maps Reduce
Shuffle
2
3
4
5
1 2 M M
3 4 M M
5 M
The idea !Bring the computations to the data MR can schedule map functions without moving data.
After waiting in the queue for a month and "after 24 hours of finding eigenvalues, "one node randomly hiccups.
The last page of a 136-page error dump.
10
David Gleich · Purdue
11
A
From tinyimages"collection
Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000)
regression and general linear models"with many samples
block iterative methods panel factorizations
simulation data analysis !
big-data SVD/PCA!
Used in
David Gleich · Purdue
Scientific simulations as "Tall-and-Skinny matrices
12
Input "Parameters
Time history"of simulation
s f"~100GB
f(s) =
2
66666666666664
q(x1, t1, s)...
q(xn
, t1, s)q(x1, t2, s)
...q(x
n
, t2, s)...
q(xn
, t
k
, s)
3
77777777777775
The s
imul
ation
as a
vect
or
The s
imul
ation
as a
mat
rix
A
spac
e
time A database of simulations
s1 -> f1 s2 -> f2
sk -> fk
A
spac
e-by
-tim
e
parameters
The
data
base
is a
ver
y"ta
ll-an
d-sk
inny
mat
rix
David Gleich · Purdue
A Large Scale Example
Nonlinear heat transfer model 80k nodes, 300 time-steps 104 basis runs SVD of 24m x 104 data matrix 500x reduction in wall clock time (100x including the SVD)
Model reduction
13
Constantine & Gleich, ICASSP 2012
David Gleich · Purdue
PCA of 80,000,000"images
14/2
2
A
80,000,000 images
1000 pixels
X
MapReduce Post Processing
Zero"mean"rows
TSQ
R
R SVD
V
First 16 columns
of V as images
Top 100 singular values
(principal �components)
David Gleich · Purdue Constantine & Gleich, MapReduce 2010.
All these applications need is Tall-and-Skinny QR
15
David Gleich · Purdue
Quick review of QR
Current MapReduce algs use the normal equations which can limit numerical accuracy
16
QR Factorization
David Gleich (Sandia)
Using QR for regression
is given by the solution of
QR is block normalization“normalize” a vector usually generalizes to computing in the QR
A Q
Let , real
is orthogonal ( )
is upper triangular.
0
R
=
4/22MapReduce 2011
QR Factorization
David Gleich (Sandia)
Using QR for regression
is given by the solution of
QR is block normalization“normalize” a vector usually generalizes to computing in the QR
A Q
Let , real
is orthogonal ( )
is upper triangular.
0
R
=
4/22MapReduce 2011
A = QR AT A Cholesky�����! RT R Q = AR�1
David Gleich · Purdue
There are good MPI implementations. Why MapReduce?
17
David Gleich · Purdue
Full TSQR code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper
def compress(self):R = numpy.linalg.qr(
numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])
def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()
def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, row
def mapper(self,key,value):self.collect(key,value)
def reducer(self,key,values):for value in values: self.mapper(key,value)
if __name__=='__main__':mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)
David Gleich (Sandia) 13/22MapReduce 2011 18
David Gleich · Purdue
Tall-and-skinny matrix storage in MapReduce A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array "for a row Each submatrix Ai is an "the input to a map task.
A1
A4
A2
A3
A4
19
David Gleich · Purdue
Numerical stability was a problem for prior approaches
20
Condition number 1020 105
norm
( Q
T Q –
I )
AR-1
AR-1 + "
iterative refinement Direct TSQR Benson, Gleich, "Demmel, Submitted
Prior work
Constantine & Gleich, MapReduce 2010
Benson, Gleich, Demmel, Submitted
Previous methods couldn’t ensure that the matrix Q was orthogonal
David Gleich · Purdue
A1
A2
A3
A1
A2qr
Q2 R2
A3qr
Q3 R3
A4qr Q4A4
R4
emit
A5
A6
A7
A5
A6qr
Q6 R6
A7qr
Q7 R7
A8qr Q8A8
R8
emit
Mapper 1Serial TSQR
R4
R8
Mapper 2Serial TSQR
R4
R8
qr Q emitRReducer 1Serial TSQR
AlgorithmData Rows of a matrix
Map QR factorization of rowsReduce QR factorization of rows
Communication avoiding QR (Demmel et al. 2008) "on MapReduce (Constantine and Gleich, 2010)
21
David Gleich · Purdue
“Manual reduce” can make it faster by adding a second iteration. Computes only R and not Q Can get Q via Q = AR-1 with another MR iteration. Use the standard Householder method?
Taking care of business by keeping track of Q
22
A1
A4
Q1 R1
Mapper 1
A2 Q2 R2
A3 Q3 R3
A4 Q4
Q1
Q2
Q3
Q4
R1
R2
R3
R4
R4 Q o
utpu
t
R ou
tput
Q11
Q21
Q31
Q41
R Task 2
Q11
Q21
Q31
Q41
Q1
Q2
Q3
Q4
Mapper 3
1. Output local Q and R in separate files
2. Collect R on one node, compute Qs for each piece
3. Distribute the pieces of Q*1 and form the true Q
David Gleich · Purdue
The price is right! Based on performance model and tests
23
seco
nds
2500
500
DirectTSQR is faster than refinement for few columns
… and not any slower for many columns.
Experiment on NERSC Magellan computer, 80 nodes, 640 processors, 80TB disk
800M-by-10 7.5B-by-4 150M-by-100 500M-by-50
David Gleich · Purdue
Ongoing work
Make AR-1 stable with targeted quad-precision arithmetic to get a numerically orthogonal Q"
Performance model says it’s feasible! How to handle more than ~ 10,000 columns? "
Some randomized methods? Do we need quad-precision for big-data?"Standard error analysis n𝜀 to compute sum."
I’ve seen this with PageRank computations!
24
David Gleich · Purdue
Assefaw Gehraimbem "Arif Khan"Alex Pothen"Ryan Rossi" Purdue, CS Mahantesh Halappanavar" PNNL Chen Greif"David Kurokawa" Univ. British Columbia Mohsen Bayati"Amin Saberi"Ying Wang (now Google)" Stanford
Multicore Graph "Algorithms
Funded by DOE CSCAPES Institute grant (DE-FC02-08ER25864), NSF CAREER grant 1149756-CCF, and the Center for Adaptive Super Computing Software Multithreaded Architectures (CASS-MT) at PNNL.
CPU
Mem
CPU
Mem
CPU
Mem
25
David Gleich · Purdue
Network alignment"What is the best way of matching "graph A to B?
v
r
t
s
w
u
A B
26
David Gleich · Purdue
Network alignment"
review articles
MAY 2012 | VOL. 55 | NO. 5 | COMMUNICATIONS OF THE ACM 91
subgraph under some mapping of the proteins between the two species) or inexact, allowing unmatched nodes on either subnetwork. This problem was first studied by Kelley et al.17 in the context of local network alignment; its later development accompanied the growth in the number of mapped organ isms.5,7,9,33 The third problem that has been considered is global net-work alignment (Figure 1c), where one wishes to align whole networks, one against the other.4,34 In its simplest form, the problem calls for identifying a 1-1 mapping between the proteins of two species so as to optimize some conservation criterion, such as the number of conserved interactions be-tween the two networks.
All these problems are NP-hard as they generalize graph and subgraph isomorphism problems. However, heuristic, parameterized, and ILP ap-proaches for solving them have worked remarkably well in practice. Here, we review these approaches and demon-strate their good performance in prac-tice both in terms of solution quality and running time.
Heuristic ApproachesAs in other applied fields, many prob-lems in network biology are amenable to heuristic approaches that perform well in practice. Here, we highlight two such methods: a local search heuristic for local network alignment and an eigenvector-based heuristic for global network alignment.
NetworkBLAST32 is an algorithm for local network alignment that aims to identify significant subnetwork matches across two or more networks. It searches for conserved paths and conserved dense clusters of interac-tions; we focus on the latter in our de-scription. To facilitate the detection of conserved subnetworks, Network-BLAST first forms a network alignment graph,17,23 in which nodes correspond to pairs of sequence-similar proteins, one from each species, and edges cor-respond to conserved interactions (see Figure 2). The definition of the latter is flexible and allows, for instance, a di-rect interaction between the proteins of one species versus an indirect interac-tion (via a common network neighbor) in the other species. Any subnetwork of the alignment graph naturally corre-
Figure 2. The NetworkBLAST local network alignment algorithm. Given two input networks, a network alignment graph is constructed. Nodes in this graph correspond to pairs of sequence-similar proteins, one from each species, and edges correspond to conserved interactions. A search algorithm identifies highly similar subnetworks that follow a prespecified interaction pattern. Adapted from Sharan and Ideker.30
Figure 3. Performance comparison of computational approaches.
(a) An evaluation of the quality of NetworkBLAST’s output clusters. NetworkBLAST was applied to a yeast network from Yu et al.39 For every protein that served as a seed for an output cluster, the weight of this cluster was compared to the optimal weight of a cluster containing this protein, as computed using an ILP approach. The plot shows the % of protein seeds (y-axis) as a function of the deviation of the resulting clusters from the optimal attainable weight (x-axis).
(b) A comparison of the running times of the dynamic programming (DP) and ILP approaches employed by Torque.7 The % of protein complexes (queries, y-axis) that were completed in a given time (x-axis) is plotted for the two algorithms. The shift to the left of the ILP curve (red) compared with that of the dynamic programming curve (blue) indicates the ILP formulation tends to be faster than the dynamic programming implementation.
(a)
(b)
From Sharan and Ideker, Modeling cellular machinery through biological network comparison. Nat. Biotechnol. 24, 4 (Apr. 2006), 427–433. 27
David Gleich · Purdue
Network alignment"What is the best way of matching "graph A to B using only edges in L?
v
r
t
s
w
uwtu
A L B
28
David Gleich · Purdue
Network alignment"Matching? 1-1 relationship"Best? highest weight and overlap
v
r
t
s
w
uwtu
Overlap
A L B
29
David Gleich · Purdue
Our contributions A new belief propagation method (Bayati et al. 2009, 2013)"Outperformed state-of-the-art PageRank and optimization-based heuristic methods High performance C++ implementations (Khan et al. 2012)"40 times faster (C++ ~ 3, complexity ~ 2, threading ~ 8)"5 million edge alignments ~ 10 sec" www.cs.purdue.edu/~dgleich/codes/netalignmc
30
David Gleich · Purdue
31
David Gleich · Purdue
Each iteration involves Matrix-vector-ish computations with a sparse matrix, e.g. sparse matrix vector products in a semi-ring, dot-products, axpy, etc. Bipartite max-weight matching using a different weight vector at each iteration "No “convergence” "100-1000 iterations
Let x[i] be the score for each pair-wise match in L
for i=1 to ...
update x[i] to y[i]
compute a max-weight match with y
update y[i] to x[i] (using match in MR)
32
David Gleich · Purdue
The methods Each iteration involves! Matrix-vector-ish computations with a sparse matrix, e.g. sparse matrix vector products in a semi-ring, dot-products, axpy, etc. Bipartite max-weight matching using a different weight vector at each iteration
Belief Propagation!!!!!!!!
Listing 2. A belief-propagation message passing procedure for networkalignment. See the text for a description of othermax and round heuristic.
1 y
(0)= 0, z(0) = 0,d(0)
= 0,S(k)= 0
2 for k = 1 to niter
3 F = bound0,� [�S+ S
(k)T] Step 1: compute F
4 d = ↵w + Fe Step 2: compute d5 y
(k)= d� othermaxcol(z
(k�1)) Step 3: othermax
6 z
(k)= d� othermaxrow(y
(k�1))
7 S
(k)= diag(y
(k)+ z
(k) � d)S� F Step 4: update S8 (y
(k), z(k),S(k)) �k
(y
(k), z(k),S(k))+
9 (1� �k)(y
(k�1), z(k�1),S(k�1)) Step 5: damping
10 round heuristic (y(k)) Step 6: matching11 round heuristic (z(k))12 end13 return y
(k) or z(k) with the largest objective value
interpretation, the weight vectors are usually called messagesas they communicate the “beliefs” of each “agent.” In thisparticular problem, the neighborhood of an agent representsall of the other edges in graph L incident on the same vertexin graph A (1st vector), all edges in L incident on the samevertex in graph B (2nd vector), or the edges in L that arepart of an overlap. The message vectors do not generallyconverge, and thus, the iteration is artificially damped toenforce convergence. We only describe one type of damping.See [13] for other variations.
After each update to the messages, we round the messagesto a matching using a bipartite maximum weight matchingprocedure, and then evaluate the objective function.
We present a pseudo-code for the method in Figure 2. Thiscode uses the mildly curious function othermaxrow. Supposethat g is a weight vector on the edges of a bipartite graph L.This means we can index g with the edges of L such thatgi,i0 is the weight on the edge (i, i0) 2 EL. The othermaxrowfunction then computes a new weight for each edge in L:
[othermaxrow(g)]i,i0 = bound
0,1[ max
(i,k0)2EL,k0 6=i0gi,k0
].
This function computes something rather simple. Given a row,replace all non-zeros in that row with the maximum value forthe row; except, for the element that is the maximum value,replace it with the second largest value. The othermaxcolfunction works on columns instead of rows.
C. Stopping Criteria
Both algorithms generate a sequence of heuristic weightvectors whose solution quality varies continually. There is nomonotonicity in the solution quality, which can vary greatlybetween iterations. Thus, no simple stopping criteria is pos-sible. Due to the shrinking step length in Klau’s method andthe artificial damping in BP, there is also no point in runningfor more than 500-1000 iterations with reasonable choices ofthese two parameters.
D. ComplexityThe complexity of each iteration of Klau’s method and
the BP method is O(nnz(S) + |EL| + matching), whereO(matching) is the complexity of the bipartite matching instep 5. Let N = (|VA| + |VB |). Currently, the best knownalgorithm for computing an optimal edge-weighted match-ing has complexity O(|EL|N + N2
logN) [20]. Practicalimplementations have complexity O(|EL|N logN) [21]. Thehalf-approximate matching discussed below has complexityO(|EL|). Thus, when we replace the exact matching step withapproximate matching for our experiments, the complexity ofeach iteration will be O(nnz(S) + |EL|).
IV. PARALLEL NETWORK ALIGNMENT IMPLEMENTATIONS
We now consider a shared-memory multi-core implementa-tion of these procedure with OpenMP. All required memory ispre-allocated before the first iteration and there are no dynamicmemory allocations. We avoid computing intermediate resultswhenever possible, see the online codes for details.
A. Matrix computations in both methodsAll matrices are sparse and are stored as compressed
sparse row arrays. All non-zero patterns and structures remainfixed throughout iterations. We found using simple OpenMP“parallel for” loops faster than using a matrix library suchas Intel’s Math Kernel Library. This result is due to thesimplicity of the matrix computations. For instance, becauseS and U are structurally symmetric with the same structure,the transposes have the same row pointer and the columnindex arrays. But the value array is permuted. So we computethe permutation and whenever we need to transpose one ofthese matrices, we just permute the values array according tothe permutation. Since these matrices do not change structureduring the algorithm, we can compute the permutation once.Sometimes – such as line 5 of Klau’s method or line 3 of BP– we simply use the permutation array to pull elements fromappropriate memory locations without any intermediate write.
The matrix S can be highly imbalanced (some rows areempty and others have many non-zeros) and so we foundthat using a dynamic schedule in OpenMP’s “parallel for”construction yielded better performance than a static schedule.After some experimentation, we found that a chunk-size of1000 seemed to produce the best performance for theseoperations. Indeed, we found this observation to be the casefor all operations involving the matrix S. Synchronization onlyoccurs at the end of each “parallel for” loop.
B. Specifics about Klau’s methodIn the first step of the iteration, we need to solve a bipartite
matching problem for each row of the matrix S with weightsthat change based on U
(k). We compute �2S+U
(k)�U
(k)T
using the permutation trick, and then we parallelize the op-eration over rows. Each of these matching problems is smallbecause there are only a few non-zeros in each row of S andso we do not consider using the parallel approximation here.We precompute the maximum memory required for p threads
Step 6: matching
33
David Gleich · Purdue
The NEW methods Each iteration involves! Matrix-vector-ish computations with a sparse matrix, e.g. sparse matrix vector products in a semi-ring, dot-products, axpy, etc. Approximate bipartite max-weight matching is used here instead!
Belief Propagation!!!!!!!!
Listing 2. A belief-propagation message passing procedure for networkalignment. See the text for a description of othermax and round heuristic.
1 y
(0)= 0, z(0) = 0,d(0)
= 0,S(k)= 0
2 for k = 1 to niter
3 F = bound0,� [�S+ S
(k)T] Step 1: compute F
4 d = ↵w + Fe Step 2: compute d5 y
(k)= d� othermaxcol(z
(k�1)) Step 3: othermax
6 z
(k)= d� othermaxrow(y
(k�1))
7 S
(k)= diag(y
(k)+ z
(k) � d)S� F Step 4: update S8 (y
(k), z(k),S(k)) �k
(y
(k), z(k),S(k))+
9 (1� �k)(y
(k�1), z(k�1),S(k�1)) Step 5: damping
10 round heuristic (y(k)) Step 6: matching11 round heuristic (z(k))12 end13 return y
(k) or z(k) with the largest objective value
interpretation, the weight vectors are usually called messagesas they communicate the “beliefs” of each “agent.” In thisparticular problem, the neighborhood of an agent representsall of the other edges in graph L incident on the same vertexin graph A (1st vector), all edges in L incident on the samevertex in graph B (2nd vector), or the edges in L that arepart of an overlap. The message vectors do not generallyconverge, and thus, the iteration is artificially damped toenforce convergence. We only describe one type of damping.See [13] for other variations.
After each update to the messages, we round the messagesto a matching using a bipartite maximum weight matchingprocedure, and then evaluate the objective function.
We present a pseudo-code for the method in Figure 2. Thiscode uses the mildly curious function othermaxrow. Supposethat g is a weight vector on the edges of a bipartite graph L.This means we can index g with the edges of L such thatgi,i0 is the weight on the edge (i, i0) 2 EL. The othermaxrowfunction then computes a new weight for each edge in L:
[othermaxrow(g)]i,i0 = bound
0,1[ max
(i,k0)2EL,k0 6=i0gi,k0
].
This function computes something rather simple. Given a row,replace all non-zeros in that row with the maximum value forthe row; except, for the element that is the maximum value,replace it with the second largest value. The othermaxcolfunction works on columns instead of rows.
C. Stopping Criteria
Both algorithms generate a sequence of heuristic weightvectors whose solution quality varies continually. There is nomonotonicity in the solution quality, which can vary greatlybetween iterations. Thus, no simple stopping criteria is pos-sible. Due to the shrinking step length in Klau’s method andthe artificial damping in BP, there is also no point in runningfor more than 500-1000 iterations with reasonable choices ofthese two parameters.
D. ComplexityThe complexity of each iteration of Klau’s method and
the BP method is O(nnz(S) + |EL| + matching), whereO(matching) is the complexity of the bipartite matching instep 5. Let N = (|VA| + |VB |). Currently, the best knownalgorithm for computing an optimal edge-weighted match-ing has complexity O(|EL|N + N2
logN) [20]. Practicalimplementations have complexity O(|EL|N logN) [21]. Thehalf-approximate matching discussed below has complexityO(|EL|). Thus, when we replace the exact matching step withapproximate matching for our experiments, the complexity ofeach iteration will be O(nnz(S) + |EL|).
IV. PARALLEL NETWORK ALIGNMENT IMPLEMENTATIONS
We now consider a shared-memory multi-core implementa-tion of these procedure with OpenMP. All required memory ispre-allocated before the first iteration and there are no dynamicmemory allocations. We avoid computing intermediate resultswhenever possible, see the online codes for details.
A. Matrix computations in both methodsAll matrices are sparse and are stored as compressed
sparse row arrays. All non-zero patterns and structures remainfixed throughout iterations. We found using simple OpenMP“parallel for” loops faster than using a matrix library suchas Intel’s Math Kernel Library. This result is due to thesimplicity of the matrix computations. For instance, becauseS and U are structurally symmetric with the same structure,the transposes have the same row pointer and the columnindex arrays. But the value array is permuted. So we computethe permutation and whenever we need to transpose one ofthese matrices, we just permute the values array according tothe permutation. Since these matrices do not change structureduring the algorithm, we can compute the permutation once.Sometimes – such as line 5 of Klau’s method or line 3 of BP– we simply use the permutation array to pull elements fromappropriate memory locations without any intermediate write.
The matrix S can be highly imbalanced (some rows areempty and others have many non-zeros) and so we foundthat using a dynamic schedule in OpenMP’s “parallel for”construction yielded better performance than a static schedule.After some experimentation, we found that a chunk-size of1000 seemed to produce the best performance for theseoperations. Indeed, we found this observation to be the casefor all operations involving the matrix S. Synchronization onlyoccurs at the end of each “parallel for” loop.
B. Specifics about Klau’s methodIn the first step of the iteration, we need to solve a bipartite
matching problem for each row of the matrix S with weightsthat change based on U
(k). We compute �2S+U
(k)�U
(k)T
using the permutation trick, and then we parallelize the op-eration over rows. Each of these matching problems is smallbecause there are only a few non-zeros in each row of S andso we do not consider using the parallel approximation here.We precompute the maximum memory required for p threads
Step 6"approx matching
34
Parallel
David Gleich · Purdue
Approximation doesn’t hurt the belief propagation algorithm
problem from Klau [7] (homo-musm). The first is an alignmentbetween protein interactions in a fly (D. melanogaster) andyeast (S. cerevisiae). The second is an alignment betweenhumans (H. sapiens) and mice (M. musculus). We utilizethese problems solely for the instances of a network alignmentproblem and do not focus on the biological insights suggested.The graph L and associated weights are from the originalpapers.
C. Ontology alignmentWe consider two problems in ontology alignment from [13].
The first is an alignment between the Library of Congresssubject headings and Wikipedia categories (lcsh-wiki). Whileboth ontologies have a core hierarchical tree, they also havemany cross edges for other types of relationships. Thus wecan think of them as general graphs. The second problem is analignment between the Library of Congress subject headingsand its counterpart in the French National Library: Rameau.In both cases, the edges and weights in L are computed via atext-matching of the subject heading strings (and via translatedstrings in the case of Rameau). These problems are larger thanthe bioinformatics ones.
VII. NETWORK ALIGNMENT WITH APPROXIMATEMATCHING
In this section we address the question: how does the be-havior of Klau’s method and the BP method change when wesubstitute the approximate matching procedure from Section Vfor the bipartite matching step in each algorithm? Note thatwe always use exact matching in the first step of Klau’smethod (Step 1: row match) because the problems in eachrow tend to be small and we parallelize over rows. Note alsothat the bipartite matching is much more integral to Klau’smethod than the BP procedure. For the BP procedure, weonly solve a bipartite matching problem to evaluate the qualityof an iterate, whereas in Klau’s method, the results of thematching determine the update to the Lagrange multipliers U.Put another way, the set of iterates from the BP method isindependent of the choice of matching algorithm. At the endof the iteration, each of the methods returns the best heuristicit computed, and we perform one final step of exact maximumweight matching to convert this into the returned matching.
We begin by evaluating the solution quality on syntheticpower-law problems. We use ↵ = 1,� = 2 and 1000 iterationsof each method. We evaluate each solution by comparisonwith the identity alignment. Note that the identity alignment– which assigns each vertex in graph A to its mirror imagein graph B based on the original graph G – may not be theoptimal alignment because the perturbations to the graph couldintroduce a better solution. This seems to occur because wecompute objective values larger than the identity alignment for¯d > 10. We also study how many of the correct matches eachmethod generates with respect to the identity matching. Theresults are shown in Figure 2. In the top plot, we show thefraction of the objective from the identity matching achieved(y-axis) as the expected degree ¯d of random edges in L
0 5 10 15 200
0.2
0.4
0.6
0.8
1
rou
nd
ed
ob
ject
ive
va
lue
s
expected degree of noise in L (p ! n)
MR-upperMRApproxMRBPApproxBP
0 5 10 15 200
0.2
0.4
0.6
0.8
1
rou
nd
ed
ob
ject
ive
va
lue
s
expected degree of noise in L (p ! n)
0 5 10 15 200
0.2
0.4
0.6
0.8
1
fra
ctio
n c
orr
ect
expected degree of noise in L (p ! n)
MRApproxMRBPApproxBP
0 5 10 15 200
0.2
0.4
0.6
0.8
1
fra
ctio
n c
orr
ect
expected degree of noise in L (p ! n)
Fig. 2. Alignment with a power-law graph shows the large effect thatapproximate rounding can have on solutions from Klau’s method (MR). Withthat method, using exact rounding will yield the identity matching for allproblems (bottom figure), whereas using the approximation results in over a50% error rate. The results from the BP method with and without approximatematching are indistinguishable. Small differences were randomly added toshow both lines in the figure.
varies from 2 to 20 (x-axis). In the bottom plot, we showthe fraction of correct matches (y-axis), again as the expecteddegree ¯d varies. Problems with more random edges are morechallenging. The figures demonstrate that Klau’s method issensitive to using an approximate matching routine, whereasthe BP method with exact and approximate matching arenearly indistinguishable.
We also evaluate the how the matching weight (wTx,
plotted on the x-axis) and overlap (xTSx/2, plotted on the
y-axis) change for a bioinformatics problem (dmela-scere)and an ontology problem (lcsh-wiki) in the upper and lowerplots in Figure 3. Again, the BP results with and withoutapproximate matching are virtually indistinguishable. Klau’smethod, however, produces results that are considerably worse.
Randomly perturb one power-law graph to get A, B Generate L by the true-match + random edges
BP and ApproxBP are indistinguishable
The amount of random-ness in L in average expected degree
Frac
tion
of c
orre
ct m
atch
35
David Gleich · Purdue
A local dominating edge method for bipartite matching
i
r
t
s
j
uwtu
A L B
The method guarantees • ½ approximation • maximal matching based on work by Preis (1999), Manne and Bisseling (2008), and Halappanavar et al (2012)
A locally dominating edge is an edge heavier than all neighboring edges. For bipartite Work on smaller side only
36
David Gleich · Purdue
A local dominating edge method for bipartite matching
i
r
t
s
j
uwtu
A L B
Queue all vertices
Until queue is empty!In Parallel over vertices"
Match to heavy edge and if there’s a conflict, check the winner, and find an alternative for the loser
Add endpoint of non-dominating edges to the queue
37
A locally dominating edge is an edge heavier than all neighboring edges. For bipartite Work on smaller side only
David Gleich · Purdue
A local dominating edge method for bipartite matching
i
r
t
s
j
uwtu
A L B
Customized first iteration (with all vertices) Use OpenMP locks to update choices Use sync_and_fetch_add for queue updates.
38
A locally dominating edge is an edge heavier than all neighboring edges. For bipartite Work on smaller side only
David Gleich · Purdue
Remaining multi-threading procedures are straightforward Standard OpenMP for matrix-computations" use schedule=dynamic to handle skew We can batch the matching procedures in the BP method for additional parallelism for i=1 to ...
update x[i] to y[i] save y[i] in a buffer when the buffer is full compute max-weight match for all in buffer and save the best
39
David Gleich · Purdue
Performance evaluation (2x4)-10 core Intel E7-8870, 2.4 GHz (80-cores) 16 GB memory/proc (128 GB) Scaling study 1. Thread binding "
scattered vs. compact 2. Memory binding "
interleaved vs. bind
40
CPU
Mem
CPU
Mem
CPU
Mem
CPU
Mem
CPU
Mem
CPU
Mem
CPU
Mem
CPU
Mem
David Gleich · Purdue
Scaling
41
0 20 40 60 800
5
10
15
20
25
Threads
Spee
dup
scatter and interleave
0 20 40 60 800
5
10
15
20
25
Threads
Spee
dup
BP with no batching lcsh-rameau, 400 iterations
1450 seconds for 1-thread
115 seconds for 40-thread
David Gleich · Purdue
Ongoing work
Better memory handling! "
numactl, affinity insufficient for full scaling Better models!"
These get to be much bigger computations. Distributed memory."
Trying to get an MPI version, looking into GraphLab
42
David Gleich · Purdue
UTRC Seminar David Gleich, Purdue 43/4
0
PageRank details
1
2
3
4
5
6
!
2664
1/6 1/2 0 0 0 01/6 0 0 1/3 0 01/6 1/2 0 1/3 0 01/6 0 1/2 0 0 01/6 0 1/2 1/3 0 11/6 0 0 0 1 0
3775
| {z }P
P�j�0eTP=eT
“jump” ! v = [ 1n ... 1
n ]T ���0
eTv=1
Markov chainî�P+ (1� �)veT
óx = x
unique x ) �j � 0, eTx = 1.
Linear system (�� �P)x = (1� �)vIgnored dangling nodes patched back to v
algorithms laterDavid F. Gleich (Sandia) PageRank intro Purdue 6 / 36
PageRank by Google
1
2
3
4
5
6
The Model1. follow edges uniformly with
probability �, and2. randomly jump with probability
1� �, we’ll assume everywhere isequally likely
The places we find thesurfer most often are im-portant pages.
David F. Gleich (Sandia) PageRank intro Purdue 5 / 36
PageRank was created by Google to rank web-pages
Other uses for PageRankWhat else people use PageRank to do
GeneRank
10 20 30 40 50 60 70
NM_003748NM_003862Contig32125_RCU82987AB037863NM_020974Contig55377_RCNM_003882NM_000849Contig48328_RCContig46223_RCNM_006117NM_003239NM_018401AF257175AF201951NM_001282Contig63102_RCNM_000286Contig34634_RCNM_000320AB033007AL355708NM_000017NM_006763AF148505Contig57595NM_001280AJ224741U45975Contig49670_RCContig753_RCContig25055_RCContig53646_RCContig42421_RCContig51749_RCAL137514NM_004911NM_000224NM_013262Contig41887_RCNM_004163AB020689NM_015416Contig43747_RCNM_012429AB033043AL133619NM_016569NM_004480NM_004798Contig37063_RCNM_000507AB037745Contig50802_RCNM_001007Contig53742_RCNM_018104Contig51963Contig53268_RCNM_012261NM_020244Contig55813_RCContig27312_RCContig44064_RCNM_002570NM_002900AL050090NM_015417Contig47405_RCNM_016337Contig55829_RCContig37598Contig45347_RCNM_020675NM_003234AL080110AL137295Contig17359_RCNM_013296NM_019013AF052159Contig55313_RCNM_002358NM_004358Contig50106_RCNM_005342NM_014754U58033Contig64688NM_001827Contig3902_RCContig41413_RCNM_015434NM_014078NM_018120NM_001124L27560Contig45816_RCAL050021NM_006115NM_001333NM_005496Contig51519_RCContig1778_RCNM_014363NM_001905NM_018454NM_002811NM_004603AB032973NM_006096D25328Contig46802_RCX94232NM_018004Contig8581_RCContig55188_RCContig50410Contig53226_RCNM_012214NM_006201NM_006372Contig13480_RCAL137502Contig40128_RCNM_003676NM_013437Contig2504_RCAL133603NM_012177R70506_RCNM_003662NM_018136NM_000158NM_018410Contig21812_RCNM_004052Contig4595Contig60864_RCNM_003878U96131NM_005563NM_018455Contig44799_RCNM_003258NM_004456NM_003158NM_014750Contig25343_RCNM_005196Contig57864_RCNM_014109NM_002808Contig58368_RCContig46653_RCNM_004504M21551NM_014875NM_001168NM_003376NM_018098AF161553NM_020166NM_017779NM_018265AF155117NM_004701NM_006281Contig44289_RCNM_004336Contig33814_RCNM_003600NM_006265NM_000291NM_000096NM_001673NM_001216NM_014968NM_018354NM_007036NM_004702Contig2399_RCNM_001809Contig20217_RCNM_003981NM_007203NM_006681AF055033NM_014889NM_020386NM_000599Contig56457_RCNM_005915Contig24252_RCContig55725_RCNM_002916NM_014321NM_006931AL080079Contig51464_RCNM_000788NM_016448X05610NM_014791Contig40831_RCAK000745NM_015984NM_016577Contig32185_RCAF052162AF073519NM_003607NM_006101NM_003875Contig25991Contig35251_RCNM_004994NM_000436NM_002073NM_002019NM_000127NM_020188AL137718Contig28552_RCContig38288_RCAA555029_RCNM_016359Contig46218_RCContig63649_RCAL080059
Use (�� �GD�1)x =w tofind “nearby” importantgenes.
ProteinRankObjectRankEventRankIsoRankClustering
(graph partitioning)
Sports rankingFood websCentralityTeaching
Note Conjectured new papers: TweetRank (Done, WSDM 2010), WaveRank,BeachRank, PaperRank, UniversityRank, LabRank. I think the last one involves arandom scientist!
Morrison et al. GeneRank, 2005.David F. Gleich (Sandia) PageRank intro Purdue 7 / 36
UTRC Seminar David Gleich, Purdue 44/4
0
Which sensitivity?
(�� �P)x = (1� �)v
Sensitivity to the links : examined and understood
Sensitivity to the jump : examined, understood, and useful
Sensitivity to � : less well understood
David F. Gleich (Sandia) Sensitivity Purdue 10 / 36
Multicore PageRank
… similar story … Serialized preprocessing Parallelize the linear algebra via an "asynchronous Gauss-Seidel iterative method ~10x scaling on same (80-core) machine "(1M nodes, 15M edges, synthetic)
David Gleich · Purdue 45
Questions? Papers on my webpage
www.cs.purdue.edu/homes/dgleich
Codes github.com/arbenson/mrtsqr
www.cs.purdue.edu/homes/dgleich/codes/netalignmc github.com/dgleich/prpack
David Gleich · Purdue 46