Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Usable sparse matrix–vector multiplication
Portable, usable, and efficient sparse matrix–vectormultiplication
Albert-Jan YzelmanParallel Computing and Big Data
Huawei Technologies France
30th of November, 2016
A. N. Yzelman
Usable sparse matrix–vector multiplication
Outline
1 Shared-memory architectures
2 (Shared-memory) parallel programming models
3 Application to sparse computing
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Shared-memory architectures
1 Shared-memory architectures
2 (Shared-memory) parallel programming models
3 Application to sparse computing
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Moore’s Law (Waldrop, Nature vol. 530, 2016)
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Hardware trends: core count
Transistor counts continue to double every two years (at least until2021, cf. Waldrop), yet clock speeds have stalled. Why?
Graphics from Ron Maltiel, Maltiel Consulting(http://www.maltiel-consulting.com/ISSCC-2013-High-Performance-Digital-Trends.html)
M. W. Waldrop, “The chips are down for Moore’s Law”; Nature, Vol. 530, pp. 144–147, 2016.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Hardware trends: core count
Transistor counts continue to double every two years (at least until2021, cf. Waldrop), yet clock speeds have stalled. Why?
Graphics from Ron Maltiel, Maltiel Consulting(http://www.maltiel-consulting.com/ISSCC-2013-High-Performance-Digital-Trends.html)
M. W. Waldrop, “The chips are down for Moore’s Law”; Nature, Vol. 530, pp. 144–147, 2016.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Hardware trends: bandwidth
CPU speeds stall, but Moore’s Law now translates to an increasingamount of cores per die, i.e., the effective flop rate of processorsstill rises.
But what about bandwidth?
Technology Year Bandwidth #cores
EDO 1970s 27 Mbyte/s 1SDRAM early 1990s 53 Mbyte/s 1RDRAM mid 1990s 1.2 Gbyte/s 1
DDR 2000 1.6 Gbyte/s 1DDR2 2003 3.2 Gbyte/s 2DDR3 2007 6.4 Gbyte/s 4DDR3 2013 11 Gbyte/s 8DDR4 2015 25 Gbyte/s 14
Effective bandwidth per core, stalled at best...
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Arithmetic intensity
1
2
4
8
16
32
64
1/8 1/4 1/2 1 2 4 8 16
atta
inab
le G
FL
OP
/sec
Arithmetic Intensity FLOP/Byte
peak m
emory BW
peak floating-point
withvectorization
If your computation has enough work per data element, it iscompute bound; otherwise, it is bandwidth bound.
If you are bandwidth bound, reducing your memory footprint, e.g.,by compression, directly results in faster execution.
(Image courtesy of Prof. Wim Vanroose, UA)
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Multi-socket architectures: NUMA
Each socket has local main memory where access is fast. Betweensockets, access is slower.
Memory
CPUs
Access times to shared-memory depends on the physical location,leading to non-uniform memory access (NUMA).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Dealing with NUMA: distribution types
Implicit distribution, centralised local allocation:
If each processor moves data to the same single memory element, thebandwidth is limited by that of a single memory controller.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Dealing with NUMA: distribution types
Implicit distribution, centralised interleaved allocation:
If each processor moves data from all memory elements, the bandwidthmultiplies if accesses are uniformly random.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Dealing with NUMA: distribution types
Explicit distribution, distributed local allocation:
If each processor moves data from and to its own unique memoryelement, the bandwidth multiplies.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches
Divide the main memory (RAM) in stripes of size LS .
Mainmemory
(RAM)
The ith line in RAM is mapped to the cache line i mod L, where L isthe number of available cache lines. Can we do better?
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches
A smarter cache follows a pre-defined policy instead; for instance, the‘Least Recently Used (LRU)’ policy:
x1
⇒
Req. x1, . . . , x4
x4
x3
x2
x1
⇒
Req. x2
x2
x4
x3
x1
⇒
Req. x5
x5
x2
x4
x3
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches
Realistic caches combine modulo-mapping and the LRU policy:
Mainmemory
(RAM)
Cache
Subcaches
Modulo mapping
LRU−stack
k is the number of subcaches; there are L/k LRU stacks.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches
Realistic caches are used within multi-level memory hierarchies:
Mainmemory
(RAM)������������
������������
��������������������
��������������������
��������������������
Cache Cache
(L1) (L2)CPU
Intel Core2 (Q6600) AMD Phenom II (945e) Intel Westmere (E7-2830)L1: 32kB k = 8L2: 4MB k = 16L3: - -
S = 64kB k = 2S = 512kB k = 8S = 6MB k = 48
S = 256kB k = 8S = 2MB k = 8S = 24MB k = 24
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multiplication
Dense matrix–vector multiplication
a00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
·
x0
x1
x2
x3
=
y0
y1
y2
y3
Example with LRU caching and S = 4:
x0
=⇒
a00
x0
=⇒
y0
a00
x0 =⇒
x1
y0
a00
x0
=⇒
a01
x1
y0
a00
x0
=⇒
y0
a01
x1
a00
x0
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multiplication
Dense matrix–vector multiplication
a00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
·
x0
x1
x2
x3
=
y0
y1
y2
y3
Example with LRU caching and S = 4:
x0
=⇒
a00
x0
=⇒
y0
a00
x0 =⇒
x1
y0
a00
x0
=⇒
a01
x1
y0
a00
x0
=⇒
y0
a01
x1
a00
x0
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multiplication
Dense matrix–vector multiplication
a00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
·
x0
x1
x2
x3
=
y0
y1
y2
y3
Example with LRU caching and S = 4:
x0
=⇒
a00
x0
=⇒
y0
a00
x0 =⇒
x1
y0
a00
x0
=⇒
a01
x1
y0
a00
x0
=⇒
y0
a01
x1
a00
x0
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multiplication
Dense matrix–vector multiplication
a00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
·
x0
x1
x2
x3
=
y0
y1
y2
y3
Example with LRU caching and S = 4:
x0
=⇒
a00
x0
=⇒
y0
a00
x0 =⇒
x1
y0
a00
x0
=⇒
a01
x1
y0
a00
x0
=⇒
y0
a01
x1
a00
x0
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multiplication
Dense matrix–vector multiplication
a00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
·
x0
x1
x2
x3
=
y0
y1
y2
y3
Example with LRU caching and S = 4:
x0
=⇒
a00
x0
=⇒
y0
a00
x0 =⇒
x1
y0
a00
x0
=⇒
a01
x1
y0
a00
x0
=⇒
y0
a01
x1
a00
x0
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multiplication
Dense matrix–vector multiplication
a00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
·
x0
x1
x2
x3
=
y0
y1
y2
y3
Example with LRU caching and S = 4:
x0
=⇒
a00
x0
=⇒
y0
a00
x0 =⇒
x1
y0
a00
x0
=⇒
a01
x1
y0
a00
x0
=⇒
y0
a01
x1
a00
x0
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multiplication: NUMA again
When k, L are larger, we can predict:
lower elements from x are evicted while processing the first row.This causes O(n) cache misses on m − 1 rows.
Fix:
stop processing a row before an element from x would be evicted;first continue with the next rows.
This results in column-wise ‘stripes’ of the dense A:
A =
· · ·
.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multiplication: NUMA again
When k, L are larger, we can predict:
lower elements from x are evicted while processing the first row.This causes O(n) cache misses on m − 1 rows.
Fix:
stop processing a row before an element from x would be evicted;first continue with the next rows.
This results in column-wise ‘stripes’ of the dense A:
A =
· · ·
.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multiplication: NUMA again
A =
· · ·
.
But now:
elements from the vector y can be prematurely evicted; O(m)cache misses on each block of columns. (Already much better!)
Fix:
stop processing before an element from y is evicted; first do theremaining column blocks.
This is cache-aware blocking.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multiplication: NUMA again
A =
· · ·
.
But now:
elements from the vector y can be prematurely evicted; O(m)cache misses on each block of columns. (Already much better!)
Fix:
stop processing before an element from y is evicted; first do theremaining column blocks.
This is cache-aware blocking.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multicore
Most architectures employ shared caches.
������������
������������
������������
������������
������������
������������
������������
������������
������
������
������
������
������
������
������
������
64kB L1 64kB L1 64kB L1 64kB L1
Core 1 Core 2 Core 3 Core 4
512kB L2512kB L2512kB L2512kB L2
System interface
6MB shared L3 cache
In BSP: (4, 3GHz, l , g). Is this a correct view?
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multicore: NUMA
������������
������������
������������
������������
������������
������������
������������
������������
32kB L1 32kB L1 32kB L1 32kB L1
Core 1 Core 2 Core 3 Core 4
4MB L2
System interface
4MB L2
In BSP: (4, 2.4 GHz, l , g)...
A. N. Yzelman
Usable sparse matrix–vector multiplication > Shared-memory architectures
Caches and multicore: NUMA
������������
������������
������������
������������
������������
������������
������������
������������
32kB L1 32kB L1 32kB L1 32kB L1
Core 1 Core 2 Core 3 Core 4
4MB L2
System interface
4MB L2
In BSP: (4, 2.4 GHz, l , g) but Non-Uniform Memory Access!
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
(Shared-memory) parallel programmingmodels
1 Shared-memory architectures
2 (Shared-memory) parallel programming models
3 Application to sparse computing
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Shared-memory programming intro
Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.
Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.
Output: xT y
Shared-memory SPMD program with ‘double α;’ globally allocated:
α = 0.0
for i = s to n step p
α += xiyireturn α
Data race! (for n = p = 2, output can be x0y0, x1y1, or x0y0 + x1y1)
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Shared-memory programming intro
Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.
Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.
Output: xT y
Shared-memory SPMD program with ‘double α;’ globally allocated:
α = 0.0
for i = s to n step p
α += xiyireturn α
Data race! (for n = p = 2, output can be x0y0, x1y1, or x0y0 + x1y1)
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Shared-memory programming intro
Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.
Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.
Output: xT y
Shared-memory SPMD program with ‘double α[p];’ globally allocated:
for i = s to n step p
αs += xiyisynchronise
return∑p−1
i=0 αi
False sharing! (processors access and update the same cache lines)
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Shared-memory programming intro
Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.
Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.
Output: xT y
Shared-memory SPMD program with ‘double α[p];’ globally allocated:
for i = s to n step p
αs += xiyisynchronise
return∑p−1
i=0 αi
False sharing! (processors access and update the same cache lines)
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Shared-memory programming intro
Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.
Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.
Output: xT y
Shared-memory SPMD program with ‘double α[8p];’ globally allocated:
for i = s to n step pα8s += xiyi
synchronisereturn
∑p−1i=0 α8i
Inefficient cache use!(All threads access virtually all cache lines associated with x , y)
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Shared-memory programming intro
Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.
Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.
Output: xT y
Shared-memory SPMD program with ‘double α[8p];’ globally allocated:
for i = s to n step pα8s += xiyi
synchronisereturn
∑p−1i=0 α8i
Inefficient cache use!(All threads access virtually all cache lines; Θ(pn) data movement)
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Shared-memory programming intro
Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.
Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.
Output: xT y
Shared-memory SPMD program with ‘double α[8p];’ globally allocated:
for i = s · dn/pe to (s + 1) · dn/peα8s += xiyi
synchronise
return∑p−1
i=0 α8i
(Now inefficiency only at boundaries; O(n + p − 1) data movement)
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Shared-memory programming intro
Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.
Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.
Output: xT y
Shared-memory SPMD program with ‘double α[8p];’ globally allocated:
for i = s · dn/pe to (s + 1) · dn/peα8s += xiyi
synchronise
return∑p−1
i=0 α8i
(Now inefficiency only at boundaries; O(n + p − 1) data movement)
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Speedup
Definition (Speedup)
Let Tseq be the sequential running time required for solving a problem.Let Tp be the running time of a parallel algorithm using p processes,solving the same problem. Then the speedup is given by
S = Tseq/Tp.
Scalable in time:
Ideally, S = p;if we are lucky, S > p;realistically, 1� S < p;if we do very badly, S < 1.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Maximum attainable speedup
Consider a graph G = (V ,E ) of a given algorithm, e.g.,
Nodes correspond to data,edges indicate which data iscombined to generate acertain output.
Question: If we had an infinitenumber of processors, how fastwould we be able to run thealgorithm shown on the right?
f(a)
h(f(a))...
a
g(a)
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Maximum attainable speedup
Consider a graph G = (V ,E ) of a given algorithm, e.g.,
Nodes correspond to data,edges indicate which data iscombined to generate acertain output.
Question: If we had an infinitenumber of processors, how fastwould we be able to run thealgorithm shown on the right?
Answer: Tseq = |V | = 9, while thecritical path length T∞ equals 4.
The maximum speedup hence is:
Tseq/T∞ = 9/4.
f(a)
h(f(a))...
a
g(a)
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
What is parallelism?
Definition (Parallelism)
The parallelism of a given algorithm is its maximum attainablespeedup:
Tseq/T∞.
T∞ is known as the critical path length or the algorithmic span.
This leads to a theoretical upper bound on speedup:
S = Tseq/Tp ≤ Tseq/T∞.
This type of analysis forms the basis of
fine-grained parallel computation.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Fine-grained parallel computing
Decompose a problem into many small tasks, that run concurrently(as much as possible). A run-time scheduler assigns tasks toprocesses.
What is small? Grain-size.
Performance model? Parallelism.
Algorithms can be implemented as graphs, explicitly or implicitly:
Intel: Threading Building Blocks (TBB),
OpenMP,
Intel / MIT / Cilk Arts: Cilk,
Google: Pregel,
. . .
By contrast, BSP computing is coarse-grained.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Cilk
Only two parallel programming primitives:1 (binary) fork, and2 (binary) join.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Cilk
Example: calculate x4 from xn = xn−2 + xn−1 given x0 = x1 = 1:
int f( int n ) {if( n == 0 ∨ n == 1 ) return 1;int x1 = cilk spawn f( n-1 ); //forkint x2 = cilk spawn f( n-2 ); //forkcilk sync; //joinreturn x1 + x2;
}
int main() {int x4 = f( 4 );printf( ”x 4 = %d\n”, &x4 );return 0;
}
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Cilk
Spawned function calls are assigned to one of the available processesby the Cilk run time scheduler.
The Cilk scheduler guarantees, under some assumptions on thedeterminism of the algorithm, that the parallel run-time Tp is boundedby
O(T1/p + T∞).
Not all run-time schedulers have such guarantees, and Cilk is but oneof the many fine-grained parallel programming frameworks.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Is parallelism the way to go?
Example:
Consider the naive Θ(n2) Fourier transformation; its span is Θ(log n),so its parallelism is Θ(n2/ log n). Lots of parallelism!
The FFT formulation has work Θ(n log n), also with span Θ(log n)resulting in Θ(n log n
log n ) = Θ(n) parallelism. Less parallelism...
Which would you prefer?
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Is parallelism the way to go?
Is there a difference between considering Tseq or T1?
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Is parallelism the way to go?
Is there a difference between considering Tseq or T1?
Yes(!)
There may be multiple sequential algorithms to solve the sameproblem. When comparing, always compare to the best. For parallelsorting:
Sodd-even sort = T qsortseq /T odd-even sort
p .
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Is parallelism the way to go?
Are there any other issues that may be overlookedwhen focusing solely on parallelism?
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Overheads
Definition (Overhead)
The overhead of parallel computation is any extra effort expendedover the original amount of work Tseq
To = pTp − Tseq.
The parallel computation time can be expressed in To :
Tp =Tseq + To
p.
Cilk: To = pT∞.
BSP: To = p(∑N−1
i=0 hig + l)
+ p∑N−1
i=0 maxs w(s)i − Tseq.
Data movement, latency costs, and extra computations on top of thebare minimum required should be modelled. This enables parallel
algorithm design, instead of simply enabling parallel execution.A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
MapReduce
Most parallel programming models only consider algorithmic structure;fork-join, parallel for, dataflows, etc. One other is MapReduce, whichoperates on key-value pairs from
K × V ,
with K a set of possible keys and V a set of values. MapReducedefines two operations:
mapper: K × V → P(K × V );
reducer: K × P(V )→ V .
Mapper applies to all key-value pairs, embarrasingly parallel.
Reducer applies to a single key only, global communication.
Shuffling refers to the sorting-by-keys prior to the reduction stage.A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Pregel
Consider a graph G = (V ,E ). Graph algorithms may be phrased asfollows:
For each vertex v ∈ V , a thread executes a user-defined SPMDalgorithm;
each algorithm consists of successive local compute phases andglobal communication phases;
during a communication phase, a vertex v can only send messagesto N(v), where N(v) is the set of neighbouring vertices of v ; i.e.,N(v) = {w ∈ V | {v ,w} ∈ E}.
MapReduce and Pregel are variants of the BSP algorithm model,
a type of fine-grained BSP.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Spark
High level language for large-scale computing: resilient, scalable, hugeuptake; expressive and easy to use:
s c a l a> v a l A = s c . t e x t F i l e ( ”A . t x t ” ) . map( x => x . s p l i t ( ” ” ) match {case Array ( a , b , c ) => ( a . t o I n t − 1 , b . t o I n t − 1 , c . toDouble )
} ) . groupBy ( x => x . 1 ) ;A : org . apache . s p a r k . rdd .RDD[ ( I n t , I t e r a b l e [ ( I n t , I n t , Double ) ] ) ] = Shuff ledRDD [ 8 ] . . .
s c a l a>
Spark is implemented in Scala, runs on the JVM, relies on serialisation,and commonly uses HDFS for distributed and resilient storage.
Ref.: Zaharia, M. (2013). An architecture for fast and general data processing on large clusters. Dissertation, UCB.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Spark
Concepts:
RDDs are fine-grained data distributed by hashing;
Transformations (map, filter, groupBy) are lazy operators;
DAGs thus formed are resolved by actions: reduce, collect, ...
Computations are offloaded as close to the data as possible;
all-to-all data shuffles for communication required by actions.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Bridging HPC and Big Data
Platforms like Spark essentially do PRAM simulation:
automatic mode vs. direct modeeasy-of-use vs. performance
Ref.: Valiant, L. G. (1990). A bridging model for parallel computation. Communications of the ACM, 33(8).
Both are scalable as long as the shuffle remains balanced well:
”First, [..] we show that RDDs can emulate any distributed system, and will
do so efficiently as long as the system tolerates some network latency [..]
because, once augmented with fast data sharing, MapReduce can emulate the
BSP model of parallel computing, with the main drawback being the latency
of each MapReduce step.”– from the PhD thesis introducing Spark, page 4.
Ref.: Zaharia, M. (2013). An architecture for fast and general data processing on large clusters. Dissertation, UCB.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Multi-BSP
BSP itself is evolving as well:
Multi-BSP computer = p ( subcomputers or processors ) +
M bytes of local memory+
an interconnect
A total of 4L parameters: (p0, g0, l0,M0, . . . , pL−1, gL−1, lL−1,ML−1).
memory-aware,
non-uniform!
However,
harder to prove optimality(?)
L. G. Valiant, A bridging model for multi-core computing, 2008, 2011.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Multi-BSP
BSP itself is evolving as well:
Multi-BSP computer = p ( subcomputers or processors ) +
M bytes of local memory+
an interconnect
A total of 4L parameters: (p0, g0, l0,M0, . . . , pL−1, gL−1, lL−1,ML−1).
memory-aware,
non-uniform!
However,
harder to prove optimality(?)
L. G. Valiant, A bridging model for multi-core computing, 2008, 2011.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Multi-BSP
BSP itself is evolving as well:
Multi-BSP computer = p ( subcomputers or processors ) +
M bytes of local memory+
an interconnect
A total of 4L parameters: (p0, g0, l0,M0, . . . , pL−1, gL−1, lL−1,ML−1).
memory-aware,
non-uniform!
However,
harder to prove optimality(?)
L. G. Valiant, A bridging model for multi-core computing, 2008, 2011.
A. N. Yzelman
Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models
Multi-BSP
An example with L = 4 quadlets (p, g , l ,M):
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Application to sparse computing
1 Shared-memory architectures
2 (Shared-memory) parallel programming models
3 Application to sparse computing
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Problem setting
Given a sparse m × n matrix A, and corresponding vectors x , y .
How to calculate y = Ax as fast as possible?
How to make the code usable for the 99%?
Figure: Wikipedia link matrix (’07) with on average ≈ 12.6 nonzeroes per row.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Problem setting
Shared-memory central obstacles for SpMV multiplication:
inefficient cache use,
limited memory bandwidth, and
non-uniform memory access (NUMA).
Distributed-memory:
inefficient network use.
Shared-memory and distributed-memory share their objectives:
cache misses==
communication volumeCache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods
by A. N. Yzelman & Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Problem setting
Shared-memory central obstacles for SpMV multiplication:
inefficient cache use,
limited memory bandwidth, and
non-uniform memory access (NUMA).
Distributed-memory:
inefficient network use.
Shared-memory and distributed-memory share their objectives:
cache misses==
communication volumeCache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods
by A. N. Yzelman & Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Inefficient cache use
Visualisation of the SpMV multiplication Ax = y with nonzeroesprocessed in row-major order:
Accesses on the input vector are completely unpredictable.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Enhanced cache use: nonzero reorderings
Blocking to cache subvectors, and cache-oblivious traversals.
Other approaches: no blocking (Haase et al.), Morton Z-curves and bisection (Martone et al.), Z-curve within blocks(Buluc et al.), composition of low-level blocking (Vuduc et al.), ...
Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”,IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Enhanced cache use: nonzero reorderings
Blocking to cache subvectors, and cache-oblivious traversals.
Sequential SpMV multiplication on the Wikipedia ’07 link matrix:345 (CRS), 203 (Hilbert), 245 (blocked Hilbert) ms/mul.
Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”,IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Enhanced cache use: matrix permutations
1 2 3 4
1 2
43
(Upper bound on) the number of cache misses:∑i
(λi − 1)
Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methodsby A. N. Yzelman & Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Enhanced cache use: matrix permutations
cache misses ≤∑i
(λi − 1) = communication volume
Lengauer, T. (1990). Combinatorial algorithms for integrated circuit layout. Springer Science & Business Media.
Catalyurek, U. V., & Aykanat, C. (1999). Hypergraph-partitioning-based decomposition for parallel sparse-matrixvector multiplication. IEEE Transactions on Parallel and Distributed Systems, 10(7), 673-693.
Catalyurek, U. V., & Aykanat, C. (2001). A Fine-Grain Hypergraph Model for 2D Decomposition of SparseMatrices. In IPDPS (Vol. 1, p. 118).
Vastenhouw, B., & Bisseling, R. H. (2005). A two-dimensional data distribution method for parallel sparsematrix-vector multiplication. SIAM review, 47(1), 67-95.
Bisseling, R. H., & Meesen, W. (2005). Communication balancing in parallel sparse matrix-vector multiplication.Electronic Transactions on Numerical Analysis, 21, 47-65.
Should we program shared-memory as though it were distributed?
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Enhanced cache use: matrix permutations
Practical gains:
Figure: the Stanford link matrix (left) and its 20-part reordering (right).
Sequential execution using CRS on Stanford:
18.99 (original), 9.92 (1D), 9.35 (2D) ms/mul.
Ref.: Two-dimensional cache-oblivious sparse matrix-vector multiplicationby A. N. Yzelman & Rob H. Bisseling in Parallel Computing 37(12), pp. 806-819 (2011).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Bandwidth
Exploiting sparsity through computation using only nonzeroes:
i = (0, 0, 1, 1, 2, 2, 2, 3)j = (0, 4, 2, 4, 1, 3, 5, 2)v = (a00, a04, . . . , a32)
for k = 0 to nz − 1yik := yik + vk · xjk
The coordinate (COO) format: two flops versus five data words.
Θ(3nz) storage. CRS has lower arithmetic intensity: Θ(2nz + m) storage.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Efficient bandwidth use
A =
4 1 3 00 0 2 31 0 0 27 0 1 1
Bi-directional incremental CRS (BICRS):
A =
V [7 1 4 1 2 3 3 2 1 1]
∆J [0 4 4 1 5 4 5 4 3 1]
∆I [3 -1 -2 1 -1 1 1 1]
Storage requirements, allowing arbitrary traversals:
Θ(2nz + row jumps + 1).Ref.: Yzelman and Bisseling, “A cache-oblivious sparse matrix–vector multiplication scheme based on the Hilbert curve”,
Progress in Industrial Mathematics at ECMI 2010, pp. 627-634 (2012).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Efficient bandwidth use
With BICRS you can, distributed or not,
vectorise,
compress,
do blocking,
have arbitrary nonzero or block orders.
Optimised BICRS takes less than or equal to 2nz + m of memory.
Ref.: Buluc, Fineman, Frigo, Gilbert, Leiserson (2009). Parallel sparse matrix-vector and matrix-transpose-vectormultiplication using compressed sparse blocks. In Proceedings of the twenty-first annual symposium on Parallelism inalgorithms and architectures (pp. 233-244). ACM.
Ref.: Yzelman and Bisseling (2009). Cache-oblivious sparse matrix-vector multiplication by using sparse matrixpartitioning methods. In SIAM Journal of Scientific Computation 31(4), pp. 3128-3154.
Ref.: Yzelman and Bisseling (2012). A cache-oblivious sparse matrix–vector multiplication schemebased on the Hilbert curve”. In Progress in Industrial Mathematics at ECMI 2010, pp. 627-634.
Ref.: Yzelman and Roose (2014). High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication.In IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31.
Ref.: Yzelman, A. N. (2015). Generalised vectorisation for sparse matrix: vector multiplication. In Proceedings of the 5thWorkshop on Irregular Applications: Architectures and Algorithms. ACM.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
One-dimensional data placement
Coarse-grain row-wise distribution, compressed, cache-optimised:
explicit allocation of separate matrix parts per core,
explicit allocation of the output vector on the various sockets,
interleaved allocation of the input vector,
Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEETransactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Two-dimensional data placement
Distribute row- and column-wise (individual nonzeroes):
all data allocation is explicit,inter-process communication minimised by partitioning;incurs cost of partitioning.
Ref.: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication, IEEETrans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).
Ref.: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Results
Sequential CRS on Wikipedia ’07: 472 ms/mul. 40 threads BICRS:
21.3 (1D), 20.7 (2D) ms/mul. Speedup: ≈ 22x.
Average speedup on six large matrices:2 x 6 4 x 10 8 x 8
–, 1D fine-grained, CRS∗ 4.6 6.8 6.2Hilbert, Blocking, 1D, BICRS∗ 5.4 19.2 24.6
Hilbert, Blocking, 2D, BICRS† − 21.3 30.8
†: uses an updated test set. (Added for reference versus a good 2D algorithm.)
As NUMA increases, interleaved and 1D algorithms lose efficiency.
∗: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication, IEEETrans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).
†: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Results
Sequential CRS on Wikipedia ’07: 472 ms/mul. 40 threads BICRS:
21.3 (1D), 20.7 (2D) ms/mul. Speedup: ≈ 22x.
Average speedup on six large matrices:2 x 6 4 x 10 8 x 8
–, 1D fine-grained, CRS∗ 4.6 6.8 6.2Hilbert, Blocking, 1D, BICRS∗ 5.4 19.2 24.6
Hilbert, Blocking, 2D, BICRS† − 21.3 30.8
†: uses an updated test set. (Added for reference versus a good 2D algorithm.)
As NUMA increases, interleaved and 1D algorithms lose efficiency.
∗: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication, IEEETrans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).
†: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Usability
The previous all is available through free software:
http://albert-jan.yzelman.net/software.php#SL
However, there are problems integrating into existing codes:
SPMD (PThreads, MPI, ...) vs. others (OpenMP, Cilk, ...).
Globally allocated vectors versus explicit data allocation.
Conversion between matrix data formats.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Usability
The previous all is available through free software:
http://albert-jan.yzelman.net/software.php#SL
However, there are problems integrating into existing codes:
SPMD (PThreads, MPI, ...) vs. others (OpenMP, Cilk, ...).
Globally allocated vectors versus explicit data allocation.
Conversion between matrix data formats.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Usability
The previous all is available through free software:
http://albert-jan.yzelman.net/software.php#SL
However, there are problems integrating into existing codes:
SPMD (PThreads, MPI, ...) vs. others (OpenMP, Cilk, ...).
Globally allocated vectors versus explicit data allocation.
Conversion between matrix data formats.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Usability
Wish list:
Performance and scalability.
Portable codes and/or APIs: GPUs, x86, ARM, phones, ...
Out-of-core, streaming capabilities, dynamic updates.
User-defined overloaded operations on user-defined data.
Ease of use!
GraphBLAS.org
Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Usability
Wish list:
Performance and scalability.
Portable codes and/or APIs: GPUs, x86, ARM, phones, ...
Out-of-core, streaming capabilities, dynamic updates.
User-defined overloaded operations on user-defined data.
Ease of use!
GraphBLAS.org
Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Usability
Wish list:
Performance and scalability.
Portable codes and/or APIs: GPUs, x86, ARM, phones, ...
Out-of-core, streaming capabilities, dynamic updates.
User-defined overloaded operations on user-defined data.
Ease of use!
GraphBLAS.org
Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Usability
Wish list:
Performance and scalability.
Portable codes and/or APIs: GPUs, x86, ARM, phones, ...
Out-of-core, streaming capabilities, dynamic updates.
User-defined overloaded operations on user-defined data.
Ease of use!
GraphBLAS.org
Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Usability
Wish list:
Performance and scalability.
Portable codes and/or APIs: GPUs, x86, ARM, phones, ...
Out-of-core, streaming capabilities, dynamic updates.
User-defined overloaded operations on user-defined data.
Ease of use!
GraphBLAS.org
Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Bridging HPC and Big Data
Wanted: a bridge between Big Data and HPC
Our take:
Spark I/O via native RDDs and native Scala interfaces;
Rely on serialisation and the JNI to switch to C;
Intercept Spark’s execution model to switch to SPMD;
Set up and enable inter-process RDMA communications.
d e f c r e a t e M a t r i x (rdd : RDD[ ( I n t , I t e r a b l e [ I n t ] , I t e r a b l e [ Double ] ) ] ,P : I n t
) : S p a r s e M a t r i x = . . .
d e f m u l t i p l y (s c : o rg . apache . s p a r k . SparkContext ,x : DenseVector , A : S p a r s e M a t r i x , y : DenseVector
) : DenseVector = . . . //−−− t h i s f u n c t i o n c a l l s 1D or 2D SpMVs −−−//
d e f toRDD(s c : org . apache . s p a r k . SparkContext ,x : DenseVector
) : RDD[ ( I n t , Double ) ] = . . .
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Bridging HPC and Big Data
Wanted: a bridge between Big Data and HPC
Our take:
Spark I/O via native RDDs and native Scala interfaces;
Rely on serialisation and the JNI to switch to C;
Intercept Spark’s execution model to switch to SPMD;
Set up and enable inter-process RDMA communications.
d e f c r e a t e M a t r i x (rdd : RDD[ ( I n t , I t e r a b l e [ I n t ] , I t e r a b l e [ Double ] ) ] ,P : I n t
) : S p a r s e M a t r i x = . . .
d e f m u l t i p l y (s c : o rg . apache . s p a r k . SparkContext ,x : DenseVector , A : S p a r s e M a t r i x , y : DenseVector
) : DenseVector = . . . //−−− t h i s f u n c t i o n c a l l s 1D or 2D SpMVs −−−//
d e f toRDD(s c : org . apache . s p a r k . SparkContext ,x : DenseVector
) : RDD[ ( I n t , Double ) ] = . . .
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
DataBSP
We have a shared-memory prototype. Preliminary results:
SpMM multiply, SpMV multiply, and basic vector operations;
one machine learning application.
Cage15, n = 5 154 859, nz = 99 199 551. Using the 1D method:
Note: this is ongoing work;performance and functionality improvements are forthcoming.
A. N. Yzelman
Usable sparse matrix–vector multiplication > Application to sparse computing
Conclusions and future work
Needed for current algorithms:
faster partitioning to enable scalable 2D sparse computations,
integration in practical and extensible libraries (GraphBLAS),
making them interoperable with common use scenarios.
Extend application areas further:
sparse power kernels,
symmetric matrix support,
graph and sparse tensor computations,
support various hardware and execution platforms (Hadoop?).
Thank you!The basic SpMV multiplication codes are free:
http://albert-jan.yzelman.net/software#SL
A. N. Yzelman
Usable sparse matrix–vector multiplication
Backup slides
A. N. Yzelman
Usable sparse matrix–vector multiplication
Results: cross platform
Cross platform results over 24 matrices:
Structured Unstructured Average
Intel Xeon Phi 21.6 8.7 15.22x Ivy Bridge CPU 23.5 14.6 19.0
NVIDIA K20X GPU 16.7 13.3 15.0
no one solution fits all.
If we must, some generalising statements:
Large structured matrices: GPUs.
Large unstructured matrices: CPUs or GPUs.
Smaller matrices: Xeon Phi or CPUs.
Ref.: Yzelman, A. N. (2015). Generalised vectorisation for sparse matrix: vector multiplication. In Proceedings of the 5thWorkshop on Irregular Applications: Architectures and Algorithms. ACM.
A. N. Yzelman
Usable sparse matrix–vector multiplication
BSP sparse matrix–vector multiplication
Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .
1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j
3: end for4: sync {execute fan-out}5: ys = Asxs {local multiplication stage}6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}
10: for all (i , α) received do11: add α to ys,i12: end for
Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.
A. N. Yzelman
Usable sparse matrix–vector multiplication
BSP sparse matrix–vector multiplication
Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .
1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j
3: end for4: sync {execute fan-out}
5: ys = Asxs {local multiplication stage}6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}
10: for all (i , α) received do11: add α to ys,i12: end for
Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.
A. N. Yzelman
Usable sparse matrix–vector multiplication
BSP sparse matrix–vector multiplication
Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .
1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j
3: end for4: sync {execute fan-out}5: ys = Asxs {local multiplication stage}
6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}
10: for all (i , α) received do11: add α to ys,i12: end for
Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.
A. N. Yzelman
Usable sparse matrix–vector multiplication
BSP sparse matrix–vector multiplication
Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .
1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j
3: end for4: sync {execute fan-out}5: ys = Asxs {local multiplication stage}6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}
10: for all (i , α) received do11: add α to ys,i12: end for
Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.
A. N. Yzelman
Usable sparse matrix–vector multiplication
BSP sparse matrix–vector multiplication
Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .
1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j
3: end for4: sync {execute fan-out}5: ys = Asxs {local multiplication stage}6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}
10: for all (i , α) received do11: add α to ys,i12: end for
Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.A. N. Yzelman
Usable sparse matrix–vector multiplication
Multi-BSP SpMV multiplication
SPMD-style Multi-BSP SpMV multiplication:
define process 0 at level −1 as the Multi-BSP root.
let process s at level k have parent t at level k − 1.
define (A−1,0, x−1,0, y−1,0) = (A, x , y), the original input.
variables Ak,s , xk,s , yk,s are local versions of Ak−1,t , xk−1,t , yk−1,t ,
{A, x , k}k−1,t was distributed into pk−1 parts,
where pk−1 ≥ pk−1 is such that all {A, x , y}k,s fit into Mk bytes.
1: do2: for j = 0 to p step p3: get {A}k,j from parent4: down5: while(up)
Mandatory input data movement only.
A. N. Yzelman
Usable sparse matrix–vector multiplication
Multi-BSP SpMV multiplication
SPMD-style Multi-BSP SpMV multiplication:
define process 0 at level −1 as the Multi-BSP root.
let process s at level k have parent t at level k − 1.
define (A−1,0, x−1,0, y−1,0) = (A, x , y), the original input.
variables Ak,s , xk,s , yk,s are local versions of Ak−1,t , xk−1,t , yk−1,t ,
{A, x , k}k−1,t was distributed into pk−1 parts,
where pk−1 ≥ pk−1 is such that all {A, x , y}k,s fit into Mk bytes.
1: do2: for j = 0 to p step p3: get {A}k,j from parent4: down5: while(up)
Mandatory input data movement only.
A. N. Yzelman
Usable sparse matrix–vector multiplication
Multi-BSP SpMV multiplication
SPMD-style Multi-BSP SpMV multiplication:
define process 0 at level −1 as the Multi-BSP root.
let process s at level k have parent t at level k − 1.
define (A−1,0, x−1,0, y−1,0) = (A, x , y), the original input.
variables Ak,s , xk,s , yk,s are local versions of Ak−1,t , xk−1,t , yk−1,t ,
{A, x , k}k−1,t was distributed into pk−1 parts,
where pk−1 ≥ pk−1 is such that all {A, x , y}k,s fit into Mk bytes.
1: do2: for j = 0 to p step p3: get {A}k,j from parent4: down5: while(up)
Mandatory input data movement only.
A. N. Yzelman
Usable sparse matrix–vector multiplication
Multi-BSP SpMV multiplication
SPMD-style Multi-BSP SpMV multiplication:
define process 0 at level −1 as the Multi-BSP root.
let process s at level k have parent t at level k − 1.
define (A−1,0, x−1,0, y−1,0) = (A, x , y), the original input.
variables Ak,s , xk,s , yk,s are local versions of Ak−1,t , xk−1,t , yk−1,t ,
{A, x , k}k−1,t was distributed into pk−1 parts,
where pk−1 ≥ pk−1 is such that all {A, x , y}k,s fit into Mk bytes.
1: do2: for j = 0 to p step p3: get {A}k,j from parent4: down5: while(up)
Mandatory input data movement only.
A. N. Yzelman
Usable sparse matrix–vector multiplication
Multi-BSP SpMV multiplication
SPMD-style Multi-BSP SpMV multiplication:
1: do2: . . .3: for j = 0 to p step p4: get {A, x , y}k,j from parent5: . . .6: if(not down)7: compute yk,j = Ak,jxk,j {only executed on leafs}8: . . .9: put yk,j into parent
10: . . .11: while(up)
Mandatory and mixed mandatory/overhead data movement.Minimal required work only.
A. N. Yzelman
Usable sparse matrix–vector multiplication
Multi-BSP SpMV multiplication
SPMD-style Multi-BSP SpMV multiplication:
1: do2: ∀j , get separator xk,j and initialise yk,j iff j mod p = s3: for j = 0 to p step p4: get {A, x , y}k,j from parent5: sync6: if(not down)7: compute yk,j = Ak,jxk,j {only executed on leafs}8: perform fan-in on seperator yk,j9: put yk,j into parent
10: sync11: put yk,j into parent and sync12: while(up)
Mandatory costs plus overhead. Split vectors: {x , y}s versus {x , y}s .
A. N. Yzelman
Usable sparse matrix–vector multiplication
Flat partitioning for Multi-BSP
Can we reuse existing partitioning techniques?
1 Partition A = A0 ∪ . . .Ap−1 with p = πL−1l=0 pl?
No: As , xs , ys may not fit in ML−1.
2 Find minimal k to partition A into s.t. {A, x , y}i fits into ML−1?Very similar to previous work!
Y. and Bisseling, “Cache-oblivous sparse matrix–vectormultiplication by using sparse matrix partitioning”, SISC, 2009.Y. and Bisseling, “Two-dimensional cache-oblivious sparsematrix–vector multiplication”, Parallel Computing, 2011.
3 Hierarchical partitioning?A = A0 ∪ . . . ∪ Ak0 ,Ai = Ai,0 ∪ . . . ∪ Ai,k1 , etc.solves assignment issue.
However,all of these do not take into account different gl !
A. N. Yzelman
Usable sparse matrix–vector multiplication
Flat partitioning for Multi-BSP
Can we reuse existing partitioning techniques?
1 Partition A = A0 ∪ . . .Ap−1 with p = πL−1l=0 pl?
2 Find minimal k to partition A into s.t. {A, x , y}i fits into ML−1?Very similar to previous work!
Y. and Bisseling, “Cache-oblivous sparse matrix–vectormultiplication by using sparse matrix partitioning”, SISC, 2009.Y. and Bisseling, “Two-dimensional cache-oblivious sparsematrix–vector multiplication”, Parallel Computing, 2011.
3 Hierarchical partitioning?A = A0 ∪ . . . ∪ Ak0 ,Ai = Ai,0 ∪ . . . ∪ Ai,k1 , etc.solves assignment issue.
However,all of these do not take into account different gl !
A. N. Yzelman
Usable sparse matrix–vector multiplication
Flat partitioning for Multi-BSP
Can we reuse existing partitioning techniques?
1 Partition A = A0 ∪ . . .Ap−1 with p = πL−1l=0 pl?
2 Find minimal k to partition A into s.t. {A, x , y}i fits into ML−1?Very similar to previous work!
Y. and Bisseling, “Cache-oblivous sparse matrix–vectormultiplication by using sparse matrix partitioning”, SISC, 2009.Y. and Bisseling, “Two-dimensional cache-oblivious sparsematrix–vector multiplication”, Parallel Computing, 2011.
3 Hierarchical partitioning?A = A0 ∪ . . . ∪ Ak0 ,Ai = Ai,0 ∪ . . . ∪ Ai,k1 , etc.solves assignment issue.
However,all of these do not take into account different gl !
A. N. Yzelman
Usable sparse matrix–vector multiplication
Flat partitioning for Multi-BSP
Can we reuse existing partitioning techniques?
1 Partition A = A0 ∪ . . .Ap−1 with p = πL−1l=0 pl?
2 Find minimal k to partition A into s.t. {A, x , y}i fits into ML−1?Very similar to previous work!
Y. and Bisseling, “Cache-oblivous sparse matrix–vectormultiplication by using sparse matrix partitioning”, SISC, 2009.Y. and Bisseling, “Two-dimensional cache-oblivious sparsematrix–vector multiplication”, Parallel Computing, 2011.
3 Hierarchical partitioning?A = A0 ∪ . . . ∪ Ak0 ,Ai = Ai,0 ∪ . . . ∪ Ai,k1 , etc.solves assignment issue.
However,all of these do not take into account different gl !
A. N. Yzelman
Usable sparse matrix–vector multiplication
Hierarchical partitioning
If g0 < 2g1, greedy hierarchical partitioning is suboptimal.
Upper level Lower level
Fan-out 6g0
Fan-in 2g0
Total: 8g0 + . . .Previous:
A. N. Yzelman
Usable sparse matrix–vector multiplication
Hierarchical partitioning
If g0 < 2g1, greedy hierarchical partitioning is suboptimal.
Upper level Lower level
Fan-out 6g0 0Fan-in 2g0 0
Total: 8g0
Previous:
A. N. Yzelman
Usable sparse matrix–vector multiplication
Hierarchical partitioning
If g0 < 2g1, greedy hierarchical partitioning is suboptimal.
Upper level Lower level
Fan-out 0Fan-in 4g0
Total: 4g0 + . . .Previous: 8g0
A. N. Yzelman
Usable sparse matrix–vector multiplication
Hierarchical partitioning
If g0 < 2g1, greedy hierarchical partitioning is suboptimal.
Upper level Lower level
Fan-out 0 6g1
Fan-in 4g0 2g1
Total: 4g0 + 8g1
Previous: 8g0
A. N. Yzelman
Usable sparse matrix–vector multiplication
Hierarchical partitioning
If g0 < 2g1, greedy hierarchical partitioning is suboptimal.
Upper level Lower level
Fan-out 0 6g1
Fan-in 4g0 2g1
Total: 4g0 + 8g1
Previous: 8g0
A. N. Yzelman
Usable sparse matrix–vector multiplication
Multi-BSP aware partitioning
Slightly modified V-cycle:
1 coarsen
2 recurse or randomly partition3 do k steps of HKLFM
calculate gains taking g0, . . . , gL−1 into account
4 refine
Claim: if g0 > g1 > g2 . . ., then HKLFM is a local operation.
By enumeration of all possibilities (L = 2). At level-1 refinement:
suppose we move a nonzero aij from As1,s2 to At1,t2 with s1 6= t1:
aij ∈ As1,s2 , aij /∈ As1 : gain is g1 − g0 or 2(g1 − g0).
aij /∈ As1,s2 : gain is 0, g1 − g0, or 2(g1 − g0).
Hence it suffices to perform HKLFM steps on each level separately.
A. N. Yzelman
Usable sparse matrix–vector multiplication
Multi-BSP aware partitioning
Slightly modified V-cycle:
1 coarsen
2 recurse or randomly partition3 do k steps of HKLFM
calculate gains taking g0, . . . , gL−1 into account
4 refine
Claim: if g0 > g1 > g2 . . ., then HKLFM is a local operation.
By enumeration of all possibilities (L = 2). At level-1 refinement:
suppose we move a nonzero aij from As1,s2 to At1,t2 with s1 6= t1:
aij ∈ As1,s2 , aij /∈ As1 : gain is g1 − g0 or 2(g1 − g0).
aij /∈ As1,s2 : gain is 0, g1 − g0, or 2(g1 − g0).
Hence it suffices to perform HKLFM steps on each level separately.
A. N. Yzelman
Usable sparse matrix–vector multiplication
Summary
Differences from flat BSP:
different notion of load balanceparts must fit into local memory.
non-uniform communication costsimplies different partitioning techniques.
Non-uniform data locality...
with fine-grained distribution.
A. N. Yzelman
Usable sparse matrix–vector multiplication
Summary
Differences from flat BSP:
different notion of load balanceparts must fit into local memory.
non-uniform communication costsimplies different partitioning techniques.
Non-uniform data locality...with fine-grained distribution.
A. N. Yzelman
Usable sparse matrix–vector multiplication
How does it compare?
ANSI C++11, parallelisation using std::thread,
implementation relies on shared-memory cache coherency
Mondriaan 4.0, medium-grain, symmetric doubly BBD reordering
Global arrays without blocking, nonzero reordering, compression.
matrix original p = 1 p = max Optimal
2x8 G3 circuit 33.3 26.7 10.5 2.772x8 FS1 83.5 65.3 22.0 10.32x8 cage15 523 387 77.1 29.8
2x10 G3 circuit 22.7 16.9 9.77 1.732x10 FS1 83.5 65.3 22.0 7.562x10 cage15 341 233 54.7 23.4
all numbers are in ms.
Y. and Bisseling, Cache-oblivious sparse matrix–vector multiplication, SISC 2009
Y. and Roose, High-level strategies for sparse matrix–vector multiplication, IEEE TPDS 2014
A. N. Yzelman
Usable sparse matrix–vector multiplication
Vectorised BICRS
Incorporating vectorisation in the compressed data structure:
Ref.: Yzelman, A. N. & Roose, D. (2014). Sparse matrix computations on multi-core systems. In Intel European ExascaleLabs report 2013, pp. 24–29, Intel.Ref.: Yzelman, A. N. (2015). Generalised vectorisation for sparse matrix–vector multiplication. In Proceedings of the 5th
Workshop on Irregular Applications: Architectures and Algorithms. ACM.
A. N. Yzelman