Portable, usable, and efficient sparse matrix vector

Usable sparse matrix–vector multiplication

Portable, usable, and efficient sparse matrix–vectormultiplication

Albert-Jan YzelmanParallel Computing and Big Data

Huawei Technologies France

30th of November, 2016

A. N. Yzelman


Outline

1 Shared-memory architectures

2 (Shared-memory) parallel programming models

3 Application to sparse computing

A. N. Yzelman

Usable sparse matrix–vector multiplication > Shared-memory architectures

Shared-memory architectures




A. N. Yzelman


Moore’s Law (Waldrop, Nature vol. 530, 2016)

A. N. Yzelman


Hardware trends: core count

Transistor counts continue to double every two years (at least until2021, cf. Waldrop), yet clock speeds have stalled. Why?

Graphics from Ron Maltiel, Maltiel Consulting(http://www.maltiel-consulting.com/ISSCC-2013-High-Performance-Digital-Trends.html)

M. W. Waldrop, “The chips are down for Moore’s Law”; Nature, Vol. 530, pp. 144–147, 2016.

A. N. Yzelman

http://www.maltiel-consulting.com/ISSCC-2013-High-Performance-Digital-Trends.html


Hardware trends: core count

Transistor counts continue to double every two years (at least until2021, cf. Waldrop), yet clock speeds have stalled. Why?

Graphics from Ron Maltiel, Maltiel Consulting(http://www.maltiel-consulting.com/ISSCC-2013-High-Performance-Digital-Trends.html)

M. W. Waldrop, “The chips are down for Moore’s Law”; Nature, Vol. 530, pp. 144–147, 2016.

A. N. Yzelman

http://www.maltiel-consulting.com/ISSCC-2013-High-Performance-Digital-Trends.html


Hardware trends: bandwidth

CPU speeds stall, but Moore’s Law now translates to an increasingamount of cores per die, i.e., the effective flop rate of processorsstill rises.

But what about bandwidth?

Technology Year Bandwidth #cores

EDO 1970s 27 Mbyte/s 1SDRAM early 1990s 53 Mbyte/s 1RDRAM mid 1990s 1.2 Gbyte/s 1

DDR 2000 1.6 Gbyte/s 1DDR2 2003 3.2 Gbyte/s 2DDR3 2007 6.4 Gbyte/s 4DDR3 2013 11 Gbyte/s 8DDR4 2015 25 Gbyte/s 14

Effective bandwidth per core, stalled at best...

A. N. Yzelman


Arithmetic intensity

1

2

4

8

16

32

64

1/8 1/4 1/2 1 2 4 8 16

atta

inab

le G

FL

OP

/sec

Arithmetic Intensity FLOP/Byte

peak m

emory BW

peak floating-point

withvectorization

If your computation has enough work per data element, it iscompute bound; otherwise, it is bandwidth bound.

If you are bandwidth bound, reducing your memory footprint, e.g.,by compression, directly results in faster execution.

(Image courtesy of Prof. Wim Vanroose, UA)

A. N. Yzelman


Multi-socket architectures: NUMA

Each socket has local main memory where access is fast. Betweensockets, access is slower.

Memory

CPUs

Access times to shared-memory depends on the physical location,leading to non-uniform memory access (NUMA).

A. N. Yzelman


Dealing with NUMA: distribution types

Implicit distribution, centralised local allocation:

If each processor moves data to the same single memory element, thebandwidth is limited by that of a single memory controller.

A. N. Yzelman



Implicit distribution, centralised interleaved allocation:

If each processor moves data from all memory elements, the bandwidthmultiplies if accesses are uniformly random.

A. N. Yzelman



Explicit distribution, distributed local allocation:

If each processor moves data from and to its own unique memoryelement, the bandwidth multiplies.

A. N. Yzelman


Caches

Divide the main memory (RAM) in stripes of size LS .

Mainmemory

(RAM)

The ith line in RAM is mapped to the cache line i mod L, where L isthe number of available cache lines. Can we do better?

A. N. Yzelman


Caches

A smarter cache follows a pre-defined policy instead; for instance, the‘Least Recently Used (LRU)’ policy:

x1

⇒

Req. x1, . . . , x4

x4

x3

x2

x1

⇒

Req. x2

x2

x4

x3

x1

⇒

Req. x5

x5

x2

x4

x3

A. N. Yzelman


Caches

Realistic caches combine modulo-mapping and the LRU policy:

Mainmemory

(RAM)

Cache

Subcaches

Modulo mapping

LRU−stack

k is the number of subcaches; there are L/k LRU stacks.

A. N. Yzelman


Caches

Realistic caches are used within multi-level memory hierarchies:

Mainmemory

(RAM)��

��

��

��

��

Cache Cache

(L1) (L2)CPU

Intel Core2 (Q6600) AMD Phenom II (945e) Intel Westmere (E7-2830)L1: 32kB k = 8L2: 4MB k = 16L3: - -

S = 64kB k = 2S = 512kB k = 8S = 6MB k = 48

S = 256kB k = 8S = 2MB k = 8S = 24MB k = 24

A. N. Yzelman


Caches and multiplication

Dense matrix–vector multiplication

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3

Example with LRU caching and S = 4:

x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman




a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3


x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman




a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3


x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman




a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3


x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman




a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3


x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman




a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

·

x0

x1

x2

x3

=

y0

y1

y2

y3


x0

=⇒

a00

x0

=⇒

y0

a00

x0 =⇒

x1

y0

a00

x0

=⇒

a01

x1

y0

a00

x0

=⇒

y0

a01

x1

a00

x0

A. N. Yzelman


Caches and multiplication: NUMA again

When k, L are larger, we can predict:

lower elements from x are evicted while processing the first row.This causes O(n) cache misses on m − 1 rows.

Fix:

stop processing a row before an element from x would be evicted;first continue with the next rows.

This results in column-wise ‘stripes’ of the dense A:

A =

· · ·

.

A. N. Yzelman



When k, L are larger, we can predict:

lower elements from x are evicted while processing the first row.This causes O(n) cache misses on m − 1 rows.

Fix:

stop processing a row before an element from x would be evicted;first continue with the next rows.

This results in column-wise ‘stripes’ of the dense A:

A =

· · ·

.

A. N. Yzelman



A =

· · ·

.

But now:

elements from the vector y can be prematurely evicted; O(m)cache misses on each block of columns. (Already much better!)

Fix:

stop processing before an element from y is evicted; first do theremaining column blocks.

This is cache-aware blocking.

A. N. Yzelman



A =

· · ·

.

But now:

elements from the vector y can be prematurely evicted; O(m)cache misses on each block of columns. (Already much better!)

Fix:

stop processing before an element from y is evicted; first do theremaining column blocks.

This is cache-aware blocking.

A. N. Yzelman


Caches and multicore

Most architectures employ shared caches.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

64kB L1 64kB L1 64kB L1 64kB L1

Core 1 Core 2 Core 3 Core 4

512kB L2512kB L2512kB L2512kB L2

System interface

6MB shared L3 cache

In BSP: (4, 3GHz, l , g). Is this a correct view?

A. N. Yzelman


Caches and multicore: NUMA

��

��

��

��

��

��

��

��

32kB L1 32kB L1 32kB L1 32kB L1


4MB L2

System interface

4MB L2

In BSP: (4, 2.4 GHz, l , g)...

A. N. Yzelman


Caches and multicore: NUMA

��

��

��

��

��

��

��

��

32kB L1 32kB L1 32kB L1 32kB L1


4MB L2

System interface

4MB L2

In BSP: (4, 2.4 GHz, l , g) but Non-Uniform Memory Access!

A. N. Yzelman

Usable sparse matrix–vector multiplication > (Shared-memory) parallel programming models

(Shared-memory) parallel programmingmodels




A. N. Yzelman


Shared-memory programming intro

Suppose x and y are in a shared memory. We calculate aninner-product in parallel, using the cyclic distribution.

Input:s the current processor ID,p the total number of processors (threads),n the size of the input vectors.

Output: xT y

Shared-memory SPMD program with ‘double α;’ globally allocated:

α = 0.0

for i = s to n step p

α += xiyireturn α

Data race! (for n = p = 2, output can be x0y0, x1y1, or x0y0 + x1y1)

A. N. Yzelman





Output: xT y

Shared-memory SPMD program with ‘double α;’ globally allocated:

α = 0.0


α += xiyireturn α

Data race! (for n = p = 2, output can be x0y0, x1y1, or x0y0 + x1y1)

A. N. Yzelman





Output: xT y

Shared-memory SPMD program with ‘double α[p];’ globally allocated:


αs += xiyisynchronise

return∑p−1

i=0 αi

False sharing! (processors access and update the same cache lines)

A. N. Yzelman





Output: xT y

Shared-memory SPMD program with ‘double α[p];’ globally allocated:


αs += xiyisynchronise

return∑p−1

i=0 αi

False sharing! (processors access and update the same cache lines)

A. N. Yzelman





Output: xT y

Shared-memory SPMD program with ‘double α[8p];’ globally allocated:

for i = s to n step pα8s += xiyi

synchronisereturn

∑p−1i=0 α8i

Inefficient cache use!(All threads access virtually all cache lines associated with x , y)

A. N. Yzelman





Output: xT y


for i = s to n step pα8s += xiyi

synchronisereturn

∑p−1i=0 α8i

Inefficient cache use!(All threads access virtually all cache lines; Θ(pn) data movement)

A. N. Yzelman





Output: xT y


for i = s · dn/pe to (s + 1) · dn/peα8s += xiyi

synchronise

return∑p−1

i=0 α8i

(Now inefficiency only at boundaries; O(n + p − 1) data movement)

A. N. Yzelman





Output: xT y


for i = s · dn/pe to (s + 1) · dn/peα8s += xiyi

synchronise

return∑p−1

i=0 α8i

(Now inefficiency only at boundaries; O(n + p − 1) data movement)

A. N. Yzelman


Speedup

Definition (Speedup)

Let Tseq be the sequential running time required for solving a problem.Let Tp be the running time of a parallel algorithm using p processes,solving the same problem. Then the speedup is given by

S = Tseq/Tp.

Scalable in time:

Ideally, S = p;if we are lucky, S > p;realistically, 1� S < p;if we do very badly, S < 1.

A. N. Yzelman


Maximum attainable speedup

Consider a graph G = (V ,E ) of a given algorithm, e.g.,

Nodes correspond to data,edges indicate which data iscombined to generate acertain output.

Question: If we had an infinitenumber of processors, how fastwould we be able to run thealgorithm shown on the right?

f(a)

h(f(a))...

a

g(a)

A. N. Yzelman


Maximum attainable speedup

Consider a graph G = (V ,E ) of a given algorithm, e.g.,

Nodes correspond to data,edges indicate which data iscombined to generate acertain output.

Question: If we had an infinitenumber of processors, how fastwould we be able to run thealgorithm shown on the right?

Answer: Tseq = |V | = 9, while thecritical path length T∞ equals 4.

The maximum speedup hence is:

Tseq/T∞ = 9/4.

f(a)

h(f(a))...

a

g(a)

A. N. Yzelman


What is parallelism?

Definition (Parallelism)

The parallelism of a given algorithm is its maximum attainablespeedup:

Tseq/T∞.

T∞ is known as the critical path length or the algorithmic span.

This leads to a theoretical upper bound on speedup:

S = Tseq/Tp ≤ Tseq/T∞.

This type of analysis forms the basis of

fine-grained parallel computation.

A. N. Yzelman


Fine-grained parallel computing

Decompose a problem into many small tasks, that run concurrently(as much as possible). A run-time scheduler assigns tasks toprocesses.

What is small? Grain-size.

Performance model? Parallelism.

Algorithms can be implemented as graphs, explicitly or implicitly:

Intel: Threading Building Blocks (TBB),

OpenMP,

Intel / MIT / Cilk Arts: Cilk,

Google: Pregel,

. . .

By contrast, BSP computing is coarse-grained.

A. N. Yzelman


Cilk

Only two parallel programming primitives:1 (binary) fork, and2 (binary) join.

A. N. Yzelman


Cilk

Example: calculate x4 from xn = xn−2 + xn−1 given x0 = x1 = 1:

int f( int n ) {if( n == 0 ∨ n == 1 ) return 1;int x1 = cilk spawn f( n-1 ); //forkint x2 = cilk spawn f( n-2 ); //forkcilk sync; //joinreturn x1 + x2;

}

int main() {int x4 = f( 4 );printf( ”x 4 = %d\n”, &x4 );return 0;

}

A. N. Yzelman


Cilk

Spawned function calls are assigned to one of the available processesby the Cilk run time scheduler.

The Cilk scheduler guarantees, under some assumptions on thedeterminism of the algorithm, that the parallel run-time Tp is boundedby

O(T1/p + T∞).

Not all run-time schedulers have such guarantees, and Cilk is but oneof the many fine-grained parallel programming frameworks.

A. N. Yzelman


Is parallelism the way to go?

Example:

Consider the naive Θ(n2) Fourier transformation; its span is Θ(log n),so its parallelism is Θ(n2/ log n). Lots of parallelism!

The FFT formulation has work Θ(n log n), also with span Θ(log n)resulting in Θ(n log n

log n ) = Θ(n) parallelism. Less parallelism...

Which would you prefer?

A. N. Yzelman



Is there a difference between considering Tseq or T1?

A. N. Yzelman



Is there a difference between considering Tseq or T1?

Yes(!)

There may be multiple sequential algorithms to solve the sameproblem. When comparing, always compare to the best. For parallelsorting:

Sodd-even sort = T qsortseq /T odd-even sort

p .

A. N. Yzelman



Are there any other issues that may be overlookedwhen focusing solely on parallelism?

A. N. Yzelman


Overheads

Definition (Overhead)

The overhead of parallel computation is any extra effort expendedover the original amount of work Tseq

To = pTp − Tseq.

The parallel computation time can be expressed in To :

Tp =Tseq + To

p.

Cilk: To = pT∞.

BSP: To = p(∑N−1

i=0 hig + l)

+ p∑N−1

i=0 maxs w(s)i − Tseq.

Data movement, latency costs, and extra computations on top of thebare minimum required should be modelled. This enables parallel

algorithm design, instead of simply enabling parallel execution.A. N. Yzelman


MapReduce

Most parallel programming models only consider algorithmic structure;fork-join, parallel for, dataflows, etc. One other is MapReduce, whichoperates on key-value pairs from

K × V ,

with K a set of possible keys and V a set of values. MapReducedefines two operations:

mapper: K × V → P(K × V );

reducer: K × P(V )→ V .

Mapper applies to all key-value pairs, embarrasingly parallel.

Reducer applies to a single key only, global communication.

Shuffling refers to the sorting-by-keys prior to the reduction stage.A. N. Yzelman


Pregel

Consider a graph G = (V ,E ). Graph algorithms may be phrased asfollows:

For each vertex v ∈ V , a thread executes a user-defined SPMDalgorithm;

each algorithm consists of successive local compute phases andglobal communication phases;

during a communication phase, a vertex v can only send messagesto N(v), where N(v) is the set of neighbouring vertices of v ; i.e.,N(v) = {w ∈ V | {v ,w} ∈ E}.

MapReduce and Pregel are variants of the BSP algorithm model,

a type of fine-grained BSP.

A. N. Yzelman


Spark

High level language for large-scale computing: resilient, scalable, hugeuptake; expressive and easy to use:

s c a l a> v a l A = s c . t e x t F i l e ( ”A . t x t ” ) . map( x => x . s p l i t ( ” ” ) match {case Array ( a , b , c ) => ( a . t o I n t − 1 , b . t o I n t − 1 , c . toDouble )

} ) . groupBy ( x => x . 1 ) ;A : org . apache . s p a r k . rdd .RDD[ ( I n t , I t e r a b l e [ ( I n t , I n t , Double ) ] ) ] = Shuff ledRDD [ 8 ] . . .

s c a l a>

Spark is implemented in Scala, runs on the JVM, relies on serialisation,and commonly uses HDFS for distributed and resilient storage.

Ref.: Zaharia, M. (2013). An architecture for fast and general data processing on large clusters. Dissertation, UCB.

A. N. Yzelman


Spark

Concepts:

RDDs are fine-grained data distributed by hashing;

Transformations (map, filter, groupBy) are lazy operators;

DAGs thus formed are resolved by actions: reduce, collect, ...

Computations are offloaded as close to the data as possible;

all-to-all data shuffles for communication required by actions.

A. N. Yzelman


Bridging HPC and Big Data

Platforms like Spark essentially do PRAM simulation:

automatic mode vs. direct modeeasy-of-use vs. performance

Ref.: Valiant, L. G. (1990). A bridging model for parallel computation. Communications of the ACM, 33(8).

Both are scalable as long as the shuffle remains balanced well:

”First, [..] we show that RDDs can emulate any distributed system, and will

do so efficiently as long as the system tolerates some network latency [..]

because, once augmented with fast data sharing, MapReduce can emulate the

BSP model of parallel computing, with the main drawback being the latency

of each MapReduce step.”– from the PhD thesis introducing Spark, page 4.

Ref.: Zaharia, M. (2013). An architecture for fast and general data processing on large clusters. Dissertation, UCB.

A. N. Yzelman


Multi-BSP

BSP itself is evolving as well:

Multi-BSP computer = p ( subcomputers or processors ) +

M bytes of local memory+

an interconnect

A total of 4L parameters: (p0, g0, l0,M0, . . . , pL−1, gL−1, lL−1,ML−1).

memory-aware,

non-uniform!

However,

harder to prove optimality(?)

L. G. Valiant, A bridging model for multi-core computing, 2008, 2011.

A. N. Yzelman


Multi-BSP




an interconnect


memory-aware,

non-uniform!

However,



A. N. Yzelman


Multi-BSP




an interconnect


memory-aware,

non-uniform!

However,



A. N. Yzelman


Multi-BSP

An example with L = 4 quadlets (p, g , l ,M):

A. N. Yzelman

Usable sparse matrix–vector multiplication > Application to sparse computing

Application to sparse computing




A. N. Yzelman


Problem setting

Given a sparse m × n matrix A, and corresponding vectors x , y .

How to calculate y = Ax as fast as possible?

How to make the code usable for the 99%?

Figure: Wikipedia link matrix (’07) with on average ≈ 12.6 nonzeroes per row.

A. N. Yzelman


Problem setting

Shared-memory central obstacles for SpMV multiplication:

inefficient cache use,

limited memory bandwidth, and

non-uniform memory access (NUMA).

Distributed-memory:

inefficient network use.

Shared-memory and distributed-memory share their objectives:

cache misses==

communication volumeCache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods

by A. N. Yzelman & Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).

A. N. Yzelman


Problem setting

Shared-memory central obstacles for SpMV multiplication:

inefficient cache use,

limited memory bandwidth, and

non-uniform memory access (NUMA).

Distributed-memory:

inefficient network use.

Shared-memory and distributed-memory share their objectives:

cache misses==

communication volumeCache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods

by A. N. Yzelman & Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).

A. N. Yzelman


Inefficient cache use

Visualisation of the SpMV multiplication Ax = y with nonzeroesprocessed in row-major order:

Accesses on the input vector are completely unpredictable.

A. N. Yzelman


Enhanced cache use: nonzero reorderings

Blocking to cache subvectors, and cache-oblivious traversals.

Other approaches: no blocking (Haase et al.), Morton Z-curves and bisection (Martone et al.), Z-curve within blocks(Buluc et al.), composition of low-level blocking (Vuduc et al.), ...

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”,IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).

A. N. Yzelman


Enhanced cache use: nonzero reorderings

Blocking to cache subvectors, and cache-oblivious traversals.

Sequential SpMV multiplication on the Wikipedia ’07 link matrix:345 (CRS), 203 (Hilbert), 245 (blocked Hilbert) ms/mul.

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”,IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).

A. N. Yzelman


Enhanced cache use: matrix permutations

1 2 3 4

1 2

43

(Upper bound on) the number of cache misses:∑i

(λi − 1)

Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methodsby A. N. Yzelman & Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).

A. N. Yzelman



cache misses ≤∑i

(λi − 1) = communication volume

Lengauer, T. (1990). Combinatorial algorithms for integrated circuit layout. Springer Science & Business Media.

Catalyurek, U. V., & Aykanat, C. (1999). Hypergraph-partitioning-based decomposition for parallel sparse-matrixvector multiplication. IEEE Transactions on Parallel and Distributed Systems, 10(7), 673-693.

Catalyurek, U. V., & Aykanat, C. (2001). A Fine-Grain Hypergraph Model for 2D Decomposition of SparseMatrices. In IPDPS (Vol. 1, p. 118).

Vastenhouw, B., & Bisseling, R. H. (2005). A two-dimensional data distribution method for parallel sparsematrix-vector multiplication. SIAM review, 47(1), 67-95.

Bisseling, R. H., & Meesen, W. (2005). Communication balancing in parallel sparse matrix-vector multiplication.Electronic Transactions on Numerical Analysis, 21, 47-65.

Should we program shared-memory as though it were distributed?

A. N. Yzelman



Practical gains:

Figure: the Stanford link matrix (left) and its 20-part reordering (right).

Sequential execution using CRS on Stanford:

18.99 (original), 9.92 (1D), 9.35 (2D) ms/mul.

Ref.: Two-dimensional cache-oblivious sparse matrix-vector multiplicationby A. N. Yzelman & Rob H. Bisseling in Parallel Computing 37(12), pp. 806-819 (2011).

A. N. Yzelman


Bandwidth

Exploiting sparsity through computation using only nonzeroes:

i = (0, 0, 1, 1, 2, 2, 2, 3)j = (0, 4, 2, 4, 1, 3, 5, 2)v = (a00, a04, . . . , a32)

for k = 0 to nz − 1yik := yik + vk · xjk

The coordinate (COO) format: two flops versus five data words.

Θ(3nz) storage. CRS has lower arithmetic intensity: Θ(2nz + m) storage.

A. N. Yzelman


Efficient bandwidth use

A =

4 1 3 00 0 2 31 0 0 27 0 1 1

Bi-directional incremental CRS (BICRS):

A =

V [7 1 4 1 2 3 3 2 1 1]

∆J [0 4 4 1 5 4 5 4 3 1]

∆I [3 -1 -2 1 -1 1 1 1]

Storage requirements, allowing arbitrary traversals:

Θ(2nz + row jumps + 1).Ref.: Yzelman and Bisseling, “A cache-oblivious sparse matrix–vector multiplication scheme based on the Hilbert curve”,

Progress in Industrial Mathematics at ECMI 2010, pp. 627-634 (2012).

A. N. Yzelman


Efficient bandwidth use

With BICRS you can, distributed or not,

vectorise,

compress,

do blocking,

have arbitrary nonzero or block orders.

Optimised BICRS takes less than or equal to 2nz + m of memory.

Ref.: Buluc, Fineman, Frigo, Gilbert, Leiserson (2009). Parallel sparse matrix-vector and matrix-transpose-vectormultiplication using compressed sparse blocks. In Proceedings of the twenty-first annual symposium on Parallelism inalgorithms and architectures (pp. 233-244). ACM.

Ref.: Yzelman and Bisseling (2009). Cache-oblivious sparse matrix-vector multiplication by using sparse matrixpartitioning methods. In SIAM Journal of Scientific Computation 31(4), pp. 3128-3154.

Ref.: Yzelman and Bisseling (2012). A cache-oblivious sparse matrix–vector multiplication schemebased on the Hilbert curve”. In Progress in Industrial Mathematics at ECMI 2010, pp. 627-634.

Ref.: Yzelman and Roose (2014). High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication.In IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31.

Ref.: Yzelman, A. N. (2015). Generalised vectorisation for sparse matrix: vector multiplication. In Proceedings of the 5thWorkshop on Irregular Applications: Architectures and Algorithms. ACM.

A. N. Yzelman


One-dimensional data placement

Coarse-grain row-wise distribution, compressed, cache-optimised:

explicit allocation of separate matrix parts per core,

explicit allocation of the output vector on the various sockets,

interleaved allocation of the input vector,

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEETransactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).

A. N. Yzelman


Two-dimensional data placement

Distribute row- and column-wise (individual nonzeroes):

all data allocation is explicit,inter-process communication minimised by partitioning;incurs cost of partitioning.

Ref.: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication, IEEETrans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).

Ref.: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).

A. N. Yzelman


Results

Sequential CRS on Wikipedia ’07: 472 ms/mul. 40 threads BICRS:

21.3 (1D), 20.7 (2D) ms/mul. Speedup: ≈ 22x.

Average speedup on six large matrices:2 x 6 4 x 10 8 x 8

–, 1D fine-grained, CRS∗ 4.6 6.8 6.2Hilbert, Blocking, 1D, BICRS∗ 5.4 19.2 24.6

Hilbert, Blocking, 2D, BICRS† − 21.3 30.8

†: uses an updated test set. (Added for reference versus a good 2D algorithm.)

As NUMA increases, interleaved and 1D algorithms lose efficiency.

∗: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication, IEEETrans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).

†: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).

A. N. Yzelman


Results

Sequential CRS on Wikipedia ’07: 472 ms/mul. 40 threads BICRS:

21.3 (1D), 20.7 (2D) ms/mul. Speedup: ≈ 22x.

Average speedup on six large matrices:2 x 6 4 x 10 8 x 8

–, 1D fine-grained, CRS∗ 4.6 6.8 6.2Hilbert, Blocking, 1D, BICRS∗ 5.4 19.2 24.6

Hilbert, Blocking, 2D, BICRS† − 21.3 30.8

†: uses an updated test set. (Added for reference versus a good 2D algorithm.)

As NUMA increases, interleaved and 1D algorithms lose efficiency.

∗: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication, IEEETrans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).

†: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).

A. N. Yzelman


Usability

The previous all is available through free software:

http://albert-jan.yzelman.net/software.php#SL

However, there are problems integrating into existing codes:

SPMD (PThreads, MPI, ...) vs. others (OpenMP, Cilk, ...).

Globally allocated vectors versus explicit data allocation.

Conversion between matrix data formats.

A. N. Yzelman



Usability







A. N. Yzelman



Usability







A. N. Yzelman



Usability

Wish list:

Performance and scalability.

Portable codes and/or APIs: GPUs, x86, ARM, phones, ...

Out-of-core, streaming capabilities, dynamic updates.

User-defined overloaded operations on user-defined data.

Ease of use!

GraphBLAS.org

Interoperability (PThreads + Cilk, MPI + OpenMP, DSLs!)

A. N. Yzelman


Usability

Wish list:





Ease of use!

GraphBLAS.org


A. N. Yzelman


Usability

Wish list:





Ease of use!

GraphBLAS.org


A. N. Yzelman


Usability

Wish list:





Ease of use!

GraphBLAS.org


A. N. Yzelman


Usability

Wish list:





Ease of use!

GraphBLAS.org


A. N. Yzelman



Wanted: a bridge between Big Data and HPC

Our take:

Spark I/O via native RDDs and native Scala interfaces;

Rely on serialisation and the JNI to switch to C;

Intercept Spark’s execution model to switch to SPMD;

Set up and enable inter-process RDMA communications.

d e f c r e a t e M a t r i x (rdd : RDD[ ( I n t , I t e r a b l e [ I n t ] , I t e r a b l e [ Double ] ) ] ,P : I n t

) : S p a r s e M a t r i x = . . .

d e f m u l t i p l y (s c : o rg . apache . s p a r k . SparkContext ,x : DenseVector , A : S p a r s e M a t r i x , y : DenseVector

) : DenseVector = . . . //−−− t h i s f u n c t i o n c a l l s 1D or 2D SpMVs −−−//

d e f toRDD(s c : org . apache . s p a r k . SparkContext ,x : DenseVector

) : RDD[ ( I n t , Double ) ] = . . .

A. N. Yzelman



Wanted: a bridge between Big Data and HPC

Our take:

Spark I/O via native RDDs and native Scala interfaces;

Rely on serialisation and the JNI to switch to C;

Intercept Spark’s execution model to switch to SPMD;

Set up and enable inter-process RDMA communications.

d e f c r e a t e M a t r i x (rdd : RDD[ ( I n t , I t e r a b l e [ I n t ] , I t e r a b l e [ Double ] ) ] ,P : I n t

) : S p a r s e M a t r i x = . . .

d e f m u l t i p l y (s c : o rg . apache . s p a r k . SparkContext ,x : DenseVector , A : S p a r s e M a t r i x , y : DenseVector

) : DenseVector = . . . //−−− t h i s f u n c t i o n c a l l s 1D or 2D SpMVs −−−//

d e f toRDD(s c : org . apache . s p a r k . SparkContext ,x : DenseVector

) : RDD[ ( I n t , Double ) ] = . . .

A. N. Yzelman


DataBSP

We have a shared-memory prototype. Preliminary results:

SpMM multiply, SpMV multiply, and basic vector operations;

one machine learning application.

Cage15, n = 5 154 859, nz = 99 199 551. Using the 1D method:

Note: this is ongoing work;performance and functionality improvements are forthcoming.

A. N. Yzelman


Conclusions and future work

Needed for current algorithms:

faster partitioning to enable scalable 2D sparse computations,

integration in practical and extensible libraries (GraphBLAS),

making them interoperable with common use scenarios.

Extend application areas further:

sparse power kernels,

symmetric matrix support,

graph and sparse tensor computations,

support various hardware and execution platforms (Hadoop?).

Thank you!The basic SpMV multiplication codes are free:

http://albert-jan.yzelman.net/software#SL

A. N. Yzelman

http://albert-jan.yzelman.net/software#SL


Backup slides

A. N. Yzelman


Results: cross platform

Cross platform results over 24 matrices:

Structured Unstructured Average

Intel Xeon Phi 21.6 8.7 15.22x Ivy Bridge CPU 23.5 14.6 19.0

NVIDIA K20X GPU 16.7 13.3 15.0

no one solution fits all.

If we must, some generalising statements:

Large structured matrices: GPUs.

Large unstructured matrices: CPUs or GPUs.

Smaller matrices: Xeon Phi or CPUs.

Ref.: Yzelman, A. N. (2015). Generalised vectorisation for sparse matrix: vector multiplication. In Proceedings of the 5thWorkshop on Irregular Applications: Architectures and Algorithms. ACM.

A. N. Yzelman


BSP sparse matrix–vector multiplication

Variables As , xs , ys are local versions of the global variables A, x , ydistributed according to πA, πx , πy .

1: for j | ∃aij 6= 0 ∈ As and πx(j) 6= s do2: get xπx (j),j

3: end for4: sync {execute fan-out}5: ys = Asxs {local multiplication stage}6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}

10: for all (i , α) received do11: add α to ys,i12: end for

Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.

A. N. Yzelman





3: end for4: sync {execute fan-out}

5: ys = Asxs {local multiplication stage}6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}



A. N. Yzelman





3: end for4: sync {execute fan-out}5: ys = Asxs {local multiplication stage}

6: for i | ∃aij ∈ As and πy (i) 6= s do7: send (i , ys,i ) to πy (i)8: end for9: sync {execute fan-in}



A. N. Yzelman








A. N. Yzelman







Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004.A. N. Yzelman


Multi-BSP SpMV multiplication

SPMD-style Multi-BSP SpMV multiplication:

define process 0 at level −1 as the Multi-BSP root.

let process s at level k have parent t at level k − 1.

define (A−1,0, x−1,0, y−1,0) = (A, x , y), the original input.

variables Ak,s , xk,s , yk,s are local versions of Ak−1,t , xk−1,t , yk−1,t ,

{A, x , k}k−1,t was distributed into pk−1 parts,

where pk−1 ≥ pk−1 is such that all {A, x , y}k,s fit into Mk bytes.

1: do2: for j = 0 to p step p3: get {A}k,j from parent4: down5: while(up)

Mandatory input data movement only.

A. N. Yzelman












A. N. Yzelman












A. N. Yzelman












A. N. Yzelman




1: do2: . . .3: for j = 0 to p step p4: get {A, x , y}k,j from parent5: . . .6: if(not down)7: compute yk,j = Ak,jxk,j {only executed on leafs}8: . . .9: put yk,j into parent

10: . . .11: while(up)

Mandatory and mixed mandatory/overhead data movement.Minimal required work only.

A. N. Yzelman




1: do2: ∀j , get separator xk,j and initialise yk,j iff j mod p = s3: for j = 0 to p step p4: get {A, x , y}k,j from parent5: sync6: if(not down)7: compute yk,j = Ak,jxk,j {only executed on leafs}8: perform fan-in on seperator yk,j9: put yk,j into parent

10: sync11: put yk,j into parent and sync12: while(up)

Mandatory costs plus overhead. Split vectors: {x , y}s versus {x , y}s .

A. N. Yzelman


Flat partitioning for Multi-BSP

Can we reuse existing partitioning techniques?

1 Partition A = A0 ∪ . . .Ap−1 with p = πL−1l=0 pl?

No: As , xs , ys may not fit in ML−1.

2 Find minimal k to partition A into s.t. {A, x , y}i fits into ML−1?Very similar to previous work!

Y. and Bisseling, “Cache-oblivous sparse matrix–vectormultiplication by using sparse matrix partitioning”, SISC, 2009.Y. and Bisseling, “Two-dimensional cache-oblivious sparsematrix–vector multiplication”, Parallel Computing, 2011.

3 Hierarchical partitioning?A = A0 ∪ . . . ∪ Ak0 ,Ai = Ai,0 ∪ . . . ∪ Ai,k1 , etc.solves assignment issue.

However,all of these do not take into account different gl !

A. N. Yzelman









A. N. Yzelman









A. N. Yzelman









A. N. Yzelman


Hierarchical partitioning

If g0 < 2g1, greedy hierarchical partitioning is suboptimal.

Upper level Lower level

Fan-out 6g0

Fan-in 2g0

Total: 8g0 + . . .Previous:

A. N. Yzelman





Fan-out 6g0 0Fan-in 2g0 0

Total: 8g0

Previous:

A. N. Yzelman





Fan-out 0Fan-in 4g0

Total: 4g0 + . . .Previous: 8g0

A. N. Yzelman





Fan-out 0 6g1

Fan-in 4g0 2g1

Total: 4g0 + 8g1

Previous: 8g0

A. N. Yzelman





Fan-out 0 6g1

Fan-in 4g0 2g1

Total: 4g0 + 8g1

Previous: 8g0

A. N. Yzelman


Multi-BSP aware partitioning

Slightly modified V-cycle:

1 coarsen

2 recurse or randomly partition3 do k steps of HKLFM

calculate gains taking g0, . . . , gL−1 into account

4 refine

Claim: if g0 > g1 > g2 . . ., then HKLFM is a local operation.

By enumeration of all possibilities (L = 2). At level-1 refinement:

suppose we move a nonzero aij from As1,s2 to At1,t2 with s1 6= t1:

aij ∈ As1,s2 , aij /∈ As1 : gain is g1 − g0 or 2(g1 − g0).

aij /∈ As1,s2 : gain is 0, g1 − g0, or 2(g1 − g0).

Hence it suffices to perform HKLFM steps on each level separately.

A. N. Yzelman


Multi-BSP aware partitioning

Slightly modified V-cycle:

1 coarsen

2 recurse or randomly partition3 do k steps of HKLFM

calculate gains taking g0, . . . , gL−1 into account

4 refine

Claim: if g0 > g1 > g2 . . ., then HKLFM is a local operation.

By enumeration of all possibilities (L = 2). At level-1 refinement:

suppose we move a nonzero aij from As1,s2 to At1,t2 with s1 6= t1:

aij ∈ As1,s2 , aij /∈ As1 : gain is g1 − g0 or 2(g1 − g0).

aij /∈ As1,s2 : gain is 0, g1 − g0, or 2(g1 − g0).

Hence it suffices to perform HKLFM steps on each level separately.

A. N. Yzelman


Summary

Differences from flat BSP:

different notion of load balanceparts must fit into local memory.

non-uniform communication costsimplies different partitioning techniques.

Non-uniform data locality...

with fine-grained distribution.

A. N. Yzelman


Summary

Differences from flat BSP:

different notion of load balanceparts must fit into local memory.

non-uniform communication costsimplies different partitioning techniques.

Non-uniform data locality...with fine-grained distribution.

A. N. Yzelman


How does it compare?

ANSI C++11, parallelisation using std::thread,

implementation relies on shared-memory cache coherency

Mondriaan 4.0, medium-grain, symmetric doubly BBD reordering

Global arrays without blocking, nonzero reordering, compression.

matrix original p = 1 p = max Optimal

2x8 G3 circuit 33.3 26.7 10.5 2.772x8 FS1 83.5 65.3 22.0 10.32x8 cage15 523 387 77.1 29.8

2x10 G3 circuit 22.7 16.9 9.77 1.732x10 FS1 83.5 65.3 22.0 7.562x10 cage15 341 233 54.7 23.4

all numbers are in ms.

Y. and Bisseling, Cache-oblivious sparse matrix–vector multiplication, SISC 2009

Y. and Roose, High-level strategies for sparse matrix–vector multiplication, IEEE TPDS 2014

A. N. Yzelman


Vectorised BICRS

Incorporating vectorisation in the compressed data structure:

Ref.: Yzelman, A. N. & Roose, D. (2014). Sparse matrix computations on multi-core systems. In Intel European ExascaleLabs report 2013, pp. 24–29, Intel.Ref.: Yzelman, A. N. (2015). Generalised vectorisation for sparse matrix–vector multiplication. In Proceedings of the 5th

Workshop on Irregular Applications: Architectures and Algorithms. ACM.

A. N. Yzelman

Documents

Portable, usable, and efficient sparse matrix vector