Data Parallel Dense and Sparse Linear Algebra using Global ...Data Parallel Dense and Sparse Linear...

Data Parallel Dense and Sparse Linear Algebrausing Global Arrays

Serge G. Petiton and Nahid Emad

November 1st, 2018

PARIS-SACLAY, FRANCE

Outline

• Introduction• Global arrays and data parallelism• Data parallel dense linear algebra• Data parallel sparse linear algebra• Conclusion

Novembre 1, 2018 XMP workshop 2

Introduction

The “true” data parallelism is back : all the cores didn’tshare a memory

NOC : distributed and parallel computing, even on a chip

Runtime systems and communications would not be able to optimize everything

The algorithms have to be data parallel and communication or stencils, have tobe “compiled” when possible

Outline

Global array and data parallelism

• A global array is a “data parallel” variable• We assume that each element of a global array is store only on a

private memory, without other elements (hypothesis)• Reduction (prefix) operations on global arrays : associative operations• Global arrays : let ”epc” be the “elements per core ratio” = number

of elements of the global array in a physical core (“former” “virtual processor ration” and “virtual geometry”).

• Spread ”data parallel operation” along a given dimension of a given array

• Scan data parallel operations?• Spread_with_add, Reduce_with_add along a given dimension• Neighbor communication,s and stencils• Send/get one_to_one, one_to_all, one_to_many,….• We have to map those “elements” along the processors (depending of

communication and the data parallel algorithm• We have to align “global arrays” Novembre 1, 2018 XMP workshop 5

Outline

• Introduction• Global arrays and data parallelism• Data parallel Dense linear algebra

– Matrix-vector multiplication : A(Ax)– Gauss Elimination– Back substitution– Gauss-Jordan inversion– QR method

• Sparse linear algebra• Conclusion

Data parallel A(Ax)

a1,1 a1,2 a1,3 a1,4 a1,5 a1,6

a2,1 a2,3 a2,4 a2,5 a2,6a2,2

a4,2 a4,3 a4,4 a4,5 a4,6a4,1

a5,2 a5,3 a5,4 a5,5 a5,6a5,1

a6,2 a6,3 a6,4 a6,5 a6,6a6,1

a3,3 a3,4 a3,5a3,6a3,1 a3,2

x3x2 x4x1 x6x5

x3x2 x4x1 x6x5XMP workshopNovembre 1, 2018 7

Data parallel multiplication

T1,3T1,2 T1,4T1,1 T1,6T1,5

T2,3T2,2 T2,4T2,1 T2,6T2,5

T3,3T3,2 T3,4T3,1 T3�T3,5

T4,3T4,2 T4,4T4,1 T4,6T4,5

T5,3T5,2 T54T5,1 T5,6T5,5

T,36T6,2 T;46T6,1 T6,6T6,6

XMP workshop

Ti,j = ai,j xj

Novembre 1, 2018 8

Second step

T1,3T1,2 T1,4T1,1 T1,6T1,5

T2,3T2,2 T2,4T2,1 T2,6T2,5

T3,3T3,2 T3,4T3,1 T3�T3,5

T4,3T4,2 T4,4T4,1 T4,6T4,5

T5,3T5,2 T54T5,1 T5,6T5,5

T,36T6,2 T;46T6,1 T6,6T6,6

XMP workshop

Summation (reduction operation)

Ti,j = ai,j xj

Novembre 1, 2018 9

A(Ax) or A(Ax+x)+x

w1w1 w1w1 w1w1

w2w2 w2w2 w2w2

w3w3 w3w3 w3w3

w4w4 w4w4 w4w4

w5w5 w5w5 w5w5

w6w6 w6w6 w6w6

XMP workshopNovembre 1, 2018 10

A(Ax) or A(Ax+x)+x

w1w1 w1w1 w1w1

w2w2 w2w2 w2w2

w3w3 w3w3 w3w3

w4w4 w4w4 w4w4

w5w5 w5w5 w5w5

w6w6 w6w6 w6w6

Ax = wis now colummapped

A(Ax) or A(Ax+x)+x

w1w1 w1w1 w1w1

w2w2 w2w2 w2w2

w3w3 w3w3 w3w3

w4w4 w4w4 w4w4

w5w5 w5w5 w5w5

w6w6 w6w6 w6w6

Ax = wis now colummapped

Spread operation

w1 w2 w3 w4 w5 w6

Novembre 1, 2018 XMP workshop

Dimension 1

Data parallel Gauss Elimination

Dimension 1

Spread number optimization

Dimension 1

(non) Data parallel back substitution

Dimension 1

We already compute x8, x7 and x6, then we compute x5

Dimension 1

Data parallel Gauss Jordan method

Dimension 1

Spread number already optimized

Dimension 1

Triadic (*,+) data parallel operation

Dimension 1

(non) data parallel QR method

Outline

• Introduction• Global arrays and data parallelism• Data parallel dense linear algebra• Data parallel sparse linear algebra

– Sparse matrix-vector multiplication (iterative/restarted methods)

• Conclusion

x3 x7 x8

Each compressed col jis multiplied by xj

T[1:3,1:8 ] = A[1:3,1:8] * X1:3,1:9]

We need also to store the row of each non-zero element :ELLPACK format

For the reduction/spread with addition, the best is the row compression

We have to store also the column of each non zero element

ELLPACK format

Column

Jc : the jcth non zero element of the compress row

Sparse General Pattern (SGP), you need to have (ai,j,i,j,ic,jc)

Acr = non zero element of A

We may use these parameters to change from a column compression to a row one

(ic,j) to (i,jc)

(ic,j) to (jc,i)

to keep a C by N global array

++ + +

+ + ++

Outline

Conclusion

• Global arrays allow data parallel algorithms• Sparse matrix linear algebra asks for new formats when

cores/processors don’t share any memory• SGP format was the must efficient on past data parallel

machines with such properties• Algorithm have to be developed using XMP and

experiments on new machines

- William Ferng, Serge Petiton, Kesheng Wu, and Yousef Saad. Basic Sparse MatrixComputations on Massively Parallel Computers, Parallel Processing for ScientificComputing, David Keyes et al. Editeurs, SIAM, 1993.

- Serge Petiton et Nahid Emad. A Data Parallel Scientific Computing Introduction,The Data Parallel Programming Model, LNCS 1132, pp 45-64, Springer-Verlag,1996.

Data Parallel Dense and Sparse Linear Algebra using Global ...Data Parallel Dense and Sparse Linear...

Documents

Format Abstraction for Sparse Tensor Algebra Compilerspeople.csail.mit.edu/s3chou/files/oopsla18-slides.pdfThe dense matrix input has a density of 0.95, the hypersparse matrix has

Precision Tracking with Sparse 3D and Dense Color 2D Datadavheld.github.io/DavidHeld_files/ICRA13_0624_FI.pdf · Precision Tracking with Sparse 3D and Dense Color 2D Data David Held,

Hybrid Dense /Sparse Matrices in Compressed Sensing Reconstruction

Efﬁcient Sparse-to-Dense Optical Flow Estimation using …files.is.tue.mpg.de/black/papers/cvpr2015_pcaflow.pdf · Efﬁcient Sparse-to-Dense Optical Flow Estimation using a Learned

Efficient Sparse-to-Dense Optical Flow Estimation …openaccess.thecvf.com/content_cvpr_2015/papers/Wulff...ticular, recent methods use either sparse or dense matching to capture long-range

Learning Sparse High Dimensional Filters: Image Filtering, Dense … · 2016. 5. 16. · Learning Sparse High Dimensional Filters: Image Filtering, Dense CRFs and Bilateral Neural

DSC: Dense-Sparse Convolution for Vectorized Inference of ...openaccess.thecvf.com/content_CVPRW_2019/papers/SAIAD/Fricken… · DSC: Dense-Sparse Convolution for Vectorized Inference

Memory Hierarchy Optimizations and Performance · with dense linear algebra [4, 29], sparse matrix-vector multiply (SpM V) [15, 16, 26], and sparse triangular solve (SpTS) [27]. In

Scalable Sparse Optimization in Dense Cloud-RANshiyuanming.github.io/papers/Thesis_Yuanming.pdf · Scalable Sparse Optimization in Dense Cloud-RAN by Yuanming SHI This is to certify

How to Write Fast Numerical Code · Computer Science How to write fast numerical code Spring 2019 Sparse Linear Algebra Very different characteristics from dense linear algebra (LAPACK

Sparse Linear Algebra on GPUs

Sparse approximate inverse preconditioning for dense ...benzi/Web_papers/fulltext.pdf · Sparse approximate inverse preconditioning for dense linear systems arising in computational

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix

Dense-and-Sparse - MIT

Sparse Linear Algebra

Sparse Bundle Adjustment for Dense Data · Sparse Bundle Adjustment for Dense Data ... Sparse Surface Adjustment ... Dense surface methods have difficulty with “adjustment” after

Between Dense and Sparse Polynomial Multiplication

Parallel Tensor Computations in Python or C++ Using Cyclopssolomon2.web.engr.illinois.edu/talks/pasc-2018.pdf · 2018. 7. 3. · 1% sparsedense 1% sparsesparse.1% sparse*dense.1%

A lava attack on the recovery of sums of dense and sparse ... · \sparse+dense" model, in which the signal is given by the sum of a sparse signal and a dense signal. Such a structure

Dense and sparse parallel linear algebra algorithms on ...€¦ · the use of graphics processing units as computer accelerators and apply it to the eld of linear algebra. In particular,