Upload
sheila-lauren-stone
View
232
Download
0
Embed Size (px)
DESCRIPTION
What To Improve Current algorithms use excessive indirect addressing Current optimizations depend on the structure of the matrix (distribution of the nonzero elements)
Citation preview
Sparse Matrix Dense Vector Multiplication
byPedro A. Escallon
Parallel Processing ClassFlorida Institute of Technology
April 2002
The Problem
• Improve the speed of sparse matrix - dense vector multiplication using MPI in a beowolf parallel computer.
What To Improve
• Current algorithms use excessive indirect addressing
• Current optimizations depend on the structure of the matrix (distribution of the nonzero elements)
Sparse Matrix Representations
• Coordinate format• Compressed Sparse Row (CSR)• Compressed Sparse Column (CSC)• Modified Sparse Row (MSR)
Compressed Sparse Row (CSR)
0 A01 A02 0
0 A11 0 A13
A20 0 0 0
0 2 4 5
0 2 1 3 0
A01 A02 A11 A13 A20
rS
ndx
val
CSR Code
void sparseMul(int m, double *val, int *ndx, int *rS, double *x, double *y){ int i,j; for(i=0;i<m;i++) { for(j=rowStart[i];j<rS[i+1];j++) { y[i]+=(*val++)*x[*ndx++]; } }}
Goals
• Eliminate indirect addressing• Remove the dependency on the distribution
of the nonzero elements• Further compress the matrix storage• Most of all, to speed up the operation
Proposed Solution
{0,0} {1,A01} {2,A02} {-1,0} {1,A11} {3,A13} {-2,A20}
0 A01 A02 0
0 A11 0 A13
A20 0 0 0
A =
Data Structure
typedef struct { int rCol; double val;} dSparS_t;
{rCol,val}
Process
0 1 3 p
local_size
hdr.size
…
residual < p
local_size – hdr.size / presidual = hdr.size % p
A
Scatter
0 1 2 p
local_size
…A
0 1 2 p…local_A
Multiplication Codeif( (index=local_A[0].rCol) > 0 ) local_Y[0].val = local_A[0].val * X[index];else local_Y[0].val = local_A[0].val * X[0];local_Y[0].rCol = -1;k=1; h=0;while(k<local_size) { while((0<(index=local_A[k].rCol)) && (k<local_size))
local_Y[h].val += local_A[k++].val * X[index]; if(k<local_size) {
local_Y[h++].rCol = -index-1;local_Y[h].val = local_A[k++].val * X[0];
}}local_Y[h].rCol = local_Y[-1+h++].rCol+1;while(h < stride) local_Y[h++].rCol = -1;
Multiplication
local_size
local_A
stri d
e
local_Y
doam
in
Ran
g e
X
=*
Algorithm
local_A
X
Y.val
Y.rCol
{r0,v0}0
X[0]
=X[0]*v00
-
{c1,v1}0
X[c01]
+=X[c01]*v01
-
.. {r1,v0}1
.. X[0]
=X[0]*v00
-
{c2,v2}0
X[c02]
+=X[c02]*v02
-r1-1
{c1,v1}1
X[c11]
+=X[c11]*v11
-
Gather
…
0 1 2 p…local_Y
residual
gatherBuffer
split element striderange
Consolidation of Split Rows
…
residual
Y
nCols
…
+=
gatherBuffer
Results (vavasis3)vavasis3.rua - Total non-zero values: 1,683,902 - p = 10
Broadcast Time Scatter Time Gather Time Computation Time
P0 0.103930 2.380285 0.096051 0.012123
P1 0.107588 0.457140 0.012000 0.011504
P2 0.107667 0.706087 0.012022 0.011642
P3 0.103155 0.951814 0.011971 0.011560
P4 0.107644 1.206376 0.012210 0.011536
P5 0.109243 1.452563 0.012032 0.011506
P6 0.108477 1.702571 0.012044 0.011506
P7 0.109446 1.948481 0.012004 0.011658
P8 0.055822 2.208924 0.012079 0.011540
P9 0.059023 2.459900 0.012009 0.011438
Results (vavasis3)vavasis3.rua - Total non-zero values: 1,683,902 - p = 8
Broadcast Time Scatter Time Gather Time Computation Time
P0 0.089478 2.264316 0.121741 0.014860
P1 0.093083 0.569091 1.711789 0.014105
P2 0.093217 0.866460 1.429352 0.014227
P3 0.091012 1.160591 1.146954 0.014457
P4 0.081719 1.462335 0.865520 0.014365
P5 0.085375 1.756941 0.582353 0.014341
P6 0.085418 2.055651 0.299847 0.014362
P7 0.089087 2.350998 0.017813 0.014728
vavasis3.rua - Total non-zero values: 1,683,902 - p = 1
Broadcast Time Scatter Time Gather Time Computation Time
P0 0.000002 1.412774 0.033015 0.112132
Results (vavasis3)vavasis3.rua - Total non-zero values: 1,683,902 - p = 4
Broadcast Time Scatter Time Gather Time Computation Time
P0 0.051980 3.026846 0.217574 0.028587
P1 0.055605 1.725272 1.027928 0.028258
P2 0.055703 2.319343 0.451021 0.028141
P3 0.056422 3.212518 0.018073 0.027988
vavasis3.rua - Total non-zero values: 1,683,902 - p = 2
Broadcast Time Scatter Time Gather Time Computation Time
P0 0.233200 5.810814 0.426097 0.056334
P1 0.236864 6.521328 0.032125 0.055866
Results (vavasis3)
P Computation Speedup E_p Gather C_p
1 0.112132 --- --- 0.033015 1.294430
2 0.056334 1.990485 0.995243 0.426097 8.563763
4 0.028587 3.922482 0.980621 1.027928 36.957883
8 0.014860 7.545895 0.943237 1.711789 116.194415
10 0.012123 9.249526 0.924953 0.096051 8.923039
vavasis3.rua - Calculated Results
Results (bayer02)bayer02.rua - Total non-zero values: 63,679 - p = 10
Broadcast Time Scatter Time Gather Time Computation Time
P0 0.046136 0.093143 0.011733 0.000926
P1 0.048824 0.018207 0.001567 0.000423
P2 0.048627 0.027146 0.002054 0.000456
P3 0.044416 0.034386 0.002440 0.000445
P4 0.048214 0.046365 0.002457 0.000397
P5 0.048481 0.053511 0.001978 0.000425
P6 0.045666 0.063204 0.002015 0.000467
P7 0.048173 0.070167 0.002440 0.000419
P8 0.033947 0.088532 0.002323 0.000395
P9 0.032110 0.097866 0.001959 0.000479
Results (bayer02)bayer02.rua - Total non-zero values: 63,679 - p = 8
Broadcast Time Scatter Time Gather Time Computation Time
P0 0.040159 0.103422 0.011810 0.001020
P1 0.042743 0.023353 0.001728 0.000549
P2 0.042709 0.035670 0.001777 0.000607
P3 0.039322 0.047141 0.001738 0.000599
P4 0.041584 0.064024 0.001724 0.000702
P5 0.039229 0.075528 0.001725 0.000568
P6 0.037206 0.089757 0.001733 0.000565
P7 0.039912 0.101267 0.002111 0.000541
bayer02.rua - Total non-zero values: 63,679 - p = 1
Broadcast Time Scatter Time Gather Time Computation Time
P0 0.000003 0.063824 0.010975 0.006090
Results (bayer02)bayer02.rua - Total non-zero values: 63,679 - p = 4
Broadcast Time Scatter Time Gather Time Computation Time
P0 0.049680 0.096930 0.018308 0.001888
P1 0.052379 0.048924 0.003765 0.001555
P2 0.051944 0.076405 0.003609 0.001561
P3 0.046413 0.101871 0.003636 0.001528
bayer02.rua - Total non-zero values: 63,679 - p = 2
Broadcast Time Scatter Time Gather Time Computation Time
P0 0.025494 0.520611 0.008192 0.003445
P1 0.028157 0.504081 0.032848 0.003121
Results (bayer02)
P Computation Speedup E_p Gather C_p
1 0.006090 --- --- 0.010975 2.802135
2 0.003445 1.767779 0.883890 0.032848 10.534978
4 0.001888 3.225636 0.806409 0.018308 10.697034
8 0.001020 5.970588 0.746324 0.011810 12.578431
10 0.000926 6.576674 0.657667 0.011733 13.670626
bayer02.rua - Calculated Results
Conclusions
• The proposed representation speeds up the matrix calculation
• Data mismatch solution before gather should be improved
• There seems to be a communication penalty for using moving structured data
Bibliography
• “Optimizing the Performance of Sparse Matrix-Vector Multiplication” dissertation by Eun-Jin Im.
• “Iterative Methods for Sparse Linear Systems” by Yousef Saad
• “Users’ Guide for the Harwell-Boeing Sparse Matrix Collection” by Iain S. Duff