View
216
Download
0
Category
Preview:
DESCRIPTION
3 6 Variations of Matrix Multiple
Citation preview
1
Lecture 6: Memory Hierarchy and Cache (Continued)
Jack Dongarra University of Tennessee andOak Ridge National Laboratory
Cache: A safe place for hiding and storing things. Webster’s New World Dictionary (1976)
2
Homework Assignment
• Implement, in Fortran or C, the six different ways to perform matrix multiplication by interchanging the loops. (Use 64-bit arithmetic.) Make each implementation a subroutine, like:
• subroutine ijk ( a, m, n, lda, b, k, ldb, c, ldc )• subroutine ikj ( a, m, n, lda, b, k, ldb, c, ldc )• …
3
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
4
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijkC i,j A I,k B k,j
5
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
C i,j A I,k B k,j
6
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
C i,j A I,k B k,j
7
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
kji
C i,j A I,k B k,j
8
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
kji
jki
C i,j A I,k B k,j
9
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
kji
jki
jik
C i,j A I,k B k,j
10
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
kji
jki
jik
C i,j A I,k B k,j
FortranC
11
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
kji
jki
jik
C i,j A I,k B k,j
FortranC
However, only part of the story
SUN Ultra 2 200 MHz (L1=16KB, L2=1MB)
• ijk
• jki
• kij
• dgemm
• jik
• kji
• ikj
13
Matrices in Cache
• L1 cache 16 KB
• L2 cache 2 MB
16 8 45KB /
2 8 362120MB /
14
15
Optimizing Matrix Addition for Caches
• Dimension A(n,n), B(n,n), C(n,n) • A, B, C stored by column (as in Fortran) • Algorithm 1:
– for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j)
• Algorithm 2:– for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)
• What is “memory access pattern” for Algs 1 and 2?• Which is faster?• What if A, B, C stored by row (as in C)?
16
Using a Simpler Model of Memory to Optimize
• Assume just 2 levels in the hierarchy, fast and slow• All data initially in slow memory
– m = number of memory elements (words) moved between fast and slow memory
– tm = time per slow memory operation– f = number of arithmetic operations– tf = time per arithmetic operation < tm– q = f/m average number of flops per slow element access
• Minimum possible Time = f*tf, when all data in fast memory
• Actual Time = f*tf + m*tm = f*tf*(1 + (tm/tf)*(1/q))• Larger q means Time closer to minimum f*tf
17
Simple example using memory model
s = 0
for i = 1, n
s = s + h(X[i])
• Assume tf=1 Mflop/s on fast memory
• Assume moving data is tm = 10• Assume h takes q flops• Assume array X is in slow memory
• To see results of changing q, consider simple computation
• So m = n and f = q*n• Time = read X + compute = 10*n + q*n• Mflop/s = f/t = q/(10 + q)• As q increases, this approaches the “peak” speed of 1
Mflop/s
18
Simple Example (continued)• Algorithm 1
s1 = 0; s2 = 0
for j = 1 to n
s1 = s1+h1(X(j))
s2 = s2 + h2(X(j))
° Algorithm 2
s1 = 0; s2 = 0
for j = 1 to n
s1 = s1 + h1(X(j))
for j = 1 to n
s2 = s2 + h2(X(j))
° Which is faster?
19
Loop Fusion Example/* Before */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; improve spatial locality
20
Optimizing Matrix Multiply for Caches
• Several techniques for making this faster on modern processors
– heavily studied• Some optimizations done automatically by
compiler, but can do much better• In general, you should use optimized libraries
(often supplied by vendor) for this and other very common linear algebra operations
– BLAS = Basic Linear Algebra Subroutines• Other algorithms you may want are not going to be
supplied by vendor, so need to know these techniques
21
Warm up: Matrix-vector multiplication y = y + A*x
for i = 1:nfor j = 1:n
y(i) = y(i) + A(i,j)*x(j)
= + *
y(i) y(i)
A(i,:)
x(:)
22
Warm up: Matrix-vector multiplication y = y + A*x
{read x(1:n) into fast memory}{read y(1:n) into fast memory}for i = 1:n
{read row i of A into fast memory} for j = 1:n
y(i) = y(i) + A(i,j)*x(j){write y(1:n) back to slow memory}
° m = number of slow memory refs = 3*n + n2
° f = number of arithmetic operations = 2*n2
° q = f/m ~= 2° Matrix-vector multiplication limited by slow memory speed
23
Matrix Multiply C=C+A*B
for i = 1 to n for j = 1 to n
for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)
= + *C(i,j) C(i,j) A(i,:)
B(:,j)
24
Matrix Multiply C=C+A*B(unblocked, or untiled)
for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory}
= + *C(i,j) C(i,j) A(i,:)
B(:,j)
25
Matrix Multiply (unblocked, or untiled)
Number of slow memory references on unblocked matrix multiplym = n3 read each column of B n times
+ n2 read each column of A once for each i + 2*n2 read and write each element of C once = n3 + 3*n2
So q = f/m = (2*n3)/(n3 + 3*n2) ~= 2 for large n, no improvement over matrix-vector mult
= + *C(i,j) C(i,j) A(i,:)
B(:,j)
q=ops/slow mem ref
26
Matrix Multiply (blocked, or tiled)
Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize
for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on
blocks} {write block C(i,j) back to slow memory}
= + *C(i,j) C(i,j) A(i,k)
B(k,j)
27
Matrix Multiply (blocked or tiled)
Why is this algorithm correct?
Number of slow memory references on blocked matrix multiplym = N*n2 read each block of B N3 times (N3 * n/N * n/N)
+ N*n2 read each block of A N3 times + 2*n2 read and write each block of C once = (2*N + 2)*n2
So q = f/m = 2*n3 / ((2*N + 2)*n2) ~= n/N = b for large n
So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2)
Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b2 <= M, so q ~= b <= sqrt(M/3)
Theorem (Hong, Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q =O(sqrt(M))
q=ops/slow mem ref
28
Model• As much as possible will be overlapped• Dot Product: ACC = 0 do i = x,n ACC = ACC + x(i) y(i) end do• Experiments done on an IBM RS6000/530
– 25 MHz– 2 cycle to complete FMA can be pipelined
» => 50 Mflop/s peak– one cycle from cache
29
DOT Operation - Data in Cache
Do 10 I = 1, n T = T + X(I)*Y(I) 10 CONTINUE
• Theoretically, 2 loads for X(I) and Y(I), one FMA operation, no re-use of data
• Pseudo-assembler LOAD fp0,T label: LOAD fp1,X(I) LOAD fp2,Y(I) FMA fp0,fp0,fp1,fp2 BRANCH label:
Load x Load y FMA
Load x Load y
1 result per cycle = 25 Mflop/sFMA
30
Matrix-Vector Product
• DOT version DO 20 I = 1, M DO 10 J = 1, N Y(I) = Y(I) + A(I,J)*X(J) 10 CONTINUE 20 CONTINUE
• From Cache = 22.7 Mflops • From Memory = 12.4 Mflops
31
Loop Unrolling
DO 20 I = 1, M, 2 T1 = Y(I ) T2 = Y(I+1) DO 10 J = 1, N T1 = T1 + A(I,J )*X(J) T2 = T2 + A(I+1,J)*X(J)10 CONTINUE Y(I ) = T1 Y(I+1) = T2 20 CONTINUE
• 3 loads, 4 flops• Speed of y=y+ATx,
N=48
Depth 1 2 3 4 Speed 25 33.3 37.5 40 50Measured 22.7 30.5 34.3 36.5Memory 12.4 12.7 12.7 12.6
• unroll 1: 2 loads : 2 ops per 2 cycles• unroll 2: 3 loads : 4 ops per 3 cycles• unroll 3: 4 loads : 6 ops per 4 cycles• …• unroll n: n+1 loads : 2n ops per n+1 cycles
• problem: only so many registers
32
Matrix Multiply
• DOT version - 25 Mflops in cache DO 30 J = 1, M DO 20 I = 1, M DO 10 K = 1, L C(I,J) = C(I,J) + A(I,K)*B(K,J) 10 CONTINUE 20 CONTINUE 30 CONTINUE
33
How to Get Near Peak DO 30 J = 1, M, 2 DO 20 I + 1, M, 2 T11 = C(I, J ) T12 = C(I, J+1) T21 = C(I+1,J ) T22 = C(I+1,J+1) DO 10 K = 1, L T11 = T11 + A(I, K) *B(K,J ) T12 = T12 + A(I, K) *B(K,J+1) T21 = T21 + A(I+1,K)*B(K,J ) T22 = T22 + A(I+1,K)*B(K,J+1) 10 CONTINUE C(I, J ) = T11 C(I, J+1) = T12 C(I+1,J ) = T21 C(I+1,J+1) = T22 20 CONTINUE 30 CONTINUE
• Inner loop: – 4 loads, 8 operations,
optimal.
• In practice we have measured 48.1 out of a peak of 50 Mflop/s when in cache
34
BLAS -- Introduction
• Clarity: code is shorter and easier to read,• Modularity: gives programmer larger building
blocks,• Performance: manufacturers will provide
tuned machine-specific BLAS,• Program portability: machine dependencies
are confined to the BLAS
35
Memory Hierarchy
RegistersL 1
CacheL 2 CacheLocal
MemoryRemote MemorySecondary Memory
• Key to high performance in effective use of memory hierarchy
• True on all architectures
36
Level 1, 2 and 3 BLAS• Level 1 BLAS
Vector-Vector operations
• Level 2 BLAS Matrix-Vector operations
• Level 3 BLAS Matrix-Matrix operations
+ *
*
+ *
37
More on BLAS (Basic Linear Algebra Subroutines)
• Industry standard interface(evolving)• Vendors, others supply optimized implementations• History
– BLAS1 (1970s): » vector operations: dot product, saxpy (y=*x+y), etc» m=2*n, f=2*n, q ~1 or less
– BLAS2 (mid 1980s)» matrix-vector operations: matrix vector multiply, etc» m=n2, f=2*n2, q~2, less overhead » somewhat faster than BLAS1
– BLAS3 (late 1980s)» matrix-matrix operations: matrix matrix multiply, etc» m >= 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is
potentially much faster than BLAS2• Good algorithms used BLAS3 when possible (LAPACK)• www.netlib.org/blas, www.netlib.org/lapack
38
Why Higher Level BLAS?
• Can only do arithmetic on data at the top of the hierarchy
• Higher level BLAS lets us do this
BLAS MemoryRefs
Flops Flops/MemoryRefs
Level 1y=y+x
3n 2n 2/3
Level 2y=y+Ax
n2 2n2 2
Level 3C=C+AB
4n2 2n3 n/2
RegistersL 1
CacheL 2
CacheLocal
MemoryRemote Memory
Secondary Memory
39
BLAS for Performance
• Development of blocked algorithms important for performance
IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak)
0
50
100
150
200
250
10 100 200 300 400 500Order of vector/Matrices
Mflo
p/s
Level 3 BLAS
Level 2 BLAS
Level 1 BLAS
40
BLAS for Performance
• Development of blocked algorithms important for performance
Alpha EV 5/6 500MHz (1Gflop/s peak)
0100200300400500600700
10 100 200 300 400 500Order of vector/Matrices
Mflo
p/s
Level 3 BLAS
Level 2 BLASLevel 1 BLAS
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors)
Fast linear algebra kernels: BLAS
• Simple linear algebra kernels such as matrix-matrix multiply
• More complicated algorithms can be built from these basic kernels.
• The interfaces of these kernels have been standardized as the Basic Linear Algebra Subroutines (BLAS).
• Early agreement on standard interface (~1980) • Led to portable libraries for vector and shared
memory parallel machines. • On distributed memory, there is a less-
standard interface called the PBLAS
Level 1 BLAS
• Operate on vectors or pairs of vectors– perform O(n) operations; – return either a vector or a scalar.
• saxpy – y(i) = a * x(i) + y(i), for i=1 to n. – s stands for single precision, daxpy is for double
precision, caxpy for complex, and zaxpy for double complex,
• sscal y = a * x, for scalar a and vectors x,y
• sdot computes s = S ni=1 x(i)*y(i)
Level 2 BLAS
• Operate on a matrix and a vector; – return a matrix or a vector;– O(n2) operations
• sgemv: matrix-vector multiply– y = y + A*x– where A is m-by-n, x is n-by-1 and y is m-by-1.
• sger: rank-one update – A = A + y*xT, i.e., A(i,j) = A(i,j)+y(i)*x(j) – where A is m-by-n, y is m-by-1, x is n-by-1, – strsv: triangular solve – solves y=T*x for x, where T is triangular
Level 3 BLAS
• Operate on pairs or triples of matrices– returning a matrix;– complexity is O(n3).
• sgemm: Matrix-matrix multiplication– C = C +A*B, – where C is m-by-n, A is m-by-k, and B is k-by-n
• strsm: multiple triangular solve– solves Y = T*X for X, – where T is a triangular matrix, and X is a rectangular
matrix.
45
Optimizing in practice
• Tiling for registers– loop unrolling, use of named “register” variables
• Tiling for multiple levels of cache• Exploiting fine-grained parallelism within the
processor– super scalar – pipelining
• Complicated compiler interactions• Hard to do by hand (but you’ll try)• Automatic optimization an active research area
– PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac– www.cs.berkeley.edu/~iyer/asci_slides.ps– ATLAS: www.netlib.org/atlas/index.html
46
BLAS -- References
• BLAS software and documentation can be obtained via:
– WWW: http://www.netlib.org/blas,– (anonymous) ftp ftp.netlib.org: cd blas; get index– email netlib@www.netlib.org with the message: send
index from blas
• Comments and questions can be addressed to: lapack@cs.utk.edu
47
BLAS Papers
• C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, 5:308--325, 1979.
• J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 14(1):1--32, 1988.
• J. Dongarra, J. Du Croz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 16(1):1--17, 1990.
Performance of BLAS
• BLAS are specially optimized by the vendor
– Sun BLAS uses features in the Ultrasparc• Big payoff for algorithms that can be
expressed in terms of the BLAS3 instead of BLAS2 or BLAS1.
• The top speed of the BLAS3• Algorithms like Gaussian elimination
organized so that they use BLAS3
49
How To Get Performance From Commodity Processors?
• Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.
• Routines have a large design space w/many parameters– blocking sizes, loop nesting permutations, loop unrolling depths,
software pipelining strategies, register allocations, and instruction schedules.
– Complicated interactions with the increasingly sophisticated microarchitectures of new microprocessors.
• A few months ago no tuned BLAS for Pentium for Linux.• Need for quick/dynamic deployment of optimized routines.• ATLAS - Automatic Tuned Linear Algebra Software
– PhiPac from Berkeley
M C A B
N
K
N
M
K
*NB
Adaptive Approach for Level 3• Do a parameter study of the operation on the
target machine, done once.• Only generated code is on-chip multiply• BLAS operation written in terms of generated on-
chip multiply• All tranpose cases coerced through data copy to
1 case of on-chip multiply– Only 1 case generated per platform
51
Code Generation Strategy
• Code is iteratively generated & timed until optimal case is found. We try:
– Differing NBs– Breaking false dependencies– M, N and K loop unrolling
• On-chip multiply optimizes for:
– TLB access– L1 cache reuse– FP unit usage– Memory fetch– Register reuse– Loop overhead minimization
• Takes a couple of hours to run.
52
500x500 Double Precision Matrix-Matrix Multiply Across Multiple Architectures
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
DE
C A
lpha
2116
4a-4
33
HP
PA
8000
180M
hz
HP
9000
/735
/125
IBM
Pow
er2-
135
IBM
Pow
erP
C60
4e-3
32
Pen
tium
MM
X-15
0
Pen
tium
Pro
-200
Pen
tium
II-2
66
SG
I R46
00
SG
I R50
00
SG
I R80
00ip
21
SG
I R10
000i
p27
Sun
Mic
rosp
arc
IIM
odel
70
Sun
Dar
win
-270
Sun
Ultr
a2 M
odel
2200
System
Mflo
ps
Vendor Matrix Multiply ATLAS Matrix Multiply
53
500 x 500 Double Precision LU Factorization Performance Across Multiple Architectures
0.0
100.0
200.0
300.0
400.0
500.0
600.0
DC
G L
X 21
164a
-53
3
DE
C A
lpha
211
64a-
433
HP
PA
8000
IBM
Pow
er2-
135
IBM
Pow
erP
C60
4e-3
32
Pen
tium
Pro
-200
Pen
tium
II-2
66
SG
I R50
00
SG
I R10
000i
p27
Sun
Dar
win
-270
Sun
Ultr
a2 M
odel
2200
MFL
OPS
LU w/Vendor BLAS LU w/ATLAS & GEMM-based BLAS
54
500x500 gemm-based BLAS on SGI R10000ip28
0
50
100
150
200
250
300
DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM
MFL
OPS
Vendor BLAS ATLAS/SSBLAS Reference BLAS
55
500x500 gemm-based BLAS on UltraSparc 2200
0
50
100
150
200
250
300
DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM
Level 3 BLAS Routine
MFL
OPS
Vendor BLAS ATLAS/GEMM-based BLAS Reference BLAS
56
Recursive Approach for Other Level 3 BLAS
• Recur down to L1 cache block size
• Need kernel at bottom of recursion
– Use gemm-based kernel for portability
Recursive TRMM
00
0
00
0
0
0
0
0
0
0
00
0
0
0
0
0
0
57
500x500 Level 2 BLAS DGEMV
0
50
100
150
200
250
300
Architectures
MFL
OPS
Vendor NoTrans ATLAS NoTrans
F77 NoTrans
58
0100200300400500600700800
Size
Mflo
p/s
Intel BLAS 1 proc ATLAS 1proc Intel BLAS 2 proc ATLAS 2 proc
Multi-Threaded DGEMMIntel PIII 550 MHz
59
ATLAS
• Keep a repository of kernels for specific machines.
• Develop a means of dynamically downloading code
• Extend work to allow sparse matrix operations
• Extend work to include arbitrary code segments
• See: http://www.netlib.org/atlas/
60
BLAS Technical Forum http://www.netlib.org/utk/papers/blast-forum.html
• Established a Forum to consider expanding the BLAS in light of modern software, language, and hardware developments.
• Minutes available from each meeting• Working proposals for the following:
– Dense/Band BLAS– Sparse BLAS– Extended Precision BLAS– Distributed Memory BLAS– C and Fortran90 interfaces to Legacy BLAS
61
Strassen’s Matrix Multiply
• The traditional algorithm (with or without tiling) has O(n3) flops
• Strassen discovered an algorithm with asymptotically lower flops
– O(n2.81)• Consider a 2x2 matrix multiply, normally 8 multiplies
Let M = [m11 m12] = [a11 a12] * [b11 b12]
[m21 m22] [a21 a22] [b21 b22]
Let p1 = (a12 - 122) * (b21 + b22) p5 = a11 * (b12 - b22)
p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11)
p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11
p4 = (a11 + a12) * b22
Then m11 = p1 + p2 - p4 + p6
m12 = p4 + p5
m21 = p6 + p7
m22 = p2 - p3 + p5 - p7
Extends to nxn by divide&conquer
62
Strassen (continued)
T(n) = Cost of multiplying nxn matrices
= 7*T(n/2) + 18*(n/2)2 = O(nlog_2 7) = O(n2.81)
° Available in several libraries° Up to several time faster if n large enough (100s)° Needs more memory than standard algorithm° Can be less accurate because of roundoff error° Current world’s record is O(n2.376.. )
63
Summary• Performance programming on uniprocessors
requires– understanding of memory system
» levels, costs, sizes– understanding of fine-grained parallelism in processor to
produce good instruction mix
• Blocking (tiling) is a basic approach that can be applied to many matrix algorithms
• Applies to uniprocessors and parallel processors– The technique works for any architecture, but choosing the
blocksize b and other details depends on the architecture
• Similar techniques are possible on other data structures
64
Summary: Memory Hierachy• Virtual memory was controversial at the time:
can SW automatically manage 64KB across many programs?
– 1000X DRAM growth removed the controversy
• Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy
• Today CPU time is a function of (ops, cache misses) vs. just f(ops):What does this mean to Compilers, Data structures, Algorithms?
65
BLAS MemoryRefs
Flops Flops/MemoryRefs
Level 1y=y+x
3n 2n 2/3
Level 2y=y+Ax
n2 2n2 2
Level 3C=C+AB
4n2 2n3 n/2
Performance = Effective Use of Memory Hierarchy
• Can only do arithmetic on data at the top of the hierarchy
• Higher level BLAS lets us do this
• Development of blocked algorithms important for performance
Level 1, 2 & 3 BLAS Intel PII 450MHz
0
100
200
300
400
10 100 200 300 400 500Order of vector/Matrices
Mflo
p/s
66
Engineering: SUN Enterprise
• Proc + mem card - I/O card– 16 cards of either type– All memory accessed over bus, so symmetric– Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 addr ess, 83 MHz)
SB
US
SB
US
SB
US
2 Fi
berC
hann
el
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$P
$2
$
Mem ctrl
Bus interface/switch
I/O cards
67
Engineering: Cray T3E
– Scale up to 1024 processors, 480MB/s links– Memory controller generates request message for non-local references– No hardware mechanism for coherence
» SGI Origin etc. provide this
Switch
P$
XY
Z
External I/O
Memctrl
and NI
Mem
68
000001
010011
100
110
101
111
Evolution of Message-Passing Machines
• Early machines: FIFO on each link– HW close to prog. Model; – synchronous ops– topology central (hypercube algorithms)
CalTech Cosmic Cube (Seitz, CACM Jan 95)
69
Diminishing Role of Topology
• Shift to general links– DMA, enabling non-blocking ops
» Buffered by system at destination until recv
– Store&forward routing• Diminishing role of topology
– Any-to-any pipelined routing– node-network interface dominates
communication time
– Simplifies programming– Allows richer design space
» grids vs hypercubes
H x (T0 + n/B)
vs
T0 + H + n/B
Intel iPSC/1 -> iPSC/2 -> iPSC/860
70
Example Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia’ s Intel Paragon XP/S-based Super computer
71
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC
Building on the mainstream: IBM SP-2
• Made out of essentially complete RS6000 workstations
• Network interface integrated in I/O bus (bw limited by I/O bus)
72
Berkeley NOW
• 100 Sun Ultra2 workstations
• Inteligent network interface
– proc + mem
• Myrinet Network
– 160 MB/s per link
– 300 ns per hop
73
Thanks • These slides came in part from
courses taught by the following people:
– Kathy Yelick, UC, Berkeley– Dave Patterson, UC, Berkeley– Randy Katz, UC, Berkeley– Craig Douglas, U of Kentucky
• Computer Architecture A Quantitative Approach, Chapter 8, Hennessy and Patterson, Morgan Kaufman Pub.
Recommended