Upload
allison-shelton
View
220
Download
7
Tags:
Embed Size (px)
Citation preview
High Performance Dense Linear Algebra on Spatially Distributed Processors
Jeffrey Diamond and Behnam Robatmili
Stephen Keckler, Robert van de Geijn, Kazushige Goto*, Doug Burger
Department of Computer ScienceUniversity of Texas at Austin
*Texas Advanced Computing CenterUniversity of Texas at Austin
2
Trends in Chip Level Parallelism
Emerging architectures more fine grained On chip networks, precise control over communication Tight orchestration of computation across ALUs
Algorithmic insight from most fine grained case
CoarseGrained
FineGrained
Quad Core
(MIMD)TRIPS(SDU)
Cell Tilera
3
Parallel Programming Paradigms
Programming occurs at many levels Trends towards optimized library model
Special low level APIs for high performance We’re interested in these low level APIs
High Level API
Low Level API
Haskel, F#, Sequoia, CUDA, Ct, UPC, etcDynamic Run Times / CompilationClassic MultithreadingHigh Performance, Low Level Libraries
4
Case Study: Matrix Multiply
Implementing full scale DGEMM High Performance Dense Linear Algebra Libraries
(Level 3 BLAS) are layered on top of high performance Matrix Multiply kernels: SYMM, SYRK, TRSM, TRMM, etc. Core LAPACK: LU with partial pivoting, Cholesky, QR
factorization, matrix inversion, reduction to tridiagonal/Hessenberg/bidiagonal form
Control theory: Sylvester equation, Lyapunov equation, and many, many others...
Regular operation is very amenable to algorithmic transformations and easy to reason about
5
Talk Outline
Spatially Distributed Uniprocessors Matrix Multiply Algorithm
High Level Memory Management Low Level Blocking Inner Kernel
Optimizing Inner Kernel Results Conclusion
6
Spatially Distributed Uniprocessors (SDUs)
Single threaded scalability issues for architectures and implementation technology: Wire delay, Power, Issue Width, Memory Bandwidth… Solution: SDU - partitioned register banks, functional units, …
Still executing a single thread across multiple ALUs Where an instruction executes matters
Program statically determines location of instructions Examples include advanced VLIW processors in embedded
market TRIPS partitions most aspects of single core into tiles:
Tiles connected by on chip 2-D network Large number of distributed ALUs, registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network
7
TRIPS - a modern SDU
8
TRIPS - a modern SDU
Core 1
Core 2
Shared L2
9
TRIPS - a modern SDURegister BanksL1 banksL2 banks
Grid of ALUs
10
Talk Outline
Spatially Distributed Uniprocessors Matrix Multiply Algorithm
High Level Memory Management Low Level Blocking Inner Kernel
Optimizing Inner Kernel Results Conclusion
11
Outer-level: Goto streaming algorithm At heart GotoBLAS Linear Algebra Libraries Licensed by many of the top computer vendors Used by many supercomputers in top 500 list
Mid-level: enhanced Goto algorithm with new hierarchical blocking layer to leverage SDU topology
Inner kernel: novel algorithm suited to SDUs
Implementing Matrix Multiply
12
Goto Streaming Algorithm
Classical blocking algorithm (C += AB): Break matrices into square blocks just big
enough for a, b and c to fit in L1 cache Goto: L2 cache is actually fast enough to
access directly from inner kernel Instead of small, square matrix blocks, use
huge block-panel multiplies Traversal order to maximize reuse Stream full-sized panels of B and C directly out of
DRAM
13
Goto: High Level Blocking
C A B
High Level Blocking
C’ A’ B’
Original Problem
A’C’ B’
L2 DRAM/L1DRAM/REG
Thousands
Thousands
Thousands ThousandsHundreds
Hundreds
Panel Slices
+=
+=
+=
14
128 registers hold non-trivial sized blocks 2-D mesh network has high bandwidth in orthogonal
directions (like a systolic array) Additionally store blocks of A in registers
Bring in elements of A and B simultaneously and maximize bandwidth Maximize use of both horizontal and vertical network links
But to amortize use of elements of A in registers, need to add another level of low level blocking to the hierarchy
Enhancing Goto Algorithm
15
B’, C’ panel slices broken into mini-panels b’, c’ a’-block broken into mini-blocks, a’
a’ block and c mini panel held in registers 4x4 a’ amortized over 4x16 b’
Careful ordering of data movement preserves computational properties of larger block-panel multiply B slice stays in L1 for a LONG time, A stays even longer
A’C’ B’
(L2) (L1)(DRAM)
16 16444 4
+=Hundreds Hundreds
Low Level Blocking Scheme
16
How do we traverse?
A’C’
B’
128
512
128
512
X
• B’ slice fits in L1 cache• A’ block fits in L2 cache• C’ streams from DRAM
Load c’ and a’ blocks into Registers
+=
16164 44
17
A’C’
B’
128
512
128
1616
512
X
Stream b’(4x16) from L1 & multiply by a’(4x4)(Reuse a’ four times!)
+= B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
How do we traverse?
4 4
18
A’C’
B’
128
512
128
512
X
+= B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
How do we traverse?
16164 4
19
A’C’
B’
128
512
128
512
X
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
16164 4
20
A’C’
B’
128
512
128
512
X
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
16164 4
21
A’C’
B’
128
512
128
161651
2
X
Reuse register c’, next a’ right, next b’ below:
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
22
A’C’
B’
128
512
128
161651
2
X
Repeat until at bottom of B slice, right of A row
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
23
A’C’
B’
128
512
128
161651
2
X
Save c’s, load next row of a’ and c’, reuse entire B’ slice’
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
24
A’C’
B’
128
512
128
161651
2
X
Repeat process over slice of B’
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
25
A’C’
B’
128
512
128
161651
2
X
Continue over entire block of A’ and C’
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
26
C’
B’
A’C’
B’
128
512
128
1616
X
Fetch next slice of B’ and move into next slice of C’
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
27
A’C’
B’
128
512
128
1616
X
Complete B’, C’ Panels, load next A’ and repeat…
C’
B
C’
B
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
28
Defined Inner Kernel
C A B
High Level Blocking
C’ A’ B’
Original Problem
A’C’ B’
L2 DRAM/L1DRAM/REG
Thousands
Thousands
Thousands ThousandsHundreds
Hundreds
Panel Slices
+=
+=
+=
16
4
16
4
4
4 Mini Block-PanelREG REG L1
+=c’ b’a’
29
Talk Outline
Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion
30
Optimizing the Inner Kernel
Developed several optimization principles: First to apply these principles to TRIPS
Avoiding network contention is critical! Single overscheduled link can cut performance in half Avoided by datapath routing, direction oriented
computation (DOC), register mirroring, data interleaving - got a 5x jump in Instructions Per Cycle, exceeding 10 IPC
Load balance every resource in system In a loop, total performance limited by most used wire link
or execution slot Loop body scaled to match register and data usage and to
minimize architectural overheads
Results in “fragility” of optimization typical of spatial architectures with shared resources
31
Simplified Schedule Step 1: Reading A from Register files
D0
D1
D2
D3
GT R0 R1 R2 R3
Step 2: Loading B and broadcast it across rows
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Step 3: Do the multiply and then add across columns Step 4: Write the results back to C
1 2 3 4
32
Every register use must be retrieved across network Every load and store needs to get an address Need to interleave prefetching, writing, updating pointers, counters Need to account for data movement instructions
What are the complications?
33
Talk Outline
Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion
34
Comparison of FPC across major processors
0
1
2
3
4
5
6
7
Opter
on P4
Core
2 Duo
POWER5
Itaniu
m
TRIPS
Kernel FPCDGEMM FPC
Execution Bottlenecks:Integer/Network Ops vs FLOPSSingle Operand Per Cycle
Enhancement OpportunitiesSIMD instruction setLarger Instruction WindowMore network bandwidth
* Results from K. Goto and R. A. van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, 2008. 13:748-757, August 2007
35
0
1
2
3
4
5
6
0 512 1024 1536 2048 2560 3072 3584 4096
FP
C
DGEMM
C Kernel + Goto
C Kernel, no Goto
Performance vs Matrix Size
36
Role of the Compiler
Kernel has 8x performance of TRIPS C compiler Did exhaustive empirical studies to determine individual
performance contributions of optimizations and their interaction with the TRIPS compiler
TRIPS compiler does scheduling as post process Determined that existing scheduler can handle
orchestration well if algorithm matches topology: If assembly for inner loop specified, scheduler obtained
75% of total performance
Lesson: Orchestration is not the difficult part Need to consider basic topology during compilation Blocking compilers and register clustering are active topics
of research Annotations / hints to compiler?
37
Conclusions
Fine grained architectures can boost single thread performance
Optimization principles we learned can be applied to many levels of architectural granularity But critical for fine grained architectures
In the future, high performance will depend on algorithms that incorporate both the memory hierarchy and the topology of the processing/ communication substrate
38
Thank You :)
Any Questions?
39
Thank You :)
Any Questions?
40
Back Up Slides
Just a list for now: Comparison of GotoBLAS against
Atlas/LAPACK More detailed diagrams of algorithm Other performance graphs Systolic Array Diagrams of other canonical processors
41
Future work
Explore applicability of optimization principles beyond dense linear algebra, to irregular, control intensive algorithms
Quantify degree to which principles apply to coarser grained architectures (CMPs) and different memory topologies
42
Trends in Chip Level Parallelism
Multiple ways to exploit parallelism: Instruction/Thread/Data Level Parallelism Coarse Grained vs Fine Grained
What’s the programming model? High level paradigm of your choice… Dynamic compilation and run time
systems Low level APIs for writing optimized
libraries Likely need to rewrite applications
43
Trends in Computer Architecture
Emerging architectures are trending towards more fine grained control E.g. Intel Terascale, RAW, Tilera Tightly orchestrated computation On chip networks Precise control over
communication These represent a step down a
path Algorithmic insight can be gained
by looking at the most fine grained examples
44
Spatially Distributed Uniprocessors Scalability issues for both architectures and underlying
technology Wire delay ,Power, Issue Width…
More and more components of microprocessors becoming distributed Partitioned register banks, functional units, …
SDU partitions all aspects of single core into tiles Tiles connected by on chip 2-D network Large number of distributed registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network
Key performance characteristic: Where an instruction executes matters!
45
TRIPS - a modern SDU
Grid of ALUs (16) Large number of distributed registers Large number of data ports On chip 2-D mesh network S-NUCA distributed L1 and L2 cache
46
TRIPS - a modern SDU
Potential Advantages for Matrix Multiply Large number of ALUs Precise placement of instructions
Not a MIMD machine Model of execution is block dataflow graphs Bring in graphs one at a time and execute must also deal with data movement, registers, data bandwidth, control
47
Classical Matrix Multiply
Need to compute C = AB + C Once just used a triply nested loop… Want to amortize O(n2) data movement over
2n3 computation of matrix multiply Break A, B and C matrices into square blocks
just small enough to fit A, B and C in L1 cache Inner kernel computes block of C by caching
elements of C in registers and using values of A and B from L1 cache
48
Performance for thin panels
Performance vs Panel Thickness
0
1
2
3
4
5
6
0 512 1024 1536 2048 2560 3072 3584 4096
k (m = n = 4096)
FPC
Cmxn = Amxk x Bkxn
49
Goto’s Streaming Algorithm
Classical algorithm breaks matrices into blocks just big enough for A, B and C to fit in L1 cache
Goto realized L2 cache is actually fast enough to access directly from inner kernel! Use most of L2 cache for a giant block of A Inner kernel uses all levels of memory hierarchy
simultaneously Cache large slices of B panel in L1 cache, cache small piece
of C in registers
Instead of square matrix blocks, use block-panel multiplies, with traversal order to maximize reuse Stream full-sized contiguous panels of B and C directly out
of DRAM
Use extremely optimized hand tuned assembly
50
Methodology
So we compiled code using the TRIPS compilerAnd we ran it on a hardware prototype.We kept making changes and seeing how fast it ran.We made notes of the changes.We made graphs from the notes.We made slides based on the graphs.We made conclusions based on the slides.It’s 130nm and 366 MHz, but that’s OK.
51
Controlling The Cache
AC
B
=+
128
512
128
161651
2
X
• B slice fits in L1 cache• A block fits in L2 cache• C chunks from L2
How do we keep B in L1 cache while streaming all of A through?
52
A Buffer Size
Affect of Dimensions of A Buffer (same area)
0
1
2
3
4
5
6
0 512 1024 1536 2048 2560 3072 3584 4096
m = n = k
FPC
512*128
256*256
128*512
53
Block Panel Multiply
C BA
+= x
Doing multiple GEMDOTS in parallel.
54
Block Panel Multiply
C BA
+= x
Doing multiple GEMDOTS in parallel.
55
Block Panel Multiply
C BA
+= x
Doing multiple GEMDOTS in parallel.
56
Block Panel Multiply
C BA
+= x
Doing multiple GEMDOTS in parallel.
57
Block Panel Multiply
C BA
+= x
Doing multiple GEMDOTS in parallel.
58