High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

High Performance Dense Linear Algebra on Spatially Distributed Processors

Jeffrey Diamond and Behnam Robatmili

Stephen Keckler, Robert van de Geijn, Kazushige Goto*, Doug Burger

Department of Computer ScienceUniversity of Texas at Austin

*Texas Advanced Computing CenterUniversity of Texas at Austin

2

Trends in Chip Level Parallelism

Emerging architectures more fine grained On chip networks, precise control over communication Tight orchestration of computation across ALUs

Algorithmic insight from most fine grained case

CoarseGrained

FineGrained

Quad Core

(MIMD)TRIPS(SDU)

Cell Tilera

3

Parallel Programming Paradigms

Programming occurs at many levels Trends towards optimized library model

Special low level APIs for high performance We’re interested in these low level APIs

High Level API

Low Level API

Haskel, F#, Sequoia, CUDA, Ct, UPC, etcDynamic Run Times / CompilationClassic MultithreadingHigh Performance, Low Level Libraries

4

Case Study: Matrix Multiply

Implementing full scale DGEMM High Performance Dense Linear Algebra Libraries

(Level 3 BLAS) are layered on top of high performance Matrix Multiply kernels: SYMM, SYRK, TRSM, TRMM, etc. Core LAPACK: LU with partial pivoting, Cholesky, QR

factorization, matrix inversion, reduction to tridiagonal/Hessenberg/bidiagonal form

Control theory: Sylvester equation, Lyapunov equation, and many, many others...

Regular operation is very amenable to algorithmic transformations and easy to reason about

5

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm

High Level Memory Management Low Level Blocking Inner Kernel

Optimizing Inner Kernel Results Conclusion

6

Spatially Distributed Uniprocessors (SDUs)

Single threaded scalability issues for architectures and implementation technology: Wire delay, Power, Issue Width, Memory Bandwidth… Solution: SDU - partitioned register banks, functional units, …

Still executing a single thread across multiple ALUs Where an instruction executes matters

Program statically determines location of instructions Examples include advanced VLIW processors in embedded

market TRIPS partitions most aspects of single core into tiles:

Tiles connected by on chip 2-D network Large number of distributed ALUs, registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network

7

TRIPS - a modern SDU

8


Core 1

Core 2

Shared L2

9

TRIPS - a modern SDURegister BanksL1 banksL2 banks

Grid of ALUs

10

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm

High Level Memory Management Low Level Blocking Inner Kernel

Optimizing Inner Kernel Results Conclusion

11

Outer-level: Goto streaming algorithm At heart GotoBLAS Linear Algebra Libraries Licensed by many of the top computer vendors Used by many supercomputers in top 500 list

Mid-level: enhanced Goto algorithm with new hierarchical blocking layer to leverage SDU topology

Inner kernel: novel algorithm suited to SDUs

Implementing Matrix Multiply

12

Goto Streaming Algorithm

Classical blocking algorithm (C += AB): Break matrices into square blocks just big

enough for a, b and c to fit in L1 cache Goto: L2 cache is actually fast enough to

access directly from inner kernel Instead of small, square matrix blocks, use

huge block-panel multiplies Traversal order to maximize reuse Stream full-sized panels of B and C directly out of

DRAM

13

Goto: High Level Blocking

C A B

High Level Blocking

C’ A’ B’

Original Problem

A’C’ B’

L2 DRAM/L1DRAM/REG

Thousands

Thousands

Thousands ThousandsHundreds

Hundreds

Panel Slices

+=

+=

+=

14

128 registers hold non-trivial sized blocks 2-D mesh network has high bandwidth in orthogonal

directions (like a systolic array) Additionally store blocks of A in registers

Bring in elements of A and B simultaneously and maximize bandwidth Maximize use of both horizontal and vertical network links

But to amortize use of elements of A in registers, need to add another level of low level blocking to the hierarchy

Enhancing Goto Algorithm

15

B’, C’ panel slices broken into mini-panels b’, c’ a’-block broken into mini-blocks, a’

a’ block and c mini panel held in registers 4x4 a’ amortized over 4x16 b’

Careful ordering of data movement preserves computational properties of larger block-panel multiply B slice stays in L1 for a LONG time, A stays even longer

A’C’ B’

(L2) (L1)(DRAM)

16 16444 4

+=Hundreds Hundreds

Low Level Blocking Scheme

16

How do we traverse?

A’C’

B’

128

512

128

512

X

• B’ slice fits in L1 cache• A’ block fits in L2 cache• C’ streams from DRAM

Load c’ and a’ blocks into Registers

+=

16164 44

17

A’C’

B’

128

512

128

1616

512

X

Stream b’(4x16) from L1 & multiply by a’(4x4)(Reuse a’ four times!)

+= B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

How do we traverse?

4 4

18

A’C’

B’

128

512

128

512

X

+= B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

How do we traverse?

16164 4

19

A’C’

B’

128

512

128

512

X

+=

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

16164 4

20

A’C’

B’

128

512

128

512

X

+=

How do we traverse?


16164 4

21

A’C’

B’

128

512

128

161651

2

X

Reuse register c’, next a’ right, next b’ below:

+=

How do we traverse?


22

A’C’

B’

128

512

128

161651

2

X

Repeat until at bottom of B slice, right of A row

+=

How do we traverse?


23

A’C’

B’

128

512

128

161651

2

X

Save c’s, load next row of a’ and c’, reuse entire B’ slice’

+=

How do we traverse?


24

A’C’

B’

128

512

128

161651

2

X

Repeat process over slice of B’

+=

How do we traverse?


25

A’C’

B’

128

512

128

161651

2

X

Continue over entire block of A’ and C’

+=

How do we traverse?


26

C’

B’

A’C’

B’

128

512

128

1616

X

Fetch next slice of B’ and move into next slice of C’

+=

How do we traverse?


27

A’C’

B’

128

512

128

1616

X

Complete B’, C’ Panels, load next A’ and repeat…

C’

B

C’

B

+=

How do we traverse?


28

Defined Inner Kernel

C A B

High Level Blocking

C’ A’ B’

Original Problem

A’C’ B’

L2 DRAM/L1DRAM/REG

Thousands

Thousands

Thousands ThousandsHundreds

Hundreds

Panel Slices

+=

+=

+=

16

4

16

4

4

4 Mini Block-PanelREG REG L1

+=c’ b’a’

29

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion

30

Optimizing the Inner Kernel

Developed several optimization principles: First to apply these principles to TRIPS

Avoiding network contention is critical! Single overscheduled link can cut performance in half Avoided by datapath routing, direction oriented

computation (DOC), register mirroring, data interleaving - got a 5x jump in Instructions Per Cycle, exceeding 10 IPC

Load balance every resource in system In a loop, total performance limited by most used wire link

or execution slot Loop body scaled to match register and data usage and to

minimize architectural overheads

Results in “fragility” of optimization typical of spatial architectures with shared resources

31

Simplified Schedule Step 1: Reading A from Register files

D0

D1

D2

D3

GT R0 R1 R2 R3

Step 2: Loading B and broadcast it across rows

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Step 3: Do the multiply and then add across columns Step 4: Write the results back to C

1 2 3 4

32

Every register use must be retrieved across network Every load and store needs to get an address Need to interleave prefetching, writing, updating pointers, counters Need to account for data movement instructions

What are the complications?

33

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion

34

Comparison of FPC across major processors

0

1

2

3

4

5

6

7

Opter

on P4

Core

2 Duo

POWER5

Itaniu

m

TRIPS

Kernel FPCDGEMM FPC

Execution Bottlenecks:Integer/Network Ops vs FLOPSSingle Operand Per Cycle

Enhancement OpportunitiesSIMD instruction setLarger Instruction WindowMore network bandwidth

* Results from K. Goto and R. A. van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, 2008. 13:748-757, August 2007

35

0

1

2

3

4

5

6

0 512 1024 1536 2048 2560 3072 3584 4096

FP

C

DGEMM

C Kernel + Goto

C Kernel, no Goto

Performance vs Matrix Size

36

Role of the Compiler

Kernel has 8x performance of TRIPS C compiler Did exhaustive empirical studies to determine individual

performance contributions of optimizations and their interaction with the TRIPS compiler

TRIPS compiler does scheduling as post process Determined that existing scheduler can handle

orchestration well if algorithm matches topology: If assembly for inner loop specified, scheduler obtained

75% of total performance

Lesson: Orchestration is not the difficult part Need to consider basic topology during compilation Blocking compilers and register clustering are active topics

of research Annotations / hints to compiler?

37

Conclusions

Fine grained architectures can boost single thread performance

Optimization principles we learned can be applied to many levels of architectural granularity But critical for fine grained architectures

In the future, high performance will depend on algorithms that incorporate both the memory hierarchy and the topology of the processing/ communication substrate

38

Thank You :)

Any Questions?

39

Thank You :)

Any Questions?

40

Back Up Slides

Just a list for now: Comparison of GotoBLAS against

Atlas/LAPACK More detailed diagrams of algorithm Other performance graphs Systolic Array Diagrams of other canonical processors

41

Future work

Explore applicability of optimization principles beyond dense linear algebra, to irregular, control intensive algorithms

Quantify degree to which principles apply to coarser grained architectures (CMPs) and different memory topologies

42

Trends in Chip Level Parallelism

Multiple ways to exploit parallelism: Instruction/Thread/Data Level Parallelism Coarse Grained vs Fine Grained

What’s the programming model? High level paradigm of your choice… Dynamic compilation and run time

systems Low level APIs for writing optimized

libraries Likely need to rewrite applications

43

Trends in Computer Architecture

Emerging architectures are trending towards more fine grained control E.g. Intel Terascale, RAW, Tilera Tightly orchestrated computation On chip networks Precise control over

communication These represent a step down a

path Algorithmic insight can be gained

by looking at the most fine grained examples

44

Spatially Distributed Uniprocessors Scalability issues for both architectures and underlying

technology Wire delay ,Power, Issue Width…

More and more components of microprocessors becoming distributed Partitioned register banks, functional units, …

SDU partitions all aspects of single core into tiles Tiles connected by on chip 2-D network Large number of distributed registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network

Key performance characteristic: Where an instruction executes matters!

45


Grid of ALUs (16) Large number of distributed registers Large number of data ports On chip 2-D mesh network S-NUCA distributed L1 and L2 cache

46


Potential Advantages for Matrix Multiply Large number of ALUs Precise placement of instructions

Not a MIMD machine Model of execution is block dataflow graphs Bring in graphs one at a time and execute must also deal with data movement, registers, data bandwidth, control

47

Classical Matrix Multiply

Need to compute C = AB + C Once just used a triply nested loop… Want to amortize O(n2) data movement over

2n3 computation of matrix multiply Break A, B and C matrices into square blocks

just small enough to fit A, B and C in L1 cache Inner kernel computes block of C by caching

elements of C in registers and using values of A and B from L1 cache

48

Performance for thin panels

Performance vs Panel Thickness

0

1

2

3

4

5

6

0 512 1024 1536 2048 2560 3072 3584 4096

k (m = n = 4096)

FPC

Cmxn = Amxk x Bkxn

49

Goto’s Streaming Algorithm

Classical algorithm breaks matrices into blocks just big enough for A, B and C to fit in L1 cache

Goto realized L2 cache is actually fast enough to access directly from inner kernel! Use most of L2 cache for a giant block of A Inner kernel uses all levels of memory hierarchy

simultaneously Cache large slices of B panel in L1 cache, cache small piece

of C in registers

Instead of square matrix blocks, use block-panel multiplies, with traversal order to maximize reuse Stream full-sized contiguous panels of B and C directly out

of DRAM

Use extremely optimized hand tuned assembly

50

Methodology

So we compiled code using the TRIPS compilerAnd we ran it on a hardware prototype.We kept making changes and seeing how fast it ran.We made notes of the changes.We made graphs from the notes.We made slides based on the graphs.We made conclusions based on the slides.It’s 130nm and 366 MHz, but that’s OK.

51

Controlling The Cache

AC

B

=+

128

512

128

161651

2

X

• B slice fits in L1 cache• A block fits in L2 cache• C chunks from L2

How do we keep B in L1 cache while streaming all of A through?

52

A Buffer Size

Affect of Dimensions of A Buffer (same area)

0

1

2

3

4

5

6

0 512 1024 1536 2048 2560 3072 3584 4096

m = n = k

FPC

512*128

256*256

128*512

53

Block Panel Multiply

C BA

+= x

Doing multiple GEMDOTS in parallel.

54


C BA

+= x


55


C BA

+= x


56


C BA

+= x


57


C BA

+= x


58

Documents

High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige