43
Carnegie Mellon Supported by DARPA, ONR, NSF, Intel, Mercury Markus Püschel Computer Science …and the Spiral team (only part shown) Spiral Computer Synthesis of Computational Programs Joint work with … Franz Franchetti Yevgen Voronenko Srinivas Chellappa Frédéric de Mesmay Daniel McFarlin José Moura James Hoe

Spiral Presentation Template - Department of Computer Science

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Supported by DARPA, ONR, NSF, Intel, Mercury

Markus Püschel

Computer Science

…and the Spiral team (only part shown)

Spiral Computer Synthesis of Computational Programs

Joint work with … Franz Franchetti

Yevgen Voronenko Srinivas Chellappa

Frédéric de Mesmay Daniel McFarlin

José Moura James Hoe

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

Page 2: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Audio/image/video processing

Scientific Computing

Physics/biology simulations

Consumer Computing

Computing Unlimited need for performance

Large set of applications, but …

Relatively small set of critical components (100s to 1000s)

Matrix multiplication

Discrete Fourier transform

Viterbi decoder

Filter/stencil

…. Embedded Computing

Signal processing, communication, control

Page 3: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

The Problem: Example DFT

0

5

10

15

20

25

30

35

40

16 64 256 1k 4k 16k 64k 256k 1M

DFT on Intel Core i7 (4 Cores, 2.66 GHz) Performance [Gflop/s]

Fastest program

Same number of operations

Best compiler

12x

35x

Direct implementation

Page 4: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

The Problem Is Everywhere

0

10

20

30

40

50

60

70

6 12 18 24 30 36 42 48 54

WiFi Receiver Performance [Mbit/s]

Matrix multiplication Performance [Gflop/s]

0

10

20

30

40

50

0 2000 4000 6000 8000 10000

160x 30x

Why is that?

Page 5: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

DFT: Analysis

Compiler doesn’t do it

Doing by hand: Very tough

0

5

10

15

20

25

30

35

40

16 64 256 1k 4k 16k 64k 256k 1M

DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz) Performance [Gflop/s]

3x

3x

5x

locality optimization

vectorization

threading

Page 6: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

And There Is Not Only Intel …

Source: IEEE SP Magazine, Vol. 26, November 2009

Core i7

Nvidia G200

TI TNETV3020 Tilera Tile64

Arm Cortex A9

Page 7: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Numerical problem

Computing platform

algorithm selection

compilation

hu

man

eff

ort

au

tom

ate

d

implementation

C program

auto

mat

ed

Numerical problem

Computing platform

Current Future

C code a singularity:

• Compiler has no access to high level information

• No structural optimization

• No evaluation of choices

Challenge: conquer the high abstraction level for complete automation

algorithm selection

compilation

implementation

Page 8: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Current Solution

Algorithm knowledge Processor knowledge

Optimal program (Repeated for every processor)

Page 9: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Our Research: The Computer Writes the Program

Optimal program (Regenerated for every processor)

Automation:

Spiral

Algorithm knowledge Processor knowledge

Page 10: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

“Computer Writes the Program”

Computer writes close to optimal code

Vectorized, parallelized, etc.

“click” “click”

Page 11: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Organization

Spiral: Basic system

Parallelism

General input size

Results

Final remarks

Page 12: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Linear Transforms

T =DFTn = [e¡2k`¼i=n]0·k;`<n

x=

0BBB@

x0x1...

xn¡1

1CCCA

0BBB@

y0y1...

yn¡1

1CCCA= y = Tx

Output

T ¢Input

Example:

Page 13: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Algorithms: Example FFT, n = 4

SPL (Signal processing language): Mathematical, declarative, point-free

Divide-and-conquer algorithms = breakdown rules in SPL

Fast Fourier transform (FFT)

Representation using matrix algebra

Page 14: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Decomposition Rules (>200 for >40 Transforms)

DFTn ! P>k=2;2m³DFT2m©

³Ik=2¡1 ­i C2m rDFT2m(i=k)

´´ ³RDFT0k­Im

´; k even;

¯̄¯̄¯̄¯̄¯

RDFTnRDFT0nDHTnDHT0n

¯̄¯̄¯̄¯̄¯! (P>k=2;m ­ I2)

0BBB@

¯̄¯̄¯̄¯̄¯

RDFT2m

RDFT02mDHT2mDHT02m

¯̄¯̄¯̄¯̄¯©

0BBB@Ik=2¡1 ­i D2m

¯̄¯̄¯̄¯̄¯

rDFT2m(i=k)

rDFT2m(i=k)

rDHT2m(i=k)

rDHT2m(i=k)

¯̄¯̄¯̄¯̄¯

1CCCA

1CCCA

0BBB@

¯̄¯̄¯̄¯̄¯

RDFT0kRDFT0kDHT0kDHT0k

¯̄¯̄¯̄¯̄¯­ Im

1CCCA ; k even;

¯̄¯̄¯rDFT2n(u)

rDHT2n(u)

¯̄¯̄¯! L2nm

ÃIk ­i

¯̄¯̄¯rDFT2m((i+ u)=k)

rDHT2m((i+ u)=k)

¯̄¯̄¯

! ï̄¯̄¯rDFT2k(u)

rDHT2k(u)

¯̄¯̄¯­ Im

!;

RDFT-3n ! (Q>k=2;m ­ I2) (Ik ­i rDFT2m)(i+1=2)=k)) (RDFT-3k­Im) ; k even;

DCT-2n ! P>k=2;2m³DCT-22mK2m

2 ©³Ik=2¡1 ­N2mRDFT-3

>2m

´´Bn(L

n=2

k=2­ I2)(Im ­RDFT0k)Qm=2;k;

DCT-3n ! DCT-2>n ;

DCT-4n ! Q>k=2;2m³Ik=2 ­N2mRDFT-3

>2m

´B0n(L

n=2

k=2­ I2)(Im ­RDFT-3k)Qm=2;k:

DFTn ! (DFTk­ Im)Tnm(Ik­DFTm)Lnk ; n = km

DFTn ! Pn(DFTk­DFTm)Qn; n = km; gcd(k;m) = 1

DFTp ! RTp (I1©DFTp¡1)Dp(I1©DFTp¡1)Rp; p prime

DCT-3n ! (Im©Jm)Lnm(DCT-3m(1=4)©DCT-3m(3=4))

¢(F2­ Im)

"Im 0©¡ Jm¡1

1p2(I1©2 Im)

#; n = 2m

DCT-4n ! SnDCT-2n diag0·k<n(1=(2 cos((2k+1)¼=4n)))

IMDCT2m ! (Jm© Im© Im© Jm)

ÃÃ"1

¡1

#­ Im

!©Ã"¡1¡1

#­ Im

!!J2mDCT-42m

WHT2k

!tY

i=1

(I2k1+¢¢¢+ki¡1 ­WHT

2ki­ I

2ki+1+¢¢¢+kt

); k = k1+ ¢ ¢ ¢+ kt

DFT2 ! F2

DCT-22 ! diag(1;1=p2)F2

DCT-42 ! J2R13¼=8

Decomposition rules = Algorithm knowledge in Spiral

(from ≈100 publications)

Combining these rules yields many algorithms for every given transform

Page 15: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

SPL to Code

Correct code: easy fast code: very difficult

Page 16: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Program Generation in Spiral Transform

C Program

Algorithm (SPL)

Algorithm (∑-SPL)

Decomposition rules

void sub(double *y, double *x) { double f0, f1, f2, f3, f4, f7, f8, f10, f11; f0 = x[0] - x[3];

f1 = x[0] + x[3]; f2 = x[1] - x[2]; f3 = x[1] + x[2]; f4 = f1 - f3; y[0] = f1 + f3; y[2] = 0.7071067811865476 * f4; f7 = 0.9238795325112867 * f0;

< more lines>

+ Search or Learning

parallelization vectorization

locality optimization

basic block optimizations

Page 17: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Organization

Spiral: Basic system

Parallelism

General input size

Results

Final remarks

Page 18: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Example: Vectorization in Spiral

Relationship SPL expressions ↔ vectorization?

y =³DFT2­ I4

´xy =DFT2x

one addition one subtraction

one (4-way) vector addition one (4-way) vector subtraction

Page 19: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Step 1: Identify “Good” Vector Constructs

Vector length:

Good (= easily vectorizable) SPL constructs:

Idea: Convert a given SPL expression into a “good” SPL expression through rewriting (structural manipulation)

A­ Iº

º

Lº2

º ; L2º2 ;L2ºº

SPL expressions recursively built from those

base cases

Page 20: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Step 2: Find Manipulation Rules

AB ! AB

Am£m ­ Iº !³Im­L2ºº

´³Am£m ­ Iº

´³Im­L2º2

´

Im­An£n ! Im­An£nD !

³In=º ­L2ºº

´~D³In=º ­L2º2

´

P ! P ­ I2

Lnºn !³In=º ­Lº

2

º

´³Lnn=º

­ Iº

´

Lnºº !³Lnº ­ Iº

´³In=º ­Lº

2

º

´

Lmnm !

³Lmn=ºm ­ Iº

´³Imn=º2­Lº

2

º

´³In=º ­Lm

m=º­ Iº

´

Il­Lkmnn ­ Ir !

³Il­Lknn ­ Imr

´³Ikl­Lmn

n ­ Ir

´

Il­Lkmnn ­ Ir !

³Il­Lkmn

kn ­ Ir

´³Il­Lkmn

mn ­ Ir

´

Il­Lkmnkm ­ Ir !

³Ikl­Lmn

m ­ Ir

´³Il­Lknk ­ Imr

´

Il­Lkmnkm ­ Ir !

³Il­Lkmn

k ­ Ir

´³Il­Lkmn

m ­ Ir

´³Im­An£n

´! Lmn

º

³(Im=º ­An£n)­ Iº

´Lmnmn=º³

Im­An£n´Lmnm !

µIm=º ­Lnºº

³An£n ­ Iº

´¶³Lmn=º

m=º­ Iº

´

Lmnn

³Im­An£n

´!

³Lmn=ºn ­ Iº

´µIm=º ­

³An£n ­ Iº

´Lnºn

µIk­

³Im­An£n

´Lmnm

¶Lkmnk !

³Lkmk ­ In

´µIm­

³Ik­An£n

´Lknk

¶³Lmnm ­ Ik

´

Lkmnmn

µIk­Lmn

n

³Im­An£n

´¶!

³Lmnn ­ Ik

´µIm­Lknn

³Ik­An£n

´¶³Lkmm ­ In

´

Manipulation rules = Processor knowledge in Spiral

Page 21: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Example

DFTmn| {z }vec(º)

! (DFTm­ In)Tmnn (Im­DFTn)Lmn

m| {z }vec(º)

:::

:::

:::

!µImnº­L2ºº

¶µDFTm­ In

º­ Iº

¶T0mnn

µImº­³L2nº ­ Iº

´ µI2nº­Lº

2

º

¶µInº­L2º2 ­ Iº

¶ ³DFTn ­ Iº

´¶µLmnºmº­L2º2

vectorized arithmetic vectorized data accesses

Page 22: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Automatically Generate Base Case Library

Goal: Given instruction set, generate base cases

Idea: Instructions as matrices + search

y = _mm_unpacklo_ps(x0, x1);

y = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(1,2,1,2));

y = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4));

y =

2641 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0

0 1 0 0 0 0 0 0

0 0 0 0 0 1 0 0

375"~x0~x1

#

y =

2641 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

375"~x0~x1

#

y =

2640 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1

375"~x0~x1

#

º = 4 :nL42; I2­L42; L

42­ I2; L

82; L

84

o

No base case

2666664

1 0 0 0 0 0 0 00 0 0 0 1 0 0 00 1 0 0 0 0 0 00 0 0 0 0 1 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 1

3777775

y0 = _mm_unpacklo_ps(x[0], x[1]); y1 = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4)); Base case

y0 = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(1,2,1,2)); y1 = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4));

2666664

1 0 0 0 0 0 0 00 1 0 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 1

3777775

Page 23: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Same Approach for Different Paradigms Vectorization: Threading:

GPUs: Verilog for FPGAs:

Rigorous, correct by construction

Overcomes compiler limitations

Page 24: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Optimal program (Regenerated for every processor)

Automation:

Spiral

Algorithm knowledge Processor knowledge

Page 25: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Optimal program (Regenerated for every processor)

Automation:

Spiral

Decomposition rules Manipulation rules

Page 26: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Organization

Spiral: Basic system

Parallelism

General input size

Results

Final remarks

Page 27: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Challenge: General Size Libraries

Challenge: Library for general input size DFT(n, x, y) {

for(i = …) {

DFT_strided(m, x+mi, y+i, 1, k)

}

}

• Algorithm cannot be fixed

• Recursive code

• Creates many challenges

So far: Code specialized to fixed input size DFT_384(x, y) {

for(i = …) {

t[2i] = x[2i] + x[2i+1]

t[2i+1] = x[2i] - x[2i+1]

}

}

• Algorithm fixed

• Nonrecursive code

Page 28: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Challenge: Recursion Steps

Cooley-Tukey FFT

Implementation that increases locality (e.g., FFTW 2.x)

void DFT(int n, cpx *y, cpx *x) { int k = choose_dft_radix(n); for (int i=0; i < k; ++i) DFTrec(m, y + m*i, x + i, k, 1); for (int j=0; j < m; ++j) DFTscaled(k, y + j, t[j], m); }

Page 29: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

S-SPL : Basic Idea Four additional matrix constructs: S, G, S, Perm

S (sum) explicit loop Gf (gather) load data with index mapping f Sf (scatter) store data with index mapping f Permf permute data with the index mapping f

S-SPL formulas = matrix factorizations

Example:

Page 30: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Find Recursion Step Closure

Repeat until closure

Page 31: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Recursion Step Closure: Examples

DFT: scalar code

DFT: full-fledged (vectorized and parallel code)

Page 32: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Summary: Complete Automation for Transforms

• Memory hierarchy optimization Rewriting and search for algorithm selection Rewriting for loop optimizations

• Vectorization Rewriting

• Parallelization Rewriting

• Derivation of library structure Rewriting Other methods

fixed input size code

general input size library

Page 33: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Organization

Spiral: Basic system

Parallelism

General input size

Results

Final remarks

Page 34: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

DFT on Intel Multicore

5MB vectorized, threaded, general-size, adaptive library Spiral

Page 35: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

0

2

4

6

8

10

12

32 64 96 128 160 192 224 256 288 320 352 384

input size n

DCTs on 2.66 GHz Core2 (4-way SSSE3) Performance [Gflop/s]

Often it Looks Like That

Spiral-generated

Available libraries

Page 36: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Computer generated Functions for Intel IPP 6.0

3984 C functions 1M lines of code Transforms: DFT (fwd+inv), RDFT (fwd+inv), DCT2, DCT3, DCT4, DHT, WHT Sizes: 2–64 (DFT, RDFT, DHT); 2-powers (DCTs, WHT) Precision: single, double Data type: scalar, SSE, AVX (DFT, DCT), LRB (DFT)

Computer generated

Results: SpiralGen Inc.

Page 37: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Mapping to FPGAs

best

Page 38: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Organization

Spiral: Basic system

Parallelism

General input size

Results

Final remarks

Page 39: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Beyond Transforms Viterbi Decoder Matrix-Matrix Multiplication

JPEG 2000 (Wavelet, EBCOT) Synthetic Aperture Radar (SAR)

interpolation 2D iFFT matched filtering

preprocessing DWT quantization entropy coding (EBCOT + MQ)

= x

Page 40: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Related Work: Autotuning

Goal: (Partial) automation of runtime optimization

Projects ATLAS (U. Tennessee)

FFTW (MIT)

BeBop/OSKI (U. Berkeley)

Spiral (Carnegie Mellon)

Tensor contractions (Ohio State)

Fenics (some universities)

FLAME (UT Austin)

Adaptive sorting (UIUC)

<many others>

Proceedings of the IEEE special issue, Feb. 2005

Page 41: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Spiral: Summary

Spiral:

Successful approach to automating the development of computing software

Commercial proof-of-concept

Key ideas:

Algorithm knowledge: Domain specific symbolic representation

Platform knowledge: Tagged rewrite rules, SIMD specification

void dft64(float *Y, float *X) {

__m512 U912, U913, U914, U915,...

__m512 *a2153, *a2155;

a2153 = ((__m512 *) X); s1107 = *(a2153);

s1108 = *((a2153 + 4)); t1323 = _mm512_add_ps(s1107,s1108);

t1324 = _mm512_sub_ps(s1107,s1108);

<many more lines>

U926 = _mm512_swizupconv_r32(…);

s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi(

_mm512_set_1to16_ps(0.70710678118654757),0xAAAA,a2154,U926),t1341),

_mm512_mask_sub_ps(_mm512_set_1to16_ps(0.70710678118654757),…),

_mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB));

U927 = _mm512_swizupconv_r32

<many more lines>

}

Page 42: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

35x

Page 43: Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Programming languages Program generation

Symbolic Computation Rewriting

Compilers

Software Scientific Computing

Algorithms Mathematics

Automating High-Performance

Numerical Software Development Search

Learning