Spiral Presentation Template - Department of Computer Science

Carnegie Mellon

Supported by DARPA, ONR, NSF, Intel, Mercury

Markus Püschel

Computer Science

…and the Spiral team (only part shown)

Spiral Computer Synthesis of Computational Programs

Joint work with … Franz Franchetti

Yevgen Voronenko Srinivas Chellappa

Frédéric de Mesmay Daniel McFarlin

José Moura James Hoe

…

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

Carnegie Mellon

Audio/image/video processing

Scientific Computing

Physics/biology simulations

Consumer Computing

Computing Unlimited need for performance

Large set of applications, but …

Relatively small set of critical components (100s to 1000s)

Matrix multiplication

Discrete Fourier transform

Viterbi decoder

Filter/stencil

…. Embedded Computing

Signal processing, communication, control

Carnegie Mellon

The Problem: Example DFT

0

5

10

15

20

25

30

35

40

16 64 256 1k 4k 16k 64k 256k 1M

DFT on Intel Core i7 (4 Cores, 2.66 GHz) Performance [Gflop/s]

Fastest program

Same number of operations

Best compiler

12x

35x

Direct implementation

Carnegie Mellon

The Problem Is Everywhere

0

10

20

30

40

50

60

70

6 12 18 24 30 36 42 48 54

WiFi Receiver Performance [Mbit/s]

Matrix multiplication Performance [Gflop/s]

0

10

20

30

40

50

0 2000 4000 6000 8000 10000

160x 30x

Why is that?

Carnegie Mellon

DFT: Analysis

Compiler doesn’t do it

Doing by hand: Very tough

0

5

10

15

20

25

30

35

40

16 64 256 1k 4k 16k 64k 256k 1M

DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz) Performance [Gflop/s]

3x

3x

5x

locality optimization

vectorization

threading

Carnegie Mellon

And There Is Not Only Intel …

Source: IEEE SP Magazine, Vol. 26, November 2009

Core i7

Nvidia G200

TI TNETV3020 Tilera Tile64

Arm Cortex A9

Carnegie Mellon

Numerical problem

Computing platform

algorithm selection

compilation

hu

man

eff

ort

au

tom

ate

d

implementation

C program

auto

mat

ed

Numerical problem

Computing platform

Current Future

C code a singularity:

• Compiler has no access to high level information

• No structural optimization

• No evaluation of choices

Challenge: conquer the high abstraction level for complete automation

algorithm selection

compilation

implementation

Carnegie Mellon

Current Solution

Algorithm knowledge Processor knowledge

Optimal program (Repeated for every processor)

Carnegie Mellon

Our Research: The Computer Writes the Program

Optimal program (Regenerated for every processor)

Automation:

Spiral


Carnegie Mellon

“Computer Writes the Program”

Computer writes close to optimal code

Vectorized, parallelized, etc.

“click” “click”

Carnegie Mellon

Organization

Spiral: Basic system

Parallelism

General input size

Results

Final remarks

Carnegie Mellon

Linear Transforms

T =DFTn = [e¡2k`¼i=n]0·k;`<n

x=

0BBB@

x0x1...

xn¡1

1CCCA

0BBB@

y0y1...

yn¡1

1CCCA= y = Tx

Output

T ¢Input

Example:

Carnegie Mellon

Algorithms: Example FFT, n = 4

SPL (Signal processing language): Mathematical, declarative, point-free

Divide-and-conquer algorithms = breakdown rules in SPL

Fast Fourier transform (FFT)

Representation using matrix algebra

Carnegie Mellon

Decomposition Rules (>200 for >40 Transforms)

DFTn ! P>k=2;2m³DFT2m©

³Ik=2¡1 i C2m rDFT2m(i=k)

´´ ³RDFT0kIm

´; k even;

¯̄¯̄¯̄¯̄¯

RDFTnRDFT0nDHTnDHT0n

¯̄¯̄¯̄¯̄¯! (P>k=2;m I2)

0BBB@

¯̄¯̄¯̄¯̄¯

RDFT2m

RDFT02mDHT2mDHT02m

¯̄¯̄¯̄¯̄¯©

0BBB@Ik=2¡1 i D2m

¯̄¯̄¯̄¯̄¯

rDFT2m(i=k)

rDFT2m(i=k)

rDHT2m(i=k)

rDHT2m(i=k)

¯̄¯̄¯̄¯̄¯

1CCCA

1CCCA

0BBB@

¯̄¯̄¯̄¯̄¯

RDFT0kRDFT0kDHT0kDHT0k

¯̄¯̄¯̄¯̄¯ Im

1CCCA ; k even;

¯̄¯̄¯rDFT2n(u)

rDHT2n(u)

¯̄¯̄¯! L2nm

ÃIk i

¯̄¯̄¯rDFT2m((i+ u)=k)

rDHT2m((i+ u)=k)

¯̄¯̄¯

! Ã¯̄¯̄¯rDFT2k(u)

rDHT2k(u)

¯̄¯̄¯ Im

!;

RDFT-3n ! (Q>k=2;m I2) (Ik i rDFT2m)(i+1=2)=k)) (RDFT-3kIm) ; k even;

DCT-2n ! P>k=2;2m³DCT-22mK2m

2 ©³Ik=2¡1 N2mRDFT-3

>2m

´´Bn(L

n=2

k=2 I2)(Im RDFT0k)Qm=2;k;

DCT-3n ! DCT-2>n ;

DCT-4n ! Q>k=2;2m³Ik=2 N2mRDFT-3

>2m

´B0n(L

n=2

k=2 I2)(Im RDFT-3k)Qm=2;k:

DFTn ! (DFTk Im)Tnm(IkDFTm)Lnk ; n = km

DFTn ! Pn(DFTkDFTm)Qn; n = km; gcd(k;m) = 1

DFTp ! RTp (I1©DFTp¡1)Dp(I1©DFTp¡1)Rp; p prime

DCT-3n ! (Im©Jm)Lnm(DCT-3m(1=4)©DCT-3m(3=4))

¢(F2 Im)

"Im 0©¡ Jm¡1

1p2(I1©2 Im)

#; n = 2m

DCT-4n ! SnDCT-2n diag0·k<n(1=(2 cos((2k+1)¼=4n)))

IMDCT2m ! (Jm© Im© Im© Jm)

ÃÃ"1

¡1

# Im

!©Ã"¡1¡1

# Im

!!J2mDCT-42m

WHT2k

!tY

i=1

(I2k1+¢¢¢+ki¡1 WHT

2ki I

2ki+1+¢¢¢+kt

); k = k1+ ¢ ¢ ¢+ kt

DFT2 ! F2

DCT-22 ! diag(1;1=p2)F2

DCT-42 ! J2R13¼=8

Decomposition rules = Algorithm knowledge in Spiral

(from ≈100 publications)

Combining these rules yields many algorithms for every given transform

Carnegie Mellon

SPL to Code

Correct code: easy fast code: very difficult

Carnegie Mellon

Program Generation in Spiral Transform

C Program

Algorithm (SPL)

Algorithm (∑-SPL)

Decomposition rules

void sub(double *y, double *x) { double f0, f1, f2, f3, f4, f7, f8, f10, f11; f0 = x[0] - x[3];

f1 = x[0] + x[3]; f2 = x[1] - x[2]; f3 = x[1] + x[2]; f4 = f1 - f3; y[0] = f1 + f3; y[2] = 0.7071067811865476 * f4; f7 = 0.9238795325112867 * f0;

< more lines>

+ Search or Learning

parallelization vectorization

locality optimization

basic block optimizations

Carnegie Mellon

Organization


Parallelism

General input size

Results

Final remarks

Carnegie Mellon

Example: Vectorization in Spiral

Relationship SPL expressions ↔ vectorization?

y =³DFT2 I4

´xy =DFT2x

one addition one subtraction

one (4-way) vector addition one (4-way) vector subtraction

Carnegie Mellon

Step 1: Identify “Good” Vector Constructs

Vector length:

Good (= easily vectorizable) SPL constructs:

Idea: Convert a given SPL expression into a “good” SPL expression through rewriting (structural manipulation)

A Iº

º

Lº2

º ; L2º2 ;L2ºº

SPL expressions recursively built from those

base cases

Carnegie Mellon

Step 2: Find Manipulation Rules

AB ! AB

Am£m Iº !³ImL2ºº

´³Am£m Iº

´³ImL2º2

´

ImAn£n ! ImAn£nD !

³In=º L2ºº

´~D³In=º L2º2

´

P ! P I2

Lnºn !³In=º Lº

2

º

´³Lnn=º

Iº

´

Lnºº !³Lnº Iº

´³In=º Lº

2

º

´

Lmnm !

³Lmn=ºm Iº

´³Imn=º2Lº

2

º

´³In=º Lm

m=º Iº

´

IlLkmnn Ir !

³IlLknn Imr

´³IklLmn

n Ir

´

IlLkmnn Ir !

³IlLkmn

kn Ir

´³IlLkmn

mn Ir

´

IlLkmnkm Ir !

³IklLmn

m Ir

´³IlLknk Imr

´

IlLkmnkm Ir !

³IlLkmn

k Ir

´³IlLkmn

m Ir

´³ImAn£n

´! Lmn

º

³(Im=º An£n) Iº

´Lmnmn=º³

ImAn£n´Lmnm !

µIm=º Lnºº

³An£n Iº

´¶³Lmn=º

m=º Iº

´

Lmnn

³ImAn£n

´!

³Lmn=ºn Iº

´µIm=º

³An£n Iº

´Lnºn

¶

µIk

³ImAn£n

´Lmnm

¶Lkmnk !

³Lkmk In

´µIm

³IkAn£n

´Lknk

¶³Lmnm Ik

´

Lkmnmn

µIkLmn

n

³ImAn£n

´¶!

³Lmnn Ik

´µImLknn

³IkAn£n

´¶³Lkmm In

´

Manipulation rules = Processor knowledge in Spiral

Carnegie Mellon

Example

DFTmn| {z }vec(º)

! (DFTm In)Tmnn (ImDFTn)Lmn

m| {z }vec(º)

:::

:::

:::

!µImnºL2ºº

¶µDFTm In

º Iº

¶T0mnn

µImº³L2nº Iº

´ µI2nºLº

2

º

¶µInºL2º2 Iº

¶ ³DFTn Iº

´¶µLmnºmºL2º2

¶

vectorized arithmetic vectorized data accesses

Carnegie Mellon

Automatically Generate Base Case Library

Goal: Given instruction set, generate base cases

Idea: Instructions as matrices + search

y = _mm_unpacklo_ps(x0, x1);

y = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(1,2,1,2));

y = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4));

y =

2641 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0

0 1 0 0 0 0 0 0

0 0 0 0 0 1 0 0

375"~x0~x1

#

y =

2641 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

375"~x0~x1

#

y =

2640 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1

375"~x0~x1

#

º = 4 :nL42; I2L42; L

42 I2; L

82; L

84

o

No base case

2666664

1 0 0 0 0 0 0 00 0 0 0 1 0 0 00 1 0 0 0 0 0 00 0 0 0 0 1 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 1

3777775

y0 = _mm_unpacklo_ps(x[0], x[1]); y1 = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4)); Base case

y0 = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(1,2,1,2)); y1 = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4));

2666664

1 0 0 0 0 0 0 00 1 0 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 1

3777775

Carnegie Mellon

Same Approach for Different Paradigms Vectorization: Threading:

GPUs: Verilog for FPGAs:

Rigorous, correct by construction

Overcomes compiler limitations

Carnegie Mellon


Automation:

Spiral


Carnegie Mellon


Automation:

Spiral

Decomposition rules Manipulation rules

Carnegie Mellon

Organization


Parallelism

General input size

Results

Final remarks

Carnegie Mellon

Challenge: General Size Libraries

Challenge: Library for general input size DFT(n, x, y) {

…

for(i = …) {

DFT_strided(m, x+mi, y+i, 1, k)

}

…

}

• Algorithm cannot be fixed

• Recursive code

• Creates many challenges

So far: Code specialized to fixed input size DFT_384(x, y) {

…

for(i = …) {

t[2i] = x[2i] + x[2i+1]

t[2i+1] = x[2i] - x[2i+1]

}

…

}

• Algorithm fixed

• Nonrecursive code

Carnegie Mellon

Challenge: Recursion Steps

Cooley-Tukey FFT

Implementation that increases locality (e.g., FFTW 2.x)

void DFT(int n, cpx *y, cpx *x) { int k = choose_dft_radix(n); for (int i=0; i < k; ++i) DFTrec(m, y + m*i, x + i, k, 1); for (int j=0; j < m; ++j) DFTscaled(k, y + j, t[j], m); }

Carnegie Mellon

S-SPL : Basic Idea Four additional matrix constructs: S, G, S, Perm

S (sum) explicit loop Gf (gather) load data with index mapping f Sf (scatter) store data with index mapping f Permf permute data with the index mapping f

S-SPL formulas = matrix factorizations

Example:

Carnegie Mellon

Find Recursion Step Closure

Repeat until closure

Carnegie Mellon

Recursion Step Closure: Examples

DFT: scalar code

DFT: full-fledged (vectorized and parallel code)

Carnegie Mellon

Summary: Complete Automation for Transforms

• Memory hierarchy optimization Rewriting and search for algorithm selection Rewriting for loop optimizations

• Vectorization Rewriting

• Parallelization Rewriting

• Derivation of library structure Rewriting Other methods

fixed input size code

general input size library

Carnegie Mellon

Organization


Parallelism

General input size

Results

Final remarks

Carnegie Mellon

DFT on Intel Multicore

5MB vectorized, threaded, general-size, adaptive library Spiral

Carnegie Mellon

0

2

4

6

8

10

12

32 64 96 128 160 192 224 256 288 320 352 384

input size n

DCTs on 2.66 GHz Core2 (4-way SSSE3) Performance [Gflop/s]

Often it Looks Like That

Spiral-generated

Available libraries

Carnegie Mellon

Computer generated Functions for Intel IPP 6.0

3984 C functions 1M lines of code Transforms: DFT (fwd+inv), RDFT (fwd+inv), DCT2, DCT3, DCT4, DHT, WHT Sizes: 2–64 (DFT, RDFT, DHT); 2-powers (DCTs, WHT) Precision: single, double Data type: scalar, SSE, AVX (DFT, DCT), LRB (DFT)

Computer generated

Results: SpiralGen Inc.

Carnegie Mellon

Mapping to FPGAs

best

Carnegie Mellon

Organization


Parallelism

General input size

Results

Final remarks

Carnegie Mellon

Beyond Transforms Viterbi Decoder Matrix-Matrix Multiplication

JPEG 2000 (Wavelet, EBCOT) Synthetic Aperture Radar (SAR)

interpolation 2D iFFT matched filtering

preprocessing DWT quantization entropy coding (EBCOT + MQ)

= x

Carnegie Mellon

Related Work: Autotuning

Goal: (Partial) automation of runtime optimization

Projects ATLAS (U. Tennessee)

FFTW (MIT)

BeBop/OSKI (U. Berkeley)

Spiral (Carnegie Mellon)

Tensor contractions (Ohio State)

Fenics (some universities)

FLAME (UT Austin)

Adaptive sorting (UIUC)

<many others>

Proceedings of the IEEE special issue, Feb. 2005

Carnegie Mellon

Spiral: Summary

Spiral:

Successful approach to automating the development of computing software

Commercial proof-of-concept

Key ideas:

Algorithm knowledge: Domain specific symbolic representation

Platform knowledge: Tagged rewrite rules, SIMD specification

void dft64(float *Y, float *X) {

__m512 U912, U913, U914, U915,...

__m512 *a2153, *a2155;

a2153 = ((__m512 *) X); s1107 = *(a2153);

s1108 = *((a2153 + 4)); t1323 = _mm512_add_ps(s1107,s1108);

t1324 = _mm512_sub_ps(s1107,s1108);

<many more lines>

U926 = _mm512_swizupconv_r32(…);

s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi(

_mm512_set_1to16_ps(0.70710678118654757),0xAAAA,a2154,U926),t1341),

_mm512_mask_sub_ps(_mm512_set_1to16_ps(0.70710678118654757),…),

_mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB));

U927 = _mm512_swizupconv_r32

<many more lines>

}

Carnegie Mellon

35x

Carnegie Mellon

Programming languages Program generation

Symbolic Computation Rewriting

Compilers

Software Scientific Computing

Algorithms Mathematics

Automating High-Performance

Numerical Software Development Search

Learning

Documents

Spiral Presentation Template - Department of Computer Science