Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Carnegie Mellon
Supported by DARPA, ONR, NSF, Intel, Mercury
Markus Püschel
Computer Science
…and the Spiral team (only part shown)
Spiral Computer Synthesis of Computational Programs
Joint work with … Franz Franchetti
Yevgen Voronenko Srinivas Chellappa
Frédéric de Mesmay Daniel McFarlin
José Moura James Hoe
…
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA
Carnegie Mellon
Audio/image/video processing
Scientific Computing
Physics/biology simulations
Consumer Computing
Computing Unlimited need for performance
Large set of applications, but …
Relatively small set of critical components (100s to 1000s)
Matrix multiplication
Discrete Fourier transform
Viterbi decoder
Filter/stencil
…. Embedded Computing
Signal processing, communication, control
Carnegie Mellon
The Problem: Example DFT
0
5
10
15
20
25
30
35
40
16 64 256 1k 4k 16k 64k 256k 1M
DFT on Intel Core i7 (4 Cores, 2.66 GHz) Performance [Gflop/s]
Fastest program
Same number of operations
Best compiler
12x
35x
Direct implementation
Carnegie Mellon
The Problem Is Everywhere
0
10
20
30
40
50
60
70
6 12 18 24 30 36 42 48 54
WiFi Receiver Performance [Mbit/s]
Matrix multiplication Performance [Gflop/s]
0
10
20
30
40
50
0 2000 4000 6000 8000 10000
160x 30x
Why is that?
Carnegie Mellon
DFT: Analysis
Compiler doesn’t do it
Doing by hand: Very tough
0
5
10
15
20
25
30
35
40
16 64 256 1k 4k 16k 64k 256k 1M
DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz) Performance [Gflop/s]
3x
3x
5x
locality optimization
vectorization
threading
Carnegie Mellon
And There Is Not Only Intel …
Source: IEEE SP Magazine, Vol. 26, November 2009
Core i7
Nvidia G200
TI TNETV3020 Tilera Tile64
Arm Cortex A9
Carnegie Mellon
Numerical problem
Computing platform
algorithm selection
compilation
hu
man
eff
ort
au
tom
ate
d
implementation
C program
auto
mat
ed
Numerical problem
Computing platform
Current Future
C code a singularity:
• Compiler has no access to high level information
• No structural optimization
• No evaluation of choices
Challenge: conquer the high abstraction level for complete automation
algorithm selection
compilation
implementation
Carnegie Mellon
Current Solution
Algorithm knowledge Processor knowledge
Optimal program (Repeated for every processor)
Carnegie Mellon
Our Research: The Computer Writes the Program
Optimal program (Regenerated for every processor)
Automation:
Spiral
Algorithm knowledge Processor knowledge
Carnegie Mellon
“Computer Writes the Program”
Computer writes close to optimal code
Vectorized, parallelized, etc.
“click” “click”
Carnegie Mellon
Organization
Spiral: Basic system
Parallelism
General input size
Results
Final remarks
Carnegie Mellon
Linear Transforms
T =DFTn = [e¡2k`¼i=n]0·k;`<n
x=
0BBB@
x0x1...
xn¡1
1CCCA
0BBB@
y0y1...
yn¡1
1CCCA= y = Tx
Output
T ¢Input
Example:
Carnegie Mellon
Algorithms: Example FFT, n = 4
SPL (Signal processing language): Mathematical, declarative, point-free
Divide-and-conquer algorithms = breakdown rules in SPL
Fast Fourier transform (FFT)
Representation using matrix algebra
Carnegie Mellon
Decomposition Rules (>200 for >40 Transforms)
DFTn ! P>k=2;2m³DFT2m©
³Ik=2¡1 i C2m rDFT2m(i=k)
´´ ³RDFT0kIm
´; k even;
¯̄¯̄¯̄¯̄¯
RDFTnRDFT0nDHTnDHT0n
¯̄¯̄¯̄¯̄¯! (P>k=2;m I2)
0BBB@
¯̄¯̄¯̄¯̄¯
RDFT2m
RDFT02mDHT2mDHT02m
¯̄¯̄¯̄¯̄¯©
0BBB@Ik=2¡1 i D2m
¯̄¯̄¯̄¯̄¯
rDFT2m(i=k)
rDFT2m(i=k)
rDHT2m(i=k)
rDHT2m(i=k)
¯̄¯̄¯̄¯̄¯
1CCCA
1CCCA
0BBB@
¯̄¯̄¯̄¯̄¯
RDFT0kRDFT0kDHT0kDHT0k
¯̄¯̄¯̄¯̄¯ Im
1CCCA ; k even;
¯̄¯̄¯rDFT2n(u)
rDHT2n(u)
¯̄¯̄¯! L2nm
ÃIk i
¯̄¯̄¯rDFT2m((i+ u)=k)
rDHT2m((i+ u)=k)
¯̄¯̄¯
! ï̄¯̄¯rDFT2k(u)
rDHT2k(u)
¯̄¯̄¯ Im
!;
RDFT-3n ! (Q>k=2;m I2) (Ik i rDFT2m)(i+1=2)=k)) (RDFT-3kIm) ; k even;
DCT-2n ! P>k=2;2m³DCT-22mK2m
2 ©³Ik=2¡1 N2mRDFT-3
>2m
´´Bn(L
n=2
k=2 I2)(Im RDFT0k)Qm=2;k;
DCT-3n ! DCT-2>n ;
DCT-4n ! Q>k=2;2m³Ik=2 N2mRDFT-3
>2m
´B0n(L
n=2
k=2 I2)(Im RDFT-3k)Qm=2;k:
DFTn ! (DFTk Im)Tnm(IkDFTm)Lnk ; n = km
DFTn ! Pn(DFTkDFTm)Qn; n = km; gcd(k;m) = 1
DFTp ! RTp (I1©DFTp¡1)Dp(I1©DFTp¡1)Rp; p prime
DCT-3n ! (Im©Jm)Lnm(DCT-3m(1=4)©DCT-3m(3=4))
¢(F2 Im)
"Im 0©¡ Jm¡1
1p2(I1©2 Im)
#; n = 2m
DCT-4n ! SnDCT-2n diag0·k<n(1=(2 cos((2k+1)¼=4n)))
IMDCT2m ! (Jm© Im© Im© Jm)
ÃÃ"1
¡1
# Im
!©Ã"¡1¡1
# Im
!!J2mDCT-42m
WHT2k
!tY
i=1
(I2k1+¢¢¢+ki¡1 WHT
2ki I
2ki+1+¢¢¢+kt
); k = k1+ ¢ ¢ ¢+ kt
DFT2 ! F2
DCT-22 ! diag(1;1=p2)F2
DCT-42 ! J2R13¼=8
Decomposition rules = Algorithm knowledge in Spiral
(from ≈100 publications)
Combining these rules yields many algorithms for every given transform
Carnegie Mellon
SPL to Code
Correct code: easy fast code: very difficult
Carnegie Mellon
Program Generation in Spiral Transform
C Program
Algorithm (SPL)
Algorithm (∑-SPL)
Decomposition rules
void sub(double *y, double *x) { double f0, f1, f2, f3, f4, f7, f8, f10, f11; f0 = x[0] - x[3];
f1 = x[0] + x[3]; f2 = x[1] - x[2]; f3 = x[1] + x[2]; f4 = f1 - f3; y[0] = f1 + f3; y[2] = 0.7071067811865476 * f4; f7 = 0.9238795325112867 * f0;
< more lines>
+ Search or Learning
parallelization vectorization
locality optimization
basic block optimizations
Carnegie Mellon
Organization
Spiral: Basic system
Parallelism
General input size
Results
Final remarks
Carnegie Mellon
Example: Vectorization in Spiral
Relationship SPL expressions ↔ vectorization?
y =³DFT2 I4
´xy =DFT2x
one addition one subtraction
one (4-way) vector addition one (4-way) vector subtraction
Carnegie Mellon
Step 1: Identify “Good” Vector Constructs
Vector length:
Good (= easily vectorizable) SPL constructs:
Idea: Convert a given SPL expression into a “good” SPL expression through rewriting (structural manipulation)
A Iº
º
Lº2
º ; L2º2 ;L2ºº
SPL expressions recursively built from those
base cases
Carnegie Mellon
Step 2: Find Manipulation Rules
AB ! AB
Am£m Iº !³ImL2ºº
´³Am£m Iº
´³ImL2º2
´
ImAn£n ! ImAn£nD !
³In=º L2ºº
´~D³In=º L2º2
´
P ! P I2
Lnºn !³In=º Lº
2
º
´³Lnn=º
Iº
´
Lnºº !³Lnº Iº
´³In=º Lº
2
º
´
Lmnm !
³Lmn=ºm Iº
´³Imn=º2Lº
2
º
´³In=º Lm
m=º Iº
´
IlLkmnn Ir !
³IlLknn Imr
´³IklLmn
n Ir
´
IlLkmnn Ir !
³IlLkmn
kn Ir
´³IlLkmn
mn Ir
´
IlLkmnkm Ir !
³IklLmn
m Ir
´³IlLknk Imr
´
IlLkmnkm Ir !
³IlLkmn
k Ir
´³IlLkmn
m Ir
´³ImAn£n
´! Lmn
º
³(Im=º An£n) Iº
´Lmnmn=º³
ImAn£n´Lmnm !
µIm=º Lnºº
³An£n Iº
´¶³Lmn=º
m=º Iº
´
Lmnn
³ImAn£n
´!
³Lmn=ºn Iº
´µIm=º
³An£n Iº
´Lnºn
¶
µIk
³ImAn£n
´Lmnm
¶Lkmnk !
³Lkmk In
´µIm
³IkAn£n
´Lknk
¶³Lmnm Ik
´
Lkmnmn
µIkLmn
n
³ImAn£n
´¶!
³Lmnn Ik
´µImLknn
³IkAn£n
´¶³Lkmm In
´
Manipulation rules = Processor knowledge in Spiral
Carnegie Mellon
Example
DFTmn| {z }vec(º)
! (DFTm In)Tmnn (ImDFTn)Lmn
m| {z }vec(º)
:::
:::
:::
!µImnºL2ºº
¶µDFTm In
º Iº
¶T0mnn
µImº³L2nº Iº
´ µI2nºLº
2
º
¶µInºL2º2 Iº
¶ ³DFTn Iº
´¶µLmnºmºL2º2
¶
vectorized arithmetic vectorized data accesses
Carnegie Mellon
Automatically Generate Base Case Library
Goal: Given instruction set, generate base cases
Idea: Instructions as matrices + search
y = _mm_unpacklo_ps(x0, x1);
y = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(1,2,1,2));
y = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4));
y =
2641 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0
0 1 0 0 0 0 0 0
0 0 0 0 0 1 0 0
375"~x0~x1
#
y =
2641 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
375"~x0~x1
#
y =
2640 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
375"~x0~x1
#
º = 4 :nL42; I2L42; L
42 I2; L
82; L
84
o
No base case
2666664
1 0 0 0 0 0 0 00 0 0 0 1 0 0 00 1 0 0 0 0 0 00 0 0 0 0 1 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 1
3777775
y0 = _mm_unpacklo_ps(x[0], x[1]); y1 = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4)); Base case
y0 = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(1,2,1,2)); y1 = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4));
2666664
1 0 0 0 0 0 0 00 1 0 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 1
3777775
Carnegie Mellon
Same Approach for Different Paradigms Vectorization: Threading:
GPUs: Verilog for FPGAs:
Rigorous, correct by construction
Overcomes compiler limitations
Carnegie Mellon
Optimal program (Regenerated for every processor)
Automation:
Spiral
Algorithm knowledge Processor knowledge
Carnegie Mellon
Optimal program (Regenerated for every processor)
Automation:
Spiral
Decomposition rules Manipulation rules
Carnegie Mellon
Organization
Spiral: Basic system
Parallelism
General input size
Results
Final remarks
Carnegie Mellon
Challenge: General Size Libraries
Challenge: Library for general input size DFT(n, x, y) {
…
for(i = …) {
DFT_strided(m, x+mi, y+i, 1, k)
}
…
}
• Algorithm cannot be fixed
• Recursive code
• Creates many challenges
So far: Code specialized to fixed input size DFT_384(x, y) {
…
for(i = …) {
t[2i] = x[2i] + x[2i+1]
t[2i+1] = x[2i] - x[2i+1]
}
…
}
• Algorithm fixed
• Nonrecursive code
Carnegie Mellon
Challenge: Recursion Steps
Cooley-Tukey FFT
Implementation that increases locality (e.g., FFTW 2.x)
void DFT(int n, cpx *y, cpx *x) { int k = choose_dft_radix(n); for (int i=0; i < k; ++i) DFTrec(m, y + m*i, x + i, k, 1); for (int j=0; j < m; ++j) DFTscaled(k, y + j, t[j], m); }
Carnegie Mellon
S-SPL : Basic Idea Four additional matrix constructs: S, G, S, Perm
S (sum) explicit loop Gf (gather) load data with index mapping f Sf (scatter) store data with index mapping f Permf permute data with the index mapping f
S-SPL formulas = matrix factorizations
Example:
Carnegie Mellon
Find Recursion Step Closure
Repeat until closure
Carnegie Mellon
Recursion Step Closure: Examples
DFT: scalar code
DFT: full-fledged (vectorized and parallel code)
Carnegie Mellon
Summary: Complete Automation for Transforms
• Memory hierarchy optimization Rewriting and search for algorithm selection Rewriting for loop optimizations
• Vectorization Rewriting
• Parallelization Rewriting
• Derivation of library structure Rewriting Other methods
fixed input size code
general input size library
Carnegie Mellon
Organization
Spiral: Basic system
Parallelism
General input size
Results
Final remarks
Carnegie Mellon
DFT on Intel Multicore
5MB vectorized, threaded, general-size, adaptive library Spiral
Carnegie Mellon
0
2
4
6
8
10
12
32 64 96 128 160 192 224 256 288 320 352 384
input size n
DCTs on 2.66 GHz Core2 (4-way SSSE3) Performance [Gflop/s]
Often it Looks Like That
Spiral-generated
Available libraries
Carnegie Mellon
Computer generated Functions for Intel IPP 6.0
3984 C functions 1M lines of code Transforms: DFT (fwd+inv), RDFT (fwd+inv), DCT2, DCT3, DCT4, DHT, WHT Sizes: 2–64 (DFT, RDFT, DHT); 2-powers (DCTs, WHT) Precision: single, double Data type: scalar, SSE, AVX (DFT, DCT), LRB (DFT)
Computer generated
Results: SpiralGen Inc.
Carnegie Mellon
Mapping to FPGAs
best
Carnegie Mellon
Organization
Spiral: Basic system
Parallelism
General input size
Results
Final remarks
Carnegie Mellon
Beyond Transforms Viterbi Decoder Matrix-Matrix Multiplication
JPEG 2000 (Wavelet, EBCOT) Synthetic Aperture Radar (SAR)
interpolation 2D iFFT matched filtering
preprocessing DWT quantization entropy coding (EBCOT + MQ)
= x
Carnegie Mellon
Related Work: Autotuning
Goal: (Partial) automation of runtime optimization
Projects ATLAS (U. Tennessee)
FFTW (MIT)
BeBop/OSKI (U. Berkeley)
Spiral (Carnegie Mellon)
Tensor contractions (Ohio State)
Fenics (some universities)
FLAME (UT Austin)
Adaptive sorting (UIUC)
<many others>
Proceedings of the IEEE special issue, Feb. 2005
Carnegie Mellon
Spiral: Summary
Spiral:
Successful approach to automating the development of computing software
Commercial proof-of-concept
Key ideas:
Algorithm knowledge: Domain specific symbolic representation
Platform knowledge: Tagged rewrite rules, SIMD specification
void dft64(float *Y, float *X) {
__m512 U912, U913, U914, U915,...
__m512 *a2153, *a2155;
a2153 = ((__m512 *) X); s1107 = *(a2153);
s1108 = *((a2153 + 4)); t1323 = _mm512_add_ps(s1107,s1108);
t1324 = _mm512_sub_ps(s1107,s1108);
<many more lines>
U926 = _mm512_swizupconv_r32(…);
s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi(
_mm512_set_1to16_ps(0.70710678118654757),0xAAAA,a2154,U926),t1341),
_mm512_mask_sub_ps(_mm512_set_1to16_ps(0.70710678118654757),…),
_mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB));
U927 = _mm512_swizupconv_r32
<many more lines>
}
Carnegie Mellon
35x
Carnegie Mellon
Programming languages Program generation
Symbolic Computation Rewriting
Compilers
Software Scientific Computing
Algorithms Mathematics
Automating High-Performance
Numerical Software Development Search
Learning