Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Parallel FFT-algorithms
Shmeleva YuliaSaint-Petersburg State University
Department of Computational Physics
Joint Advanced Student School 2009
Jass 2009, Shmeleva Yulia, SPbSU
Outline
• Motivation• Mathematical theory• Serial FFT• Parallel FFT
• Binary-exchange algorithm • Transpose algorithm
• Conclusion
Jass 2009, Shmeleva Yulia, SPbSU
Motivation
• Linear partial differential equations• Waveform analysis• Convolution and correlation• Digital signal processing • Image filtering
Jass 2009, Shmeleva Yulia, SPbSU
Continuous Fourier Transform
• (Forward) Fourier Transform
,2∫+∞
∞−
= dth(t)еH(f) πift
• (Inverse) Fourier Transform
,)()( 2∫+∞
∞−
−= dfefHth iftπ
where 1−=i
where 1−=i
Jass 2009, Shmeleva Yulia, SPbSU
Discrete Fourier transform
• Finite time series , sampled at an interval Δ
][][][ Δ== khthkh k
1-N ..., ,1 ,0=k
>−=< ]1[..., ],1[],0[ Nhhhh
where
Jass 2009, Shmeleva Yulia, SPbSU
Discrete Fourier transform
• The discrete (forward) Fourier transform
1-N ..., ,1 ,0=j
>−=< ]1[..., ],1[ ],0[ NHHHH
,][][1
0
/2∑−
=
=N
k
NikjekhjH πwhere
Jass 2009, Shmeleva Yulia, SPbSU
Discrete Fourier transform
• Let NiN eW /2π=
• The discrete (forward) Fourier transform
∑−
=
=1
0 ][][
N
k
jkNWkhjH
• The discrete (inverse) Fourier transform
∑−
=
−=1
0 ][1][
N
j
jkNWjH
Nkh
Jass 2009, Shmeleva Yulia, SPbSU
Discrete Fourier transform
Each H[j] requires N multiplications
The entire sequence H needs an order of
N² operations!
⇓
Jass 2009, Shmeleva Yulia, SPbSU
DFT property of symmetry
Recall πi/NN eW 2 ⇒= 1 ,1W 2/
N −== NN
N W
If we associate ][][ jHkh ⇔
then ][][ jHkh −⇔−
H[j]][ lkWlkh −⇔+
][][Wlk ljHkh +⇔
Jass 2009, Shmeleva Yulia, SPbSU
Serial FFT (Cooley and Tukey, 1965)
Assume that dN 2=
∑ ∑−
=
−
=
++=12/
0
12/
0
2kjN
jN
2N 1]Wk2[WW]2[][
N
k
N
k
kj hkhjH
∑ ∑−
=
−
=
+−=+12/
0
12/
0
2kjN
jN
2kjN 1]Wk2[W[2k]W]2/[
N
k
N
khhNjH
Jass 2009, Shmeleva Yulia, SPbSU
Serial FFT
Butterfly
jNW
ja
jb
jj
Nj bWa +
jj
Nj bWa −
Jass 2009, Shmeleva Yulia, SPbSU
Serial FFT (decomposition)
h[7] h[6] h[5] h[4] h[3] h[2] h[1] h[0]
h[6] h[4] h[2] h[0] h[7] h[5] h[3] h[1]
h[4] h[0] h[6] h[2] h[7] h[3]h[5] h[1]
h[0] h[4] h[2] h[6] h[5]h[1] h[3] h[7]
Jass 2009, Shmeleva Yulia, SPbSU
Serial FFT (Cooley and Tukey, 1965)
H[0]
H[1]
H[2]
H[3]
H[5]
H[4]
H[6]
H[7]
0NW
4NW
2NW
6NW
7NW
3NW
5NW
1NW
1=m 3=m2=m0
NW
0NW
6NW
2NW
4NW
4NW
2NW
6NW
0NW
0NW
0NW
0NW
4NW
4NW
4NW
4NWh[6]
h[5]
h[3]
h[7]
h[2]
h[4]
h[0]
h[1]
Jass 2009, Shmeleva Yulia, SPbSU
Serial FFT (Cooley and Tukey, 1965)
• iterationsN2log
• During each iteration – N operations
Jass 2009, Shmeleva Yulia, SPbSU
Serial FFT
N² operationsOriginal DFT:
operationsFFT: NN 2log
Compare!
Jass 2009, Shmeleva Yulia, SPbSU
Serial FFT (decomposition)
h[7] h[6] h[5] h[4] h[3] h[2] h[1] h[0]
h[6] h[4] h[2] h[0] h[7] h[5] h[3] h[1]
h[4] h[0] h[6] h[2] h[7] h[3]h[5] h[1]
h[0] h[4] h[2] h[6] h[5]h[1] h[3] h[7]
Jass 2009, Shmeleva Yulia, SPbSU
Serial FFT• Bit reversal sorting algorithm
Original sequence Rearranged sequence
decimal binary binary decimal0 000 000 01 001 100 42 010 010 23 011 110 64 100 001 15 101 101 56 110 011 37 111 111 7
⇒
Jass 2009, Shmeleva Yulia, SPbSU
Definitions
• Bisection width –minimum number of communication links that can be removed to break a network into two equal sized disconnected networks
bisection width is 1
Jass 2009, Shmeleva Yulia, SPbSU
Definitions
• Speedup –measure that gives the relative benefit of solving a problem in parallel
p
s
TTS =
Jass 2009, Shmeleva Yulia, SPbSU
Definitions
• Efficiency –measure of the fraction of time for which a processing element is usefully employed
pSE =
Jass 2009, Shmeleva Yulia, SPbSU
Definitions
• Scalability –measure of capacity to increase speedup in proportion to the number of processing elementsin order to maintain efficiency fixed.
Jass 2009, Shmeleva Yulia, SPbSU
Definitions
• Problem size –number of basic computation steps in the best sequential algorithm to solve the problem on a single processing element
Jass 2009, Shmeleva Yulia, SPbSU
Definitions
• Isoefficiency function –function which dictates the growth rate of problem size required to keep the efficiency fixed as a number of processors increases.
Jass 2009, Shmeleva Yulia, SPbSU
Parallel FFT
1. The Binary-Exchange algorithm
2. The transpose algorithm
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• Full bandwidth network• p parallel processes• bisection width is an order of p
Example: hypercube network
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• Simple mapping:
One task ↔ one processp = N
• Assume thatrN 2=
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
h[0]
h[1]
h[2]
h[3]
h[5]
h[4]
h[7]
1=m
h[6]
0P
5P4P3P2P1P
7P6P
000
101
100
011
010
001
111
110
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
h[0]
h[1]
h[2]
h[3]
h[5]
h[4]
h[7]
1=m
h[6]
0P
5P4P3P2P1P
7P6P
000
101
100
011
010
001
111
110
Jass 2009, Shmeleva Yulia, SPbSU
0P
5P4P3P2P1P
7P6P
Binary-Exchange algorithm
h[0]
h[1]
h[2]
h[3]
h[5]
h[4]
h[7]
1=m
h[6]
000
101
100
011
010
001
111
110
Jass 2009, Shmeleva Yulia, SPbSU
0P
5P4P3P2P1P
7P6P
Binary-Exchange algorithm
h[0]
h[1]
h[2]
h[3]
h[5]
h[4]
h[7]
1=m
h[6]
000
101
100
011
010
001
111
110
Jass 2009, Shmeleva Yulia, SPbSU
0P
5P4P3P2P1P
7P6P
Binary-Exchange algorithm
h[0]
h[1]
h[2]
h[3]
h[5]
h[4]
h[7]
1=m 2=m
h[6]
000
101
100
011
010
001
111
110
Jass 2009, Shmeleva Yulia, SPbSU
0P
5P4P3P2P1P
7P6P
Binary-Exchange algorithm
h[0]
h[1]
h[2]
h[3]
h[5]
h[4]
h[7]
1=m 2=m
h[6]
000
101
100
011
010
001
111
110
Jass 2009, Shmeleva Yulia, SPbSU
0P
5P4P3P2P1P
7P6P
Binary-Exchange algorithm
h[0]
h[1]
h[2]
h[3]
h[5]
h[4]
h[7]
H[0]
H[1]
H[2]
H[3]
H[5]
H[4]
H[6]
H[7]
1=m 2=m
h[6]
000
101
100
011
010
001
111
110
3=m
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• Another mapping:multiple tasks ↔ one process
p < N
• Assume that dr pN 2 , 2 ==
• Partition the sequence into blocks of N/p elements one block ↔ one process
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
h[0]
h[1]
h[2]
h[3]
h[5]
h[4]
h[7]
H[0]
H[1]
H[2]
H[3]
H[5]
H[4]
H[6]
H[7]
1=m 3=m2=m
h[6]
0P
3P
2P
1P
)3 ,2( 4 , 8 ==== rdpN• Consider
000
101
100
011
010
001
111
110
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
h[0]
h[1]
h[2]
h[3]
h[5]
h[4]
h[7]
H[0]
H[1]
H[2]
H[3]
H[5]
H[4]
H[6]
H[7]
1=m 3=m2=m
h[6]
0P
3P
2P
1P
)3 ,2( 4 , 8 ==== rdpN• Consider
101
100
011
010
111
110
000
001
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
h[0]
h[1]
h[2]
h[3]
h[5]
h[4]
h[7]
H[0]
H[1]
H[2]
H[3]
H[5]
H[4]
H[6]
H[7]
1=m 3=m2=m
h[6]
0P
3P
2P
1P
)3 ,2( 4 , 8 ==== rdpN• Consider
101
100
111
110
000
001
010
011
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
First d iterations N/p words of data exchangepd 2log=
Last (r-d) iterationsNr 2log=
No interprocess interaction
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• Denote:
st ─ the startup time for the data transfer
wt ─ the per-word transfer time
─ computation timect
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• The parallel run time
ct ct
ppNtptN
pNtT wscp 222 logloglog ++=
• Speedup
p)N/t(tp)p/t(tNNNpN
TNNt
Scwcsp
c
22
22
logloglogloglog
++==
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• Efficiency
)log/()log()log/()log(11
2222 NtptNNtpptE
cwcs ++=
• Scalability
EEKpp
ttKpp
ttKppW wKt
c
w
c
s
−=
⎭⎬⎫
⎩⎨⎧
=1
,log ,log ,logmax 2t/
22c
.
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• Assume that
.
25 ,4 ,2 === swc ttt
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• Assume that
.
256 ,25 ,4 ,2 ==== pttt swc
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• Limited bandwidth network• p parallel processes• bisection width is less then Θ(p)
Example: a mesh interconnection network
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• Mapping:
pp ×N tasks ↔ processes
• Assume thatrN 2=dp 2=
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
p2log steps ─communicating processes are in the same row
p2log steps ─communicating processes are in the same column
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• The parallel run time
ct ct
pNtptN
pNtT wscp 2loglog 22 ++=
• Speedup
p)N/t(tp)p/t(tNNNpNS
cwcs 2logloglog
2
2
++=
Jass 2009, Shmeleva Yulia, SPbSU
Binary-Exchange algorithm
• Efficiency
)log/()(2)log/()log(11
222 NtptNNtpptE
cwcs ++=
• Scalability
EEKpp
ttKpp
ttKppW pKt
c
w
c
s w
−=
⎭⎬⎫
⎩⎨⎧
=1
,2 ,log ,logmax )t/(222
c
.
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
• Two-dimensional transpose algorithm
sequence of size N → NN × array
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
1. A -point FFT is computed for each column of the initial array
2. The array is transposed
3. A -point FFT for each column of the transposed array
N
N
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
• Simple mappingone column ↔ one process
Np =
• Assume that dr pN 2 , 2 ==
• Full bandwidth network• bisection width is an order of p• example: hypercube
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
Transpose
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
• Another mapping:Several columns ↔ one process
• Assume that dr pN 2 , 2 ==
• Partition the array into blocks of rows one block ↔ one process
pN /
Np <
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
• The parallel run time
ct ct
• Speedup pNtptN
pNtT wscp +−+= )1(log 2
NttpttNNNpN
Scwcs )/()/(log
log2
2
2
++≈
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
• Efficiency
• Scalability
)log( 22 ppW Θ=
.
)log/()log/()(11
222 NttNNtpt
Ecwcs ++
≈
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
Compare two algorithms!
pNtptN
pNtT wscp +−+= )1(log 2
Binary-exchangealgorithm
Transposealgorithm
ppNtptN
pNtT wscp 222 logloglog ++=
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
Compare two algorithms!
pNtptN
pNtT wscp +−+= )1(log 2
Binary-exchangealgorithm
Transposealgorithm
ppNtptN
pNtT wscp 222 logloglog ++=
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
• Three-dimensional transpose algorithm
sequence of size N → 333 NNN ×× array
• Mapping
333 NNN ×× ↔ pp × mesh ofarray processes
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
1. A -point FFT along the z-axis2. Each of the cross-sections along the
y-z plane is transposed3. A -point FFT along the z-axis4. Each of the cross-sections along the
x-z plane is transposed5. A -point FFT along the z-axis
3 N
3 N
3 N
33 NN ×
33 NN ×
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm• The transposition phases in the tree-dimensional
transpose algorithm
Jass 2009, Shmeleva Yulia, SPbSU
Transpose algorithm
• The parallel run time
ct ct
• Speedup
NttppttNNNpNS
cwcs )/(2)1()/(2loglog
2
2
+−+≈
pNt)p(tN
pNtT wscp 212log 2 +−+=
Jass 2009, Shmeleva Yulia, SPbSU
Comparison of described algorithms
• Assume that 64 ,25 ,4 ,2 ==== pttt swc
Jass 2009, Shmeleva Yulia, SPbSU
Conclusion
• Binary-exchange algorithm• high communication bandwidth • shared memory (Open MP)
• Transpose algorithm• limited communication bandwidth • distributed memory (MPI)
Jass 2009, Shmeleva Yulia, SPbSU
Parallel FFT software
• Parallel Engineering and Scientific Subroutine Library (PESSL)
• “Fastest Fourier Transform in the West.” (FFTW)
• Intel® Math Kernel Library (Intel® MKL)
Jass 2009, Shmeleva Yulia, SPbSU
Acknowledgments
Sergei Andreevitch Nemnyugin
Sergei Yurievitch Slavyanov
Roman Kuralev
Jass 2009, Shmeleva Yulia, SPbSU
Thank you for attention!