Parallel FFT-algorithms - TUM€¦ · Parallel FFT-algorithms Shmeleva Yulia Saint-Petersburg State...

Preview:

Citation preview

Parallel FFT-algorithms

Shmeleva YuliaSaint-Petersburg State University

Department of Computational Physics

Joint Advanced Student School 2009

Jass 2009, Shmeleva Yulia, SPbSU

Outline

• Motivation• Mathematical theory• Serial FFT• Parallel FFT

• Binary-exchange algorithm • Transpose algorithm

• Conclusion

Jass 2009, Shmeleva Yulia, SPbSU

Motivation

• Linear partial differential equations• Waveform analysis• Convolution and correlation• Digital signal processing • Image filtering

Jass 2009, Shmeleva Yulia, SPbSU

Continuous Fourier Transform

• (Forward) Fourier Transform

,2∫+∞

∞−

= dth(t)еH(f) πift

• (Inverse) Fourier Transform

,)()( 2∫+∞

∞−

−= dfefHth iftπ

where 1−=i

where 1−=i

Jass 2009, Shmeleva Yulia, SPbSU

Discrete Fourier transform

• Finite time series , sampled at an interval Δ

][][][ Δ== khthkh k

1-N ..., ,1 ,0=k

>−=< ]1[..., ],1[],0[ Nhhhh

where

Jass 2009, Shmeleva Yulia, SPbSU

Discrete Fourier transform

• The discrete (forward) Fourier transform

1-N ..., ,1 ,0=j

>−=< ]1[..., ],1[ ],0[ NHHHH

,][][1

0

/2∑−

=

=N

k

NikjekhjH πwhere

Jass 2009, Shmeleva Yulia, SPbSU

Discrete Fourier transform

• Let NiN eW /2π=

• The discrete (forward) Fourier transform

∑−

=

=1

0 ][][

N

k

jkNWkhjH

• The discrete (inverse) Fourier transform

∑−

=

−=1

0 ][1][

N

j

jkNWjH

Nkh

Jass 2009, Shmeleva Yulia, SPbSU

Discrete Fourier transform

Each H[j] requires N multiplications

The entire sequence H needs an order of

N² operations!

Jass 2009, Shmeleva Yulia, SPbSU

DFT property of symmetry

Recall πi/NN eW 2 ⇒= 1 ,1W 2/

N −== NN

N W

If we associate ][][ jHkh ⇔

then ][][ jHkh −⇔−

H[j]][ lkWlkh −⇔+

][][Wlk ljHkh +⇔

Jass 2009, Shmeleva Yulia, SPbSU

Serial FFT (Cooley and Tukey, 1965)

Assume that dN 2=

∑ ∑−

=

=

++=12/

0

12/

0

2kjN

jN

2N 1]Wk2[WW]2[][

N

k

N

k

kj hkhjH

∑ ∑−

=

=

+−=+12/

0

12/

0

2kjN

jN

2kjN 1]Wk2[W[2k]W]2/[

N

k

N

khhNjH

Jass 2009, Shmeleva Yulia, SPbSU

Serial FFT

Butterfly

jNW

ja

jb

jj

Nj bWa +

jj

Nj bWa −

Jass 2009, Shmeleva Yulia, SPbSU

Serial FFT (decomposition)

h[7] h[6] h[5] h[4] h[3] h[2] h[1] h[0]

h[6] h[4] h[2] h[0] h[7] h[5] h[3] h[1]

h[4] h[0] h[6] h[2] h[7] h[3]h[5] h[1]

h[0] h[4] h[2] h[6] h[5]h[1] h[3] h[7]

Jass 2009, Shmeleva Yulia, SPbSU

Serial FFT (Cooley and Tukey, 1965)

H[0]

H[1]

H[2]

H[3]

H[5]

H[4]

H[6]

H[7]

0NW

4NW

2NW

6NW

7NW

3NW

5NW

1NW

1=m 3=m2=m0

NW

0NW

6NW

2NW

4NW

4NW

2NW

6NW

0NW

0NW

0NW

0NW

4NW

4NW

4NW

4NWh[6]

h[5]

h[3]

h[7]

h[2]

h[4]

h[0]

h[1]

Jass 2009, Shmeleva Yulia, SPbSU

Serial FFT (Cooley and Tukey, 1965)

• iterationsN2log

• During each iteration – N operations

Jass 2009, Shmeleva Yulia, SPbSU

Serial FFT

N² operationsOriginal DFT:

operationsFFT: NN 2log

Compare!

Jass 2009, Shmeleva Yulia, SPbSU

Serial FFT (decomposition)

h[7] h[6] h[5] h[4] h[3] h[2] h[1] h[0]

h[6] h[4] h[2] h[0] h[7] h[5] h[3] h[1]

h[4] h[0] h[6] h[2] h[7] h[3]h[5] h[1]

h[0] h[4] h[2] h[6] h[5]h[1] h[3] h[7]

Jass 2009, Shmeleva Yulia, SPbSU

Serial FFT• Bit reversal sorting algorithm

Original sequence Rearranged sequence

decimal binary binary decimal0 000 000 01 001 100 42 010 010 23 011 110 64 100 001 15 101 101 56 110 011 37 111 111 7

Jass 2009, Shmeleva Yulia, SPbSU

Definitions

• Bisection width –minimum number of communication links that can be removed to break a network into two equal sized disconnected networks

bisection width is 1

Jass 2009, Shmeleva Yulia, SPbSU

Definitions

• Speedup –measure that gives the relative benefit of solving a problem in parallel

p

s

TTS =

Jass 2009, Shmeleva Yulia, SPbSU

Definitions

• Efficiency –measure of the fraction of time for which a processing element is usefully employed

pSE =

Jass 2009, Shmeleva Yulia, SPbSU

Definitions

• Scalability –measure of capacity to increase speedup in proportion to the number of processing elementsin order to maintain efficiency fixed.

Jass 2009, Shmeleva Yulia, SPbSU

Definitions

• Problem size –number of basic computation steps in the best sequential algorithm to solve the problem on a single processing element

Jass 2009, Shmeleva Yulia, SPbSU

Definitions

• Isoefficiency function –function which dictates the growth rate of problem size required to keep the efficiency fixed as a number of processors increases.

Jass 2009, Shmeleva Yulia, SPbSU

Parallel FFT

1. The Binary-Exchange algorithm

2. The transpose algorithm

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• Full bandwidth network• p parallel processes• bisection width is an order of p

Example: hypercube network

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• Simple mapping:

One task ↔ one processp = N

• Assume thatrN 2=

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

h[0]

h[1]

h[2]

h[3]

h[5]

h[4]

h[7]

1=m

h[6]

0P

5P4P3P2P1P

7P6P

000

101

100

011

010

001

111

110

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

h[0]

h[1]

h[2]

h[3]

h[5]

h[4]

h[7]

1=m

h[6]

0P

5P4P3P2P1P

7P6P

000

101

100

011

010

001

111

110

Jass 2009, Shmeleva Yulia, SPbSU

0P

5P4P3P2P1P

7P6P

Binary-Exchange algorithm

h[0]

h[1]

h[2]

h[3]

h[5]

h[4]

h[7]

1=m

h[6]

000

101

100

011

010

001

111

110

Jass 2009, Shmeleva Yulia, SPbSU

0P

5P4P3P2P1P

7P6P

Binary-Exchange algorithm

h[0]

h[1]

h[2]

h[3]

h[5]

h[4]

h[7]

1=m

h[6]

000

101

100

011

010

001

111

110

Jass 2009, Shmeleva Yulia, SPbSU

0P

5P4P3P2P1P

7P6P

Binary-Exchange algorithm

h[0]

h[1]

h[2]

h[3]

h[5]

h[4]

h[7]

1=m 2=m

h[6]

000

101

100

011

010

001

111

110

Jass 2009, Shmeleva Yulia, SPbSU

0P

5P4P3P2P1P

7P6P

Binary-Exchange algorithm

h[0]

h[1]

h[2]

h[3]

h[5]

h[4]

h[7]

1=m 2=m

h[6]

000

101

100

011

010

001

111

110

Jass 2009, Shmeleva Yulia, SPbSU

0P

5P4P3P2P1P

7P6P

Binary-Exchange algorithm

h[0]

h[1]

h[2]

h[3]

h[5]

h[4]

h[7]

H[0]

H[1]

H[2]

H[3]

H[5]

H[4]

H[6]

H[7]

1=m 2=m

h[6]

000

101

100

011

010

001

111

110

3=m

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• Another mapping:multiple tasks ↔ one process

p < N

• Assume that dr pN 2 , 2 ==

• Partition the sequence into blocks of N/p elements one block ↔ one process

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

h[0]

h[1]

h[2]

h[3]

h[5]

h[4]

h[7]

H[0]

H[1]

H[2]

H[3]

H[5]

H[4]

H[6]

H[7]

1=m 3=m2=m

h[6]

0P

3P

2P

1P

)3 ,2( 4 , 8 ==== rdpN• Consider

000

101

100

011

010

001

111

110

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

h[0]

h[1]

h[2]

h[3]

h[5]

h[4]

h[7]

H[0]

H[1]

H[2]

H[3]

H[5]

H[4]

H[6]

H[7]

1=m 3=m2=m

h[6]

0P

3P

2P

1P

)3 ,2( 4 , 8 ==== rdpN• Consider

101

100

011

010

111

110

000

001

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

h[0]

h[1]

h[2]

h[3]

h[5]

h[4]

h[7]

H[0]

H[1]

H[2]

H[3]

H[5]

H[4]

H[6]

H[7]

1=m 3=m2=m

h[6]

0P

3P

2P

1P

)3 ,2( 4 , 8 ==== rdpN• Consider

101

100

111

110

000

001

010

011

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

First d iterations N/p words of data exchangepd 2log=

Last (r-d) iterationsNr 2log=

No interprocess interaction

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• Denote:

st ─ the startup time for the data transfer

wt ─ the per-word transfer time

─ computation timect

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• The parallel run time

ct ct

ppNtptN

pNtT wscp 222 logloglog ++=

• Speedup

p)N/t(tp)p/t(tNNNpN

TNNt

Scwcsp

c

22

22

logloglogloglog

++==

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• Efficiency

)log/()log()log/()log(11

2222 NtptNNtpptE

cwcs ++=

• Scalability

EEKpp

ttKpp

ttKppW wKt

c

w

c

s

−=

⎭⎬⎫

⎩⎨⎧

=1

,log ,log ,logmax 2t/

22c

.

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• Assume that

.

25 ,4 ,2 === swc ttt

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• Assume that

.

256 ,25 ,4 ,2 ==== pttt swc

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• Limited bandwidth network• p parallel processes• bisection width is less then Θ(p)

Example: a mesh interconnection network

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• Mapping:

pp ×N tasks ↔ processes

• Assume thatrN 2=dp 2=

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

p2log steps ─communicating processes are in the same row

p2log steps ─communicating processes are in the same column

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• The parallel run time

ct ct

pNtptN

pNtT wscp 2loglog 22 ++=

• Speedup

p)N/t(tp)p/t(tNNNpNS

cwcs 2logloglog

2

2

++=

Jass 2009, Shmeleva Yulia, SPbSU

Binary-Exchange algorithm

• Efficiency

)log/()(2)log/()log(11

222 NtptNNtpptE

cwcs ++=

• Scalability

EEKpp

ttKpp

ttKppW pKt

c

w

c

s w

−=

⎭⎬⎫

⎩⎨⎧

=1

,2 ,log ,logmax )t/(222

c

.

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

• Two-dimensional transpose algorithm

sequence of size N → NN × array

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

1. A -point FFT is computed for each column of the initial array

2. The array is transposed

3. A -point FFT for each column of the transposed array

N

N

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

• Simple mappingone column ↔ one process

Np =

• Assume that dr pN 2 , 2 ==

• Full bandwidth network• bisection width is an order of p• example: hypercube

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

Transpose

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

• Another mapping:Several columns ↔ one process

• Assume that dr pN 2 , 2 ==

• Partition the array into blocks of rows one block ↔ one process

pN /

Np <

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

• The parallel run time

ct ct

• Speedup pNtptN

pNtT wscp +−+= )1(log 2

NttpttNNNpN

Scwcs )/()/(log

log2

2

2

++≈

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

• Efficiency

• Scalability

)log( 22 ppW Θ=

.

)log/()log/()(11

222 NttNNtpt

Ecwcs ++

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

Compare two algorithms!

pNtptN

pNtT wscp +−+= )1(log 2

Binary-exchangealgorithm

Transposealgorithm

ppNtptN

pNtT wscp 222 logloglog ++=

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

Compare two algorithms!

pNtptN

pNtT wscp +−+= )1(log 2

Binary-exchangealgorithm

Transposealgorithm

ppNtptN

pNtT wscp 222 logloglog ++=

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

• Three-dimensional transpose algorithm

sequence of size N → 333 NNN ×× array

• Mapping

333 NNN ×× ↔ pp × mesh ofarray processes

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

1. A -point FFT along the z-axis2. Each of the cross-sections along the

y-z plane is transposed3. A -point FFT along the z-axis4. Each of the cross-sections along the

x-z plane is transposed5. A -point FFT along the z-axis

3 N

3 N

3 N

33 NN ×

33 NN ×

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm• The transposition phases in the tree-dimensional

transpose algorithm

Jass 2009, Shmeleva Yulia, SPbSU

Transpose algorithm

• The parallel run time

ct ct

• Speedup

NttppttNNNpNS

cwcs )/(2)1()/(2loglog

2

2

+−+≈

pNt)p(tN

pNtT wscp 212log 2 +−+=

Jass 2009, Shmeleva Yulia, SPbSU

Comparison of described algorithms

• Assume that 64 ,25 ,4 ,2 ==== pttt swc

Jass 2009, Shmeleva Yulia, SPbSU

Conclusion

• Binary-exchange algorithm• high communication bandwidth • shared memory (Open MP)

• Transpose algorithm• limited communication bandwidth • distributed memory (MPI)

Jass 2009, Shmeleva Yulia, SPbSU

Parallel FFT software

• Parallel Engineering and Scientific Subroutine Library (PESSL)

• “Fastest Fourier Transform in the West.” (FFTW)

• Intel® Math Kernel Library (Intel® MKL)

Jass 2009, Shmeleva Yulia, SPbSU

Acknowledgments

Sergei Andreevitch Nemnyugin

Sergei Yurievitch Slavyanov

Roman Kuralev

Jass 2009, Shmeleva Yulia, SPbSU

Thank you for attention!

Recommended