20
What is what Polyphase filter Implementation Results The Implementation and Comparison of a Polyphase Filter on Many-Core Systems Karel Ad´ amek, Jan Novotn´ y, Wes Armour GPU 2014 Rome 15.9.2014 Karel Ad´ amek The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

The Implementation and Comparison of a PolyphaseFilter on Many-Core Systems

Karel Adamek, Jan Novotny, Wes Armour

GPU 2014 Rome 15.9.2014

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 2: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Astro-Accelerate

Astro-Accelerate is a many-core accelerated library for real-timeprocessing of radio-astronomy data. Modules completed or indevelopment are Polyphase filter, de-dispersion, RFI mitigation,Acceleration Search and new novel algorithms for detection ofquasi periodic signals. Astro-Accelerate is currently used inARTEMIS, ALPHABURST, DRAGNET and potentially the SKA.

Many people have contributed to Astro-accelerate:Dan Curran, Simon McIntosh-Smith (Bristol)Wes Armour, Sofia Dimoudi, Mike Giles, Aris Karastergiou, CraigWebb (Oxford)Jan Novotny and Karel Adamek (Opava)Mike Clark (NVIDIA)Steve Casselman (ALTERA)

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 3: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Polyphase filter - Why bother?

We are interested in observed frequenciesDiscrete Fourier Transformation (DFT)

X (k) =N−1∑n=0

x(n)e−i 2πNnk

Broadband filter

Reduces DFT leakage, DFT scalloping loss

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 4: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

DFT leakage

Smears one frequency bin to all other bins

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 5: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 6: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 7: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 8: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 9: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

FIR filter

PPF = FIR filter + DFT

FIR vs IIR filters

FIR is convolution of a signal x(n) and impulse response b

y(n) =T−1∑i=0

x(n − i)bi

+ Simpler implementation+ Inherently stable– More computationally demanding

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 10: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Hardware

Table: Specification comparison of our investigated many-core platforms.

PlatformMemory bandwidth Peak Performance

(GB/s) (GFLOP/s)

Xeon E5-2650 (2x) 102 512Xeon Phi 5110P 320 1920Fermi c20701 144 1030Kepler Tesla K401 288 5045Maxwell GTX 750 Ti2 86.4 1605

1Scientific2Low-end

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 11: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Serial code

FFT libraries: MKL (CPU, Xeon Phi), cuFFT (GPU)

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 12: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

GPU Implementation

Addresses: 96 128 160

nChannels

608 640 682

Taps1

2

3 3

2

1

Warp Warp Warp Warp

thread ID

coalesced and sequential data access

no constant memory (nothing to broadcast)

no shared memory (work in progress) now L1, L2 as cache

Each block process more spectra (Speed up 1.29x)

Read only data – const restrict or ldg() (2.41x)

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 13: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Maxwell GPU - Streams

Different implementation for each generation

Transfer latency fully hidden on scientific cards (6.2x)

Maxwell streams (only one copy engine)

Kepler streams (two copy engines)

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 14: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Xeon Phi 5110P

Each thread computes 2 spectraData access by cachelines (1x AVX-512)Streams on Phi. You can hide computation behind:

incoming data (bugged)

outgoing data (worked) (2.35x)

Thread affinity: compact 3.1x faster than scatter

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 15: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Two operational modes:

Single mode

The accelerator computes only the polyphase filter (PPF), timeneeded for data transfer is included.

Streaming mode

Data for the PPF are already on the device and results of the PPFare reused in some other module. Transfer times not included.

Test performed for data rate of 6.5GB/s, estimated data ratefor single channels of SKAs Low Frequency Aperture Array

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 16: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Multiples of real-time

Multiples of real-time

Is a measure of how fast we can process 1s of data.

M =one second

time taken to process.

We measure two versions of multiples of real-time.

Mb – For single mode (transfer times included)

Mc – For streaming mode (without transfer times)

For CPU Mb = Mc

Another bottleneck is network

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 17: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Multiples of real-time - Streaming mode

0

5

10

15

20

25

30

256 512 1024 2048 4096

Mul

tipl

e of

rea

l-ti

me

Channels

0

1

2

3

4

5

6

7

5 8 16 32 64

Taps

CPU Xeon Phi FermiKepler Maxwell

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 18: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Multiples of real-time

PlatformSingle run

Time (s) Perf. (GFLOP/s) Band. (GB/s)

Xeon E5-2650 0.053 74 37Xeon Phi 5110P 0.028 140 70Fermi c2070 0.025 157 79Kepler Tesla K40 0.014 284 142Maxwell GTX 750 Ti 0.037 106 53

PlatformSingle mode Streaming mode

Mb (1/s) PCIe usage (GB/s) Mc (1/s) GFlops (GFlops/s) Bandwidth (GB/s)

Xeon E5-2650 1.92 — 1.92 188 (37%) 50 (49%)Xeon Phi 5110P 0.42 5.2 (33%) 2.72 267 (14%) 71 (22%)Fermi c2070 0.44 5.8 (36%) 3.71 364 (35%) 97 (67%)Kepler Tesla K40 1.38 17.9 (57%) 6.89 677 (13%) 181 (63%)Maxwell GTX 750 Ti 0.82 10.7 (34%) 2.39 234 (15%) 63 (73%)

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 19: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Conclusions

Ongoing work – shared memory; CPU blocking code

Presented at WDS 2014; peer reviewed proceedings submitted

Publication in preparation

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems

Page 20: The Implementation and Comparison of a Polyphase Filter on ... · Xeon E5-2650 (2x) 102 512 Xeon Phi 5110P 320 1920 Fermi c20701 144 1030 Kepler Tesla K401 288 5045 Maxwell GTX 750

What is what Polyphase filter Implementation Results

Thank you for your attention!

Thanks to:Zdenek Stuchlık, Stanislav Hledık, John Miller and ArisKarastergiou

Karel Adamek

The Implementation and Comparison of a Polyphase Filter on Many-Core Systems