Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
What is what Polyphase filter Implementation Results
The Implementation and Comparison of a PolyphaseFilter on Many-Core Systems
Karel Adamek, Jan Novotny, Wes Armour
GPU 2014 Rome 15.9.2014
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Astro-Accelerate
Astro-Accelerate is a many-core accelerated library for real-timeprocessing of radio-astronomy data. Modules completed or indevelopment are Polyphase filter, de-dispersion, RFI mitigation,Acceleration Search and new novel algorithms for detection ofquasi periodic signals. Astro-Accelerate is currently used inARTEMIS, ALPHABURST, DRAGNET and potentially the SKA.
Many people have contributed to Astro-accelerate:Dan Curran, Simon McIntosh-Smith (Bristol)Wes Armour, Sofia Dimoudi, Mike Giles, Aris Karastergiou, CraigWebb (Oxford)Jan Novotny and Karel Adamek (Opava)Mike Clark (NVIDIA)Steve Casselman (ALTERA)
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Polyphase filter - Why bother?
We are interested in observed frequenciesDiscrete Fourier Transformation (DFT)
X (k) =N−1∑n=0
x(n)e−i 2πNnk
Broadband filter
Reduces DFT leakage, DFT scalloping loss
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
DFT leakage
Smears one frequency bin to all other bins
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
FIR filter
PPF = FIR filter + DFT
FIR vs IIR filters
FIR is convolution of a signal x(n) and impulse response b
y(n) =T−1∑i=0
x(n − i)bi
+ Simpler implementation+ Inherently stable– More computationally demanding
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Hardware
Table: Specification comparison of our investigated many-core platforms.
PlatformMemory bandwidth Peak Performance
(GB/s) (GFLOP/s)
Xeon E5-2650 (2x) 102 512Xeon Phi 5110P 320 1920Fermi c20701 144 1030Kepler Tesla K401 288 5045Maxwell GTX 750 Ti2 86.4 1605
1Scientific2Low-end
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Serial code
FFT libraries: MKL (CPU, Xeon Phi), cuFFT (GPU)
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
GPU Implementation
Addresses: 96 128 160
nChannels
608 640 682
Taps1
2
3 3
2
1
Warp Warp Warp Warp
thread ID
coalesced and sequential data access
no constant memory (nothing to broadcast)
no shared memory (work in progress) now L1, L2 as cache
Each block process more spectra (Speed up 1.29x)
Read only data – const restrict or ldg() (2.41x)
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Maxwell GPU - Streams
Different implementation for each generation
Transfer latency fully hidden on scientific cards (6.2x)
Maxwell streams (only one copy engine)
Kepler streams (two copy engines)
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Xeon Phi 5110P
Each thread computes 2 spectraData access by cachelines (1x AVX-512)Streams on Phi. You can hide computation behind:
incoming data (bugged)
outgoing data (worked) (2.35x)
Thread affinity: compact 3.1x faster than scatter
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Two operational modes:
Single mode
The accelerator computes only the polyphase filter (PPF), timeneeded for data transfer is included.
Streaming mode
Data for the PPF are already on the device and results of the PPFare reused in some other module. Transfer times not included.
Test performed for data rate of 6.5GB/s, estimated data ratefor single channels of SKAs Low Frequency Aperture Array
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Multiples of real-time
Multiples of real-time
Is a measure of how fast we can process 1s of data.
M =one second
time taken to process.
We measure two versions of multiples of real-time.
Mb – For single mode (transfer times included)
Mc – For streaming mode (without transfer times)
For CPU Mb = Mc
Another bottleneck is network
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Multiples of real-time - Streaming mode
0
5
10
15
20
25
30
256 512 1024 2048 4096
Mul
tipl
e of
rea
l-ti
me
Channels
0
1
2
3
4
5
6
7
5 8 16 32 64
Taps
CPU Xeon Phi FermiKepler Maxwell
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Multiples of real-time
PlatformSingle run
Time (s) Perf. (GFLOP/s) Band. (GB/s)
Xeon E5-2650 0.053 74 37Xeon Phi 5110P 0.028 140 70Fermi c2070 0.025 157 79Kepler Tesla K40 0.014 284 142Maxwell GTX 750 Ti 0.037 106 53
PlatformSingle mode Streaming mode
Mb (1/s) PCIe usage (GB/s) Mc (1/s) GFlops (GFlops/s) Bandwidth (GB/s)
Xeon E5-2650 1.92 — 1.92 188 (37%) 50 (49%)Xeon Phi 5110P 0.42 5.2 (33%) 2.72 267 (14%) 71 (22%)Fermi c2070 0.44 5.8 (36%) 3.71 364 (35%) 97 (67%)Kepler Tesla K40 1.38 17.9 (57%) 6.89 677 (13%) 181 (63%)Maxwell GTX 750 Ti 0.82 10.7 (34%) 2.39 234 (15%) 63 (73%)
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Conclusions
Ongoing work – shared memory; CPU blocking code
Presented at WDS 2014; peer reviewed proceedings submitted
Publication in preparation
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems
What is what Polyphase filter Implementation Results
Thank you for your attention!
Thanks to:Zdenek Stuchlık, Stanislav Hledık, John Miller and ArisKarastergiou
Karel Adamek
The Implementation and Comparison of a Polyphase Filter on Many-Core Systems