Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Powering Real-time Radio
Astronomy Signal Processing with
GPUsDesign of a GPU based real-time backend for the upgraded GMRT
Harshavardhan Reddy Suda
NCRA, India
Pradeep Kumar Gupta
NVIDIA, India
Collaborating teams :
NCRA, Pune, India :
• Yashwant Gupta
• B. Ajithkumar Nair
• Harshvardhan Reddy Suda
• Sanjay Kudale
Swinburne University, Australia :
• Andrew Jameson
• Ben Barsdell
NVIDIA, India :
• Pradeep Kumar Gupta
Acknowledgements :
Matthew Bailes (Swinburne)
Jayanta Roy (NCRA)
Amit Bansod (IIT Bombay)
Dharam Vir Lal (NCRA)
Outline :
Introduction
• Overview of the GMRT
• GMRT receiver system
Digital Back-end for the GMRT
• Existing Digital Back-end
• Upgrade specifications and compute requirements
Upgraded Digital Back-end
• Design and Development
• GPU and IO performance
• Results
Future Prospects
Introducing the GMRT
The Giant Meter-wave Radio Telescope (GMRT) is a world class instrument for studying astrophysical phenomena at low radio frequencies (50 to 1450 MHz)
Located 80 km north of Pune, 160 km east of Mumbai
Array telescope with 30 antennas of 45 m diameter, operating at meter wavelengths --the largest in the world at these frequencies
Operational since 2001
Frequency range : • 130-170 MHz
• 225-245 MHz
• 300-360 MHz
• 580-660 MHz
• 1000-1450 MHz
Effective collecting area (2-3% of SKA) : • 30,000 sq m at lower frequencies
• 20,000 sq m at highest frequencies
Supports 2 modes of operation :
• Interferometry, aperture synthesis
• Array mode (incoherent & coherent)
GMRT is used by astronomers from all over the world for various kinds of astrophysical studies
The GMRT : A Quick Overview
14
km
1 km x
1 km
GMRT Receiver System
Dual polarized feeds
Super-heterodyne receiver chain : IF & baseband sections
Tunable LO (30 – 1700 MHz)
Maximum IF bandwidth : 32 MHz
Digital Back-end : correlator (for imaging) + beamformer (for pulsars studies)
Digital Back-end
ADC ADC ADC ADC
Ant 1 Ant 2 Ant 3 Ant M
Delay
Correction
(Integer
Clocks)
Delay
Correction
(Integer
Clocks)
Delay
Correction
(Integer
Clocks)
Delay
Correction
(Integer
Clocks)
FFT FFT FFT FFT
Phase
Correction
Phase
Correction
Phase
Correction
Phase
Correction
Multiply and Accumulate Beamformer
Data Storage and Analysis
Existing Digital Back-end : GMRT
Software Backend
Roy et al (2010)
Software based back-ends :• Few made to order hardware
components ; mostly off-the-shelf items
• Easier to program ; more flexible
The GMRT Software Back-end
(GSB) :• 32 antennas
• 32 MHz bandwidth, dual polarisation
• Net input data rate : 2 Gsamples/sec
• FX correlator + beam former
• Uses off-the-shelf ADC cards, CPUs &
switches to implement a fully real-time
back-end
• Current status : now working as the
observatory back-end
Looking ahead : The GMRT upgrade
Seamless frequency coverage from ~ 30 MHz to 1500 MHz
Increased instantaneous bandwidth of 400 MHz (from the present maximum of
32 MHz) modern new digital back-end receiver
Sampler
Fourier Transform
O(NlogN)
Phase
Correction
MAC
M(M+1)/2
GSB Upgraded digital
back-end
32 MHz BW 400 MHz BW
2k point FFT –
181 GFlops
16k point FFT –
2.9 TFlops
8.5 GFlops 0.1 TFlops
560 GFlops 6.6 TFlops
Antenna
Signals(M=64)
Why GPUs?
Total computation requirement for GSB ~750 GFlops
For upgraded digital backend ~10 TFlops
With the peak single precision floating point performance of both Fermi C2050
and K20, ten of Fermi C2050s or three of Kepler K20s should be enough
provided IO requirement can been handled.
IO requirement for upgraded digital back-end ~25 GB/s
Upgraded Digital Back-end design
Switch
(Infiniband)
Antenna 1
(400MHz
2pols)
FPGA
(packetizer)
CPU+GPU
(correlator)
Data acquisition
and control
ADC
(2 channels)
Antenna 2
(400MHz
2pols)
FPGA
(packetizer)
CPU+GPU
(correlator)
ADC
(2 channels)
Antenna 32
(400MHz
2pols)
FPGA
(packetizer)
CPU+GPU
(correlator)
ADC
(2 channels)
Upgraded Digital Back-end development
Xilinx virtex-5 FPGA boards with
ADC cards connected
CPU hosts : DELL T7500;
Myricom 10GbE NICs; Infiniband
interconnect using 8-port Mellanox
Switch
GPU cards : Tesla C2050 and K20Work done in collaboration with NVIDIA, India
and Swinburne University, Australia
Prototype 8 antenna system
Benchmarking results
Operation Performance (GFLOPS)
FFT (2k point) 330
FFT (4k point) 322
FFT (8k point) 233
Phase shifting 167
MAC 340
Overall sustained performance with Fermi C2050 is nearly 1/3rd of the peak
single-precision floating point performance
Total number of Fermi C2050s required for the full correlator ~ 30
Kepler K20 benchmarking – 33% improvement over Fermi C2050
Number of GPUs requirement reduces to approximately twenty
Optimizing the code for K20 will further reduce the number of GPUs required
Benchmarking on single Fermi (C2050)
I/O consideration and MPI performance
From each FPGA board to GPU host the data rate is constant 800MBps
(for 200 MHz BW, 8 bits/sample; 400 MHz BW, 4bits/sample)
Between GPU hosts, the fraction data (for each node) to be shared with
other nodes increases with the number of nodes(M) as (M-1)/M
Bi-directional BW achieved on 10 GbE interconnect between four nodes –
1.3 GBps. This is not sufficient for more than four nodes.
Benchmarks on GPU-cluster with Mellanox Infiniband interconnect gives
~5GBPS. For 32-node cluster – memory transfer takes 32% of time.
Data I/O considerations
50
75
87.5
93.7596.875
0
20
40
60
80
100
120
2 4 8 16 32
% of data
shared
No. of nodes vs % of data
shared
No. of nodes vs % of time for data
sharing
Software Model : OpenMP, MPI and CUDA
Code uses OpenMP, parallel
processing done with multiple
threads.
Main thread
Data Acquisition Data Processing
Shared memory
read
MPI transfer
Correlation
Visibility dump
GPU features used
- Asynchronous data transfer with streams
- Pinned memory to achieve high H2D transfer
bandwidth
- Shared memory to enhance MAC
- CUFFT library
First light image from GPU
Correlator
Image of 3C147 made from 4 hrs
of observations with 8 antenna
inputs (single polarisation)
RF : 1280 MHz; BW : 30 MHz
RMS noise : 7 mJy
Sample result from new
wideband signal path
First GMRT image using 100
MHz RF BW at L-band
RMS noise : 3 mJy
Courtesy Sanjay Kudale and Dharam Vir Lal
Future Plans
Build a 16-node cluster for 30 antennas of GMRT with each node
having two10GbE cards for data acquisition, latest generation
Kepler GPU cards for processing and infiniband network for
internode data transfer
Implement RFI filtering and Digital Down Conversion schemes
Tune the code and do performance benchmarking (DevTech
support from NVIDIA)
Explore GPU features
GPUDirect RDMA
Increased resources in Kepler
Proposed Plan : 400MHz BW dual pol 32 antennas
ROACH 1
CPU-GPU Node1
CPU-GPU Node2
CPU-GPU Node16
INFINIBAND SWITCH (40 GbPS)
800M
iB/S
800M
iB/S
800M
iB/S
800M
iB/S
800M
iB/S
800M
iB/S
ROACH 2
ROACH 3
ROACH 4
ROACH 31
ROACH 32
Server Machine