1
Basic structure of the MWA, from antennas to output data products. Shows the main high-speed hardware to software interface at the input from the correlator to the RTS. High Speed Data Ingestion and Processing for MWA The Murchison Widefield Array (MWA) is a low- frequency radio telescope currently being deployed in Western Australia using 512 dipole-based antennas. With over 130,000 baselines and around 800 fine frequency channels, there is a significant computational challenge facing the Real Time System (RTS) software. A prototype system with 32 antennas is presently being used to test the hardware and software solutions from end-to-end. Before calibration and imaging can occur, the RTS must ingest and integrate correlated data at high speeds; around 0.5 Gigabit/sec per network interface on a Beowulf-style cluster. The data is transferred using UDP packets over Gigabit Ethernet, with as close to zero data loss as possible. School of Physics, University of Melbourne,Victoria 3010, Australia [email protected] Stewart Gleadow (and the team from MWA) MWA REAL TIME SYSTEM The MWA hardware correlator sends out packet data representing a full set of visibilities and channels every 50ms, which means only tens of μs per packet. The RTS runs on an 8 second cadence, so visibilities need to be integrator to this level. In order to avoid overflows or loss in the network card and kernel memory, a custom buffering system is required. The goal is to allow the correlator, network interface and the main RTS calibration and imaging to run in parallel, without losing data in between. UDP does not guarantee successful transmission, but in our testing, with a direct Gigabit Ethernet connection (no switch), there is no packet loss other than from buffer overflows. This only occurs when packets are not read from the network interface fast enough. DATA INGESTION CHALLENGE When approaching link capacity, one thread is dedicated to constantly reading packets from the network interface to avoid buffer overflows and packet loss. In order to operate at close to Gigabit speeds, a hierarchy of parallel threads is required. Buffering all packets for 8 seconds would introduce heavy memory requirements. Hence, an intermediate thread processing a mid-level time resolution is required. Theoretical network performance is difficult to achieve using small packets because of the overhead of the encoding, decoding and notification because too much for the network interface and operating system. THREADED HEIRARCHY While the new generation radio telescopes pose great computational challenges, they are also pushing the boundaries of network capacity and performance. A combination of high quality network hardware and multiple-core processors are required in order to receive and process data simultaneously. Depending on the level of processing and integration required, and in a trade off between memory usage and performance, parallel threads may be required at multiple levels. The architecture described above has been tested on Intel processors and network interfaces, running Ubuntu Linux, to successfully receive, process and integrate many Gigabytes of data without missing a single packet. Further work involves testing the architecture in a switched network environment and deploying the system in the field in late 2009. CONCLUSION In order to operate at close to gigabit speeds, a hierarchy of parallel threads is required. Each only does a small amount of processing in order to operate quickly while still reaching the higher data level required by the rest of the calibration and imaging processes. Each thread uses double buffers (shown in diagram), so that there is one set of data currently being filled by each thread, and another that is already full and being passed on to the next level. This allows each thread to operate in parallel, while each set of data still passes through each phase in the order it arrived from the correlator. 2 3 4 The MWA radio telescope requires the interaction of hardware and software systems at close to link capacity, with minimal transmission loss and maximum throughput. Using the parallel thread architecture described below, we aim to operate high speed network connections and process data products simultaneously. packet/20µs 20µs to 1s 1s to 8s OUTPUT / STORAGE CORRELATOR ANTENNAS / BEAMFORMERS RECEIVERS REAL TIME SYSTEM HARDWARE SOFTWARE For 32-tile demonstration, each of four computing nodes receives: correlations for both polarizations from all antennas 192 x 40KHz frequency channels ~0.5 Gbit/s data PACKET READER VISIBILITY INTEGRATOR CORRELATOR MAIN RTS Buffer One: Buffer Two: 8s cadence 0 200 400 600 800 1000 0 400 800 1200 1600 2000 Bandwidth (Mbit/sec) Datagram Size (bytes) 0 3 6 9 12 15 18 0 400 800 1200 1600 2000 Percentage Loss (%) Datagram Size (bytes) Left: Plot of effective bandwidth using UDP packets for various datagram sizes. Below: Plot of percentage packet loss against UDP payload size. (tests performed by Steve Ord, Harvard-Smithsonian Center for Astrophysics) The poor network performance for small packets is caused by the kernel becoming flooded with interrupts faster than it can service them, to the point where not all interrupts are handled and packets start to be dropped as requests are ignored. These results prompted a move from 388 byte to 1540 byte packets. (original packet size) (new packet size) 1

Multithreaded Data Transport

Embed Size (px)

DESCRIPTION

Poster presentation from HPC India conference in 2009

Citation preview

Page 1: Multithreaded Data Transport

Melbourne Thermochronology

Basic structure of the MWA, from antennas to output data products. Shows the main high-speed hardware to software interface at the input from the correlator to the RTS.

High Speed Data Ingestion and Processing for MWA

The Murchison Widefield Array (MWA) is a low-frequency radio telescope currently being deployed in Western Australia using 512 dipole-based antennas. With over 130,000 baselines and around 800 fine frequency channels, there is a significant computational challenge facing the Real Time System (RTS) software. A prototype system with 32 antennas is presently being used to test the hardware and software solutions from end-to-end.

Before calibration and imaging can occur, the RTS must ingest and integrate correlated data at high speeds; around 0.5 Gigabit/sec per network interface on a Beowulf-style cluster. The data is transferred using UDP packets over Gigabit Ethernet, with as close to zero data loss as possible.

School of Physics, University of Melbourne, Victoria 3010, Australia [email protected]

Stewart Gleadow (and the team from MWA)

MWA REAL TIME SYSTEM

The MWA hardware correlator sends out packet data representing a full set of visibilities and channels every 50ms, which means only tens of µs per packet. The RTS runs on an 8 second cadence, so visibilities need to be integrator to this level.

In order to avoid overflows or loss in the network card and kernel memory, a custom buffering system is required. The goal is to allow the correlator, network interface and the main RTS calibration and imaging to run in parallel, without losing data in between.

UDP does not guarantee successful transmission, but in our testing, with a direct Gigabit Ethernet connection (no switch), there is no packet loss other than from buffer overflows. This only occurs when packets are not read from the network interface fast enough.

DATA INGESTION CHALLENGE

When approaching link capacity, one thread is dedicated to constantly reading packets from the network interface to avoid buffer overflows and packet loss. In order to operate at close to Gigabit speeds, a hierarchy of parallel threads is required.

Buffering all packets for 8 seconds would introduce heavy memory requirements. Hence, an intermediate thread processing a mid-level time resolution is required.

Theoretical network performance is difficult to achieve using small packets because of the overhead of the encoding, decoding and notification because too much for the network interface and operating system.

THREADED HEIRARCHY

While the new generation radio telescopes pose great computational challenges, they are also pushing the boundaries of network capacity and performance. A combination of high quality network hardware and multiple-core processors are required in order to receive and process data simultaneously. Depending on the level of processing and integration required, and in a trade off between memory usage and performance, parallel threads may be required at multiple levels.

The architecture described above has been tested on Intel processors and network interfaces, running Ubuntu Linux, to successfully receive, process and integrate many Gigabytes of data without missing a single packet. Further work involves testing the architecture in a switched network environment and deploying the system in the field in late 2009.

CONCLUSION

In order to operate at close to gigabit speeds, a hierarchy of parallel threads is required. Each only does a small amount of processing in order to operate quickly while still reaching the higher data level required by the rest of the calibration and imaging processes.

Each thread uses double buffers (shown in diagram), so that there is one set of data currently being filled by each thread, and another that is already full and being passed on to the next level. This allows each thread to operate in parallel, while each set of data still passes through each phase in the order it arrived from the correlator.

2

3

4

The MWA radio telescope requires the interaction of hardware and software systems at close to link capacity, with minimal transmission loss and maximum throughput. Using the parallel thread architecture described

below, we aim to operate high speed network connections and process data products simultaneously.

packet/20µs 20µs to 1s 1s to 8s

OUTPUT / STORAGE

CORRELATOR

ANTENNAS / BEAMFORMERS

RECEIVERS

REAL TIME SYSTEM

HA

RD

WA

RE

S

OFT

WA

RE

For 32-tile demonstration, each of four computing nodes receives: •  correlations for both polarizations from all antennas •  192 x 40KHz frequency channels

•  ~0.5 Gbit/s data

PACKET READER

VISIBILITY INTEGRATOR CORRELATOR MAIN RTS

Buffer One:

Buffer Two:

8s cadence

0

200

400

600

800

1000

0 400 800 1200 1600 2000

Ban

dw

idth

(M

bit

/se

c)

Datagram Size (bytes)

0

3

6

9

12

15

18

0 400 800 1200 1600 2000

Pe

rce

nta

ge L

oss

(%

)

Datagram Size (bytes)

Left: Plot of effective bandwidth using UDP packets for various datagram sizes. Below: Plot of percentage packet loss against UDP payload size. (tests performed by Steve Ord, Harvard-Smithsonian Center for Astrophysics)

The poor network performance for small packets is caused by the kernel becoming flooded with interrupts faster than it can service them, to the point where not all interrupts are handled and packets start to be dropped as requests are ignored. These results prompted a move from 388 byte to 1540 byte packets.

(original packet size)

(new packet size)

1