Solving Network Throughput Problems at the Diamond Light Source

Preview:

Citation preview

Alex White, Campus network engineering workshop19/10/2016 Solving Network Throughput Problems at the

Diamond Light Source

Introduction to Diamond Light SourceSolving Network Throughput Problemsat the Diamond Light Source

Alex Whitealex.white@diamond.ac.uk

So, what do we actually do?

The Diamond machine is a type of particle accelerator

CERN = high energy particles smashed together and analyse the “crash”!

Diamond = accelerate electrons to produce synchrotron light

Use this light to study matter – like a “super microscope”

Three particle accelerators:

Linear accelerator

Booster Synchrotron

Storage ring (48 straight sections angled

together, 562m long)

The Diamond machine

Simultaneous Experiments

Data-intensive research

Lustre and GPFS filesystems: 430TB, 900TB, 3.3PB as of 2016

Typical X-ray camera 4MB * 100hz An experiment can easily produce 300GB-1TB Scientists want to take their data home

Site Limitations

Scientific data download speeds from Diamond to visiting user’s institutes were inconsistent and slow even though the facility had a “10Gb/s” JANET connection from STFC.

The limit on download speeds was delaying post-experiment data analysis by academics at their home institutes.

How did we characterise the problem?

We set ourselves an initial target of “a stable 50Mb/s over a 10ms path”

Initial Findings

10Gb/s inside our network, with no packet loss Low speeds found with iperf over the

STFC/JANET segment between Diamond's edge and the Physics Department at Oxford

We saw a small amount of packet loss over the STFC/JANET link

TCP Performance and the Mathis equation

Packet size Latency (AKA Round Trip Time) Packet Loss

“Interesting” effects of packet loss

Packet Loss

According to Mathis, to achieve our initial goal of 50Mb/s over a 10ms path the tolerable packet loss is 0.026% maximum.

Finding the problem – the Last Mile

We worked with STFC to connect a PerfSonar server directly to the Harwell site border router.

Tests with this extra server allowed us to pinpoint the STFC firewall (our “last mile”) as the source of the insidious packet loss.

The Fix: Science DMZ

The Fix: Science DMZ

Globus GridFTP

Uses parallel TCP streams Simple, web-based interface

Performance with Science DMZ

Test data: 2Gb/s+ consistently between DLS and Brookhaven National Labs (USA)!

Actual transfers in August 2016:Fastest: Crystallography dataset from DLS to

Newcastle: 260GB @ 480Mb/sBiggest: Electron Microscope data from DLS to

Imperial: 1120GB @ 290Mb/s

Security in the Science DMZ

In Summary

1. Use real-world testing to find packet loss2. Zero packet loss is crucial3. The last mile is usually the problem4. Firewalls have been shown to introduce packet loss – this is

backed up by ESnet's own testing5. Don't use SCP as the common implementation has a fixed

TCP window size – it will never grow to fill your link

Recommended