Once we leave JANET (UK academic network) finding out what the
connectivity is and what we should expect is almost
impossible.
How do we get the broken bits in the middle?
Finding the person responsible for a broken router on the
internet is hard.
12. What about the Physicists?
LHC moves 20 PBytes year across the internet to their
processing sites.
Not really. 13. Dedicated 10GigE networking between CERN and
the 10 Tier 1 centres.
Even with dedicated paths, it is still hard.
Multiple telcos involved, even for a point-to-point link. 14.
Constant monitoring / bandwidth tests to ensure it stays working.
15. See HEPIX talks for gory details.
16. We need a bigger networks:
A fast network is fundamental to moving data. 17. Is it the
only thing we need to do?
18. Sanger Production Pipeline
Provides a nice example of moving large amounts of data in
real-life.
19. Sequencing data flow Sanger Sequencer Analysis/ alignment
Internalrepository EBI EGA / SRA (EBI) 20. Data movement between
Sanger/EBI
This should be easy...
We are on the same campus. 21. 10Gbit/s (1.2 Gbyte/s) link
between EBI and Sanger. 22. We share a data-centre. 23. Physically
near, so we do not need to worry about WAN issues.
24. It is not just networks:
Speed will only be as fast as the slowest link. 25. Speed was
not a design point for our holding area.
$ per TB was the overriding design goal, not speed.
EBI Sanger Server Firewall Internet Server Firewall Network
Network Disk Disk 26. Organisational issues:
Data movement was not considered until after Sanger/EBI started
building the systems.
Hard to do fast data transfers if your disk subsystem is not up
to the job.
Expectation management:
How fast should I be able to move data?
Good communication.
Multi-institute teams. 27. Need to take end-to-end ownership
across institutions.
Application Led:
Nobody cares about raw data rates- they care how fasttheir
applicationcan move data. 28. Need application developers and
sys-admin to work together.
This needs to be in-place before you start projects!
29. Do we need to move the data? 30. Centralised data Data
Sequencing Centre + DCC Sequencing centre Sequencing centre
Sequencing centre Sequencing centre 31. Example Problem:
We want to run out pipeline across 100TB of data currently in
EGA/SRA. 32. We will need to de-stage the data to Sanger, and then
run the compute.
Extra 0.5 PB of storage, 1000 cores of compute. 33. 3 month
lead time. 34. ~$1.5M capex.
35. Federation: A Better way: Collaborations are short term: 18
months-3 years. Sequencing centre Sequencing centre Sequencing
centre Sequencing centre Federated access 36. Federation software:
Unstructured data (flat files) Data size per Genome Structured data
(databases) BioMart IRODS (data grid software) Intensities / raw
data (2TB) Alignments (200 GB) Sequence + quality data (500 GB)
Variation data (1GB) Individualfeatures(3MB) 37. Cloud / Computable
archives
Can we move the compute to the data?
Upload workload onto VMs. 38. Put VMs on compute that is
attached to the data.
Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM 39. Summary
We need fast network links. 40. We need cross site teams who
can troubleshoot all potential trouble spots. 41. Teams need
application & systems people.
42. Acknowledgements:
The HEPIX Community.
Http://www.hepix.org
Team ISG:
James Beal 43. Gen-Tao Chiang 44. Pete Clapham 45. Simon
Kelley