Cluster Filesystems and the next 1000 human genomes

1. Cluster Filesystems and the next 1000 Human genomes Guy Coates Wellcome Trust Sanger Institute

2. Introduction 3. About the Institute

Funded by Wellcome Trust.

2 ndlargest research charity in the world. 4. ~700 employees.

Large scale genomic research.

Sequenced 1/3 of the human genome (largest single contributor). 5. We have active cancer, malaria, pathogen and genomic variation studies.

All data is made publicly available.

Websites, ftp, direct database. access, programmatic APIs.

6. New technology Sequencing 7. Sequencing projects at the Sanger The Human Genome Project

Worldwide collaboration, 6 countries, 5 major centres, many smaller labs. 8. 13 years.

1000 Genomes project.

Study variation in human populations. 9. 1000 genomes over 3 years by 5 centres. 10. We have agreed to do 200 genomes.

And the rest.

We have cancer, malaria, pathogen, worm, human variation WTCCC2 etc.

11. How is this achievable? Moore's Law of Sequencing.

Cost of sequencing halves every 2 years. 12. Driven by multiple factors.

Economies of Scale:

Human Genome Project: 13 years, 23 labs,$500 Million. 13. Cost today: $10 Million, several months in a single large genome centre.

New sequencing technologies:

Illumina/solexa machines. 14. $100,000 for a human genome. 15. Single machine, 3 days.

16. New sequencing technologies Capillary sequencing.

96 sequencing reactions carried out per run. 17. 0.5-1 hour run time.

Illumina sequencing.

52 Million reactions per run. 18. 3 day run time.

Machines are cheap (ish) and small.

We can buy lot of them.

19. Data centre

4x250 M 2Data centres.

2-4KW / M 2cooling. 20. 3.4MW power draw.

Overhead aircon, power and networking.

Allows counter-current cooling. 21. More efficient.

Technology Refresh.

1 data centre is an empty shell.

Rotate into the empty room every 4 years. 22. Refurb one of the in-use rooms with the current state of the art.

Fallow Field principle.

rack rack rack rack 23. Highly Disruptive Sequencing centre runs 24x7 Peak capacity of capillarysequencing:

3.5 Gbases / month.

Current Illumina sequencing:

262 Gbases / month in April. 24. 1 T Base /month predicted for Sept.

Total sequence deposited in genbank for all time.

200 Gbases.

75x Increase in sequencing output. 25. Gigabase!=Gigabyte We store ~8 bytes of data per base.

Quality,error, experimental information. 26. ~10 Tbytes / month permanent archival storage.

Raw data from the machines is much larger.

June 2007 15 machine, 1 TB every 6 days. 27. Sept 2007, 30 machines, 1TB every 3 days. 28. Jan 2008, 30 machines, 2TB every 3 days.

Compute pipeline crunches 2 TB bytes of raw data into 30 Gbytes of sequence data. We need to capture ~120 TB data perweek beforewe can analyse the data to produce the final sequence. 29. IT for new technology sequencing 30. Compute Infrastructure Problem I

How do we capture the data coming off the sequencing machines?

Problem 2

How do we analyse the compute coming of the sequencing machines?

Problem 3

How do we do this from scratch in 8 weeks?

31. Problem 1: Build a Big file-system 3 x 100 TB file-systems to dump data to.

Multiple file-systems in order to protect against catastrophic hardware failures.

Hold data for 2 weeks only.

This should give us enough space to store ~2 weeks worth ofraw data. 32. Once run has passed QC, raw data can be deleted.

Use Lustre (HP SFS, based on lustre 1.4).

Sustained write rate 1.6 Gbit /s (not huge). 33. Reads will have to be much faster so we can analyse data on our compute cluster faster than we capture it. 34. We used it already. 35. Low risk for us.

36. Problem 2: Build a compute cluster Compute was the easy part.

Analysis pipeline embarrassingly parallel workload. 37. Scales well on commodity clusters.

(After the bugs had been fixed).

8 Chassis of HP BL460c Blades.

128 nodes / 640 cores. 38. We use blade systems already. 39. Excellent manageability. 40. Fit into our existing machine management structure. 41. Once physically installed we can deploy thecluster in a few hours.

42. Add lots of networking Black diamond 8810 chassis.

360 GigE ports.

Trunked GigE links.

8x per blade chassis (16 machines) for the lustre network. 43. 8x links to the sequencing centre.

44. Data pull ... LSF reconfig allows processing capacity to be interchanged. sequencer1sequencer 30 Offline 2 o analysis Realtime 1 o analysis suckers Final Repository 100TB / yr scratch area 25 TB Lustre SFS20 staging area 320 TB Lustre EVA 45. Problem 3: How do we do it quickly? Plan for more than you actually need.

Make and estimation and add 1/3. 46. Still was not enough in our case.

Go with technologies you know.

Nothing works out of the box on this scale. 47. There will inevitably be problems with kit you do know.

Firmware / hardware skews, delivery .

Other technologies might have been better on paper (eg lustre 1.6, Big NAS box?), but might not have worked.

Good automated systems management infrastructure.

Machine and software configs all held in cfengine. 48. Easy to add new hardware and make it the same as that.

49. Problems Lustre file-system is striped across a number of OSS server

A box with some disk attached.

Original plan was for 6 EVA arrays (50TB each) and 12 OSS servers.

Server failover pairs.

Limit in the SFS failover code means that 1 OSS can only serve 8 LUNs.

We were looking at 13 LUNsper server (26 in case of failover).

Required to increase the number of OSSs from 6 to 28.

Plus increased SAN / networking infrastructure.

50. More Problems Golden Rule of Storage Systems

All disks go to 96% full and stay there.

Increase in data production rates reduced the time we could buffer data for.

Keep data for 2 weeks rather than 3.

We need to add another 100TB / 6 OSS.

Expansion is currently ongoing.

df -h FilesystemSizeUsed Avail Use% Mounted on XXX97T93T4T96% /lustre/sf1 XXX116T111T5T96% /lustre/sf2 XXX97T93T4T96% /lustre/sf3 51. Even More Problems Out of memory...

Changes in the analysis code + new sequencing machines means that we were filling up our compute farm. 52. Code requirement jumped from 1GB / core to 2GB core. 53. Under-commit machines with jobs to prevent memory exhaustion.

Reduced overall capacity.

Retro-fit underway to increase memory.

Out of machines...

Changes in downstream analysis means that we need twice as much CPU than we had. 54. Installed 560 cores of IBM HS21 Blades

55. Can we do it better? Rapidly changing environment.

Sequencing machine are already here. 56. Have to do something, and quickly.

Agile software development.

Sequencing software team use agile development to cope with change in sequencing science and process. 57. Very fast ,incrementaldevelopment (weekly release). 58. Get usable code into production very quickly, even if it is not feature complete.

Can we do agile systems?

Software is empheral, Hardware is not.

You cannotmagic 320TB of disk and 1000 cores out of thin air...

Or can you?

59. Possible Future directions Virtualisation.

Should help us if we have excess capacity in our data-centre.

We are not talking single machines with a bit of local disk. 60. Cluster file-systems, non-trivial networking.

Require over-provision of network and storage infrastructure. 61. Is this the price of agility?

Grid / cloud / elastic computing.

Can we use someone else's capacity instead? 62. Can we find a sucker / valued partner to take a wedge of our data? 63. Can we get data in and out of the grid quickly enough? 64. Do the machine inside the clould have fast data paths between them and the storage? 65. Supercomputer, not web services.

We are starting work to look at all of these. 66. There is no end in sight! We already have exponential growth in storage and compute.

Storage doubles every 12 months. 67. We cross the 2PB barrier last week.

Sequencing technologies are constantly evolving. Known unknowns:

Higher data output from our current machines. 68. More machines.

Unknown unknowns:

New big science projects are just a good idea away... 69. Gen 3 sequencing technologies

70. Acknowledgements Sanger

Systems network, SAN and storage teams. 71. Sequencing pipeline development team.

Code performance, testing.

Cambridge Online

Dawn Johnson.

Systems intergration.

HP Galway

Eamonn O'Toole, Gavin Brebner.

Lustre planning.

Documents

Cluster Filesystems and the next 1000 human genomes