Guy Coates

Big Data:Sanger Experiences

Guy Coates

Wellcome Trust Sanger Institute

gmpc@sanger.ac.uk

The Sanger Institute

Funded by Wellcome Trust.• 2nd largest research charity in the world.• ~700 employees.• Based in Hinxton Genome Campus,

Cambridge, UK.

Large scale genomic research.• Sequenced 1/3 of the human genome.

(largest single contributor).• Large scale sequencing with an impact

on human and animal health.

Data is freely available.• Websites, ftp, direct database access,

programmatic APIs.• Some restrictions for potentially

identifiable data.

My team:• Scientific computing systems architects.

DNA Sequencing

TCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG

AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA

TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC

ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG

TGCACTCCAGCTTGGGTGACACAG CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG

AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA

ATGAAGTAAATCG ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC

250 Million * 75-108 Base fragments

~1 TByte / day / machineHuman Genome (3GBases)

Economic Trends:

Cost of sequencing halves every 12 months.

The Human genome project: • 13 years.• 23 labs.• $500 Million.

A Human genome today:• 3 days.• 1 machine.• $8,000.

Trend will continue:• $1000 genome is probable within 2 years.

The scary graph

Peak Yearly capillary sequencing: 30 Gbase

Current weekly sequencing:7-10 Tbases

Gen III Sequencers this year?

What are we doing with all these genomes?

UK10K• Find and understand impact of rare genetic

variants on disease.

Ensembl• Genome annotation.• Data resources and analysis pipelines.

Cancer Genome Project• Catalogue causal mutations in cancer.• Genomics of tumor drug sensitivity.

Pathogen Genomics• Bacterial / viral genomics• Malaria Genetics• Parasite genetic / tropical diseases.

All these programmes exist in frameworks of external collaboration.• Sharing data and resources is crucial.

19941995

19961997

19981999

20002001

20022003

20042005

20062007

20082009

Disk Storage

IT Requirements

Needs to match growth in sequencing technology.

Growth of compute & storage • Storage /compute doubles every 12

months.• 2012 ~17 PB Usable

Everything changes, all the time.• Science is very fluid.• Speed to deployment is critical.

Moore's law will not save us.

$1000 genome*• *Informatics not included

Sequencing data flow.

SequencerProcessing/

QCComparative

analysisArchive

Structured data(databases)

Unstructured data(Flat files)

InternetInternet

Alignments(200GB)

Variation data(1GB)

Feature(3MB)

Raw data(10 TB)

Sequence(500GB)

Agile Systems

Modular design.• Blocks of network, compute and

storage.• Assume from day 1 we will be adding

more.• Expand simply by adding more blocks.• Lots of automation.

Make storage visible from everywhere.• Key enabler; lots of 10Gig.

Compute

Network

Compute Compute

Disk Disk Disk

Compute

Compute Modules

Commodity Servers• Blade form-factor.• Automated Management.

Generic intel/AMD CPUs• Single threaded / embarrasingly parallel

workload.• No FPGAs or GPUs.

2000-10,000 core per cluster• 3 Gbyte/s memory per core.• A few bigger memory machines (0.5TB).

Storage Modules

Two flavours:

Scale up (Fast)• DDN storage arrays.• Lustre. 250-500TB per filesystem.• High performance. Expensive.

Scale out (Slow)• Linux NFS servers.• Nexsan Storage arrays.• 50-100TB per filesystem.• Cheap and cheerful.

How large?• More modules = more management overhead.• Fewer modules = large unit of failure.• 100-500 TB

Actual Architecture

Compute Silos• Beware of over-

consolidation.• Some workflows interact

badly with one another.• Separate out some work

onto different clusters.

Logically rather than physically separated.• LSF to manage workflow.• Simple software re-config to

move capacity between silos.

Farm2Farm 1 LSF

Fast disk

Slow disk

Network

Fast disk

Farm3LSF

Fast disk

Some things we learned

KISS! Keep It Simple, Stupid.• Simple solution may look less reliable on paper than the fully redundant

failover option.• Operational reality:

• Simple solutions are much quicker to fix when they break.• Not always possible (eg lustre use).

Good communication between science and IT teams.• Expose the IT costs to researchers.

Build systems Iteratively.• Constantly evolving systems.• Groups start out with everything on fast storage, but realise they can get

away with slower stuff.• More cost effective to do 3x1 yearly purchase rather than 1x 3 yearly?

Data Triage• What do we really want to keep?

QCComparative

analysisArchive

InternetInternet

Slow Fast

Sequencing(1K cores)

SlowFast

UK10K Farm(1.5K core)

General Farm(6K core)

CGP Farm(2K cores)

That was easy!

Variation data(1GB)

Alignments(200GB)

Pbytes!

QCComparative

analysisdatastore

InternetInternet

Feature(3MB)

Raw data(10 TB)

Sequence(500GB)

People = Unmanaged Data

Investigators take data and “do stuff” with it.

Data is left in the wrong place.• Typically left where it was created.

• Moving data is hard and slow.• Important data left in scratch areas, or high IO analysis being run against

slow storage.

Duplication.• Everyone take a copy of the data, “just to be sure.”

Capacity planning becomes impossible.• Who is using our disk space?

• “du” on 4PB is not going to work...

Not Just an IT Problem

100TB filesystem, 136M files.• “Where is the stuff we can delete so we can continue production...?”

#df -hFilesystem Size Used Avail Use% Mounted onlus02-mds1:/lus02 108T 107T 1T 99% /lustre/scratch102

#df -i Filesystem Inodes IUsed IFree IUse% Mounted onlus02-mds1:/lus02 300296107 136508072 163788035 45%/lustre/scratch102

Lost productivity

Data management impacts on research productivity.• Groups spend large amounts of time and effort just keeping track of data.• Groups who control their data get much more done.

• But they spend time writing data tracking applications.

Money talks:• “Group A only need ½ the storage budget of group B to do the same

analysis.”• Powerful message.

Need a common site-wide data management infrastructure.• We need something simple so people will want to use it.

Data management

iRODS: Integrated Rule-Oriented Data System.• Produced by DICE Group (Data Intensive Cyber Environments) at U.

North Carolina, Chapel Hill.

Successor to SRB.• SRB used by the High-Energy-Physics (HEP) community.

• 20PB/year LHC data.• HEP community has lots of “lessons learned” that we can benefit from.

ICATCataloguedatabase

Rule EngineImplements policies

Irods ServerData on disk

User interfaceWeb, command line, fuse, API

Irods ServerData in database

Irods ServerData in S3

Queryable metadata• SQL like language.

Scalable• Copes with PB of data and 100,000M+ files.• Data replication engine.• Fast parallel data transfers across local and wide area network links.

Customisable Rules• Trigger actions on data according to policy.

• Eg generate thumbnail for every image uploaded.

Federated• iRODS installs can be federated across institutions.• Sharing data is easy.

Open Source• BSD licensed.

Sequencing Archive

Final resting place for all our sequencing data.• Researchers pull data from irods for

further analysis.

2x 800TB space.• First deployment; KISS!

Simple ruleset.• Replicate & checksum data.• External scripts periodically scrub data.

Positively received.• Researchers are pushing us for new

instances to store their data.

Next Iterations:• Experiments with WAN, external

federations, complex rules.

Architecture

ICATOracle 10g RAC

Replica 1(green datacentre)

Irods Server

275TB(Nexsan)

120TB(Nexan) 480TB

Replica 2(red datacentre)

275TB(Nexsan)

120TB(Nexan) 480TB

Some thoughts on Clouds

Largest drag on response is dealing with real hardware.• Delivery lead times, racking, cabling etc.To the cloud!

Nothing about our IT approach precludes/mandates cloud.• Use it where it makes sense.

Public clouds for big-data.• Uploading data to the cloud takes along time.• Data Security.

• Need to do your due-diligence • (just like you should be doing in-house!)

• Cloud may be more appropriate than in house.

Currently cheaper for us to do production in-house.• But: Purely an economic decision.

Cloud Archives

Dark Archives• You can get data, but cannot compute

across it.• Nobody is going to download 400TB

of sequence data.

Cloud Archives• Cloud models allow compute to be

uploaded to the data and run “in-place.”

• Private clouds may simplify data governance.

• Can you do it more cheaply than public providers?

Summary

Modular Infrastructure.

Manage Data.

Data Triage.

Strong Collaboration / Dialogue with Researchers.

Acknowledgements

Sanger Systems Team• Phil Butcher (Director of IT)• Informatics Systems Group.• Networking, DBA, Infrastructure & helpdesk teams.• Cancer, human-genetics, uk10k informatics teams.

Resources:

• http://www.sanger.ac.uk/research• http://www.uk10k.org• http://www.sanger.ac.uk/genetics/cgp/cosmic• http://www.ensembl.org• http://www.irods.org• http://www.nanoporetech.com

Guy Coates

Technology

Coates Hire Coolangatta Gold 2017 Entry Conditions · Coates Hire Coolangatta Gold Entry Conditions – July 2017 Coates Hire Coolangatta Gold 2017 Entry Conditions Title: 2017 Coates

Advocamp: David Coates

Coates Presentation

Dev340 Thake Coates

Nahmias Coates Kvaran

HOLLYWOOD KIM COATES: TOUGH GU Y - … guy Kim Coates is his mom’s killer cinnamon buns and his take-a-bullet love for his ... Eric Bana, even Christopher Plummer. You get the idea

a celebration of the contributions of the Coates Family · retirement. Coates’ son Jesse became the first Chair of Chemical Engineering. In his personal life, Charles E. Coates

PATTY COATES - Ontario Federation of Labourofl.ca/wp-content/uploads/OFL-BIO-PattyCoates-2015.12.12.pdf · PATTY COATES SECRETARY-TREASURER, ONTARIO FEDERATION OF LABOUR Patty Coates

Estrada y Coates 1996

THE COATES-SINNOTTCONJECTURE FOR CYCLIC CUBIC … · The generalized Coates-Sinnottconjecture predicts that for n > 2, ... sion F/k. In this paper we examine the (motivic) Coates-Sinnottconjecture

Eric Coates Miniature Suite

Ellen Coates

30A Coates Gardens - media.scottishhomereports.commedia.scottishhomereports.com/MediaServer/PropertyMarketing/31234… · 30A Coates Gardens, West End, Edinburgh, EH12 5LE Viewing:

Rizal by austin coates

It's Christmas - Dan Coates

Heating Coates - perricom.com.hk

Coates - Properhood.pdf

Laurie Coates

Introduction Elaine Coates. Elaine Coates Head of ICT E-learning Consultant M-learning Consultant ICT tutor

Nigel Robbins, Coates Hire