49
Bionimbus: Lessons from a PetabyteScale Science Cloud Service Provider (CSP) Robert Grossman Ins?tute for Genomics & Systems Biology Center for Research Informa?cs Computa?on Ins?tute Department of Medicine University of Chicago & Open Data Group September 11, 2012

Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Embed Size (px)

DESCRIPTION

This is a talk I gave at XLDB 2012 on September 11, 2012 at Stanford University.

Citation preview

Page 1: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Bionimbus:    Lessons  from  a  Petabyte-­‐Scale    

Science  Cloud  Service  Provider  (CSP)  

Robert  Grossman    

Ins?tute  for  Genomics  &  Systems  Biology    Center  for  Research  Informa?cs    

Computa?on  Ins?tute  Department  of  Medicine  University  of  Chicago  

&    Open  Data  Group  

 September  11,  2012  

Page 2: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

The  OSDC  &  Bionimbus  Teams  

•  Open  Science  Data  Cloud  (OSDC)  Team  – MaM  Greenway,  Allison  Heath,  Ray  Powell,  Rafael  Suarez.  

– Major  funding  for  the  OSDC  is  provided  by  the  Gordon  and  BeMy  Moore  Founda?on.  

•  Bionimbus  Team  –  Elizabeth  Bartom,  Casey  Brown,  Jason  Grundstad,  David  Hanley,  Nicolas  Negre,  Tom  Stricker,  MaM  SlaMery,  Rebecca  Spokony  &  Kevin  White.  

–  Bionimbus  is  a  joint  project  between  Laboratory  for  Advanced  Compu?ng  &  White  Lab  at  the  University  of  Chicago  and  uses  in  part  the  OSDC  infrastructure.  

Page 3: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Let’s  Step  Back  20  Years  

•  1992-­‐96:  Petabyte  Access  &  Storage  Solu?ons  (PASS)  Project  for  SSC.  

•  It  developed  &  benchmarked  federated  rela?onal,  OO  DB,  object  stores,  &  column-­‐oriented  data  warehouse  solu?ons  at  the  TB-­‐scale.    

Page 4: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

A  picture  of  Cern’s  Large  Hadron  Collider  (LHC).    The  LHC  took  about  a  decade  to  construct,  and  cost  about  $4.75  billion.      Source  of  picture:  Conrad  Melvin,  Crea?ve  Commons  BY-­‐SA  2.0,  www.flickr.com/photos/58220828@N07/5350788732  

Page 5: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Part  1.  Genomics  as  a  Big  Data  Science  

Page 6: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Source:  Lincoln  Stein  

Page 7: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

One  Million  Genomes  •  Sequencing  a  million  genomes  would  most  likely  fundamentally  change  the  way  we  understand  genomic  varia?on.  

•  The  genomic  data  for  a  pa?ent  is  about  1  TB  (including  samples  from  both  tumor  and  normal  ?ssue).  

•  One  million  genomes  is  about  1000  PB  or  1  EB  •  With  compression,  it  may  be  about  100  PB  •  At  $1000/genome,  the  sequencing  would  cost  about  $1B  

Page 8: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Big  data  driven  discovery  on  1,000,000  genomes  and  1  EB  of  data.  

Genomic-­‐driven  

diagnosis  

Improved  understanding  of  genomic  science  

 Genomic-­‐  driven  drug  development  

Precision  diagnosis  and  treatment.    Preven?ve  

health  care.  

Page 9: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

TNBC  

ER+  

Source:  White  Lab,  University  of  Chicago.  

With  genomics,  we  can  stra?fy  diseases  and  treat  each  stratum  differently.  

Page 10: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Clonal  Evolu?on  of  Tumors  

Tumors  evolve  temporally  and  spa?ally.  Source:  Mel  Greaves  &  Carlo  C.  Maley,  Clonal  evolu?on  in  cancer,  Nature,  Volume  241,  pages  306-­‐312,  2012.  

Page 11: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Combina?ons  of  Rare  Alleles  

Allele    frequency  

Penetrance  

Very  rare   Common  

Low  

High  

Rare   Uncommon  0.001   0.01   0.1  

Intermediate  

Modest  

alleles  causing  

Mendelian    disease  

most  common  variants    

implicated  in  common  disease  

by  GWA  

rare  examples  of  high-­‐penetrance  common  variants    

influencing    common  disease  

rare  variants  of  small  effect  

very  hard  to  iden?fy  by  gene?c  means  

Low-­‐frequency  variants  with  

 intermediate  penetrance  

Source:  Mark  McCarthy  

Page 12: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

TCGA  Analysis  of  Lung  Cancer  

•  178  cases  of  SQCC  (lung  cancer)  

•  Matched  tumor  &  normal  

•  Mean  of  360  exonic  muta?ons,  323  CNV,  &  165  rearrangements  per  tumor  

Source:  The  Cancer  Genome  Atlas  Research  Network,  Comprehensive  genomic  characteriza?on  of  squamous  cell  lung  cancers,  Nature,  2012,  doi:10.1038/nature11404.  

Page 13: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Discipline   Dura3on   Size   #  Devices  

HEP  -­‐  LHC   10  years   15  PB/year*   One  

Astronomy  -­‐  LSST   10  years   12  PB/year**   One  

Genomics  -­‐  NGS   2-­‐4  years   0.5  TB/genome   1000’s  

Some  Examples  of  Big  Data  Science  

*At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  par?cle  accelerator,  is  expected  to  produce  more  than  15  million  Gigabytes  of  data  each  year.    …  This  ambi?ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer  centres  in  33  countries.    Source:  hMp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html    **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes  processed),  resul?ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hMp://www.lsst.org/News/enews/teragrid-­‐1004.html  

Page 14: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

One  large  instrument   Many  smaller  instruments  

Page 15: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Part  2.  What  Instrument  Do  we  Use  to    Make  Big  Data  Discoveries?  

How  do  we  build  a  “datascope?”  

Page 16: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

What  is  big  data?  

TB?  PB?  EB?  ZB?  

Page 17: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Think  of  data  as  big  if  you  measure  it  in  MW,  as  in  Facebook’s  Pineville  Data  Center  is  30  MW.  

Another  way:  

opencompute.org  

Page 18: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

An  algorithm  and  compu?ng  infrastructure  is  “big-­‐data  scalable”  if  adding  a  rack  (or  container)  of  data  (and  corresponding  processors)  allows  you  to  do  the  same  computa?on  in  the  same  ?me  but  over  more  data.  

Page 19: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Commercial  Cloud  Service  Provider  (CSP)    15  MW  Data  Center  

100,000  servers  1  PB  DRAM  

100’s  of  PB  of  disk  

Automa?c  provisioning  and  infrastructure  management  

Monitoring,  network  security  and  forensics  

Accoun?ng  and  billing   Customer  

Facing  Portal  

Data  center  network  

~1  Tbps  egress  bandwidth    

25  operators  for  15  MW  Commercial  Cloud  

Page 20: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

What  are  some  of  the  important  differences  between  commercial  and  research-­‐focused  CSPs?    

Page 21: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Science  Clouds  

Science  CSP   Commercial  CSP  POV   Democra?ze  access  to  

data.    Integrate  data  to  make  discoveries.    Long  term  archive.  

As  long  as  you  pay  the  bill;  as  long  as  the  business  model  holds.  

Data  &  Storage  

Data  intensive  compu?ng  &  HP  storage  

Internet  style  scale  out  and  object-­‐based  storage  

Flows   Large  data  flows  in  and  out  

Lots  of  small  web  flows  

Streams   Streaming  processing  required  

NA  

Accoun?ng   Essen?al   Essen?al  Lock  in   Moving  environment  

between  CSPs  essen?al  Lock  in  is  good  

Page 22: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Part  3.  The  Open  Cloud  Consor?um’s    Open  Science  Data  Cloud  

Page 23: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

23  www.opencloudconsor?um.org  

•  U.S  based  not-­‐for-­‐profit  corpora?on.  •  Manages  cloud  compu?ng  infrastructure  to  

support  scien?fic  research:  Open  Science  Data  Cloud.  

•  Manages  cloud  compu?ng  testbeds:  Open  Cloud  Testbed.  

 

Page 24: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Cloud  Services    Opera?ons  Centers  (CSOC)  

•  The  OSDC  operates  Cloud  Services  Opera?ons  Center  (or  CSOC).  

•  It  is  a  CSOC  focused  on  suppor?ng  Science  Clouds  for  researchers.  

•  Compare  to  Network  Opera?ons  Center  or  NOC.  

•  Both  are  an  important  part  of  cyber  infrastructure  for  big  data  science.  

Page 25: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

•  Design  1:  Put  cores  over  spindles.  

•  Higher  cost  but  easy  to  compute  over  all  the  data.  

•  Design  2:  separate  (some  of  the  )storage  from  the  compute.  

2012  OSDC  rack  design  (dray)  •  950  TB  /  rack  •  600  cores  /  rack  

Different  Styles  of  OSDC  Racks  

Page 26: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Open  Science  Data  Cloud  

3  PB  2011  10  PB  2012    

able  to  scale  to  100  PB?  

Automa?c  provisioning  and  infrastructure  management  

Monitoring,  compliance,  &  

security  

Accoun?ng  and  billing  (OSDC)  

Customer  Facing  Portal  (Tukey)  

Data  center  network  

~100  Gbps  bandwidth    

5-­‐12  operators  to  operate  1-­‐5  MW  Science  Cloud  

Science  Cloud  SW  &  Services  

OSDC  Data  Stack  based  upon  OpenStack,  Hadoop,  GlusterFS,  UDT,  …  

Page 27: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

OSDC  Philosophy  •  We  try  to  automate  as  much  as  possible  (we  automate  the  setup  &  opera?ons  of  a  rack).  

•  We  try  to  write  as  liMle  soyware  as  possible.  •  Each  project  is  a  bit  different,  but  in  general:  •  We  assign  (permanent)  IDs  to  data  managed  by  the  OSDC  and  manage  associated  metadata.  

•  We  assign  and  enforce  permissions  for  users  &  groups  of  users  and  for  files/objects,  collec?ons  of  files/objects,  and  collec?ons  of  collec?ons.  

•  We  Support  RESTful  interfaces.  •  Do  accoun?ng  for  storage  and  core-­‐hours.  

Page 28: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Some  Of  Our  Biggest  Mistakes  

•  Not  charging  those  who  were  the  largest  users  of  our  services.      This  resulted  in  a  lot  of  bad  behavior.  

•  Trying  to  support  donated  equipment  without  adequate  staff.  

•  Being  too  op?mis?c  about  when  big  data  soyware  would  be  ready  for  prime  ?me.  

•  Some  problems  with  big  data  soyware  doesn’t  show  up  at  less  than  the  full  scale  of  the  OSDC,  but  we  have  only  one  OSDC  and  it  is  difficult  to  test  at  this  scale.  

Page 29: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Essen?al  Services  for  a  Science  CSP  •  Support  for  data  intensive  compu?ng  •  Support  for  big  data  flows  •  Account  management,  authen?ca?on  and  authoriza?on  services  

•  Health  and  status  monitoring  •  Billing  and  accoun?ng  •  Ability  to  rapidly  provision  infrastructure  •  Security  services,  logging,  event  repor?ng  •  Access  to  large  amounts  of  public  data  •  High  performance  storage  •  Simple  data  export  and  import  services  

Page 30: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Small   Medium  to  Large     Very  Large  

Data  Size  

10’s  

100’s  

1000’s  

Number  

Public  infrastructure  

Dedicated    infrastructure  

Shared  community  infrastructure  

Individual  scien?sts  &  small  projects  

Community  based  science  via  Science  as  a  Service  

very  large  projects  

Page 31: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Part  4.    Bionimbus  

Bionimbus  is  a  joint  project  between  Laboratory  For  Advanced  Compu?ng  &  the  White  Lab  at  the  University  of  Chicago.  

Page 32: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Step  1.  Prepare  a  Sample  

Page 33: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Step  2.    Login  to  Bionimbus  and  get  a  Bionimbus  Key.  

Page 34: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Step  3.    Send  your  sample  to  the  sequencing  center.    

Page 35: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Step  4.    Login  on  to  Bionimbus  and    view  your  data  

Page 36: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Step  5.    Use  Bionimbus  to  perform  standard  and  custom  pipelines.  

Bionimbus  can  launch  mul?ple  virtual  machines.  

Page 37: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Bionimbus  Virtual  Machine  Releases    Peak  Calling   MAT  

MA2C  PeakSeq  MACS  SPP  

Quality  Control  

Various  

Alignment  &  Genotyping  

Bow?e  

TopHat  Samtools  Picard  

37  

Page 38: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Soyware  Tools:  Moving  Genomes  

Page 39: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Bionimbus  Community  Genomic  Cloud  

researcher  

Personal  “dropbox”  +  compute  

•  1K  genomes  •  PubMed  •  etc.  

Cloud  for  Public  Data    

Page 40: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Bionimbus  Private  Genomic  Cloud  

researcher  

Personal  “dropbox”  &  compute  

Cloud  for  Public  Data    

Cloud  for  Controlled  Data    

TCGA  dbGaP  

•  1K  genomes  •  PubMed  •  etc.  

Page 41: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Bionimbus  Private  Biomedical  Cloud  

researcher  

Personal  “dropbox”  plus  compute  

Cloud  for  Public  Data    

Cloud  for  Controlled  Data    

TCGA  dbGaP  

Cloud  for  PHI  data  

Clinical  Research  Data  Warehouse  

ScaMer,  gather  queries  

•  1K  genomes  •  PubMed  •  etc.  

Page 42: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Bionimbus  Private  Cloud  

UC  

Bionimbus  Community  

Cloud  

Bionimbus  Private  Cloud  XY  

Amazon  dbGaP  

External    sequencing  partner  

Internal  Sequencers  

Step  1.  Get  Bionimbus  ID  (BID),  assign  project,  private/community,  public  cloud,  etc.  

Step  2.  Send  sample  to  be  sequenced.  

BID  Generator  

Step  3b.  Return  variant  calls,    CNV,  annota?on…  

Step  4.  Secure  data  rou?ng  to  appropriate  cloud  based  upon  BID.  

Step  5.    Cloud  based  analysis    using  IGSB  and  3rd    party  tools  and  applica?ons.    Step  3a.  Return  raw  

reads.  

Page 43: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Database  Services  

Analysis  Pipelines  &  Re-­‐analysis  Services  

web2py-­‐based  Front  End  

Data    Cloud  Services  

Data  Inges?on  Services  

U?lity  Cloud  Services  

Intercloud  Services  

(Hadoop,  Sector/Sphere)  

(Eucalyptus,  OpenStack)  

(PostgreSQL)  

(IDs,  etc.)  

(UDT,  replica?on)  

Page 44: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

44  

>300  ChIP  datasets  -­‐ Chroma?n/RNA  ?mecourse  -­‐ CBP  -­‐ PolII  -­‐ Pho/silencers  -­‐ HDACs  -­‐ Insulators  -­‐ TFs  Predic3ons  537  silencers  2,307  new  promoters  12,285  enhancers  14,145  insulators  

www.modencode.org        

Negre  et  al.  Nature  2011  

Page 45: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Part  5.      Managing  One  Million  Genomes  

Page 46: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Sequence  (BAM)  Files  (100-­‐1000  PB)    

Varia?on  (VCF)  Files  (1-­‐10  PB)    

Summary  level    (10-­‐100  TB)  

Rela?onal  databases  

NoSql  &  scien?fic  databases    

NoSql,  DFS,      file  overlays?    

Enrich  with  clinical  data  

(Genomic  varia?on)  

(Sequence  data  in  binary  form)  

Page 47: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Acknowledgements  Major  funding  and  support  for  the  Open  Science  Data  Cloud  (OSDC)  is  provided  by  the  Gordon  and  BeMy  Moore  Founda?on.    This  funding  is  used  to  support  the  OSDC-­‐Adler,  Sullivan  and  Root  facili?es.    Addi?onal  funding  for  the  OSDC  has  been  provided  by  the  following  sponsors:    •  The  OCC-­‐Y  Hadoop  Cluster  (approximately  1000  cores  and  1  PB  of  storage)  was  

donated  by  Yahoo!  in  2011.  •  Cisco  provides  the  OSDC  access  to  the  Cisco  C-­‐Wave,  which  connects  OSDC  data  

centers  with  10  Gbps  wide  area  networks.  •  NSF  awarded  the  OSDC  a  5-­‐year  (2010-­‐2016)  PIRE  award  to  train  scien?sts  to  use  

the  OSDC  and  to  further  develop  the  underlying  technology.  •  OSDC  technology  for  high  performance  data  transport  is  support  in  part  by    NSF  

Award  1127316.  •  The  StarLight  Facility  in  Chicago  enables  the  OSDC  to  connect  to  over  30  high  

performance  research  networks  around  the  world  at  10  Gbps  or  higher,  with  an  increasing  number  of  100  Gbps  connec?ons.  

 The  OSDC  is  managed  by  the  Open  Cloud  Consor?um,  a  501(c)(3)  not-­‐for-­‐profit  corpora?on.  If  you  are  interested  in  providing  funding  or  dona?ng  equipment  or  services,  please  contact  us  at  [email protected].  

Page 48: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

For  more  informa?on  •  You  can  find  some  more  informa?on  on  my  blog:  

                                               rgrossman.com.  •  Some  of  my  technical  papers  are  also  available  there.    •  My  email  address  is  robert.grossman  at  uchicago  dot  edu  •  I  recently  wrote  a  popular  book  about  compu?ng  called:  The  

Structure  of  Digital  Compu?ng:  From  Mainframes  to  Big  Data,  which  you  can  buy  from  Amazon.  

 Center forResearchInformatics

Page 49: Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Sources  for  images  

•  The  image  of  the  hard  disk  is  from  Norlando  Pobre,  Crea?ve  Commons.  •  The  image  of  the  Facebook  Pineville  Data  Center  is  from  the  Intel  Free  Press,  

www.flickr.com/photos/intelfreepress/6722296855/,  Crea?ve  Commons  BY  2.0.  •  The  image  of  the  LHC  is  from  Conrad  Melvin,  Crea?ve  Commons  BY-­‐SA  2.0,  www.flickr.com/

photos/58220828@N07/5350788732