Chemogenomics in the cloud: Is the sky the limit?

Chemogenomics in the cloud Is the sky the limit?

Rajarshi Guha, Ph.D. NIH Center for Transla:onal Therapeu:cs

June 28, 2012

The cloud as infrastructure

•  Cloud compu:ng is a service for –  Infrastructure – PlaForm – SoHware

•  Much of the benefits of cloud compu:ng are – Economic – Poli:cal

•  Won’t be discussing the remote hos:ng aspects of clouds

Characteris8cs of the cloud

Cloud Computing

Virtually assemble

Pay-per-use

On-demand self service

Offsite technology

Shared workloads

Massive scale

hPp://www.slideshare.net/haslinatuanhim/slides-‐cloud-‐compu:ng

Parallel compu8ng in the cloud

•  Modern cloud vendors make provisioning compute resources easy – Allows one to handle unpredictable loads easily – Pay only for what you need

•  Chemistry applica:ons don’t usually have very dynamic loads

•  But large scale resources are an opportunity for large scale (parallel) computa:ons

Storing chemical informa8on

•  Fill up a hard drive, mail to Amazon •  Copy over the network

– Aspera – GridFTP

•  S:ll need to pay for storage space

•  Lots of op:ons on the cloud – S3, rela:onal DB’s

•  See Chris Dagdigian’s talk for views on storage hPp://www.slideshare.net/chrisdag/2012-‐trends-‐from-‐the-‐trenches

Recoding for the cloud?

•  Only if we really have to •  Large amounts of legacy code, runs perfectly well on local clusters – May not make sense to recode as a map-‐reduce job

– May not be possible to

•  Different levels of HPC on the cloud – Legacy HPC –  ‘Cloudy’ HPC – Big Data HPC

hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud

?

• Use cloud resources in the same way as a local cluster

• MIT StarCluster makes this easy to do

Legacy HPC

• Make use of cloud capabili:es

• Old algorithms, new infrastructure

• Spot instances, SNS, SQS SimpleDB, S3, etc

Cloudy HPC

• Huge datasets • Candidates for map-‐reduce

•  Involves algorithm (re)design

Big Data HPC

Recoding for the cloud?

hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud

How does the cloud enable science?

•  How does the cloud change computa:onal chemistry, cheminforma:cs, … – The way we do them – The scale at which we do them

Are there problems that we can address that

we could not have if we didn’t have on-‐demand, scalable cloud resources?

Big data & cheminforma8cs

•  Computa:on over large chemical databases – Pubchem, ChEMBL, …

•  What types of computa:ons? – Searches (substructure, pharmacophore, ….) – QSAR models over large data – Predic:ons for large data

•  Certain applica:ons just need structures •  Access to correspondingly massive experimental datasets is tough (impossible?)


•  GDB-‐13 is a truly big database – 977 million different structures – Current search interface is based on NN searches using a reduced representa:on

– Could be a good candidate for a Hadoop based analysis

•  More generally, enumerated virtual libraries can also lead to very big data – Time required to enumerate is a boPleneck


•  Fundamentally, “big chemical data” lets us explore larger chemical spaces – Can plow through large catalogs – e.g., iden:fying PKR inhibitors by LBVS of the ChemNavigator collec:on [Bryk et al]

•  This can push predic:ve models to their limits – Brings us back to the global vs local arguments

The Hadoop ecosystem

•  A framework for the map-‐reduce agorithm – Not something you can download and just run – Need to implement the infrastructure and then develop code to run using the infrastructure

•  Low level Hadoop programs can be large, complex and tedious

•  Abstrac:ons have been developed that make Hadoop queries more SQL-‐like – results in much more concise code

The Hadoop ecosystem

Hadoop Common

Hadoop Distributed Filesystem

Map Reduce Engine

Hive

Hama

WhirrHBase

Pig

AvroMahout

FlumeZookeeperChukwa

Based on hPp://www.slideshare.net/informa:cacorp/101111-‐part-‐3-‐maP-‐asleP-‐the-‐hadoop-‐ecosystem

Simplifying Hadoop applica8ons

•  Raw Hadoop programs can be very tedious to write

SMARTS based substructure search

Pig & Pig La8n

•  Pig La:n programs are much simpler to write and get translated to Hadoop code

•  SQL-‐like, requires UDF to be implemented to perform non-‐standard tasks

!"#"$%&'"()*'+,)-.)+("&."/.)+$*.012&3&33&456"7"#"8$9*3"!":4";*9-3<,2&-'1-=+<->?!@AB/.)+$*.C"(DA/#E5A/#E5D(56".9%3*"7"+;9%"(%,9=,9-9F9(6"

!"#$%&'&$())'*+,-./'012034)'5%$2065"3&'7''''')2(8&'*+,9-*:"06;-<<$')=2>)2(8&'7'''''''''26;'7''''''''''''')=2'?'30@'*+,9-*:"06;-<<$AB.BC>'''''''''D'&(2&E'A.FGH1&0!8<3'0C'7'''''''''''''*;)20IJ<"2J!6%32$3A0C>'''''''''D'''''D''''')2(8&'*I%$0)K(6)06')!'?'30@'*I%$0)K(6)06AF0L("$2.E0IM#N0&2O"%$406JP02Q3)2(3&0ACC>''''''!"#$%&'O<<$0(3'010&A-"!$0'2"!$0C'2E6<@)'QMH1&0!8<3'7'''''''''%L'A2"!$0'??'3"$$'RR'2"!$0J)%S0AC'T'UC'602"63'L($)0>'''''''''*26%3P'2(6P02'?'A*26%3PC'2"!$0JP02AVC>'''''''''*26%3P'="06;'?'A*26%3PC'2"!$0JP02AWC>'''''''''26;'7''''''''''''')=2J)02*I(62)A="06;C>'''''''''''''Q,2<I.<32(%306'I<$'?')!J!(6)0*I%$0)A2(6P02C>'''''''''''''602"63')=2JI(2&E0)AI<$C>'''''''''D'&(2&E'A.FGH1&0!8<3'0C'7'''''''''''''2E6<@'X6(!!04QMH1&0!8<3J@6(!ABH66<6'%3'*+,9-*'!(Y063'<6'*+QZH*')26%3P'B[="06;\'0C>'''''''''D'''''D'D'

SMARTS search in Pig La:n

UDF for SMARTS search

•  Hadoop doesn’t know anything about cheminforma:cs – Need to write your own code, UDF’s etc

•  But applica:on layers have been developed for other purposes –  Apache Mahout: a library for machine learning on data stored in Hadoop clusters

– Possible to build virtual screening pipelines based on the Hadoop framework

Working on top of Hadoop

What Hadoop is not for

•  Doesn’t replace an actual database •  It’s not uniformly fast or efficient •  Not good for ad hoc or real:me analysis •  Not effec:ve unless dealing with massive datasets

•  All algorithms are not amenable to the map-‐reduce method – CPU bound methods and those requiring communica:on

Cheminforma8cs on Hadoop

•  Hadoop and Atom Coun:ng •  Hadoop and SD Files •  Cheminforma:cs, Hadoop and EC2 •  Pig and Cheminforma:cs

But are cheminforma1cs problems really big enough to jus1fy all of this?

How big is big?

•  Bryk et al performed a LBVS of 5 million compounds to iden:fy PKR inhibitors – Pharmacophore fingerprints + perceptron – Required conformer genera:on

•  Given that conformer and descriptor genera:on are one-‐:me tasks, screening 5M compounds doesn’t take long

•  Example: RF models built on 512 bit binary fingerprints gives us predic:ons for 5M fingerprints in 12 min [Single core, 3 GHz Xeon, OS X 10.6.8]

Going beyond chunking?

•  All the preceding use cases are embarrassingly parallel – Chunking the input data and applying the same opera:on to each chunk

– Very nice when you have a big cluster

Are there algorithms in cheminforma1cs that can employ

map-‐reduce at the algorithmic level?


•  Applica:ons that make use of pairwise (or higher order) calcula:ons could benefit from a map-‐reduce incarna:on – Doesn’t always avoid the O(N2) barrier – Bioisostere iden:fica:on is one case that could be rephrased as a map-‐reduce problem

•  Search algorithms such as GA’s, par:cle swarms can make use of map-‐reduce – GA based docking – Feature selec:on for QSAR models


•  Machine learning for massive chemical datasets? – MR jobs (descriptor genera:on) + Mahout (model building) lets us handle this in a straight forward manner

•  But will QSAR models benefit from more data? – Helgee et al suggest global models are preferable – But diversity and the structure of the chemical space will affect performance of global models

– Unsupervised methods maybe more relevant – Philosophical ques:on?


•  Many clustering algorithms are amenable to map-‐reduce style – K-‐means, Spectral, EM, minhash, … – Many are implemented in Mahout

Problems where we generate large numbers of combina8ons can be amenable to map-‐reduce

Networks & integra8on

•  Network models of molecules, and targets are common – Allows for the incorpora:on of lots of associated informa:on

– Diseases, pathways, OTE’s, •  When linked with clinical data & outcomes, we can generate massive networks – Adverse events (FDA AERS) – Analysis by Cloudera considered > 10E6 drug-‐drug-‐reac:on triples

Yildirim, M.A. et al


•  SAR data can be viewed in a network form – SALI, SARI based networks – Usually requires pairwise calcula:ons of the metric

•  Current studies have focused on small datasets (< 1000 molecules)

•  Hadoop + Giraph could let us apply this to HTS-‐scale datasets

hPp://sali.rguha.net/ Peltason, L et al


•  When we apply a network view we can consider many interes:ng applica:ons & make use of cloud scale infrastructure – Network based similarity – Community detec:on (aka clustering) – PageRank style ranking (of targets, compounds, …) – Generate network metrics, which can be used as input to predic:ve models (for interac:ons, effects, …)

Bauer-‐Mehren et al

Conclusions

•  Cheminforma:cs applica:ons can be rewriPen to take advantage of cloud resources – Remotely hosted – Embarrassingly parallel / chunked – Map/reduce

•  Ability to process larger structure collec:ons lets us explore more chemical space

•  Integra:ng chemistry with clinical & pharmacological data can lead to big datasets

Conclusions

•  Q: But are cheminforma8cs problems really big enough to jus8fy all of this?

•  A: Yes – virtual libraries, integra:ng chemical structure with other types and scales of data

•  Q: Are there algorithms in cheminforma8cs that can employ map-‐reduce at the algorithmic level?

•  A: Yes – especially when we consider problems with a combinatorial flavor

Technology

Chemogenomics in the cloud: Is the sky the limit?