Upload
rguha
View
610
Download
0
Tags:
Embed Size (px)
Citation preview
Chemogenomics in the cloud Is the sky the limit?
Rajarshi Guha, Ph.D. NIH Center for Transla:onal Therapeu:cs
June 28, 2012
The cloud as infrastructure
• Cloud compu:ng is a service for – Infrastructure – PlaForm – SoHware
• Much of the benefits of cloud compu:ng are – Economic – Poli:cal
• Won’t be discussing the remote hos:ng aspects of clouds
Characteris8cs of the cloud
Cloud Computing
Virtually assemble
Pay-per-use
On-demand self service
Offsite technology
Shared workloads
Massive scale
hPp://www.slideshare.net/haslinatuanhim/slides-‐cloud-‐compu:ng
Parallel compu8ng in the cloud
• Modern cloud vendors make provisioning compute resources easy – Allows one to handle unpredictable loads easily – Pay only for what you need
• Chemistry applica:ons don’t usually have very dynamic loads
• But large scale resources are an opportunity for large scale (parallel) computa:ons
Storing chemical informa8on
• Fill up a hard drive, mail to Amazon • Copy over the network
– Aspera – GridFTP
• S:ll need to pay for storage space
• Lots of op:ons on the cloud – S3, rela:onal DB’s
• See Chris Dagdigian’s talk for views on storage hPp://www.slideshare.net/chrisdag/2012-‐trends-‐from-‐the-‐trenches
Recoding for the cloud?
• Only if we really have to • Large amounts of legacy code, runs perfectly well on local clusters – May not make sense to recode as a map-‐reduce job
– May not be possible to
• Different levels of HPC on the cloud – Legacy HPC – ‘Cloudy’ HPC – Big Data HPC
hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud
?
• Use cloud resources in the same way as a local cluster
• MIT StarCluster makes this easy to do
Legacy HPC
• Make use of cloud capabili:es
• Old algorithms, new infrastructure
• Spot instances, SNS, SQS SimpleDB, S3, etc
Cloudy HPC
• Huge datasets • Candidates for map-‐reduce
• Involves algorithm (re)design
Big Data HPC
Recoding for the cloud?
hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud
How does the cloud enable science?
• How does the cloud change computa:onal chemistry, cheminforma:cs, … – The way we do them – The scale at which we do them
Are there problems that we can address that
we could not have if we didn’t have on-‐demand, scalable cloud resources?
Big data & cheminforma8cs
• Computa:on over large chemical databases – Pubchem, ChEMBL, …
• What types of computa:ons? – Searches (substructure, pharmacophore, ….) – QSAR models over large data – Predic:ons for large data
• Certain applica:ons just need structures • Access to correspondingly massive experimental datasets is tough (impossible?)
Big data & cheminforma8cs
• GDB-‐13 is a truly big database – 977 million different structures – Current search interface is based on NN searches using a reduced representa:on
– Could be a good candidate for a Hadoop based analysis
• More generally, enumerated virtual libraries can also lead to very big data – Time required to enumerate is a boPleneck
Big data & cheminforma8cs
• Fundamentally, “big chemical data” lets us explore larger chemical spaces – Can plow through large catalogs – e.g., iden:fying PKR inhibitors by LBVS of the ChemNavigator collec:on [Bryk et al]
• This can push predic:ve models to their limits – Brings us back to the global vs local arguments
The Hadoop ecosystem
• A framework for the map-‐reduce agorithm – Not something you can download and just run – Need to implement the infrastructure and then develop code to run using the infrastructure
• Low level Hadoop programs can be large, complex and tedious
• Abstrac:ons have been developed that make Hadoop queries more SQL-‐like – results in much more concise code
The Hadoop ecosystem
Hadoop Common
Hadoop Distributed Filesystem
Map Reduce Engine
Hive
Hama
WhirrHBase
Pig
AvroMahout
FlumeZookeeperChukwa
Based on hPp://www.slideshare.net/informa:cacorp/101111-‐part-‐3-‐maP-‐asleP-‐the-‐hadoop-‐ecosystem
Simplifying Hadoop applica8ons
• Raw Hadoop programs can be very tedious to write
SMARTS based substructure search
Pig & Pig La8n
• Pig La:n programs are much simpler to write and get translated to Hadoop code
• SQL-‐like, requires UDF to be implemented to perform non-‐standard tasks
!"#"$%&'"()*'+,)-.)+("&."/.)+$*.012&3&33&456"7"#"8$9*3"!":4";*9-3<,2&-'1-=+<->?!@AB/.)+$*.C"(DA/#E5A/#E5D(56".9%3*"7"+;9%"(%,9=,9-9F9(6"
!"#$%&'&$())'*+,-./'012034)'5%$2065"3&'7''''')2(8&'*+,9-*:"06;-<<$')=2>)2(8&'7'''''''''26;'7''''''''''''')=2'?'30@'*+,9-*:"06;-<<$AB.BC>'''''''''D'&(2&E'A.FGH1&0!8<3'0C'7'''''''''''''*;)20IJ<"2J!6%32$3A0C>'''''''''D'''''D''''')2(8&'*I%$0)K(6)06')!'?'30@'*I%$0)K(6)06AF0L("$2.E0IM#N0&2O"%$406JP02Q3)2(3&0ACC>''''''!"#$%&'O<<$0(3'010&A-"!$0'2"!$0C'2E6<@)'QMH1&0!8<3'7'''''''''%L'A2"!$0'??'3"$$'RR'2"!$0J)%S0AC'T'UC'602"63'L($)0>'''''''''*26%3P'2(6P02'?'A*26%3PC'2"!$0JP02AVC>'''''''''*26%3P'="06;'?'A*26%3PC'2"!$0JP02AWC>'''''''''26;'7''''''''''''')=2J)02*I(62)A="06;C>'''''''''''''Q,2<I.<32(%306'I<$'?')!J!(6)0*I%$0)A2(6P02C>'''''''''''''602"63')=2JI(2&E0)AI<$C>'''''''''D'&(2&E'A.FGH1&0!8<3'0C'7'''''''''''''2E6<@'X6(!!04QMH1&0!8<3J@6(!ABH66<6'%3'*+,9-*'!(Y063'<6'*+QZH*')26%3P'B[="06;\'0C>'''''''''D'''''D'D'
SMARTS search in Pig La:n
UDF for SMARTS search
• Hadoop doesn’t know anything about cheminforma:cs – Need to write your own code, UDF’s etc
• But applica:on layers have been developed for other purposes – Apache Mahout: a library for machine learning on data stored in Hadoop clusters
– Possible to build virtual screening pipelines based on the Hadoop framework
Working on top of Hadoop
What Hadoop is not for
• Doesn’t replace an actual database • It’s not uniformly fast or efficient • Not good for ad hoc or real:me analysis • Not effec:ve unless dealing with massive datasets
• All algorithms are not amenable to the map-‐reduce method – CPU bound methods and those requiring communica:on
Cheminforma8cs on Hadoop
• Hadoop and Atom Coun:ng • Hadoop and SD Files • Cheminforma:cs, Hadoop and EC2 • Pig and Cheminforma:cs
But are cheminforma1cs problems really big enough to jus1fy all of this?
How big is big?
• Bryk et al performed a LBVS of 5 million compounds to iden:fy PKR inhibitors – Pharmacophore fingerprints + perceptron – Required conformer genera:on
• Given that conformer and descriptor genera:on are one-‐:me tasks, screening 5M compounds doesn’t take long
• Example: RF models built on 512 bit binary fingerprints gives us predic:ons for 5M fingerprints in 12 min [Single core, 3 GHz Xeon, OS X 10.6.8]
Going beyond chunking?
• All the preceding use cases are embarrassingly parallel – Chunking the input data and applying the same opera:on to each chunk
– Very nice when you have a big cluster
Are there algorithms in cheminforma1cs that can employ
map-‐reduce at the algorithmic level?
Going beyond chunking?
• Applica:ons that make use of pairwise (or higher order) calcula:ons could benefit from a map-‐reduce incarna:on – Doesn’t always avoid the O(N2) barrier – Bioisostere iden:fica:on is one case that could be rephrased as a map-‐reduce problem
• Search algorithms such as GA’s, par:cle swarms can make use of map-‐reduce – GA based docking – Feature selec:on for QSAR models
Going beyond chunking?
• Machine learning for massive chemical datasets? – MR jobs (descriptor genera:on) + Mahout (model building) lets us handle this in a straight forward manner
• But will QSAR models benefit from more data? – Helgee et al suggest global models are preferable – But diversity and the structure of the chemical space will affect performance of global models
– Unsupervised methods maybe more relevant – Philosophical ques:on?
Going beyond chunking?
• Many clustering algorithms are amenable to map-‐reduce style – K-‐means, Spectral, EM, minhash, … – Many are implemented in Mahout
Problems where we generate large numbers of combina8ons can be amenable to map-‐reduce
Networks & integra8on
• Network models of molecules, and targets are common – Allows for the incorpora:on of lots of associated informa:on
– Diseases, pathways, OTE’s, • When linked with clinical data & outcomes, we can generate massive networks – Adverse events (FDA AERS) – Analysis by Cloudera considered > 10E6 drug-‐drug-‐reac:on triples
Yildirim, M.A. et al
Networks & integra8on
• SAR data can be viewed in a network form – SALI, SARI based networks – Usually requires pairwise calcula:ons of the metric
• Current studies have focused on small datasets (< 1000 molecules)
• Hadoop + Giraph could let us apply this to HTS-‐scale datasets
hPp://sali.rguha.net/ Peltason, L et al
Networks & integra8on
• When we apply a network view we can consider many interes:ng applica:ons & make use of cloud scale infrastructure – Network based similarity – Community detec:on (aka clustering) – PageRank style ranking (of targets, compounds, …) – Generate network metrics, which can be used as input to predic:ve models (for interac:ons, effects, …)
Bauer-‐Mehren et al
Conclusions
• Cheminforma:cs applica:ons can be rewriPen to take advantage of cloud resources – Remotely hosted – Embarrassingly parallel / chunked – Map/reduce
• Ability to process larger structure collec:ons lets us explore more chemical space
• Integra:ng chemistry with clinical & pharmacological data can lead to big datasets
Conclusions
• Q: But are cheminforma8cs problems really big enough to jus8fy all of this?
• A: Yes – virtual libraries, integra:ng chemical structure with other types and scales of data
• Q: Are there algorithms in cheminforma8cs that can employ map-‐reduce at the algorithmic level?
• A: Yes – especially when we consider problems with a combinatorial flavor