37
Continuous modeling - automating model building on high- performance e-Infrastructures Ola Spjuth Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala, Sweden

Continuous modeling - automating model building on high-performance e-Infrastructures

Embed Size (px)

Citation preview

Page 1: Continuous modeling - automating model building on high-performance e-Infrastructures

Continuous modeling - automating model building on high-performance e-Infrastructures

Ola SpjuthDepartment of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala,

Sweden

Page 2: Continuous modeling - automating model building on high-performance e-Infrastructures

Today: We have access to high-throughput technologies to study biological phenomena

Page 3: Continuous modeling - automating model building on high-performance e-Infrastructures

New challenges: Data management and analysis

• Storage• Analysis methods, pipelines• Scaling• Automation• Data integration, security• Predictions• …

Page 4: Continuous modeling - automating model building on high-performance e-Infrastructures

My research focus

• Enabling high-throughput biology, from e-infrastructures and up– Massively parallel sequencing, metabolomics– Predictive modeling in toxicology and pharmacology

• Particular focus in large-scale predictive modeling– Tackle large problems– Evaluate predictive performance– Easy and secure sharing/consumption of models– Automate re-building of models

Page 5: Continuous modeling - automating model building on high-performance e-Infrastructures

Observations

• Predictive toxicology and pharmacology are becoming data-intensive– High throughput technologies

• Drug/chemical screening• Molecular biology (omics)

– More and bigger publicly available data sources

• Data is continuously updated

Page 6: Continuous modeling - automating model building on high-performance e-Infrastructures

QSAR modeling

• Signatures1 descriptor in CDK2

– Canonical representation of atom environments

• Support Vector Machine (SVM)– Robust modeling

1. Faulon, J.-L.; Visco, D. P.; Pophale, R. S. Journal of Chemical Information and Computer Sciences, 2003, 43, 707-720

2. Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. Journal of Chemical Information and Computer Sciences, 2003,43, 493-500.

Lars Carlsson,AstraZeneca R&D

Page 7: Continuous modeling - automating model building on high-performance e-Infrastructures

Interpretation of nonlinear QSAR models

• Method– Compute gradient of decision

function for prediction– Extract descriptor(s) with largest

component in the gradient• Demonstrated on RF, SVM, and PLS

Carlsson, L., Helgee, E. A., and Boyer, S. Interpretation of nonlinear qsar models applied to ames mutagenicity data. J Chem Inf Model 49, 11 (Nov 2009), 2551–2558.

E. Ahlberg, O. Spjuth, C. Hasselgren, and L. Carlsson. Interpretation of Conformal Prediction Classification Models. In Statistical Learning and Data Sciences, vol. 9047 of Lecture Notes in Computer Science. Springer International Publishing, 2015, pp. 323–334.

Lars Carlsson,AstraZeneca R&D

Page 8: Continuous modeling - automating model building on high-performance e-Infrastructures

Bioclipse Decision Support

Page 9: Continuous modeling - automating model building on high-performance e-Infrastructures

Modeling large number of observations on HPC

Aim: Measure predictive performance when QSAR datasets get larger

Research questions:• When do we need HPC?• How can we work efficiently with HPC in

modeling?• Are nonlinear methods required?

Page 10: Continuous modeling - automating model building on high-performance e-Infrastructures

High-Performance Computing

• Computationally expensive problems call for high-performance e-Infrastructures

• High-Performance Computing (HPC)– Fast interconnect between compute nodes

• High-Throughput Computing (HTC)– Fast interconnect not needed

• Cloud Computing (CC)– Infrastructure as a Service (IaaS)

Page 11: Continuous modeling - automating model building on high-performance e-Infrastructures

UPPMAX high-performance computing center (Uppsala, Sweden)

• Get access to multiple nodes– 16 compute cores per node

• Get access to large memory machines– we have nodes with 128, 256, 512, or 2000 GB RAM

• OpenStack private cloud

• However on HPC:– Only terminal usage, no web server allowed (scripting in bash, perl

and python common)– Queuing system (e.g. SLURM, SGE)– Limited job length (e.g. 10 days)

Page 12: Continuous modeling - automating model building on high-performance e-Infrastructures

Project growth

Page 13: Continuous modeling - automating model building on high-performance e-Infrastructures

Bioinformatics has inefficient HPC usage

Page 14: Continuous modeling - automating model building on high-performance e-Infrastructures

Levels of automation in sequence analysis

• Production: Can be fully automated

• Secondary analysis: Partly automated

• Researchers: Basic science not really useful to automate, flexibility

Page 15: Continuous modeling - automating model building on high-performance e-Infrastructures

Training large number of datasets on HPC

Aim: Build models for hundreds or thousands of targets

– Challenge to automate data assembly/integration

– Challenge to automate model building

Hypothesis: Workflow systems can enable agile large-scale predictive modeling

Data sources

Samuel Lampa

Page 16: Continuous modeling - automating model building on high-performance e-Infrastructures

What is a workflow system

Page 17: Continuous modeling - automating model building on high-performance e-Infrastructures

The workflow landscape

Page 18: Continuous modeling - automating model building on high-performance e-Infrastructures

Automating analysis on clusters

• Workflow systems can aid development and deployment• We extended Luigi system into SciLuigi (

https://github.com/samuell/sciluigi)• Integrate with batch queuing system on HPC

Train and assess model

Samuel Lampa

Page 19: Continuous modeling - automating model building on high-performance e-Infrastructures

Modeling large datasets on HPC

Jonathan Alvarsson

Page 20: Continuous modeling - automating model building on high-performance e-Infrastructures

Modeling large datasets on HPC

Jonathan Alvarsson

Page 21: Continuous modeling - automating model building on high-performance e-Infrastructures
Page 22: Continuous modeling - automating model building on high-performance e-Infrastructures

Publishing models

• Publish models for easy access and consumption

• We use P2 (OSGi) provisioning system

v. 1.3

v. 1.2

v. 1.1

Use models

Page 23: Continuous modeling - automating model building on high-performance e-Infrastructures

Bioclipse and OpenTox

E. Willighagen N. Jeliazkova, B. Hardy, R. Grafström, and O. Spjuth Computational toxicology using the OpenTox application programming interface and Bioclipse. BMC Research Notes 2011, 4:487

Page 24: Continuous modeling - automating model building on high-performance e-Infrastructures

Reactive/continuous modelingData sources

CoordinateIntegrateVersionMonitor

Publishmodels

Archivemodels

User

Bioclipse

Train and assess model

Page 25: Continuous modeling - automating model building on high-performance e-Infrastructures

Could cloud computing improve/simplify modeling?

Page 26: Continuous modeling - automating model building on high-performance e-Infrastructures

Modeling on Amazon Elastic Cloud

B. T. Moghadam, J. Alvarsson, M. Holm, M. Eklund, L. Carlsson, and O. SpjuthScaling predictive modeling in drug development with cloud computing.J. Chem. Inf. Model., 2015, 55 (1), pp 19-25

Page 27: Continuous modeling - automating model building on high-performance e-Infrastructures

• H2020 infrastructure project (2015-2018)

• Platform for metabolomics data analysis – study metabolites in primarily clinical studies

• Integrating data and tools• Data management, privacy• Cloud/Microservices architecture• Predictions

http://phenomenal-h2020.eu/

Page 28: Continuous modeling - automating model building on high-performance e-Infrastructures

Could Big Data frameworks improve/simplify modeling?

• Map/Reduce, Hadoop, Spark, HDFS/distributed file systems and others…

• Recently received a lot of attention• Allow for massively parallel analysis

• How useful are they in pharmaceutical bioinformatics?

Page 29: Continuous modeling - automating model building on high-performance e-Infrastructures

Hadoop (MapReduce) for massively parallel analysis

Page 30: Continuous modeling - automating model building on high-performance e-Infrastructures

Evaluating Hadoop for sequence analysis

• Compare Hadoop and HPC– Create as identical pipelines as possible– Investigate scaling and performance– Shows the bottlenecks with current HPC

Alexey Siretskiy, former Postdoc

A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience. 2015; 4:26.

Page 31: Continuous modeling - automating model building on high-performance e-Infrastructures

Distributed modeling with Spark

• Appealing programming methodology• Built-in data locality and in-memory

computing– RDD (Resilient Distributed Dataset):

distributed large-scale dataset abstraction

– MLlib: Spark-based distributed implementation of many ML algorithms. Logistic regression in Hadoop

and Spark

Page 32: Continuous modeling - automating model building on high-performance e-Infrastructures

Parallel Virtual Screening with Spark

Hypothesis: The Spark framework can be used for trivially parallelizable problems in pharm. Bioinformatics• Demonstrate on Virtual Screening• Used OpenEye suitePrel. results:• Spark API allows for simple programmatic parallelization• Good scalability in terms of speedup• Lack of documentation

L. Ahmed, A. Edlund, E. Laure, O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing Technology and Science (Cloud- Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013

Laeeq Ahmed, PhD Student

Valentin Georgiev,Researcher

Page 33: Continuous modeling - automating model building on high-performance e-Infrastructures

Conformal Prediction in Spark

• Evaluate confidence in predictions• We implemented Inductive Conformal

Prediction (ICP) in Spark, extending MLlib• Tested on 2 large data sets

– HIGGS: 11M examples. Task: distinguish between Higgs boson signal process and background process

– SUSY: 5M examples. Task: distinguish between supersymmetric particle signal process and background process

POSTER P-33

Marco CapucciniPhD Student

Page 34: Continuous modeling - automating model building on high-performance e-Infrastructures

Results:• Valid predictions• Good scalability

Conformal Prediction in Spark

M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth.

Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence.

Accepted in IEEE Transaction on Cloud Computing, 2015.

POSTER P-33

Marco CapucciniPhD Student

Page 35: Continuous modeling - automating model building on high-performance e-Infrastructures

Some conclusions

• Automation/continuous modeling is not trivial– Data management, modeling, model management/governance

• Conformal prediction– Predictions with confidence

• Large-scale problems requires computational power– Cloud computing vs High-Performance Computing

• Workflows and Big Data frameworks – Immature technologies, not well documented– can be useful for large-scale analysis in pharmaceutical

bioinformatics, especially for automation

Page 36: Continuous modeling - automating model building on high-performance e-Infrastructures

Some ongoing projects

• Augment Parallel virtual screening with Machine Learning

• Further develop conformal predictions in distributed settings

• Large-scale target predictions• Continue evaluate Spark vs Workflows, Cloud vs HPC

– Still not reached a good agile system but we are getting closer

• The group is open for collaborations.

Page 37: Continuous modeling - automating model building on high-performance e-Infrastructures

Thank you

Ola [email protected]