26
Christophe Blanchet, Clément Gauthey Infrastructure Distributed for Biology IDB-IBCP CNRS FR3302 - LYON - FRANCE http://idee-b.ibcp.fr IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552 ) and the French National Research Agency's Arpege Programme (ANR-10-SEGI-001 ) IDB-Cloud Providing Bioinformatics Services on Cloud

IDB-Cloud Providing Bioinformatics Services on Cloud

Embed Size (px)

DESCRIPTION

A presentation of IDB (Infrastructure Distributed for Biology) using StratusLab technology by Christophe Blanchet and Clément Gauthey at Lille, France, May 2013.

Citation preview

Page 1: IDB-Cloud Providing Bioinformatics Services on Cloud

Christophe Blanchet, Clément Gauthey

Infrastructure Distributed for BiologyIDB-IBCP CNRS FR3302 - LYON - FRANCE

http://idee-b.ibcp.fr

IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and the French National Research Agency's Arpege Programme (ANR-10-SEGI-001)

IDB-CloudProviding Bioinformatics

Services on Cloud

Page 2: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Bioinformatics Today• Biological data are big data

• 1512 online databases (NAR Database Issue 2013)

• Institut Sanger, UK, 5 PB

• Beijing Genome Institute, China, 4 sites, 10 PB➡ Big data in lot of places

• Analysing such data became difficult• Scale-up of the analyses : gene/protein to complete genome/

proteome, ...

• Lot of different daily-used tools

• That need to be combined in workflows

• Usual interfaces: portals, Web services, federation,...➡ Datacenters with ease of access/use

• Distributed resources• Experimental platforms: NGS, imaging, ...

• Bioinformatics platforms➡ Federation of datacenters

ADN

ADN

BI

M

M

ADN

ADN

BI

ADN

ADN

BI CC

BI

M

ADN

ADN

ADN

Page 3: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Sequencing Genomes

source: www.politigenomics.com/next-generation-sequencing-informatics

Complete genome sequencing become a lab commodity with

NGS (cheap and efficient)

source: www.genomesonline.org

Page 4: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Infrastructures in Biology

Lot of toolsand web servicesto treat and vizualize

lot of data

Page 5: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

The scene• Bioinformatics services

providers• Is it easy to deploy lot of

(incompatible) tools ?

• To make them connected to public databases ?

• To limit transfer of huge data ?

• To provide users with their own computing resources ?

• With their own isolated storage ?

• Scientists• Is it easy to access/use these

tools ?

• To adapt to your usage ?• To get your/other tools deployed

on a datacenter ?

• To combine them ?• To get my own computing/

storage resources ?

ADN

ADN

BI

M

M

ADN

ADN

BI

ADN

ADN

BI CC

BI

M

ADN

ADN

ADN

Bioinformatics Center

Scientists

Computer Resources

French biologists

have access to

regional resources

(RENABI)

Availability? Yes

Engineers

No

Compatible?

Usually one

cluster for

all use

Yes

No ?

toolX ?

installation

time

Page 6: IDB-Cloud Providing Bioinformatics Services on Cloud

RENABI GRISBI www.grisbio.fr

ii

GRSB

- GRISBI -Bioinformatics

French Grid

© RENABI GRISBI - www.grisbio.fr

RENABI-GO APLIBIO

PRABI

RENABI-SOIBISA

PF-2008

RENABI-NE

RENABI GRISBI

• Groupe de réflexion sur l’organisation et les technologies: e.g. gLite, DIET, GridWay, BioMaj, ActiveCircle, Caringo, HDFS, XtreemFS, dCache, …

• Infrastructure distribuée de Bioinformatique• Soutien financier par RENABI , IBISA 2008-2011,

Institut des Grilles 2009-2010

• Ressources informatiques:

• dans les PFs 2600 coeurs, 310 To stockage

• déjà sur GRISBI 860 coeurs, 26 To stockage

• 5 centres régionaux RENABI• PFs de production en Bioinformatique

• Labellisées RIO / IBISA

• 9 sites, 7 CNRS, 2 INRA

• ~70 membres enregistrés

• Collaboration avec les infrastructures informatiques nationales: Institut des Grilles, Grid5000 GENCI, Mésocentres

=> Pour structurer la communauté et proposer des réponses aux besoins des biologistes

563 c90 TB

444 c62 TB

376 c50 TB

304 c32 TB

876 c75 TB

www.grisbio.fr

Page 7: IDB-Cloud Providing Bioinformatics Services on Cloud

RENABI GRISBI www.grisbio.fr

Satisfactions des besoinsgLite GRISBI

Banques internationales ~ oui biomaj NFS

Espace personnel ~ oui XtreemFS ?

Espace commun ~ oui

Accès simple au stockage non XtreemFS ?

Distribution des calculs WMS

Intégration cluster l’existant ~ oui CE-gateway

Déploiement des logiciels SWAREA ++ temps humain

Workflow/pipeline ~ DAG

Gestion des identités et accès vo.renabi.fr Shibboleth/LDAP

Interface facile à utiliser ~ CLI « commandes GR »

Interface publique: accès anonyme sur portail et web services non ? certificats robot, myproxy ?

➡ Logiciel gLite répond au besoin en puissance de calcul➡ Modes d’accès et de gestion des données sont moins adaptés aux usages de la communauté

Page 8: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Cloud computing ?

Created by Sam JohnstonLicense: Creative Commons

Page 9: IDB-Cloud Providing Bioinformatics Services on Cloud

9

StratusLab Project

Goal§Create comprehensive, open-source,IaaS cloud distribution

EU FP7 project§1 June 2010—31 May 2012 (2 years)§6 partners from 5 countries§Budget : 3.3 M€ (2.3 M€ EC)

Contacts§Site web: http://stratuslab.eu/ §Twitter: @StratusLab§Support: [email protected]

CNRS (FR) UCM (ES)

GRNET (GR) SIXSQ (CH)

TID (ES) TCD (IE)

Page 10: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

IDB’s Cloud

• Cloud workbench for Biology• 13 turnkey bioinformatics appliances (as of Apr. 2013)

• Running since Sept. 2011, opened to Biology community

• Lyon, FRANCE

• Powered by• StratusLab

• Compute nodes, Block storage

• +900 cores, +4TB RAM, 36TB vdisks

• Mainly Intel SandyBridge servers with 32c 128GB

• Bigmen servers with 64c 768GB

• VMs from 1core-1GB to 64cores-768GB RAM

• + Openstack

• Object storage (Swift)

• +200 TB redundant & scalable storage

Page 11: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Driven throught a simple web interface

Page 12: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Integrate Bioinformatics Tools in Cloud

BLAST

GOR4

FastASSearch

Abyss

ClustalW

Bioinformatics

Tools

RayBWA

PhyML RedHat,CentOS

Debian,Ubuntu

Suse

LinuxVirtual machines

Createnew

Appliance

Bioinformatics Marketplace

NGSStructure Galaxy ARIA (…)Sequence

• Appliances are virtual machines• small : few GB, easy to convert in most virtualization formats

• Installed and pre-configured with common bioinformatics tools• e.g. BLAST, Clustalw, ARIA, MEME, HMMer, TopHat, BWA, Samtools, etc.

Page 13: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Bioinformatics Appliances

Page 14: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Select your bioinformatics tools

Page 15: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Run Bioinformatics Cloud InstancesBioinformatics Marketplace

NGSStructure Galaxy ARIA (…)Sequence

IBCP's CloudResources

BLAST,Clustal,

etc.

PaaS

WorkersVM CNS

Shar

ed F

S

launch jobssshIaaS

Master & StorageVM ARIA

Portal

Laun

chIn

stan

ces

Page 16: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Manage your Cloud Instances

Page 17: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

UNIPROT

PDB

EMBLPROSITE

Genomes

Public

Data sources

BioinformaticsCloud

BLAST,Clustal,

etc.

PaaS

WorkersVM CNS

Shar

ed F

S

launch jobssshIaaS

Master & StorageVM ARIA

Portal

shared(NFS)

User

Persistent data

pdisk(iSCSI)

Biological Data in CloudUpload your data

Get your results

scp http/S3

scp http/S3

Page 18: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Biological examples

Page 19: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Common bioinformatics node

• ‘Biocompute’ appliance

• Use your own instance(s)

• With pre-installed standard bioinformatics tools• BLAST, FastA, SSearch,HMM,...

• ClustalW2, Clustal-Omega, Muscle,..

• Bowtie(2), BWA, samtools, ...

• MEME, R, etc.

• Connected to public reference data• Uniprot, EMBL, genomes, PDB, etc.

• Automaticaly shared to the VMs

Page 20: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Structural Biology• TOwards StruCtural AssignmeNt Improvement

• To improve the determination of protein structures based on Nuclear Magnetic Resonance (NMR) information with ARIA software

• Large computational needs.

• A NMR laboratory will not specially invest in building a cluster of about 100 nodes to be able to run such NMR structure calculations.

• Flexibility of the cloud to deploy the different required bioinformatics tools can accelerate such a procedure.

• Commercial interest in providing such tools to structural biologists on a “pay as you go” basis.

• Endorsers:Institut Pasteur Parisand CNRS IBCP

Page 21: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

IaaS deployment of ARIA

SharedStorage

Intermediateresults

CNSCNS

CNSCNS

CNSCNS

CNSCNS

...(20-100)

Structurepreparation

(8x)

ARIA

Final results

Input data: 10s MBResults: GB

ReadWrite

Virtual

Cluster

WorkersVM CNS

Master & StorageVM ARIA

Shar

ed F

S

launch jobsssh

Significant increase in the number of calculated protein conformations improves the

statistics on the NMR conformations and can help to overcome the ambiguity

bottleneck.

Page 22: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Galaxy portal for NGS analyses• Analyse NGS data

• portal Galaxy is widely used in the community

• connected to large public data: sequences and indexes

• large user data (GBs)

• Preserve workflows and results (persistent storage)

Page 23: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Proteomics desktop• Motivation

• Collaboration with a mass spectroscopy platform

• Running out of space on their local resources

• Protein identification• Mass experimental data

• Reference databases : nr, Swiss-Prot

• Reference screening tools:OMSSA, X!Tandem

• User interface• Remote display

• NX

• Reference GUIs

• SearchGUI

• PeptidShaker

source: PeptideShaker site

Page 24: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Conclusion• Provide turnkey bioinformatics appliances

• Standard tools and pipelines

• Interoperability: ready to run on cloud

• Easier to transfer appliances than data (GB vs TB)

• Provide a cloud infrastructure tightly connected to existing bioinformatics infrastructure• Public IDB’s bioinformatics cloud

• Linked to public biological databases

• In collaboration with the French Bioinformatics Institute

• Ease the usage by scientists• Usual bioinformatics gateways

• Persistent and large ubiquitous storage

• Web interface for cloud management

• Access on a registration basis and standard use

Page 25: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

Perspectives• Define good practices to provide academic community and

industry with bioinformatics services!

• French Bioinformatics Institute - IFB• Goals are to provide core bioinformatics resources to the national and

international life science research community in key fields such as genomics, proteomics, systems biology, etc.

• Aims at building a national academic cloud devoted to Bioinformatics, inspired by the model evaluated through the IDB’s cloud.

• European ELIXIR infrastructure• To build a sustainable European

infrastructure for biological information, supporting life science research and its translation

• IFB will be the French representative in ELIXIR.

BioinformaticsCenterAppliances

catalog

Scientists

French biologists

have access to

regional resources

(RENABI)

Yes

Engineers

No

toolX ? Cloud

Bioinformatics or

public cloud.

Regional, national

or a federation.

Appliances

create new

register

Available ?

Page 26: IDB-Cloud Providing Bioinformatics Services on Cloud

Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013

• Acknowledgment

• IDB members: Clément Gauthey, Simon Malesys

• StratusLab members

• co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and by the French National Research Agency's Arpege Programme (ANR-10-SEGI-001).

Questions ?

http://idee-b.ibcp.fr