24
Cloudgene - an execution platform for MapReduce programs in public and private clouds Lukas Forer, Sebastian Schönherr , Hansi Weißensteiner University of Innsbruck, Austria Medical University Innsbruck, Austria BOSC 2012

L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Embed Size (px)

DESCRIPTION

Presentation at BOSC2012 by L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Citation preview

Page 1: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene - an execution platform for MapReduce programs in public and private clouds

Lukas Forer, Sebastian Schönherr, Hansi Weißensteiner

University of Innsbruck, AustriaMedical University Innsbruck, Austria

BOSC 2012

Page 2: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Parallel approach

MapReduce

2MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat - 2004

Serial approach

cluster

cloud

private public

How to support scientists when using (our) MapReduce programs?

Simplify the execution of MapReduce programs including data management

Simplify access to a working MapReduce cluster

Maintain data sensitivity

Page 3: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

MapReduce in Genetics

CloudBurst

highly sensitive read mapping with MapReduce; Schatz, 2009

Crossbow

Searching for SNPs with cloud computing; Langmead et al., 2009

MyRNA

Cloud-scale RNA-sequencing differential expression analysis with Myrna; Langmead et al., 2010

Seal

a Distributed Short Read Mapping and Duplicate Removal Tool; Pireddu et al., 2012

Hadoop BAM

directly manipulating next generation sequencing data in the cloud; Matti Niemenmaa et al., 2012

CloudBioLinux

CloudBioLinux: pre-configured and on-demand bioinformatics computing for the genomics community; Krampis et al., 2012

3

Page 4: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Difficulties with MapReduce

4

Required steps when cluster is up and running, Hadoop installed

Additional steps, when setting up a cluster in a public environment

Page 5: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Approaches

Possible approachesProgram specific approach

Implement a GUI for every program

Redundant work for the developer

Heterogeneity

Workflow systemsGalaxy, Taverna, Mobyle

Possible, but no HDFS support, blackbox

Our approach for Hadoop MapReduceOne GUI for different programs

Feedback, Standardized Import/Export

Integration of programs via a plugin interface

5

Page 6: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Open-source platform to improve the usability of Hadoop MapReduce jobs

Provides a graphical web interface for their execution

Programs can be integrated by writing a simple configuration file

Public cloud & private cloudSetting up a cluster in the cloud, installs all data on it

History of executed jobs with defined input/output parameters

Runs in your browser

Cloudgene

What is Cloudgene?

6

CloudBurstCrossbowSealCloudBioLinux

Myrna

Page 7: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene

7

Page 8: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Features

Integration of programs easily possiblestandard MapReduce programs (Java -> CloudBurst)

streaming jobs (e.g. Mapper and Reducer using Perl-> Myrna)

command line programs (e.g. using Pydoop -> Seal)

Data can be imported from different sourcesS3 / HTTP / FTP

Import of huge datasets

Export results to S3 (public cloud)

Connect different MapReduce programs to a pipeline

Install additional programs via a web repository

8

Page 9: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Features

Cloudgene can be used on private and public clusters

sensitive data

local data

data on S3

no in-house clusteravailable

Open source

9

} public cloud

} private cloud

Page 10: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Summary

10

Page 11: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene in Action

How to integrate a new program in Cloudgene 1. Implement the program (or use existing)

2. Write plugin configuration file

11

Page 12: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene in Action

12

Step 1 - Implement a program, executable via the command line

e.g: FastQ pre-processing with MapReduce

base quality / sequence quality / duplication levels / length distribution

hadoop jar exomePreprocessing.jar -input exomeData-step baseJob -encoding 0 -output resultsOutput

Page 13: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene in Action

Step 2 - Write configuration file including 3 parts

Part 1 – General information:

13

Page 14: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene in Action

Step 2 - Write configuration file including 3 parts

Part 2 – Public cloud information:

14

Page 15: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene in Action

Step 2 - Write configuration file including 3 parts

Part 3 – MapReduce information:

15

Page 16: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene in Action

16

Page 17: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene in Action

17

Page 18: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene in Action

18

Page 19: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene in Action

19

Page 20: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Cloudgene in Action

Different application – different GUI

20

Page 21: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Technologies

Apache Hadoophttp://hadoop.apache.org

Apache Whirrhttp://whirr.apache.org

Restlethttp://www.restlet.org

ExtJShttp://www.sencha.com

H2http://www.h2database.com

21

Page 22: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Evaluation

Amazon Elastic MapReduce (EMR)Graphical execution for MapReduceprograms

Excellent solution for public clouds Combination with S3

butdata sensitivity

Reproducibility

Additional costs

22

Cloudgene Amazon EMR0 sec

500 sec

1000 sec

1500 sec

2000 sec

2500 sec

3000 sec

3500 sec

4000 sec

ExportCalculationImportSetup

Page 23: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Integrated programs

23

http://sourceforge.net/apps/mediawiki/cloudburst-bio/nfs/project/c/cl/cloudburst-bio/7/70/MediaWikiSidebarLogo.png

Exome Preprocessing

Wordcount, Grep, etc.

Finding SNPs

in house

Page 24: L Forer - Cloudgene: an execution platform for MapReduce programs in public and private clouds

Acknowledgements

24

Project-Website:

http://cloudgene.uibk.ac.at

Source Code:

http://github.com/genepi

Lukas ForerSebastian Schönherr Hansi Weissensteiner

Anita Kloss-Brandstätter Florian Kronenberg Günther SpechtThanks to the Open Source Community