R Analytics in the Cloud

R Analytics in the Cloud

Radek Maciaszek DataMine Lab (www.dataminelab.com) - Data mining,

business intelligence and data warehouse consultancy.

MSc in Bioinformatics at Birkbeck, University of London.

Project at UCL Institute of Healthy Ageing under supervision of Dr Eugene Schuster.

Introduction

2

http://www.dataminelab.com/

Primer in Bioinformatics

3

Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc)

Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans).

Goal: find genes responsible for ageing

Caenorhabditis Elegans

Genes are encoded by the DNA. Microarray

(100 x 100)

4

Central dogma of molecular biology

• Database of 50 curated experiments.• 10k genes compare to each other

Why R?

Very popular in bioinformatics Functional, scripting programming

language Swiss-army knife for statistician Designed by statisticians for

statisticians Lots of ready to use packages (CRAN)

5

R limitations & Hadoop

Data needs to fit in the memory Single-threaded Hadoop integration:

Hadoop Streaming Rhipe: http://ml.stat.purdue.edu/rhipe/ Segue: http://code.google.com/p/segue/

6

http://ml.stat.purdue.edu/rhipe/

http://ml.stat.purdue.edu/rhipe/

http://code.google.com/p/segue/

http://code.google.com/p/segue/

Segue

Works with Amazon Elastic MapReduce. Creates a cluster for you. Designed for Big Computations (rather than

Big Data) Implements a cloud version of lapply()

function.

7

Segue workflow (emrlapply)

8

S3

R

Elastic MapReduce

Amazon AWS

List (local)

List (remote)

R very quick example

m <- list(a = 1:10, b = exp(-3:3))

lapply(m, mean)$a

[1] 5.5

$b

[1] 4.535125

lapply(X, FUN) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

9

Segue – large scale example

> AnalysePearsonCorelation <- function(probe) {

A.vector <- experiments.matrix[probe,]

p.values <- c()

for(probe.name in rownames(experiments.matrix)) {

B.vector <- experiments.matrix[probe.name,]

p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)

}

return (p.values)

}

> pearson.cor <- lapply(probes, AnalysePearsonCorelation)

Moving to the cloud in 3 lines of code!

10

RNA Probes

Segue – large scale example

> AnalysePearsonCorelation <- function(probe) {

A.vector <- experiments.matrix[probe,]

p.values <- c()

for(probe.name in rownames(experiments.matrix)) {

B.vector <- experiments.matrix[probe.name,]

p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)

}

return (p.values)

}

> # pearson.cor <- lapply(probes, AnalysePearsonCorelation)

> myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE)

> pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation)

> stopCluster(myCluster)11

RNA Probes

Discovering genes

12

Topomaps of clustered genes

This work was based on a similar approach to:A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., Science 293, 2087 (2001)

Conclusions

R is great for statistics. It’s easy to scale up R using Segue. We are all going to live very long.

13

Thanks!

Questions?

References:http://code.google.com/r/radek-segue/ http://www.dataminelab.com

14

http://code.google.com/r/radek-segue/





Technology

R Analytics in the Cloud