Upload
datamine-lab
View
3.963
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
R Analytics in the Cloud
Radek Maciaszek DataMine Lab (www.dataminelab.com) - Data mining,
business intelligence and data warehouse consultancy.
MSc in Bioinformatics at Birkbeck, University of London.
Project at UCL Institute of Healthy Ageing under supervision of Dr Eugene Schuster.
Introduction
2
Primer in Bioinformatics
3
Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc)
Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans).
Goal: find genes responsible for ageing
Caenorhabditis Elegans
Genes are encoded by the DNA. Microarray
(100 x 100)
4
Central dogma of molecular biology
• Database of 50 curated experiments.• 10k genes compare to each other
Why R?
Very popular in bioinformatics Functional, scripting programming
language Swiss-army knife for statistician Designed by statisticians for
statisticians Lots of ready to use packages (CRAN)
5
R limitations & Hadoop
Data needs to fit in the memory Single-threaded Hadoop integration:
Hadoop Streaming Rhipe: http://ml.stat.purdue.edu/rhipe/ Segue: http://code.google.com/p/segue/
6
Segue
Works with Amazon Elastic MapReduce. Creates a cluster for you. Designed for Big Computations (rather than
Big Data) Implements a cloud version of lapply()
function.
7
Segue workflow (emrlapply)
8
S3
R
Elastic MapReduce
Amazon AWS
List (local)
List (remote)
R very quick example
m <- list(a = 1:10, b = exp(-3:3))
lapply(m, mean)$a
[1] 5.5
$b
[1] 4.535125
lapply(X, FUN) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
9
Segue – large scale example
> AnalysePearsonCorelation <- function(probe) {
A.vector <- experiments.matrix[probe,]
p.values <- c()
for(probe.name in rownames(experiments.matrix)) {
B.vector <- experiments.matrix[probe.name,]
p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
}
return (p.values)
}
> pearson.cor <- lapply(probes, AnalysePearsonCorelation)
Moving to the cloud in 3 lines of code!
10
RNA Probes
Segue – large scale example
> AnalysePearsonCorelation <- function(probe) {
A.vector <- experiments.matrix[probe,]
p.values <- c()
for(probe.name in rownames(experiments.matrix)) {
B.vector <- experiments.matrix[probe.name,]
p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
}
return (p.values)
}
> # pearson.cor <- lapply(probes, AnalysePearsonCorelation)
> myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE)
> pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation)
> stopCluster(myCluster)11
RNA Probes
Discovering genes
12
Topomaps of clustered genes
This work was based on a similar approach to:A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., Science 293, 2087 (2001)
Conclusions
R is great for statistics. It’s easy to scale up R using Segue. We are all going to live very long.
13
Thanks!
Questions?
References:http://code.google.com/r/radek-segue/ http://www.dataminelab.com
14