Upload
lynhi
View
215
Download
0
Embed Size (px)
Citation preview
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
R, paralelización, datos masivos yaplicaciones web: ejemplos del uso de R
en bioinformática
Ramón Díaz-Uriarte
Dept. BioquímicaUniversidad Autónoma de Madrid
Madrid, [email protected]
http://ligarto.org/rdiaz
Facultad de InformáticaUniversidad Complutense de Madrid
9-Mayo-2012
(1 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
License and copyright
This work is Copyright, c©, 2012, Ramón Díaz-Uriarte, andis licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike License. To view acopy of this license, visit
http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons,559 Nathan Abbott Way, Stanford, California 94305, USA.
*****************************Please, respect the copyright. This material is provided freely, and if you use
it, I only ask that you use it according to the (very permissive) terms of the
license: attribution, non-comercial use, and a share alike license. If you have
any doubts, ask me. (2 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Outline
Context
Parallelizing code
Web applications
Large data sets and parallelization
R, C, and compression on the fly
Conclusions et al.
What we are doing now
(3 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ContextBiological contextComputational context
Parallelizing code
Web applications
Large data sets and parallelization
R, C, and compression on the fly
Conclusions et al.
What we are doing now
(4 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Chromosomes
From the Wikipedia; original sourcehttp://www.genome.gov/Pages/Hyperion//DIR/VIP/
Glossary/Illustration/karyotype.shtml(5 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
DNA→ protein
(From O. Rueda’s PhD Thesis)(6 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
DNA, genes and probes (spots)
(From O. Rueda’s PhD Thesis)(7 : 62)
A T A C G T T
T A T G C A A
A T A C C A
T A T G G T
T A T G C A A T A T G G T
probe 1 probe 2
A T A C G T T
T A T G C A A
A T A C C A
T A T G G T
exon 1 exon 2 exon 3intron 1 intron 2
probe 3
NucleotideSequence
Gene
Probe SelectionforMicroarray
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Two-color arrays
(From O. Rueda’s PhD Thesis)(8 : 62)
Hybridization Optical scanning
DNA samples
Tumor sample Control sample
Microarray chip
Spot for a probe
red fluorescent dye
green fluorescent dye
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Data from a microarray experiment
Slide from Gema Moreno Bueno, Department of Biochemistry, UAM(9 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
More microarray data
Modified from http://www2.warwick.ac.uk/fac/sci/moac/
students/peter_cock/r/heatmap/scaled_color_key.png
(10 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
DNA→ protein
(From O. Rueda’s PhD Thesis)(11 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
aCGH
Chromosome
Olshen, 2005
Barrett et al., 2004
Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)
Hupe & Barillot, 2005
Calling gains and losses: hypothesistesting
Inferring number of copy gains/losses: estimation L
og
2(R
ati
o)
(12 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Data, data, data (in Gigabytes)
Expression arrays (mRNA) > 40,000 probesCopy number with aCGH > 400,000 common;
some > 4 x 106
. . . . . .
(13 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
aCGH: example of data
(From O. Rueda’s PhD Thesis)(14 : 62)
probe gene
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Computational issues et al.
We want to analyze, reanalyze, and combine.Do it in a reasonably short time.“Wet lab researchers” need user friendly access tomethods that are both statistically rigorous andcomputationally efficient.
BioConductor paper: second most accessed paper inGenome Biology ; yearly “Web server issue” ofNucleic Acids Research.
(15 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Multicores and computing clusters
Increases in CPU speed slowed down (< 20% peryear since 2002).Increase in the number of “cores”: 2, 4, 8. Next 10years?Inexpensive computing clusters with off-the-shelfcomponents.Must design our programs from the start: parallelprogramming
Image from http://faq.distributed.net/
(16 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Context
Parallelizing code
Web applications
Large data sets and parallelization
R, C, and compression on the fly
Conclusions et al.
What we are doing now
(17 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Standalone
(18 : 62)
Statistical Computingin Bioinformatics
Develop statistical methods
Implement existingapproaches
Implement for statisticians and bioinformaticians
Implement for wet lab users
- Parallel Computing
- Fault tolerance
Web apps:- User friendly
- No installation
- Statistical rigour - Best practices
Increased speed (40x - 60x)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
R code
Code available for many procedures (but a few yearsago none parallelized!)Many computations embarrassingly parallelizable:
I bootstrapping and cross-validationI arrays (or samples)I arrays by chromosomesI parallel chains in MCMC
Figure production can be parallelized
(19 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Parallelizing R code
(Implement missing functionality: R/C)MPI: R packages Rmpi, papply, snow, snowfallLoad balancedWrappers over “mid level” functions in package: easeupdatingParallelize:
I Bootstrap samples/Cross-val. runs.I arraysI arrays by chromosomesI (or a combination of both)I Figures
(20 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Is it worth it?
Are speed improvements really worth the effort?Over what range of problems do see improvements?With what hardware can we see improvements?
(21 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
What do we gain?
(22 : 62)
●
●
●
●
●
HMM
Use
r w
all t
ime
(sec
onds
)
10 50 100 15020
20
50
100
300
500
1000
2500
5000
10000
● ● ●●
●
● ● ●
●
●
●
●
●
●
●
Sequential code
Parallelized code
60
30
10●
●
●
●
●
GLAD
10 50 100 150
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
CBS
10 50 100 150
●
●
● ●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●BioHMM
10 50 100 150
20
50
100
300
500
1000
2500
5000
10000
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
20,000 genes
Number of arrays (samples)
Use
r w
all t
ime
(sec
onds
)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
What do we gain?
Are speed improvements really worth the effort?Your effort: “R CMD INSTALL ADaCGH2”.
Over what range of problems do see improvements?10 to 103 arrays/samples;104 to 106 spots/genes.
With what hardware can we see improvements?2 cores to 120 cores.
Smaller clusters: more cost effectiveSingle node/multi-core: lesscommunication overhead
(23 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Where is this running?
varSelRF (CRAN)ADaCGH2 (BioConductor)SignS (launchpad:http://launchpad.net/signs)
(24 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Context
Parallelizing code
Web applicationsWeb apps: how
Large data sets and parallelization
R, C, and compression on the fly
Conclusions et al.
What we are doing now
(25 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Applications for wet lab researchers
Analyze data in a reasonably short time.User friendly access to methods that are statisticallyrigorous.
(26 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Web-based applications
User-friendly interface.No hardware/software hassles for end users.Parallelization is transparent.Method selection can be partially transferred (to us).Short user wall time: use (hardware/software)resources rarely available to individual biomedicalresearchersJust type in a URL:http://www.some-application
Image modified from http://faq.distributed.net/
(27 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Sometimes collaborations feel like . . .
(From http://www.bitacoradegalileo.com/2010/11/16/giordano-bruno-en-la-cara-oculta-de-la-luna/)
(28 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Parallelization in web applications
(29 : 62)
Statistical Computingin Bioinformatics
Develop statistical methods
Implement existingapproaches
Implement for statisticians and bioinformaticians
Implement for wet lab users
- Parallel Computing
- Fault tolerance
Web apps:- User friendly
- No installation
- Statistical rigour - Best practices
Increased speed (40x - 60x)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Main web-based applications
(30 : 62)
Dealing with raw data
Statistical analysis (sensu stricto)
Annotation and Interpretation
Remove artifactsfrom microarrays
- Missing data- Replicate spots
DNMAD preP
Differentiallyexpressed
genes
Select genesfor
classification
Tnasas GeneSrFPomelo_II
Molecular signatures
survival data
SignS
SegmentaCGH
WaviCGH ADaCGH
Interpret results
IDClight PaLS
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
The applications
(31 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
What do we gain?
(32 : 62)
250
500
1000
2000
4000
1 5 10 20
●
●
●
●
●
CBS
1 5 10 20
●
●
●
●
●●
CGHseg●
●
●
●
●●
GLAD
250
500
1000
2000
4000
●
●
●
●
HMM15000 genes, 40 arrays
Number of simultaneous users
Use
r w
all t
ime
(sec
onds
)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
How it works: some key ideas
Each runI Parallelization (transparent for users)I Fault-tolerance (network problems, machine crashes,
bugs)I Check-pointing
Periodic tasks (keep system running 24h, 365 d)I Automatic monitorizationI Automated testing suite
(33 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
What happens
(34 : 62)
UserHead node (LVS):Send request to
one of the servers.
CGI: data checking,file upload
Execution: Python program
- Setting up LAM/MPI- Starting R
- Fault tolerance- Checking termination of R
- Checking run errors- Formatting output
R program
Autorefreshing HTMLuntil final results
Sequential code Parallelized code
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
What happens: details
(35 : 62)
User
Head node (LVS)
Server 2
Server 1
Continue R execution till end
Apache
Server 3Server n
CGI
Read dataCreate MPI universe
Launch R, RmpiMonitor R execution
Maintain R process counters
(slave)
(Master)
(slave)(slave)Rmpi started
OK?
Halt MPI universe Produce and return results pages
Is R done?Yes
Return autorefreshing page
NoNo
Yes
Stop execution Halt MPI universe
Return error
Not after K attempts
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
MPI details
(36 : 62)
Sleep Can we run?(Count other lam daemons)No
Boot (new)LAM/MPI
Yes
Start R: continue from last checkpoint Sleep
Run outof time?
Are we done?R crashed (bugs)?
MPI universe:Servers 1 ... n
NFS sharedtemporary storage
NFS sharedstorage
Segmentation and Figures (over subjects and chrom.).
Rmpi crashed?LAM/MPI/nodes crashed?
No
Halt MPI universe Produce and return results pages
Yes
Yes
No
Verify servers(modify LAM defs)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Where is this running?
http://signs.bioinfo.cnio.es
http://wavi.bioinfo.cnio.es
http://genesrf.bioinfo.cnio.es
(37 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Context
Parallelizing code
Web applications
Large data sets and parallelization
R, C, and compression on the fly
Conclusions et al.
What we are doing now
(38 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
aCGH
Chromosome
Olshen, 2005
Barrett et al., 2004
Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)
Hupe & Barillot, 2005
Calling gains and losses: hypothesistesting
Inferring number of copy gains/losses: estimation L
og
2(R
ati
o)
(39 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Large data sets
Millions of spotsHundreds or thousands of subjects.No need to hold everything in RAM at once.
Package ff: “memory-efficient storage of large dataon disk and fast access functions.”
Combined with:I parallelizationI shared storage
(40 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization
ff stores the object on disk.Read that object from various R processes.Different R processes can write in different ff objects
(41 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (I)
R1 R2 Rn
Common ff object
ff1 ff2 ffn
write
read only
Rmaster
ffall
(42 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (I)
R1 R2 Rn
Common ff object
ff1 ff2 ffn
write
read only
Rmaster
ffall
(42 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (I)
R1 R2 Rn
Common ff object
ff1 ff2 ffn
write
read only
Rmaster
ffall
(42 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (II)
RmasterData
ffin
write
read only
(multicore)
R2 RnR1
ff1 ff2 ffn
R2 RnR1
i1
ini2
R2 RnR1
ff1 ff2 ffn
Fig.1 Fig.2 Fig.n
ffout Results
(43 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (II)
RmasterData
ffin
write
read only (multicore)
R2 RnR1
ff1 ff2 ffn
R2 RnR1
i1
ini2
R2 RnR1
ff1 ff2 ffn
Fig.1 Fig.2 Fig.n
ffout Results
(43 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (II)
RmasterData
ffin
write
read only
(multicore)
R2 RnR1
ff1 ff2 ffn
R2 RnR1
i1
ini2
R2 RnR1
ff1 ff2 ffn
Fig.1 Fig.2 Fig.n
ffout Results
(43 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (II)
RmasterData
ffin
write
read only
(multicore)
R2 RnR1
ff1 ff2 ffn
R2 RnR1
i1
ini2
R2 RnR1
ff1 ff2 ffn
Fig.1 Fig.2 Fig.n
ffout Results
(43 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (II)
RmasterData
ffin
write
read only
(multicore)
R2 RnR1
ff1 ff2 ffn
R2 RnR1
i1
ini2
R2 RnR1
ff1 ff2 ffn
Fig.1 Fig.2 Fig.n
ffout Results
(43 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (II)
RmasterData
ffin
write
read only
(multicore)
R2 RnR1
ff1 ff2 ffn
R2 RnR1
i1
ini2
R2 RnR1
ff1 ff2 ffn
Fig.1 Fig.2 Fig.n
ffout Results
(43 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (II)
RmasterData
ffin
write
read only
(multicore)
R2 RnR1
ff1 ff2 ffn
R2 RnR1
i1
ini2
R2 RnR1
ff1 ff2 ffn
Fig.1 Fig.2 Fig.n
ffout Results
(43 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
ff and parallelization (II)
RmasterData
ffin
write
read only
(multicore)
R2 RnR1
ff1 ff2 ffn
R2 RnR1
i1
ini2
R2 RnR1
ff1 ff2 ffn
Fig.1 Fig.2 Fig.n
ffout Results
(43 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Where is this running?
ADaCGH2 (BioConductor package)Web-based applicationhttp://wavi.bioinfo.cnio.es.
(44 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Context
Parallelizing code
Web applications
Large data sets and parallelization
R, C, and compression on the fly
Conclusions et al.
What we are doing now
(45 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
aCGH
Chromosome
Olshen, 2005
Barrett et al., 2004
Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)
Hupe & Barillot, 2005
Calling gains and losses: hypothesistesting
Inferring number of copy gains/losses: estimation L
og
2(R
ati
o)
(46 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Store and access (large) pre-computed results
HMM for aCGH data with Reversible Jump: ViterbiCommon regions: “count” on the Viterbi paths.
Fitting HMM/common regions: distinct operations.
C: number-crunching.R: wrapper and figures/tables.C: creates large amounts of data.
In package RJaCGH (CRAN).
(47 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Fit HMM
R C (HMM)
Store Viterbias gzipped file
return filenames
Find common regions
R C (common regions)pass filenames
ReadViterbi datareturn results
Figures, tables
(48 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Fit HMM
R C (HMM)
Store Viterbias gzipped filereturn filenames
Find common regions
R C (common regions)pass filenames
ReadViterbi datareturn results
Figures, tables
(48 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Fit HMM
R C (HMM)
Store Viterbias gzipped filereturn filenames
Find common regions
R C (common regions)pass filenames
ReadViterbi datareturn results
Figures, tables
(48 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Fit HMM
R C (HMM)
Store Viterbias gzipped filereturn filenames
Find common regions
R C (common regions)pass filenames
ReadViterbi datareturn results
Figures, tables
(48 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Context
Parallelizing code
Web applications
Large data sets and parallelization
R, C, and compression on the fly
Conclusions et al.
What we are doing now
(49 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Web-based: A few things we’ve learned
Configuration sucks (if you need to modify > 1 file)Too many languagesAdding test cases to the testing suites: web, RDocumentation: in the code, web pages, LATEX . . .
Too much R code to catch errorsUser interfaces: who designs them?
(50 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Too many languagesImpedance mismatch problem:“Building Web-based applications requires the mastering of anumber of languages/technologies (e.g. HTML, CSS, CGI, ASP,PHP, XML, etc..). Such languages and technologies werecreated to address different aspects on a by-need evolutionarymanner. The result is a plethora of tools that are fitted togetherin an ad hoc fashion.” El-Ansary, Grolaux, Van Roy, Rafea(2005) “Overcoming the Multiplicity of Languages andTechnologies for Web-Based Development Using aMulti-paradigm Approach”.
R and CHTML and Python: CGI, data entry, displayPython (and others): control and monitor MPIJavascript: AJAX and figures
(51 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Fault tolerance and communicationManual check for errors (R ain’t Erlang)Too much network traffic
(52 : 62)
Boot (new)LAM/MPI
Start R: continue from last checkpoint
Sleep
Run outof time?
Are we done?R crashed
(coding errors)?
MPI universe:Servers 1 ... n
NFS sharedtemporary storage
NFS sharedstorage
Rmpi crashed?LAM/MPI crashed?
(includes node crashes)
No
Halt MPI universe Produce and return results pages
Yes
Yes
No
Verify servers(modify LAM defs)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Solutions?
Literate programming and org-modeAlternatives to MPI and/or use Erlang. . .Keep things as they are (only a few painful events ayear)
(53 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Rethinking web-based applications
Users can get into trouble.
Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational
approaches (R).I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble
Web-based applications are here to stay
(54 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Rethinking web-based applications
Users can get into trouble.
Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational
approaches (R).I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble
Web-based applications are here to stay
(54 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Rethinking web-based applications
Users can get into trouble.
Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational
approaches (R).I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble
Web-based applications are here to stay
(54 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
. . . so . . .
Forget about them: just write your R/C/whatever codeGo for it
I We can use R + HPCI But other tools and work necessary
(55 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Regardless of web-based applications . . .
Parallel computing can be used routinelyI (library(parallel) in R ≥ 2.14.0)
Large data sets with ff + parallelization.
(56 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
So far . . .
Most of what I mentioned refers to “traditional clustersetups”
I Several nodes (e.g., > 10).I A few CPUs/cores per nodeI Not too much RAM per node.
We’ve been using it for about 10 years.But things change . . .
(57 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
New hardware
Only a few nodes (2 in our case).Many cores.Lots of RAM available for a single process.More reliable?
Image from
http://blogs.amd.com/work/files/2011/02/Dell61453.jpg(58 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Changes
Little need for control and monitorization software?Reconfiguration of MPI definition files.Load balancing of web servers.
(59 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
CHANGES
Changes in application software:Rethink how we use MPI.Start using OpenMP in C code.
I Need to be careful when called from R.I Random number generation.
Use mclapply (forking) within R.Rethink usage of ff: we can keep the whole object inRAM.
I Do not use the disk at all.I Eliminate code.
Rethink I/O and storage.Combine MPI/Rmpi with OpenMP and mclapply(forking).
Rethink usage of R (Julia? Python?)
(60 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
CHANGES
Changes in application software:Rethink how we use MPI.Start using OpenMP in C code.
I Need to be careful when called from R.I Random number generation.
Use mclapply (forking) within R.Rethink usage of ff: we can keep the whole object inRAM.
I Do not use the disk at all.I Eliminate code.
Rethink I/O and storage.Combine MPI/Rmpi with OpenMP and mclapply(forking).Rethink usage of R (Julia? Python?)
(60 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Commercials (grandes ofertas)
I’d be glad to talk to anybody who wants to play with, andhelp configure, our machines and code.
(61 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Acknowledgements
O. M. Rueda, A. Alibés, A. Cañada, E. R. Morrissey,M. L. Neves, D. Rico.Funding: Fundación de Investigación Médica MutuaMadrileña, Project TIC2003-09331-C02-02 of theSpanish MEC and BIO2009-12458 of the SpanishMICINN. Ramón y Cajal Programme of the SpanishMinistry of Education and Science.CNIO (Spanish National Cancer Research Center).The R users and developers for a vibrant statisticalcomputing community and amazing platform.Victoria López.
(62 : 62)