Upload
sri-ambati
View
1.502
Download
0
Embed Size (px)
Citation preview
Sparkling Water on the Spark Notebook: Interactive Genomes
clusteringWhy you must care, by Data Fellas
Xavier [email protected]
@xtordoir
● Apache Spark● Interactivity: Spark notebook● Genomics on Spark: ADAM● Data exploitation● H2O w/ Spark: Sparkling water● Show time● Streamlining dev/deployment
Lineup
Can’t wait!
Data Fellas
Andy Petrella
MathsGeospatialDistributed Computing
Spark NotebookTrainer Spark/ScalaMachine Learning
Xavier Tordoir
PhysicsBioinformaticsDistributed Computing
Scala (& Perl)trainer SparkMachine Learning
Distributed computing framework
Large Scale Data Processing engineI play BIG!
What is Apache Spark?
Distributed computing framework
Large Scale Data Processing engine
● SQL & Dataframes● Streaming● Graph Processing● Machine Learning
With all colors!
What is Apache Spark?
Distributed computing framework
Large Scale Data Processing engine
● Optimize memory usage (FAST)● Optimize computation execution
(Complex tasks)● Easy programming model
Checking in cache If I remember...
What is Apache Spark?
Distributed computing framework
Large Scale Data Processing engine
● Interactive● @ any scale
http://spark-notebook.io
Laurel? HArdy? Anyone?
What is Apache Spark?
● Scala (types, production quality)● Reactive&pluggable charts API
(scala = no.js)● easy install, no deps.● multiple sparkContext
out of the box.
What is Apache Spark?
http://bdgenomics.org/
ADAM Project (UC Berkeley):
● Data format (schema, compact, distributed): avro + parquet
● API (Reads, Variants, Genotypes, …)
I, ADAM
Genomics with Spark?
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
GenomicsThe data
Please, don’t mind the colors...
GenomicsThe data
So… that’s what separates us huh?
1000 genomes: http://www.1000genomes.org/
~1000 samples
Few samples => Machine Learning
GenomicsThe data
Woooow, really, you must be kidding me… ahahahahah
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Few samples => Machine Learning
Lots of Data => Distributed computing
GenomicsThe data
Oh… damned… hum huh
Population stratification
w/ Deeplearning? H2O
From the spark notebook? Sparkling water
GenomicsThe problem
Here I need some water.
Memory implementation of “Map-Reduce”
Highly optimised structures for the JVM
blazing fast convergent models
H2O
Higher API
H2OSparkling: in-memory data exchange
I remember things better with two copies in memory.
http://h2o.ai/product/sparkling-water/
Showtime!
press play...
There’s a notebook for that
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Shar3 (Data Fellas)ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Shar3 (Data Fellas)Analysis
Production
DistributionRendering
Discovery
CatalogProject
Generator
Micro Service / Binary format
Schema for output
Metadata
Spark and the Notebook are interactive and leverage distributed computing infrastructure
ADAM is an optimized storage format for Massive genomic data
Spark provides tools to manipulate data and works w/ other libraries like H2O
Data scientists and application developers can work together
Summary
Wake up, we’re back!
Acknowledgements
Frank NothaftMatt Massie
Neil Fergusson
Vinod & Michal
Thank you For your attention!
Questions?
And now let’s talk.