H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Sparkling Water on the Spark Notebook: Interactive Genomes

clusteringWhy you must care, by Data Fellas

Xavier [email protected]

@xtordoir

● Apache Spark● Interactivity: Spark notebook● Genomics on Spark: ADAM● Data exploitation● H2O w/ Spark: Sparkling water● Show time● Streamlining dev/deployment

Lineup

Can’t wait!

Data Fellas

Andy Petrella

MathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier Tordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

Distributed computing framework

Large Scale Data Processing engineI play BIG!

What is Apache Spark?


Large Scale Data Processing engine

● SQL & Dataframes● Streaming● Graph Processing● Machine Learning

With all colors!




● Optimize memory usage (FAST)● Optimize computation execution

(Complex tasks)● Easy programming model

Checking in cache If I remember...




● Interactive● @ any scale

http://spark-notebook.io

Laurel? HArdy? Anyone?




● Scala (types, production quality)● Reactive&pluggable charts API

(scala = no.js)● easy install, no deps.● multiple sparkContext

out of the box.


http://bdgenomics.org/

ADAM Project (UC Berkeley):

● Data format (schema, compact, distributed): avro + parquet

● API (Reads, Variants, Genotypes, …)

I, ADAM

Genomics with Spark?



1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

GenomicsThe data

Please, don’t mind the colors...

GenomicsThe data

So… that’s what separates us huh?


~1000 samples

Few samples => Machine Learning

GenomicsThe data

Woooow, really, you must be kidding me… ahahahahah


~1000 samples

~30M Genotypes per sample (features)

Few samples => Machine Learning

Lots of Data => Distributed computing

GenomicsThe data

Oh… damned… hum huh

Population stratification

w/ Deeplearning? H2O

From the spark notebook? Sparkling water

GenomicsThe problem

Here I need some water.

Memory implementation of “Map-Reduce”

Highly optimised structures for the JVM

blazing fast convergent models

H2O

Higher API

H2OSparkling: in-memory data exchange

I remember things better with two copies in memory.

http://h2o.ai/product/sparkling-water/



Showtime!

press play...

There’s a notebook for that

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Shar3 (Data Fellas)ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Shar3 (Data Fellas)Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

Spark and the Notebook are interactive and leverage distributed computing infrastructure

ADAM is an optimized storage format for Massive genomic data

Spark provides tools to manipulate data and works w/ other libraries like H2O

Data scientists and application developers can work together

Summary

Wake up, we’re back!

Acknowledgements

Frank NothaftMatt Massie

Neil Fergusson

Vinod & Michal

Thank you For your attention!

Questions?

And now let’s talk.

Software

H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir