21
Sparkling Water on the Spark Notebook: Interactive Genomes clustering Why you must care, by Data Fellas Xavier Tordoir [email protected] @xtordoir

H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Embed Size (px)

Citation preview

Page 1: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Sparkling Water on the Spark Notebook: Interactive Genomes

clusteringWhy you must care, by Data Fellas

Xavier [email protected]

@xtordoir

Page 2: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

● Apache Spark● Interactivity: Spark notebook● Genomics on Spark: ADAM● Data exploitation● H2O w/ Spark: Sparkling water● Show time● Streamlining dev/deployment

Lineup

Can’t wait!

Page 3: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Data Fellas

Andy Petrella

MathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier Tordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

Page 4: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Distributed computing framework

Large Scale Data Processing engineI play BIG!

What is Apache Spark?

Page 5: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Distributed computing framework

Large Scale Data Processing engine

● SQL & Dataframes● Streaming● Graph Processing● Machine Learning

With all colors!

What is Apache Spark?

Page 6: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Distributed computing framework

Large Scale Data Processing engine

● Optimize memory usage (FAST)● Optimize computation execution

(Complex tasks)● Easy programming model

Checking in cache If I remember...

What is Apache Spark?

Page 7: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Distributed computing framework

Large Scale Data Processing engine

● Interactive● @ any scale

http://spark-notebook.io

Laurel? HArdy? Anyone?

What is Apache Spark?

Page 8: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

● Scala (types, production quality)● Reactive&pluggable charts API

(scala = no.js)● easy install, no deps.● multiple sparkContext

out of the box.

What is Apache Spark?

Page 9: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

http://bdgenomics.org/

ADAM Project (UC Berkeley):

● Data format (schema, compact, distributed): avro + parquet

● API (Reads, Variants, Genotypes, …)

I, ADAM

Genomics with Spark?

Page 10: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

GenomicsThe data

Please, don’t mind the colors...

Page 11: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

GenomicsThe data

So… that’s what separates us huh?

Page 12: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

1000 genomes: http://www.1000genomes.org/

~1000 samples

Few samples => Machine Learning

GenomicsThe data

Woooow, really, you must be kidding me… ahahahahah

Page 13: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

Few samples => Machine Learning

Lots of Data => Distributed computing

GenomicsThe data

Oh… damned… hum huh

Page 14: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Population stratification

w/ Deeplearning? H2O

From the spark notebook? Sparkling water

GenomicsThe problem

Here I need some water.

Page 15: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Memory implementation of “Map-Reduce”

Highly optimised structures for the JVM

blazing fast convergent models

H2O

Higher API

Page 16: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

H2OSparkling: in-memory data exchange

I remember things better with two copies in memory.

http://h2o.ai/product/sparkling-water/

Page 17: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Showtime!

press play...

There’s a notebook for that

Page 18: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Shar3 (Data Fellas)ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Page 19: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Shar3 (Data Fellas)Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

Page 20: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Spark and the Notebook are interactive and leverage distributed computing infrastructure

ADAM is an optimized storage format for Massive genomic data

Spark provides tools to manipulate data and works w/ other libraries like H2O

Data scientists and application developers can work together

Summary

Wake up, we’re back!

Page 21: H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir

Acknowledgements

Frank NothaftMatt Massie

Neil Fergusson

Vinod & Michal

Thank you For your attention!

Questions?

And now let’s talk.