Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Share and analyse genomic data at scalewith Spark, Adam, Tachyon & the Spark Notebookby @DataFellas, Oct • 29th • 2015

Outline• Sharp intro to Genomics data• What are the Challenges• Distributed Machine Learning to the rescue

• Projects: Distributed teams• Research: Long process• Towards Maximum Share for efficiency

Andy PetrellaMathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier TordoirPhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

“There must be another way of doing the credits” -- Robin Hood: Men in Tights (1993, Mel Brooks)

Analyse Genomic At ScaleSpark, Adam, Spark Notebook

• Sharp intro to Genomics data• What are the Challenges• Distributed Machine Learning to the rescue

What is genomics data?DNA?

What makes us what we are…

… a complex biochemical soup.

With applications to medical diagnostics, drug response,disease mechanisms

On the production side

Fast biotech progress…

… can IT keep up?


Sequence {A, T, G, C}

3 billion characters (bases)


Sequence {A, T, G, C}

3 billion characters (bases)

… x 30 (x 60)

Massively parallel

Lots of data?

Lots of data?

10’s millions

Lots of data!

10’s millions

1,000s1,000,000s...

ADAM: Spark genomics library

http://www.bdgenomics.org





Avro schema

Parquet storage

Genomics API

So what do we do with this?Study variations between populations

Descriptive statistics

Machine Learning (Population stratification or Supervised learning)

… and share and replay!

The Spark Notebook… comes to the rescue.

Spark: easy APIsSelf described and consistentEasily shared (code)

http://www.spark-notebook.io

The Spark Notebook

The Spark Notebook

The Spark Notebook

So what do we do with this?

… and share and replay!

Code can be shared easily but we want better...

How do we share data produced by the notebook?

How do we publish the notebook as a service?

Share Genomic At ScaleSpark, Tachyon, Mesos, Shar3

• Projects: Distributed teams• Research: Long process• Towards Maximum Share for efficiency

Projects

Intrinsically involving many teams

geolocally distributed in different countries or laboratories

with different skills inBiology, Genetics, I.T., Medicine (, legal...)

Projects

Require many types of data ranging frombio samplesimagerytextualarchives/historical

ProjectsOf course

Generally gather many people from several populations

Note: This is very expensive and burns time as hell!

Projects1.000 genomes (2008-2012): 200To

100.000 genomes (2013-2017): 20Po (probably more)

1.000.000 genomes (2016-2020): 0.2Eo (probably more)

eQTL: mixing many sources

ProjectsNeed proper data management between entities, yet coping with:

amount of dataheterogeneity of people

distance between actorsconstraints related to data

location

ProjectsDistributed friendly

SCHEMAS + BINARY

f.i. Avro

ResearchResearch in medicine or health in general is

LOOOOOOO…OOOOONG

ResearchMost reasons are quite obvious not have to be overlooked

Lots of measures and validationLots of control (including by Gov.)

Lots of actors

ResearchAs a matter of fact, research need

to be conducted on data and to produce results

But both are highly exposed to reuse, so what if we lose either of them?

ResearchHowever, we can get into troubles instantly without even losing them.

What if we don’t track the processes to go from one to the other?

In any scientific process: confrontation, replay and enhancement are key to move forward

This is misleading to think that sharing the code is enough.

Remind: we look for data and results, not for code.

The process includes the code, the context, the sourcesand so on, and all should be part of the data discovery/validation task

Research

Assess the risk factor associated with a disease given mutations of a certain gene.

More than 50 years of data collecting and modelling.

Hundreds of researchers, each generation with new ideas.

Replaying old processes on new data,new processes on old data

Research

Share share share

All these facts relate to our capacity to share our work and to collaborate.

We need to share efficiently and accurately• data• process• results

Share share share

The challenge resides in the workflow

Share share share

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Share share share

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Share share share

Streamlining development lifecycle for better Productivitywith Shar3

Share share share

Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

That’s all folksThanks for listening/staying

Poke us on Twitter or via http://data-fellas.guru@DataFellas

@Shar3_Fellas@SparkNotebook

@Xtordoir & @Noootsab

Check also @TypeSafe: http://t.co/o1Bt6dQtgH

Data & Analytics

Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir