42
Share and analyse genomic data at scale with Spark, Adam, Tachyon & the Spark Notebook by @DataFellas, Oct • 29th • 2015

Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Embed Size (px)

Citation preview

Page 1: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Share and analyse genomic data at scalewith Spark, Adam, Tachyon & the Spark Notebookby @DataFellas, Oct • 29th • 2015

Page 2: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Outline• Sharp intro to Genomics data• What are the Challenges• Distributed Machine Learning to the rescue

• Projects: Distributed teams• Research: Long process• Towards Maximum Share for efficiency

Page 3: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Andy PetrellaMathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier TordoirPhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

“There must be another way of doing the credits” -- Robin Hood: Men in Tights (1993, Mel Brooks)

Page 4: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Analyse Genomic At ScaleSpark, Adam, Spark Notebook

• Sharp intro to Genomics data• What are the Challenges• Distributed Machine Learning to the rescue

Page 5: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

What is genomics data?DNA?

What makes us what we are…

… a complex biochemical soup.

With applications to medical diagnostics, drug response,disease mechanisms

Page 6: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

On the production side

Fast biotech progress…

… can IT keep up?

Page 7: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

On the production side

Sequence {A, T, G, C}

3 billion characters (bases)

Page 8: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

On the production side

Sequence {A, T, G, C}

3 billion characters (bases)

… x 30 (x 60)

Massively parallel

Page 9: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Lots of data?

Page 10: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Lots of data?

10’s millions

Page 11: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Lots of data!

10’s millions

1,000s1,000,000s...

Page 12: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ADAM: Spark genomics library

http://www.bdgenomics.org

Page 13: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ADAM: Spark genomics library

Page 14: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ADAM: Spark genomics library

Page 15: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ADAM: Spark genomics library

Page 16: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ADAM: Spark genomics library

Avro schema

Parquet storage

Genomics API

Page 17: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

So what do we do with this?Study variations between populations

Descriptive statistics

Machine Learning (Population stratification or Supervised learning)

… and share and replay!

Page 18: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

The Spark Notebook… comes to the rescue.

Spark: easy APIsSelf described and consistentEasily shared (code)

http://www.spark-notebook.io

Page 19: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

The Spark Notebook

Page 20: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

The Spark Notebook

Page 21: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

The Spark Notebook

Page 22: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

So what do we do with this?

… and share and replay!

Code can be shared easily but we want better...

How do we share data produced by the notebook?

How do we publish the notebook as a service?

Page 23: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Share Genomic At ScaleSpark, Tachyon, Mesos, Shar3

• Projects: Distributed teams• Research: Long process• Towards Maximum Share for efficiency

Page 24: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Projects

Intrinsically involving many teams

geolocally distributed in different countries or laboratories

with different skills inBiology, Genetics, I.T., Medicine (, legal...)

Page 25: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Projects

Require many types of data ranging frombio samplesimagerytextualarchives/historical

Page 26: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ProjectsOf course

Generally gather many people from several populations

Note: This is very expensive and burns time as hell!

Page 27: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Projects1.000 genomes (2008-2012): 200To

100.000 genomes (2013-2017): 20Po (probably more)

1.000.000 genomes (2016-2020): 0.2Eo (probably more)

eQTL: mixing many sources

Page 28: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ProjectsNeed proper data management between entities, yet coping with:

amount of dataheterogeneity of people

distance between actorsconstraints related to data

location

Page 29: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ProjectsDistributed friendly

SCHEMAS + BINARY

f.i. Avro

Page 30: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ResearchResearch in medicine or health in general is

LOOOOOOO…OOOOONG

Page 31: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ResearchMost reasons are quite obvious not have to be overlooked

Lots of measures and validationLots of control (including by Gov.)

Lots of actors

Page 32: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ResearchAs a matter of fact, research need

to be conducted on data and to produce results

But both are highly exposed to reuse, so what if we lose either of them?

Page 33: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

ResearchHowever, we can get into troubles instantly without even losing them.

What if we don’t track the processes to go from one to the other?

In any scientific process: confrontation, replay and enhancement are key to move forward

Page 34: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

This is misleading to think that sharing the code is enough.

Remind: we look for data and results, not for code.

The process includes the code, the context, the sourcesand so on, and all should be part of the data discovery/validation task

Research

Page 35: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Assess the risk factor associated with a disease given mutations of a certain gene.

More than 50 years of data collecting and modelling.

Hundreds of researchers, each generation with new ideas.

Replaying old processes on new data,new processes on old data

Research

Page 36: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Share share share

All these facts relate to our capacity to share our work and to collaborate.

We need to share efficiently and accurately• data• process• results

Page 37: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Share share share

The challenge resides in the workflow

Page 38: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Share share share

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Page 39: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Share share share

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Page 40: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Share share share

Streamlining development lifecycle for better Productivitywith Shar3

Page 41: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Share share share

Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

Page 42: Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

That’s all folksThanks for listening/staying

Poke us on Twitter or via http://data-fellas.guru@DataFellas

@Shar3_Fellas@SparkNotebook

@Xtordoir & @Noootsab

Check also @TypeSafe: http://t.co/o1Bt6dQtgH