62
by Data Fellas, Spark London Meetup July, 1st ‘15 Share and analyse genomic data at scale with Spark, Adam, Tachyon and the Spark Notebook

Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Embed Size (px)

Citation preview

Page 1: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

by Data Fellas, Spark London Meetup July, 1st ‘15

Share and analyse genomic dataat scale with Spark, Adam, Tachyon and the Spark Notebook

Page 2: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

PART IAdam: genomics on Spark1K Genomes in Adam on S3Explore: Compute StatsLearn: train a model

OutlinePART IIGA4GH: Standard for Genomicsmed-at-scale projectExplore: using StandardsCreate custom micro services

Page 3: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Andy Petrella@noootsab

MathsscalaApache Spark

Spark NotebookTrainerData Banana

Xavier Tordoir@xtordoir

PhysicsBioinformatics

ScalaSpark

Page 4: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

PART ISpark & Genomics

Adam: genomics on Spark

1K Genomes in Adam on S3

Explore: Compute Stats

Learn: train a model

So that’s the thing that separates us?

Page 5: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamWhat is genomics data

Okay, sounds good. Give me two of them!

Genome is an important factor in health:

Medical DiagnosticsDrug responseDiseases mechanisms …

Page 6: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamWhat is genomics data

You mean devs are slacking of?

On the data production:

Fast biotech progress

No so fast IT progress?

Page 7: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamWhat is genomics data

No! They’re just sticky bubbles...

On the data production:

Sequence {A, T, G, C}

3 billion bases

Page 8: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamWhat is genomics data

Okay, a lot of bubbles.

On the data production:

Sequence {A, T, G, C}

3 billion bases

… x 30 (x 60?)

Page 9: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamWhat is genomics data

C’mon. a big mess of plenty of lil’ bubbles then.

On the data production: massively parallel

Sequence {A, T, G, C}

3 billion bases

… x 30 (x 60?)

Page 10: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamWhat is genomics data

Ah that explain why the black bars are differents

Page 11: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamWhat is genomics data

Dude... Tens of millions

Page 12: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamWhat is genomics data

Staaaaaaph Tens of millions

1000’s1,000,000’s…

Page 13: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamWhat is genomics data

‘coz it makes sparkling bubbles, right?

Ok, looks like Apache Spark makes a lot of sense here …

Page 14: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

AdamAn understandable model

Well done, a spec as text in a pDf…

Page 15: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

AdamAn understandable model

Take that

Page 16: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

AdamAn understandable model

Dunno what is a Genotype but it contains a Variant.Apparently.

Page 17: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

AdamAn understandable model

Yeaaah:generate client == more slack

Adam provides an avro schema

Page 18: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamAn efficient storage

Machism in I.T., what a flaw!

● Distribute data● Schema based● Read/query efficient● Compact

Page 19: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamAn efficient storage

That’s a quick step

● Distribute data● Schema based● Read/query efficient● Compact

PARQUET!

Page 20: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamAn efficient storage

Is Eve okay to use the parquet for that?

● Distribute data● Schema based● Read/query efficient● Compact

PARQUET!

Adam provides parquet as storage format

Page 21: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamA clean API

Object Wrappedy

adam Context

Page 22: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamA clean API

I could have done this as a one liner

adam Context

IO methods

Page 23: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

AdamA clean API

At least, it’s going to be simpler than the chemistry

● Scala classes generated from Avro● Data loaded as RDDs ● functions on RDDs

○ write to HDFS○ genomic objects manipulations○ Primitives to query genomics

datasets

Page 24: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

AdamPart of a pipeline

human | Seq | SNAP | Avocado | Adam | Ga4gh

ADAM is JVM library leveraging- Spark- Avro- Parquet

It still needs to be combined with sources (snap)

Adam data is part of processes (AVOCADO).

It CAN ALSO BE THE SOURCE FOR external PROCESSING, LEARNING (LIKE mllIB).

Page 25: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Thousands GenomesOpen Data Set

Games without Frontiers

1000 genomes: http://www.1000genomes.org/

Page 26: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Produces BAMs, VCFs, ...

Thousands Genomes

Why do you complain, they are compressed …

Page 27: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Thousands GenomesWhere are the data

DNA Russian roulette: which is fastest?

● EBI FTP: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/

● NCBI FTP: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/

● S3: http://aws.amazon.com/1000genomes/

● GS: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp

Page 28: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Thousands GenomesAdam that shit on S3

Hmmm like in the good old days of HPC

The bad part …

● get the vcf.gz file on local disk (& time for a coffee)

● uncompress (& go for lunch) ● put in HDFS (& take dessert)

Page 29: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Thousands GenomesAdam that shit on S3

what? No grappa?

The good part …

the Notebook (this one)

Page 30: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Thousands GenomesAdam that shit on S3

Okay, good enough to wait a bit…

What did we gain?

● before: 152 GB (gzipped) in 23 files● After: 71 GB in 9172 partitions

(43,372,735,220 genotypes)

Page 31: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Explore GenomicsAccess the data

Just in case, you don’t believe us -_-’

Access data from this notebook

Page 33: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Learn GenomicsThe problem

Insane, you’ll have hard time with me |:-[

How to deal with heterogenous data?

● Population stratification● Identify natural clusters● Assign genomes to these clusters

Page 34: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Learn GenomicsThe dimensions

Wiiiiiiiiiiiiiiiiide rows

● 1000 Samples (Rows)● 30,000,000 variants (columns or

variables)

Hard to explore such a feature space…

Page 35: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Learn GenomicsThe dimensions

*LDA for Latent Dirichelet Allocation…

Dimensionality reduction?

● Ideal would be a “Genetic” Mixture measure (lda* would do that…)

● Or a genetic distance (edit distance)

KMeans & distances to centroids

Page 36: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Learn GenomicsThe model

Reduce, train, validate, infer

● Split training/validation set● Train KMeans with 25 clusters● Compute distances to each centroid as

new features● Train Random Forest ● Validation

Page 38: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

AdamOur pipeline

I am a Llama

Convert VCFs to ADAMStoRE ADAM to S3

Compute alleles frequenciesStore alleles frequencies to S3

Compute Minor Allele frequency distribution

Train a Model for stratification

Hmmm… quite some missing pieces, right?

Page 39: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

PART IIStandards & Micro Services

Wake up!

GA4GH: Standard for Genomics

med-at-scale project

Explore: using Standards

Create custom micro services

Page 40: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Ga4GHLet’s fix the baseline

In I.T. it’s easy everything is standardized…

Global Alliance for Genomic and Health

http://genomicsandhealth.org/http://ga4gh.org/

Framework for responsible data sharing● Define schemas● Define services

Along with Ethical, Legal, security, clinical aspects

Page 41: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

GA4GHmodels

… everybody has is own standard

Page 42: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

GA4GHServices

But a shared schema is a bit better!

Page 43: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

GA4GHMetadata

The data of my data is also my data

Work In Progress

● Individual ● Sample ● Experiment ● Dataset ● IndividualGroup ● Analysis

But still very young and too much centered on data

Beacon ⁽*⁾

Tells the world you have data.CLearly not enough

Page 44: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Med At ScaleBy Data Fellas

Existing scalable implementation:Google Genomics

Uses ● BigQuery● google cloud computing● dremel● …

That’s what happens when you think you have…

Page 45: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Med At ScaleBy Data Fellas

Google Genomics is pushing Hard

Page 46: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Med At ScaleScalability first

BIG

There is another scalable implementation:Med At Scale, by Data Fellas

Uses ● Apache Spark● Adam● S3● HDFS● …

Page 47: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Med At ScaleScalability first

Data Fellas is pushing TOO

BIG

Page 48: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Med At ScaleComposability

very BIG

GA4GH defines quite some methods, or services

They don’t have all the same requirements in term of exposure and data processing

→ micro services for the Win

Allows granular deployment and composition/chaining of methods to answer a global question

Page 49: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Med At ScaleCustomization

Data Fellas is a data science company

Thus our goal is to expose data analyses

A data analysis is ● elaborated in a notebook● validated on a cluster● deployed as a micro service it self

Still defining a Schema and Service

VERY VERY BIG

Page 50: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Med At ScaleReady for the load

Balls!

We saw that one row has

30,000,000 columns

The queries are slicing and dicing those columns → views are huge

Hence, Tachyon via RDD.persist/save will optimize the collocated queries in space and time.

The hard part (will/)is to size the tachyon cluster

Page 51: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Med At ScaleAd Hoc Analytics

Who left the rats out?

Standards are very important

However, they cannot define everything, mostly OLAP.

Ad-Hoc analytics are thus allowed on the raw data using Apache Spark directly.

Of course, interactivity is a key to performance… hence the Spark-Notebook is involved.

Page 52: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Med At ScaleHow it works

Finally…

Page 53: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Med At ScaleADAM (and Spark)

Finally…

Page 54: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Med At ScaleMLlib (and Spark)

Finally…

Page 55: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Med At ScaleEfficient binary data

Finally…

Page 56: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Med At ScaleMicro Service

Finally…

Page 57: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Med At ScaleCache and Collaboration

Finally…

Page 58: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

ExploreUsing GA4GH endpoints

notebook TIME!

Use scala/Java Avro client from the browser.

I give you BananasYou give me Ananas

Page 59: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

CustomizeCreate and Use micro service (WIP)

Planning the next gear

Remember the frequencies use case? There is a custom endpoint manually created

We’re working on an Integrated Workflow

In a notebook: ● create the process● create Cassandra schema● persist (using connector)● Define service AVRO IDL● Generate project for DCOS● Log usage (see next)

Page 60: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

TIPS 1:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

OptimizationQuery mining (Roadmap)

Always look at the bright side

Back to the high dimensionality problem

Caching beforehands is a good solution but is not optimal.

Plan: ANalyse the Request/Response objects and the gathered runtime metrics to adapt the caching policies -- query mining processes

Page 61: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

ReferencesAdam: https://github.com/bigdatagenomics/adamBdg-Formats: https://github.com/bigdatagenomics/bdg-formats

GA4GH website: http://genomicsandhealth.org/GA4GH data working group: http://ga4gh.org/

Spark-Notebook: https://github.com/andypetrella/spark-notebook/

Med-At-Scale: https://github.com/med-at-scale/high-health

Data Fellas: http://data-fellas.guru/

Page 62: Spark meetup london  share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Q/A⁽*⁾THANKS!

⁽*⁾ or head to the pub (at least beers…)