34
Next-Generation Genomics Using Spark and ADAM Timothy Danford Tamr Inc. AMPLab

Strata-Hadoop 2015 Presentation

Embed Size (px)

Citation preview

Page 1: Strata-Hadoop 2015 Presentation

Next-Generation Genomics Using Spark and ADAM

Timothy DanfordTamr Inc.AMPLab

Page 2: Strata-Hadoop 2015 Presentation

Next Generation

?

We come in peace.

Page 3: Strata-Hadoop 2015 Presentation

What even is genomics?

Page 4: Strata-Hadoop 2015 Presentation

Organism Cell Genome

Page 5: Strata-Hadoop 2015 Presentation

One chromosome

Page 6: Strata-Hadoop 2015 Presentation

One chromosome

per person

Page 7: Strata-Hadoop 2015 Presentation

One chromosome

per persondefines a reference

chromosome

Page 8: Strata-Hadoop 2015 Presentation

One chromosome

per persondefines a reference

chromosomeand

location

Page 9: Strata-Hadoop 2015 Presentation

“… decoding the Book of Life”

Page 10: Strata-Hadoop 2015 Presentation

Ortellius, 1570

Page 11: Strata-Hadoop 2015 Presentation

Google, 2005

Page 12: Strata-Hadoop 2015 Presentation
Page 13: Strata-Hadoop 2015 Presentation
Page 14: Strata-Hadoop 2015 Presentation
Page 15: Strata-Hadoop 2015 Presentation

Lambert et al. “Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease” (2013)

Page 16: Strata-Hadoop 2015 Presentation

Down the Long

Slide, To Happiness Endlessly

Page 17: Strata-Hadoop 2015 Presentation

We often treat ‘bioinformatics’ as a

black box

Vials into Files

Page 18: Strata-Hadoop 2015 Presentation

What’s In The Box?

Page 19: Strata-Hadoop 2015 Presentation
Page 20: Strata-Hadoop 2015 Presentation

My God, It’s Full of Pipelines

Page 21: Strata-Hadoop 2015 Presentation

My God, It’s Full of Pipelines

Page 22: Strata-Hadoop 2015 Presentation

A Tale of Three File FormatsBAM Files: Do You Read

Me?

Compressed text files & custom index formatsUser-defined attributesMulti-record structure

Page 23: Strata-Hadoop 2015 Presentation

“Not wishing to be outdone by Amazon, Sanger Institute develops drone deliver system for BAM files.”

Page 24: Strata-Hadoop 2015 Presentation

Open the Pod Bay Doors, Pal

Page 25: Strata-Hadoop 2015 Presentation

I Had a Dream It Would End This Way

Page 26: Strata-Hadoop 2015 Presentation

What to do, what to do?

Page 27: Strata-Hadoop 2015 Presentation

Bioinformaticians

❤ Probabilistic

Models

Our Data Scattered Back and Forth

Across Space by this Gadget

Page 28: Strata-Hadoop 2015 Presentation

Why Are We Still Defining File Formats By Hand?

• Instead of defining custom file formats for each data type and access pattern…

• Parquet creates a compressed format for each Avro-defined data model.

• Improvement over existing formats1

• 20-22% for BAM• ~95% for VCF

1compression % quoted from 1K Genomes samples

Page 29: Strata-Hadoop 2015 Presentation

Spark + Genomics = ADAM

• Hosted at Berkeley and the AMPLab

• Apache 2 License• Contributors from both

research and commercial organizations

• Core spatial primitives, variant calling

• Avro and Parquet for data models and file formats

Page 30: Strata-Hadoop 2015 Presentation

Core Genomics Primitives: The Needs of the Many

Page 31: Strata-Hadoop 2015 Presentation

The Terrible Trouble with Existing Pipelines

Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)

Page 32: Strata-Hadoop 2015 Presentation

“I think you know what the problem is, just as well as I

do.”A single piece of a filtering stage for a somatic variant caller

“11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order

Page 33: Strata-Hadoop 2015 Presentation

“Myths of Bioinformatics Software”

1. Somebody will build on your code2. You should have assembled a team to build your software3. If you choose the right license, more people will use and build on your

software.4. Making software free for commercial use shows you are not against

companies.5. You should maintain your software indefinitely6. Your “stable URL” can exist forever7. You should make your software “idiot proof”8. You used the right programming language for the task.Lior Pachterhttps://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/

We Can

Mak

e

Our O

wn Myt

hs

Page 34: Strata-Hadoop 2015 Presentation

Thanks to...

And thank you! Questions?