Upload
timothy-danford
View
1.116
Download
0
Embed Size (px)
Citation preview
Next-Generation Genomics Using Spark and ADAM
Timothy DanfordTamr Inc.AMPLab
Next Generation
?
We come in peace.
What even is genomics?
Organism Cell Genome
One chromosome
One chromosome
per person
One chromosome
per persondefines a reference
chromosome
One chromosome
per persondefines a reference
chromosomeand
location
“… decoding the Book of Life”
Ortellius, 1570
Google, 2005
Lambert et al. “Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease” (2013)
Down the Long
Slide, To Happiness Endlessly
We often treat ‘bioinformatics’ as a
black box
Vials into Files
What’s In The Box?
My God, It’s Full of Pipelines
My God, It’s Full of Pipelines
A Tale of Three File FormatsBAM Files: Do You Read
Me?
Compressed text files & custom index formatsUser-defined attributesMulti-record structure
“Not wishing to be outdone by Amazon, Sanger Institute develops drone deliver system for BAM files.”
Open the Pod Bay Doors, Pal
I Had a Dream It Would End This Way
What to do, what to do?
Bioinformaticians
❤ Probabilistic
Models
Our Data Scattered Back and Forth
Across Space by this Gadget
Why Are We Still Defining File Formats By Hand?
• Instead of defining custom file formats for each data type and access pattern…
• Parquet creates a compressed format for each Avro-defined data model.
• Improvement over existing formats1
• 20-22% for BAM• ~95% for VCF
1compression % quoted from 1K Genomes samples
Spark + Genomics = ADAM
• Hosted at Berkeley and the AMPLab
• Apache 2 License• Contributors from both
research and commercial organizations
• Core spatial primitives, variant calling
• Avro and Parquet for data models and file formats
Core Genomics Primitives: The Needs of the Many
The Terrible Trouble with Existing Pipelines
Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)
“I think you know what the problem is, just as well as I
do.”A single piece of a filtering stage for a somatic variant caller
“11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order
“Myths of Bioinformatics Software”
1. Somebody will build on your code2. You should have assembled a team to build your software3. If you choose the right license, more people will use and build on your
software.4. Making software free for commercial use shows you are not against
companies.5. You should maintain your software indefinitely6. Your “stable URL” can exist forever7. You should make your software “idiot proof”8. You used the right programming language for the task.Lior Pachterhttps://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
We Can
Mak
e
Our O
wn Myt
hs
Thanks to...
And thank you! Questions?