30
Algorithms & Tools for Genomic Analysis on Spark Ryan Williams Hammer Lab @ Mt. Sinai School of Medicine http://bit.ly/sse2017

Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by Ryan Williams

Embed Size (px)

Citation preview

Algorithms & Tools for Genomic Analysis on

SparkRyan WilliamsHammer Lab @ Mt. Sinai School of Medicinehttp://bit.ly/sse2017

Slides:http://bit.ly/sse2017

Agenda• Intro• Genomics crash course• Genomics applications• Fun with magic-rdds• Scala/Spark project mgmt notes• Questions

Intro

Hammer Lab• Est. 2013 @ Mt. Sinai School of Medicine• Initial focus: high-quality bioinformatics/genomics software

– distributed systems– OSS– static-typing + functional idioms

• Present-day: cancer immunotherapy research– personal-cancer-vaccine clinical trials– post-hoc clinical-data analysis

• This talk: ≈3yrs of genomic-analysis tool-building w/ Spark

Spark/Scala Tooling Overview

Genomics Crash Course

blogs.plos.org/dnascience/…

http://www.wikiwand.com/en/Shotgun_sequencing

www.thunderbolts.info

• Excessive repetitiveness– 20% retrotransposons– L1: 7000bp, 100k copies– Pseudogenes

• Impossible to resolve with “short reads”

Genome structure makes things difficult

www.pnas.org

contig.wordpress.com

github.com/rrwick/Bandage

“Reference Genome”

Basically everything matches

“Coverage Depth”

Digitizing Human Genomes• 1 genome ≈ 3B base-pairs• Theory:

– “2 bits per base-pair” (A, C, G, T)– ⇒ 1 genome ≈ 750MB– <1% unique, person to person– 7BN genomes ≈ 50PB?

• Reality:– 1BN 100bp “reads”– ⇒ 100BN sequenced bases– Cover the genome at average depth 30 (“30x coverage”)– 2-bit base, 1-byte quality score ⇒ 100GB / genome– 100-100k genomes ⇒ 10TB-10PB

Genomic Applications on

Spark

Spark/Scala Tooling Overview

www.idtdna.com

Joint histogram of coverage-depth distribution of two samples

Joint histogram of coverage-depth distribution of two samples

Fun with magic-rdds

Spark/Scala Tooling Overview

CappedGroupByKeyRDD

SlidingRDD

ReverseRDD

ScanLeftRDD

RunLengthRDD

Scala/Spark Project Management