Upload
lynn-langit
View
192
Download
0
Embed Size (px)
Citation preview
Dr. Denis Bauer & Lynn Langit
Genomic-scale Data Pipelines
Denis Bauer, PhD
Oscar Luo, PhD
Rob Dunne, PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson, PhD
Adrian WhiteAndy Hindmarch
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai, PhD
Arash Bayat
John Hildebrandt Mia Chapman
Ian BlairKelly Williams
Jules Damji
Gaetan Burgio Lynn Langit
Natalie Twine, PhD
Prabha Pillay
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics Team
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
YouTube
Big Data in 2025…Petabytes?
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
YouTube
Big Data in 2025…Petabytes?
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
1
0.17
2
20
0 5 10 15 20 25
Astronomy
YouTube
Genomic
GENOMIC Big Data in 2025 - Exabytes
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genome holds Blueprint for Every Cell
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Affects Looks, Disease Risk, and Behavior
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
VCF Data
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic Research Workflow
https://www.projectmine.com/about/
BigData Focus
Finding the Disease Gene(s)
Spot the letter that is…• common amongst all affected
• absent in all unaffected*
* oversimplified
cases
controls
Gene1 Gene2
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Why Apache Spark?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
low Accuracy high
low
Spe
ed
h
igh
Cloud Data Pipeline Pattern
Business Problem
DataQuality
Candidate Technologies
Build/TestMVPs
Assemble Pipeline
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Building a Cloud Data Pipeline
Candidate Technologies
• Ingest/Clean
• Analyze/Predict
• Visualize
Build MVPs
• Test
• Iterate
• Learn
Assemble Pipeline
• Combine pieces
• Validate sections
• Test at scale
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Building a Cloud Data Pipeline
Spark
•IaaS, PaaS, SaaS Vendors
•AWS, Azure, GCP…
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Visualizing Machine Learning Results
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Solving Important Questions…Cancer genomics?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
DEMO: Who is a Bondi Hipster?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster • 12 workers
• 16 x Intel CPUs
• Xeon [email protected]
• 128 GB RAM
• Spark 1.6.1 • 128 executors
• 6GB / executor 0.75TB
• Synthetic dataset
Whole Genome
RangeGWAS Range
Future Directions for VariantSpark RF
Mixed feature types
Unordered Categorical
Continuous
Build Community
Python API
Non-Genomic Demos
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Implementation by
Try it out: VariantSpark Notebook
Transformational Bioinformatics| Denis C. Bauer @allPowerde
https://docs.databricks.com/spark/latest/training/variant-spark.html
Genome Editing can correct genetic diseases, ex. hypertrophic cardiomyopathy
“Editing does not work every time, e.g. only 7 in 10 embryos were mutation free.”
Aim: Develop computational guidance framework to enable edits the first time; every time
Ma et al. Nature 2017 *
* Controversy around the paper – stay tuned
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Make Process Parallel and Scalable
SPEED
• Each search can be broken down into parallel tasks - each takes seconds
SCALE
• Researchers might want to search the target for one gene or 100,000
Scalability + Agility =
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
One of the first Serverless Applications in Research
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Featured in
X-Ray Tracing Demo of GT-Scan2• Find performance
bottlenecks
• Fix and test
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Webapp
Resources (S3, DynamoDB)
Lambda
25
50
75
getF
asta
Seq
uenc
e
crea
teJo
b
targ
etSca
n
offta
rget
Sca
nSta
rter
offta
rget
Sea
rch
targ
etIn
ters
ects
targ
etTr
ansc
riptio
nInt
erse
cts
targ
etW
uSco
rer
targ
etSgR
NASco
rer
OnT
arge
tSco
rer
geno
meC
RIS
PR
functions
runtim
e (
s)
Type
base
old
GTScan2 X-Ray Analysis
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Results – 4x Faster (80% improvement)
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
2 min
30 sec
Considering Servicesfor GT-Scan2
• Use AWS Step Functions• Simplify workflow
• Simplify task timeouts
• Simplify task failures
• Must evaluate costs• SNS vs. Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Cloud Data Pipeline Pattern
Problem Data Technologies MVPs Pipeline
SearchGTScan2
fastq, bed-> S3, NoSQL Ingest ETL, AnalyzeViz
S3LambdaLambda/API Gateway
Serverless
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Serverless Pipeline Pattern
Lambda function
1
Lambda function
2
Lambda function
3
buckets with objects DynamoDB
API Gateway Users
Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Cloud Data Pipeline Pattern
Problem Data Technologies MVPs Pipeline
AnalyzeGWAS
vcf -> S3/Spark IngestETLAnalyzeViz
S3 -> Databricks DBFSApache SparkVariant-Spark MLNotebook, SQL, R, Python
Spark ServerCluster
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Spark Server Cluster Pipeline Pattern
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Jupyter Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Cloud Genomic-Scale Data Pipelines• Problem # 1 – ML on Large Data
• Solution: Spark-server cluster + custom machine learning
• Problem #2 – Burstable Search
• Solution: Serverless pipeline
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic-scale Data Pipelines
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Dr. Denis Bauer & Lynn Langit