40
HIGH-PERFORMANCE BIOLOGICAL COMPUTING University of Illinois at Urbana Champaign Instrumenting Genomic Variant Calling Workflow Part II Liudmila Sergeevna Mainzer Blue Waters Symposium June 13-15, 2016

Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

HIGH-PERFORMANCE BIOLOGICAL COMPUTINGUniversity of Illinois at Urbana Champaign

Instrumenting Genomic Variant Calling Workflow

Part II

Liudmila Sergeevna MainzerBlue Waters Symposium

June 13-15, 2016

Page 2: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Computational Genomics at NCSA and UIUC

• Architecture:

What kind of computer architecture is best suited forbioinformatics work?

• Performance bottlenecks:Victor Jongeneel,Director of HPCBio

What are the performance bottlenecks for bioinformatics work,on different architectures?

• Future:

How to structure the bioinformatics workflows for bestperformance on the architectures upcoming in the next 1, 3, 5 years?

Ravi Iyer,Professor of ECE

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 2 of 30

Page 3: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Project complete: how far can we scale genomic variant calling on BW?

Part 1: IntroductionWhat is genomic variant calling?

Why do we think it is important?Why does it need a petascale computer?

== with the emphasis on scale ==

Part 2: Data management issuesBig data workflow

Workflow management was discussed last year

Part 3: OutlookWhat kind of HPC do we need to satisfy the NSF priority to Understand the Rules of Life: linking genotype to phenotype? (Jim Kurose)

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 3 of 30

Page 4: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Part 1:

What is Genomic Variant Calling? Why do we think it is important?

Why does it need petascale computing?

Page 5: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Genomic Variant = a difference in the genetic code

goodnightgoodnightpartingissuchsweetsorrow

htg-odnigh

nightg-od

Oetsorro

swOetsorr

uchswOetsoodnightg

Goodnigh

ghtpartingi

nightparti issuchswO

g-odnightp

ghtg-odnig

dnightg-od

Rahim et al. Genome Biology 2008 9:215

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 5 of 30

Page 6: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Even a single variant in a single gene can lead to a drastic difference in phenotype

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 6 of 30

Image from http://www.nhlbi.nih.gov/

Sickle-cell anemia is a Mendelian disease.

NHGRI:Since 2011, Centers for Mendelian Genomics sequenced >20,000 human exomes.

Human exome ~ 2% human genome

1 sample ~ 10 GB sequencing data20,000 samples ~ 200 TB sequencing data

Page 7: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Petascale storage requirements

Complex traits are influenced by many variants, frequently outside coding regions:

• BMI• Human height• Alzheimer’s disease• Diabetes• Stroke• Autism• Heart disease• Intelligence• Fertility

Mainzer, HPCBio Blue Waters Symposium 2015 Slide 7 of 30

NHGRI:Centers for Common Disease Genomics plan to sequence ~200,000 whole human genomes.

1 sample ~ 200 GB sequencing data (depth-dependent)

200,000 samples ~ 40 PB sequencing data input data to the variant calling process

Page 8: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

What if we had to genotype every baby being born? = 500 genomes/day in the state of Illinois

NERVE CONDITION - PROBABILITY 60%,MANIC DEPRESSION - 42%,

OBESITY - 66%,ATTENTION DEFICIT DISORDER - 89% HEART DISORDER - 99%EARLY FATAL POTENTIAL

LIFE EXPECTANCY - 33 YEARS

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 8 of 30

Sustained Petascale and Exascale storage requirements

NIH http://www.nih.gov/precisionmedicine/

Page 9: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

• Input 300-600 GB/genome 150-300 TB/day when analyzing 500 genomes/day

• Intermediary 3 TB per sample with intermediaries

0.3-1.5 PB/day when analyzing 500

genomes/day

• Output Tiny: < 500 M per sample

What if we had to genotype every baby being born? = 500 genomes/day in the state of Illinois

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 9 of 30

Sustained Petascale and Exascale storage requirements

NIH http://www.nih.gov/precisionmedicine/

Page 10: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Large scale plant and animal genotyping

Purebreddairycattle.com

Complex traits of note

• Plant biomass• Nutritive content of grain:

oil, protein, vitamins, minerals• Parasite resistance • Milk volume• Muscle mass

Mainzer, HPCBio Blue Waters Symposium 2015 Slide 10 of 30

Ongoing and future projects

• 1000+ Arabidopsis genomes

• 3,000 Rice varieties

• 1000 Fungal genomes project

• Genome10K: 16,000 Vertebrates

• 5,000 Insect genomes

Page 11: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Purebreddairycattle.com

Mainzer, HPCBio Blue Waters Symposium 2015 Slide 11 of 30

It won’t stop here

Page 12: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Compute requirements: Node count, not flops

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 12 of 30

1. Alignment500 jobs for BWAIf chunking input data: 5,000 jobs for Novoalign

2. Split by chromosome25 chromosomes * 500 genomes = 12,500 jobs

3. Realign/Recalibrate25 chromosomes * 500 genomes = 12,500 jobs

4. Variant calling25 chromosomes * 500 genomes = 12,500 jobs

Page 13: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Node Type Cray XE6 Cray XK7

CPU2 x AMD “Interlagos”

Opteron 6276 1 x AMD “Interlagos”

Opteron 6276

GPU NA1 x Nvidia “Kepler” Tesla

K20x

Total Nodes 22,640 4,224

Total x86 Cores 362,240 33,792

Cores/Node16 FP x86_64 cores,

2.45 GHz8 FP x86 Cores, 2.45 GHz;

2688 CUDA cores

Memory/Node 64 GB 32 GB (CPU) + 6 GB (GPU)

Storage 26.4 petabytes (disk), 380 petabytes (nearline)

Interconnect Cray “Gemini” 3D Torus

OS Cray Linux 6

Blue Waters

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 13 of 30

Page 14: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Part 2:

So, how far can we scale?Data management issues

Image from http://fcw.com

Page 15: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Best Practices Workflow instantiated, runs

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 15 of 30

Page 16: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

,

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 16 of 30

A minimalist workflow can processOne human WGS sample in ~24 hours, depth-dependent

The first step, short read mapping is where i/o bottlenecks are likely to occurAs we scale up to 500 genomes

Page 17: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Peak node injection bandwidth is ~9.6 GB/sec

• Deeply sequenced samples• > 30X

• accurate aligner takes a few days

• max walltime 24 hours

• Nodes have 64 GB RAM• alignments >100 GB

• Breaks up aligned pieces

• Reads/writes while sorting in RAM

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 17 of 30

Sort/Merge step can be particularly i/o intensive

Page 18: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Peak node injection bandwidth is ~9.6 GB/sec

OVIS data

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 18 of 30

Sort/Merge step can be particularly i/o intensive

At scale might cause:• Contention in the network

• Contention at the entry into OSS Let’s measure!• Contention at the entry into OST

Page 19: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

0

1

2

3

4

1 4 16 64 256 1024

Wal

ltim

e, h

ou

rs

Number of Novosort instances

8 OSTs (JYC) 144 OSTs (BW projects, striped across, -1)

144 OSTs (BW projects, not striped) 1440 OSTs (BW scratch, striped across, -1)

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 19 of 30

Striping does not matter?Potentially due to re-reading from Lustre cache (4 TB/sec ? Wow!)

Soy NAM

1000 fungal genomes

Page 20: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 4 8 16 32 64 128 200 256 300 400 512

median

q1

min

max

q3

Projects, striped -1

1

1.5

2

2.5

3

3.5

4

4.5

1 2 4 8 16 32 64 128 256 512 750 901 1000 1001 1250

Scratch, striped -1

Wal

ltim

e , h

ou

rs

# Novosort instances # Novosort instances

Large file system is essentialThere is always a bit of variability due to other users

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 20 of 30

Page 21: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

,

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 21 of 30

A minimalist workflow can processOne human WGS sample in ~24 hours, depth-dependent

The first step, short read mapping is where i/o bottlenecks are likely to occurAs we scale up to 500 genomes

Let’s run those 500 genomes!

Page 22: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

File growth pattern during alignment

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 22 of 30

Blue – sample does not share an OST with any other samplesGreen – sample shares an OST with some other sampleRed – left and right reads of the same sample share an OST

reality

Time, hrs

File

siz

e, b

ytes

• Need access to i/o data per OST• Want to know what everyone

else is doing to a given OST

Solution to this bottleneck:• Stripe width 3• Break up samples into batches • Stagger the batches

Then it runs fine.

Page 23: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Part 3:

Outlook:Alternative solutions

Desired infrastructure

Page 24: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

We can do it. What now?

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 24 of 30

Thanks to this research and the support of the BW team, we have• Proven that analyzing 500 genomes/day is feasible

on Blue Waters• Documented bottlenecks and issues in workflow

and data management• Developed solutions to those problems

This enabled us to• Serve the Mayo Clinic as part of the Mayo-Illinois

Alliance• Help H3ABionet analyze 350 WGS for the

genotypinbg chip project• Analyze Soybean NAM data

What is the long-term outlook from here?

Page 25: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Ultrafast => no need for checkpointing, only 2 output files

Monolithic => only 1-2 jobs, no workflow management needs

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 25 of 30

BALSA

The call for scientific community (Bill Gropp)

• prepare the software for the next generation of HPC

• help NSF figure out what it should be

Indeed: ultrafast monolithic solutions

Page 26: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

GATK workflow Genalice Isaac v2 Dragen

Synthetic NEAT data,

WGS 50X,

0.1% error rate

PM

FN

FP

99.48%

0.52%

0.01%

(BWA, UG)

99.21%

0.79%

0.25%

97.60%

2.40%

1.59%

99.09%

0.91%

0.01%

NA12878, 60X depth

WGS, GATK dataset

PM

FN

FP

99.33%

0.67%

0.5%

(BWA, HC)

96.80%

3.20%

4.54%

96.56%

3.44%

3.89%

97.81%

2.19%

5.70%

NA12878_rep4

WGS, Illumina dataset

PM

FN

FP

---

95.89%

4.11%

4.81%

95.11%

4.89%

3.43%

96.80%

3.20%

4.46%

Synthetic NEAT data,

WES 50X, Chr1,

0.1% error rate

PM

FN

FP

99.62%

0.38%

0.93%

(Novoalign, UG)

99.53%

0.47%

0.71%

---

99.22%

0.78%

0.11%

ERR250440: WES, ~50X

From 1000 genomes

PM

FN

FP

N/A

95.71%

4.29%

4.20%

97.80%

2.20%

28.10%

98.20%

1.80%

3.34%

Overall variant calling concordance

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 26 of 30

Page 27: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Overall performance on one WGS 50X

BWA/GATK workflow Genalice Isaac v2 Dragen

Walltime 21 hrs 41 min 22 min 44 sec 1 hr 49 min 30 min 43 sec

Max #nodes used

Max #cores used

Max RAM used

Total core-hours

25

32 per node

10 GB per node

6,400

1

50

60 GB

37

1

96

600 GB

154

1

24+card

56 GB

12

Number of jobs

1 – alignment

25 – split by chromosome

25 – realign/recalibrate

25 – variant calls

1 – merge all vcfs

1 – create gar

1– convert

garvcf

1 – align

1 – variant calls

1- align+variant calls

Total = 77 Total = 2 Total =2 Total = 1

Number of files

4 – alignment

500 – realign/recalibrate

50 – variant calls

1 – output gar

1 – alignment

report

1 – vcf

3 – alignment

1272 – reporting

4 – variant calls

937 – temp files

587 – created and

deleted throughout

the run

Total = 554 Total = 3 Total = 2216 Total = 587

Disk footprint ~ TB ~ few GB ~ TB 289 GB

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 27 of 30

Page 28: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

But the big data problem is still PB

Need to make progress making big data be small data =>

• Changing encoding protocols: letters to bits• Compressing the data • Computing on compressed data• Changing the contents of the output files to encode the same

information with fewer bits

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 28 of 30

Page 29: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

The best computer for genomic variant calling

Next generation of Blue Waters:

• HPC expertise and a solid, dedicated support team like that on BW is absolutely essential

• Must have ~>256 GB RAM per node • Nodes must have internal storage: 1-4 TB• We want lots of cores: 32-64

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 29 of 30

Page 30: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Acknowledgements

HPCBio

Victor Jongeneel

Gloria Rendon

Chris Fields

Cray

Bob Fiedler

Carlos Sosa

Pierre Carrier

Richard Walsh

Bill Long

Jef DawsonNCSA Industry Engagement

Blue Waters support team

Evan Burness

Jim Long

Wayne HoyengaGreg Bauer

Victor Anisimov

Ryan Mokos

Kalyana Chadalavada

Alex Parga

Jeremy Enos

Andriy Kot

Jason Alt

Craig Steffen

CompGen

Ravi Iyer

Subho Banerjee

Arjun Athreya

Zachary Stephens

Innovative Systems Lab

Volodymyr Kindratenko

H3A bionet: UCTGerrit BothaAyton MeintjesNicola Mulder

Page 31: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Network: @scale ready?

10G path everywhere

… and yet 2.5 M/sec transfer rate !

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 31 of 20

Page 32: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

0

2

4

6

8

10

12

14

0 200 400 600 800 1000 1200 1400

Nu

mb

er

of

file

s p

lace

d b

y th

e

wo

rkfl

ow

on

to a

n O

ST

Object Storage Target ID

One workflow run on 10 synthetic whole human genomes

Black circles denote files less than 100 GB in size; red circles are ~500 GB files, and yellow and green circles are ~150GB files created in two different steps.

Actually, Lustre does a pretty good job

Liudmila Sergeevna Mainzer -- HPCBio – UIUC

Page 33: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

But it loses opportunity to backfill and costs > $$$

Unbundle, they say!

… when we got up to 4,000 jobs in the queue,Torque stopped talking to Moab,so we killed the jobs.

Mainzer, HPCBio Blue Waters Symposium 2015 Slide 33 of 20

Page 34: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Task 1 Task 2 Task 3 … Task N

Job management

Solution: wrap multiple SMP jobs with a launcher, turning them into a single MPI job

A single multi-node reservation is made on the cluster

Launcher is started within that reservation

It launches each task within this reservation As tasks complete, it launches new ones, until the list

of tasks is exhausted

OUTPUT DATA

Data N…Data 3Data 2Data 1

INPUT DATA

Mainzer, HPCBio Blue Waters Symposium 2015 Slide 34 of 20

A SINGLE MPI JOB

VictorAnisimov, NCSABlue Waters support group

Page 35: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Obama announces Precision Medicine Initiative

" to bring us closer to curing diseases like cancer and diabetes – and to give all of us access to the personalized information we need to keep ourselves and our families healthier."

"I want the country that eliminated polio and mapped the human genome to lead a new era of medicine – one that delivers the right treatment at the right time,"

U.S. President Barack Obama delivers his State of the Union address to a joint session of the U.S. Congress on Capitol Hill in Washington, January 20, 2015. Reuters/Jonathan Ernst

NIH http://www.nih.gov/precisionmedicine/

Precision medicine is an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person.

Mainzer, HPCBio Blue Waters Symposium 2015 Slide 35 of 20

Page 36: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Kinds of challenges

1. Large total data footprint DataManagement

2. Large number of files

3. Large number of simultaneous but independent non-mpi computations

Workflowmanagement4. Keeping track of what was done to the data: large amount of Metadata

5. Workflow bottlenecks: fans and merges, followed by fans

Mainzer, HPCBio Blue Waters Symposium 2015 Slide 36 of 20

Page 37: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Incoming data auto-md5auto-archivestream directly into the workflow

Output dataauto-check for correctness at every stepauto-archive during/after computationauto-stream to the recipient

Identifying potential i/o bottlenecks uneven file distributionsimultaneous file accesssaturating i/o in certain steps of the workflowimpact on metadata servers

Mainzer, HPCBio Blue Waters Symposium 2015 Slide 37 of 20

Data management

Solved problemsin some other areas of science;

hope to learn, borrow and adapt solutions

Have done a lot of profiling,Identified corner cases, worst case scenarios

Blue Waters:

Craig Steffen, Jeremy Enos, Ryan Mokos, Jason Alt, Galen Arnold, Greg Bauer

CSL: Subho Banerjee, Arjun Athreya, Zachary Stephens, Dr. Ravi Iyer

Page 38: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Variant calling: a production case

Human Heredity and Health in AfricaA massively collaborative project

To profile the genotypic diversity across the African continent

Help cure diseasesHelp understand human evolution

> 2,000 genomes total~350 genomes sequenced at 30X depth,

To arrive in batches of 50 genomesat Baylor

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 38 of 20

Page 39: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Topology awareness particularly important

during merge-sort

Peak node injection bandwidth is ~9.6 GB/sec

Ovbottleneck because

Nodes have 64 GB RAM

And due to merge sorting in RAM

Need 256 GB

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 39 of 20

Page 40: Instrumenting Genomic Variant Calling Workflow Part II · 2017. 10. 17. · • > 30X • accurate aligner takes a few days • max walltime 24 hours • Nodes have 64 GB RAM •

Petascale computing requirements

Genotyping every baby being born? 500 genomes/day in the state of Illinois result in:

Input

300-600 GB/genome

150-300 TB/day

2 files/genome = 1000 files

Intermediary

1-3 TB per sample

0.3-1.5 PB/day total

525 files/sample = 262,500 files total

filesOutput

< 500 M per sample

26 files/sample = 13,000 files total

Mainzer, HPCBio Blue Waters Symposium 2016 Slide 40 of 20