Illumina NovaSeq 6000 Sequencing Dell EMC Isilon and NVIDIA … · 2021. 1. 14. · E. Sasha Paegle August 2019 Initial release E. Sasha Paegle The information in this publication

© 2020 Dell Technologies or its subsidiaries.

White Paper

This white paper describes a modular,

scale-out solution architecture composed

of NVIDIA Parabricks application software,

NVIDIA DGX-1 system and Dell EMC

Isilon network-attached storage (NAS)

capable of analyzing the daily output of an

Illumina NovaSeq 6000 Sequencing

system, or approximately 24, 40X whole

human genomes sequences (WGS) per

day. This solution architecture can scale-

out to process over 1000 WGS per week.

This paper also highlights variables to

consider when building out a technical

computing environment designed to

accelerate the secondary analysis of NGS

data.

Accelerating Next

Generation Sequencing

Secondary Analysis with

Dell EMC Isilon and

NVIDIA Parabricks

2

Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

Table of Contents

Table of Contents .................................................................................................................................................... 2

Revisions ................................................................................................................................................................. 3

Executive Summary ................................................................................................................................................. 4

Introduction ............................................................................................................................................................ 5

Building Blocks For Modular, Scale-out GPU Solution Architecture to Accelerate NGS Secondary Analysis ........ 9

Evaluating Wall Clock Time for Secondary Analysis ............................................................................................. 11

Discussion.............................................................................................................................................................. 19

References ............................................................................................................................................................ 21

Appendix ............................................................................................................................................................... 22

3


Revisions

DATE DESCRIPTION AUTHOR

March 2020 Revised to reflect NVIDIA acquisition of

Parabricks E. Sasha Paegle

August 2019 Initial release E. Sasha Paegle

The information in this publication is provided “as is.” DELL EMC Corporation makes no representations or warranties of any kind with

respect to the information in this publication and specifically disclaims implied warranties of merchantability or fitness for a purpose.

Use, copying, and distribution of any DELL EMC software described in this publication require an applicable software license.

DELL EMC2, DELL EMC, the DELL EMC logo are registered trademarks or trademarks of DELL EMC Corporation in the United States

and other countries. All other trademarks used herein are the property of their respective owners.

© Copyright 2019 DELL EMC Corporation. All rights reserved. Published in the USA. 08/19.

DELL EMC believes the information in this document is accurate as of its publication date. The information is subject to change without

notice.

DELL EMC is now part of the Dell group of companies.

4


Executive Summary

Next-Generation Sequencing (NGS) is a combination of laboratory instrumentation technologies and analysis

methods to identify patterns in DNA, the code of life, at dramatically increased resolution and quality. The cost

of acquiring NGS data continues to decline exponentially, while the volume of NGS data is doubling every year

(Stephens, 2015). The latest NGS instrumentation produces five times more data than the previous generation

of instrumentation. As the capacity to sequence DNA continues to increase, organizations like the Global

Alliance for Genomics and Health (GA4GH) estimate that over 60 million patients will have their DNA

sequenced in a healthcare context by 2025 (Birney, 2017).

However, Secondary Analysis, the conversion of raw NGS data into a usable DNA sequence and compared to

a reference, may require a significant amount of time (CPU-hours) to complete. Depending on available

computing & storage resources, software, and analysis methodology, secondary analysis time can range from

minutes to days. Ideally, there are enough computing and storage resources such that the output of secondary

analysis keeps pace with the rate of raw NGS data generation. The goal is to avoid a secondary analysis

backlog and ensure processed data are sent further to downstream analysis and interpretation as fast as

possible.

This white paper describes a modular, scale-out solution architecture composed of NVIDIA Parabricks

application software, NVIDIA V100 Tesla GPUs, and Dell EMC Isilon network-attached storage (NAS) capable

of analyzing the daily output of an Illumina NovaSeq 6000 system, or approximately 24, 40X whole human

genomes sequences (WGS) per day. This solution architecture can scale-out to process over 1000 WGS per

week. This paper also highlights variables to consider when building out a technical computing environment

designed to accelerate the secondary analysis of NGS data.

Intended Audience

Scientists who are responsible for the analysis of NGS data and IT professionals who are responsible for

providing a technical computing environment designed to support NGS applications are encouraged to read

this paper.

Acknowledgments

Thank you Parabricks Inc. for providing the Parabricks application suite. We thank “Shop” Mallick, The David

Reich Laboratory, Harvard Medical School, and the Simons Foundation for providing access to source data

generated by the Simons Diversity Genome Project. We also would like to thank Glen Otero, VP of Scientific

Computing, Translational Genomics Research Institute (Tgen), and Kihoon Yoon, Principal Engineer, Dell HPC

& AI Innovation Laboratory for input and consultation. GroupWare (Santa Clara, California) provided the lab

environment to perform the testing described herein.

5


Introduction

DNA is the code of life. This molecule carries the genetic instructions for growth, development, and

reproduction of all living organisms. The building block of DNA is a four-letter code: "A..T..G..C.” These four

letters are referred to as “bases.” The order of the A, T, G, and C s is responsible for traits like eye color or

drug sensitivity. DNA sequencing is the process of writing out the order of the bases for an organism of

interest. The entire complement of DNA for an organism is a genome. After approximately ten years and over

$2.7B US dollars, the first draft of the human genome sequence was published in April 2003 (NHGRI, 2019).

Next-generation sequencing (NGS) automates the rapid sequencing of DNA and can produce a human

genome in approximately 24 hours. Consequently, NGS now plays an increasingly important role in clinical

practice and public health. The information encoded in a person's genome is instrumental in assessing the

response to diagnosis, treatment, and disease prevention strategies due to person-to-person variability

(Suwinski, 2019). Identifying variants or differences for a genome is done by comparing an individual’s genome

to a DNA reference sequence. Also known as Secondary Analysis, this process for generating a list of variants

can take minutes to days depending on the available software, computing, and storage resources.

Keeping Pace with NGS Data Generation While Reducing Secondary Analysis

Time

Extending this approach to assess the genetic

variability of patient populations requires

operating the latest NGS instrumentation and

computing resources at scale. For example,

the latest Illumina NovaSeq 6000 system can

output approximately five times more DNA

bases than the previous generation of

instrumentation (Illumina Inc., 2019). One

Illumina NovaSeq system can produce

between ~1.5 to 2.5 TB raw data per day,

representing approximately 20 to 48 whole

genome sequences (WGS) per day1. Today it

is not uncommon for life science organizations

to operate more than one NGS instrument and

routinely process from 200 to over 1000

samples per week. Ideally, an organization has

enough computing and storage resources

matched to the output capacity for a fleet of

sequencing instruments such that the rate of

secondary analysis keeps pace with the rate of

raw NGS data generation. Otherwise, the

organization risks experiencing an analysis

backlog.

1 Sequencing output depends on NGS instrumentation, application and analysis methodology.

Figure 1. Keeping Pace with Data Generation

6


Working with NGS Data2

The desired product of whole-genome sequencing (WGS) is a list of variants or differences for a given sample

when compared to a reference genome. Although motivations may be different, minimizing the time to

generate this list of variants is a common goal shared by many healthcare and life science organizations.

Research organizations competing for grant awards want to move into variant interpretation and analysis as

soon as possible while avoiding costly false positives. To recognize revenue, a DNA sequencing provider must

return a list of variants to its customer per agreed on timelines. While in a clinical setting, a diagnostic variant

report is needed at a speed that impacts the care of a patient.

To better understand how software, computing, and storage technology choices impact the time to generate a

variant list, it is worthwhile to review the three analysis phases of NGS data (Figure 1).

Primary Analysis A primary analysis is the NGS instrument-specific steps needed to call DNA bases and compute quality scores

for each base. The most common output file format for this data as they arrive from the sequencer is FASTQ.

The FASTQ format is ASCII text data containing the short sequences of DNA bases3 and associated quality

score for each base. These short sequence data are un-ordered and un-aligned and commonly referred to as

reads. Depending on the type of sequencing instrument, instrument settings and application, FASTQ files can

range from a small number of large (> 120 GB) files to an extremely large number of smaller files.

Secondary Analysis During a secondary analysis, the raw reads contained in one or more FASTQ files are mapped and aligned to

a reference genome. A Binary Alignment Map (BAM) file(s) is the output and represents the genome for the

sample of interest. The genome of interest (i.e., BAM file) is passed to a variant calling step which identifies the

significant differences, or variants, between the genome of interest and a reference genome. The identified

differences are written to a variant call file (VCF).

Tertiary Analysis4 A tertiary analysis focuses on interpreting the variants for a given sample or a population of samples to

understand their significance in the context of additional biological and clinical information.

2 There are many NGS applications. The patterns for working with NGS data are common across applications. For simplicity this document focuses on whole genome sequencing (WGS). 3 Typically, short read segments range from 75 to 250+ bases depending on NGS application. 4 Tertiary analysis is beyond the scope of this paper.

7


Figure 2: Three Phases of NGS Analysis

Reducing Secondary Analysis Time To Keep Pace With NGS Data Generation

Due to the size of individual sample data and volume of samples, WGS secondary analysis is a compute and

intensive storage process. The most commonly used and cited methods for secondary analysis include the

Burrows-Wheeler Alignment (BWA-Mem) (Li, 2009), and the Genome Analysis Tool Kit (GATK) (McKenna,

2010). Using the Broad GATK Best Practices workflow (pipeline) requires over 30 hours to process5 a 30X

WGS (Goyal, 2017). Analyzing a few genomes per day is far from the ideal when a modern, high throughput

NGS instrument can generate unanalyzed, raw NGS data for 20 or more WGS per day.

It is important to consider critical variables that may impact the total secondary analysis (wall-clock) time when

choosing technologies that enable secondary analysis of NGS data. These variables range from the type of

NGS sequencing application, analysis software and strategies, output file types, application file access

patterns, and number and type of available computing resources.

Sequence Depth Of Coverage When planning time and resources to complete secondary analysis, it is essential to be aware of the

sequencing depth of coverage (aka coverage) for sample data as it will impact analysis time per sample.

Coverage describes the average number of reads that align to, or "cover," a known reference sequence. The

coverage often determines if a variant exists with a certain degree of confidence at a specific genomic location.

Coverage requirements vary by sequencing application. For example, 30X to 50X coverage is common for

human WGS applications (Illumina, 2019). However, the analysis of cancer genomes may require sequencing

to a depth of coverage higher than 100X to achieve the necessary sensitivity and specificity to detect rare

variants (Griffith, 2015).

5 48 core Intel Xeon E5-2697v2 12C, 2.7 GHz processors with 128 GB RAM and 3.2 TB SSD, CentOS 6.6

8


Coverage is also a measure of the amount of data per sample. As coverage increases, so does the amount of

data per sample. For example, a 30X (coverage) WGS sample contains approximately three times more data

than a 10X WGS sample, which means secondary analysis time also increases.

Interplay Between Analysis & Computing Resources Given many degrees of freedom between software and computing choices, it can be one of the most

challenging and time-consuming tasks in minimizing secondary analysis time. Organizations with access to

resources with deep computer science expertise may implement system-level optimizations achieving a 70%

reduction in execution time (Kathiresan, 2017). Alternatively, it can be as simple as updating existing server

technology yielding a 12% increase in daily output (Yoon, 2018). To avoid limitations of hardware scalability,

modern accelerator hardware architectures such as GPUs or FPGAs in combination with purpose-built

software can lead to significant reductions in secondary analysis times. For example, using Tesla V100 GPUs,

Dell Technologies demonstrated, in collaboration with Parabricks, over 25x reduction in analysis time

compared to a CPU-only solution6 (Dell Technologies, 2018).

Data Placement Like the interplay between software and computing resources, data storage solutions and their related file

systems also offer opportunities to accelerate secondary analysis. It is worthwhile to inspect and understand

the general file access patterns for the methods used in the secondary analysis. Some analysis applications

used in secondary analysis, especially those like BWA used for sorting and alignment, can create many

temporary files. As a best practice, these temporary files should be placed on direct-attached storage (DAS)

when feasible, instead of any network file storage. However, mounting a temporary or scratch directory on a

shared storage resource is an acceptable approach, if it introduces opportunities for eliminating manual, time-

consuming steps to stage large data sets next to computing resources. This approach also offers opportunities

to minimize or prevent data loss.

Storage Media Types Implementing secondary analysis workflows using shared storage resources often prompts groups to purchase

more expensive flash storage with the anticipation that it will significantly reduce analysis time. However, the

benefits from flash storage are highly dependent on the software application, available compute host memory,

data set size, and application IOPS requirements. Only 50% of commonly used bioinformatics tools

demonstrated 2x or more speed up from flash or solid-state disk (Lee, 2016). Relative to other technical

constraints using lower-cost hard disk drives (HDD) is a perfectly acceptable approach.

Simplifying Choices To simplify and streamline technology choices that lead to significantly reduced secondary analysis times while

keeping pace with NGS data generation, Dell EMC and Parabricks set out to identify a modular, easy-to-scale

reference architecture using a technical computing environment composed of Parabricks application software,

NVIDIA DGX-1 systems and Dell EMC Isilon network-attached storage capable of processing more than 1000

WGS per week.

6 18 core Intel Xeon E5-2699 18C, 3.0 GHz processors with 384 GB RAM and 12 TB SAS, Red Hat 7.6

9


Building Blocks For Modular, Scale-out GPU Solution Architecture to Accelerate NGS Secondary Analysis

Figure 3 illustrates the technical computing environment hosted at Groupware Technology7 used to evaluate

the acceleration of NGS secondary analysis. It is composed of eight DGX-1 systems, a Dell EMC Isilon H500

storage cluster, networking, and Parabricks application software.

Note that in a customer deployment, the number and type of GPU systems and Isilon storage nodes will vary

and can be scaled independently to meet the requirements specific to an organization (Dell EMC, 2018). The

last section of this document will discuss a starting configuration that can be scaled-out as requirements

change.

Storage: Dell EMC Isilon H500

The Dell EMC Isilon H500 storage cluster is used by many Life Science organizations today. It offers a reliable,

easy to manage, and cost-effective balance between performance and capacity needed to support NGS

secondary analysis and other heterogeneous bioinformatics workflows.

The Dell EMC Isilon H500 is a hybrid (H) storage platform powered by the Isilon OneFS operating system. It

uses a highly versatile yet straightforward scale-out storage architecture to speed access to massive amounts

of data, while dramatically reducing cost and complexity. This hybrid platform uses a mix of HDD and flash

drives that delivers up to 5 GB/s bandwidth and capacity ranging from 120 TB to 480 TB per chassis. Isilon

hybrid storage systems integrate easily with Isilon All-Flash (e.g., F800) and Isilon Archive (e.g., A200) chassis

as well as existing Isilon clusters.

Compute: NVIDIA DGX-1 System

The DGX-1 system is a fully integrated, turnkey hardware and software system that is purpose-built to

accelerate deep learning (DL) and other technical computing workflows. Each DGX-1 system hosts eight Tesla

V100 GPUs configured with NVLink technology, a hybrid cube mesh topology, that provides ultra-high

bandwidth, low-latency fabric for inter-GPU communication. DGX-1 systems provide high bandwidth, low

latency network interconnects for multi-node clustering over RDMA-capable fabrics.

The NVIDIA GPU Cloud (NGC) container registry provides researchers, investigators and developers with a

simple to access, a comprehensive catalog of GPU-accelerated software for AI, machine learning and HPC

workflows that take full advantage of the NVIDIA DGX-1 GPUs on-prem and in the cloud. The Appendix

provides specific DGX-1 cluster configuration information.

Networking: Arista 7060CX2-32S

The Arista 7060CX2 is 1RU high performance 40 GbE and 100 GbE high density, fixed configuration, data

center switch with wire-speed Layer 2 and Layer 3 features for software-driven cloud networking. It delivers a

rich choice of interface speed and density, allowing networks to seamlessly evolve from 10 GbE and 40 GbE to

25 and 100 GbE. This switch provides support for IEEE 25 GbE and support for shared packet buffer pool of

22 MB with 450 ns latency.

7 Visit Groupware Technology for more information about the NVIDIA DGX-1 POC program. https://www.groupwaretech.com/

https://www.groupwaretech.com/

10


Software: Parabricks Application Suite

Parabricks is a software suite for performing secondary analysis of next-generation sequencing (NGS) data.

The suite provides access to many GPU accelerated mapping, alignment, post-processing and variant calling

methods. Users can construct secondary analysis pipelines designed to deliver results at fast speeds and low

cost. Parabricks analyzes whole human genomes in about 45 minutes, compared to about 30 hours using

traditional CPU hardware for 30X WGS data.

The Parabricks software suite runs on a range of GPU platforms available on-prem or in the cloud. It scales

linearly with the number of GPU resources. Results produced by Parabricks are consistent across different

GPU platforms and generates the same results with each execution. The results are equivalent to Broad GATK

Best Practices pipeline. The current version of Parabricks supports all versions of GATK through v4.0.4.

Furthermore, Parabricks analysis pipelines are readily customizable, and new steps can be added effortlessly.

Additional information about the Parabricks application suite is summarized in the Appendix.

Figure 3. Dell EMC Isilon, and NVIDIA Parabricks & DGX-1 SYSTEM Test Environment

11


Evaluating Wall Clock Time for Secondary Analysis8

Methodology

To determine the recommended software and hardware configuration capable of keeping pace with the daily output of the latest NGS instrumentation, three test cases were evaluated. For each case, the observed the wall clock time was recorded for Parabricks secondary analysis pipeline(s) using different resource configurations, data layouts, and sample data.

Sample Data Sets

Whole Genome Sample NA12878

NA12878 is a human sample genome of Caucasian ancestry that is part of the CEPH Utah Reference

Collection. It has been extensively studied, and WGS data generated from NA12878 is often used to

benchmark secondary analysis pipelines. Sample input FASTQ data represent a 50X WGS for sample

NA12878. The Appendix provides additional background information about sample data from NA12878.

Simons Genome Diversity Project

In 2016 the Simons Diversity Genome

Project (SDGP) published one of the most

extensive datasets of diverse, high-quality

human genome sequences ever reported

(Mallick, 2016). Of the 300 human

samples collected, 279 samples are

publicly available for secondary research.

Rather than relying on a single sample

data set like NA12878, NGS sequencing

data from the SDGP is more

representative of the variation and scale

typically encountered in NGS labs today.

The sequencing depth of coverage spans

from 35X to over 80X, and the average

depth of coverage for the entire cohort is

43X (Figure 4). The NGS data was

provided as BAM files. The Appendix

describes the BAM to FASTQ file

conversion and additional information

about the SDGP project data.

8 A Comment On Benchmarks. The benchmark results reported here were generated in May 2019. Benchmarking results are subject to variables such as hardware configuration, software versions, and source data. When comparing the results summarized in this paper to benchmark results generated elsewhere, be sure to understand the impact of variables when comparing reported results. Reach out to a Dell or Parabricks representative for the most up to date benchmark information.

Figure 4. SGDP Sequence Coverage Distribution

12


Parabricks Secondary Analysis Pipeline Analyses performed on NGS data is often described as a pipeline. A pipeline is simply a collection of methods

or operations where the output of one operation becomes the input for the next operation. Four critical

operations, mapping, alignment, pre-processing, and variant calling make up most secondary analysis WGS

pipelines.

Parabricks is a software suite for genomic analysis methods designed to take advantage of GPU acceleration.

Many of the Parabricks methods are functionally equivalent to existing open-source methods. Parabricks

operations are stitched together to create a secondary analysis pipeline best matched to the requirements for

the sequencing application of interest such as germline and somatic analysis. Parabricks is available as either

a Docker or Singularity container and uses a variety of GPU resources. Figure 5 highlights the Parabricks

v2.3.7 application suite.

Figure 6 illustrates the Parabricks “germline” pipeline used for benchmarking. It combines all the steps from

fq2bam with GATK-HaplotypeCaller (HC)9 into a single command. For each test case below, sample paired-

end FASTQ files are submitted to the germline pipeline. The BAM files generated by the germline pipeline are

also submitted to the Parabricks DeepVariant (DV) operation. Each variant caller, HC and DV, output gVCF

files which adds 10 to 15 minutes execution time for each variant calling step10. Each test case described

recorded the wall-clock time for each stage of the pipeline for each sample.

9 The Parabricks HC is equivalent to the accurate sequential GATK-HC Version 4.0.4. 10 When performing variant discovery, greater sensitivity is achieved when jointly calling variants across multiple samples. However, this strategy is computationally expensive and does not scale well. The key difference between a regular VCF and a gVCF file is that the gVCF file has records variant information for genomic location (i.e. site), whether a variant call exists or not. The goal is to have every site represented in the file to perform joint analysis of a cohort in subsequent steps designed to overcome computation and scaling bottlenecks. The gVCF output generated in this exercise will be used in future tertiary analysis pipeline benchmarking studies. See the Broad Best Practices Guide for more on this topic. https://software.broadinstitute.org/gatk/

Figure 5. Parabricks Application Suite

https://software.broadinstitute.org/gatk/

13


Figure 6. Parabricks Germline Pipeline

Why Two Variant Callers? Calling genetic variants present in an individual genome relies on billions of short, error-prone sequence reads.

Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and

parameterized statistical models used for variant calling still produce thousands of errors and missed variants

in each genome (Poplin, 2016). Many groups run consensus variant calling pipelines which use more than one

variant calling method to minimize the likelihood of missing a variant.

DeepVariant, a variant calling method developed by Google, applies a deep convolutional neural network, has

been shown to outperform expert-driven statistical methods. However, calling variants for a 30X human

genome and writing the variants out to a gVCF file takes approximately four hours and requires at least 1024

compute cores11. The Parabricks GPU accelerated version of DeepVariant executes in less than 20 minutes

for a 30X genome. The fast analysis time makes it possible to use DeepVariant alone or in combination with

other expert-driven methods while minimizing the potential of creating a secondary analysis backlog.

Software Tuning and Environment Validation

The Isilon storage cluster used the default OneFS settings. Isilon OneFS SmartConnect was disabled, and

each DGX-1 system was directly mounted to an Isilon storage node for all test cases unless stated otherwise.

The DGX-1 systems (i.e., the clients) were mounted over NFSv3 using recommended NFS settings:

“async,nolock,rw,hard,intr,timeo=600,retrans=2,rsize=524288,wsize=524288”

iPerf, a tool used to test for the maximum achievable bandwidth on IP networks, validated the throughput of the

IP network path from an Isilon node to a DGX-1 compute node NIC.

The Tesla V100 GPUs were set with maximum frequency (boost) enabled:

#/bin/bash max_mem_freq="$(nvidia-smi -q -i 0 | grep Memory | grep MHz | tail -n1 | cut -d':' -f2 | cut -d' ' -f 2)" max_SM_freq="$(nvidia-smi -q -i 0 | grep SM | grep MHz | tail -n1 | cut -d':' -f2 | cut -d' ' -f 2)" sudo nvidia-smi -pm 1 sudo nvidia-smi --auto-boost-default=0 sudo nvidia-smi -ac ${max_mem_freq},${max_SM_freq}

Reference files needed for Parabricks operations are stored locally on each DGX-1 system.

11 For more on DV and best practices see https://cloud.google.com/genomics/docs/tutorials/deepvariant

https://cloud.google.com/genomics/docs/tutorials/deepvariant

14


Correct installation and operation of the Parabricks software were verified by executing the Parabricks

germline pipeline using the NA12878 FASTQ sample data set. A simple bash script and related output log for

the germline pipeline is shared in the Appendix.

Test Cases

Three test cases were evaluated to determine the recommended solution architecture. Test Case 1 evaluates the impact of data layout on secondary analysis time. Test Case 2 examines the compute to storage resource ratio on analysis time. Test Case 3 evaluates the effective daily sample throughput using 4 or 8 GPUs per sample.

Test Case 1: Impact of Data Layout on Secondary Analysis Time

The purpose of this test case is to determine the optimal data layout and recommendations for secondary

analysis using the Parabricks application suite and NVIDIA DGX-1 systems. Earlier benchmarking studies

demonstrated that it is best to use a mixed data layout where input and output directories are mounted on the

Isilon storage cluster, and a temporary directory located on the direct-attached storage (DAS) on the compute

node is used for intermediate files.

The wall clock time was recorded for each step of the Parabricks germline pipeline and DeepVariant step using

three different data layouts

• Mixed (Default): Input and output directories are mounted on the Isilon storage cluster; a temporary

directory is mounted from the DAS on each compute node.

• All Isilon: Input, output and temporary directories are mounted on the Isilon storage cluster.

• All Local: Input, output, and temporary directories are mounted on the DAS on the DGX-1 system.

A subset of 64 data samples (paired, FASTQ files) representing a range of WGS with sequence coverages

from 35x to 80X were submitted to the germline pipeline followed by a DeepVariant step. Each sample used 4

GPUs per pipeline job. Each DGX-1 system was pinned to a named Isilon node (1:1 compute to storage node

ratio).

Figure 7 summarizes the wall clock times for each stage of the germline pipeline. The wall clock time increases

as the sequencing depth of coverage increases. When submitting two high coverage samples (> 55x) to a

single DGX-1 system, pipeline jobs would fail due to limited host memory. In these cases, high coverage

samples were submitted individually to a single DGX-1 system.

The All Isilon data layout generates the fastest wall-clock time during the BWA stage (includes pre-processing

operations). The All Isilon data layout was 18% faster than the mixed, default data layout. The BWA stage

generates over 200 temporary files and consumes approximately a terabyte of storage per input sample data

set. Using an All Isilon layout for the BWA stage is more forgiving than the other two data layouts as there is no

contention for storage resources on the compute node.

The BWA stage of the pipeline is the most read/write intensive stage, and the Isilon storage cluster

performance was not significantly taxed. Peak throughput of 1.6 GB/s was observed using the mixed data

layout consuming 20% of the storage CPU resources. The All Isilon data layout generated a peak throughput

of 5 GB/s consuming 30% of the Isilon storage CPU resources.

During the Haplotype Caller and DeepVariant stages, the storage cluster was mostly idle. The DGX-1 system

CPU and GPU utilization varied for each stage of the pipeline. Haplotype Caller used nearly 100% of the GPU

resources while DeepVariant consumed about 30% of available GPU resources. The mixed data layout

15


produced the fastest wall clock time, although the difference in wall clock times between the three data layouts

is less than 10%. A similar pattern is observed for the DeepVariant step too.

Test Case 2: Impact of Compute to Storage Node Ratio on Secondary Analysis Time

Shared storage in NGS environments often service requests from external client applications in addition to

those requests generated from the secondary analysis. Also, shared storage systems are also consuming

resources to perform file system maintenance functions. This test case evaluates secondary analysis time

using different DGX-1 system (i.e., client) to Isilon H500 node ratios to simulate a storage system under load.

A Generation 6 Dell EMC Isilon H500 4U storage chassis consists of four storage nodes. Each node provides

the disk, RAM, and CPU resources. OneFS, the Isilon file system, aggregates the Isilon node hardware

components so that the whole becomes greater than the sum of the parts. The RAM is grouped into a single

coherent cache, allowing I/O on any part of the storage cluster (or chassis). For access to one or more files,

disk spindles and CPU are combined with increasing throughput, capacity, and IOPS as the cluster grows with

each additional chassis.

The same 64 sample data sets used in Test Case 1 were submitted to the germline pipeline and DeepVariant

step using different DGX-1 system to Isilon node ratios for each data layout. Each sample used four GPUs per

pipeline job. By limiting the front-end network interface on the Isilon cluster, secondary analysis wall clock time

was observed for three ratios:

Matching data layout to pipeline stage introduces opportunities to minimize secondary analysis time. An All Isilon data layout (blue) generates a shorter wall-clock time per sample relative to the Mixed (orange) or All Local (grey) data layouts.

Figure 7. Impact of Data Layout on Secondary Analysis Time

16


• One DGX-1 system: One Isilon Node (or 1:1), the default mode.

• Two DGX-1 systems: One Isilon Node (or 2:1). 16 V100 GPU clients connect to a single Isilon node.

• Four DGX-1 systems: One Isilon Node (or 4:1). 32 V100 GPU clients connect to a single Isilon node.

Figure 8 summarizes the wall clock time for using different DGX-1 system to Isilon node ratios. In most cases,

the difference between 1:1 and 4:1 resource ratio using different data layouts was within 5%. The 1:1 resource

ratio was 8.5% faster than 4:1 using the mixed, default data layout during the HaplotypeCaller stage.

At a 4:1 resource ratio the Isilon storage CPU resources peaked at 50% and averaged 20% for pipeline runs

using a mixed data layout. This set up also generated a peak throughput of 1.7 GB/s.

Figure 8. DGX-1 system: Isilon H500 Node Ratio & Wall Clock Time

Increasing DGX-1 system to Isilon node ratio from 1:1 to 4:1 does not significantly impact secondary analysis time. A 1:1 ratio using a mixed data layout (blue) reduced secondary analysis time on average by 8.5%. Results using the 2:1 resource ratio are omitted for simplicity.

17


Test Case 3: Secondary Analysis Time Using 4 vs. 8 GPU Per Sample

The number of GPU resources are configurable for Parabricks pipelines. This test case evaluates secondary

analysis time using four or eight Tesla V100 GPUs per sample input. 270 SDGP samples were submitted to

the germline pipeline and DV step. The wall clock time for pipeline runs using 4 GPUs per sample, or 8 GPUs

per sample was recorded. Each pipeline run used the mixed data layout and the 1 DGX-1 system to 1 H500

node ratio configuration.

Figure 9 summarizes the wall clock time for each stage of the pipeline for the SDGP cohort. The wall-clock

times using four or eight GPUs per sample for each pipeline stage differs by 30%. Note, for high coverage

genomes (> 55x) it is recommended to use eight GPUs per sample with all the compute node resources

available to the sample (four GPUs per sample may still be acceptable if expanding localhost memory).

Storage and compute resource utilization were like those observed in Test Case 1 and 2.

If eight GPU resources are available, four GPUs per sample is the recommended resource assignment to

maximize the effective daily secondary analysis throughput (e.g., # of samples analyzed / day). Table 1

summarizes the effective secondary analysis time per sample when using four or eight GPUs for each pipeline

stage. Figure 9 summaries the effective secondary analysis throughput for different pipeline step-ups.

Figure 9. Secondary Analysis Time Using 4 & 8 GPUs per Sample

18


Table 1. Wall Clock Time By Pipeline Stage

Figure 10. Effective Daily Sample Throughput By Pipeline Using An NVIDIA DGX-1 system & Isilon H500 Cluster

EFFECTIVE PARABRICKS SECONDARY ANALYSIS TIME PER SAMPLE USING A DGX-1 & DELL EMC ISILON H500 CLUSTER

COVERAGE 4 GPU - BWA 8 GPU - BWA 4 GPU - HC 8 GPU - HC 4 GPU - DV 8 GPU - DV

AVE. 35X 35.0 41.7 17.9 17.5 20.8 15.0

AVE. 40X 40.9 58.5 18.4 20.8 23.0 26.7

AVE. 50X 49.1 71.0 21.0 32.1 28.3 40.3

Time in minutes. Averaged for samples within noted coverage

19


Discussion

The test case results highlight the impact of available GPU resources, data layout, and compute to storage

resource ratios on secondary analysis time.

Modifying the number of GPU resources assigned per sample provided the greatest opportunity to reduce

secondary analysis time. If eight or more Tesla V100 GPUs are available, it is best to assign four GPUs per

sample to maximize the effective, daily secondary analysis sample throughput (i.e., WGS analyzed per day).

However, disciplines like oncology where NGS applications generating high sequence coverage samples (>

55X), Parabricks secondary analysis pipelines will benefit from assigning all the DGX-1 system GPU and CPU

resources to a single sample. Organizations using GPU system configurations with expanded host memory

(1.5 TB) such as the NVIDIA DGX-2 system, Dell PowerEdge C4140 or Dell EMC DSS 8440 should evaluate

four GPU per sample setups for high coverage samples.

Secondary analysis times can be further minimized by matching the data layout best suited to the analysis

pipeline operation. For example, using an All Isilon data layout for the BWA stage and mixed data layout for

either of the variant calling stages would reduce the secondary analysis time by 18%. Matching the data layout

best matched to the analysis pipeline stage creates the opportunity to process 2 to 3 additional genomes per

day or to run a consensus variant calling pipeline without significantly impacting daily analysis throughput

(Figure 10).

When possible, it is best to use a 1:1 DGX-1 system to Isilon H500 node ratio. Altering the DGX-1 system to

Isilon H500 ratios did not materially impact Parabricks secondary analysis times. Additional test cases such as

modifying the priority of OneFS file system services such as SyncIQ, dropping (i.e., failing out) storage nodes,

and servicing other external client requests during a secondary analysis are needed to identify the DGX-1

system to Isilon node ratio that significantly impacts secondary analysis times. This upper limit will suggest how

best to take advantage of all available GPUs and along with other computing resources as well as optimize

storage configuration, especially for blended Isilon storage clusters consisting of F, H, and A node types.

Finally, the test cases generated gVCF files anticipating this output will be used to benchmark joint genotyping

workflows. If the downstream tertiary NGS analysis strategy does not require gVCF output, writing variants out

to VCF file format provides an additional opportunity to minimize secondary analysis time further and increase

daily sample throughput.

A Modular, Scale-Out Solution Architecture That Keeps Pace With an Illumina NovaSeq 6000

System

Today many healthcare and life science organizations are transitioning their NGS capabilities to take

advantage of the Illumina NovaSeq 6000 system12. A single NovaSeq can generate approximately 7100 40X,

human WGS annually (Illumina Inc., 2019). At steady-state operation, this is equivalent to generating 20

unanalyzed, 40X human WGS per day.

The software and hardware solution architecture tested here can support up to ten NovaSeq systems and is

capable of processing over 1300, 40X WGS per week using the Parabricks germline pipeline (BWA+GATK) or,

alternatively the Parabricks DeepVariant pipeline (BWA+DeepVariant). The configured system also provides

enough active storage capacity to collect and to process NGS data generated for ten days.

12 Illumina, Inc. earnings conference calls 2018 – 2019. www.investor.illumina.com

http://www.investor.illumina.com/

20


The results from the test cases point a solution architecture matched to the output of one Illumina NovaSeq

6000 system. This solution architecture consists of three key components:

• 1 DGX-1 system (or 8Tesla V100 GPUs)

• 1 Dell EMC Isilon H500 Storage cluster (480 TB)

• A Parabricks Application Suite

In combination, these components create a modular, scale-out solution architecture that can keep pace with a

fleet of NovaSeq systems generating sequencing data for a broad range of coverage. This solution

architecture is flexible, easy to manage, and reliable.

Using this solution architecture, organizations gain the flexibility needed to respond quickly to changing

sequencing demands. If another NovaSeq joins the sequencing fleet, then just one additional DGX-1 system to

handle the increased output. This solution is also ideally suited for organizations operating in hybrid cloud

environments. For example, if additional analysis capacity is needed for a short-term project, an organization

can sync its NGS data to Isilon storage available at a co-location facility like Faction then burst to a Parabricks

instance(s) available in any of the three major cloud providers: Azure, GCP, and AWS. Also, the same

environment can be used to support machine learning workflows when NGS utilization is low.

Easy management is a high priority for many technical computing organizations tasked with supporting NGS

workflows. Typically operating with a “do more with less” mindset, simplifying storage management is always

welcome. For example, if the primary NGS germline analysis workflow significantly shifted to a somatic

workflow with higher coverage samples and longer running secondary analysis time, a team can respond by

adding Isilon hybrid or flash nodes to match the increased computing requirements while Isilon A-nodes can be

added to provide archive capacity within 60 seconds. Similarly, the multi-protocol access provided by OneFS

eliminates the need to host additional gateways dedicated to primary data capture from sequencing

instruments or to expose VCF files to downstream analytics workflows that require SPARK or HDFS.

Reliability is a cornerstone of secondary analysis too. In previous studies using data from sample NA1878, the

Parabricks application suite produces reproducible and accurate results across different computing

configurations. This simplifies computing choices and analysis strategies. Groups can use the Tesla V100

GPUs hosted in DGX-1 system, use six NVIDIA Tesla T4 GPUs in Dell EMC PowerEdge R740 server, cloud-

hosted NVIDIA Tesla P100 GPUs or a combination of these resources and achieved the same results every

time. Also, Isilon OneFS storage capabilities like non-disruptive upgrades and configurable erasure coding

minimize the likelihood of data loss and environment downtime. These types of capabilities are especially

crucial in NGS environments like a DNA sequencing provider where primary and processed data must be

guaranteed for customer delivery.

In summary, the Parabricks application suite used along with Tesla V100 GPUs and Isilon H500 storage is a

solution architecture designed to keep pace with the latest NGS sequencing capabilities. To learn more how

this solution accelerates you NGS secondary analysis, contact your Parabricks or Dell EMC representative.

21


References

Birney, E. (2017). Genomics in healthcare: GA4GH looks to 2022. Retrieved from https://doi.org/10.1101/203554

Dell EMC. (2018). Dell EMC Isilon and NVIDIA DGX-1 Servers for Deep Learning. Retrieved from

https://www.dellemc.com/resources/en-us/asset/white-

papers/products/storage/Dell_EMC_Isilon_and_NVIDIA_DGX_1_servers_for_deep_learning.pdf

Dell Technologies. (2018, October). High Performance Secondary Analysis of Genomic Data. Retrieved from

https://www.dell.com/support/article/us/en/04/sln314233/high-performance-secondary-analysis-of-genomic-

data?lang=en

Goyal, A. (2017). Ultra-Fast Next Generation Human Genome Sequencing Data. Retrieved from

http://www.scirp.org/journal/paperinformation.aspx?paperid=74603

Griffith, M. (2015). Optimizng Cancer Genome Sequencing and Analysis. Cell Systems. Retrieved from

https://doi.org/10.1016/j.cels.2015.08.015

Illumina. (2019, July 22). What is NGS Coverage? Retrieved from https://www.illumina.com/science/technology/next-generation-

sequencing/plan-experiments/coverage.html

Illumina Inc. (2019, July 25). NovaSeq™ 6000 Sequencing System. Retrieved from https://www.illumina.com/content/dam/illumina-

marketing/documents/products/datasheets/novaseq-6000-system-specification-sheet-770-2016-025.pdf

Kathiresan, N. (2017). Accelerating Next Generation Sequencing Data Analysis With System Level Optimizations. Nature Scientific

Reports. doi:10.1038/s41598-017-09089-1

Lee, S. (2016). Will solid-state drives accelerate your bioinformatics? In-depth profiling, performance analysis and beyond. Briefings

in Bioinformatics. doi:10.1093/bib/bbv073

Li, H. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics.

doi:10.1093/bioinformatics/btp324

Mallick, S. (2016). The Simons Genome Diversity Project: 300 Genomes From 142 Diverse Populations. Nature.

doi:10.1038/nature18964

McKenna, A. (2010). The Genome Analysis Toolkit: A MapReduce Framework For Analyzing Next Genration Sequence Data. Genome

Research. doi:10.1101/gr.107524.110

NHGRI. (2019, July 25). Retrieved from https://www.genome.gov/human-genome-project/Completion-FAQ

Poplin, R. (2016). Ceating A Universal SNP and Small Indel Variant Caller With Deep Nueral Networks. Nature Biotechnology.

doi:https://doi.org/10.1038/nbt.4235

Stephens. (2015). Big Data: Astronomical or Genomical? PLOS Biology. doi:10.1371/journal.pbio.1002195

Stephens, Z. D. (2015). Big Data: Astronomical or Genomical? PLOS Biology. Retrieved from

https://doi.org/10.1371/journal.pbio.1002195

Suwinski, P. (2019). Advancing Personalized Medicine Through the Application of Whole Exome Sequencing and Big Data Analytics.

Frontiers in Genetics. Retrieved from https://doi.org/10.3389/fgene.2019.00049

Yoon, K. (2018, March). REFERENCE ARCHITECTURES OF DELL EMC READY BUNDLE FOR HPC LIFE SCIENCES REFRESH WITH 14th

GENERATION SERVERS. Retrieved from https://downloads.dell.com/manuals/all-

products/esuprt_software/esuprt_it_ops_datcentr_mgmt/high-computing-solution-resources_white-papers27_en-us.pdf

22


Appendix

Sample Data & Reference Files

Simons Diversity Genome Project In 2017 the Simons Diversity Genome Project (SDGP) was one of the largest datasets of diverse, high-quality human genome sequences ever reported. The sampling strategy differs from previous studies of human genome diversity that aimed to maximize medical relevance by studying populations with large numbers of present-day people. SDGP study samples populations in a way that represents as much anthropological, linguistic, and cultural diversity as possible. All genomes in the dataset were sequenced to at least 30X coverage using an Illumina HiSeq 2000 platform, paired-end sequencing with reads of 2x 100 base pairs. A full description of this cohort and the results of the project described in The Simons Genome Diversity Project: 300 Genomes From 142 Diverse Populations. Mallick, S. et al., Nature. doi:10.1038/nature18964 and at the SDGP website: https://www.simonsfoundation.org/simons-genome-diversity-project/ Information about SGDP data is posted here: http://reichdata.hms.harvard.edu/pub/datasets/sgdp/ The raw data for 279 genomes for which the informed consent documentation is consistent with fully public data release is available through the EBI European Nucleotide Archive under accession numbers PRJEB9586 and ERP010710. No attempt was made to connect the genetic data to personal identifiers for the samples.

NA12878 Sample National Institute of Standards and Technology (NIST) developed the Genome in a Bottle (GIAB) Consortium to develop a set of human genome reference materials (RM). The availability of whole-genome RMs allows a methods-based approach for NGS technical validations in a standardized, cost-effective, and practical manner. NGS data generated from sample NA12878 (NIST RM 8398) was extensively characterized as part of the Thousand Genomes Project and continues to be used for comparing different sequencing technologies and developing bioinformatic tools. The NA12878 sequence read data set was downloaded from the European Nucleotide Archive at http:// www.ebi.ac.uk/ena/data/view/ERR194147 . This sample was sequenced to 50X depth on an Illumina HiSeq 2000 platform using paired-end sequencing with reads of 2x 100 base pairs.

REFERENCE FILES The following reference files were used with Parabricks secondary analysis pipeline(s):

Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

Homo_sapiens_assembly38.dbsnp138.vcf

Homo_sapiens_assembly38.fasta

These can be downloaded at https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1

https://www.simonsfoundation.org/simons-genome-diversity-project/http://reichdata.hms.harvard.edu/pub/datasets/sgdp/http://www.ebi.ac.uk/ena/data/view/ERR194147https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1

23


Simons Genome Diversity Project BAM to FASTQ Data Preparation

Simons Genome Diversity Project (SGDP) sample data were provided in BAM format. Measuring the wall clock

time for a secondary analysis pipeline requires starting with FASTQ files as input. The BAM files were

converted back to FASTQ using a combination of samtools (v1.9, www.htslib.org ) and an in-house MPI

application.

The conversion of the approximately 300 BAM files to paired, sorted FASTQ files are summarized as follows. A

batch samtools fastq input.bam > output.fastq job was submitted to a 14 node HPC cluster mounted to a Dell

EMC Isilon H500 storage cluster. Each BAM file was converted to a single, unsorted, uncompressed FASTQ

file. Each conversion completed within 15 minutes. The each of the single, unsorted FASTQ files were split into

paired, sorted FASTQ files using an in-house MPI program with N(54 in or case) ranks. Each rank reads its

chunk of the FASTQ file, sorts the chunk, and splits it into paired, gzipped FASTQ files. Any remaining un-

matched reads are gathered, sorted, and split into a pair of FASTQ files. On average, each sort-split operation

completed within 15 minutes using 14 compute nodes.

24


Parabricks

Parabricks is a software suite for performing secondary analysis of next-generation sequencing (NGS) DNA

data. A major benefit of Parabricks is that it is designed to deliver results at fast speeds and with low costs.

Parabricks can analyze whole human genomes in about 45 minutes, compared to about 30 hours for 30X

WGS data. More information can be found at https://developer.nvidia.com/nvidia-parabricks .

Installation:

1. Download. The Parabricks application can be requested from Parabricks by contacting [email protected]

.

2. Follow the steps outlined for local installation:

a. https://docs.parabricks.com/installation/local-installation b. tar -xvzf parabricks.tar.gz c. ./parabricks/installer.py --container singularity --install-location

2. Verify Installation:

a. Download sample data to the local SSD, as shown below: i. wget https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz ii. Test the sample data using the following commands: 1. tar -xvzf parabricks_sample.tar.gz

2. /parabricks/pbrun fq2bam --ref

parabricks_sample/Ref/Homo_sapiens_assembly38.fasta --in-fq parabricks_sample/Data/sample_1.fq.gz parabricks_sample/Data/sample_2.fq.gz --out-bam output.bam --num-gpus 4

iii. The above test should finish in ~150 seconds. iv. Share with Parabricks the output printing on-screen during the run to verify correct operation.

Additional Parabricks software documentation is posted here: https://www.nvidia.com/en-us/docs/parabricks/

https://developer.nvidia.com/nvidia-parabricksmailto:[email protected]://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gzhttps://www.nvidia.com/en-us/docs/parabricks/

25


Sample Bash Script for Parabricks Germline Pipeline

This script executes germline pipeline followed by DeepVariant using 4 GPUs using pair-end FASTQ files for Sample NA12878 as input. The whole pipeline BWA MEM + SORTING + MARKING + BQSR + HaplotypeCaller takes a total of 3,869 sec (1h 5 min) to finish on 4 GPUs. DeepVariant is an additional 20 to 30 minutes depending on sample coverage.

#!/bin/bash GENOME=NA12878 cd /ifs/1KWGS/${GENOME}/ TMPDIR=/scratch/${GENOME}/tmp2 rm -rf ${TMPDIR} mkdir -p ${TMPDIR} _START_=`date` echo "${GENOME} start = ${_START_}" export NVIDIA_VISIBLE_DEVICES="4,5,6,7" pbrun germline --ref /scratch/parabricks_sample/Ref/Homo_sapiens_assembly38.fasta --in-fq NA12878_1.fastq.gz NA12878_2.fastq.gz "@RG\tID:foo0\tLB:lib1\tPL:bar\tSM:${GENOME}\tPU:unit0" --out-bam out2/${GENOME}.bam --num-gpus 4 --out-recal-file out2/${GENOME}.txt --knownSites /scratch/parabricks_sample/Ref/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --knownSites /scratch/parabricks_sample/Ref/Homo_sapiens_assembly38.dbsnp138.vcf --out-variants ./${GENOME}.g.vcf.gz --gvcf --tmp-dir ${TMPDIR} pbrun deepvariant --ref /scratch/parabricks_sample/Ref/Homo_sapiens_assembly38.fasta --num-gpus 4 --in-bam ./out2/${GENOME}.bam --out-variants ./out2/${GENOME}_dv.g.vcf.gz --gvcf _END_=`date` echo "${GENOME} end = ${_END_}"

NA12878 start = Sat May 11 01:37:30 PDT 2019 ------------------------------------------------------------------------------ || Parabricks accelerated Genomics Pipeline || || Version v2.3.5 || || GPU-BWA mem, Sorting, Marking Duplicates, BQSR || || Contact: [email protected] || ------------------------------------------------------------------------------ [M::bwa_idx_load_from_disk] read 0 ALT contigs GPU-BWA mem ProgressMeter Reads Base Pairs Aligned [08:38:04] 5061430 790000000 [08:38:15] 10122388 1530000000 [08:38:26] 15182640 2250000000 [08:38:37] 20242528 3110000000 [08:38:48] 25301636 3800000000 ….. [09:10:44] 900490638 135290000000 [09:10:55] 905550360 136030000000

mailto:[email protected]

26


[09:11:06] 910608992 136810000000 [09:11:18] 915668866 137550000000 [09:11:29] 920728896 138380000000 GPU-BWA Mem time: 2054.673712 seconds GPU-BWA Mem is finished. GPU Sorting, Marking Dups, BQSR ProgressMeter SAM Entries Completed [09:12:18] 5000000 [09:12:21] 10000000 [09:12:25] 15000000 [09:12:30] 20000000 [09:12:33] 25000000 ….. [09:23:03] 900000000 [09:23:06] 905000000 [09:23:09] 910000000 [09:23:13] 915000000 [09:23:16] 920000000 [09:23:19] 925000000 Total GPU-BWA Mem + Sorting + MarkingDups + BQSR Generation + BAM writing Processing time: 2810.635388 seconds [Parabricks Options Mesg]: Checking argument compatibility ------------------------------------------------------------------------------ || Parabricks accelerated Genomics Pipeline || || Version v2.3.5 || || GPU-GATK4 HaplotypeCaller || || Contact: [email protected] || ------------------------------------------------------------------------------ ProgressMeter - Current-Locus Elapsed-Minutes Regions-Processed Regions/Minute [09:25:08] chr1:44188473 0.2 247703 1486218 [09:25:18] chr1:82626947 0.3 463409 1390227 [09:25:28] chr1:117654487 0.5 660958 1321916 [09:25:38] chr1:159004742 0.7 795460 1193190 [09:25:48] chr1:170318282 0.8 861460 1033752 ….. [09:41:28] HLA-DRB1*09:21:15906 16.5 16655240 1009408 [09:41:38] HLA-DRB1*09:21:15906 16.7 16655240 999314 [09:41:48] HLA-DRB1*09:21:15906 16.8 16655240 989420 [09:41:58] HLA-DRB1*09:21:15906 17.0 16655240 979720 [09:42:08] HLA-DRB1*09:21:15906 17.2 16655240 970208 [09:42:18] HLA-DRB1*09:21:15906 17.3 16655240 960879 Total time taken: 1058.99 total 0.000 to vc 0.000 real write 0.000 problem 0.000 tmp0 0.000 tmp1 0.000 tmp2 0.000 ------------------------------------------------------------------------------ || Parabricks accelerated Genomics Pipeline || || Version v2.3.5 || || deepvariant || || Contact: [email protected] || ------------------------------------------------------------------------------ Starting DeepVariant

mailto:[email protected]:[email protected]

27


Running with 4 gpu, each with 4 workers ProgressMeter - Current-Locus Elapsed-Minutes ProgressMeter - chr1:10000 0.1 ProgressMeter - chr1:33000 0.2 2019-05-11 09:43:13.009006: W src/allelecounter.cpp:310] Found duplicate read: 183 at reference_name: "chr1" position: 7865117 ProgressMeter - chr1:1555000 0.3 ProgressMeter - chr1:10514000 0.4 ProgressMeter - chr1:18886000 0.5 ProgressMeter - chr1:27771000 0.6 ….. ProgressMeter - chrX:119822000 32.1 ProgressMeter - chrX:135975000 32.2 ProgressMeter - chrX:155640000 32.3

DeepVariant is finished, total time is 1947.259 seconds

28


Dell EMC Isilon Network Attached Storage

About Dell EMC Isilon Dell EMC Isilon scale-out storage solutions are robust, yet simple to scale and manage, no matter how large

your unstructured data environment becomes.

An overview of Isilon Generation 6 Hybrid storage capabilities, features and options are summarized here:

https://www.dellemc.com/en-za/collaterals/unauth/data-sheets/products/storage/h16071-ss-isilon-hybrid.pdf

For a detailed description of OneFS, the Isilon storage file system, see Dell EMC Isilon OneFS Technical

Overview posted here: https://www.emc.com/collateral/data-sheet/h16071-ss-isilon-hybrid.pdf

Dell EMC Isilon H500 Configuration

DELLEMC ISILON H500 STORAGE CLUSTER: 4TB HDD, 2x 3.2 TB SSD 40GbE/40GbE

CHASSIS TYPE & NODE COUNT H500 / 12 Nodes

USABLE (RAW) CAPACITY – TB 538 TB ( 720 TB)

SSD CAPACITY / NODE 6.2 TB (76.8 TB total)

PROCESSORS / NODE 2.2 GHz, 10-core

MEMORY / NODE 128 GB (1.5 TB total)

FRONT-END NETWORKING 2 x 40 GbE

BACK-END NETWORKING 2 x QSFP+40 GbE Ethernet

SOFTWARE

OPERATING SYSTEM ONEFS v8.1.2

ISILON SMARTCONNECT ROUND ROBIN MODE

SMARTREAD / L3 CACHE TRUE (enabled, default)

SNAPSHOTS None

DATA PROTECTION N+2:1 (default)

INSIGHTIQ v4.1

https://www.dellemc.com/en-za/collaterals/unauth/data-sheets/products/storage/h16071-ss-isilon-hybrid.pdfhttps://www.emc.com/collateral/data-sheet/h16071-ss-isilon-hybrid.pdf

29


The NVIDIA DGX-1 System

About NVIDIA DGX-1 System TheDGX-1 system is the world's first purpose-built system optimized for deep learning, with fully integrated

hardware and software that can be deployed quickly and easily. Its revolutionary performance significantly

accelerates training time, making it the world's first deep learning supercomputer in a box.

NVIDIA DGX-1 System technical specification is posted here: https://www.nvidia.com/content/dam/en-

zz/Solutions/Data-Center/dgx-1/NVIDIA-DGX-1-Volta-AI-Supercomputer-Datasheet.pdf

NVIDIA DGX-1 System Configuration

NVIDIA DGx-1 Cluster

NO. DGX-1 8

NVIDIA GPU CLOUD IMAGE nvcr.io/nvidia/tensorflow:18.09-py3

TENSORFLOW 1.10.0

ARISTA – SYSTEM IMAGE 4.19.8M

DGX-1 – UBUNTU 16.04.4 LTS

DGX-1 – BASE OS 3.1.6

DGX-1 – BIOS 5.11

DGX-1 NVIDIA DRIVER 384.125

DGX-1 HOST MEMORY 512 GB DDR4 LRDIMM

DGX 1 LOCAL STORAGE 7.6 TB, RAID 0, 4 x 1.92 TB SSD

DGX-1 GPU 8X, Tesla V100 16GB/GPU; 40,960 NVIDIA CUDA Cores

DGX-1 CPU 2X, 20-Core Intel Xeon E5-2698 v4 2.2GHz

• © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. Reference Number: H18251

Contact a Dell

Technologies Expert Learn More about

solutions

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/NVIDIA-DGX-1-Volta-AI-Supercomputer-Datasheet.pdfhttps://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/NVIDIA-DGX-1-Volta-AI-Supercomputer-Datasheet.pdfhttps://www.dellemc.com/en-us/contactus.htmhttps://www.delltechnologies.com/en-us/industry/healthcare-it/index.htm

Documents

Illumina NovaSeq 6000 Sequencing Dell EMC Isilon and NVIDIA … · 2021. 1. 14. · E. Sasha Paegle August 2019 Initial release E. Sasha Paegle The information in this publication