29
© 2020 Dell Technologies or its subsidiaries. White Paper This white paper describes a modular, scale-out solution architecture composed of NVIDIA Parabricks application software, NVIDIA DGX-1 system and Dell EMC Isilon network-attached storage (NAS) capable of analyzing the daily output of an Illumina NovaSeq 6000 Sequencing system, or approximately 24, 40X whole human genomes sequences (WGS) per day. This solution architecture can scale- out to process over 1000 WGS per week. This paper also highlights variables to consider when building out a technical computing environment designed to accelerate the secondary analysis of NGS data. Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon and NVIDIA Parabricks

Illumina NovaSeq 6000 Sequencing Dell EMC Isilon and NVIDIA … · 2021. 1. 14. · E. Sasha Paegle August 2019 Initial release E. Sasha Paegle The information in this publication

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

  • © 2020 Dell Technologies or its subsidiaries.

    White Paper

    This white paper describes a modular,

    scale-out solution architecture composed

    of NVIDIA Parabricks application software,

    NVIDIA DGX-1 system and Dell EMC

    Isilon network-attached storage (NAS)

    capable of analyzing the daily output of an

    Illumina NovaSeq 6000 Sequencing

    system, or approximately 24, 40X whole

    human genomes sequences (WGS) per

    day. This solution architecture can scale-

    out to process over 1000 WGS per week.

    This paper also highlights variables to

    consider when building out a technical

    computing environment designed to

    accelerate the secondary analysis of NGS

    data.

    Accelerating Next

    Generation Sequencing

    Secondary Analysis with

    Dell EMC Isilon and

    NVIDIA Parabricks

  • 2

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Table of Contents

    Table of Contents .................................................................................................................................................... 2

    Revisions ................................................................................................................................................................. 3

    Executive Summary ................................................................................................................................................. 4

    Introduction ............................................................................................................................................................ 5

    Building Blocks For Modular, Scale-out GPU Solution Architecture to Accelerate NGS Secondary Analysis ........ 9

    Evaluating Wall Clock Time for Secondary Analysis ............................................................................................. 11

    Discussion.............................................................................................................................................................. 19

    References ............................................................................................................................................................ 21

    Appendix ............................................................................................................................................................... 22

  • 3

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Revisions

    DATE DESCRIPTION AUTHOR

    March 2020 Revised to reflect NVIDIA acquisition of

    Parabricks E. Sasha Paegle

    August 2019 Initial release E. Sasha Paegle

    The information in this publication is provided “as is.” DELL EMC Corporation makes no representations or warranties of any kind with

    respect to the information in this publication and specifically disclaims implied warranties of merchantability or fitness for a purpose.

    Use, copying, and distribution of any DELL EMC software described in this publication require an applicable software license.

    DELL EMC2, DELL EMC, the DELL EMC logo are registered trademarks or trademarks of DELL EMC Corporation in the United States

    and other countries. All other trademarks used herein are the property of their respective owners.

    © Copyright 2019 DELL EMC Corporation. All rights reserved. Published in the USA. 08/19.

    DELL EMC believes the information in this document is accurate as of its publication date. The information is subject to change without

    notice.

    DELL EMC is now part of the Dell group of companies.

  • 4

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Executive Summary

    Next-Generation Sequencing (NGS) is a combination of laboratory instrumentation technologies and analysis

    methods to identify patterns in DNA, the code of life, at dramatically increased resolution and quality. The cost

    of acquiring NGS data continues to decline exponentially, while the volume of NGS data is doubling every year

    (Stephens, 2015). The latest NGS instrumentation produces five times more data than the previous generation

    of instrumentation. As the capacity to sequence DNA continues to increase, organizations like the Global

    Alliance for Genomics and Health (GA4GH) estimate that over 60 million patients will have their DNA

    sequenced in a healthcare context by 2025 (Birney, 2017).

    However, Secondary Analysis, the conversion of raw NGS data into a usable DNA sequence and compared to

    a reference, may require a significant amount of time (CPU-hours) to complete. Depending on available

    computing & storage resources, software, and analysis methodology, secondary analysis time can range from

    minutes to days. Ideally, there are enough computing and storage resources such that the output of secondary

    analysis keeps pace with the rate of raw NGS data generation. The goal is to avoid a secondary analysis

    backlog and ensure processed data are sent further to downstream analysis and interpretation as fast as

    possible.

    This white paper describes a modular, scale-out solution architecture composed of NVIDIA Parabricks

    application software, NVIDIA V100 Tesla GPUs, and Dell EMC Isilon network-attached storage (NAS) capable

    of analyzing the daily output of an Illumina NovaSeq 6000 system, or approximately 24, 40X whole human

    genomes sequences (WGS) per day. This solution architecture can scale-out to process over 1000 WGS per

    week. This paper also highlights variables to consider when building out a technical computing environment

    designed to accelerate the secondary analysis of NGS data.

    Intended Audience

    Scientists who are responsible for the analysis of NGS data and IT professionals who are responsible for

    providing a technical computing environment designed to support NGS applications are encouraged to read

    this paper.

    Acknowledgments

    Thank you Parabricks Inc. for providing the Parabricks application suite. We thank “Shop” Mallick, The David

    Reich Laboratory, Harvard Medical School, and the Simons Foundation for providing access to source data

    generated by the Simons Diversity Genome Project. We also would like to thank Glen Otero, VP of Scientific

    Computing, Translational Genomics Research Institute (Tgen), and Kihoon Yoon, Principal Engineer, Dell HPC

    & AI Innovation Laboratory for input and consultation. GroupWare (Santa Clara, California) provided the lab

    environment to perform the testing described herein.

  • 5

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Introduction

    DNA is the code of life. This molecule carries the genetic instructions for growth, development, and

    reproduction of all living organisms. The building block of DNA is a four-letter code: "A..T..G..C.” These four

    letters are referred to as “bases.” The order of the A, T, G, and C s is responsible for traits like eye color or

    drug sensitivity. DNA sequencing is the process of writing out the order of the bases for an organism of

    interest. The entire complement of DNA for an organism is a genome. After approximately ten years and over

    $2.7B US dollars, the first draft of the human genome sequence was published in April 2003 (NHGRI, 2019).

    Next-generation sequencing (NGS) automates the rapid sequencing of DNA and can produce a human

    genome in approximately 24 hours. Consequently, NGS now plays an increasingly important role in clinical

    practice and public health. The information encoded in a person's genome is instrumental in assessing the

    response to diagnosis, treatment, and disease prevention strategies due to person-to-person variability

    (Suwinski, 2019). Identifying variants or differences for a genome is done by comparing an individual’s genome

    to a DNA reference sequence. Also known as Secondary Analysis, this process for generating a list of variants

    can take minutes to days depending on the available software, computing, and storage resources.

    Keeping Pace with NGS Data Generation While Reducing Secondary Analysis

    Time

    Extending this approach to assess the genetic

    variability of patient populations requires

    operating the latest NGS instrumentation and

    computing resources at scale. For example,

    the latest Illumina NovaSeq 6000 system can

    output approximately five times more DNA

    bases than the previous generation of

    instrumentation (Illumina Inc., 2019). One

    Illumina NovaSeq system can produce

    between ~1.5 to 2.5 TB raw data per day,

    representing approximately 20 to 48 whole

    genome sequences (WGS) per day1. Today it

    is not uncommon for life science organizations

    to operate more than one NGS instrument and

    routinely process from 200 to over 1000

    samples per week. Ideally, an organization has

    enough computing and storage resources

    matched to the output capacity for a fleet of

    sequencing instruments such that the rate of

    secondary analysis keeps pace with the rate of

    raw NGS data generation. Otherwise, the

    organization risks experiencing an analysis

    backlog.

    1 Sequencing output depends on NGS instrumentation, application and analysis methodology.

    Figure 1. Keeping Pace with Data Generation

  • 6

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Working with NGS Data2

    The desired product of whole-genome sequencing (WGS) is a list of variants or differences for a given sample

    when compared to a reference genome. Although motivations may be different, minimizing the time to

    generate this list of variants is a common goal shared by many healthcare and life science organizations.

    Research organizations competing for grant awards want to move into variant interpretation and analysis as

    soon as possible while avoiding costly false positives. To recognize revenue, a DNA sequencing provider must

    return a list of variants to its customer per agreed on timelines. While in a clinical setting, a diagnostic variant

    report is needed at a speed that impacts the care of a patient.

    To better understand how software, computing, and storage technology choices impact the time to generate a

    variant list, it is worthwhile to review the three analysis phases of NGS data (Figure 1).

    Primary Analysis A primary analysis is the NGS instrument-specific steps needed to call DNA bases and compute quality scores

    for each base. The most common output file format for this data as they arrive from the sequencer is FASTQ.

    The FASTQ format is ASCII text data containing the short sequences of DNA bases3 and associated quality

    score for each base. These short sequence data are un-ordered and un-aligned and commonly referred to as

    reads. Depending on the type of sequencing instrument, instrument settings and application, FASTQ files can

    range from a small number of large (> 120 GB) files to an extremely large number of smaller files.

    Secondary Analysis During a secondary analysis, the raw reads contained in one or more FASTQ files are mapped and aligned to

    a reference genome. A Binary Alignment Map (BAM) file(s) is the output and represents the genome for the

    sample of interest. The genome of interest (i.e., BAM file) is passed to a variant calling step which identifies the

    significant differences, or variants, between the genome of interest and a reference genome. The identified

    differences are written to a variant call file (VCF).

    Tertiary Analysis4 A tertiary analysis focuses on interpreting the variants for a given sample or a population of samples to

    understand their significance in the context of additional biological and clinical information.

    2 There are many NGS applications. The patterns for working with NGS data are common across applications. For simplicity this document focuses on whole genome sequencing (WGS). 3 Typically, short read segments range from 75 to 250+ bases depending on NGS application. 4 Tertiary analysis is beyond the scope of this paper.

  • 7

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Figure 2: Three Phases of NGS Analysis

    Reducing Secondary Analysis Time To Keep Pace With NGS Data Generation

    Due to the size of individual sample data and volume of samples, WGS secondary analysis is a compute and

    intensive storage process. The most commonly used and cited methods for secondary analysis include the

    Burrows-Wheeler Alignment (BWA-Mem) (Li, 2009), and the Genome Analysis Tool Kit (GATK) (McKenna,

    2010). Using the Broad GATK Best Practices workflow (pipeline) requires over 30 hours to process5 a 30X

    WGS (Goyal, 2017). Analyzing a few genomes per day is far from the ideal when a modern, high throughput

    NGS instrument can generate unanalyzed, raw NGS data for 20 or more WGS per day.

    It is important to consider critical variables that may impact the total secondary analysis (wall-clock) time when

    choosing technologies that enable secondary analysis of NGS data. These variables range from the type of

    NGS sequencing application, analysis software and strategies, output file types, application file access

    patterns, and number and type of available computing resources.

    Sequence Depth Of Coverage When planning time and resources to complete secondary analysis, it is essential to be aware of the

    sequencing depth of coverage (aka coverage) for sample data as it will impact analysis time per sample.

    Coverage describes the average number of reads that align to, or "cover," a known reference sequence. The

    coverage often determines if a variant exists with a certain degree of confidence at a specific genomic location.

    Coverage requirements vary by sequencing application. For example, 30X to 50X coverage is common for

    human WGS applications (Illumina, 2019). However, the analysis of cancer genomes may require sequencing

    to a depth of coverage higher than 100X to achieve the necessary sensitivity and specificity to detect rare

    variants (Griffith, 2015).

    5 48 core Intel Xeon E5-2697v2 12C, 2.7 GHz processors with 128 GB RAM and 3.2 TB SSD, CentOS 6.6

  • 8

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Coverage is also a measure of the amount of data per sample. As coverage increases, so does the amount of

    data per sample. For example, a 30X (coverage) WGS sample contains approximately three times more data

    than a 10X WGS sample, which means secondary analysis time also increases.

    Interplay Between Analysis & Computing Resources Given many degrees of freedom between software and computing choices, it can be one of the most

    challenging and time-consuming tasks in minimizing secondary analysis time. Organizations with access to

    resources with deep computer science expertise may implement system-level optimizations achieving a 70%

    reduction in execution time (Kathiresan, 2017). Alternatively, it can be as simple as updating existing server

    technology yielding a 12% increase in daily output (Yoon, 2018). To avoid limitations of hardware scalability,

    modern accelerator hardware architectures such as GPUs or FPGAs in combination with purpose-built

    software can lead to significant reductions in secondary analysis times. For example, using Tesla V100 GPUs,

    Dell Technologies demonstrated, in collaboration with Parabricks, over 25x reduction in analysis time

    compared to a CPU-only solution6 (Dell Technologies, 2018).

    Data Placement Like the interplay between software and computing resources, data storage solutions and their related file

    systems also offer opportunities to accelerate secondary analysis. It is worthwhile to inspect and understand

    the general file access patterns for the methods used in the secondary analysis. Some analysis applications

    used in secondary analysis, especially those like BWA used for sorting and alignment, can create many

    temporary files. As a best practice, these temporary files should be placed on direct-attached storage (DAS)

    when feasible, instead of any network file storage. However, mounting a temporary or scratch directory on a

    shared storage resource is an acceptable approach, if it introduces opportunities for eliminating manual, time-

    consuming steps to stage large data sets next to computing resources. This approach also offers opportunities

    to minimize or prevent data loss.

    Storage Media Types Implementing secondary analysis workflows using shared storage resources often prompts groups to purchase

    more expensive flash storage with the anticipation that it will significantly reduce analysis time. However, the

    benefits from flash storage are highly dependent on the software application, available compute host memory,

    data set size, and application IOPS requirements. Only 50% of commonly used bioinformatics tools

    demonstrated 2x or more speed up from flash or solid-state disk (Lee, 2016). Relative to other technical

    constraints using lower-cost hard disk drives (HDD) is a perfectly acceptable approach.

    Simplifying Choices To simplify and streamline technology choices that lead to significantly reduced secondary analysis times while

    keeping pace with NGS data generation, Dell EMC and Parabricks set out to identify a modular, easy-to-scale

    reference architecture using a technical computing environment composed of Parabricks application software,

    NVIDIA DGX-1 systems and Dell EMC Isilon network-attached storage capable of processing more than 1000

    WGS per week.

    6 18 core Intel Xeon E5-2699 18C, 3.0 GHz processors with 384 GB RAM and 12 TB SAS, Red Hat 7.6

  • 9

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Building Blocks For Modular, Scale-out GPU Solution Architecture to Accelerate NGS Secondary Analysis

    Figure 3 illustrates the technical computing environment hosted at Groupware Technology7 used to evaluate

    the acceleration of NGS secondary analysis. It is composed of eight DGX-1 systems, a Dell EMC Isilon H500

    storage cluster, networking, and Parabricks application software.

    Note that in a customer deployment, the number and type of GPU systems and Isilon storage nodes will vary

    and can be scaled independently to meet the requirements specific to an organization (Dell EMC, 2018). The

    last section of this document will discuss a starting configuration that can be scaled-out as requirements

    change.

    Storage: Dell EMC Isilon H500

    The Dell EMC Isilon H500 storage cluster is used by many Life Science organizations today. It offers a reliable,

    easy to manage, and cost-effective balance between performance and capacity needed to support NGS

    secondary analysis and other heterogeneous bioinformatics workflows.

    The Dell EMC Isilon H500 is a hybrid (H) storage platform powered by the Isilon OneFS operating system. It

    uses a highly versatile yet straightforward scale-out storage architecture to speed access to massive amounts

    of data, while dramatically reducing cost and complexity. This hybrid platform uses a mix of HDD and flash

    drives that delivers up to 5 GB/s bandwidth and capacity ranging from 120 TB to 480 TB per chassis. Isilon

    hybrid storage systems integrate easily with Isilon All-Flash (e.g., F800) and Isilon Archive (e.g., A200) chassis

    as well as existing Isilon clusters.

    Compute: NVIDIA DGX-1 System

    The DGX-1 system is a fully integrated, turnkey hardware and software system that is purpose-built to

    accelerate deep learning (DL) and other technical computing workflows. Each DGX-1 system hosts eight Tesla

    V100 GPUs configured with NVLink technology, a hybrid cube mesh topology, that provides ultra-high

    bandwidth, low-latency fabric for inter-GPU communication. DGX-1 systems provide high bandwidth, low

    latency network interconnects for multi-node clustering over RDMA-capable fabrics.

    The NVIDIA GPU Cloud (NGC) container registry provides researchers, investigators and developers with a

    simple to access, a comprehensive catalog of GPU-accelerated software for AI, machine learning and HPC

    workflows that take full advantage of the NVIDIA DGX-1 GPUs on-prem and in the cloud. The Appendix

    provides specific DGX-1 cluster configuration information.

    Networking: Arista 7060CX2-32S

    The Arista 7060CX2 is 1RU high performance 40 GbE and 100 GbE high density, fixed configuration, data

    center switch with wire-speed Layer 2 and Layer 3 features for software-driven cloud networking. It delivers a

    rich choice of interface speed and density, allowing networks to seamlessly evolve from 10 GbE and 40 GbE to

    25 and 100 GbE. This switch provides support for IEEE 25 GbE and support for shared packet buffer pool of

    22 MB with 450 ns latency.

    7 Visit Groupware Technology for more information about the NVIDIA DGX-1 POC program. https://www.groupwaretech.com/

    https://www.groupwaretech.com/

  • 10

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Software: Parabricks Application Suite

    Parabricks is a software suite for performing secondary analysis of next-generation sequencing (NGS) data.

    The suite provides access to many GPU accelerated mapping, alignment, post-processing and variant calling

    methods. Users can construct secondary analysis pipelines designed to deliver results at fast speeds and low

    cost. Parabricks analyzes whole human genomes in about 45 minutes, compared to about 30 hours using

    traditional CPU hardware for 30X WGS data.

    The Parabricks software suite runs on a range of GPU platforms available on-prem or in the cloud. It scales

    linearly with the number of GPU resources. Results produced by Parabricks are consistent across different

    GPU platforms and generates the same results with each execution. The results are equivalent to Broad GATK

    Best Practices pipeline. The current version of Parabricks supports all versions of GATK through v4.0.4.

    Furthermore, Parabricks analysis pipelines are readily customizable, and new steps can be added effortlessly.

    Additional information about the Parabricks application suite is summarized in the Appendix.

    Figure 3. Dell EMC Isilon, and NVIDIA Parabricks & DGX-1 SYSTEM Test Environment

  • 11

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Evaluating Wall Clock Time for Secondary Analysis8

    Methodology

    To determine the recommended software and hardware configuration capable of keeping pace with the daily output of the latest NGS instrumentation, three test cases were evaluated. For each case, the observed the wall clock time was recorded for Parabricks secondary analysis pipeline(s) using different resource configurations, data layouts, and sample data.

    Sample Data Sets

    Whole Genome Sample NA12878

    NA12878 is a human sample genome of Caucasian ancestry that is part of the CEPH Utah Reference

    Collection. It has been extensively studied, and WGS data generated from NA12878 is often used to

    benchmark secondary analysis pipelines. Sample input FASTQ data represent a 50X WGS for sample

    NA12878. The Appendix provides additional background information about sample data from NA12878.

    Simons Genome Diversity Project

    In 2016 the Simons Diversity Genome

    Project (SDGP) published one of the most

    extensive datasets of diverse, high-quality

    human genome sequences ever reported

    (Mallick, 2016). Of the 300 human

    samples collected, 279 samples are

    publicly available for secondary research.

    Rather than relying on a single sample

    data set like NA12878, NGS sequencing

    data from the SDGP is more

    representative of the variation and scale

    typically encountered in NGS labs today.

    The sequencing depth of coverage spans

    from 35X to over 80X, and the average

    depth of coverage for the entire cohort is

    43X (Figure 4). The NGS data was

    provided as BAM files. The Appendix

    describes the BAM to FASTQ file

    conversion and additional information

    about the SDGP project data.

    8 A Comment On Benchmarks. The benchmark results reported here were generated in May 2019. Benchmarking results are subject to variables such as hardware configuration, software versions, and source data. When comparing the results summarized in this paper to benchmark results generated elsewhere, be sure to understand the impact of variables when comparing reported results. Reach out to a Dell or Parabricks representative for the most up to date benchmark information.

    Figure 4. SGDP Sequence Coverage Distribution

  • 12

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Parabricks Secondary Analysis Pipeline Analyses performed on NGS data is often described as a pipeline. A pipeline is simply a collection of methods

    or operations where the output of one operation becomes the input for the next operation. Four critical

    operations, mapping, alignment, pre-processing, and variant calling make up most secondary analysis WGS

    pipelines.

    Parabricks is a software suite for genomic analysis methods designed to take advantage of GPU acceleration.

    Many of the Parabricks methods are functionally equivalent to existing open-source methods. Parabricks

    operations are stitched together to create a secondary analysis pipeline best matched to the requirements for

    the sequencing application of interest such as germline and somatic analysis. Parabricks is available as either

    a Docker or Singularity container and uses a variety of GPU resources. Figure 5 highlights the Parabricks

    v2.3.7 application suite.

    Figure 6 illustrates the Parabricks “germline” pipeline used for benchmarking. It combines all the steps from

    fq2bam with GATK-HaplotypeCaller (HC)9 into a single command. For each test case below, sample paired-

    end FASTQ files are submitted to the germline pipeline. The BAM files generated by the germline pipeline are

    also submitted to the Parabricks DeepVariant (DV) operation. Each variant caller, HC and DV, output gVCF

    files which adds 10 to 15 minutes execution time for each variant calling step10. Each test case described

    recorded the wall-clock time for each stage of the pipeline for each sample.

    9 The Parabricks HC is equivalent to the accurate sequential GATK-HC Version 4.0.4. 10 When performing variant discovery, greater sensitivity is achieved when jointly calling variants across multiple samples. However, this strategy is computationally expensive and does not scale well. The key difference between a regular VCF and a gVCF file is that the gVCF file has records variant information for genomic location (i.e. site), whether a variant call exists or not. The goal is to have every site represented in the file to perform joint analysis of a cohort in subsequent steps designed to overcome computation and scaling bottlenecks. The gVCF output generated in this exercise will be used in future tertiary analysis pipeline benchmarking studies. See the Broad Best Practices Guide for more on this topic. https://software.broadinstitute.org/gatk/

    Figure 5. Parabricks Application Suite

    https://software.broadinstitute.org/gatk/

  • 13

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Figure 6. Parabricks Germline Pipeline

    Why Two Variant Callers? Calling genetic variants present in an individual genome relies on billions of short, error-prone sequence reads.

    Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and

    parameterized statistical models used for variant calling still produce thousands of errors and missed variants

    in each genome (Poplin, 2016). Many groups run consensus variant calling pipelines which use more than one

    variant calling method to minimize the likelihood of missing a variant.

    DeepVariant, a variant calling method developed by Google, applies a deep convolutional neural network, has

    been shown to outperform expert-driven statistical methods. However, calling variants for a 30X human

    genome and writing the variants out to a gVCF file takes approximately four hours and requires at least 1024

    compute cores11. The Parabricks GPU accelerated version of DeepVariant executes in less than 20 minutes

    for a 30X genome. The fast analysis time makes it possible to use DeepVariant alone or in combination with

    other expert-driven methods while minimizing the potential of creating a secondary analysis backlog.

    Software Tuning and Environment Validation

    The Isilon storage cluster used the default OneFS settings. Isilon OneFS SmartConnect was disabled, and

    each DGX-1 system was directly mounted to an Isilon storage node for all test cases unless stated otherwise.

    The DGX-1 systems (i.e., the clients) were mounted over NFSv3 using recommended NFS settings:

    “async,nolock,rw,hard,intr,timeo=600,retrans=2,rsize=524288,wsize=524288”

    iPerf, a tool used to test for the maximum achievable bandwidth on IP networks, validated the throughput of the

    IP network path from an Isilon node to a DGX-1 compute node NIC.

    The Tesla V100 GPUs were set with maximum frequency (boost) enabled:

    #/bin/bash max_mem_freq="$(nvidia-smi -q -i 0 | grep Memory | grep MHz | tail -n1 | cut -d':' -f2 | cut -d' ' -f 2)" max_SM_freq="$(nvidia-smi -q -i 0 | grep SM | grep MHz | tail -n1 | cut -d':' -f2 | cut -d' ' -f 2)" sudo nvidia-smi -pm 1 sudo nvidia-smi --auto-boost-default=0 sudo nvidia-smi -ac ${max_mem_freq},${max_SM_freq}

    Reference files needed for Parabricks operations are stored locally on each DGX-1 system.

    11 For more on DV and best practices see https://cloud.google.com/genomics/docs/tutorials/deepvariant

    https://cloud.google.com/genomics/docs/tutorials/deepvariant

  • 14

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Correct installation and operation of the Parabricks software were verified by executing the Parabricks

    germline pipeline using the NA12878 FASTQ sample data set. A simple bash script and related output log for

    the germline pipeline is shared in the Appendix.

    Test Cases

    Three test cases were evaluated to determine the recommended solution architecture. Test Case 1 evaluates the impact of data layout on secondary analysis time. Test Case 2 examines the compute to storage resource ratio on analysis time. Test Case 3 evaluates the effective daily sample throughput using 4 or 8 GPUs per sample.

    Test Case 1: Impact of Data Layout on Secondary Analysis Time

    The purpose of this test case is to determine the optimal data layout and recommendations for secondary

    analysis using the Parabricks application suite and NVIDIA DGX-1 systems. Earlier benchmarking studies

    demonstrated that it is best to use a mixed data layout where input and output directories are mounted on the

    Isilon storage cluster, and a temporary directory located on the direct-attached storage (DAS) on the compute

    node is used for intermediate files.

    The wall clock time was recorded for each step of the Parabricks germline pipeline and DeepVariant step using

    three different data layouts

    • Mixed (Default): Input and output directories are mounted on the Isilon storage cluster; a temporary

    directory is mounted from the DAS on each compute node.

    • All Isilon: Input, output and temporary directories are mounted on the Isilon storage cluster.

    • All Local: Input, output, and temporary directories are mounted on the DAS on the DGX-1 system.

    A subset of 64 data samples (paired, FASTQ files) representing a range of WGS with sequence coverages

    from 35x to 80X were submitted to the germline pipeline followed by a DeepVariant step. Each sample used 4

    GPUs per pipeline job. Each DGX-1 system was pinned to a named Isilon node (1:1 compute to storage node

    ratio).

    Figure 7 summarizes the wall clock times for each stage of the germline pipeline. The wall clock time increases

    as the sequencing depth of coverage increases. When submitting two high coverage samples (> 55x) to a

    single DGX-1 system, pipeline jobs would fail due to limited host memory. In these cases, high coverage

    samples were submitted individually to a single DGX-1 system.

    The All Isilon data layout generates the fastest wall-clock time during the BWA stage (includes pre-processing

    operations). The All Isilon data layout was 18% faster than the mixed, default data layout. The BWA stage

    generates over 200 temporary files and consumes approximately a terabyte of storage per input sample data

    set. Using an All Isilon layout for the BWA stage is more forgiving than the other two data layouts as there is no

    contention for storage resources on the compute node.

    The BWA stage of the pipeline is the most read/write intensive stage, and the Isilon storage cluster

    performance was not significantly taxed. Peak throughput of 1.6 GB/s was observed using the mixed data

    layout consuming 20% of the storage CPU resources. The All Isilon data layout generated a peak throughput

    of 5 GB/s consuming 30% of the Isilon storage CPU resources.

    During the Haplotype Caller and DeepVariant stages, the storage cluster was mostly idle. The DGX-1 system

    CPU and GPU utilization varied for each stage of the pipeline. Haplotype Caller used nearly 100% of the GPU

    resources while DeepVariant consumed about 30% of available GPU resources. The mixed data layout

  • 15

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    produced the fastest wall clock time, although the difference in wall clock times between the three data layouts

    is less than 10%. A similar pattern is observed for the DeepVariant step too.

    Test Case 2: Impact of Compute to Storage Node Ratio on Secondary Analysis Time

    Shared storage in NGS environments often service requests from external client applications in addition to

    those requests generated from the secondary analysis. Also, shared storage systems are also consuming

    resources to perform file system maintenance functions. This test case evaluates secondary analysis time

    using different DGX-1 system (i.e., client) to Isilon H500 node ratios to simulate a storage system under load.

    A Generation 6 Dell EMC Isilon H500 4U storage chassis consists of four storage nodes. Each node provides

    the disk, RAM, and CPU resources. OneFS, the Isilon file system, aggregates the Isilon node hardware

    components so that the whole becomes greater than the sum of the parts. The RAM is grouped into a single

    coherent cache, allowing I/O on any part of the storage cluster (or chassis). For access to one or more files,

    disk spindles and CPU are combined with increasing throughput, capacity, and IOPS as the cluster grows with

    each additional chassis.

    The same 64 sample data sets used in Test Case 1 were submitted to the germline pipeline and DeepVariant

    step using different DGX-1 system to Isilon node ratios for each data layout. Each sample used four GPUs per

    pipeline job. By limiting the front-end network interface on the Isilon cluster, secondary analysis wall clock time

    was observed for three ratios:

    Matching data layout to pipeline stage introduces opportunities to minimize secondary analysis time. An All Isilon data layout (blue) generates a shorter wall-clock time per sample relative to the Mixed (orange) or All Local (grey) data layouts.

    Figure 7. Impact of Data Layout on Secondary Analysis Time

  • 16

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    • One DGX-1 system: One Isilon Node (or 1:1), the default mode.

    • Two DGX-1 systems: One Isilon Node (or 2:1). 16 V100 GPU clients connect to a single Isilon node.

    • Four DGX-1 systems: One Isilon Node (or 4:1). 32 V100 GPU clients connect to a single Isilon node.

    Figure 8 summarizes the wall clock time for using different DGX-1 system to Isilon node ratios. In most cases,

    the difference between 1:1 and 4:1 resource ratio using different data layouts was within 5%. The 1:1 resource

    ratio was 8.5% faster than 4:1 using the mixed, default data layout during the HaplotypeCaller stage.

    At a 4:1 resource ratio the Isilon storage CPU resources peaked at 50% and averaged 20% for pipeline runs

    using a mixed data layout. This set up also generated a peak throughput of 1.7 GB/s.

    Figure 8. DGX-1 system: Isilon H500 Node Ratio & Wall Clock Time

    Increasing DGX-1 system to Isilon node ratio from 1:1 to 4:1 does not significantly impact secondary analysis time. A 1:1 ratio using a mixed data layout (blue) reduced secondary analysis time on average by 8.5%. Results using the 2:1 resource ratio are omitted for simplicity.

  • 17

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Test Case 3: Secondary Analysis Time Using 4 vs. 8 GPU Per Sample

    The number of GPU resources are configurable for Parabricks pipelines. This test case evaluates secondary

    analysis time using four or eight Tesla V100 GPUs per sample input. 270 SDGP samples were submitted to

    the germline pipeline and DV step. The wall clock time for pipeline runs using 4 GPUs per sample, or 8 GPUs

    per sample was recorded. Each pipeline run used the mixed data layout and the 1 DGX-1 system to 1 H500

    node ratio configuration.

    Figure 9 summarizes the wall clock time for each stage of the pipeline for the SDGP cohort. The wall-clock

    times using four or eight GPUs per sample for each pipeline stage differs by 30%. Note, for high coverage

    genomes (> 55x) it is recommended to use eight GPUs per sample with all the compute node resources

    available to the sample (four GPUs per sample may still be acceptable if expanding localhost memory).

    Storage and compute resource utilization were like those observed in Test Case 1 and 2.

    If eight GPU resources are available, four GPUs per sample is the recommended resource assignment to

    maximize the effective daily secondary analysis throughput (e.g., # of samples analyzed / day). Table 1

    summarizes the effective secondary analysis time per sample when using four or eight GPUs for each pipeline

    stage. Figure 9 summaries the effective secondary analysis throughput for different pipeline step-ups.

    Figure 9. Secondary Analysis Time Using 4 & 8 GPUs per Sample

  • 18

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Table 1. Wall Clock Time By Pipeline Stage

    Figure 10. Effective Daily Sample Throughput By Pipeline Using An NVIDIA DGX-1 system & Isilon H500 Cluster

    EFFECTIVE PARABRICKS SECONDARY ANALYSIS TIME PER SAMPLE USING A DGX-1 & DELL EMC ISILON H500 CLUSTER

    COVERAGE 4 GPU - BWA 8 GPU - BWA 4 GPU - HC 8 GPU - HC 4 GPU - DV 8 GPU - DV

    AVE. 35X 35.0 41.7 17.9 17.5 20.8 15.0

    AVE. 40X 40.9 58.5 18.4 20.8 23.0 26.7

    AVE. 50X 49.1 71.0 21.0 32.1 28.3 40.3

    Time in minutes. Averaged for samples within noted coverage

  • 19

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Discussion

    The test case results highlight the impact of available GPU resources, data layout, and compute to storage

    resource ratios on secondary analysis time.

    Modifying the number of GPU resources assigned per sample provided the greatest opportunity to reduce

    secondary analysis time. If eight or more Tesla V100 GPUs are available, it is best to assign four GPUs per

    sample to maximize the effective, daily secondary analysis sample throughput (i.e., WGS analyzed per day).

    However, disciplines like oncology where NGS applications generating high sequence coverage samples (>

    55X), Parabricks secondary analysis pipelines will benefit from assigning all the DGX-1 system GPU and CPU

    resources to a single sample. Organizations using GPU system configurations with expanded host memory

    (1.5 TB) such as the NVIDIA DGX-2 system, Dell PowerEdge C4140 or Dell EMC DSS 8440 should evaluate

    four GPU per sample setups for high coverage samples.

    Secondary analysis times can be further minimized by matching the data layout best suited to the analysis

    pipeline operation. For example, using an All Isilon data layout for the BWA stage and mixed data layout for

    either of the variant calling stages would reduce the secondary analysis time by 18%. Matching the data layout

    best matched to the analysis pipeline stage creates the opportunity to process 2 to 3 additional genomes per

    day or to run a consensus variant calling pipeline without significantly impacting daily analysis throughput

    (Figure 10).

    When possible, it is best to use a 1:1 DGX-1 system to Isilon H500 node ratio. Altering the DGX-1 system to

    Isilon H500 ratios did not materially impact Parabricks secondary analysis times. Additional test cases such as

    modifying the priority of OneFS file system services such as SyncIQ, dropping (i.e., failing out) storage nodes,

    and servicing other external client requests during a secondary analysis are needed to identify the DGX-1

    system to Isilon node ratio that significantly impacts secondary analysis times. This upper limit will suggest how

    best to take advantage of all available GPUs and along with other computing resources as well as optimize

    storage configuration, especially for blended Isilon storage clusters consisting of F, H, and A node types.

    Finally, the test cases generated gVCF files anticipating this output will be used to benchmark joint genotyping

    workflows. If the downstream tertiary NGS analysis strategy does not require gVCF output, writing variants out

    to VCF file format provides an additional opportunity to minimize secondary analysis time further and increase

    daily sample throughput.

    A Modular, Scale-Out Solution Architecture That Keeps Pace With an Illumina NovaSeq 6000

    System

    Today many healthcare and life science organizations are transitioning their NGS capabilities to take

    advantage of the Illumina NovaSeq 6000 system12. A single NovaSeq can generate approximately 7100 40X,

    human WGS annually (Illumina Inc., 2019). At steady-state operation, this is equivalent to generating 20

    unanalyzed, 40X human WGS per day.

    The software and hardware solution architecture tested here can support up to ten NovaSeq systems and is

    capable of processing over 1300, 40X WGS per week using the Parabricks germline pipeline (BWA+GATK) or,

    alternatively the Parabricks DeepVariant pipeline (BWA+DeepVariant). The configured system also provides

    enough active storage capacity to collect and to process NGS data generated for ten days.

    12 Illumina, Inc. earnings conference calls 2018 – 2019. www.investor.illumina.com

    http://www.investor.illumina.com/

  • 20

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    The results from the test cases point a solution architecture matched to the output of one Illumina NovaSeq

    6000 system. This solution architecture consists of three key components:

    • 1 DGX-1 system (or 8Tesla V100 GPUs)

    • 1 Dell EMC Isilon H500 Storage cluster (480 TB)

    • A Parabricks Application Suite

    In combination, these components create a modular, scale-out solution architecture that can keep pace with a

    fleet of NovaSeq systems generating sequencing data for a broad range of coverage. This solution

    architecture is flexible, easy to manage, and reliable.

    Using this solution architecture, organizations gain the flexibility needed to respond quickly to changing

    sequencing demands. If another NovaSeq joins the sequencing fleet, then just one additional DGX-1 system to

    handle the increased output. This solution is also ideally suited for organizations operating in hybrid cloud

    environments. For example, if additional analysis capacity is needed for a short-term project, an organization

    can sync its NGS data to Isilon storage available at a co-location facility like Faction then burst to a Parabricks

    instance(s) available in any of the three major cloud providers: Azure, GCP, and AWS. Also, the same

    environment can be used to support machine learning workflows when NGS utilization is low.

    Easy management is a high priority for many technical computing organizations tasked with supporting NGS

    workflows. Typically operating with a “do more with less” mindset, simplifying storage management is always

    welcome. For example, if the primary NGS germline analysis workflow significantly shifted to a somatic

    workflow with higher coverage samples and longer running secondary analysis time, a team can respond by

    adding Isilon hybrid or flash nodes to match the increased computing requirements while Isilon A-nodes can be

    added to provide archive capacity within 60 seconds. Similarly, the multi-protocol access provided by OneFS

    eliminates the need to host additional gateways dedicated to primary data capture from sequencing

    instruments or to expose VCF files to downstream analytics workflows that require SPARK or HDFS.

    Reliability is a cornerstone of secondary analysis too. In previous studies using data from sample NA1878, the

    Parabricks application suite produces reproducible and accurate results across different computing

    configurations. This simplifies computing choices and analysis strategies. Groups can use the Tesla V100

    GPUs hosted in DGX-1 system, use six NVIDIA Tesla T4 GPUs in Dell EMC PowerEdge R740 server, cloud-

    hosted NVIDIA Tesla P100 GPUs or a combination of these resources and achieved the same results every

    time. Also, Isilon OneFS storage capabilities like non-disruptive upgrades and configurable erasure coding

    minimize the likelihood of data loss and environment downtime. These types of capabilities are especially

    crucial in NGS environments like a DNA sequencing provider where primary and processed data must be

    guaranteed for customer delivery.

    In summary, the Parabricks application suite used along with Tesla V100 GPUs and Isilon H500 storage is a

    solution architecture designed to keep pace with the latest NGS sequencing capabilities. To learn more how

    this solution accelerates you NGS secondary analysis, contact your Parabricks or Dell EMC representative.

  • 21

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    References

    Birney, E. (2017). Genomics in healthcare: GA4GH looks to 2022. Retrieved from https://doi.org/10.1101/203554

    Dell EMC. (2018). Dell EMC Isilon and NVIDIA DGX-1 Servers for Deep Learning. Retrieved from

    https://www.dellemc.com/resources/en-us/asset/white-

    papers/products/storage/Dell_EMC_Isilon_and_NVIDIA_DGX_1_servers_for_deep_learning.pdf

    Dell Technologies. (2018, October). High Performance Secondary Analysis of Genomic Data. Retrieved from

    https://www.dell.com/support/article/us/en/04/sln314233/high-performance-secondary-analysis-of-genomic-

    data?lang=en

    Goyal, A. (2017). Ultra-Fast Next Generation Human Genome Sequencing Data. Retrieved from

    http://www.scirp.org/journal/paperinformation.aspx?paperid=74603

    Griffith, M. (2015). Optimizng Cancer Genome Sequencing and Analysis. Cell Systems. Retrieved from

    https://doi.org/10.1016/j.cels.2015.08.015

    Illumina. (2019, July 22). What is NGS Coverage? Retrieved from https://www.illumina.com/science/technology/next-generation-

    sequencing/plan-experiments/coverage.html

    Illumina Inc. (2019, July 25). NovaSeq™ 6000 Sequencing System. Retrieved from https://www.illumina.com/content/dam/illumina-

    marketing/documents/products/datasheets/novaseq-6000-system-specification-sheet-770-2016-025.pdf

    Kathiresan, N. (2017). Accelerating Next Generation Sequencing Data Analysis With System Level Optimizations. Nature Scientific

    Reports. doi:10.1038/s41598-017-09089-1

    Lee, S. (2016). Will solid-state drives accelerate your bioinformatics? In-depth profiling, performance analysis and beyond. Briefings

    in Bioinformatics. doi:10.1093/bib/bbv073

    Li, H. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics.

    doi:10.1093/bioinformatics/btp324

    Mallick, S. (2016). The Simons Genome Diversity Project: 300 Genomes From 142 Diverse Populations. Nature.

    doi:10.1038/nature18964

    McKenna, A. (2010). The Genome Analysis Toolkit: A MapReduce Framework For Analyzing Next Genration Sequence Data. Genome

    Research. doi:10.1101/gr.107524.110

    NHGRI. (2019, July 25). Retrieved from https://www.genome.gov/human-genome-project/Completion-FAQ

    Poplin, R. (2016). Ceating A Universal SNP and Small Indel Variant Caller With Deep Nueral Networks. Nature Biotechnology.

    doi:https://doi.org/10.1038/nbt.4235

    Stephens. (2015). Big Data: Astronomical or Genomical? PLOS Biology. doi:10.1371/journal.pbio.1002195

    Stephens, Z. D. (2015). Big Data: Astronomical or Genomical? PLOS Biology. Retrieved from

    https://doi.org/10.1371/journal.pbio.1002195

    Suwinski, P. (2019). Advancing Personalized Medicine Through the Application of Whole Exome Sequencing and Big Data Analytics.

    Frontiers in Genetics. Retrieved from https://doi.org/10.3389/fgene.2019.00049

    Yoon, K. (2018, March). REFERENCE ARCHITECTURES OF DELL EMC READY BUNDLE FOR HPC LIFE SCIENCES REFRESH WITH 14th

    GENERATION SERVERS. Retrieved from https://downloads.dell.com/manuals/all-

    products/esuprt_software/esuprt_it_ops_datcentr_mgmt/high-computing-solution-resources_white-papers27_en-us.pdf

  • 22

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Appendix

    Sample Data & Reference Files

    Simons Diversity Genome Project In 2017 the Simons Diversity Genome Project (SDGP) was one of the largest datasets of diverse, high-quality human genome sequences ever reported. The sampling strategy differs from previous studies of human genome diversity that aimed to maximize medical relevance by studying populations with large numbers of present-day people. SDGP study samples populations in a way that represents as much anthropological, linguistic, and cultural diversity as possible. All genomes in the dataset were sequenced to at least 30X coverage using an Illumina HiSeq 2000 platform, paired-end sequencing with reads of 2x 100 base pairs. A full description of this cohort and the results of the project described in The Simons Genome Diversity Project: 300 Genomes From 142 Diverse Populations. Mallick, S. et al., Nature. doi:10.1038/nature18964 and at the SDGP website: https://www.simonsfoundation.org/simons-genome-diversity-project/ Information about SGDP data is posted here: http://reichdata.hms.harvard.edu/pub/datasets/sgdp/ The raw data for 279 genomes for which the informed consent documentation is consistent with fully public data release is available through the EBI European Nucleotide Archive under accession numbers PRJEB9586 and ERP010710. No attempt was made to connect the genetic data to personal identifiers for the samples.

    NA12878 Sample National Institute of Standards and Technology (NIST) developed the Genome in a Bottle (GIAB) Consortium to develop a set of human genome reference materials (RM). The availability of whole-genome RMs allows a methods-based approach for NGS technical validations in a standardized, cost-effective, and practical manner. NGS data generated from sample NA12878 (NIST RM 8398) was extensively characterized as part of the Thousand Genomes Project and continues to be used for comparing different sequencing technologies and developing bioinformatic tools. The NA12878 sequence read data set was downloaded from the European Nucleotide Archive at http:// www.ebi.ac.uk/ena/data/view/ERR194147 . This sample was sequenced to 50X depth on an Illumina HiSeq 2000 platform using paired-end sequencing with reads of 2x 100 base pairs.

    REFERENCE FILES The following reference files were used with Parabricks secondary analysis pipeline(s):

    Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

    Homo_sapiens_assembly38.dbsnp138.vcf

    Homo_sapiens_assembly38.fasta

    These can be downloaded at https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1

    https://www.simonsfoundation.org/simons-genome-diversity-project/http://reichdata.hms.harvard.edu/pub/datasets/sgdp/http://www.ebi.ac.uk/ena/data/view/ERR194147https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1

  • 23

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Simons Genome Diversity Project BAM to FASTQ Data Preparation

    Simons Genome Diversity Project (SGDP) sample data were provided in BAM format. Measuring the wall clock

    time for a secondary analysis pipeline requires starting with FASTQ files as input. The BAM files were

    converted back to FASTQ using a combination of samtools (v1.9, www.htslib.org ) and an in-house MPI

    application.

    The conversion of the approximately 300 BAM files to paired, sorted FASTQ files are summarized as follows. A

    batch samtools fastq input.bam > output.fastq job was submitted to a 14 node HPC cluster mounted to a Dell

    EMC Isilon H500 storage cluster. Each BAM file was converted to a single, unsorted, uncompressed FASTQ

    file. Each conversion completed within 15 minutes. The each of the single, unsorted FASTQ files were split into

    paired, sorted FASTQ files using an in-house MPI program with N(54 in or case) ranks. Each rank reads its

    chunk of the FASTQ file, sorts the chunk, and splits it into paired, gzipped FASTQ files. Any remaining un-

    matched reads are gathered, sorted, and split into a pair of FASTQ files. On average, each sort-split operation

    completed within 15 minutes using 14 compute nodes.

  • 24

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Parabricks

    Parabricks is a software suite for performing secondary analysis of next-generation sequencing (NGS) DNA

    data. A major benefit of Parabricks is that it is designed to deliver results at fast speeds and with low costs.

    Parabricks can analyze whole human genomes in about 45 minutes, compared to about 30 hours for 30X

    WGS data. More information can be found at https://developer.nvidia.com/nvidia-parabricks .

    Installation:

    1. Download. The Parabricks application can be requested from Parabricks by contacting [email protected]

    .

    2. Follow the steps outlined for local installation:

    a. https://docs.parabricks.com/installation/local-installation b. tar -xvzf parabricks.tar.gz c. ./parabricks/installer.py --container singularity --install-location

    2. Verify Installation:

    a. Download sample data to the local SSD, as shown below: i. wget https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz ii. Test the sample data using the following commands: 1. tar -xvzf parabricks_sample.tar.gz

    2. /parabricks/pbrun fq2bam --ref

    parabricks_sample/Ref/Homo_sapiens_assembly38.fasta --in-fq parabricks_sample/Data/sample_1.fq.gz parabricks_sample/Data/sample_2.fq.gz --out-bam output.bam --num-gpus 4

    iii. The above test should finish in ~150 seconds. iv. Share with Parabricks the output printing on-screen during the run to verify correct operation.

    Additional Parabricks software documentation is posted here: https://www.nvidia.com/en-us/docs/parabricks/

    https://developer.nvidia.com/nvidia-parabricksmailto:[email protected]://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gzhttps://www.nvidia.com/en-us/docs/parabricks/

  • 25

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Sample Bash Script for Parabricks Germline Pipeline

    This script executes germline pipeline followed by DeepVariant using 4 GPUs using pair-end FASTQ files for Sample NA12878 as input. The whole pipeline BWA MEM + SORTING + MARKING + BQSR + HaplotypeCaller takes a total of 3,869 sec (1h 5 min) to finish on 4 GPUs. DeepVariant is an additional 20 to 30 minutes depending on sample coverage.

    #!/bin/bash GENOME=NA12878 cd /ifs/1KWGS/${GENOME}/ TMPDIR=/scratch/${GENOME}/tmp2 rm -rf ${TMPDIR} mkdir -p ${TMPDIR} _START_=`date` echo "${GENOME} start = ${_START_}" export NVIDIA_VISIBLE_DEVICES="4,5,6,7" pbrun germline --ref /scratch/parabricks_sample/Ref/Homo_sapiens_assembly38.fasta --in-fq NA12878_1.fastq.gz NA12878_2.fastq.gz "@RG\tID:foo0\tLB:lib1\tPL:bar\tSM:${GENOME}\tPU:unit0" --out-bam out2/${GENOME}.bam --num-gpus 4 --out-recal-file out2/${GENOME}.txt --knownSites /scratch/parabricks_sample/Ref/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --knownSites /scratch/parabricks_sample/Ref/Homo_sapiens_assembly38.dbsnp138.vcf --out-variants ./${GENOME}.g.vcf.gz --gvcf --tmp-dir ${TMPDIR} pbrun deepvariant --ref /scratch/parabricks_sample/Ref/Homo_sapiens_assembly38.fasta --num-gpus 4 --in-bam ./out2/${GENOME}.bam --out-variants ./out2/${GENOME}_dv.g.vcf.gz --gvcf _END_=`date` echo "${GENOME} end = ${_END_}"

    NA12878 start = Sat May 11 01:37:30 PDT 2019 ------------------------------------------------------------------------------ || Parabricks accelerated Genomics Pipeline || || Version v2.3.5 || || GPU-BWA mem, Sorting, Marking Duplicates, BQSR || || Contact: [email protected] || ------------------------------------------------------------------------------ [M::bwa_idx_load_from_disk] read 0 ALT contigs GPU-BWA mem ProgressMeter Reads Base Pairs Aligned [08:38:04] 5061430 790000000 [08:38:15] 10122388 1530000000 [08:38:26] 15182640 2250000000 [08:38:37] 20242528 3110000000 [08:38:48] 25301636 3800000000 ….. [09:10:44] 900490638 135290000000 [09:10:55] 905550360 136030000000

    mailto:[email protected]

  • 26

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    [09:11:06] 910608992 136810000000 [09:11:18] 915668866 137550000000 [09:11:29] 920728896 138380000000 GPU-BWA Mem time: 2054.673712 seconds GPU-BWA Mem is finished. GPU Sorting, Marking Dups, BQSR ProgressMeter SAM Entries Completed [09:12:18] 5000000 [09:12:21] 10000000 [09:12:25] 15000000 [09:12:30] 20000000 [09:12:33] 25000000 ….. [09:23:03] 900000000 [09:23:06] 905000000 [09:23:09] 910000000 [09:23:13] 915000000 [09:23:16] 920000000 [09:23:19] 925000000 Total GPU-BWA Mem + Sorting + MarkingDups + BQSR Generation + BAM writing Processing time: 2810.635388 seconds [Parabricks Options Mesg]: Checking argument compatibility ------------------------------------------------------------------------------ || Parabricks accelerated Genomics Pipeline || || Version v2.3.5 || || GPU-GATK4 HaplotypeCaller || || Contact: [email protected] || ------------------------------------------------------------------------------ ProgressMeter - Current-Locus Elapsed-Minutes Regions-Processed Regions/Minute [09:25:08] chr1:44188473 0.2 247703 1486218 [09:25:18] chr1:82626947 0.3 463409 1390227 [09:25:28] chr1:117654487 0.5 660958 1321916 [09:25:38] chr1:159004742 0.7 795460 1193190 [09:25:48] chr1:170318282 0.8 861460 1033752 ….. [09:41:28] HLA-DRB1*09:21:15906 16.5 16655240 1009408 [09:41:38] HLA-DRB1*09:21:15906 16.7 16655240 999314 [09:41:48] HLA-DRB1*09:21:15906 16.8 16655240 989420 [09:41:58] HLA-DRB1*09:21:15906 17.0 16655240 979720 [09:42:08] HLA-DRB1*09:21:15906 17.2 16655240 970208 [09:42:18] HLA-DRB1*09:21:15906 17.3 16655240 960879 Total time taken: 1058.99 total 0.000 to vc 0.000 real write 0.000 problem 0.000 tmp0 0.000 tmp1 0.000 tmp2 0.000 ------------------------------------------------------------------------------ || Parabricks accelerated Genomics Pipeline || || Version v2.3.5 || || deepvariant || || Contact: [email protected] || ------------------------------------------------------------------------------ Starting DeepVariant

    mailto:[email protected]:[email protected]

  • 27

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Running with 4 gpu, each with 4 workers ProgressMeter - Current-Locus Elapsed-Minutes ProgressMeter - chr1:10000 0.1 ProgressMeter - chr1:33000 0.2 2019-05-11 09:43:13.009006: W src/allelecounter.cpp:310] Found duplicate read: 183 at reference_name: "chr1" position: 7865117 ProgressMeter - chr1:1555000 0.3 ProgressMeter - chr1:10514000 0.4 ProgressMeter - chr1:18886000 0.5 ProgressMeter - chr1:27771000 0.6 ….. ProgressMeter - chrX:119822000 32.1 ProgressMeter - chrX:135975000 32.2 ProgressMeter - chrX:155640000 32.3

    DeepVariant is finished, total time is 1947.259 seconds

  • 28

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    Dell EMC Isilon Network Attached Storage

    About Dell EMC Isilon Dell EMC Isilon scale-out storage solutions are robust, yet simple to scale and manage, no matter how large

    your unstructured data environment becomes.

    An overview of Isilon Generation 6 Hybrid storage capabilities, features and options are summarized here:

    https://www.dellemc.com/en-za/collaterals/unauth/data-sheets/products/storage/h16071-ss-isilon-hybrid.pdf

    For a detailed description of OneFS, the Isilon storage file system, see Dell EMC Isilon OneFS Technical

    Overview posted here: https://www.emc.com/collateral/data-sheet/h16071-ss-isilon-hybrid.pdf

    Dell EMC Isilon H500 Configuration

    DELLEMC ISILON H500 STORAGE CLUSTER: 4TB HDD, 2x 3.2 TB SSD 40GbE/40GbE

    CHASSIS TYPE & NODE COUNT H500 / 12 Nodes

    USABLE (RAW) CAPACITY – TB 538 TB ( 720 TB)

    SSD CAPACITY / NODE 6.2 TB (76.8 TB total)

    PROCESSORS / NODE 2.2 GHz, 10-core

    MEMORY / NODE 128 GB (1.5 TB total)

    FRONT-END NETWORKING 2 x 40 GbE

    BACK-END NETWORKING 2 x QSFP+40 GbE Ethernet

    SOFTWARE

    OPERATING SYSTEM ONEFS v8.1.2

    ISILON SMARTCONNECT ROUND ROBIN MODE

    SMARTREAD / L3 CACHE TRUE (enabled, default)

    SNAPSHOTS None

    DATA PROTECTION N+2:1 (default)

    INSIGHTIQ v4.1

    https://www.dellemc.com/en-za/collaterals/unauth/data-sheets/products/storage/h16071-ss-isilon-hybrid.pdfhttps://www.emc.com/collateral/data-sheet/h16071-ss-isilon-hybrid.pdf

  • 29

    Accelerating Next Generation Sequencing Secondary Analysis with Dell EMC Isilon & NVIDIA Parabricks

    The NVIDIA DGX-1 System

    About NVIDIA DGX-1 System TheDGX-1 system is the world's first purpose-built system optimized for deep learning, with fully integrated

    hardware and software that can be deployed quickly and easily. Its revolutionary performance significantly

    accelerates training time, making it the world's first deep learning supercomputer in a box.

    NVIDIA DGX-1 System technical specification is posted here: https://www.nvidia.com/content/dam/en-

    zz/Solutions/Data-Center/dgx-1/NVIDIA-DGX-1-Volta-AI-Supercomputer-Datasheet.pdf

    NVIDIA DGX-1 System Configuration

    NVIDIA DGx-1 Cluster

    NO. DGX-1 8

    NVIDIA GPU CLOUD IMAGE nvcr.io/nvidia/tensorflow:18.09-py3

    TENSORFLOW 1.10.0

    ARISTA – SYSTEM IMAGE 4.19.8M

    DGX-1 – UBUNTU 16.04.4 LTS

    DGX-1 – BASE OS 3.1.6

    DGX-1 – BIOS 5.11

    DGX-1 NVIDIA DRIVER 384.125

    DGX-1 HOST MEMORY 512 GB DDR4 LRDIMM

    DGX 1 LOCAL STORAGE 7.6 TB, RAID 0, 4 x 1.92 TB SSD

    DGX-1 GPU 8X, Tesla V100 16GB/GPU; 40,960 NVIDIA CUDA Cores

    DGX-1 CPU 2X, 20-Core Intel Xeon E5-2698 v4 2.2GHz

    • © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. Reference Number: H18251

    Contact a Dell

    Technologies Expert Learn More about

    solutions

    https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/NVIDIA-DGX-1-Volta-AI-Supercomputer-Datasheet.pdfhttps://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/NVIDIA-DGX-1-Volta-AI-Supercomputer-Datasheet.pdfhttps://www.dellemc.com/en-us/contactus.htmhttps://www.delltechnologies.com/en-us/industry/healthcare-it/index.htm