13
Containerized Genomic Workflows with Singularity 1

Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

�1

Page 2: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

Containerized Genomic Workflows with Singularity

Table of Contents

3 Current Challenges with Genomic Pipelines

4 Best Practices for Genomic Workflows: the NF-Core Success Story

6 Singularity Containers

9 Deploy Containers at Scale with Nextflow

11 Conclusion

�2

Page 3: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

Current Challenges with Genomic Pipelines Authors: Phil Ewels, Sven Fillinger, Alex Peltzer, Paolo Di Tommaso

Genomic workflows are becoming increasingly important in biomedical research, and also in everyday medicine. The promise of precision medicine is coming to fruition with examples including NHS England who recently announced that from October of 2018, new cancer patients will routinely have their tumour DNA sequenced for key mutations. 1

This approach has been made possible by the massive increase in throughput and resolution of molecular-sequencing technology (also known as next-generation sequencing or NGS) that made the DNA sequencing technology an everyday efficient and affordable process. A recent publication estimated that over 60 million patients will have their genome sequenced in a healthcare context by 2025. 2

Other studies found out that the storage requirement for sequenced data will greatly outstrip YouTube’s projected annual storage needs for videos by 2025. 3

Given these figures, it comes as no surprise that researchers and computational biologists are facing major challenges to process this data in an efficient manner. Genomic data analysis workflows (a.k.a. pipelines) require massive parallel and distributed execution using clusters of computers. Owing to strict privacy concerns, computation is expected to be portable in order to be easily deployed where the data is stored (e.g., location-specific cloud platforms, on-premise clinical facilities, etc.) whilst still being completely reproducible.

Ian Sample. “Routine DNA tests will put NHS at the 'forefront of medicine’.” The Gaurdian: https://1

www.theguardian.com/science/2018/jul/03/nhs-routine-dna-tests-precision-cancer-tumour-screening

Ewan Birney, Jessica Vamathevan, Peter Goodhand. “Genomics in healthcare: GA4GH looks to 2022.” 2

bioRxiv: https://www.biorxiv.org/content/biorxiv/early/2017/10/15/203554.full.pdf

Erika Check Hayden. “Genome researchers raise alarm over big data.” Nature: https://3

www.nature.com/news/genome-researchers-raise-alarm-over-big-data-1.17912

�3

Page 4: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

Portability and reproducibility are critical requirements for life-science data analysis applications. In this post we'll show how the use of Singularity containers, along with a workflow manager such Nextflow, provides an effective solution to those requirements.

Best Practices for Genomic Workflows: the NF-Core Success StoryAuthors: Phil Ewels, Sven Fillinger, Alex Peltzer, Paolo Di Tommaso

As the scale of genomics increases, large sequencing platforms are springing up all over the world. They share common challenges: large data volumes, complex data processing pipelines, and difficulties with reproducing old analyses. The SciLifeLab National Genomics Infrastructure (NGI) provides access to genomic technologies to researchers all across Sweden, and the Quantitative Biology Centre (QBiC) offers comparable services to primarily University research groups in Southern Germany. Both centres run computational bioinformatics analysis on the DNA sequencing data produced locally and deliver processed results for hundreds of research groups.

Such computational workflows have proven to be difficult to deploy and to maintain over time due to the large number of software components and packages on which they depend. In addition, centrally managed software in academic environments and research centres can be difficult to install and unstable over time. Data analysis workflows are usually built around the needs of a single user and therefore tend not to be portable. Containers have proven to provide a solution to these problems,

however Docker has largely been passed over on traditional shared HPC systems due to well known security concerns and the lack of a clear separation between administration and user 4

privileges. But all of this has changed with the advent of Singularity, which provides a security and usage model that better fits the requirements of multi-users and multi-tenant data centers. Moreover, the container image format implemented by Singularity allows workflow developers to package and distribute the application dependencies into a single portable image file that can be deployed across the whole spectrum of compute environments and easily shared between research groups. When combined with workflow managers, such as Nextflow, the entire

Amir Jerbi. “Docker security rules to live by.” InfoWorld: https://www.infoworld.com/article/3154711/4

security/8-docker-security-rules-to-live-by.html

�4

Page 5: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

analysis pipeline can be run anywhere, by anyone. Production workflows are isolated and benefit from increased stability. Previous analyses can be rerun with near-perfect reproducibility.

The combination of Nextflow and Singularity has been gaining appeal in the field of bioinformatics over recent years. Starting with the existing workflows developed at SciLifeLab NGI in Stockholm, a new open source community called nf-core was initiated to bring together different genomics centres to collaborate on a common strategy for the implementation and the deployment of scalable genomic analysis workflows. Built around a core of high quality Nextflow scripts and community based Singularity containers (built using Bioconda and Docker), nf-core pipelines work on virtually any compute infrastructure and with a range of datasets. All pipelines included in the nf-core collection utilize a dedicated container and are continuously tested using a continuous integration (CI) server with a test dataset to check for conformance with a set of guidelines and minimal requirements established through the community best practices.

At the time of writing, there are thirteen nf-core pipelines with several more in development. There are nine different genomics centres listed as official contributors, and over 30 contributing member scientists sharing their work with the GitHub community. The pipelines range in their functionality from calculating gene counts from RNA transcriptomics data, to assessing immune compatibility for transplantation, to analysis of DNA

from ancient archeological samples.

The pipelines developed by the nf-core community are routinely deployed across many production facilities. For example, the RNA-seq data analysis workflow has been used to process 24,370 RNA samples across 296 projects at SciLifeLab since 2017 and runs on a dedicated Slurm cluster with 4,000 cores and 2 PB storage. 5

The same workflow has processed 6,532 RNA samples in a total of 207 projects at QBiC since 2017 and runs on the BinAC and CFC clusters with a summed up capacity of 8,924 cores and 1.95 PB storage.6

Uppsala Universitet. “the Irma Cluster”: http://www.uppmax.uu.se/resources/systems/the-irma-cluster/ 5

Eberhard Karls Universitat Tubingen. “Zentrum für Datenverarbeitung (ZDV)”: https://uni-tuebingen.de/6

einrichtungen/zentrum-fuer-datenverarbeitung/dienstleistungen/serverdienste/computing/hardware/binac.html

�5

Page 6: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

The quick uptake of nf-core amongst the bioinformatics community is testament to the success of Singularity and Nextflow. Experienced computational scientists are all too familiar with the pain of attempting to install complicated software tool chains to reproduce a published analysis; the proven simplicity of bundled Singularity software containers makes this a thing of the past - now analyses can be repeated with just a single Nextflow command.

Singularity ContainersAuthors: Eduardo Arango, Keith Cunningham, Gwendolyn Kurtzer

As mentioned in the first post, “The quick uptake of nf-core amongst the bioinformatics community is testament to the success of Singularity and Nextflow.” This post will give an introduction to Singularity containers and why it has become the go-to container technology among bioinformatic researchers.

One of the biggest problems in scientific computing is creating an environment for reproducible results. That is, an application stack and its data must be able to run identically on any computational resource. Until recently, the job of ensuring scientific reproducibility fell onto system administrators, to manage a complex set of tools, applications, data and related resource dependencies. However, with the introduction of Singularity, a container

platform designed specifically to simplify the application of statistical techniques, the science of reproducibility has never been easier.

Within just the past few years, the use of containers has revolutionized the way in which industries and enterprises have developed and deployed computational software and distributed systems. The containerization model is gaining traction because it provides improved reliability, reproducibility, and levels customization that have not been possible. From the onset of containerization in high performance computing, Singularity has lead the way in providing container services, ranging from small clusters to massive supercomputers.

�6

Page 7: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

Container computing has revolutionized the way groups are developing, sharing, and running software. This has been led by the growth of acceptance by many corporate DevOps teams, which has provided an ecosystem of tools to enable container computing. This paradigm shift made inroads in the high performance computing community via new tools such as Singularity, allowing users to securely run containers in environments where it was not feasible for other container platforms.

Today Singularity is the most widely used container solution in High-Performance Computing (HPC) centers. Enterprise users interested in AI, deep learning, compute driven analytics, and IOT are increasingly demanding HPC-like tools and resources. Singularity has many features that make it the preferred container solution for this new type of enterprise workload. Instead of a layered filesystem, a Singularity container is stored in a single file. This simplifies the container management lifecycle and facilitates features such as image signing and encryption to produce a trusted container environment

The Singularity container system started as an open source project in 2015 and was created as a result of scientists wanting a new method of packaging analytics applications for mobility and repeatability. By combining the success in HPC environments with the rapid expansion of artificial intelligence, deep learning, and machine learning in the Enterprise, Singularity is uniquely qualified to address the needs of a new market called Enterprise Performance Computing (EPC).

Instead of a layered file system, the Singularity Image Format (SIF) encapsulates applications, data, scripts and supporting libraries in a single file. This simplifies the container management lifecycle and facilitates features such as image signing and encryption to produce trusted containers, which also enhance reproducibility and portability.

At runtime, Singularity blurs the lines between the container and the host system allowing users to read and write persistent data and leverage hardware like GPUs and Infiniband with ease.

The Singularity security model is also unique among container solutions. Users can now build containers on resources they control, or by using Sylabs container library. Then they can move their containers to a production environment where the Linux kernel enforces privileges as it

�7

Page 8: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

does with any other application. These features make Singularity a simple, secure container solution perfect for HPC and EPC workloads.

Singularity blocks privilege escalation within the container so if a user wants to be root inside the container, they must be root outside the container. This usage paradigm mitigates many of the security concerns that exists with containers on multi-tenant shared resources. You can directly call programs inside the container from outside the container fully incorporating pipes, standard IO, file system access,

X11, and MPI.

One of Singularity’s architecturally defined features is the ability to execute containers as if they were native programs or scripts on a host computer. All standard input, output, error, pipes, IPC, and other communication pathways used by locally running programs are synchronized with the applications running locally within the container.

The key functions of Singularity are:

• Designed specifically for compute based workflows

• Uses Singularity Image Format (SIF)

• Portable containers that natively leverage GPUs

• Works with Mellanox and Intel interconnects

• MPI and PMIx Workflow compatible

• Runs on ARM, Power, and x86 platforms

• Service and Batch Job compatibility

• Native integration with batch scheduling systems and resource managers (Slurm, PBS, LSF, etc.)

• Can use other containers as source to build Singularity Images (Docker Hub, Quay, Registries, OCI, etc.)

Sylabs provides licensing, enterprise level support, professional services, cloud functionality, and value-added plugins for the Singularity container platform.

�8

Page 9: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

Deploy Containers at Scale with Nextflow Author: Paolo di Tommaso

Containers are exceptionally useful in scientific workflows. They allow for the encapsulation of software dependencies - e.g., tools and libraries required by a data-analysis application in one or more self-contained, ready-to-run, immutable container images that can be easily deployed on any platform supporting the container runtime.

For these reasons, containers have been rapidly adopted by the bioinformatics community as the popularity of projects such as BioContainers shows. Because it implements a solution that better fits the security and operational requirements adopted in the context of HPC data centres, Singularity has emerged as the technology of choice when it comes to deploying these kinds of data-analysis applications at scale.

However, while running a Singularity container instance is a relatively straightforward task, a typical genomic workflow may require dozens of different container images - and spawn the execution of thousands of tasks, each of which runs in its own container instance.

Orchestrating the execution of such containerized workloads at scale then, and proactively managing related problems such as resource optimization, per-task data I/O staging, error recovery, etc., is anything but simple!

A workflow developer might be tempted to address these challenges by delegating them to a workload manager such as Slurm. Unfortunately, this results only in a partial solution: when workflow applications are tightly coupled with workload managers, they cannot be easily executed on another system or infrastructure, nor automatically tested with a continuous integration (CI) service. In other words, tight coupling (between workflow applications workload managers) results in a loss of runtime mobility.

Nextflow is a workflow system designed to manage the orchestration and deployment of containerized workloads at scale, across clouds and clusters in a portable and reproducible

�9

Page 10: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

manner. The main design principle consists of decoupling the application workflow logic from the underlying execution platform. Each task is defined in a self-contained manner and executed in its own containerized environment. Because execution specifics are detailed in a separate configuration file, the nuances of containerization solutions are made transparent to the developer - e.g., they are not required to provide any specific container engine command.

Nextflow provides out-of-the-box support for the most widely used containerization technologies including Singularity, as well as a large range of execution platforms such as Slurm, Grid Engine, LSF, Kubernetes, AWS Batch, etc.

This support enables definition of platform-agnostic data analysis workflows that can easily be deployed in a portable and reproducible manner across heterogeneous execution platforms. For example, a researcher can rapidly prototype a workflow application on their own laptop, isolating the dependencies with a Docker container; then, they could deploy it at scale on the institution’s Slurm cluster using Singularity. Finally, they could share their workflow application with a colleague in a different organisation, using a different batch scheduler or even deploy it in the AWS cloud via the AWS Batch compute service.

Nextflow is routinely used by many pharmaceutical companies and renowned public health institutions such as SciLifeLab. For example, at the Centre for Genomic Regulation (CRG), Nextflow has been used to deploy data-intensive computational workflows since 2014; it has orchestrated the execution of over 12 million jobs, totaling 1.4 million CPU hours - with the majority of these jobs executing within Singularity containers.

Nextflow is a free and open source software solution for application workflows developed by the Centre for Genomic Regulation (CRG). Seqera Labs was recently incorporated as a spin-off from the CRG to provide enterprise-level support and professional services around the Nextflow platform, as well as to explore new, innovative products to power the next generation of big data analysis applications.

�10

Page 11: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

Conclusion

The complexity of genomic workflows challenges researchers within the scientific community to

better manage the deployment of scientific applications in a scalable and portable manner.

Containers have proven to provide a pragmatic and efficient solution to the problem of

packaging complex software dependencies into self-contained, ready-to-run executable

runtimes that can be easily deployed in a portable manner. Singularity, in particular, provides an

HPC-friendly container runtime that streamlines adoption in data centers and facilities in which

these kinds of applications are generally executed. Nextflow provides a simple, yet powerful

workflow system that enables the deployment of complex parallel and distributed application

workflows across heterogeneous computational platforms. Out-of-the-box, Nextflow supports

Singularity containers as well as a broad range of workload managers thus making it possible to

scale containerized workloads with ease. Combined, Singularity and Nextflow form an extremely

powerful solution for researchers to deploy computational data analyses at scale. This

combination also provides a solution for the so-called “reproducibility crisis” that is pervasive in

data-intensive scientific fields such as machine learning and genomics. It has never been so

easy to address these issues with respect to software technology.

If you wish to learn more about us and our products, please us at:

Nextflow: https://www.nextflow.io

nf-core: https://nf-co.re/

Seqera Labs: https://seqera.io

Singularity: https://www.sylabs.io/singularity/

Sylabs: https://www.sylabs.io

�11

Page 12: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

This white paper is for informational purposes only. SYLABS MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS WHITE PAPER. Sylabs cannot be responsible for errors in typography or photography.

Singularity is a trademark of Sylabs Inc.

Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Sylabs disclaims proprietary interest in the marks and names of others.

©Copyright 2018 Sylabs Inc. All rights reserved. Information in this document is subject to change without notice.

�12

Page 13: Containerized Genomic Workflows with Singularity · Containerized Genomic Workflows with Singularity Portability and reproducibility are critical requirements for life-science data

Containerized Genomic Workflows with Singularity

Bibliography

1. Ian Sample. “Routine DNA tests will put NHS at the 'forefront of medicine’.” The Gaurdian (July 2018): https://www.theguardian.com/science/2018/jul/03/nhs-routine-dna-tests-precision-cancer-tumour-screening

2. Ewan Birney, Jessica Vamathevan, Peter Goodhand. “Genomics in healthcare: GA4GH looks to 2022.” bioRxiv: (October 2017): https://www.biorxiv.org/content/biorxiv/early/2017/10/15/203554.full.pdf

3. Erika Check Hayden. “Genome researchers raise alarm over big data.” Nature. (July 2015): https://www.nature.com/news/genome-researchers-raise-alarm-over-big-data-1.17912

4. Amir Jerbi. “Docker security rules to live by.” InfoWorld. (January 2017): https://www.infoworld.com/article/3154711/security/8-docker-security-rules-to-live-by.html

5. Uppsala Universitet. “the Irma Cluster”: http://www.uppmax.uu.se/resources/systems/the-irma-cluster/

6. Eberhard Karls Universitat Tubingen. “Zentrum für Datenverarbeitung (ZDV)” [Center for Data Processing]: https://uni-tuebingen.de/einrichtungen/zentrum-fuer-datenverarbeitung/dienstleistungen/serverdienste/computing/hardware/binac.html

�13