1
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2 , C.-F. Lin 1,2 , D. M. Childress 1,2 , E. Klevak 3 , E. T. Geller 1 , Y.-C. Hwang 2,4 , E. A. Tsai 4,5 , A. B. Partch 1,2 , G. D. Schellenberg 1 , L.-S. Wang 1,2 1) Department of Pathology and Laboratory Medicine, University of Pennsylvania. Philadelphia, PA; 2) Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA; 3) Department of Physics, University of Washington, Seattle, WA; 4) Genomics and Computational Biology Graduate Group, University of Pennsylvania. Philadelphia, PA; 5) Department of Pathology and Laboratory Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA. Next-generation sequencing (NGS) has redefined what big data means in biomedical research. Advances in quality and capacity have led to a declining cost of implementation, allowing NGS to be used in a wide range of experiments at a variety of scales; from a few samples in small laboratories to thousands of samples from multi-institute collaborations. Processing terabytes of data requires a certain level of information technology and bioinformatics expertise, which can be daunting to small laboratories with limited resources. The programs we developed will enable these groups to process DNA-seq data and identify single-nucleotide variants and small insertions and deletions (indels). Introduction Integrates open-source programs to analyze DNA-seq data in a Linux environment GATK (http://www.broadinstitute.org/gatk/ ) SAMtools (http://samtools.sourceforge.net/ ) BWA (http://bio-bwa.sourceforge.net/ ) PICARD (http://picard.sourceforge.net/ ) SnpEff (http://snpeff.sourceforge.net/ ) Operates on distributed resource management system (Oracle Grid Engine) Job dependency and error checking Available on Amazon Elastic Cloud Computing DRAW: DNA Resequencing Analysis Workflow Acknowledgements We thank the constructive input from members of the Schellenberg and Wang labs, collaborators from the ARRA autism sequencing consortium, Nancy B. Spinner, Samir Wadwahan, Maja Bucan, Chris Stoeckert, and members of the Penn HTS group. Funding: The authors gratefully acknowledge funding from NIMH (R01 MH089004, R01 MH094382, and R01 MH094382), NIA (U24 AG041689, U01 AG032984, P30 AG010124), NINDS (P50 NS053488), and CurePSP Foundation. SneakPeek: Quality Metrics Management System Provides an overview of all samples processed through a dynamic web interface Allows user to assess quality of sequencing data Identify samples with unusual QC metric(s) Identify batch problems DRAW+SneakPeek Availability Released under the MIT license Free for academic and non-profit use Available at the National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) ( http://www.niagads.org/ ) Source code Amazon Machine Images (AMIs) Install guide, documentation, tutorial Running DRAW One command will run all three phases of DRAW: Phase 5: Import into MySQL tables using in-house scripts Phase 3: Variant and coverage using GATK/snpEff Phase 2: QC using GATK/Picard/Samtools Phase 1: Mapping using BWA Input Demultiplexed FastQ files Align reads, Paired ends Mark duplicates, Local realignment, Base quality recalibration Variant detection, filtration, annotation Quality metrics on SneakPeek Read, Base/Depth Coverage, QC metrics Annotated VCF file One flow cell: Illumina Hi-Seq 2000, 100-bp pair-end, ~350 Gb, 34 multiplexed samples using Nimblegen Human Exome v2 Library. 1.1TB data in two days; total cost $528 Running DRAW on Amazon Create AWS User Accoun t Create your EC2 Instanc e Cluste r Comput e Eight Extra Large Instan ce Create your EC2 Instance Amazon Machine Image (AMI) ID: ami- a4c934cd Attach EBS Volume At least 2TB for a flow cell Log in Just like how you login to your remote linux server! Guide available on NIAGADS.org High quality DNA-Seq workflow that is: easy to use can run on cloud computing environment easily Large institutions have their own IT teams and don’t need our help. What about the rest of us? Cloud computing is a viable solution for data analysis, but it is hard to master! Many labs can do DNA- Seq study but can’t analyze data without help and large computing power What Motivates Draw+SneakPeek Features of DRAW High quality workflow Developed through our collaboration with Broad Institute (autism sequencing consortium) Based on GATK best practices Easy to use Two config files and five unix commands to run No complicated GUI involved Comprehensive documentation Step-by-step guide on configuring an Amazon EC2 instance You get the same result using cloud or your local HPC cluster Data download, $146(upload is free) Storage; $95 Elastic Block Storage I/O; $17 Read mapping; $140 Base quality recalibration/local realignment, $103 Variant calling; $27 Running DRAW on Amazon EC2: A benchmark study Workflow Comparison A comparison of DRAW+SneakPeek with other workflows. Reference Lin CF, Valladares O, Childress DM, Klevak E, Geller ET, Hwang YC, Tsai EA, Schellenberg GD, and Wang LS. DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments. Bioinformatics, Oct 1;29(19):2498- 2500. Epub 2013 Aug 13.

DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak

Embed Size (px)

Citation preview

Page 1: DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak

DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq ExperimentsO. Valladares1,2, C.-F. Lin1,2, D. M. Childress1,2, E. Klevak3, E. T. Geller1, Y.-C. Hwang2,4, E. A. Tsai4,5, A. B. Partch1,2, G. D. Schellenberg1, L.-

S. Wang1,21) Department of Pathology and Laboratory Medicine, University of Pennsylvania. Philadelphia, PA; 2) Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA; 3) Department of Physics, University of Washington, Seattle, WA; 4) Genomics and Computational Biology Graduate Group, University of Pennsylvania. Philadelphia, PA; 5) Department of Pathology and Laboratory Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA.

Next-generation sequencing (NGS) has redefined what big data means in biomedical research. Advances in quality and capacity have led to a declining cost of implementation, allowing NGS to be used in a wide range of experiments at a variety of scales; from a few samples in small laboratories to thousands of samples from multi-institute collaborations. Processing terabytes of data requires a certain level of information technology and bioinformatics expertise, which can be daunting to small laboratories with limited resources. The programs we developed will enable these groups to process DNA-seq data and identify single-nucleotide variants and small insertions and deletions (indels).

Introduction

• Integrates open-source programs to analyze DNA-seq data in a Linux environment• GATK (http://www.broadinstitute.org/gatk/)• SAMtools (http://samtools.sourceforge.net/)• BWA (http://bio-bwa.sourceforge.net/)• PICARD (http://picard.sourceforge.net/)• SnpEff (http://snpeff.sourceforge.net/)

• Operates on distributed resource management system (Oracle Grid Engine)

• Job dependency and error checking• Available on Amazon Elastic Cloud Computing

DRAW: DNA Resequencing Analysis Workflow

AcknowledgementsWe thank the constructive input from members of the Schellenberg and Wang labs, collaborators from the ARRA autism sequencing consortium, Nancy B. Spinner, Samir Wadwahan, Maja Bucan, Chris Stoeckert, and members of the Penn HTS group.

Funding: The authors gratefully acknowledge funding from NIMH (R01 MH089004, R01 MH094382, and R01 MH094382), NIA (U24 AG041689, U01 AG032984, P30 AG010124), NINDS (P50 NS053488), and CurePSP Foundation.

SneakPeek: Quality Metrics Management System

• Provides an overview of all samples processed through a dynamic web interface

• Allows user to assess quality of sequencing data• Identify samples

with unusual QC metric(s)

• Identify batch problems

DRAW+SneakPeek Availability• Released under the MIT license• Free for academic and non-profit use • Available at the National Institute on Aging

Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) (http://www.niagads.org/) • Source code• Amazon Machine Images (AMIs)• Install guide, documentation, tutorial

Running DRAW

One command will run all three phases of DRAW:

Phase 5: Import into MySQL tables using

in-house scripts

Phase 3: Variant and coverage using

GATK/snpEff

Phase 2: QC using GATK/Picard/Samtool

s

Phase 1: Mapping using BWA

Input Demultiplexed FastQ files

Align reads, Paired ends

Mark duplicates, Local realignment, Base quality recalibration

Variant detection, filtration, annotation

Quality metrics on SneakPeek

Read, Base/Depth Coverage, QC metrics

Annotated VCF file

One flow cell: Illumina Hi-Seq 2000, 100-bp pair-end, ~350 Gb, 34 multiplexed samples using Nimblegen Human Exome v2 Library.

1.1TB data in two days; total cost $528

Running DRAW on Amazon

Create AWS User Account

Create your EC2 Instance• Cluster

Compute Eight Extra Large Instance

Create your EC2 Instance• Amazon

Machine Image (AMI) ID: ami-a4c934cd

Attach EBS Volume• At least

2TB for a flow cell

Log in• Just like

how you login to your remote linux server!

Guide available on NIAGADS.org

High quality DNA-Seq workflow that is:• easy to use• can run on cloud computing environment easily

Large institutions have their own IT teams and don’t need our help.

What about the rest of us?

Cloud computing is a viable

solution for data analysis, but it is hard to master!

Many labs can do DNA-Seq study

but can’t analyze data without help

and large computing power

What Motivates Draw+SneakPeek

Features of DRAW

High quality workflow• Developed through

our collaboration with Broad Institute (autism sequencing consortium)

• Based on GATK best practices

Easy to use• Two config files and

five unix commands to run

• No complicated GUI involved

Comprehensive documentation• Step-by-step guide on

configuring an Amazon EC2 instance

You get the same result using cloud or

your local HPC cluster

Data download, $146(upload is free)

Storage; $95Elastic Block Storage I/O; $17

Read mapping; $140

Base quality recalibration/local re-

alignment, $103

Variant calling; $27

Running DRAW on Amazon EC2: A benchmark study

Workflow ComparisonA comparison of DRAW+SneakPeek with other workflows.

ReferenceLin CF, Valladares O, Childress DM, Klevak E, Geller ET, Hwang YC, Tsai EA, Schellenberg GD, and Wang LS. DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments. Bioinformatics, Oct 1;29(19):2498-2500. Epub 2013 Aug 13.