DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O....

Preview:

Citation preview

DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq ExperimentsO. Valladares1,2, C.-F. Lin1,2, D. M. Childress1,2, E. Klevak3, E. T. Geller1, Y.-C. Hwang2,4, E. A. Tsai4,5, A. B. Partch1,2, G. D. Schellenberg1, L.-

S. Wang1,21) Department of Pathology and Laboratory Medicine, University of Pennsylvania. Philadelphia, PA; 2) Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA; 3) Department of Physics, University of Washington, Seattle, WA; 4) Genomics and Computational Biology Graduate Group, University of Pennsylvania. Philadelphia, PA; 5) Department of Pathology and Laboratory Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA.

Next-generation sequencing (NGS) has redefined what big data means in biomedical research. Advances in quality and capacity have led to a declining cost of implementation, allowing NGS to be used in a wide range of experiments at a variety of scales; from a few samples in small laboratories to thousands of samples from multi-institute collaborations. Processing terabytes of data requires a certain level of information technology and bioinformatics expertise, which can be daunting to small laboratories with limited resources. The programs we developed will enable these groups to process DNA-seq data and identify single-nucleotide variants and small insertions and deletions (indels).

Introduction

• Integrates open-source programs to analyze DNA-seq data in a Linux environment• GATK (http://www.broadinstitute.org/gatk/)• SAMtools (http://samtools.sourceforge.net/)• BWA (http://bio-bwa.sourceforge.net/)• PICARD (http://picard.sourceforge.net/)• SnpEff (http://snpeff.sourceforge.net/)

• Operates on distributed resource management system (Oracle Grid Engine)

• Job dependency and error checking• Available on Amazon Elastic Cloud Computing

DRAW: DNA Resequencing Analysis Workflow

AcknowledgementsWe thank the constructive input from members of the Schellenberg and Wang labs, collaborators from the ARRA autism sequencing consortium, Nancy B. Spinner, Samir Wadwahan, Maja Bucan, Chris Stoeckert, and members of the Penn HTS group.

Funding: The authors gratefully acknowledge funding from NIMH (R01 MH089004, R01 MH094382, and R01 MH094382), NIA (U24 AG041689, U01 AG032984, P30 AG010124), NINDS (P50 NS053488), and CurePSP Foundation.

SneakPeek: Quality Metrics Management System

• Provides an overview of all samples processed through a dynamic web interface

• Allows user to assess quality of sequencing data• Identify samples

with unusual QC metric(s)

• Identify batch problems

DRAW+SneakPeek Availability• Released under the MIT license• Free for academic and non-profit use • Available at the National Institute on Aging

Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) (http://www.niagads.org/) • Source code• Amazon Machine Images (AMIs)• Install guide, documentation, tutorial

Running DRAW

One command will run all three phases of DRAW:

Phase 5: Import into MySQL tables using

in-house scripts

Phase 3: Variant and coverage using

GATK/snpEff

Phase 2: QC using GATK/Picard/Samtool

s

Phase 1: Mapping using BWA

Input Demultiplexed FastQ files

Align reads, Paired ends

Mark duplicates, Local realignment, Base quality recalibration

Variant detection, filtration, annotation

Quality metrics on SneakPeek

Read, Base/Depth Coverage, QC metrics

Annotated VCF file

One flow cell: Illumina Hi-Seq 2000, 100-bp pair-end, ~350 Gb, 34 multiplexed samples using Nimblegen Human Exome v2 Library.

1.1TB data in two days; total cost $528

Running DRAW on Amazon

Create AWS User Account

Create your EC2 Instance• Cluster

Compute Eight Extra Large Instance

Create your EC2 Instance• Amazon

Machine Image (AMI) ID: ami-a4c934cd

Attach EBS Volume• At least

2TB for a flow cell

Log in• Just like

how you login to your remote linux server!

Guide available on NIAGADS.org

High quality DNA-Seq workflow that is:• easy to use• can run on cloud computing environment easily

Large institutions have their own IT teams and don’t need our help.

What about the rest of us?

Cloud computing is a viable

solution for data analysis, but it is hard to master!

Many labs can do DNA-Seq study

but can’t analyze data without help

and large computing power

What Motivates Draw+SneakPeek

Features of DRAW

High quality workflow• Developed through

our collaboration with Broad Institute (autism sequencing consortium)

• Based on GATK best practices

Easy to use• Two config files and

five unix commands to run

• No complicated GUI involved

Comprehensive documentation• Step-by-step guide on

configuring an Amazon EC2 instance

You get the same result using cloud or

your local HPC cluster

Data download, $146(upload is free)

Storage; $95Elastic Block Storage I/O; $17

Read mapping; $140

Base quality recalibration/local re-

alignment, $103

Variant calling; $27

Running DRAW on Amazon EC2: A benchmark study

Workflow ComparisonA comparison of DRAW+SneakPeek with other workflows.

ReferenceLin CF, Valladares O, Childress DM, Klevak E, Geller ET, Hwang YC, Tsai EA, Schellenberg GD, and Wang LS. DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments. Bioinformatics, Oct 1;29(19):2498-2500. Epub 2013 Aug 13.