Upload
lesley-ward
View
217
Download
2
Tags:
Embed Size (px)
Citation preview
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq ExperimentsO. Valladares1,2, C.-F. Lin1,2, D. M. Childress1,2, E. Klevak3, E. T. Geller1, Y.-C. Hwang2,4, E. A. Tsai4,5, A. B. Partch1,2, G. D. Schellenberg1, L.-
S. Wang1,21) Department of Pathology and Laboratory Medicine, University of Pennsylvania. Philadelphia, PA; 2) Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA; 3) Department of Physics, University of Washington, Seattle, WA; 4) Genomics and Computational Biology Graduate Group, University of Pennsylvania. Philadelphia, PA; 5) Department of Pathology and Laboratory Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA.
Next-generation sequencing (NGS) has redefined what big data means in biomedical research. Advances in quality and capacity have led to a declining cost of implementation, allowing NGS to be used in a wide range of experiments at a variety of scales; from a few samples in small laboratories to thousands of samples from multi-institute collaborations. Processing terabytes of data requires a certain level of information technology and bioinformatics expertise, which can be daunting to small laboratories with limited resources. The programs we developed will enable these groups to process DNA-seq data and identify single-nucleotide variants and small insertions and deletions (indels).
Introduction
• Integrates open-source programs to analyze DNA-seq data in a Linux environment• GATK (http://www.broadinstitute.org/gatk/)• SAMtools (http://samtools.sourceforge.net/)• BWA (http://bio-bwa.sourceforge.net/)• PICARD (http://picard.sourceforge.net/)• SnpEff (http://snpeff.sourceforge.net/)
• Operates on distributed resource management system (Oracle Grid Engine)
• Job dependency and error checking• Available on Amazon Elastic Cloud Computing
DRAW: DNA Resequencing Analysis Workflow
AcknowledgementsWe thank the constructive input from members of the Schellenberg and Wang labs, collaborators from the ARRA autism sequencing consortium, Nancy B. Spinner, Samir Wadwahan, Maja Bucan, Chris Stoeckert, and members of the Penn HTS group.
Funding: The authors gratefully acknowledge funding from NIMH (R01 MH089004, R01 MH094382, and R01 MH094382), NIA (U24 AG041689, U01 AG032984, P30 AG010124), NINDS (P50 NS053488), and CurePSP Foundation.
SneakPeek: Quality Metrics Management System
• Provides an overview of all samples processed through a dynamic web interface
• Allows user to assess quality of sequencing data• Identify samples
with unusual QC metric(s)
• Identify batch problems
DRAW+SneakPeek Availability• Released under the MIT license• Free for academic and non-profit use • Available at the National Institute on Aging
Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) (http://www.niagads.org/) • Source code• Amazon Machine Images (AMIs)• Install guide, documentation, tutorial
Running DRAW
One command will run all three phases of DRAW:
Phase 5: Import into MySQL tables using
in-house scripts
Phase 3: Variant and coverage using
GATK/snpEff
Phase 2: QC using GATK/Picard/Samtool
s
Phase 1: Mapping using BWA
Input Demultiplexed FastQ files
Align reads, Paired ends
Mark duplicates, Local realignment, Base quality recalibration
Variant detection, filtration, annotation
Quality metrics on SneakPeek
Read, Base/Depth Coverage, QC metrics
Annotated VCF file
One flow cell: Illumina Hi-Seq 2000, 100-bp pair-end, ~350 Gb, 34 multiplexed samples using Nimblegen Human Exome v2 Library.
1.1TB data in two days; total cost $528
Running DRAW on Amazon
Create AWS User Account
Create your EC2 Instance• Cluster
Compute Eight Extra Large Instance
Create your EC2 Instance• Amazon
Machine Image (AMI) ID: ami-a4c934cd
Attach EBS Volume• At least
2TB for a flow cell
Log in• Just like
how you login to your remote linux server!
Guide available on NIAGADS.org
High quality DNA-Seq workflow that is:• easy to use• can run on cloud computing environment easily
Large institutions have their own IT teams and don’t need our help.
What about the rest of us?
Cloud computing is a viable
solution for data analysis, but it is hard to master!
Many labs can do DNA-Seq study
but can’t analyze data without help
and large computing power
What Motivates Draw+SneakPeek
Features of DRAW
High quality workflow• Developed through
our collaboration with Broad Institute (autism sequencing consortium)
• Based on GATK best practices
Easy to use• Two config files and
five unix commands to run
• No complicated GUI involved
Comprehensive documentation• Step-by-step guide on
configuring an Amazon EC2 instance
You get the same result using cloud or
your local HPC cluster
Data download, $146(upload is free)
Storage; $95Elastic Block Storage I/O; $17
Read mapping; $140
Base quality recalibration/local re-
alignment, $103
Variant calling; $27
Running DRAW on Amazon EC2: A benchmark study
Workflow ComparisonA comparison of DRAW+SneakPeek with other workflows.
ReferenceLin CF, Valladares O, Childress DM, Klevak E, Geller ET, Hwang YC, Tsai EA, Schellenberg GD, and Wang LS. DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments. Bioinformatics, Oct 1;29(19):2498-2500. Epub 2013 Aug 13.