60
Enabling Large Scale Sequencing Studies through Science as a Service (ScaaS) Justin H. Johnson Director of Bioinformatics EdgeBio Washington DC, USA

Enabling Large Scale Sequencing Studies through Science as a Service

Embed Size (px)

DESCRIPTION

Now“Now” generation sequencing has drastically changed the traditional costs and infrastructure within the sequencing community. There are several technologies, platforms and algorithms that show promise, but it is not always intuitive where to start. This uncertainty is compounded by the fact that commonly used analysis tools are difficult to build, maintain, and run effectively. Sample acquisition and preparation is quickly becoming a bottleneck as projects move from small sample sizes to hundreds or even thousands of samples. We will present case studies highlighting information, methods, challenges and opportunities in leveraging large scale high throughput sequencing and bioinformatics. Specifically we will highlight a recent genome-wide study of methylation patterns in 1575 individuals with Schizophrenia. We will also discuss several cancer transcriptome and exome sequencing projects as well as a human pathogen transcriptome characterization project consisting of multiple organisms and almost a billion reads.The FutureThe Ion Torrent PGM machine is a very promising, rapid throughput, ultra scalable sequencer that could play an integral part in future human health studies. Applications such as microbial whole genome sequencing, metagenomic characterization of environmental and microbiome sample, and targeted resequencing projects stand to benefit from this technology over time. To date we have completed more than 25 runs on a single PGM and will comment on the setup as well as sequence data and analysis.

Citation preview

Page 1: Enabling Large Scale Sequencing Studies through Science as a Service

Enabling Large Scale Sequencing Studies through Science as a

Service (ScaaS)

Justin H. JohnsonDirector of Bioinformatics

EdgeBioWashington DC, USA

Page 2: Enabling Large Scale Sequencing Studies through Science as a Service

Agenda

• Who We Are• NGS at 30K• Challenges and Enabling Through ScaaS

– Transcriptome Projects– Exome Projects– Ion Torrent Data

Page 3: Enabling Large Scale Sequencing Studies through Science as a Service

Who We Are

Page 4: Enabling Large Scale Sequencing Studies through Science as a Service

Life Tech Service

Provider

Page 5: Enabling Large Scale Sequencing Studies through Science as a Service

Contract Research Division• Five SOLiD4 sequencing platforms• One Life Techologies 5500XL• Two Ion Torrent PGMs• Automation thru Caliper Sciclone & Biomek FX• Life Technologies Preferred Service Provider• Agilent Certified Service Provider• Commercial partnerships with companies such as CLCBio,

DNANexus and Genologics• MD/PhD & Masters Level Scientists and Bioinformaticians• IT Infrastructure of >100 CPUs and >100TB storage

Page 6: Enabling Large Scale Sequencing Studies through Science as a Service

Edge BioServScientific Advisory Board

Elaine Mardis, Ph.D.Co-Director, Genome Sequencing CenterWashington University School of Medicine

Sam Levy, Ph.D.Director of Genome SciencesScripps Translational Science InstituteScripps Genomic Medicine

Michael Zody, M.S.Chief TechnologistBroad Institute

Ken Dewar, Ph.D.Assistant ProfessorMcGill University and Genome Quebec

Steven Salzberg, Ph.D.Director, Center for Bioinformatics and Computational BiologyUniversity of Maryland

Gabor Marth, Ph.D.Professor of BioinformaticsBoston College

Elliott Margulies, Ph.D.InvestigatorGenome Informatics SectionNational Human Genome Research InstituteNational Institutes of Health

Page 7: Enabling Large Scale Sequencing Studies through Science as a Service

NGS @ 30K Feet

Page 8: Enabling Large Scale Sequencing Studies through Science as a Service

Machines and Vendors

GnuBio

Page 9: Enabling Large Scale Sequencing Studies through Science as a Service

Obligatory NGS Exponential Growth Slide

Nature Biotechnology Volume 26 Number10 October2008

Page 10: Enabling Large Scale Sequencing Studies through Science as a Service

Genome- De Novo

- Resequencing/ Mutation Discovery & Profiling- Exome Sequencing

- Copy Number Variation- Ancient DNA

RNA-Seq/Whole

Transcriptome- mRNA Expression &

Discovery- Alternative Splicing

- Allele-Specific Expression

- microRNA Expression & Discovery

Epigenome- Transcriptionally Active

Sites- Protein-DNA

Interactions- Methylation Analysis

Metagenome- Microbial Diversity

- Heterogeneous Samples

Ultra High Throughput + Lower Cost = Broader Applications

Page 11: Enabling Large Scale Sequencing Studies through Science as a Service

Challenges

Page 12: Enabling Large Scale Sequencing Studies through Science as a Service

Challenges

Technical Expertise

Page 13: Enabling Large Scale Sequencing Studies through Science as a Service

Experimental Design Considerations

Sequencing Platform in Use Choice of Library Construction Depth of coverage Re$ources Number of Replicates Number of Samples and Control Etc…

Page 14: Enabling Large Scale Sequencing Studies through Science as a Service

Challenges

Flexibility w/ Standards

Page 15: Enabling Large Scale Sequencing Studies through Science as a Service

Flexibility with Standards and Scale

• Then (CE) – The Norm– 10 Machines, 30 – 360 Days, 1 Project

• Now (Illumina/SOLiD/454) – Scale– 1 machine, 14 Days, 30 Projects

• Now (Ion Torrent) - Flexibility– 1 machine, 1 Day, 1 Project.

• Future (CLCBio, Nexus, Open Source)– Standardization of analysis

Page 16: Enabling Large Scale Sequencing Studies through Science as a Service

Partial List of Mappers* BFAST - Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA.* Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows-Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X.* BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source.* ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.* Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.* GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX.* GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.* gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix.* MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source* MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX* MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source.* MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.* Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.* PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux.* RMAP - Assembles 20 - 64 bp Illumina reads to a FASTA reference genome. By Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.* SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.* SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem's colourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX.* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here.* SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX.* SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by Zemin Ning, Anthony Cox and James Mullikin. C++ for Linux/Alpha.* SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH.* SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM)* SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.* Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX.* Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data.

Courtesy of SeqAnswers.com

Page 17: Enabling Large Scale Sequencing Studies through Science as a Service

Enabling Through NGS

Evolving Sequencing & Analysis Methods to Enable Genomic Research

Page 18: Enabling Large Scale Sequencing Studies through Science as a Service

Real World Examples - Scale1500+ Sample Epigenetic Study

Challenges• Sample Prep (MethyMiner)• Tracking (LIMS)• QC (Automation and

Standardization)• Delivery (Automation and

Standardization)

Solution• Mix of Commercial and Open Tools

• CLC Bio and Genologics• Custom Algorithms

• HPC and Storage• Onsite 100 TB NAS• S3 for Backup and Delivery

Page 19: Enabling Large Scale Sequencing Studies through Science as a Service

Real World Examples – StandardsRapid sequenced the genome of the Escherichia coli strain from European outbreak

“…[University of Münster & Life Tech] ]received the samples on Monday, began sequencing that evening, and began analyzing the data on Wednesday…”

“…Justin Johnson, director of bioinformatics at EdgeBio, assembled and analyzed the raw reads made publicly available by BGI using CLC Bio's software…Johnson said his analysis took just a couple of hours…

Page 20: Enabling Large Scale Sequencing Studies through Science as a Service

Transcriptome

Page 21: Enabling Large Scale Sequencing Studies through Science as a Service

Mammalian transcriptionalcomplexity

pA

pA pApAATG ATG

AAAAAA

TSS transcription start site pA polyadenylation signalprotein coding regions

ATG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS TSS TSS

TSS

PASR TASRmiRNA

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

tiRNA

Courtesy of Life Technologies

Mammalian Transcriptome Complexity

Page 22: Enabling Large Scale Sequencing Studies through Science as a Service

RNA-Seq• New Approach to RNA Profiling enabled by Next-Gen

Sequencing• Yet based on well-established methodologies

• Substantial Benefits over Hybridization-Based Methods• Better quantitative gene expression performance (DGE)• In addition, can allow a comprehensive view of transcription (Whole

Transcriptome)• Transcriptome projects overview

• Identification of imprinted genes contributing to specific brain regions by whole transcriptome sequencing

• 24 sample cohort for basic human expression and variant analysis in diseased patients.

• 32 Sample cohort looking at novel splice junctions, gene fusions, and differential expression of colon cancer samples over a time series

• Collaboration with Scripps Translational on Colon Cancer Transciptomes

Page 23: Enabling Large Scale Sequencing Studies through Science as a Service

Challenges

Sample Preparation

Page 24: Enabling Large Scale Sequencing Studies through Science as a Service

Sample Sourcing for Transcriptome Projects

– Blood: Large quantities of sample available, but with limited utility in transcriptome analysis

– Tissue: Needle biopsy most common, but sample quantity very low

– Surgical section: Larger quantities available, but limited utility; need laser capture microdissection to provide useful results, sample quantity very low

– FFPE Slides: Very useful in clinical research but amount of sample and quality low.

Page 25: Enabling Large Scale Sequencing Studies through Science as a Service

Unamplified vs Amplified

• Prostate Cancer Cell Line (Vcap) from CPDR– Well characterized– Differential Expression upon the addition of

androgens.– Compared transcriptome from a single pool of

RNA• Unamplified, ribosomally depleted (Ribominus™)• Amplified, no ribosomal depletion required• Two Pipelines for analysis

Page 26: Enabling Large Scale Sequencing Studies through Science as a Service

Amplification Gives Different Results

• Gene Expression in Unstimulated Cells

Unamp Amplified

1071 2112

14,075

Page 27: Enabling Large Scale Sequencing Studies through Science as a Service

Spearman’s Correlation from 2 Pipelines

Pipeline A Unamplified AmplifiedAndrogen + - + -

Unamplified + … 0.930 0.904 0.892

- … … 0.896 0.900

Amplified + … … … 0.928

- … … … …

Pipeline B Unamplified Amplified

Androgen + - + -

Unamplified + … 0.853 0.757 0.701

- … … 0.720 0.712

Amplified + … … … 0.848

- … … … …

Page 28: Enabling Large Scale Sequencing Studies through Science as a Service

Challenges

Sample Analysis

Page 29: Enabling Large Scale Sequencing Studies through Science as a Service

RNA-Seq Analysis Between Pipelines is Either Concordant

Amplified, Stimulated, Pipe AGene Name RPKM

TPT1 4883

MALAT1 3632

ODC1 801.9

ACPP 637.8

KLK2 515.5

EEF1A1 441.1

NDRG1 417.5

CALM2 410.9

TRMT112 381

PPIA 357.4

Amplified, Stimulated, Pipe BGene Name RPKM

TPT1 7137.08

ODC1 1122.86

KLK2 809.00

ACPP 715.40

CALM2 590.02

CD9 584.96

TRMT112 557.08

NDRG1 553.61

EEF1A1 552.08

H3F3A 521.03

Page 30: Enabling Large Scale Sequencing Studies through Science as a Service

Or not…

Unamplified, Stimulated, Pipe A

Gene Name RPKM

ACPP 1444.82

KLK2 1259.86

NDRG1 1047.52

TPT1 839.17

ODC1 779.34

NPY 699.85

GAPDH 459.39

ACSL3 430.22

AGR2 350.97

CALM2 334.11

Unamplified, Stimulated, Pipe BGene name RPKM

SNORD27 37540

SNORD47 25680

SNORD34 23070

SNORD76 21420

SNORD104 19990

SNORD26 16560

SNORD32A 13740

SNORA32 10770

SNORD100 10510

SNORD44 10440

Page 31: Enabling Large Scale Sequencing Studies through Science as a Service

Even if you remove all SNORA and SNORD

Unamplified, Stimulated, Pipe A

Gene Name RPKM

ACPP 1444.82

KLK2 1259.86

NDRG1 1047.52

TPT1 839.17

ODC1 779.34

NPY 699.85

GAPDH 459.39

ACSL3 430.22

AGR2 350.97

CALM2 334.11

Unamplified, Stimulated, Pipe BGene Name RPKM

RNU6ATAC 1081

RPPH1 877.6

ACPP 754.5

RMRP 730.2

KLK2 550.6

NDRG1 510.9

MALAT1 425.7

TPT1 380.3

ODC1 345.1

NPY 311.3

Page 32: Enabling Large Scale Sequencing Studies through Science as a Service

0.0001

0.0010.00050.0003

0.010.0050.003

0.10.050.03

10.50.3

1053

1005030

1000500300

1000050003000

20000

40000

Mea

n.R

PK

M_H

EL

A-R

M

0.0001 0.01 0.1 10.4 1042 10030 1000 10000

Mean.RPKM_HELA-PA

NM refseqNR refseqHistones (circles)SNORD/SNORArRNA dots

PolyA Selection vs Ribosomal Depletion

Courtesy of Life Technologies

Page 33: Enabling Large Scale Sequencing Studies through Science as a Service

Solution?

Page 34: Enabling Large Scale Sequencing Studies through Science as a Service

Not what you want to hear…• Lots of manual work to run multiple pipelines• Join discordance

• Scripting• Visualization• Filtering techniques based on YOUR data.

Page 35: Enabling Large Scale Sequencing Studies through Science as a Service

Exome & Targeted Seq

Page 36: Enabling Large Scale Sequencing Studies through Science as a Service

Exome and Targeted Resequencing

• Capturing and interrogating a portion of the genome in many samples post GWAS• Fine map a region

• Capturing and interrogating the exome• Catalogue variants for downstream filtering and

identification of causative mutation(s)• Exome and Targeted Resequencing projects overview

• Identification of the genetic basis of colorectal cancer through exome sequencing

• 600+ sample cohort to identify the genetic basis of a novel syndrome• Exome sequencing of Tumor/Normal Leukemia patients to identify novel

mutations present in tumor samples• Exome sequencing of a large cohort (80+) to identify novel mutations

linked to phenotypic changes

Page 37: Enabling Large Scale Sequencing Studies through Science as a Service

Challenges

Sample Preparation

Page 38: Enabling Large Scale Sequencing Studies through Science as a Service

Targeted Capture Technologies

20Kb 1 MB 2 MB 3 MB 4 MB 5 MB 30-50MBExome

Agilent SureSelect

Nimblegen SeqCap EZ

Raindance TechnologiesFluidigm

Febit HybSelect

LR-PCR

Nimblegen SeqCap EZ

Agilent SureSelect

Genomic Region Captured

Page 39: Enabling Large Scale Sequencing Studies through Science as a Service

Challenges

Sample Analysis

Page 40: Enabling Large Scale Sequencing Studies through Science as a Service

Ultimately Comes to Variation

• Coverage• Project Design

– Cohorts– Cancer

• Algorithms a Solved Problem?– Single open source pipelines– Single commercial pipelines– Proprietary internal algorithms.– A mixture?

Page 41: Enabling Large Scale Sequencing Studies through Science as a Service

Ultimately Comes to Variation

• Coverage• Project Design

– Cohorts– Cancer

• Algorithms Solved Problem?– Single open source pipelines– Single commercial pipelines– Proprietary internal algorithms.– A mixture?

Page 42: Enabling Large Scale Sequencing Studies through Science as a Service

EdgeBio Exome Coverage Statistics

3149

106199

78.00%80.00%82.00%84.00%86.00%88.00%90.00%92.00%94.00%96.00%98.00%

0X Sites3X+ Sites

Coverage

Base

s Co

vere

d 3

or M

ore

Tim

es

Page 43: Enabling Large Scale Sequencing Studies through Science as a Service

EdgeBio Exon Coverage StatisticsHow well is the exome covered?*

Sample Reads Mean CVG (Exome)

Specificity (OnTarget)

Mean CVG >=1X

Mean CVG >=10X

Mean CVG >=20X

Mean CVG >=40X SNP Calls

Unknown SNPs

(dbSNP130) AA

Change Known OMIM Assoc.

Coverage >= 20

00C03330A 108,066,848 50.59 77.00% 90.60% 83.14% 64.87% 44.57% 37,876 2,075 374 222 152

02C11836A 98,475,789 41.31 76.70% 88.60% 81.32% 61.50% 38.50% 34,221 1,897 291 173 119

02C12313A 95,867,728 46.39 74.40% 90.83% 77.57% 65.17% 52.57% 42,438 2,533 371 206 148

02C12834A 103,089,460 47.36 77.80% 90.21% 77.35% 65.78% 44.09% 37,514 2,178 364 218 159

03C14605A 112,883,077 43.52 75.10% 90.74% 76.00% 62.93% 39.77% 36,589 2,330 391 232 172

03C14951A 105,376,198 48.07 77.30% 91.82% 78.73% 66.62% 43.92% 38,186 2,442 445 229 177

03C15059A 112,103,402 44.94 75.30% 90.48% 73.61% 59.96% 38.52% 35,658 2,354 452 246 164

QPS0001C 103,073,216 42.35 77.00% 87.34% 68.81% 55.36% 35.35% 30,691 2,255 455 285 170

QPS0001P 106,176,385 48.78 77.50% 90.27% 73.28% 61.36% 42.14% 38,807 2,772 506 301 218

QPS0001R 108,548,733 46.00 73.00% 89.36% 71.50% 59.09% 39.95% 41,013 2,779 443 261 194

Totals 1,053,660,836 45.93 76.11% 90.03% 76.13% 62.26% 41.94% 37,299 2,362 409 237 167

* Data from Fragment Runs – Since moving to PE, seeing 15% improvement

Page 44: Enabling Large Scale Sequencing Studies through Science as a Service

Venter Genome - Algorithms

• PLOS genetics 2008 vol 4 issue 8 e10000160• ~21K SNP in exons (29MB Targeted)• 36,206 expected SNPs for 50MB Kit

% Difference Homozygous TP TN FP FN Sensitivity Pos.pred.valB 1% 0% -39% -1% 1% 4%A 31% 0% 88% -41% 31% -6%C -32% 0% -49% 42% -32% 2%

% Difference Heterozygous TP TN FP FN Sensitivity Pos.pred.valB 0% 0% 16% 0% 0% -9%A -15% 0% -44% 21% -15% 16%C 15% 0% 28% -20% 15% -7%

Page 45: Enabling Large Scale Sequencing Studies through Science as a Service

3 Tools and Associated SNP Counts

• Software A– 45,551

• Software B– 29,814

• Software C– 40,964

Page 46: Enabling Large Scale Sequencing Studies through Science as a Service

Software B v. Software AB

29,814A

45,511

21,250 24,2618,564

Union: 54,075Intersection: 21,250

Not to Scale

Page 47: Enabling Large Scale Sequencing Studies through Science as a Service

Software B v. Software CB

29,814C

40,964

23,456 17,5086,358

Union: 47,322Intersection 23,456

Page 48: Enabling Large Scale Sequencing Studies through Science as a Service

Software A v. Software CA

45,511C

40,964

30,773 10,19114,738

Union: 55,702Intersection: 30,773

Page 49: Enabling Large Scale Sequencing Studies through Science as a Service

B29,814

A45,511

13,1304,750

C40,964

19,642

1,608

3,814 11,131

6,377

Union: 60,452Intersection: 19,642Voting Scheme (2/3): 36,195

Page 50: Enabling Large Scale Sequencing Studies through Science as a Service

Solution?

Page 51: Enabling Large Scale Sequencing Studies through Science as a Service

Again not what you want to hear…• Lots of manual/semi-automated work to run

multiple pipelines• Join discordance

• Scripting• Visualization

• Better algorithms for variant calling• Cancer specific

• Standardization of algorithms for variant calling• It all begins with mapping

Page 52: Enabling Large Scale Sequencing Studies through Science as a Service

Exome Analysis – Cancer SpecificDana Farber Cancer Institute

Multi-Pipeline Variant Calling and LOH

Loss of heterozygosity detection in tumor vs germline exome: candidate LOH genes selected with the following algorithm• Non-synonymous heterozygous SNP in germline

gene• Non-synonymous homozygous SNP in tumor or

additional Non-synonymous heterozygous SNP on the other allele

Page 53: Enabling Large Scale Sequencing Studies through Science as a Service

Ion Torrent

Page 54: Enabling Large Scale Sequencing Studies through Science as a Service

Ion Torrent PGM

Longer, Accurate Reads in 2.5 Hours• Microbial & Viral Resequencing• Microbial & Viral De novo Applications• Eukaryotic Amplicon Sequencing• Metagenomics

– WGS– 16S Surveys

Page 55: Enabling Large Scale Sequencing Studies through Science as a Service

Ion Torrent PGM

Name Total # Reads

Total # Reads

(AQ20) % Reads (AQ20)

Total # (Mbp)

Mean Read

Length (AQ20)

Percent of Genome Covered (AQ20)

Percent of Aligned Genome

Q40+

Inferred Read Error

Consensus Accuracy

RUN01 320,872 304,787 94.99% 32.95 84.00 99.00% 93.38% 1.71% 99.8490%

RUN02 198,755 192,031 96.62% 20.20 83.00 96.00% 82.36% 1.62% 99.6456%

RUN03 260,566 246,668 94.67% 26.91 85.00 98.00% 91.47% 1.64% 99.7737%

RUN04 163,059 156,669 96.08% 16.76 84.00 94.00% 78.82% 1.62% 99.5584%

0039009CA 201,693 188,482 93.45% 21.44 88.00 95.00% 85.98% 1.61% 99.5802%

0039010CA 241,493 227,393 94.16% 25.62 89.00 98.00% 90.38% 1.51% 99.7627%

Page 56: Enabling Large Scale Sequencing Studies through Science as a Service

Ion Torrent PGM

Name Total # Reads

# Aligned / Assembled

Reads

% Aligned / Assembled

Reads #

Contigs N50

Contig Largest Contig

Percent of Aligned Genome Covered (AQ40)

Consensus Accuracy (A) / Consensus Accuracy (D)

Combined (2 Runs) M 442,732 430,561 97.25% 219 34,417 125,376 97.43% 99.88%

Combined (4 Runs) M 941,543 902,644 95.87% 96 85,690 326,384 99.38% 99.97%

Combined (6 Runs) M 1,384,863 1,334,138 96.34% 90 107,749 326,368 99.51% 99.97%

Combined (2 Runs) D 442,732 401,345 97.43% 1,575 3,876 25,472 97.77% 1.67%

Combined (4 Runs) D 941,543 903,120 95.92% 387 21,489 67,465 99.40% 1.70%

Combined (6 Runs) D 1,384,863 1,335,604 96.44% 216 42,499 146,899 99.53% 1.73%

Page 57: Enabling Large Scale Sequencing Studies through Science as a Service

Ion Torrent PGM

Combined (2 Runs) - DeNovo Combined (4 Runs) - DeNovo Combined (6 Runs) - DeNovo -

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

Increasing Coverage Effect on Alignment / Assembly

N50 ContigLargest Contig

Combined (2 Runs) - Mapping

Combined (4 Runs) - Mapping

Combined (6 Runs) - Mapping

96.00%

96.50%

97.00%

97.50%

98.00%

98.50%

99.00%

99.50%

100.00%

100.50%

Increasing Coverage Effect on Accuracy

Percent of Aligned Genome Covered (AQ40)Consensus Accuracy

Page 58: Enabling Large Scale Sequencing Studies through Science as a Service

Real World Examples – SpeedRapid sequenced the genome of the Escherichia coli strain from European outbreak

“…[University of Münster & Life Tech] ]received the samples on Monday, began sequencing that evening, and began analyzing the data on Wednesday…”

“…Justin Johnson, director of bioinformatics at EdgeBio, assembled and analyzed the raw reads made publicly available by BGI using CLC Bio's software…Johnson said his analysis took just a couple of hours…

Page 59: Enabling Large Scale Sequencing Studies through Science as a Service

Acknowledgements

• CPDR (Center for Prostate Disease Research) Collaboration– Shyh-Han Tan, Ph.D.

• DNA Farber Cancer Institute Collaboration– Andrew Lane M.D.,Ph.D.; David Weinstock M.D.; Oliver Weigert

M.D.,Ph.D

• Scripps Translational Health– Samuel Levy

• Sequencing Team led by Joy Adigun • EdgeBio Research IFX led by John Seed, Ph.D. and Quang

Nguyen MD, Ph.D.

Page 60: Enabling Large Scale Sequencing Studies through Science as a Service

QuestionsTwitter: @Bioinfo

[email protected]