Upload
surya-saha
View
184
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY @SahaSurya BTI PGRP Summer Internship Program 2014
Citation preview
Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, NY [email protected] // @SahaSurya
BTI PGRP Summer Internship Program 2014
http://www.acgt.me/blog/2014/3/7/next-generation-sequencing-must-die
Why Sequencing?
• Targeted interrogation of genome
• Economical
• Technological developments
• High-throughput assays
• But requires subsequent validation
7/8/2014 BTI PGRP Summer Internship Program 2014 2
19
53
DNA Structure discovery
19
77
20
12
Sanger DNA sequencing by chain-terminating inhibitors
19
84
Epstein-Barr virus
(170 Kb)
19
87
Abi370
Sequencer
19
95
20
01
Homo sapiens (3.0 Gb)
20
05
454
Solexa
Solid
20
07
20
11
Ion Torrent
PacBio
Haemophilus influenzae (1.83 Mb)
20
13
Slide credit: Aureliano Bombarely
Sequencing over the Ages
Illumina
Illumina Hiseq X
454
7/8/2014 BTI PGRP Summer Internship Program 2014 3
Pinus taeda
(24 Gb)
First generation sequencing
7/8/2014 BTI PGRP Summer Internship Program 2014 4
Sanger method
7/8/2014 BTI PGRP Summer Internship Program 2014 5
Frederick Sanger 13 Aug 1918 – 19 Nov 2013 Won the Nobel Prize for Chemistry in 1958 and 1980. Published the dideoxy chain termination method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB
Sanger method
7/8/2014 BTI PGRP Summer Internship Program 2014 6
http://bit.ly/1g6Cudq
http://bit.ly/1lcQO4J
First generation sequencing
• Very high quality sequences (99.999%)
• Very low throughput
7/8/2014 BTI PGRP Summer Internship Program 2014 7
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 386 1.9-84 Kb $2400
http://bit.ly/1clLps3 http://1.usa.gov/1cLqIRd
Use the specific technology used to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS1/RSII
– Ion Torrent Proton/PGM
– SOLiD
7/8/2014 BTI PGRP Summer Internship Program 2014 8
http://www.acgt.me/blog/2014/3/10/next-generation-sequencing-must-diepart-2
454 Pyrosequencing
One purified DNA fragment, to one bead, to one read.
7/8/2014 BTI PGRP Summer Internship Program 2014 9
http://bit.ly/1ehwxWN
GS FLX Titanium
http://bit.ly/1ehAcEh
Illumina
7/8/2014 BTI PGRP Summer Internship Program 2014 10
Output 15 Gb 120 GB 1000 GB 1800 GB
Number of Reads
25 Million 400 Million 4 Billion 6 Billion
Read Length
2x300 bp 2x150 bp 2x125 bp (2x250 update mid-2014)
2x150 bp
Cost $99K $250K $740K $10M
Source: Illumina
Illu
min
a
7/8/2014 BTI PGRP Summer Internship Program 2014 11 http://1.usa.gov/1fP9ybl
Illu
min
a: M
ole
culo
7/8/2014 BTI PGRP Summer Internship Program 2014 12
http://bit.ly/1aEPOBn
Pacific Biosciences SMRT sequencing
Single Molecule Real Time sequencing
7/8/2014 BTI PGRP Summer Internship Program 2014 13
http://bit.ly/1naxgTe
Pacific Biosciences SMRT sequencing Error correction methods
7/8/2014 BTI PGRP Summer Internship Program 2014 14
Hierarchical genome-assembly process (HGAP)
PB
Jelly
Enlish et al., PLOS One. 2012
PBJelly
7/8/2014 Centre for Agricultural Bioinformatics, Pusa 15
Pacific Biosciences SMRT sequencing Read Lengths
Oxford Nanopore
7/8/2014 Centre for Agricultural Bioinformatics, Pusa 16
https://www.nanoporetech.com/
• No data yet??
• Error model
http://erlichya.tumblr.com/post/66376172948/hands-on-experience-with-oxford-nanopore-minion
Others
• Ion Torrent Proton/PGM
• Helicos
• Nabsys
• SOLiD
• ……
7/8/2014 BTI PGRP Summer Internship Program 2014 17
Comparison
7/8/2014 BTI PGRP Summer Internship Program 2014 18
Next generation sequencing
7/8/2014 BTI PGRP Summer Internship Program 2014 19
Run Time Read Length Quality
Total
nucleotides
sequenced
Cost /MB
454
Pyrosequencing 24h 700 bp Q20-Q30 0.7 GB $10
Illumina Miseq 27h 2x250bp > Q30 15 GB $0.15
Illumina Hiseq
2500 11days 2x125bp >Q30 1000 GB $0.05
Ion torrent 2h 400bp >Q20 50MB-1GB $1
Pacific
Biosciences 2h 8.5-20kb
>Q30 consensus
>Q10 single
400-850MB
/SMRT cell $0.33-$1
http://bit.ly/1clLps3 http://1.usa.gov/1cLqIRd
http://omicsmaps.com/
Next Generation Genomics: World Map of High-throughput Sequencers
BTI PGRP Summer Internship Program 2014 7/8/2014 20
7/8/2014 BTI PGRP Summer Internship Program 2014 21
Real cost of Sequencing!!
Sboner, Genome Biology, 2011
7/8/2014 22 Centre for Agricultural Bioinformatics, Pusa
Library Types
Single end
Pair end (PE, 150-800 bp, Fwd:/1, Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
7/8/2014 BTI PGRP Summer Internship Program 2014 23
F
F R
F R 454/Roche
F R Illumina
Illumina
Slide credit: Aureliano Bombarely
Implications of Choice of Library
7/8/2014 BTI PGRP Summer Internship Program 2014 24 Slide credit: Aureliano Bombarely
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers)
NNNNN NN
Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify different samples in the same lane/sector.
7/8/2014 BTI PGRP Summer Internship Program 2014 25 Slide credit: Aureliano Bombarely
AGTCGT
TGAGCA
AGTCGT AGTCGT
AGTCGT AGTCGT
TGAGCA TGAGCA
TGAGCA TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA TGAGCA
TGAGCA
TGAGCA
Sequencing
Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes.
-Wikipedia
File Formats
7/8/2014 BTI PGRP Summer Internship Program 2014 26 Slide credit: Aureliano Bombarely
Fastq files:
FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
-Wikipedia
• Single line ID with at symbol (“@”) in the first column.
• Sequences can be in multiple lines after the ID line
• Single line with plus symbol (“+”) in the first column to represent the quality line.
• Quality ID line may contain ID
• Quality values are in multiple lines after the + line but length should be identical to sequence
7/8/2014 BTI PGRP Summer Internship Program 2014 27 Slide credit: Aureliano Bombarely
File Formats
7/8/2014 BTI PGRP Summer Internship Program 2014 28
Quality control: Encoding Fastq files:
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)
Quality control: Encoding
7/8/2014 BTI PGRP Summer Internship Program 2014 29
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)
7/8/2014 BTI PGRP Summer Internship Program 2014 30
Quality control: Encoding
http://bit.ly/N28yUd
Phred score of a base is: Qphred = -10 log10 (e)
where e is the estimated probability of a base being wrong
Pre-processing: Tools
Trimming
• FastQC
• FASTX toolkit
• Trimmomatic
• Scythe
Joining paired-end reads
• fastq-join
• FLASH
• PANDAseq
7/8/2014 BTI PGRP Summer Internship Program 2014 31
Pre-processing: Error correction
7/8/2014 BTI PGRP Summer Internship Program 2014 32
Thank you!!
7/8/2014 BTI PGRP Summer Internship Program 2014 33