View
218
Download
2
Category
Preview:
Citation preview
Introduction to next-gen sequencing bioinformatics.ca
Genomic Sequence Questions
• How are sequence maps of genomes produced? • How is the information in the genome deciphered? • What can comparative genomics reveal about genome structure and evolution? • How does the availability of genomic sequence affect what we can do/ask?
The human nuclear genome viewed as a set of labeled DNA
Chapter 13 Opener
3072-CHARAC/Page 1953 REAMS OF PAPER TO PRINT OUT DNA=1.8 Meters!AGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGCGGGG AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCACATGCTACTAGCTGATATTCCTTCCGCGCGGCCGGCGAATCATTTACGTAAAAAAATTTTTCGC!AGCTAGTACAGCAGCTAGGCCGCATCATTAATTCGTATATATATATTCTCTCTCTAGAGCATCA
“Normal” DNA synthesis without dideoxy terminators
Figure 7-15
The structure of 2’,3’-dideoxynucleotides
The dideoxy sequencing method
Figure 20-16a
The dideoxy sequencing method
Figure 20-16b
Lecture 3.0 8
Principles of DNA Sequencing 5’
5’ Primer
3’ Template G C A T G C
dATP dCTP dGTP dTTP ddATP
dATP dCTP dGTP dTTP ddCTP
dATP dCTP dGTP dTTP ddTTP
dATP dCTP dGTP dTTP ddCTP
GddC
GCATGddC
GCddA GCAddT ddG
GCATddG
Lecture 3.0 9
Principles of DNA Sequencing G
C
T
A
+
_
+
_
G C A T G C
short
long
Lecture 3.0 11
Multiplexed CE with Fluorescent detection
ABI 3700 96x700 bases
Small Fragments
Large Fragments
Large Fragments
Introduction to next-gen sequencing bioinformatics.ca
Types of data generated
• Mated read pairs (Forward and Reverse) • Insert size • Chromatograms
– Sequence – Quality scores
• Assembly (often multiple versions) – Depth (coverage) – Gaps (sequence and physical) – Scaffolds (100 N convention)
Lecture 3.0 14
Shotgun Sequencing
Isolate Chromosome
ShearDNA into Fragments
Clone into Seq. Vectors Sequence
Lecture 3.0 15
Shotgun Sequencing
Sequence Chromatogram
Send to Computer Assembled Sequence
Introduction to next-gen sequencing bioinformatics.ca Lecture 3.0 16
Shotgun Sequencing
• Very efficient process for small-scale (~10 kb) sequencing (preferred method)
• First applied to whole genome sequencing in 1995 (H. influenzae)
• Now standard for all prokaryotic genome sequencing projects
• Successfully applied to D. melanogaster • Moderately successful for H. sapiens
Genome sequencing is now automated
Figure 13-3
• Laboratory Intensive
• Physical Maps • Chromosome isolation • “Walking” • Slow, but you always know where you are
• Computationally Intensive • Fast to generate data, use any of many technologies to randomly generate sequence from a variety of sources • Slow to put all the pieces back together
There are two main approaches to genome sequencing
Chromosome walking
Figure 20-13
A physical map puts clones in order
Figure 13-7a
Strategy for ordered-clone sequencing
Figure 13-8
The logic of creating a sequence map of the genome
Figure 13-2
End reads from multiple inserts may be overlapped to produce a contig
Figure 13-4
To sequence a genome, plasmids with different, but known, insert sizes are
required
• Small insert plasmid library ~2kb +/- 100 bp • Medium insert plasmid library ~10kb +/- 500 bp • Large insert library ~50kb +/- 1kb
Note: Regardless of the insert size in the library, all clones are sequenced using “mated” or paired-end sequencing
2kb
10kb
50kb
Strategy for whole-genome shotgun sequencing assembly
What do you do if you encounter a “GAP” a region that is missing from contigs? (Note: Contig = Contiguous sequence)
Paired-end reads may be used to join two sequence contigs
Introduction to next-gen sequencing bioinformatics.ca
Anatomy of a WGS Assembly
AAGCTTCGCCAGGCTGTAAATCCCGTGAGTCGTCCTCACAAATCATCAAGCAGGTGTCCTCAGGGAGACTGCCTGACTGAGTTATGCTAATTCCTTTCTACTTTGGCGTGGTCACGTGTAACCATATCCGAATCATTTCTCTAGCCCTACGAACAGGTAAGAGCGCTAGGGATGTCCGTGGAGTAGTGTGCTTACTCGATAATATTCAGTTGGGACTACCAGCGAGGCGCTCGCTTTGCTCACGCAATGCCTGAGACAGTTGCAGAATGAATGGTAACCGACAAACGCGTTCATATGCGTTTTCAAACTTAGTAGACGCGTACTGTCTGAAACTGGCGGTCACAGGCACCAGATAACGCCCTTGGCATCGGCATGTCTCGTACAGAGGTCCGTATGTAGTGCCACGACTTCTAAATCCGGCGACAGGCTGGTCTTTTGTCTTACCACGTATTAGCCCGCGTGCGATTTCTCGGAGCGCACCTGTTCAACACTAGAAAACGGAGTTTCCTGATCGAGAAGCCACCACCTTTCCAGAAGTTGAACGCTAGCATGTCATTCGATTTTCACCCCCCGCGTAGTTCCTGTGTGTCATTCGTTGTCGAGACAACTCTGTCCCGCCCCGGTGCTGTTCCATATGCGTGACTTTCCCGCAATTTTTTCAGACTTTCAGGAAAGACAGGCTCCGGAACGATCTCGTCCATGACTGGTAAATCCACGACACCGCAATGGCCCCCAGCACCTCTATCTCTCGTGCCAGGGGACTAACGTTGTATGCGTCTGCGTCTTGTCTTTTTGCATTCGCTTTCCAAAAAAGAGAGCCATCCGTTCCCCCGCACATTCAACGCCGCGAGTGCGGTTTTTGTCTTTTTTGAGTGGTAGGACGCTTTTCATGCGCGAACTACGTGGACATTAAGTTCCATTCTCTTTTTCGACAGCACGAAACCTTGCATTCAAACCCGCCCGCGGAAGATCCGATCTTGCTGCTGTTCGCAGTCCCAGTAGCGTCCTGTCGGCCGCGCCGTCTCTGTTGGTGGGCAGCCGCTACACCTGTTATCTGACTGCCGTGCGCGAAAATGACGCCATTTTTGGGAAAATCGGGGAACTTCATTCTTTAAAAGTATGCGGAGGTTTCCTTTTTCTTCTGTTCGTTTCTTTTTCTCGGGTTTGATAACCGTGTTCGATGTAAGCACTTTCCGTCTCTCCTCCGTGCTTTGTTCGACATCGAGACCAGGTGTGCAGATCCTTCGCTTGTCGATCCGGAGACGCGTGTCTCGTAGAACCTTTTCATTTTACCACACGGCAGTGCGGAGCACTGCTCTGAGTGCAGCAGGGACGGGTGAAGTTTCGCTTTAGTAGTGCGTTTCTGCTCTACGGGGCGTTGTCGTGTCTGGGAAGATGCAGAAACCGGTGTGTCTGGTCGTCGCGATGACCCCCAAGAGGGGCATCGGCATCAACAACGGCCTCCCGTGGCCCCACTTGACCACAGATTTCAAACACTTTTCTCGTGTGACAAAAACGACGCCCGAAGAAGCCAGTCGCCTGAACGGGTGGCTTCCCAGGAAATTTGCAAAGACGGGCGACTCTGGACTTCCCTCTCCATCAGTCGGCAAGAGATTCAACGCCGTTGTCATGGGACGGAAAACCTGGGAAAGCATGCCTCGAAAGTTTAGACCCCTCGTGGACAGATTGAACATCGTCGTTTCCTCTTCCCTGTGAGCACACACAGTAGTCGCCACACGCTGTTTGAGACGTGTCAATCTCCAAGAGTGTGGACGCTGTTCCACGTCTTCAAATGTTTCCCAACATCCGTCGTCTAGTAGACACACCAACAAAAAGCACACGGCGAATCTGCTCATCGGAGGGAGGAGCCGGGGGGCACACAACTATCCTCAACTCTCGAACGAACATATCCGGGGCCGCGAAGACGTCCAGTCTCTCAAATCCAACCCGGAACGCAAACATTTCTGCATCAAGTCACGATTGCGCCGGTACCTCCATGTGTAAGCAGTTCCATGAAACCTCCGATATTACACACGACTGTGGATATGAATTATATGCAGATGCATATATACTGAGACGCCGATGCAACTATAGGTTTCCTGGCCCTCCATGGATATTTCAGACCTTCCTCTCACATTTGGTTTGCCCGTACACCTCCGTTACGCTTTTTTTCTGGCTTTCTTCTTCGTCTCTGTTTATCAGCAAAGAAGAAGACATTGCGGCGGAGAAGCCTCAAGCTGAAGGCCAGCAGCGCGTCCGAGTCTGTGCTTCACTCCCAGCAGCTCTCAGCCTTCTGGAGGAAGAGTACAAGGATTCTGTCGACCAGATTTTTGTCGTGGGTATGTTGTCCTAAACTCCTTGGAACTCCATTCTTGGTCAGAAACGTACTGAAACTGTATACATGTATATACAGATGTATGGATAATATCTAGAGAAGATACAGGGAAGACTGGCAAGGATGAAAAGACATGCAGCTTTAACGAAGCAGAGGGCATTGGCGAGAGGGACGCCCGTTATGCTGTGTGATGTGGCTGTGAATCTTACCTCGCCGTTTGACTTGCTGCAGCGCTTTGTCCACTTGAACGTGACTTCTTGTTTCTACCTTCCCCAACGCCTTCTATTCCCTTCACTGCGAAAGCGCGCTCAGTGGGCCGTCACCGAACACCCTTGGTTCTTTCGTTCAGCTGTTGTCCTCTTTCTCGCGTTGCTTCCTGTGGCGTCGTGGCTCGGCTTCTCTCTCTTTCCTGTTGGTGCGTCCAGACTATGTCGCCTGTTTCCCCACCCTTCTCGGCTTGTGCTTTCAGGAGGAGCGGGACTGTACGAGGCAGCGCTGTCTCTGGGCGTTGCCTCTCACCTGTACATCACGCGTGTAGCCCGCGAGTTTCCGTGCGACGTTTTCTTCCCTGCGTTCCCCGGAGATGACATTCTTTCAAACAAATCAACTGCTGCGCAGGCTGCAGCTCCTGCCGAGTCTGTGTTCGTTCCCTTTTGTCCGGAGCTCGGAAGAGAGAAGGACAATGAAGCGACGTATCGACCCATCTTCATTTCCAAGACCTTCTCAGACAACGGGGTACCCTACGACTTTGTGGTTCTCGAGAAGAGAAGGAAGACTGACGACGCAGCCACTGCGGAACCGGTAAGAGGCAACCGAAGCGCGTAGATAAGAAAAACAACAAAGAGAAGGTGAAACACGAAGAGAAGGGAAAATGCGGAGAAACCGTGGATTTACAAAGATATCAAGAGCAATGCTTTGTGGAGATTTTTTTTAATTCAGTAGAGACACCCGCCGTGCGAGGTGTGTAGAAATAACTGCGACCCTGGAGACAGAGATGCCGCGAGTACACCACTTGTCGTTTTTCCTCCTATGTTCATGACGGGTGCTGAACGTCTATCGTACTTAATTGGAGGAGTCGTCTCCGAAGCAGCTTTGGCTGGCCATCCGTGTGTTTGCCTTGTTCCTGAAAAGCCAGAAGGCGCTCCACAGTGAGGCGATATACAGGGACGCCTACCGGAGCCCCGTTTTCTGCCTTTGTCGACTCTTGCAGAGCAACGCAATGAGCTCCTTGACGTCCACGAGGGAGACAACTCCCGTGCACGGGTTGCAGGCTCCTTCTTCGGCCGCAGCCATTGCCCCGGTGTTGGCGTGGATGGACGAAGAAGACCGGAAAAAACGCGAGCAAAAGGAACTGATTCGGGCCGTTCCGCATGTTCACTTTAGAGGCCATGAAGAATTCCAGTACCTTGATCTCATTGCCGACATTATTAACAATGGAAGGACAATGGATGACCGAACGGGTAACGGCGACTGCGAGAAAAAGCCACACCGTTTTCTCCTGTGATTCTGTCCGCAAGCCCTCTTTTGCTTCATCCACCCTTTGCTATTCTCCGCCGCCTTCCTTTTCTGCTCCATGTTCAATTCGTTCGCTTCTTCAGTCTTTCCATCTTCCCCTGTTACCTCTGTCATTCGTTTTCTTGCCTCTATTTAACTGTGTTCTACTCACAGTCTGCATTCCGCGATAGACGAGCTTCCACGTCTTGCGTCTCGACAAGCAACTGTCATTTGTACGCGCCTCCCTCCACCGTGAATCGGATTGTCGGTTCGCCGGTTCCTGGGTCAGAAAAGGCCTGCGCCAGTATTCTGAATAATACCCTTCGCCATTGTAAAGAGGCGAAGGAACAAAGAGATATTTCGGCGCATCTTTTGTGCGGCGCGTTTCCTCGTGCTTCACACCGATGCCCTTCTGTGCATGTCTTCTGCTCCTCGTCCTTCTCTCTTTTTCCCTGTTTAGGCGTTGGTGTCATCTCCAAATTCGGCTGCACTATGCGCTACTCGCTGGATCAGGCCTTTCCACTTCTCACCACAAAGCGTGTGTTCTGGAAAGGGTAAGGGCGTCTTCAGTGAATGCATATATTTGACTTCAGACATTCTTAACTGTTTGACAACCAACGTACAAATTTGTTTGTCCGTGTGCGTGTTCGACATGTCAAGTATGTGAAGAGTCGCTACTGTAGACTAACGCACGAACCAGATTTGTTTATCTGCATGCGCTGTGCACCCGTTTCTGAGTGTCTGGAGTTTCCGCAACCTTCCTTTGAATTTCTGGGTTCGTTTTTTTATGCGCGCACTGGTTTGCATGTGGCCTGAGAGAGCACAGATCGAAGGTGGGGTGATGTGGCGTCGCTGCAGAGAAACTCCGGCGAAGGCGACAGATAAAGGAGAGTGGAAATCATTGAACAGTGTCGGTCGTCTGTTGTTTCGCAGGGTCCTCGAAGAGTTGCTGTGGTTCATTCGCGGCGACACGAACGCAAACCATCTTTCTGAGAAGGGCGTGAAGGCAAGTCTACGTTGTACCTCTTGTCTCTGCCGAAGCTCAGATGTCTCCACGGCGTTGGTTTCTTTTCGTTTTTGCTTTCGTGGCATTACCATCGAGTCACCACTCATAGTTGCGTGTGTCTACATGTTTTCTAGAACGTCCGTTGTGTTGCCTCGTGGCGACC
Introduction to next-gen sequencing bioinformatics.ca
The Bioinformatic Pipeline
• Many software packages, the most widely use free suite is: Phred-Phrap-Consed
• Quality are obtained and files generated • Vector sequences are removed • A repeat library is constructed and sequences
are masked • Reads are assembled, viewed and assessed • Primers are designed to close gaps
Introduction to next-gen sequencing bioinformatics.ca
Genomic Sequence
• What was the sequencing strategy? • What is the genome size? Repeat content? • What “fold” coverage exists? 1X? 10X? • Has host and vector contamination been
removed?
Introduction to next-gen sequencing bioinformatics.ca
The Plasmodium falciparum Genome
• Approx 30 million bp in size, distributed in 14 chromosomes
• Genome project is an internationally funded effort,(NIH, Wellcome Trust, Burroughs Wellcome Foundation)
• Sequence is being generated at 3 different sites, (Sanger Centre, Stanford, TIGR)
• Sequence is nearly complete in terms of total coverage but unfinished in terms of assembly
• Sequence is nearly 80% A/T in composition
Introduction to next-gen sequencing bioinformatics.ca
The sequencing Strategy
• Separate chromosomes on a pulse-field gel • In some cases, make chromosome-specific
BAC’s or YAC’s • Shotgun sequence smaller plasmids • Remove contaminants (vector, E. coli, yeast) • Assemble “contigs”
P. falciparum Statistics (3D7)
11
13
10
Add consed picture
Contig Assembly
Chromosome
BACs YACs
Shotgun Clones (Plasmids)
Contiguated Clones
Contig Assembly Problems
Chromosome
BACs YACs
X
Physical gap, no cloned DNA exists PCR Library Walking
Contig Assembly Problems
Sequence Gap, clone exist but no sequence read
X
Contig Assembly Problems
Repetitive DNA elements
Introduction to next-gen sequencing bioinformatics.ca
The Nature of Unfinished Unannotated Sequence
• Fragmented • May contain vector or library host DNA • May have sequence gaps • May be mis-assembled • Genes and features are not identified • Probably will NEVER be “finished”
Module 1 Introduction to next-gen sequencing
FRANCIS OUELLETTE
History of DNA Sequencing
Avery: Proposes DNA as ‘Genetic Material’
Watson & Crick: Double Helix Structure of DNA
Holley: Sequences Yeast tRNAAla
1870
1953
1940
1965
1970
1977
1980
1990
2002
Miescher: Discovers DNA
Wu: Sequences λ Cohesive End DNA
Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation
Messing: M13 Cloning
Hood et al.: Partial Automation
• Cycle Sequencing • Improved Sequencing Enzymes • Improved Fluorescent Detection Schemes
1986
• Next Generation Sequencing • Improved enzymes and chemistry • New image processing
Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
1
15
150
50,000
25,000
1,500
200,000
50,000,000
Efficiency (bp/person/year)
15,000
100,000,000,000 2009
Introduction to next-gen sequencing bioinformatics.ca
Why are we sequencing? • Before Next-generation:
– Reductionist perspective on life – DNA, RNA, (proteins), (populations), sampling, averages,
consensus • Problems: sampling, averages, consensus.
• After Next-generation: – We are still reductionist, but better – Genome sequence and structure – Less cloning/PCR – Single molecules (for some)
Introduction to next-gen sequencing bioinformatics.ca
Sanger (old-gen) Sequencing
Now-Gen Sequencing
Whole Genome Human (early drafts), model organisms, bacteria, viruses and mitochondria (chloroplast), low coverage
New human (!), individual genome, 1,000 normal, 25,000 cancer matched control pairs, rare-samples
RNA cDNA clones, ESTs, Full Length Insert cDNAs, other RNAs
RNA-Seq: Digitization of transcriptome, alternative splicing events, miRNA
Communities Environmental sampling, 16S RNA populations, ocean sampling,
Human microbiome, deep environmental sequencing, Bar-Seq
Other Epigenome, rearrangements, ChIP-Seq
Introduction to next-gen sequencing bioinformatics.ca
Differences between the various platforms:!
• Nanotechnology used."• Resolution of the image analysis."• Chemistry and enzymology."• Signal to noise detection in the software"• Software/images/file size/pipeline"• Cost $$$"
Next Generation DNA Sequencing Technologies Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk
Human Genome 6GB == 6000 MB
Req’d Coverage 6 12 30
3730 454 Illumina
bp/read 600 400 2X75
reads/run 96 500,000 100,000.000
bp/run 57,600 0.5 GB 15 GB
# runs req’d 625,000 144 12
runs/day 2 1 0.1 Machine days/human genome
312,500 (856 years)
144 120
Cost/run $48 $6,800 $9,300
Total cost $15,000,000 $979,200 $111,600
Introduction to next-gen sequencing bioinformatics.ca
URLs
• http://454.com/ • http://illumina.com/ • http://appliedbiosystems.com/
• http://pacificbiosciences.com/ • http://helicosbio.com
Recommended