Genome Assembly With Next Generation Sequencers
3 May, 2011Jongsun Park
Personal Genomics Institute
Table of Contents
1 Central Dogma and –Omics Studies
2 History of Sequencing Technologies
3 Genome Assembly Processes With NGS Sequences
4 Current Status of Plant Genomes
Central Dogma and -Omics Studies
Central Dogma in Molecular Biology and Bioinformatics
DNA RNA Proteins
Genomics Transcriptomics ProteomicsAd
van
ces
of
Bio
tech
no
logi
es
Bioinformatics
Further –Omics Studies With Bioinformatics
Genomics Population Genomics
Phylogenomics
Transcriptomics
Gene Expression
Regulatory Network
Non-coding RNAs
Epigenomics
Proteomics Metabolomics
Pathway Analysis
3D structure
Position of Sequences in Central Dogma!
AGCUACGUGAGAGACGUACUGUAC…
AGCTACGTGAGAGACGTACUGTAC…
MASTWTSWAMTCCAAMST…
Target for “Sequencing”
Predicted from sequences
History of Sequencing Technologies
Classical Method: Sanger Sequencing
- Using dideoxynucleotides, Dna synthesis can be stopped randomly with four
different reaction tubes.
DNA Template
DNA Template
DNA Template
DNA Template
Primer
Primer
Primer
Primer
New DNA strain A
New DNA strain T
New DNA strain G
New DNA strain C
+ddATP
+ddTTP
+ddGTP
+ddCTP
http://en.wikipedia.org/wiki/File:Sequencing.jpg
Automated Method: Sanger Sequencing
- Using four fluorescent dyes, scanner can read sequences directly.
- With 96 capillaries, machine can read 96 or 384 different samples at one time.
DNA Template
DNA Template
DNA Template
DNA Template
Primer
Primer
Primer
Primer
New DNA strain A
New DNA strain T
New DNA strain G
New DNA strain C
+ddATPwith dye
+ddTTPwith dye
+ddGTPwith dye
+ddCTPwith dye
http://en.wikipedia.org/wiki/File:Radioactive_Fluorescent_Seq.jpg
Capacity of Automated Sanger Sequencing
- Per one capillary, 700 bases (Quality Value is
higher than 20) can be read in one hour.
- Per one machine, ABI3730 with 384
capillaries kit, 700 * 384 * 24 = 6,451,200 bp
can be obtained without considering sample
preparation process.
https://products.appliedbiosystems.com/ab/en/US/adirect/ab?cmd=catNavigate2&catID=600533&tab=Overview
- If you want to get 1x human genome sequences using one this machine,
it will take 465 days (30,000,000,000 bp / 6,451,200 bp = 465.029..).
- For de novo assembly, usually 6x to 10x coverage are required: it will take
4,650 days for obtaining enough sequences for genome project.
Next Generation Sequencing (NGS) Technologies
- Next (or Current) generation sequencing technologies have accelerated the speed
of genome sequencing projects and have broaden application range of genome
sequences.
Solexa; Illumina
SOLiD; ABI
GS-Titanium; Roche 454
SMRT; Pacific Bioscience Helicos; Helicos Bioscience
NGS: 454 Technology
NGS: Solexa Technology
NGS: SOLiD Technology (1)
NGS: SOLiD Technology (2)
NGS: SOLiD Technology (3)
Capacities of Next Generation Sequencers
Solexa GA2; Illumina
SOLiD 4; ABI
GS-Titanium; Roche 454
ABI 3730; ABI
384 x 700 bp = 268,800 bp = 269Kb (per one reaction / 1 hr)
950,000 x 450 bp = 405,000,000 bp = 405Mb (per one reaction / 2-3 days)
35,000,000 x 7 x (151 x 2) bp = 73,990,000,000 bp = 74.0 Gb (per one reaction / 12 days)
1,400,000,000 x 75 bp (50+25) = 105,500,000,000 bp = 105.5Gb (per one reaction / 11 days)
HiSeq2000; Illumina
70,000,000 x 14 x (101 x 2) bp = 197,960,000,000 bp = 198.0 Gb (per one reaction / 8 days)
Single Read and Mate-Pair (Pair-end) Sequences
- Single read is the most simple method: just read one
direction for each sample (cluster or bead).
- Mate-pair (or pair-end) method can generate both
side of short-read sequences of each read, which is
similar to BAC-end, Fosmid-end, Cosmid-end
sequences.
- Mate-pair (pair-end) methods are useful for
generating larger sequences with gaps (We called it
scaffolding.)
Summary of Costs For NGS Technologies
Sequence explosion, Nature, 46(1), 670-671
Pros and Cons of NGS Technologies
- Large number of reads per one run
- Extremely low costs for generating
huge amount of sequences
- Diverse applications not only for
genomics but also for transcriptomics,
small RNAs, epigenomics, and etc.
with small modification of protocols.
Pros Cons
- Short read length per each reads
(36 bp to 151 bp or 450bp)
- Different types of sequencing
qualities (GS, Solexa, and SOLiD)
- Difficult to deal with large size of
sequences with normal programs
including de novo assembly
- Impossible to get small amount of
sequences with low costs
Genome Assembly ProcessesWith NGS Sequences
Whole Shotgun Sequence Strategy
- Assembly process is essential for genome project because read length of each
sequence is less than 1 kb.
- Assembly process was conducted by several popular programs, such as phrap
and PCAP3, for Sanger sequences.
Genome AssemblyScaffolding
Assembled genome
Genome Assembly Process
- We can perform genome assembly manually!
23 sequences should be compared with each other!
23C2 = 23*22 / 2 = 253 comparison!
How to Find Overlapped Sequences?
- Using dynamic algorithm, we can make the program for finding similar sequences.
- Complexity of this algorithm is O(n3).
http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html
Classification of Alignments (1)
Considering whole sequences for
alignment?
Global Alignment
Yes
No
Local alignment
Classification of Alignments (2)
How many sequences are considered for
alignment?
Pair-wise alignment
2 sequences!
More than 2 sequences!
Multiple alignment
Famous Bioinformatics Tools for Alignments
- Global alignment : ClustalW, T-coffee, and MUSCLE
- Local alignment : FASTA, and BLAST (Basic Local Alignment Searching Tools)
provided by NCBI.
- Pair-wise alignment : BLAST and FASTA
- Multiple sequence alignment : ClustalW, T-coffee, MUSCLE, and etc.
Example of Genome Assembly: Vitis Vinifera
Pair-wise comparison of 6,200,000 reads
6,200,000C2 = 19,219,996,900,000 comparisons
Genome Assemblers
http://www.phrap.org/
Short-read Sequence Assembly (1)
- Short-read sequences generated by NGS machines cause several problems
of already well-established genome assemblers.
- Too many reads require near to infinite computational power.
- Too short reads cannot find reliable overlaps to make long contigs.
- To reduce computational power, new algorithm was required.
- Short reads require another strategy to make reliable contig sequences.
- Dealing a lot of sequences also caused several technical problems, such as
physical memory problems, harddisk space, and computing power.
Short-read Sequence Assembly (2)
563,466,202C563,466,201 = 158,747,080,116,419,301 comparison?!
563,466,202
De brujin Graph: Alternative Method For Alignment
De brujin Graph Algorithm For Alignment (1)
- This algorithm has been utilized for finding overlapped short-read sequences
quickly.
- This algorithm consists of three parts:
i) Generating k-mer sequences
ii) Constructing de brujin graph
iii) Resolving the graph with generating sequences
K = 3GCAAAACACTTA…
GCA
CAA
AAA
AAA
AAC
De brujin Graph Algorithm For Alignment (2)
ACA
CAC
1-3
1-4
1-1 1-2
1-5
1-6
1-71-8
1-9
ACT
CTT
TTA
1-10
ACA
CACACT
1-3
1-4
1-1 1-2
1-5
1-6
1-71-8
1-9
1-10
GCAAAACACTTA…
De brujin Graph Algorithm For Alignment (3)
CTT
TTA
ACACTTATTCGT
TAT
ATT
TTC
TCG
CGT
TAT
ATT
TTC
TCGCGTK = 3
2-1
2-22-3
2-4
2-5
2-6
2-7
2-8
2-92-10
De brujin Graph Algorithm For Alignment (4)
1-3
1-4
1-1 1-2
1-5
1-6
1-71-8
1-9
1-10
TAT
ATT
TTC
TCGCGT
2-1
2-22-3
2-4
2-5
2-6
2-7
2-8
2-92-10
GCAAAACACTTATTCGT
Genome Assemblers For Short Read Sequences
Examples of Plant Genome de novo Assembly
5,937,915,739 bp
# of contigs 950 ea
Total length 976,089 bp
Maximum length 12,606 bp
Average length 1,027.46 bp
N50 length 3,061 bp
Lithocarpushancei
Ficusaltissima
Ficusaltissima
Ficusaltissima
Ficustinctoria
Ficustinctoria
# of contigs 462,868 355,052 132,590 247,376 337,777 476,937
Total length 112,614,098 87,502,701 33,293,636 61,369,608 87,427,716 116,554,688
Maximumlength
1,748 1,090 1,688 1,334 1,274 1,578
Average length
243.30 246.45 251.10 248.08 258.83 244.38
N50 length 237 239 245 241 248 3061
Giant Panda Genome Project
Sordaria macrospora Genome Project
Large Scale Genome Projects: 1000 Human Genomes
http://www.1000genomes.org/page.phphttp://en.wikipedia.org/wiki/File:Genetic_Variation.jpg
http://www.ldl.genomics.cn/page/index.jsp
Large Scale Genome Projects: BGI 1000 Genomes
http://genome10k.org/
Large Scale Genome Projects: Genome 10K Project
Current Status ofPlant Genome Projects
Species name Method Size (Mb) # of contigs # of transcripts
Arabidopsis lyrata WGS 206.67 695 32,670
Medicago truncatula BAC, WGS 278.69 9 38,334
Selaginella moellendorffii WGS 212.76 768 22,285
Lycopersicon esculentum WGS, BAC 794.60 7,409 49,389
Solanum phureja WGS 702.58 57,681 110,512
Ricinus communis WGS 362.47 28,518 38,613
Mimulus guttatus WGS 416.66 11,243 47,442
Manihot esculenta WGS 321.73 2,216 27,501
Phoenix dactylifera WGS 284.68 234,704 -
Prunus persica WGS 227.25 202 28,689
Oryza glaberrima WGS 316.08 5,309 -
Unpublished 11 Higher Plant Genomes
All pictures are from wikipedia.
Species name Journal Method Size (Mb) # of contigs # of transcripts
Arabidopsis thaliana Nature, 2000 BAC +α 119.19 5 32,615
Oryza sativa japonicaScience, 2002Nature, 2005
BAC +α 372.08 12 66,710
Oryza sativa indicaScience, 2002PLoS Biology, 2005
WGS 426.32 10,267 49,710
Oryza sativa japonica (syngenta)
PLoS Biology, 2005 WGS 391.14 7,777 45,824
Populus trichocarpa Science, 2006 WGS 485.51 22,012 45,555
Vitis vinifera Nature, 2007WGS, Complete
497.51 35 30,434
Carica papaya Nature, 2008 WGS 369.69 17,677 28,589
Lotus japonicus DNA Research, 2008 WGS 323.24 110,945 26,700
Sorghum bicolor Nature, 2009 WGS 738.54 3,304 36,338
Zea mays Science, 2009 BAC, WGS 2,061.02 11 53,764
Cucumis sativusNature genetics, 2009
WGS 243.57 47,488 26,682
Glycine max Nature, 2010 WGS 996.90 4,262 62,199
Brachypodiumdistachyon
Nature, 2010 WGS 273.27 197 32,255
Malus x domesticaNature Genetics,2010
WGS 871.56 144,621 57,386*
14 Published Higher Plant Genomes from 11 Species
All pictures are from wikipedia.
Species name Journal MethodSize (Mb)
# of contigs
# of transcripts
Chlamydomonas reinhardtii Science, 2007 WGS 112.31 88 16,709
Micromonas pusilla CCMP1545 Science, 2009 WGS 22.04 27 10,547
Micromonas sp. RCC299 Science, 2009 WGS 20.99 17 10,108
Ostreococcus lucimarinus CCE9901 PNAS, 2007 WGS 13.2 21 7,488
Ostreococcus sp. RCC809 Not published yet WGS 13.41 22 7,492
Ostreococcus tauri PNAS, 2007 WGS 12.58 118 7,725
Coccomyxa sp. C169 Not published yet WGS 48.95 45 9,629
7 Unicellular Plants Genomes
All pictures are from wikipedia.
Distribution of 32 Plant Genome Size
Mb
0.0
500.0
1000.0
1500.0
2000.0
2500.0
12.6 13.2 13.4 21.0 22.0 49.0 112.3 119.7
206.7 212.8 225.9 227.3 273.3 284.7 307.5 316.1 321.7 323.2 362.5 369.7 372.3 391.1 416.7 426.3
485.5 497.5
702.6 738.5 781.8
871.6
996.9
2061.0
Unicellular Plants
Distribution of Number of Transcripts of 29 Plants
# of transcripts
0
20,000
40,000
60,000
80,000
100,000
120,000
7,4887,4927,7259,62910,10810,547
16,709
22,2852668226,70027,50128,58928,68930,43432,25532,67033,410
36,33838,613
45,55545,82447,4424948349,71053,42353,764
62,19967,393
110,512
Unicellular Plants
Relationship between Genome Size and Transcripts# of transcript/Genome Size (Mb)
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
26.1 49.2
61.2 62.4 63.3 77.3 82.6 85.5 93.8
104.7 106.5 113.9 116.6 117.2 118.0 118.1 126.2 148.8 157.3 158.1
173.7 181.0 196.7
279.2
478.5 481.6
558.7 567.3
614.1 Unicellular Plants
Comparisons with Genomes in Other Kingdoms
Mb
5.0 102.9
23.8 24.1
1587.1
392.2
-500.0
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
3500.0
Bacteria Oomycota Eukaryota Fungi Metazoa Viridaeplanta
5 species 6 species 23 species 256 species 90 species 32 species
Cucumber Genome Sequences
Thank you for your attention!