Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Plant Omics Resources
4 June, 2010Jongsun Park
Current Status of Plant Genome Projectswith Next Generation Sequencing Technologies
Fungal Bioinformatics Laboratory,Seoul National University
Next Generation Sequencing (NGS)Technologies and
de novo Assembly Problems
A Huge Number of Sequence Data in NCBI
- NCBI, which is the major sequence repository, presents the rapid growth of
sequences.
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
99,116,431,942 bp
Next Generation Sequencing (NGS) Technologies
- Next (or Current) generation sequencing technologies have accelerated the speed
of genome sequencing projects and have broaden application range of genome
sequences.
Solexa; Illumina
SOLiD; ABI
GS-Titanium; Roche 454
SMRT; Pacific Bioscience Helicos; Helicos Bioscience
NGS: 454 Technology
NGS: Solexa Technology
NGS: SOLiD Technology (1)
NGS: SOLiD Technology (2)
NGS: SOLiD Technology (3)
Capacity of Next Generation Sequencers
Solexa GA2; Illumina
SOLiD 4; ABI
GS-Titanium; Roche 454
ABI 3730; ABI
96 x 1,000 bp = 96,000 bp = 100Kb
950,000 x 450 bp = 405,000,000 bp = 405Mb
30,000,000 x 7 x (101 x 2) bp = 42,420,000,000 bp = 42.5Gb
940,000,000 x 75 bp (50+25) = 70,500,000,000 bp = 70.5Gb
HiSeq2000; Illumina
30,000,000 x 7 x (101 x 2) x 4 bp = 169,680,000,000 bp = 169.7Gb
Pros and Cons of NGS Technologies
- Large number of reads per one run
- Low sequencing costs
- Diverse applications not only for
genomics but also for
transcriptomics and small RNAs
Pros Cons
- Short read length per each reads
ex) 36bp to 101 bp
- Different type of sequencing
qualities
- Difficulties to deal with large size of
result files
- Almost impossible to do de novo
assembly
Whole Shotgun Sequence Strategy
- Assembly process is essential for genome project because read length of each
sequence is less than 1 kb.
- Assembly process was conducted by several popular programs, such as phrap
and PCAP3, for Sanger sequences.
Genome AssemblyScaffolding
Assembled genome
Genome Assembly Process
- We can perform genome assembly manually!
23 sequences should be compared with each other!
23C2 = 23*22 / 2 = 253 comparison!
Example of Genome Assembly: Vitis Vinifera
Pair-wise comparison of 6,200,000 reads
6,200,000C2 = 19,219,996,900,000 comparisons
Genome Assemblers
http://www.phrap.org/
Short-read Sequence Assembly (1)
- Short-read sequences generated by NGS machines cause several problems
of already well-established genome assemblers.
- Too many number of reads require near to infinite computational power.
- Too short reads cannot find reliable overlaps to make long contig.
- To reduce computational power, new algorithm was needed.
- Short reads require another strategy to make reliable contig sequences.
- Dealing a lot of sequences also caused several technical problems.
Short-read Sequence Assembly (2)
563,466,202C563,466,201 = 158,747,080,116,419,301 comparison?!
563,466,202
De brujin Graph Algorithm
- This algorithm has been utilized for fast-finding overlapped short-read
sequences with combining k-mer sequences.
K = 3GCAAAACACTT…
GCA
CAA
AAA
AAA
AAC
Genome Assemblers For Short Read Sequences
Examples of Plant Genome de novo Assembly
5,937,915,739 bp
# of contigs 950 ea
Total length 976,089 bp
Maximum length 12,606 bp
Average length 1,027.46 bp
N50 length 3,061 bp
Lithocarpushancei
Ficusaltissima
Ficusaltissima
Ficusaltissima
Ficustinctoria
Ficustinctoria
# of contigs 462,868 355,052 132,590 247,376 337,777 476,937
Total length 112,614,098 87,502,701 33,293,636 61,369,608 87,427,716 116,554,688
Maximumlength
1,748 1,090 1,688 1,334 1,274 1,578
Average length
243.30 246.45 251.10 248.08 258.83 244.38
N50 length 237 239 245 241 248 3061
Giant Panda Genome Project
Current Status ofPlant Genome Projects
Species name Method Size (Mb) # of contigs # of transcripts
Arabidopsis lyrata WGS 206.67 695 32,670
Medicago truncatula BAC, WGS 278.69 9 38,334
Selaginella moellendorffii WGS 212.76 768 22,285
Lycopersicon esculentum WGS, BAC 794.60 7,409 49,389
Solanum phureja WGS 702.58 57,681 110,512
Ricinus communis WGS 362.47 28,518 38,613
Mimulus guttatus WGS 416.66 11,243 47,442
Manihot esculenta WGS 321.73 2,216 27,501
Unpublished 8 Higher Plant Genomes
Species name Journal Method Size (Mb) # of contigs # of transcripts
Arabidopsis thaliana Nature, 2000 BAC +α 119.19 5 32,615
Oryza sativa japonicaScience, 2002Nature, 2005
BAC +α 372.08 12 66,710
Oryza sativa indicaScience, 2002PLoS Biology, 2005
WGS 426.32 10,267 49,710
Oryza sativa japonica (syngenta)
PLoS Biology, 2005 WGS 391.14 7,777 45,824
Populus trichocarpa Science, 2006 WGS 485.51 22,012 45,555
Vitis vinifera Nature, 2007WGS, Complete
497.51 35 30,434
Carica papaya Nature, 2008 WGS 369.69 17,677 28,589
Lotus japonicus DNA Research, 2008 WGS 323.24 110,945 26,700
Sorghum bicolor Nature, 2009 WGS 738.54 3,304 36,338
Zea mays Science, 2009 BAC, WGS 2,061.02 11 53,764
Cucumis sativusNature genetics, 2009
WGS 243.57 47,488 26,682
Glycine max Nature, 2010 WGS 996.90 4,262 62,199
Brachypodiumdistachyon
Nature, 2010 WGS 273.27 197 32,255
13 Published Higher Plant Genomes from 10 Species
All pictures are from wikipedia.
Species name Journal MethodSize (Mb)
# of contigs
# of transcripts
Chlamydomonas reinhardtii Science, 2007 WGS 112.31 88 16,709
Micromonas pusilla CCMP1545 Science, 2009 WGS 22.04 27 10,547
Micromonas sp. RCC299 Science, 2009 WGS 20.99 17 10,108
Ostreococcus lucimarinus CCE9901 PNAS, 2007 WGS 13.2 21 7,488
Ostreococcus sp. RCC809 Not published yet WGS 13.41 22 7,492
Ostreococcus tauri PNAS, 2007 WGS 12.58 118 7,725
Coccomyxa sp. C169 Not published yet WGS 48.95 45 9,629
7 Unicellular Plants Genomes
Distribution of 28 Plant Genome Size
0
500
1000
1500
2000
2500
13 13 13 21 22 49 112 119
207 213 244 273 279 322 323 362 370 372 391 417 426
486 498
703 739 795
997
2061
Unicellular Plants
Mb
Distribution of Number of Transcripts of 21 Plants
-
20,000
40,000
60,000
80,000
100,000
120,000
7,488 7,492 7,725 9,629 10,108 10,547
16,709 22,285
26,682 26,700 27,501 28,589 30,434 32,255 32,615 32,670 36,338 38,334 38,613
45,555 45,824 47,442 49,389 49,710 53,764
62,199 66,710
110,512
Unicellular Plants
# of transcripts
Relationship between Genome Size and Transcripts
-
100.00
200.00
300.00
400.00
500.00
600.00
700.00
# of transcript/Genome Size (Mb)
Unicellular Plants
0
50
100
150
200
250
300
350
400
13 13 13 49
21 22
112 119
372
Comparisons with Genomes in Other Kingdom
505
35
2556
302
103 23 0
500
1000
1500
2000
2500
3000
3500
4000
Streptophyta Chlorophyta Chordata Arthropoda Oomycetes Fungi
Mb
21 species 8 species 48 species 31 species 6 species 244 species
Cucumber Genome Sequences
Plant Omics Resources (POR)
Plant Genome Resources: Phytozomehttp://www.phytozome.org/
Plant Genome Resources: PlantGDBhttp://www.plantgdb.org/
National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html
Pros and Cons of Current Repositories
- Each plant genome repository has their own strategy to manage and to
show plant genome information.
- All three repositories contain additional information, such as ESTs, genetic
map, and mutant libraries.
- All three databases provides bioinformatics tools for further analyses, such
as BLAST search.
- Visualization tools, such as genome browser, are also provided.
- However, range of plant genomes and their versions are a little bit different
from each other.
- Additionaly, there is not so many additional bioinformatics tools beyond
BLAST search.
Comparative Fungal Genomics Platform
Park et al., (2008) CFGP: Comparative Fungal Genomics Platform. Nucleic Acid Research, 36, D562-D571.
CFGP: Three layered Structure
StandardizedFungal GenomeData Warehouse
Park et al., (2008) CFGP: Comparative Fungal Genomics Platform. Nucleic Acid Research, 36, D562-D571.
Middleware
Data-driven User Interface
11 DB servers47 PC nodes6 Web servers8 Other servers
9 22
252
32
6029
Bacteria
Eukaryotics
Fungi
Arthropoda
Metazoa
Viridaeplanta
CFGP: Middleware – Fungal Matrix
Park et al., (2008) CFGP: Comparative Fungal Genomics Platform. Nucleic Acid Research, 36, D562-D571.
- High-performance computing power is required for annotating large amount of
sequences.
11 DB servers
47 PC nodes
6 Web servers
8 Other servers
72 computers
Park et al., (2008) CFGP: Comparative Fungal Genomics Platform. Nucleic Acid Research, 36, D562-D571.
Favorite Frame Application Frame
CFGP: Data-driven User Interface
Park et al., (2008) CFGP: Comparative Fungal Genomics Platform. Nucleic Acid Research, 36, D562-D571.
3 BLAST search related tools
BLAST
BLAST2
BLASTMatrix
1 Functional domain searching tool
InterPro Scan
6 Phylogenetic analysis tools
ClustalW
DNAML
PROML
DNAPARS
PROTPARS
PHYML
5 Secretory prediction tools
SignalP 3.0
SigCleave
SigPred
RPSP
SecretomeP
3 Subcellular localization prediction tools
PSort2
ChloroP
TargetP
4 Post translational modification prediction tools
NetCGlyc
NetNGlyc
NetOGlyc
NetPhos
4 Other tools
MEME
tRNAScan-SE
mFold
TMHMM2
CFGP: Favorite: Bioinformatics Workbench
Park et al., (2008) CFGP: Comparative Fungal Genomics Platform. Nucleic Acid Research, 36, D562-D571.
These sequences can be stored into the Favorite againfor further analyses.
CFGP: Iterative Analyses with the Favorite
Fungal Kingdom Insect Plants
Human
Park et al., (2008) CFGP: Comparative Fungal Genomics Platform. Nucleic Acid Research, 36, D562-D571.
CFGP: BLASTMatrix
CFGP: Integration of SNU Genome Browser
Park et al., (2008) CFGP: Comparative Fungal Genomics Platform. Nucleic Acid Research, 36, D562-D571.43
http://atmt.snu.ac.kr/
http://tdna.snu.ac.kr/
http://www.phytophthoradb.org/
CFGPstandardized genome
data warehouse
http://genomebrowser.snu.ac.kr/
…
44
CFGP: Platform for Diverse Bioinformatics System
Plant Omics Resources (POR)
Plant OmicsResources
StandardizedPlant Genomes Data warehouse
26 BioinformaticsTools
28 plant and 7 unicellularplant genomes
Standardized Plant EST database
231 Plant EST datasets Plant Genome assembly Database
9 plant genomes from WGS
Summary and Take Home Messages
- Next generation sequencing technologies have promoted a lot of genome
sequences with low cost and huge amount of sequences.
- De novo assembly of short-read sequences required new algorithm (de
brujin graph) for generating contig sequences.
- Currently at least 28 plant and 7 unicellular plant genomes are published or
available.
- New plant genome repository which can provide integrated environment for
bioinformatics and comparative genomics is needed.
Acknowledgements
Yong-Hwan LeeProfessor in Fungal Plant Pathology Lab.
and Fungal Bioinformatics Lab.
Jaeyoung Choi, MS-PhD student
Donghan Kim, MS student
Kyeong-chae Cheong, MS student
Kyongyong Jung, MS-PhD student
Fungal Bioinformatics Lab.
Dr Kang’s Lab. in Pennsylvania State University
Seogchan KangProfessor in plant pathology inPennsylvania State University
Bongsoo Park, PhD student
Wonho Song, Undergraduate student
Kyohoon Ahn, Undergraduate student
Seungmin Lee, Undergraduate student
Jaejin Park, MS-PhD student
Seryun Kim, Master
Sunghyung Kong, MS-PhD student
Doil ChoiProfessor in Plant Genomics Laboratory.
Tae-Jin YangProfessor in Industrial and medical crop
genomics and biotechnology.
Seungill Kim, MS-PhD student
Junki Lee, MS-PhD student
Thank you for your attention!