Upload
lykhue
View
241
Download
1
Embed Size (px)
Citation preview
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Too Much Data - Cautionary Tales of Next-generation
and Next-next Generation Sequencing
Dave Ussery
Workshop on Comparative GenomicsKing Mongkut's University of Technology ThonburiBangkok, Thailand
1rst Talk for Tuesday8 March, 2010
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
www.cbs.dtu.dk
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Outlinehttp://www.cbs.dtu.dk/courses/thaiworkshop2010/
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Subject: What's new for '"complete genome"' in PubMedFrom: My NCBI <[email protected]>Date: 7 March, 2010 1:20:10 AM GMT+07:00To: Dave Ussery <[email protected]>
Sender's message:Sent on Saturday, 2010 Mar 06Search "complete genome" Click here to view complete results in PubMed. (Results may change over time.)To unsubscribe from these e-mail updates click here.
PubMed ResultsItems 1 - 12 of 12
2. Genomic Structure of an Economically Important Cyanobacterium, Arthrospira (Spirulina) platensis NIES-39.
Fujisawa T, Narikawa R, Okamoto S, Ehira S, Yoshimura H, Suzuki I, Masuda T, Mochimaru M, Takaichi S, Awai K, Sekine M, Horikawa H, Yashiro I, Omata S, Takarada H, Katano Y, Kosugi H, Tanikawa S, Ohmori K, Sato N, Ikeuchi M, Fujita N, Ohmori M.
DNA Res. 2010 Mar 4. [Epub ahead of print]
PMID: 20203057 [PubMed - as supplied by publisher]
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Genomic Structure of an Economically Important Cyanobacterium,Arthrospira (Spirulina) platensis NIES-39
TAKATOMO Fujisawa1, REI Narikawa2, SHINOBUOkamoto3, SHIGEKI Ehira4, HIDEHISAYoshimura2, IWANE Suzuki5,TATSURU Masuda2, MARI Mochimaru6, SHINICHI Takaichi7, KOICHIRO Awai8, MITSUO Sekine1,HIROSHI Horikawa1, ISAO Yashiro1, SEIHA Omata1, HIROMI Takarada1, YOKO Katano1, HIROKI Kosugi1,SATOSHI Tanikawa1, KAZUKO Ohmori9, NAOKI Sato2, MASAHIKO Ikeuchi2, NOBUYUKI Fujita1,*,and MASAYUKI Ohmori4
Bioresource Information Center, Department of Biotechnology, National Institute of Technology and Evaluation(NITE), 2-10-49 Nishihara, Shibuya-ku, Tokyo 151-0066, Japan1; Department of Life Sciences (Biology), TheUniversity of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan2; Database Center for Life Science,Research Organization of Information and Systems, 2-11-6 Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan3;Department of Biological Sciences, Faculty of Science and Engineering, Chuo University, 1-13-27 Kasuga,Bunkyo-ku, Tokyo 112-8551, Japan4; Graduate School of Life and Environmental Sciences, University of Tsukuba,Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8572, Japan5; Natural Science Faculty, Komazawa University,1-23-1 Komazawa, Setagaya-ku, Tokyo 154-8525, Japan6; Biological Laboratory, Nippon Medical School,Kosugi-cho 2, Nakahara-ku, Kawasaki 211-0063, Japan7; Division of Global Research Leaders, Shizuoka University,836 Ohya, Suruga-ku, Shizuoka 422-8529, Japan8 and Department of Life Sciences, Showa Women’s University,1-7 Taishido, Setagaya-ku, Tokyo 154-8533, Japan9
*To whom correspondence should be addressed. Tel. !81 3-3481-1933. Fax. !81 3-3481-8424.E-mail: [email protected]
Edited by Katsumi Isono(Received 1 December 2009; accepted 11 January 2010)
AbstractA filamentous non-N2-fixing cyanobacterium, Arthrospira (Spirulina) platensis, is an important organism
for industrial applications and as a food supply. Almost the complete genome of A. platensis NIES-39 wasdetermined in this study. The genome structure of A. platensis is estimated to be a single, circular chromo-some of 6.8 Mb, based on optical mapping. Annotation of this 6.7 Mb sequence yielded 6630 protein-coding genes as well as two sets of rRNA genes and 40 tRNA genes. Of the protein-coding genes, 78%are similar to those of other organisms; the remaining 22% are currently unknown. A total 612 kb ofthe genome comprise group II introns, insertion sequences and some repetitive elements. Group Iintrons are located in a protein-coding region. Abundant restriction-modification systems were deter-mined. Unique features in the gene composition were noted, particularly in a large number of genes foradenylate cyclase and haemolysin-like Ca21-binding proteins and in chemotaxis proteins. Filament-specific genes were highlighted by comparative genomic analysis.Key words: cyanobacteria; Arthrospira; health supplement; genome; cAMP
1. Introduction
Cyanobacteria are prokaryotes that perform oxy-genic photosynthesis and constitute a large taxonomicgroup within the domain of eubacteria. Cyanobacteria
are divided morphologically (unicellular or filamen-tous) or functionally (N2-fixing and non-N2-fixing).Filamentous species are subdivided into those withand without a heterocyst which is a differentiationfrom vegetative cells for fixing nitrogen.1,2
# The Author 2010. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in anymedium, provided the original workis properly cited.
DNA RESEARCH pp. 1–19, (2010) doi:10.1093/dnares/dsq004
DNA Research Advance Access published March 4, 2010
Genomic Structure of an Economically Important Cyanobacterium,Arthrospira (Spirulina) platensis NIES-39
TAKATOMO Fujisawa1, REI Narikawa2, SHINOBUOkamoto3, SHIGEKI Ehira4, HIDEHISAYoshimura2, IWANE Suzuki5,TATSURU Masuda2, MARI Mochimaru6, SHINICHI Takaichi7, KOICHIRO Awai8, MITSUO Sekine1,HIROSHI Horikawa1, ISAO Yashiro1, SEIHA Omata1, HIROMI Takarada1, YOKO Katano1, HIROKI Kosugi1,SATOSHI Tanikawa1, KAZUKO Ohmori9, NAOKI Sato2, MASAHIKO Ikeuchi2, NOBUYUKI Fujita1,*,and MASAYUKI Ohmori4
Bioresource Information Center, Department of Biotechnology, National Institute of Technology and Evaluation(NITE), 2-10-49 Nishihara, Shibuya-ku, Tokyo 151-0066, Japan1; Department of Life Sciences (Biology), TheUniversity of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan2; Database Center for Life Science,Research Organization of Information and Systems, 2-11-6 Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan3;Department of Biological Sciences, Faculty of Science and Engineering, Chuo University, 1-13-27 Kasuga,Bunkyo-ku, Tokyo 112-8551, Japan4; Graduate School of Life and Environmental Sciences, University of Tsukuba,Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8572, Japan5; Natural Science Faculty, Komazawa University,1-23-1 Komazawa, Setagaya-ku, Tokyo 154-8525, Japan6; Biological Laboratory, Nippon Medical School,Kosugi-cho 2, Nakahara-ku, Kawasaki 211-0063, Japan7; Division of Global Research Leaders, Shizuoka University,836 Ohya, Suruga-ku, Shizuoka 422-8529, Japan8 and Department of Life Sciences, Showa Women’s University,1-7 Taishido, Setagaya-ku, Tokyo 154-8533, Japan9
*To whom correspondence should be addressed. Tel. !81 3-3481-1933. Fax. !81 3-3481-8424.E-mail: [email protected]
Edited by Katsumi Isono(Received 1 December 2009; accepted 11 January 2010)
AbstractA filamentous non-N2-fixing cyanobacterium, Arthrospira (Spirulina) platensis, is an important organism
for industrial applications and as a food supply. Almost the complete genome of A. platensis NIES-39 wasdetermined in this study. The genome structure of A. platensis is estimated to be a single, circular chromo-some of 6.8 Mb, based on optical mapping. Annotation of this 6.7 Mb sequence yielded 6630 protein-coding genes as well as two sets of rRNA genes and 40 tRNA genes. Of the protein-coding genes, 78%are similar to those of other organisms; the remaining 22% are currently unknown. A total 612 kb ofthe genome comprise group II introns, insertion sequences and some repetitive elements. Group Iintrons are located in a protein-coding region. Abundant restriction-modification systems were deter-mined. Unique features in the gene composition were noted, particularly in a large number of genes foradenylate cyclase and haemolysin-like Ca21-binding proteins and in chemotaxis proteins. Filament-specific genes were highlighted by comparative genomic analysis.Key words: cyanobacteria; Arthrospira; health supplement; genome; cAMP
1. Introduction
Cyanobacteria are prokaryotes that perform oxy-genic photosynthesis and constitute a large taxonomicgroup within the domain of eubacteria. Cyanobacteria
are divided morphologically (unicellular or filamen-tous) or functionally (N2-fixing and non-N2-fixing).Filamentous species are subdivided into those withand without a heterocyst which is a differentiationfrom vegetative cells for fixing nitrogen.1,2
# The Author 2010. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in anymedium, provided the original workis properly cited.
DNA RESEARCH pp. 1–19, (2010) doi:10.1093/dnares/dsq004
DNA Research Advance Access published March 4, 2010
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
The genome sequence and annotation of A. platensis NIES-39 are available at GenBank/EMBL/DDBJ under accession no. AP011615
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
~40 students, with 7 genomes per student.
2. 16S rRNA tree
4. Core- and pan-genome plot
3. BLAST Matrix
5. BLAST Atlas
1. Table of related genomes finished and in GenBank
• What can be included in the presentations on Friday?
Term
inus
Origin
rrlH
rrlG
rrlDrrlBrrlA
rrlC
rrlE
0M0 .
5M
1M 1.5M
2M2.5M
3M
3.5M4M
E.co
li K-
12_W
3110
4,64
1,43
3 bp
! " # $ % & ' ( ) !* !! !" !# !$ !% !& !' !( !) "*
*!***
"***
#***
$***
%***
&***
'***
+,-./,0,1
+,-./,0,.234565,1
789,./,084,
:30./,084,!.;..734<=68>3?@,9.A,AB05."&)C)'
".;..734<=68>3?@,9.A,AB05.DE!""!
#.;..734<=68>3?@,9.A,AB05.7F($(&
$.;..734<=68>3?@,9.A,AB05.7G)#!&
%.;..734<=68>3?@,9.A,AB05.($!"%
&.;..734<=68>3?@,9.A,AB05.7F($"!
'.;..734<=68>3?@,9.A,AB05.(!!!'&H
(.;..734<=68>3?@,9.A,AB05.(!!!'&I
).;..734<=68>3?@,9.A,AB05."&*C)$
!*.;..734<=68>3?@,9.A,AB05.+7J7K!!!&(
!!.;..734<=68>3?@,9.A,AB05.LI)#!!#
!".;..734<=68>3?@,9.A,AB05.(!!!&
!#.;..734<=68>3?@,9.A,AB05.E!
!$.;..734<=68>3?@,9.?865.DE"""(
!%.;..734<=68>3?@,9.6395.DE"!**
!&.;..734<=68>3?@,9.?80?51B1.!#("&
!'.;..734<=68>3?@,9.2,@B1.("!$*
!(.;..734<=68>3?@,9.B<1365,0151.DE#!)%
!).;..734<=68>3?@,9.M845051.HJ77KIHH!#(!
"*.;..734<=68>3?@,9.?B9NB1.%"%C)"
3.7 %67 / 1,824
60.7 %1,383 / 2,280
59.3 %1,223 / 2,064
61.4 %1,367 / 2,226
62.0 %1,374 / 2,217
60.6 %1,271 / 2,099
59.2 %1,341 / 2,267
60.6 %1,335 / 2,203
60.0 %1,336 / 2,228
64.2 %1,357 / 2,113
59.8 %1,334 / 2,229
61.0 %1,321 / 2,167
60.5 %1,315 / 2,174
1.6 %29 / 1,799
68.5 %1,312 / 1,915
74.4 %1,512 / 2,031
74.9 %1,512 / 2,019
71.4 %1,382 / 1,935
68.8 %1,449 / 2,106
68.5 %1,416 / 2,068
73.8 %1,487 / 2,016
74.8 %1,460 / 1,953
70.0 %1,440 / 2,058
71.7 %1,428 / 1,991
69.9 %1,408 / 2,014
0.6 %8 / 1,414
74.4 %1,342 / 1,804
75.1 %1,346 / 1,793
76.4 %1,261 / 1,651
70.7 %1,306 / 1,847
71.5 %1,285 / 1,798
73.5 %1,316 / 1,790
80.6 %1,345 / 1,668
72.9 %1,306 / 1,792
74.8 %1,291 / 1,727
73.4 %1,278 / 1,742
1.3 %23 / 1,728
88.2 %1,616 / 1,832
74.6 %1,380 / 1,851
75.2 %1,487 / 1,977
75.0 %1,455 / 1,939
80.3 %1,524 / 1,897
88.0 %1,557 / 1,770
77.9 %1,493 / 1,916
76.6 %1,447 / 1,888
75.7 %1,437 / 1,898
1.4 %24 / 1,720
76.1 %1,393 / 1,830
76.4 %1,498 / 1,962
76.1 %1,465 / 1,924
76.6 %1,481 / 1,934
89.1 %1,564 / 1,755
77.8 %1,490 / 1,916
76.6 %1,444 / 1,884
76.0 %1,439 / 1,893
1.0 %15 / 1,495
73.7 %1,373 / 1,864
74.4 %1,351 / 1,816
73.9 %1,358 / 1,837
81.6 %1,395 / 1,709
74.7 %1,362 / 1,824
77.7 %1,360 / 1,750
76.5 %1,348 / 1,762
1.4 %25 / 1,728
84.6 %1,556 / 1,840
78.6 %1,506 / 1,917
79.5 %1,477 / 1,859
85.2 %1,570 / 1,842
79.0 %1,474 / 1,867
77.4 %1,458 / 1,884
1.4 %23 / 1,662
78.4 %1,473 / 1,880
80.2 %1,454 / 1,813
81.0 %1,496 / 1,848
80.2 %1,454 / 1,814
79.8 %1,450 / 1,818
1.3 %22 / 1,688
80.0 %1,463 / 1,828
81.1 %1,509 / 1,861
81.0 %1,474 / 1,819
79.9 %1,464 / 1,832
1.3 %21 / 1,596
81.0 %1,469 / 1,814
82.4 %1,447 / 1,757
80.4 %1,429 / 1,777
1.3 %21 / 1,676
82.2 %1,483 / 1,805
81.0 %1,471 / 1,815
1.3 %21 / 1,598
92.3 %1,536 / 1,664
1.2 %19 / 1,601
C. jejuni doylei 267.97
PID 17163, length 1,845,106 nt
1,911 proteins, 1,824 families
C. jejuni RM1221
PID 303, length nt
1,838 proteins, 1,799 families
C. jejuni CG8486
PID 17055, length nt
1,425 proteins, 1,414 families
C. jejuni CF93-6
PID 16265, length nt
1,756 proteins, 1,728 families
C. jejuni 84-25
PID 16367, length nt
1,748 proteins, 1,720 families
C. jejuni CG8421
PID 21037, length nt
1,512 proteins, 1,495 families
C. jejuni 81-176A
PID 16135, length nt
1,758 proteins, 1,728 families
C. jejuni 81-176B
PID 17341, length nt
1,690 proteins, 1,662 families
C. jejuni 260.94
PID 16229, length nt
1,716 proteins, 1,688 families
C. jejuni NCTC 11168
PID 8, length 1,641,481 nt
1,624 proteins, 1,596 families
C. jejuni HB93-13
PID 16267, length nt
1,708 proteins, 1,676 families
C. jejuni 81116
PID 17953, length nt
1,626 proteins, 1,598 families
C. jejuni M1
PID ???, length nt
1,627 proteins, 1,601 families
C. jeju
ni doy
lei 2
67.9
7
PID 1
7163
, len
gth 1
,845
,106
nt
1,91
1 pro
tein
s, 1,
824
fam
ilies
C. jeju
ni RM
1221
PID 3
03, l
ength
nt
1,83
8 pro
tein
s, 1,
799
fam
ilies
C. jeju
ni CG84
86
PID 1
7055
, len
gth n
t
1,42
5 pro
tein
s, 1,
414
fam
ilies
C. jeju
ni CF93
-6
PID 1
6265
, len
gth n
t
1,75
6 pro
tein
s, 1,
728
fam
ilies
C. jeju
ni 84-
25
PID 1
6367
, len
gth n
t
1,74
8 pro
tein
s, 1,
720
fam
ilies
C. jeju
ni CG84
21
PID 2
1037
, len
gth n
t
1,51
2 pro
tein
s, 1,
495
fam
ilies
C. jeju
ni 81-
176A
PID 1
6135
, len
gth n
t
1,75
8 pro
tein
s, 1,
728
fam
ilies
C. jeju
ni 81-
176B
PID 1
7341
, len
gth n
t
1,69
0 pro
tein
s, 1,
662
fam
ilies
C. jeju
ni 260
.94
PID 1
6229
, len
gth n
t
1,71
6 pro
tein
s, 1,
688
fam
ilies
C. jeju
ni NCTC 1
1168
PID 8
, len
gth 1
,641
,481
nt
1,62
4 pro
tein
s, 1,
596
fam
ilies
C. jeju
ni HB93
-13
PID 1
6267
, len
gth n
t
1,70
8 pro
tein
s, 1,
676
fam
ilies
C. jeju
ni 811
16
PID 1
7953
, len
gth n
t
1,62
6 pro
tein
s, 1,
598
fam
ilies
C. jeju
ni M1
PID ??
?, le
ngth n
t
1,62
7 pro
tein
s, 1,
601
fam
ilies
Homology within proteomes
3.7 %0.6 %
Proteome comparison of Campylobacter proteomesConserved protein families
Homology between proteomes
92.3 %59.2 %
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Outline
• The problem - too much data!• A brief history - The speed of sequencing• Cautionary tales• Some approaches to handle this....
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Technology
The data delugeBusinesses, governments and society are only starting to tap its vast potentialFeb 25th 2010 | From The Economist print edition
EIGHTEEN months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information flow through its network each
day. Now the amount has increased tenfold. During 2009, American drone aircraft flying over Iraq and Afghanistan sent back around 24 years’
worth of video footage. New models being deployed this year will produce ten times as many data streams as their predecessors, and those in
2011 will produce 30 times as many.
Everywhere you look, the quantity of information in the world is soaring. According to one estimate, mankind created 150 exabytes (billion
gigabytes) of data in 2005. This year, it will create 1,200 exabytes. Merely keeping up with this flood, and storing the bits that might be useful, is
difficult enough. Analysing it, to spot patterns and extract useful information, is harder still. Even so, the data deluge is already starting to
transform business, government, science and everyday life (see our special report in this issue). It has great potential for good—as long as
consumers, companies and governments make the right choices about when to restrict the flow of data, and when to encourage it.
1. The problem - too much data!
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
1. The problem - too much data!
27 February, 2010 | From The Economist print edition
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
How to visualize lots of data....
In Nature this week, features and opinion pieces on one of the most daunting challenges facing modern science: how to cope with the flood of data now being generated. A petabyte is a lot of memory, however you say it - a quadrillion, 1015, or tens of thousands of trillions of bytes. But that is the currency of 'big data'. We visited the Sanger Institute's supercomputing centre, and its petabyte of capacity. [News Feature p. 16]
Nature podcast
Volume 455 Number 7209 pp1-136
4 September, 2008
1. The problem - too much data!
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Three Current “next-generation” technologies:
1. The problem - too much data!
1. illumina (aka “Solexa”) - 500 million reads (100 bp )
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
1. The problem - too much data!
Three Current “next-generation” technologies:
2. Roche 4541. illumina (aka “Solexa”) - 500 million reads (100 bp )
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Applied Biosystems® SOLiD™ 4 System
SPECIFICATION SHEET
See the DifferenceThe SOLiD™ 4 System enables you to obtain more high-quality sequence at a lower cost per run. New optimized reagents and algorithms provide more uniform coverage across the genome and result in higher throughput and accuracy for all applications. Accelerate your time to results with automated workflows, intelligent barcoding designs, and the broadest portfolio of application-specific kits and analysis tools. With the SOLID™ 4 System, you have the throughput and accuracy to cost-effectively discover causative variation—you have the Quality Genome.
Key Benefits
Higher accuracy—detection of causative variation enabled at lower coverage and cost per sample
Scalable throughput on a single platform—80–100 GB of mappable sequence per run
Automated workflow—80% reduction in hands-on time and increased reproducibility in yield allow for significant time and labor savings
True paired-end sequencing—bidirectional sequencing facilitates detection of genetic alterations as well as splice variants and fusion transcripts with lower sample input
Robust multiplexing kits—intelligent barcode strategy enables accurate assignment without introduction of bias
Sample-to-results application support—additional application-specific kits and flexible analysis framework for optimized end-to-end application-specific workflows
Unrivaled support—over 800 dedicated service and support specialists as well as a catalog of in-depth chemistry and bioinformatics courses available
Experience Peace of Mind The SOLiD™ System’s open slide format and flexible bead densities continue to yield increases in throughput on the same platform with minor upgrades. The SOLiD™ 4 System can generate up to 100 Gb of mappable sequence or greater than 1.4 billion reads per run. Discover the peace of mind provided by the confidence that you will benefit from future technology advances without the purchase of a new system.
SOLiD™ 4S Y S T E M S E Q U E N C I N G
1. The problem - too much data!
Three Current “next-generation” technologies:
2. Roche 454 - > 1 million reads (1000 bp)1. illumina (aka “Solexa”) - 500 million reads (100 bp )
3. ABI SOLiD
~100 Gbp per run!35 bp reads
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Genome Research, Jan 2009The new paradigm of flow cell sequencingRobert A. Holt1 and Steven J.M. JonesBritish Columbia Cancer Agency, Genome Sciences Centre, Vancouver, British Columbia V5Z 4E6, Canada
DNA sequencing is in a period of rapid change, in which capillary sequencing is no longer the technology of choicefor most ultra-high-throughput applications. A new generation of instruments that utilize primed synthesis in flowcells to obtain, simultaneously, the sequence of millions of different DNA templates has changed the field. Wecompare and contrast these new sequencing platforms in terms of stage of development, instrument configuration,template format, sequencing chemistry, throughput capability, operating cost, data handling issues, and errormodels. While these platforms outperform capillary instruments in terms of bases per day and cost per base, theshort length of sequence reads obtained from most instruments and the limited number of samples that can be runsimultaneously imposes some practical constraints on sequencing applications. However, recently developed methodsfor paired-end sequencing and for array-based direct selection of desired templates from complex mixtures extendthe utility of these platforms for genome analysis. Given the ever increasing demand for DNA sequence information,we can expect continuous improvement of this new generation of instruments and their eventual replacement byeven more powerful technology.
Since the establishment of DNA as hereditary material and theelucidation of its structure, there has been insatiable demand forsequence information and remarkable innovation in the meth-ods used to obtain it. Like many technologies, DNA sequencinghas advanced by punctuated equilibrium, where a new approachto sequencing is introduced, adopted, and improved upon incre-mentally for some period of time, then replaced by the nextwave. The very earliest sequencing techniques involved varia-tions on the theme of cleavage of short polynucleotides and sub-sequent identification by their migration characteristics usingtwo-dimensional paper chromatography. Using this approach itwas possible to infer short sequences, such as that of the Esche-richia coli lac operon (Gilbert and Maxam 1973), and it was fea-sible at the time to report the data from an entire sequencingproject in a paper’s abstract. A transition of major significancewas spearheaded by the Sanger group in the mid 1970s, whenthey introduced the notion of using primed template replicationby polymerase and separation of the extension products by gelelectrophoresis (Sanger and Coulson 1975) to obtain DNA se-quence information. Modifying this approach to allow base-specific chain termination by di-deoxy nucleotides (Sanger et al.1977) laid the foundation for sequencing for the next 30 yr.Further incremental improvements during this time included us-ing fluorescent rather than radiolabeled terminators, separationon acrylamide matrices in capillaries rather than slab gels, and,ultimately, the deployment of mechanized production lines fortemplate preparation and devices for automated generation andreading of sequence ladders. This industrial approach to sequenc-ing spawned the modern era of genomics and has provided anarchive of complete reference genome sequences. Yet demand forDNA sequence is undiminished and we find ourselves in a newperiod of rapid change. If the hallmark of the past paradigm waselectrophoretic separation of terminated DNA chains, then thehallmark of new paradigm is flow cell sequencing, with stepwisedetermination of DNA sequence by iterative cycles of nucleotideextensions done in parallel on massive numbers of clonally am-plified template molecules. If one takes the broad view of a flow
cell as a reaction chamber that contains template tethered to asolid support, to which nucleotides and ancillary reagents areiteratively applied and washed away, then the new instrumentson the market (the Roche GS-FLX, the Illumina 1G analyzer, andthe Applied Biosystems SOLiD) are all flow cell sequencers (as areinstruments anticipated in the near future such as the HelicosHeliScope and the Danaher Polonator). Massively parallel ap-proaches using flow cells allow DNA to be sequenced markedlyfaster and cheaper than ever before. This means that lines ofscientific inquiry that once were prohibitively expensive are nowfeasible, and this is good because there is much to explore. Forexample, human genome sequences have been compiled but rep-resent a miniscule proportion of the !100 million kilograms ofhuman DNA that is on the planet on any given day. It is certainthat novel template from the biosphere will continue to driveconsecutive waves of innovation in sequencing technology forsome time to come.
The technology
Templates and sequencing chemistries
While all of the latest commercial sequencing instruments useflow cells and massive parallelization to increase sequencing ca-pacity, the specifics of template preparation, sequencing chem-istry, and flow cell configuration differ among the platforms.There is often a misconception that the new generation of se-quencers perform sequencing on single molecules. In fact, allcurrently available platforms (the Roche GS-FLX, Illumina 1Ganalyzer, and the Applied Biosystems SOLiD) require PCR-basedamplification of fragmented template DNA to obtain sufficientsignal for base calling. However, these methods utilize a singleDNA molecule as the initial substrate for amplification allowingeach sequenced molecule to represent a single haplotype. Thishas proven to be useful for robust polymorphism detection par-ticularly in cancer-derived material, where associated normal tis-sue may obscure heterozygote calls using traditional Sanger se-quencing of PCR products. As discussed further below, the in-strument being developed by Helicos stays with the singlemolecule throughout analysis.
1Corresponding author.E-mail [email protected]; fax (604) 877-6085.Article is online at http://www.genome.org/cgi/doi/10.1101/gr.073262.107.
Next-Generation DNA Sequencing/Review
18:839–846 ©2008 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/08; www.genome.org Genome Research 839www.genome.org
Cold Spring Harbor Laboratory Press on January 3, 2009 - Published by genome.cshlp.orgDownloaded from
"Indeed, any of these new machines running at full capacity for a year will generate more sequence than existed in the whole of NCBI at the beginning of 2008. Analysis of the sequence data has rapidly become the limiting step and will likely become the most expensive part. The sheer volume of data will provide challenges in processing, networking, storage, and analysis of the flow-cell images just to provide the initial base calling." after Holt & Jones, 2009
Sanger Center has 28 Solexa machines, 8 ABI Solids, 2 Roche 454 machines
>1000 teraBytes per month!
1. The problem - too much data!
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
1. The problem - too much data!
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
The Human Genome Project
Started more than 20 years ago (~1985)
The U.S. government agreed to invest $200,000,000 U.S. per year for 20 years.
One base per second = 216 years!
~3,400,000,000 bp per haploid genome ~6,800,000,000 bp per diploid genome
2. A brief history - The speed of sequencing
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
1. “First Human Genome”$3,000,000,000 + 15 years
2. Celera genome (a.k.a. J. Craig Venter)$100,000,000 + 0.75 years (9 months)
3. Jim Watson’s genome $900,000 + 0.17 years (2 months)
4. John Doe's genome $1,000 + 0.0002 years (0.1 day)
5. "next next-generation" machines•Helicos Biosystems machine can sequence human genome in 1 hour (2009).
•Pacific Biosciences machine can sequence human genome in 4 minutes (2010).
•Omni Molecular Recognizer Application - human genome less than $1, <1 minute.
2. A brief history - The speed of sequencing
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Num
ber
Gen
omes
in N
CBI
web
pag
es
Year
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Bacteria Archaea total published Unfinished total
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
as of 21 Jan, 2009as of 4 March, 20103630
2226
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
as of 4 March, 2010
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
3. Cautionary tales
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
4 1 Sequences as Biological Information
organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.
From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of
Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance
BACTERIA
ARCHAEA
EUCARYA
Unicellulareukaryotes
Animals Plants
Macro-organisms
Protozoans
Flav
obac
teriu
m
Crenarchaeota
EuryarchaeotaChlamydiae
Cyanobacteria
Pro
teob
acte
ria
Act
inob
acte
ria
Chlorobi
Clostridium
Bacillus
Chloroflexi
Acidobacteria
Giardia
Saccharomyces
Trypanosoma
Slime mold
Babesia
Aquifi
cae
Therm
otoga
Thermus
Deinoco
ccus
Firmicutes
Bacteroidetes
Spirochaetes
Pla
ncto
myc
etes
3. Cautionary tales
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Archaea
Firmicutes
Spirochetesγ-Proteo-bacteria
E. coli
Eukaryotes
!"#
Color ranges:
$%&'()*+'
,(-.'/'
0'-+/(1'
21'(31'45'6751'
8/19.6':1'46';*(
<.'5'991*
91('4=9/%3*:':'
>()=+*9=*(131%64.*61:19
?5'96*31%64@'5-1='(%6
A()B'49'+1C'
,('713*=9194+.'51':'
>)':131*9-.)B*:46/(*5'/
D1-+)*9+/51%64319-*13/%6
2'55%94E'55%9
F%946%9-%5%9
G'++%94:*(C/E1-%9
H*6*49'=1/:9
?':4+(*E5*3)+/9
<'&1@%E%4(%7(1=/9
D':1*4(/(1*
D(*9*=.15'46/5':*E'9+/( ,
:*=./5/94E'671'/
>'/:*(.'731+194/5/E':9
>'/:*(.'731+1947(1EE9'/
I-.1B*9'--.'(*6)-/94=*67/
$(/6*+./-1%64E*99)=11
I'--.'(*6)-/94-/(/C191'/
?)(*7'-%5%64'/(*=.15%6
,/(*=)(%64=/(:1J
I%5@*5*7%94+*&*3'11
I%5@*5*7%949*5@'+'(1-%9
K':*'(-.'/%64/L%1+':9
<./(6*=5'96
'4'-13*=.15%6
<./(6*=5'96
'4C*5-':1%6
?)(*-*--%94@%(1*9%9
?)(*-*--%94'7)991
?)(*-*--%94.*(1&*9.11
F/+.':*=)(%94&':35/(1
F/+.':*7'-+/(1%6
4+./(6'%+*+(*=.1-%6
F/+.':*-*--%94;'::'9-.11
F/+.':*-*--%946'(1='5%319
,(-.'/*E5*7%94@%5E13%9
F/+.':*9'(-1:'4'-/+1C*(':9
F/+.':*9'(-1:'46
'B/1
H'5*7'-+/(1%6
49="4KG>!#
<./(6*':'/(*7'-+/(4+/:E-*:E/:919
>5*9+(131%64'-/+*7%+)51-%6
>5*9+(131%64+/+':1
>5*9+(131%64=/(@(1:E/:9
I+'=.)5*-*--%94'%(/%94FMN
I+'=.)5*-*--%94'%(/%94KO#P
I+'=.)5*-*--%94'%(/%94F%P!
I+'=.)5*-*--%94/=13/(61319
819+/(1'41::*-%'
819+/(1'46*:*-)+*E/:/94QNORP
819+/(1'46*:*-)+*E/:/94$2D
0'-155%949%7+1519
0'-155%94':+.('-19
0'-155%94-/(/%94,<>>4#!STU
0'-155%94-/(/%94,<>>4#VPUS
0'-155%94.'5*3%(':9
A-/':*7'-155%941./)/:919
$:+/(*-*--%94@'/-'519
8'-+*-*--%945'-+19
I+(/=+*-*--%94=:/%6
*:1'/4GR
I+(/=+*-
*--%94=
:/%6*:
1'/4<W2
GV
I+(/=+*-
*--%94'E
'5'-+1'/4
WWW
I+(/=+*-*-
-%94'E'5'-
+1'/4X
I+(/=+*-*--
%94=)*E/:/9
4F#
I+(/=+*-*--%9
4=)*E/:/94F2
,ITNON
I+(/=+*-*--%94=
)*E/:/94F2,IO
#P
I+(/=+*-*--%94=)*E/
:/94IIW!#
I+(/=+*-
*--%946
%+':9
8'-+*7'-155%94=5':+'(%6
8'-+*7'-155%94;*.:9*:11
?.)+*=5'96'4A:1*:4)/55*Y9
F)-*=5'96'46)-*13/9
F)-*=5'96'46*715/
F)-*=5'96'4=%56*:19
Z(/'=5'96'4='(C%6
F)-*=5'96'4=/:/+(':9
F)-*=5'96'4E'5519/=+1-%6
F)-*=5'96'4=:/%6*:1'/
F)-*=5'96'4E/:1+'51%6
Q17(*7'-+/(49%--1:*E/:/9
>.5*(*71%64+/=13%6
?*(=.)(*6*:'94E1:E1C'519
0'-+/(*13/94+./+'1*+'*61-(*:
>.5'6)31'46%(13'(%6
>.5'6)31'4+('-.*6'+19
>.5'6)3*=.15'4-'C1'/
>.5'6)3*=.15'4=:/%6*:1'/4<M#TO
>.5'6)31'4=:/%6*:1'/4[#OT
>.5'6)31'4=:/%6*:1'/4>M8!NS>.5'6)31'4=:/%6*:1'/4,GOS
2/66'+'4*79-%(1E5*7%9G.*3*=1(/55%5'47'5+1-'
8/=+*9=1('41:+/((*E':948#!#O!8/=+*9=1('41:+/((*E':94PRR!#
0*((/51'47%(E3*(@/(1<(/=*:/6'43/:+1-*5'<(/=*:/6'4='5513%6
I+(/=+*6)-/94-*/51-*5*(
I+(/=+*6)-/94'C/(6
1+1519
F)-*7'-+/(1%6
4='('+%7/(-%5*919
F)-*7'-+/(1%6
4+%7/(-%5*9194>D>#PP#
F)-*7'-+/(1%6
4+%7/(-%5*9194HOUGC
F)-*7'-+/(1%6
47*C19
F)-*7'-+/(1%6
45/=('/
>*():/7'-+/(1%6431=.+./(1'/
>*():/7'-+/(1%64/@@1-1/:9
>*():/7'-+/(1%6
4E5%+'61-%6
>*():/7'-+/(1%6
4E5%+'61-%6
4#O!ON
01@13*7'-+/(1%645*:E%6
<(*=./()6'4Y.1==5/14<M!T\NU
<(*=./()6'4Y.1==5/14<Y19+
Q%9*7'-+/(1%6
4:%-5/'+%6
<./(6
*+*E'46'(1+16
'
,L%1@/J4'/*51-%9
D/.'5*-*--*13/94/+./:*E/:/9
<./(6%94+./(6*=.15%9
D/1:*-*--%94('31*3%(':9
25*/*7'-+/(4C1*
5'-/%9
I):/-.*-*--%94/5*:E'+%9
K*9+*-49=
"4?>>4U#N!
I):/-.*-)9+1949="4?>>RT!O
?(*-.5*(*-*--%946'(1:%94II#N!
?(*-.5*(*-*--%946'(1:%94FW<SO#O
I):/-.*-*--%949="4MHT#!N
?(*-.5*(*-*--%946'(1:%94>>F?#OUT
,-13*7'-+/(1%64-'=9%5'+%6
I*517'-+/(4%91+'+%9
D/9%5@*C17(1*4C%5E'(19
2/*7'-+/(49%5@%((/3%-/:9
03/55*C17(1*47'-+/(1*C*(%9
>'6=)5*7'-+/(4;/;%:1
M*51:/55'49%--1:*E/:/9
H/51-*7'-+/(4./='+1-%9
H/51-*7'-+/(4=)5*(14NRRSP
H/51-*7'-+/(4=)5*(14[SS
>'%5*7'-+/(4-(/9-/:+%9
G.1B*71%646/515*+1
,E(*7'-+/(1%64+%6/@'-1/:94>/(/*:
,E(*7'-+/(1%64+%6/@'-1/:94M'9.Z
0(%-/55'49%19
0(%-/55'46/51+/:919
G.1B*71%645*+1
G.*3*=9/%3*6*:'94='5%9+(19
0('3)(.1B*71%64;'=*:1-%6
G1-&/++91'4-*:*(11
G1-&/++91'4=(*Y'B/&11
M*57'-.1'49="4YF/5
K1+(*9*6*:'94/%(*='/'
>.(*6*7'-+/(1%64C1*5'-/%6
K/199/(1'46/:1:E1+131940
K/199/(1'46/:1:E1+13194,
G'59+*:1'49*5':'-/'(%6
0*(3/+/55'4=/(+%9919
0*(3/+/55'47(*:-.19/=+1-'
0*(3/+/55'4='('=/(+%9919
>*J1/55'47%(:/+11
]':+.*6*:'94-'6=/9+(19
]':+.*6*:'94'J*:*=*319
])5/55'4@'9+131*9'4S'P-
])5/55'4@'9+131*9'4U!!SRV
?9/%3*6*:'94'/(%E1:*9'
?9/%3*6*:'94=%+13'
?9/%3*6*:'949)(1:E'/
I./Y':/55'4*:/13/:919
?.*+*7'-+/(1%64=(*@%:3%6
X17(1*4-.*5/('/
X17(1*4C%5:1@1-%94^[!#R
X17(1*4C%5:
1@1-%94>F>?R
X17(1*4='('.'
/6*5)+1-%9
?'9+/%(
/55'46%5
+*-13'
H'/6*=
.15%941:@5%
/:B'/
H'/6*=
.15%943%
-(/)1
I'56*:/55'4+)=.16%(1%6
I'56*:/55'4/:+/(1-'
I'56*:/55'4+)=.1
$9-./(1-.1'4-*514$D8SOO
$9-./(1-.1'4-*514A#PU_HU
$9-./(1-.1'4-*514AR
$9-./(1-.1'4-*514`#N
I.1E/55'4@5/J:/(14N'4NVPU<
I.1E/55'4@5/J:/(14N'4O!#
^/(91:1'4=/9+194F/31/C'519
^/(91:1'4=/9+194`WF
^/(91:1'4=/9+194>ASN
?.*+*(.'73%945%61:/9-/:9
0%-.:/('4'=.13
1-*5'4,?I
0%-.:/('4'=.131-*5'
4IE
0%-.:/('4'=
.131-*5'40=
05*-.6'::
1'4@5*(13':
%9
M1EE5/9Y*
(+.1'47(/C1='
5=19
You are here
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
!"#
Color ranges:
$%&'()*+'
,(-.'/'
0'-+/(1'
21'(31'45'6751'
8/19.6':1'46';*(
<.'5'991*
91('4=9/%3*:':'
>()=+*9=*(131%64.*61:19
?5'96*31%64@'5-1='(%6
A()B'49'+1C'
,('713*=9194+.'51':'
>)':131*9-.)B*:46/(*5'/
D1-+)*9+/51%64319-*13/%6
2'55%94E'55%9
F%946%9-%5%9
G'++%94:*(C/E1-%9
H*6*49'=1/:9
?':4+(*E5*3)+/9
<'&1@%E%4(%7(1=/9
D':1*4(/(1*
D(*9*=.15'46/5':*E'9+/( ,
:*=./5/94E'671'/
>'/:*(.'731+194/5/E':9
>'/:*(.'731+1947(1EE9'/
I-.1B*9'--.'(*6)-/94=*67/
$(/6*+./-1%64E*99)=11
I'--.'(*6)-/94-/(/C191'/
?)(*7'-%5%64'/(*=.15%6
,/(*=)(%64=/(:1J
I%5@*5*7%94+*&*3'11
I%5@*5*7%949*5@'+'(1-%9
K':*'(-.'/%64/L%1+':9
<./(6*=5'96
'4'-13*=.15%6
<./(6*=5'96
'4C*5-':1%6
?)(*-*--%94@%(1*9%9
?)(*-*--%94'7)991
?)(*-*--%94.*(1&*9.11
F/+.':*=)(%94&':35/(1
F/+.':*7'-+/(1%6
4+./(6'%+*+(*=.1-%6
F/+.':*-*--%94;'::'9-.11
F/+.':*-*--%946'(1='5%319
,(-.'/*E5*7%94@%5E13%9
F/+.':*9'(-1:'4'-/+1C*(':9
F/+.':*9'(-1:'46
'B/1
H'5*7'-+/(1%6
49="4KG>!#
<./(6*':'/(*7'-+/(4+/:E-*:E/:919
>5*9+(131%64'-/+*7%+)51-%6
>5*9+(131%64+/+':1
>5*9+(131%64=/(@(1:E/:9
I+'=.)5*-*--%94'%(/%94FMN
I+'=.)5*-*--%94'%(/%94KO#P
I+'=.)5*-*--%94'%(/%94F%P!
I+'=.)5*-*--%94/=13/(61319
819+/(1'41::*-%'
819+/(1'46*:*-)+*E/:/94QNORP
819+/(1'46*:*-)+*E/:/94$2D
0'-155%949%7+1519
0'-155%94':+.('-19
0'-155%94-/(/%94,<>>4#!STU
0'-155%94-/(/%94,<>>4#VPUS
0'-155%94.'5*3%(':9
A-/':*7'-155%941./)/:919
$:+/(*-*--%94@'/-'519
8'-+*-*--%945'-+19
I+(/=+*-*--%94=:/%6
*:1'/4GR
I+(/=+*-
*--%94=
:/%6*:
1'/4<W2
GV
I+(/=+*-
*--%94'E
'5'-+1'/4
WWW
I+(/=+*-*-
-%94'E'5'-
+1'/4X
I+(/=+*-*--
%94=)*E/:/9
4F#
I+(/=+*-*--%9
4=)*E/:/94F2
,ITNON
I+(/=+*-*--%94=
)*E/:/94F2,IO
#P
I+(/=+*-*--%94=)*E/
:/94IIW!#
I+(/=+*-
*--%946
%+':9
8'-+*7'-155%94=5':+'(%6
8'-+*7'-155%94;*.:9*:11
?.)+*=5'96'4A:1*:4)/55*Y9
F)-*=5'96'46)-*13/9
F)-*=5'96'46*715/
F)-*=5'96'4=%56*:19
Z(/'=5'96'4='(C%6
F)-*=5'96'4=/:/+(':9
F)-*=5'96'4E'5519/=+1-%6
F)-*=5'96'4=:/%6*:1'/
F)-*=5'96'4E/:1+'51%6
Q17(*7'-+/(49%--1:*E/:/9
>.5*(*71%64+/=13%6
?*(=.)(*6*:'94E1:E1C'519
0'-+/(*13/94+./+'1*+'*61-(*:
>.5'6)31'46%(13'(%6
>.5'6)31'4+('-.*6'+19
>.5'6)3*=.15'4-'C1'/
>.5'6)3*=.15'4=:/%6*:1'/4<M#TO
>.5'6)31'4=:/%6*:1'/4[#OT
>.5'6)31'4=:/%6*:1'/4>M8!NS>.5'6)31'4=:/%6*:1'/4,GOS
2/66'+'4*79-%(1E5*7%9G.*3*=1(/55%5'47'5+1-'
8/=+*9=1('41:+/((*E':948#!#O!8/=+*9=1('41:+/((*E':94PRR!#
0*((/51'47%(E3*(@/(1<(/=*:/6'43/:+1-*5'<(/=*:/6'4='5513%6
I+(/=+*6)-/94-*/51-*5*(
I+(/=+*6)-/94'C/(6
1+1519
F)-*7'-+/(1%6
4='('+%7/(-%5*919
F)-*7'-+/(1%6
4+%7/(-%5*9194>D>#PP#
F)-*7'-+/(1%6
4+%7/(-%5*9194HOUGC
F)-*7'-+/(1%6
47*C19
F)-*7'-+/(1%6
45/=('/
>*():/7'-+/(1%6431=.+./(1'/
>*():/7'-+/(1%64/@@1-1/:9
>*():/7'-+/(1%6
4E5%+'61-%6
>*():/7'-+/(1%6
4E5%+'61-%6
4#O!ON
01@13*7'-+/(1%645*:E%6
<(*=./()6'4Y.1==5/14<M!T\NU
<(*=./()6'4Y.1==5/14<Y19+
Q%9*7'-+/(1%6
4:%-5/'+%6
<./(6
*+*E'46'(1+16
'
,L%1@/J4'/*51-%9
D/.'5*-*--*13/94/+./:*E/:/9
<./(6%94+./(6*=.15%9
D/1:*-*--%94('31*3%(':9
25*/*7'-+/(4C1*
5'-/%9
I):/-.*-*--%94/5*:E'+%9
K*9+*-49=
"4?>>4U#N!
I):/-.*-)9+1949="4?>>RT!O
?(*-.5*(*-*--%946'(1:%94II#N!
?(*-.5*(*-*--%946'(1:%94FW<SO#O
I):/-.*-*--%949="4MHT#!N
?(*-.5*(*-*--%946'(1:%94>>F?#OUT
,-13*7'-+/(1%64-'=9%5'+%6
I*517'-+/(4%91+'+%9
D/9%5@*C17(1*4C%5E'(19
2/*7'-+/(49%5@%((/3%-/:9
03/55*C17(1*47'-+/(1*C*(%9
>'6=)5*7'-+/(4;/;%:1
M*51:/55'49%--1:*E/:/9
H/51-*7'-+/(4./='+1-%9
H/51-*7'-+/(4=)5*(14NRRSP
H/51-*7'-+/(4=)5*(14[SS
>'%5*7'-+/(4-(/9-/:+%9
G.1B*71%646/515*+1
,E(*7'-+/(1%64+%6/@'-1/:94>/(/*:
,E(*7'-+/(1%64+%6/@'-1/:94M'9.Z
0(%-/55'49%19
0(%-/55'46/51+/:919
G.1B*71%645*+1
G.*3*=9/%3*6*:'94='5%9+(19
0('3)(.1B*71%64;'=*:1-%6
G1-&/++91'4-*:*(11
G1-&/++91'4=(*Y'B/&11
M*57'-.1'49="4YF/5
K1+(*9*6*:'94/%(*='/'
>.(*6*7'-+/(1%64C1*5'-/%6
K/199/(1'46/:1:E1+131940
K/199/(1'46/:1:E1+13194,
G'59+*:1'49*5':'-/'(%6
0*(3/+/55'4=/(+%9919
0*(3/+/55'47(*:-.19/=+1-'
0*(3/+/55'4='('=/(+%9919
>*J1/55'47%(:/+11
]':+.*6*:'94-'6=/9+(19
]':+.*6*:'94'J*:*=*319
])5/55'4@'9+131*9'4S'P-
])5/55'4@'9+131*9'4U!!SRV
?9/%3*6*:'94'/(%E1:*9'
?9/%3*6*:'94=%+13'
?9/%3*6*:'949)(1:E'/
I./Y':/55'4*:/13/:919
?.*+*7'-+/(1%64=(*@%:3%6
X17(1*4-.*5/('/
X17(1*4C%5:1@1-%94^[!#R
X17(1*4C%5:
1@1-%94>F>?R
X17(1*4='('.'
/6*5)+1-%9
?'9+/%(
/55'46%5
+*-13'
H'/6*=
.15%941:@5%
/:B'/
H'/6*=
.15%943%
-(/)1
I'56*:/55'4+)=.16%(1%6
I'56*:/55'4/:+/(1-'
I'56*:/55'4+)=.1
$9-./(1-.1'4-*514$D8SOO
$9-./(1-.1'4-*514A#PU_HU
$9-./(1-.1'4-*514AR
$9-./(1-.1'4-*514`#N
I.1E/55'4@5/J:/(14N'4NVPU<
I.1E/55'4@5/J:/(14N'4O!#
^/(91:1'4=/9+194F/31/C'519
^/(91:1'4=/9+194`WF
^/(91:1'4=/9+194>ASN
?.*+*(.'73%945%61:/9-/:9
0%-.:/('4'=.13
1-*5'4,?I
0%-.:/('4'=.131-*5'
4IE
0%-.:/('4'=
.131-*5'40=
05*-.6'::
1'4@5*(13':
%9
M1EE5/9Y*
(+.1'47(/C1='
5=19
!"#
Color ranges:
$%&'()*+'
,(-.'/'
0'-+/(1'
21'(31'45'6751'
8/19.6':1'46';*(
<.'5'991*
91('4=9/%3*:':'
>()=+*9=*(131%64.*61:19
?5'96*31%64@'5-1='(%6
A()B'49'+1C'
,('713*=9194+.'51':'
>)':131*9-.)B*:46/(*5'/
D1-+)*9+/51%64319-*13/%6
2'55%94E'55%9
F%946%9-%5%9
G'++%94:*(C/E1-%9
H*6*49'=1/:9
?':4+(*E5*3)+/9
<'&1@%E%4(%7(1=/9
D':1*4(/(1*
D(*9*=.15'46/5':*E'9+/( ,
:*=./5/94E'671'/
>'/:*(.'731+194/5/E':9
>'/:*(.'731+1947(1EE9'/
I-.1B*9'--.'(*6)-/94=*67/
$(/6*+./-1%64E*99)=11
I'--.'(*6)-/94-/(/C191'/
?)(*7'-%5%64'/(*=.15%6
,/(*=)(%64=/(:1J
I%5@*5*7%94+*&*3'11
I%5@*5*7%949*5@'+'(1-%9
K':*'(-.'/%64/L%1+':9
<./(6*=5'96
'4'-13*=.15%6
<./(6*=5'96
'4C*5-':1%6
?)(*-*--%94@%(1*9%9
?)(*-*--%94'7)991
?)(*-*--%94.*(1&*9.11
F/+.':*=)(%94&':35/(1
F/+.':*7'-+/(1%6
4+./(6'%+*+(*=.1-%6
F/+.':*-*--%94;'::'9-.11
F/+.':*-*--%946'(1='5%319
,(-.'/*E5*7%94@%5E13%9
F/+.':*9'(-1:'4'-/+1C*(':9
F/+.':*9'(-1:'46
'B/1
H'5*7'-+/(1%6
49="4KG>!#
<./(6*':'/(*7'-+/(4+/:E-*:E/:919
>5*9+(131%64'-/+*7%+)51-%6
>5*9+(131%64+/+':1
>5*9+(131%64=/(@(1:E/:9
I+'=.)5*-*--%94'%(/%94FMN
I+'=.)5*-*--%94'%(/%94KO#P
I+'=.)5*-*--%94'%(/%94F%P!
I+'=.)5*-*--%94/=13/(61319
819+/(1'41::*-%'
819+/(1'46*:*-)+*E/:/94QNORP
819+/(1'46*:*-)+*E/:/94$2D
0'-155%949%7+1519
0'-155%94':+.('-19
0'-155%94-/(/%94,<>>4#!STU
0'-155%94-/(/%94,<>>4#VPUS
0'-155%94.'5*3%(':9
A-/':*7'-155%941./)/:919
$:+/(*-*--%94@'/-'519
8'-+*-*--%945'-+19
I+(/=+*-*--%94=:/%6
*:1'/4GR
I+(/=+*-
*--%94=
:/%6*:
1'/4<W2
GV
I+(/=+*-
*--%94'E
'5'-+1'/4
WWW
I+(/=+*-*-
-%94'E'5'-
+1'/4X
I+(/=+*-*--
%94=)*E/:/9
4F#
I+(/=+*-*--%9
4=)*E/:/94F2
,ITNON
I+(/=+*-*--%94=
)*E/:/94F2,IO
#P
I+(/=+*-*--%94=)*E/
:/94IIW!#
I+(/=+*-
*--%946
%+':9
8'-+*7'-155%94=5':+'(%6
8'-+*7'-155%94;*.:9*:11
?.)+*=5'96'4A:1*:4)/55*Y9
F)-*=5'96'46)-*13/9
F)-*=5'96'46*715/
F)-*=5'96'4=%56*:19
Z(/'=5'96'4='(C%6
F)-*=5'96'4=/:/+(':9
F)-*=5'96'4E'5519/=+1-%6
F)-*=5'96'4=:/%6*:1'/
F)-*=5'96'4E/:1+'51%6
Q17(*7'-+/(49%--1:*E/:/9
>.5*(*71%64+/=13%6
?*(=.)(*6*:'94E1:E1C'519
0'-+/(*13/94+./+'1*+'*61-(*:
>.5'6)31'46%(13'(%6
>.5'6)31'4+('-.*6'+19
>.5'6)3*=.15'4-'C1'/
>.5'6)3*=.15'4=:/%6*:1'/4<M#TO
>.5'6)31'4=:/%6*:1'/4[#OT
>.5'6)31'4=:/%6*:1'/4>M8!NS>.5'6)31'4=:/%6*:1'/4,GOS
2/66'+'4*79-%(1E5*7%9G.*3*=1(/55%5'47'5+1-'
8/=+*9=1('41:+/((*E':948#!#O!8/=+*9=1('41:+/((*E':94PRR!#
0*((/51'47%(E3*(@/(1<(/=*:/6'43/:+1-*5'<(/=*:/6'4='5513%6
I+(/=+*6)-/94-*/51-*5*(
I+(/=+*6)-/94'C/(6
1+1519
F)-*7'-+/(1%6
4='('+%7/(-%5*919
F)-*7'-+/(1%6
4+%7/(-%5*9194>D>#PP#
F)-*7'-+/(1%6
4+%7/(-%5*9194HOUGC
F)-*7'-+/(1%6
47*C19
F)-*7'-+/(1%6
45/=('/
>*():/7'-+/(1%6431=.+./(1'/
>*():/7'-+/(1%64/@@1-1/:9
>*():/7'-+/(1%6
4E5%+'61-%6
>*():/7'-+/(1%6
4E5%+'61-%6
4#O!ON
01@13*7'-+/(1%645*:E%6
<(*=./()6'4Y.1==5/14<M!T\NU
<(*=./()6'4Y.1==5/14<Y19+
Q%9*7'-+/(1%6
4:%-5/'+%6
<./(6
*+*E'46'(1+16
'
,L%1@/J4'/*51-%9
D/.'5*-*--*13/94/+./:*E/:/9
<./(6%94+./(6*=.15%9
D/1:*-*--%94('31*3%(':9
25*/*7'-+/(4C1*
5'-/%9
I):/-.*-*--%94/5*:E'+%9
K*9+*-49=
"4?>>4U#N!
I):/-.*-)9+1949="4?>>RT!O
?(*-.5*(*-*--%946'(1:%94II#N!
?(*-.5*(*-*--%946'(1:%94FW<SO#O
I):/-.*-*--%949="4MHT#!N
?(*-.5*(*-*--%946'(1:%94>>F?#OUT
,-13*7'-+/(1%64-'=9%5'+%6
I*517'-+/(4%91+'+%9
D/9%5@*C17(1*4C%5E'(19
2/*7'-+/(49%5@%((/3%-/:9
03/55*C17(1*47'-+/(1*C*(%9
>'6=)5*7'-+/(4;/;%:1
M*51:/55'49%--1:*E/:/9
H/51-*7'-+/(4./='+1-%9
H/51-*7'-+/(4=)5*(14NRRSP
H/51-*7'-+/(4=)5*(14[SS
>'%5*7'-+/(4-(/9-/:+%9
G.1B*71%646/515*+1
,E(*7'-+/(1%64+%6/@'-1/:94>/(/*:
,E(*7'-+/(1%64+%6/@'-1/:94M'9.Z
0(%-/55'49%19
0(%-/55'46/51+/:919
G.1B*71%645*+1
G.*3*=9/%3*6*:'94='5%9+(19
0('3)(.1B*71%64;'=*:1-%6
G1-&/++91'4-*:*(11
G1-&/++91'4=(*Y'B/&11
M*57'-.1'49="4YF/5
K1+(*9*6*:'94/%(*='/'
>.(*6*7'-+/(1%64C1*5'-/%6
K/199/(1'46/:1:E1+131940
K/199/(1'46/:1:E1+13194,
G'59+*:1'49*5':'-/'(%6
0*(3/+/55'4=/(+%9919
0*(3/+/55'47(*:-.19/=+1-'
0*(3/+/55'4='('=/(+%9919
>*J1/55'47%(:/+11
]':+.*6*:'94-'6=/9+(19
]':+.*6*:'94'J*:*=*319
])5/55'4@'9+131*9'4S'P-
])5/55'4@'9+131*9'4U!!SRV
?9/%3*6*:'94'/(%E1:*9'
?9/%3*6*:'94=%+13'
?9/%3*6*:'949)(1:E'/
I./Y':/55'4*:/13/:919
?.*+*7'-+/(1%64=(*@%:3%6
X17(1*4-.*5/('/
X17(1*4C%5:1@1-%94^[!#R
X17(1*4C%5:
1@1-%94>F>?R
X17(1*4='('.'
/6*5)+1-%9
?'9+/%(
/55'46%5
+*-13'
H'/6*=
.15%941:@5%
/:B'/
H'/6*=
.15%943%
-(/)1
I'56*:/55'4+)=.16%(1%6
I'56*:/55'4/:+/(1-'
I'56*:/55'4+)=.1
$9-./(1-.1'4-*514$D8SOO
$9-./(1-.1'4-*514A#PU_HU
$9-./(1-.1'4-*514AR
$9-./(1-.1'4-*514`#N
I.1E/55'4@5/J:/(14N'4NVPU<
I.1E/55'4@5/J:/(14N'4O!#
^/(91:1'4=/9+194F/31/C'519
^/(91:1'4=/9+194`WF
^/(91:1'4=/9+194>ASN
?.*+*(.'73%945%61:/9-/:9
0%-.:/('4'=.13
1-*5'4,?I
0%-.:/('4'=.131-*5'
4IE
0%-.:/('4'=
.131-*5'40=
05*-.6'::
1'4@5*(13':
%9
M1EE5/9Y*
(+.1'47(/C1='
5=19
E. coli
humansworms
Y. pestis
3. Cautionary tales
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
3. Cautionary tales
23,621 genes 19,829 genes 18,529 genes 18,000 genes 6,000 genes19,568 orthologs
99%
14,000 orthologs
76%
human chimp chicken worm yeast
10,000 orthologs
55%
1,700 orthologs
28%
3606 genes 3553 genes 3874 genes 2801 genes 3760 genes3202 orthologs
90%
2974 orthologs
77%
C. botulanium
1126 orthologs
40%
1092 orthologs
29%
type A, strain ATCC 3502C. botulanium
type A, strain ATCC 19397C. botulanium C. botulanium C. botulanium
type A, strain Kyoto type C, strain Ecklund type E1, strain BoNT E Beluga
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
4 1 Sequences as Biological Information
organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.
From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of
Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance
BACTERIA
ARCHAEA
EUCARYA
Unicellulareukaryotes
Animals Plants
Macro-organisms
Protozoans
Flav
obac
teriu
m
Crenarchaeota
EuryarchaeotaChlamydiae
Cyanobacteria
Pro
teob
acte
ria
Act
inob
acte
ria
Chlorobi
Clostridium
Bacillus
Chloroflexi
Acidobacteria
Giardia
Saccharomyces
Trypanosoma
Slime mold
Babesia
Aquifi
cae
Therm
otoga
Thermus
Deinoco
ccus
Firmicutes
Bacteroidetes
Spirochaetes
Pla
ncto
myc
etes
3. Cautionary tales
gagttttatc gcttccatga cgcagaagtt aacactttcg gatatttctg atgagtcgaa aaattatctt gataaagcag gaattactac tgcttgttta cgaattaaat cgaagtggac tgctggcgga aaatgagaaa attcgaccta tccttgcgca gctcgagaag ctcttacttt gcgacctttc gccatcaact aacgattctg tcaaaaactg acgcgttgga tgaggagaag tggcttaata tgcttggcac gttcgtcaag gactggttta gatatgagtc acattttgtt catggtagag attctcttgt tgacatttta aaagagcgtg gattactatc tgagtccgat gctgttcaac cactaatagg taagaaatca tgagtcaagt tactgaacaa tccgtacgtt tccagaccgc tttggcctct attaagctca ttcaggcttc tgccgttttg gatttaaccg aagatgattt cgattttctg acgagtaaca aagtttggat tgctactgac cgctctcgtg ctcgtcgctg cgttgaggct tgcgtttatg gtacgctgga ctttgtggga taccctcgct ttcctgctcc tgttgagttt attgctgccg tcattgctta ttatgttcat cccgtcaaca ttcaaacggc ctgtctcatc atggaaggcg ctgaatttac ggaaaacatt attaatggcg tcgagcgtcc ggttaaagcc gctgaattgt tcgcgtttac cttgcgtgta cgcgcaggaa acactgacgt tcttactgac gcagaagaaa acgtgcgtca aaaattacgt gcggaaggag tgatgtaatg tctaaaggta aaaaacgttc tggcgctcgc cctggtcgtc cgcagccgtt gcgaggtact aaaggcaagc gtaaaggcgc tcgtctttgg tatgtaggtg gtcaacaatt ttaattgcag gggcttcggc cccttacttg aggataaatt atgtctaata ttcaaactgg cgccgagcgt atgccgcatg acctttccca tcttggcttc cttgctggtc agattggtcg tcttattacc atttcaacta ctccggttat cgctggcgac tccttcgaga tggacgccgt tggcgctctc cgtctttctc cattgcgtcg tggccttgct attgactcta ctgtagacat ttttactttt tatgtccctc atcgtcacgt ttatggtgaa cagtggatta agttcatgaa ggatggtgtt aatgccactc ctctcccgac tgttaacact actggttata ttgaccatgc cgcttttctt ggcacgatta accctgatac caataaaatc cctaagcatt tgtttcaggg ttatttgaat atctataaca actattttaa agcgccgtgg atgcctgacc gtaccgaggc taaccctaat gagcttaatc aagatgatgc tcgttatggt ttccgttgct gccatctcaa aaacatttgg actgctccgc ttcctcctga gactgagctt tctcgccaaa tgacgacttc taccacatct attgacatta tgggtctgca agctgcttat gctaatttgc atactgacca agaacgtgat tacttcatgc agcgttacca tgatgttatt tcttcatttg gaggtaaaac ctcttatgac gctgacaacc gtcctttact tgtcatgcgc tctaatctct gggcatctgg ctatgatgtt gatggaactg accaaacgtc gttaggccag ttttctggtc gtgttcaaca gacctataaa cattctgtgc cgcgtttctt tgttcctgag catggcacta tgtttactct tgcgcttgtt cgttttccgc ctactgcgac taaagagatt cagtacctta acgctaaagg tgctttgact tataccgata ttgctggcga ccctgttttg tatggcaact tgccgccgcg tgaaatttct atgaaggatg ttttccgttc tggtgattcg tctaagaagt ttaagattgc tgagggtcag tggtatcgtt atgcgccttc gtatgtttct cctgcttatc accttcttga aggcttccca ttcattcagg aaccgccttc tggtgatttg caagaacgcg tacttattcg ccaccatgat tatgaccagt gtttccagtc cgttcagttg ttgcagtgga atagtcaggt taaatttaat gtgaccgttt atcgcaatct gccgaccact cgcgattcaa tcatgacttc gtgataaaag attgagtgtg aggttataac gccgaagcgg taaaaatttt aatttttgcc gctgaggggt tgaccaagcg aagcgcggta ggttttctgc ttaggagttt aatcatgttt cagactttta tttctcgcca taattcaaac tttttttctg ataagctggt tctcacttct gttactccag cttcttcggc acctgtttta cagacaccta aagctacatc gtcaacgtta tattttgata gtttgacggt taatgctggt aatggtggtt ttcttcattg cattcagatg gatacatctg tcaacgccgc taatcaggtt gtttctgttg gtgctgatat tgcttttgat gccgacccta aattttttgc ctgtttggtt cgctttgagt cttcttcggt tccgactacc ctcccgactg cctatgatgt ttatcctttg aatggtcgcc atgatggtgg ttattatacc gtcaaggact gtgtgactat tgacgtcctt ccccgtacgc cgggcaataa cgtttatgtt ggtttcatgg tttggtctaa ctttaccgct actaaatgcc gcggattggt ttcgctgaat aagagattat ttgtctccag ccacttaagt gaggtgattt atgtttggtg ctattgctgg cggtattgct tctgctcttg ctggtggcgc catgtctaaa ttgtttggag gcggtcaaaa agccgcctcc ggtggcattc aaggtgatgt gcttgctacc gataacaata ctgtaggcat gggtgatgct ggtattaaat ctgccattca aggctctaat gttcctaacc ctgatgaggc cgcccctagt tttgtttctg gtgctatggc taaagctggt aaaggacttc ttgaaggtac gttgcaggct ggcacttctg ccgtttctga taagttgctt gatttggttg gacttggtgg caagtctgcc gctgataaag gaaaggatac tcgtgattat cttgctgctg catttcctga gcttaatgct tgggagcgtg ctggtgctga tgcttcctct gctggtatgg ttgacgccgg atttgagaat caaaaagagc ttactaaaat gcaactggac aatcagaaag agattgccga gatgcaaaat gagactcaaa aagagattgc tggcattcag tcggcgactt cacgccagaa tacgaaagac caggtatatg cacaaaatga gatgcttgct tatcaacaga aggagtctac tgctcgcgtt gcgtctatta tggaaaacac caatctttcc aagcaacagc aggtttccga gattatgcgc caaatgctta ctcaagctca aacggctggt cagtatttta ccaatgacca aatcaaagaa atgactcgca aggttagtgc tgaggttgac ttagttcatc agcaaacgca gaatcagcgg tatggctctt ctcatattgg cgctactgca aaggatattt ctaatgtcgt cactgatgct gcttctggtg tggttgatat ttttcatggt attgataaag ctgttgccga tacttggaac aatttctgga aagacggtaa agctgatggt attggctcta atttgtctag gaaataaccg tcaggattga caccctccca attgtatgtt ttcatgcctc caaatcttgg aggctttttt atggttcgtt cttattaccc ttctgaatgt cacgctgatt attttgactt tgagcgtatc gaggctctta aacctgctat tgaggcttgt ggcatttcta ctctttctca atccccaatg cttggcttcc ataagcagat ggataaccgc atcaagctct tggaagagat tctgtctttt cgtatgcagg gcgttgagtt cgataatggt gatatgtatg ttgacggcca taaggctgct tctgacgttc gtgatgagtt tgtatctgtt actgagaagt taatggatga attggcacaa tgctacaatg tgctccccca acttgatatt aataacacta tagaccaccg ccccgaaggg gacgaaaaat ggtttttaga gaacgagaag acggttacgc agttttgccg caagctggct gctgaacgcc ctcttaagga tattcgcgat gagtataatt accccaaaaa gaaaggtatt aaggatgagt gttcaagatt gctggaggcc tccactatga aatcgcgtag aggctttgct attcagcgtt tgatgaatgc aatgcgacag gctcatgctg atggttggtt tatcgttttt gacactctca cgttggctga cgaccgatta gaggcgtttt atgataatcc caatgctttg cgtgactatt ttcgtgatat tggtcgtatg gttcttgctg ccgagggtcg caaggctaat gattcacacg ccgactgcta tcagtatttt tgtgtgcctg agtatggtac agctaatggc cgtcttcatt tccatgcggt gcactttatg cggacacttc ctacaggtag cgttgaccct aattttggtc gtcgggtacg caatcgccgc cagttaaata gcttgcaaaa tacgtggcct tatggttaca gtatgcccat cgcagttcgc tacacgcagg acgctttttc acgttctggt tggttgtggc ctgttgatgc taaaggtgag ccgcttaaag ctaccagtta tatggctgtt ggtttctatg tggctaaata cgttaacaaa aagtcagata tggaccttgc tgctaaaggt ctaggagcta aagaatggaa caactcacta aaaaccaagc tgtcgctact tcccaagaag ctgttcagaa tcagaatgag ccgcaacttc gggatgaaaa tgctcacaat gacaaatctg tccacggagt gcttaatcca acttaccaag ctgggttacg acgcgacgcc gttcaaccag atattgaagc agaacgcaaa aagagagatg agattgaggc tgggaaaagt tactgtagcc gacgttttgg cggcgcaacc tgtgacgaca aatctgctca aatttatgcg cgcttcgata aaaatgattg gcgtatccaa cctgca
4. Approaches to handle lots of data- Visualization
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
40 3 Microbial Genome Sequences
Base Atlases to Visualize Base Composition Features
Figure 3.3 is an ‘absolute’ Base Atlas, or a graphical representation of the entire !X174 DNA sequence plotted on a single figure (for the positive strand, the one represented in Fig. 3.11). Since we are interested in base composition analysis, the densities of the four bases are plotted by color intensity (the four outer circles). It is obvious that this DNA is quite T-rich, as there is far more red (T’s) than green (A’s), turquoise (G’s), or violet (C’s). It would be a challenge to see this at one glance from Fig. 3.1.
Continuing to read this plot from outside to inside, the coding sequences are plotted next, and since they are all on one strand only one color is needed here (in case there are coding sequences on the strand complementary to the strand that is published, we color them red). The next circle is called AT skew, and is a measure of the bias of A’s towards one strand (and T’s towards the other). As will be discussed in Chapter 7, for some bacteria, the A’s are biased towards the replication leading strand, but in other bacterial chromosomes, including E. coli, which this phage nor-mally infects, the T’s are biased towards the leading strand. The strong red color in this lane means that T’s are biased towards the strand represented by the sequence, implying that this is the leading replication strand. The next circle shows the GC skew, and since the scale is the same as that of the AT skew (+/! 0.20), the absence of dark colors indicates that the bias of G’s towards one strand or the other is not as
1 Phage !X174 is a virus that packs its DNA as single strand DNA (ssDNA) in viroid particles, so it only contains this positive strand in viroid form.
phiX1745386 bp
Hin
dII
TaqI
TaqI
TaqI
TaqI
Hin
dII
Hae
IIIM
boII
TaqI
HindII
HaeIII
TaqI
HapII
MboII
HindII
HaeIII
HapIITaqIHaeIIIHphIHindII
MboIIHaeIII
HphIHphI
HphI
MboII
HindII
MboII
HindII
MboIIM
boII
Hap
II
Hph
I
Hap
II
Hph
I
Hph
I
Hae
IIIHindIIHap
II
HindIIMboII
TaqI
TaqIHphI
HindII
HaeIII
MboII
HaeIII
HaeIII
MboII
HindIIHaeIII
HaeIII
HphI
HindII
TaqIP
stI
HindII
MboII phiX1745386 bp
p1
p2
p3 p4p5 p7
p8p9
p6
p10
p11
origin
Fig. 3.2 Two views of the nucleotide sequence of the !X174 genome. The left view shows a selection of the restriction enzyme recognition sites originally described in the paper (the unique PstI site is red), and the right view shows all 11 protein encoding genes, along with their predicted transcripts. The origin of replication is indicated by an arrow
40 3 Microbial Genome Sequences
Base Atlases to Visualize Base Composition Features
Figure 3.3 is an ‘absolute’ Base Atlas, or a graphical representation of the entire !X174 DNA sequence plotted on a single figure (for the positive strand, the one represented in Fig. 3.11). Since we are interested in base composition analysis, the densities of the four bases are plotted by color intensity (the four outer circles). It is obvious that this DNA is quite T-rich, as there is far more red (T’s) than green (A’s), turquoise (G’s), or violet (C’s). It would be a challenge to see this at one glance from Fig. 3.1.
Continuing to read this plot from outside to inside, the coding sequences are plotted next, and since they are all on one strand only one color is needed here (in case there are coding sequences on the strand complementary to the strand that is published, we color them red). The next circle is called AT skew, and is a measure of the bias of A’s towards one strand (and T’s towards the other). As will be discussed in Chapter 7, for some bacteria, the A’s are biased towards the replication leading strand, but in other bacterial chromosomes, including E. coli, which this phage nor-mally infects, the T’s are biased towards the leading strand. The strong red color in this lane means that T’s are biased towards the strand represented by the sequence, implying that this is the leading replication strand. The next circle shows the GC skew, and since the scale is the same as that of the AT skew (+/! 0.20), the absence of dark colors indicates that the bias of G’s towards one strand or the other is not as
1 Phage !X174 is a virus that packs its DNA as single strand DNA (ssDNA) in viroid particles, so it only contains this positive strand in viroid form.
phiX1745386 bp
Hin
dII
TaqI
TaqI
TaqI
TaqI
Hin
dII
Hae
IIIM
boII
TaqI
HindII
HaeIII
TaqI
HapII
MboII
HindII
HaeIII
HapIITaqIHaeIIIHphIHindII
MboIIHaeIII
HphIHphI
HphI
MboII
HindII
MboII
HindII
MboIIM
boII
Hap
II
Hph
I
Hap
II
Hph
I
Hph
I
Hae
IIIHindIIHap
II
HindIIMboII
TaqI
TaqIHphI
HindII
HaeIII
MboII
HaeIII
HaeIII
MboII
HindIIHaeIII
HaeIII
HphI
HindII
TaqIP
stI
HindII
MboII phiX1745386 bp
p1
p2
p3 p4p5 p7
p8p9
p6
p10
p11
origin
Fig. 3.2 Two views of the nucleotide sequence of the !X174 genome. The left view shows a selection of the restriction enzyme recognition sites originally described in the paper (the unique PstI site is red), and the right view shows all 11 protein encoding genes, along with their predicted transcripts. The origin of replication is indicated by an arrow
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
The Importance of Visualization 41
strong as for the A’s in the previous circle. AT and GC skew are further explained in Chapter 7. Finally, the deviation of AT content from the chromosomal average percentage AT is plotted, ranging from 40% to 60% AT, with 50% AT in the middle; thus bright red regions contain lots of A’s or T’s, and the blue regions are GC-rich. There are four dark red regions in the innermost circle that are much more AT-rich than the rest of the chromosome.
The plot of Fig. 3.4 shows the same data as in Fig. 3.3, but now as a ‘relative’ Base Atlas: the data are normalized to the genomic average for the values in each lane; only values greater than three standard deviations above the average are colored. At first sight, this is a rather bleached version of the previous figure, but it does reveal different information. For instance, there is a region where A’s are highly overrepresented compared to the global A content (around 3.5 k), and a relatively small stretch where G’s are overrepresented (around 1 k). This isn't obvious from the previous, absolute Base Atlas because that is too colourful. Atlas figures can display lanes as either absolute ranges, or show regions that deviate by more than three standard deviations from the chromosomal average, or a combination of fixed and average lanes. The way to tell the scale is to look at the legend, which is always oriented with the outermost circle on the top, going towards the innermost circle at the bottom. At the right of each scale in the legend, ‘fix’ indicates a fixed range,
coliphage phiX174
.
0k 0.5k
1k1.5k
2k
2.5k3k3
5k4k
4 .5k
5k
5,386 bp
BASE ATLAS
G Contentfixavg
0.00 0.40
A Contentfixavg
0.00 0.40
T Contentfixavg
0.00 0.40
C Contentfixavg
0.00 0.40
Annotations:
CDS +
AT Skewfixavg
–0.20 0.20
GC Skewfixavg
–0.20 0.20
Percent ATfixavg
0.40 0.60
Resolution: 3
Fig. 3.3 Absolute DNA Base Atlas of the nucleotide sequence of the !X174 genome. The legend to the right explains what is represented from the outer to the inner circle. Shown are the fraction of each nucleotide along the genome (first four circles counting inwards), the coding sequences on the positive (clockwise) strand, the AT and GC skew, and the percent AT. In an ‘absolute’ Atlas all lanes are plotted with a fixed range
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
42 3 Microbial Genome Sequences
while ‘dev’ means that the average is in the middle value (usually light gray) and the extreme ends represent plus or minus three standard deviations from the average.
Genome Atlases to Visualize Chromosomes
The analysis of DNA base composition is interesting in itself, but a Base Atlas dis-plays only a fraction of the type of information a genomic atlas can provide. The next step is to combine this with the presence of genes, and to also indicate regions containing repeats in DNA sequences. Structural features of the DNA can also be plotted. That way, we start to produce what we call a Genome Atlas, providing a quick overview of some of the most important and informative features in a microbial chromosome, plasmid, or phage. Figure 3.5 represents a Genome Atlas of the !X174 genome.
Of the circles of the Base Atlas of Fig. 3.4 we have chosen to represent only AT skew (as a fixed average) and percent AT (as deviation). Three outer circles have been added to the atlas, representing DNA structural properties: intrinsic DNA curvature in the outermost, followed by stacking energy and position preference.
CD
S >
CDS >
CDS >
>
CD
S >
CD
S >
k0
0.5k
1k.5k
2k
2.5k3k3.5
k4k
4 .5k
5k
BASE ATLAS
G Contentdevavg
0.07 0.39
A Contentdevavg
0.01 0.47
T Contentdevavg
0.10 0.53
C Contentdevavg
0.04 0.39
Annotations:
CDS +
AT Skewdevavg
–0.33 0.18
GC Skewdevavg
–0.10 0.14
Percent ATdevavg
0.47 0.63
Resolution: 3
coliphage phiX1745,386 bp
Fig. 3.4 Relative Base Atlas of the !X174 genome. In this Atlas the colors represent the regions where the base density varies more than three standard deviations from the genomic average. To the right of each scale is indicated whether fixed average or three standard deviations are plotted. The numbers below the scales indicate how color intensity was chosen. This relative Base Atlas (and not the absolute version of Fig. 3.3) is the default Base Atlas used in the remainder of the book
gagttttatc gcttccatga cgcagaagtt aacactttcg gatatttctg atgagtcgaa aaattatctt gataaagcag gaattactac tgcttgttta cgaattaaat cgaagtggac tgctggcgga aaatgagaaa attcgaccta tccttgcgca gctcgagaag ctcttacttt gcgacctttc gccatcaact aacgattctg tcaaaaactg acgcgttgga tgaggagaag tggcttaata tgcttggcac gttcgtcaag gactggttta gatatgagtc acattttgtt catggtagag attctcttgt tgacatttta aaagagcgtg gattactatc tgagtccgat gctgttcaac cactaatagg taagaaatca tgagtcaagt tactgaacaa tccgtacgtt tccagaccgc tttggcctct attaagctca ttcaggcttc tgccgttttg gatttaaccg aagatgattt cgattttctg acgagtaaca aagtttggat tgctactgac cgctctcgtg ctcgtcgctg cgttgaggct tgcgtttatg gtacgctgga ctttgtggga taccctcgct ttcctgctcc tgttgagttt attgctgccg tcattgctta ttatgttcat cccgtcaaca ttcaaacggc ctgtctcatc atggaaggcg ctgaatttac ggaaaacatt attaatggcg tcgagcgtcc ggttaaagcc gctgaattgt tcgcgtttac cttgcgtgta cgcgcaggaa acactgacgt tcttactgac gcagaagaaa acgtgcgtca aaaattacgt gcggaaggag tgatgtaatg tctaaaggta aaaaacgttc tggcgctcgc cctggtcgtc cgcagccgtt gcgaggtact aaaggcaagc gtaaaggcgc tcgtctttgg tatgtaggtg gtcaacaatt ttaattgcag gggcttcggc cccttacttg aggataaatt atgtctaata ttcaaactgg cgccgagcgt atgccgcatg acctttccca tcttggcttc cttgctggtc agattggtcg tcttattacc atttcaacta ctccggttat cgctggcgac tccttcgaga tggacgccgt tggcgctctc cgtctttctc cattgcgtcg tggccttgct attgactcta ctgtagacat ttttactttt tatgtccctc atcgtcacgt ttatggtgaa cagtggatta agttcatgaa ggatggtgtt aatgccactc ctctcccgac tgttaacact actggttata ttgaccatgc cgcttttctt ggcacgatta accctgatac caataaaatc cctaagcatt tgtttcaggg ttatttgaat atctataaca actattttaa agcgccgtgg atgcctgacc gtaccgaggc taaccctaat gagcttaatc aagatgatgc tcgttatggt ttccgttgct gccatctcaa aaacatttgg actgctccgc ttcctcctga gactgagctt tctcgccaaa tgacgacttc taccacatct attgacatta tgggtctgca agctgcttat gctaatttgc atactgacca agaacgtgat tacttcatgc agcgttacca tgatgttatt tcttcatttg gaggtaaaac ctcttatgac gctgacaacc gtcctttact tgtcatgcgc tctaatctct gggcatctgg ctatgatgtt gatggaactg accaaacgtc gttaggccag ttttctggtc gtgttcaaca gacctataaa cattctgtgc cgcgtttctt tgttcctgag catggcacta tgtttactct tgcgcttgtt cgttttccgc ctactgcgac taaagagatt cagtacctta acgctaaagg tgctttgact tataccgata ttgctggcga ccctgttttg tatggcaact tgccgccgcg tgaaatttct atgaaggatg ttttccgttc tggtgattcg tctaagaagt ttaagattgc tgagggtcag tggtatcgtt atgcgccttc gtatgtttct cctgcttatc accttcttga aggcttccca ttcattcagg aaccgccttc tggtgatttg caagaacgcg tacttattcg ccaccatgat tatgaccagt gtttccagtc cgttcagttg ttgcagtgga atagtcaggt taaatttaat gtgaccgttt atcgcaatct gccgaccact cgcgattcaa tcatgacttc gtgataaaag attgagtgtg aggttataac gccgaagcgg taaaaatttt aatttttgcc gctgaggggt tgaccaagcg aagcgcggta ggttttctgc ttaggagttt aatcatgttt cagactttta tttctcgcca taattcaaac tttttttctg ataagctggt tctcacttct gttactccag cttcttcggc acctgtttta cagacaccta aagctacatc gtcaacgtta tattttgata gtttgacggt taatgctggt aatggtggtt ttcttcattg cattcagatg gatacatctg tcaacgccgc taatcaggtt gtttctgttg gtgctgatat tgcttttgat gccgacccta aattttttgc ctgtttggtt cgctttgagt cttcttcggt tccgactacc ctcccgactg cctatgatgt ttatcctttg aatggtcgcc atgatggtgg ttattatacc gtcaaggact gtgtgactat tgacgtcctt ccccgtacgc cgggcaataa cgtttatgtt ggtttcatgg tttggtctaa ctttaccgct actaaatgcc gcggattggt ttcgctgaat aagagattat ttgtctccag ccacttaagt gaggtgattt atgtttggtg ctattgctgg cggtattgct tctgctcttg ctggtggcgc catgtctaaa ttgtttggag gcggtcaaaa agccgcctcc ggtggcattc aaggtgatgt gcttgctacc gataacaata ctgtaggcat gggtgatgct ggtattaaat ctgccattca aggctctaat gttcctaacc ctgatgaggc cgcccctagt tttgtttctg gtgctatggc taaagctggt aaaggacttc ttgaaggtac gttgcaggct ggcacttctg ccgtttctga taagttgctt gatttggttg gacttggtgg caagtctgcc gctgataaag gaaaggatac tcgtgattat cttgctgctg catttcctga gcttaatgct tgggagcgtg ctggtgctga tgcttcctct gctggtatgg ttgacgccgg atttgagaat caaaaagagc ttactaaaat gcaactggac aatcagaaag agattgccga gatgcaaaat gagactcaaa aagagattgc tggcattcag tcggcgactt cacgccagaa tacgaaagac caggtatatg cacaaaatga gatgcttgct tatcaacaga aggagtctac tgctcgcgtt gcgtctatta tggaaaacac caatctttcc aagcaacagc aggtttccga gattatgcgc caaatgctta ctcaagctca aacggctggt cagtatttta ccaatgacca aatcaaagaa atgactcgca aggttagtgc tgaggttgac ttagttcatc agcaaacgca gaatcagcgg tatggctctt ctcatattgg cgctactgca aaggatattt ctaatgtcgt cactgatgct gcttctggtg tggttgatat ttttcatggt attgataaag ctgttgccga tacttggaac aatttctgga aagacggtaa agctgatggt attggctcta atttgtctag gaaataaccg tcaggattga caccctccca attgtatgtt ttcatgcctc caaatcttgg aggctttttt atggttcgtt cttattaccc ttctgaatgt cacgctgatt attttgactt tgagcgtatc gaggctctta aacctgctat tgaggcttgt ggcatttcta ctctttctca atccccaatg cttggcttcc ataagcagat ggataaccgc atcaagctct tggaagagat tctgtctttt cgtatgcagg gcgttgagtt cgataatggt gatatgtatg ttgacggcca taaggctgct tctgacgttc gtgatgagtt tgtatctgtt actgagaagt taatggatga attggcacaa tgctacaatg tgctccccca acttgatatt aataacacta tagaccaccg ccccgaaggg gacgaaaaat ggtttttaga gaacgagaag acggttacgc agttttgccg caagctggct gctgaacgcc ctcttaagga tattcgcgat gagtataatt accccaaaaa gaaaggtatt aaggatgagt gttcaagatt gctggaggcc tccactatga aatcgcgtag aggctttgct attcagcgtt tgatgaatgc aatgcgacag gctcatgctg atggttggtt tatcgttttt gacactctca cgttggctga cgaccgatta gaggcgtttt atgataatcc caatgctttg cgtgactatt ttcgtgatat tggtcgtatg gttcttgctg ccgagggtcg caaggctaat gattcacacg ccgactgcta tcagtatttt tgtgtgcctg agtatggtac agctaatggc cgtcttcatt tccatgcggt gcactttatg cggacacttc ctacaggtag cgttgaccct aattttggtc gtcgggtacg caatcgccgc cagttaaata gcttgcaaaa tacgtggcct tatggttaca gtatgcccat cgcagttcgc tacacgcagg acgctttttc acgttctggt tggttgtggc ctgttgatgc taaaggtgag ccgcttaaag ctaccagtta tatggctgtt ggtttctatg tggctaaata cgttaacaaa aagtcagata tggaccttgc tgctaaaggt ctaggagcta aagaatggaa caactcacta aaaaccaagc tgtcgctact tcccaagaag ctgttcagaa tcagaatgag ccgcaacttc gggatgaaaa tgctcacaat gacaaatctg tccacggagt gcttaatcca acttaccaag ctgggttacg acgcgacgcc gttcaaccag atattgaagc agaacgcaaa aagagagatg agattgaggc tgggaaaagt tactgtagcc gacgttttgg cggcgcaacc tgtgacgaca aatctgctca aatttatgcg cgcttcgata aaaatgattg gcgtatccaa cctgca
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
0k
250k5
00
k
750k1000k
125
0k
15
00
k
H. influenzae Rd KW20
1,830,138 bp
GENOME ATLAS
Intrinsic Curvaturedevavg
0.19 0.26
Stacking Energydevavg
-7.91 -7.09
Position Preferencedevavg
0.14 0.17
Annotations: CDS +
CDS -
rRNA
tRNA
Global Direct Repeatsfixavg
5.00 7.50
Global Inverted Repeatsfixavg
5.00 7.50
GC Skewdevavg
-0.07 0.07
Percent ATfixavg
0.20 0.80
Resolution: 733
GENOME ATLAS
Intrinsic Curvaturedevavg
0.19 0.26
Stacking Energydevavg
-7.89 -7.11
Position Preferencedevavg
0.14 0.17
Annotations: CDS +
CDS -
rRNA
tRNA
Global Direct Repeatsfixavg
5.00 7.50
Global Inverted Repeatsfixavg
5.00 7.50
GC Skewdevavg
-0.07 0.07
Percent ATfixavg
0.20 0.80Resolution: 766
2
0k
250k
50
0k
750k1000k
1
50k
15
00
k
1750k
H. influenzae 86-028NP
1,913,428 bp
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
0k
250k5
00
k
750k1000k
125
0k
15
00
k
H. influenzae Rd KW20
1,830,138 bp
GENOME ATLAS
Intrinsic Curvaturedevavg
0.19 0.26
Stacking Energydevavg
-7.91 -7.09
Position Preferencedevavg
0.14 0.17
Annotations: CDS +
CDS -
rRNA
tRNA
Global Direct Repeatsfixavg
5.00 7.50
Global Inverted Repeatsfixavg
5.00 7.50
GC Skewdevavg
-0.07 0.07
Percent ATfixavg
0.20 0.80
Resolution: 733
GENOME ATLAS
Intrinsic Curvaturedevavg
0.19 0.26
Stacking Energydevavg
-7.89 -7.11
Position Preferencedevavg
0.14 0.17
Annotations: CDS +
CDS -
rRNA
tRNA
Global Direct Repeatsfixavg
5.00 7.50
Global Inverted Repeatsfixavg
5.00 7.50
GC Skewdevavg
-0.07 0.07
Percent ATfixavg
0.20 0.80Resolution: 766
20k
250k
50
0k
750k1000k
1
50k
15
00
k
1750k
H. influenzae 86-028NP
1,913,428 bp
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
etp
D >
etpE
>
etpF
>
hlyA >
hlyB >hlyD >
CD
S >
CD
S >
CD
S >
nik
B >
toxB
>
traI >
katP >
espP >
CD
S >
Origin
toxB
toxB
0k12.5k
25
k
37.5k50k
62.5
k7
5k
87.5k
pO157 of E. coli O157:H7
strain Sakai
92,721 bp
GENOME ATLAS
Intrinsic Curvaturedevavg
0.08 0.30
Stacking Energydevavg
-9.51 -6.41
Position Preferencedevavg
0.11 0.17
Annotations:
CDS +
CDS -
Global Direct Repeatsfixavg
5.00 7.50
Global Inverted Repeatsfixavg
5.00 7.50
GC Skewfixavg
-0.12 0.12
Percent ATfixavg
0.30 0.70Resolution: 38
katP
>
espP >
L7028 >
L7031 >
etpE >
etpL >
EH
EC
-hly
A
Ch
lyB
>
L 7072 >
L7081 >
L7095 >
t
Origin
hlyA
0k12.5k
25
k
37.5k50k
62.5
k7
5k
pO157 ofE. coli O157:H7
92,077 bp
GENOME ATLAS
Intrinsic Curvaturedevavg
0.08 0.30
Stacking Energydevavg
-9.52 -6.41
Position Preferencedevavg
0.11 0.17
Annotations:
CDS +
CDS -
Global Direct Repeatsfixavg
5.00 7.50
Global Inverted Repeatsfixavg
5.00 7.50
GC Skewfixavg
-0.12 0.12
Percent ATfixavg
0.30 0.70Resolution: 37
hlyA
etpD >etpE >
etpF >
hly
A >
hly
B >
hly
D >
CD
S >
CDS >
CD S >nikB >
toxB >
t raI >
katP
>
espP
>
CDS >
Origin
toxB
toxB
0k
12.5k
25k37.5
k
50
k
62 .5k75k
87.5kpO157 of
E. coli O
157:H7
strain S
akai
92,721 bp
GE
NO
ME
AT
LA
S
Intrin
sic Cu
rvatu
red
evav
g
0.08
0.30
Stackin
g E
nerg
yd
evav
g
-9.51
-6.41
Po
sition
Preferen
ced
evav
g
0.11
0.17
An
no
tation
s:
CD
S +
CD
S -
Glo
bal D
irect Rep
eatsfixav
g
5.00
7.50
Glo
bal In
verted
Rep
eatsfixav
g
5.00
7.50
GC
Skew
fixavg
-0.12
0.12
Percen
t AT
fixavg
0.30
0.70
Reso
lutio
n: 38
katP >
espP >
L7
028
>
L7
03
1 >
et p
E >
etp
L >
EH EC-hlyA
C hlyB >
L7072 >
L7
08
1 >
L7
095 >
t
OriginhlyA
0k
12.5k
25k37.5
k
50
k
62 .5k75k
pO157 of
E. coli O
157:H7
92,077 bp
GE
NO
ME
AT
LA
S
Intrin
sic Cu
rvatu
red
evav
g
0.08
0.30
Stackin
g E
nerg
yd
evav
g
-9.52
-6.41
Po
sition
Preferen
ced
evav
g
0.11
0.17
An
no
tation
s:
CD
S +
CD
S -
Glo
bal D
irect Rep
eatsfixav
g
5.00
7.50
Glo
bal In
verted
Rep
eatsfixav
g
5.00
7.50
GC
Skew
fixavg
-0.12
0.12
Percen
t AT
fixavg
0.30
0.70
Reso
lutio
n: 37
hlyA
etp
D >
etpE
>
etpF
>
hlyA >
hlyB >hlyD >
CD
S >
CD
S >
CD
S >
nik
B >
toxB
>
traI >
katP >
espP >
CD
S >
Origin
toxB
toxB
0k12.5k
25
k
37.5k50k
62.5
k7
5k
87.5k
pO157 of E. coli O157:H7
strain Sakai
92,721 bp
GENOME ATLAS
Intrinsic Curvaturedevavg
0.08 0.30
Stacking Energydevavg
-9.51 -6.41
Position Preferencedevavg
0.11 0.17
Annotations:
CDS +
CDS -
Global Direct Repeatsfixavg
5.00 7.50
Global Inverted Repeatsfixavg
5.00 7.50
GC Skewfixavg
-0.12 0.12
Percent ATfixavg
0.30 0.70Resolution: 38
katP
>
espP >
L7028 >
L7031 >
etpE >
etpL >
EH
EC
-hly
A
Ch
lyB
>
L 7072 >
L7081 >
L7095 >
t
Origin
hlyA
0k12.5k
25
k
37.5k50k
62.5
k7
5k
pO157 ofE. coli O157:H7
92,077 bp
GENOME ATLAS
Intrinsic Curvaturedevavg
0.08 0.30
Stacking Energydevavg
-9.52 -6.41
Position Preferencedevavg
0.11 0.17
Annotations:
CDS +
CDS -
Global Direct Repeatsfixavg
5.00 7.50
Global Inverted Repeatsfixavg
5.00 7.50
GC Skewfixavg
-0.12 0.12
Percent ATfixavg
0.30 0.70Resolution: 37
hlyA
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
Proteome comparison of Burkholderia malleiALR: 0.75, e-value:1e-10
Escherichia colistrain K
-12 isolate MG
16554289 genes, bp
Escherichia colistrain K
-12 isolate W3110
4387 genes, bp
Escherichia colistrain K
-12 substr. DH
10B4126 genes, bp
Escherichia colistrain C
FT0735379 genes, bp
Escherichia colistrain K-12 isolate MG16554289 genes, 4,639,221 bp
Escherichia colistrain K-12 isolate W31104387 genes, 4,641,433 bp
Escherichia colistrain K-12 substr. DH10B4126 genes, 4,686,137 bp
Escherichia colistrain CFT0735379 genes, 5,231,428 bp
391 / 4289
9.1%
4165 / 4289
97.1%
3920 / 4289
91.4%
3567 / 4289
83.2%
4217 / 4387
96.1%
464 / 4387
10.6%
3939 / 4387
89.8%
3616 / 4387
82.4%
4010 / 4126
97.2%
3994 / 4126
96.8%
564 / 4126
13.7%
3539 / 4126
85.8%
3625 / 5379
67.4%
3637 / 5379
67.6%
3502 / 5379
65.1%
680 / 5379
12.6%
Homology within genomes
9.12 13.67
Homology between genomes
65.11 97.19
Comparative Microbial Genomics group Center for B
iological Sequence analysis D
epartment of S
ystems B
iology, Technical University of D
enmark
3.0 %110 / 3,665
88.1 %3,495 / 3,966
80.4 %3,303 / 4,109
77.1 %3,186 / 4,134
92.9 %3,489 / 3,754
79.6 %3,139 / 3,944
85.8 %3,291 / 3,837
83.2 %3,325 / 3,995
82.4 %3,302 / 4,009
83.4 %3,315 / 3,973
76.9 %3,165 / 4,117
64.7 %2,888 / 4,463
70.4 %2,922 / 4,153
68.7 %2,874 / 4,181
75.9 %3,094 / 4,077
71.6 %3,010 / 4,205
73.6 %3,045 / 4,135
70.3 %2,933 / 4,174
40.0 %2,421 / 6,055
42.2 %2,215 / 5,254
44.3 %2,515 / 5,683
41.7 %2,552 / 6,116
38.4 %2,291 / 5,971
40.3 %2,326 / 5,771
35.0 %1,949 / 5,561
34.0 %1,963 / 5,771
32.1 %1,846 / 5,747
38.7 %2,143 / 5,536
35.8 %2,018 / 5,637
32.5 %2,385 / 7,336
31.2 %2,143 / 6,862
27.2 %1,946 / 7,165
4.2 %155 / 3,729
88.8 %3,489 / 3,927
74.9 %3,187 / 4,253
89.7 %3,485 / 3,884
78.1 %3,147 / 4,032
82.5 %3,278 / 3,971
80.7 %3,321 / 4,117
80.8 %3,319 / 4,106
81.3 %3,320 / 4,085
76.7 %3,195 / 4,167
64.9 %2,940 / 4,533
70.3 %2,965 / 4,217
67.2 %2,880 / 4,288
77.2 %3,155 / 4,088
72.6 %3,068 / 4,226
74.9 %3,101 / 4,142
69.2 %2,953 / 4,267
39.6 %2,438 / 6,154
41.6 %2,225 / 5,354
43.7 %2,527 / 5,781
41.2 %2,564 / 6,224
38.0 %2,307 / 6,067
39.8 %2,339 / 5,873
34.8 %1,967 / 5,647
33.7 %1,977 / 5,865
32.1 %1,873 / 5,828
38.3 %2,156 / 5,631
35.9 %2,049 / 5,713
32.6 %2,405 / 7,380
31.1 %2,163 / 6,948
27.1 %1,964 / 7,245
4.3 %157 / 3,665
80.2 %3,271 / 4,079
81.1 %3,280 / 4,046
80.2 %3,169 / 3,953
83.4 %3,267 / 3,919
81.4 %3,309 / 4,067
81.6 %3,311 / 4,057
81.9 %3,311 / 4,041
81.6 %3,264 / 4,000
67.6 %2,983 / 4,413
72.2 %2,986 / 4,136
69.7 %2,916 / 4,183
78.0 %3,149 / 4,038
73.5 %3,065 / 4,172
75.5 %3,089 / 4,092
69.7 %2,944 / 4,221
40.0 %2,440 / 6,094
42.3 %2,236 / 5,283
44.5 %2,539 / 5,707
41.9 %2,575 / 6,151
38.5 %2,311 / 6,004
40.4 %2,345 / 5,808
35.3 %1,972 / 5,581
34.2 %1,983 / 5,797
32.5 %1,873 / 5,769
38.8 %2,162 / 5,566
36.4 %2,055 / 5,647
33.1 %2,415 / 7,299
31.5 %2,169 / 6,884
27.5 %1,971 / 7,179
3.3 %120 / 3,599
77.3 %3,164 / 4,093
75.1 %3,024 / 4,028
79.4 %3,143 / 3,956
76.0 %3,147 / 4,141
75.3 %3,136 / 4,162
76.3 %3,142 / 4,120
77.5 %3,153 / 4,066
65.1 %2,880 / 4,423
68.5 %2,869 / 4,191
69.0 %2,860 / 4,145
71.5 %2,986 / 4,175
68.5 %2,914 / 4,256
69.8 %2,942 / 4,217
66.3 %2,833 / 4,271
38.4 %2,348 / 6,120
41.3 %2,181 / 5,277
42.8 %2,449 / 5,718
40.0 %2,473 / 6,185
37.0 %2,227 / 6,026
38.6 %2,251 / 5,839
33.6 %1,884 / 5,612
32.5 %1,896 / 5,827
30.6 %1,777 / 5,804
37.9 %2,110 / 5,560
34.7 %1,968 / 5,677
31.7 %2,323 / 7,337
30.4 %2,098 / 6,893
26.3 %1,893 / 7,208
2.8 %99 / 3,560
80.7 %3,108 / 3,853
87.0 %3,272 / 3,762
82.9 %3,277 / 3,954
82.6 %3,267 / 3,954
83.7 %3,275 / 3,915
78.4 %3,157 / 4,029
67.3 %2,909 / 4,320
73.7 %2,947 / 4,001
71.8 %2,896 / 4,036
79.5 %3,125 / 3,932
74.1 %3,024 / 4,083
76.0 %3,059 / 4,025
73.2 %2,952 / 4,034
41.4 %2,445 / 5,905
43.8 %2,234 / 5,101
45.9 %2,535 / 5,526
42.9 %2,564 / 5,977
39.7 %2,312 / 5,825
41.7 %2,346 / 5,626
36.3 %1,965 / 5,413
35.3 %1,981 / 5,619
33.3 %1,863 / 5,593
40.3 %2,167 / 5,378
37.3 %2,045 / 5,477
33.6 %2,410 / 7,181
32.3 %2,164 / 6,706
28.0 %1,962 / 7,016
1.8 %59 / 3,353
83.0 %3,126 / 3,768
83.2 %3,208 / 3,855
85.6 %3,244 / 3,790
86.6 %3,253 / 3,757
74.3 %2,987 / 4,018
67.8 %2,836 / 4,184
81.0 %2,989 / 3,688
71.5 %2,801 / 3,915
77.9 %3,002 / 3,856
73.0 %2,908 / 3,986
75.2 %2,954 / 3,928
73.5 %2,863 / 3,897
42.4 %2,408 / 5,683
45.9 %2,223 / 4,842
47.1 %2,503 / 5,310
44.1 %2,533 / 5,743
40.6 %2,270 / 5,592
42.9 %2,314 / 5,388
37.7 %1,947 / 5,165
36.6 %1,964 / 5,371
34.4 %1,846 / 5,360
41.6 %2,140 / 5,139
38.7 %2,021 / 5,225
34.3 %2,377 / 6,932
33.0 %2,137 / 6,467
28.7 %1,944 / 6,766
2.9 %100 / 3,429
91.4 %3,373 / 3,689
90.1 %3,346 / 3,715
91.2 %3,355 / 3,679
78.6 %3,113 / 3,962
68.1 %2,876 / 4,226
74.9 %2,944 / 3,932
72.2 %2,861 / 3,960
83.0 %3,135 / 3,777
78.5 %3,061 / 3,901
80.4 %3,080 / 3,831
76.4 %2,970 / 3,887
41.8 %2,434 / 5,818
44.3 %2,232 / 5,038
46.4 %2,534 / 5,464
43.3 %2,559 / 5,916
39.9 %2,299 / 5,768
41.9 %2,334 / 5,571
36.6 %1,961 / 5,354
35.5 %1,974 / 5,563
33.4 %1,852 / 5,547
40.6 %2,159 / 5,323
37.4 %2,032 / 5,428
33.8 %2,403 / 7,116
32.4 %2,155 / 6,649
28.2 %1,960 / 6,957
2.8 %102 / 3,619
90.4 %3,439 / 3,805
91.7 %3,455 / 3,766
75.5 %3,125 / 4,141
66.4 %2,917 / 4,393
73.3 %2,983 / 4,071
69.0 %2,868 / 4,158
79.4 %3,144 / 3,961
75.8 %3,073 / 4,056
77.3 %3,098 / 4,009
73.1 %2,977 / 4,072
40.8 %2,445 / 5,993
43.0 %2,240 / 5,212
45.2 %2,546 / 5,633
42.3 %2,572 / 6,075
38.9 %2,309 / 5,932
40.9 %2,343 / 5,733
35.7 %1,969 / 5,522
34.6 %1,982 / 5,732
32.7 %1,868 / 5,705
39.5 %2,169 / 5,493
36.7 %2,048 / 5,585
33.3 %2,418 / 7,252
31.8 %2,169 / 6,817
27.6 %1,965 / 7,122
3.0 %109 / 3,575
96.0 %3,531 / 3,678
74.6 %3,103 / 4,160
66.3 %2,908 / 4,386
73.6 %2,979 / 4,045
69.1 %2,864 / 4,145
79.3 %3,138 / 3,958
75.1 %3,061 / 4,077
77.1 %3,088 / 4,007
73.4 %2,971 / 4,050
41.1 %2,450 / 5,957
43.4 %2,242 / 5,171
45.5 %2,548 / 5,599
42.7 %2,578 / 6,032
39.3 %2,314 / 5,892
41.2 %2,346 / 5,697
35.9 %1,971 / 5,485
34.8 %1,984 / 5,694
33.0 %1,872 / 5,667
39.7 %2,168 / 5,459
37.0 %2,051 / 5,545
33.5 %2,420 / 7,225
32.1 %2,173 / 6,778
27.7 %1,965 / 7,093
2.6 %92 / 3,593
75.4 %3,111 / 4,124
67.0 %2,915 / 4,348
74.3 %2,982 / 4,014
70.0 %2,873 / 4,102
80.3 %3,147 / 3,918
76.3 %3,076 / 4,029
78.0 %3,097 / 3,971
73.8 %2,975 / 4,030
41.3 %2,449 / 5,936
43.6 %2,244 / 5,143
45.8 %2,552 / 5,568
42.9 %2,579 / 6,005
39.4 %2,314 / 5,868
41.4 %2,346 / 5,670
36.1 %1,971 / 5,458
35.0 %1,985 / 5,667
33.2 %1,872 / 5,641
40.0 %2,171 / 5,428
37.2 %2,052 / 5,516
33.6 %2,420 / 7,193
32.2 %2,173 / 6,752
27.8 %1,967 / 7,064
3.5 %125 / 3,567
64.7 %2,847 / 4,403
68.5 %2,844 / 4,150
68.3 %2,820 / 4,126
73.1 %3,000 / 4,103
69.2 %2,925 / 4,226
71.3 %2,953 / 4,142
65.6 %2,791 / 4,256
37.9 %2,313 / 6,099
41.1 %2,153 / 5,242
42.2 %2,409 / 5,711
39.4 %2,432 / 6,172
36.9 %2,208 / 5,989
38.0 %2,213 / 5,823
32.8 %1,843 / 5,611
31.9 %1,857 / 5,817
30.2 %1,747 / 5,786
38.2 %2,104 / 5,504
34.4 %1,944 / 5,645
31.0 %2,282 / 7,362
30.3 %2,079 / 6,856
25.7 %1,850 / 7,198
2.8 %99 / 3,586
70.6 %2,886 / 4,087
68.0 %2,806 / 4,126
69.5 %2,908 / 4,185
64.3 %2,805 / 4,365
70.2 %2,930 / 4,172
65.2 %2,768 / 4,246
38.2 %2,320 / 6,080
41.2 %2,152 / 5,228
41.4 %2,372 / 5,735
39.1 %2,413 / 6,176
36.4 %2,186 / 6,003
38.0 %2,209 / 5,811
33.1 %1,848 / 5,585
32.1 %1,861 / 5,795
30.0 %1,736 / 5,791
37.3 %2,064 / 5,537
34.8 %1,952 / 5,606
30.8 %2,270 / 7,379
29.7 %2,044 / 6,887
25.6 %1,841 / 7,194
2.5 %84 / 3,305
73.1 %2,818 / 3,854
76.7 %2,961 / 3,861
71.6 %2,868 / 4,006
77.2 %2,983 / 3,866
71.6 %2,802 / 3,915
42.3 %2,399 / 5,675
46.0 %2,220 / 4,821
46.3 %2,464 / 5,326
43.0 %2,483 / 5,781
39.9 %2,238 / 5,609
41.8 %2,264 / 5,418
36.7 %1,906 / 5,192
35.6 %1,922 / 5,398
33.6 %1,805 / 5,377
41.6 %2,134 / 5,135
38.1 %1,994 / 5,228
33.3 %2,327 / 6,984
32.4 %2,098 / 6,481
28.1 %1,904 / 6,782
2.2 %73 / 3,311
71.8 %2,849 / 3,968
67.9 %2,780 / 4,092
71.8 %2,855 / 3,974
68.6 %2,743 / 4,001
39.7 %2,303 / 5,796
42.7 %2,113 / 4,953
43.5 %2,367 / 5,437
40.1 %2,373 / 5,919
38.0 %2,171 / 5,707
39.9 %2,202 / 5,514
34.9 %1,843 / 5,282
34.2 %1,872 / 5,473
31.9 %1,743 / 5,469
39.1 %2,048 / 5,244
36.4 %1,935 / 5,317
32.1 %2,268 / 7,062
31.2 %2,045 / 6,565
26.9 %1,851 / 6,869
2.1 %72 / 3,454
78.6 %3,059 / 3,894
82.5 %3,117 / 3,780
77.1 %2,975 / 3,861
42.9 %2,463 / 5,745
44.3 %2,213 / 5,001
45.9 %2,501 / 5,451
43.5 %2,550 / 5,859
40.1 %2,293 / 5,719
42.0 %2,320 / 5,530
36.8 %1,951 / 5,307
35.7 %1,970 / 5,513
33.5 %1,845 / 5,501
40.4 %2,139 / 5,293
37.7 %2,026 / 5,367
34.2 %2,394 / 7,002
32.6 %2,153 / 6,600
28.2 %1,949 / 6,915
2.4 %83 / 3,442
78.0 %3,045 / 3,906
73.0 %2,911 / 3,989
40.8 %2,400 / 5,876
43.5 %2,200 / 5,058
46.1 %2,513 / 5,455
42.2 %2,506 / 5,941
39.2 %2,272 / 5,796
41.1 %2,303 / 5,603
36.1 %1,940 / 5,373
34.9 %1,952 / 5,586
32.9 %1,824 / 5,545
39.4 %2,118 / 5,370
37.3 %2,022 / 5,418
33.4 %2,359 / 7,060
31.8 %2,123 / 6,682
27.9 %1,942 / 6,969
2.9 %99 / 3,427
74.7 %2,922 / 3,914
42.4 %2,451 / 5,781
44.9 %2,242 / 4,998
45.7 %2,506 / 5,480
42.9 %2,536 / 5,906
39.8 %2,292 / 5,756
41.6 %2,314 / 5,569
36.2 %1,940 / 5,352
35.2 %1,954 / 5,558
33.0 %1,831 / 5,549
40.3 %2,145 / 5,320
37.3 %2,019 / 5,407
33.4 %2,375 / 7,115
32.0 %2,130 / 6,656
27.9 %1,941 / 6,954
1.9 %62 / 3,316
40.9 %2,360 / 5,769
43.5 %2,155 / 4,958
45.1 %2,444 / 5,415
42.3 %2,475 / 5,845
39.2 %2,230 / 5,684
41.5 %2,272 / 5,479
36.3 %1,906 / 5,250
35.3 %1,926 / 5,455
33.1 %1,804 / 5,448
39.8 %2,086 / 5,245
37.8 %2,001 / 5,288
33.0 %2,325 / 7,056
32.0 %2,097 / 6,549
27.9 %1,909 / 6,851
3.2 %147 / 4,662
43.5 %2,547 / 5,858
48.9 %2,994 / 6,128
46.4 %3,042 / 6,550
41.4 %2,698 / 6,523
43.9 %2,762 / 6,293
35.9 %2,233 / 6,219
35.5 %2,270 / 6,400
31.9 %2,074 / 6,494
64.9 %3,384 / 5,214
47.0 %2,741 / 5,827
46.4 %3,371 / 7,266
35.2 %2,581 / 7,333
29.6 %2,295 / 7,753
2.1 %79 / 3,683
47.2 %2,608 / 5,524
43.2 %2,597 / 6,013
43.7 %2,492 / 5,705
45.4 %2,507 / 5,523
36.2 %1,976 / 5,464
34.5 %1,965 / 5,696
32.3 %1,842 / 5,705
45.0 %2,357 / 5,232
37.8 %2,081 / 5,503
34.9 %2,496 / 7,160
34.3 %2,276 / 6,634
27.9 %1,972 / 7,061
2.8 %121 / 4,337
67.5 %3,741 / 5,540
41.9 %2,637 / 6,301
43.7 %2,670 / 6,112
36.3 %2,179 / 6,005
35.3 %2,191 / 6,211
32.6 %2,040 / 6,250
46.1 %2,626 / 5,697
39.9 %2,372 / 5,942
38.7 %2,880 / 7,439
34.4 %2,472 / 7,184
29.4 %2,212 / 7,534
3.1 %150 / 4,773
39.7 %2,672 / 6,728
40.8 %2,680 / 6,565
34.2 %2,205 / 6,448
33.1 %2,209 / 6,672
30.5 %2,050 / 6,715
43.2 %2,655 / 6,143
37.5 %2,396 / 6,387
37.0 %2,900 / 7,832
33.0 %2,516 / 7,615
27.8 %2,222 / 7,979
2.3 %103 / 4,463
72.3 %3,688 / 5,101
36.6 %2,201 / 6,016
35.7 %2,219 / 6,213
33.1 %2,074 / 6,269
40.2 %2,413 / 5,999
36.9 %2,259 / 6,124
34.6 %2,682 / 7,762
36.5 %2,593 / 7,105
28.1 %2,155 / 7,667
2.8 %118 / 4,277
38.7 %2,246 / 5,808
38.1 %2,277 / 5,979
34.9 %2,114 / 6,065
42.3 %2,451 / 5,795
38.6 %2,289 / 5,931
36.7 %2,759 / 7,516
36.7 %2,562 / 6,982
29.5 %2,198 / 7,456
2.6 %96 / 3,691
75.0 %3,261 / 4,346
55.5 %2,683 / 4,838
34.5 %1,963 / 5,692
33.6 %1,915 / 5,695
29.6 %2,214 / 7,478
30.4 %2,085 / 6,866
30.3 %2,110 / 6,968
2.9 %112 / 3,894
52.4 %2,666 / 5,085
33.9 %1,991 / 5,874
32.5 %1,919 / 5,903
29.3 %2,244 / 7,665
29.4 %2,083 / 7,082
29.7 %2,127 / 7,169
3.3 %111 / 3,378
30.5 %1,813 / 5,939
30.3 %1,795 / 5,923
28.3 %2,095 / 7,406
26.7 %1,916 / 7,168
28.3 %1,980 / 6,989
2.7 %103 / 3,886
46.2 %2,452 / 5,307
43.4 %2,981 / 6,875
34.5 %2,335 / 6,762
28.0 %2,022 / 7,222
2.3 %88 / 3,822
45.0 %3,018 / 6,702
30.9 %2,144 / 6,948
25.5 %1,872 / 7,339
3.9 %201 / 5,117
30.1 %2,581 / 8,574
26.1 %2,254 / 8,624
3.9 %200 / 5,078
25.9 %2,170 / 8,370
5.0 %243 / 4,897
V.cholerae
N16961
V.cholerae0395 TEDA
V.cholerae0395 TIGR
V.choleraeV52
V.choleraeM66-2MO10
V.choleraeBX330286
V.choleraeRC9
V.choleraeMJ1236
V.choleraeB33VCE
V.cholerae2740-80
V.cholerae
V.cholerae1587
V.choleraeAM-19226
V.choleraeMZO-2
V.cholerae12129
V.choleraeTM11079-80
V.choleraeTMA21
V.choleraeVL426
V.parahaemolyticus2210633
V.parahaemolyticus 16
V.vulnificusCMCP6
V.vulnificusYJ016
V.speciesMED222
V.splendidusLGP32
A.fischeri ES114
A.fischeri MJ11
A.salmonicidaLFI1238
V.speciesEx25
V.campbellii AND4
V.harveyi BAA1116
V.shilonii AK1
P.profundumSS9 V.c
holer
aeN16
961
V.cho
lerae
0395
TEDA
V.cho
lerae
0395
TIGR
V.cho
lerae
V52
V.cho
lerae
M66-2
V.cho
lerae
MO10
V.cho
lerae
BX3302
86
V.cho
lerae
RC9
V.cho
lerae
MJ123
6
V.cho
lerae
B33VCE
V.cho
lerae
2740
-80
V.cho
lerae
1587
V.cho
lerae
AM-192
26
V.cho
lerae
MZO-2
V.cho
lerae
1212
9
V.cho
lerae
TM1107
9-80
V.cho
lerae
TMA21
V.cho
lerae
VL426
V.par
ahae
molytic
us 22
1063
3
V.par
ahae
molytic
us16
V.vuln
ificus
CMCP6
V.vuln
ificus
YJ016
V.spe
cies M
ED222
V.sple
ndidu
sLG
P32
A.fisch
eri ES11
4
A.fisch
eri MJ1
1
A.salm
onici
daLF
I1238
V.spe
cies
Ex25
V.cam
pbell
ii AND4
V.har
veyi
BAA1116
V.shil
onii
AK1
P.pro
fundu
mSS9
Homology within proteomes
6.0 %0.0 %
Homology between proteomes
90.0 %30.0 %
Figure
4BLAST
matrix
ofthe
32Vibrionaceae
genomes.
The
colourshighlighting
thespecies
arethe
sameas
inFig.1.
Sincethe
reciprocalsim
ilarity(reported
aspercent)
isnot
readableatthis
resolution,everymatrix
celliscoloured
usingthe
scalesas
indicated.The
bottomrow
identifieshits
(otherthan
hits-to-self)found
within
age-
nome.
Fourmatrix
cellsreport-
inghigh
pairwise
similarities
areoutlined;
theirnum
bersare
specifiedin
thetext
Origins
ofV.
cholerae7
0M
0.5M1M
1.5M
2M2.
5M
V. cholerae 01 El Tor N16961chromosome 12,961,149 bp
Gap B
Gap A
Gap C
Gap D
Gap G
Gap E
Gap F
0k125k
250k375k
500k625k
750k
8 75k
1000k
V. cholerae 01 El Tor N16961chromosome 21,072,310 bp
Superintegron
P.profundum SS9
V.shilonii AK1
V.harveyi BAA-116
V.campebellii AND4
V.parahaemolyticus 16
V.parahaemolyticus 2210633
Vibrio spp. Ex25
A.salmonicida LF11238
A.fischeri MJ11
A.fischeri ES114
V.splendidus LGP32
V.species MED222
V.vulnificus YJ016
V.vulnificus CMCP6
V.cholerae VL426
V.cholerae 12129
V.cholerae TMA21
V.cholerae TM11079-80
V.cholerae 1587
V.cholerae AM-19226
V.cholerae MZO-2
V.cholerae 2740-80
V.cholerae BX330286
V.cholerae B33VCE
V.cholerae RC9
V.cholerae MJ1236V.cholerae M66-2
V.cholerae V52
V.cholerae MO10
V.cholerae O395 TEDA
V.cholerae 0395 TIGR
V.cholerae N16961
Stacking energy
Position preference
Global direct repeats
GC skew
genes positive strandgenes negatve strand
Outer circle
Inner circle
T. Vesth et al.
0M
0.5M1M
1.5M
2M 2.5M
V. cholerae 01
El T
or N16961
chromosom
e 12,961,149 bp
Gap B
Gap A
Gap C
Gap D
Gap G
Gap E
Gap F
0k
125k
250k375k
500k
6 25k
750k 875k
1000k
V. cholerae 01
El T
or N16961
chromosom
e 21,072,310 bp
Superintegron
P.profundum
SS
9
V.shilonii A
K1
V.harveyi B
AA
-116
V.cam
pebellii AN
D4
V.parahaem
olyticus 16
V.parahaem
olyticus 2210633
Vibrio spp. E
x25
A.salm
onicida LF11238
A.fischeri M
J11
A.fischeri E
S114
V.splendidus LG
P32
V.species M
ED
222
V.vulnificus Y
J016
V.vulnificus C
MC
P6
V.cholerae V
L426
V.cholerae 12129
V.cholerae T
MA
21
V.cholerae T
M11079-80
V.cholerae 1587
V.cholerae A
M-19226
V.cholerae M
ZO
-2
V.cholerae 2740-80
V.cholerae B
X330286
V.cholerae B
33VC
E
V.cholerae R
C9
V.cholerae M
J1236V
.cholerae M66-2
V.cholerae V
52
V.cholerae M
O10
V.cholerae O
395 TE
DA
V.cholerae 0395 T
IGR
V.cholerae N
16961
Stacking energy
Position preference
Global direct repeats
GC
skew
genes positive strandgenes negatve strand
Outer circle
Inner circle
T.Vesth
etal.