specific protein interactions provide a sufficient constraintfor a DNA sequence to be recognized, it will ultimately bepossible to design cell-specific response elements to blocktranscription factors in a specific subset of cells [25,26].
Our findings suggest that the understanding of howspecific functions are discriminated as different blueprintson the DNA might be within our reach.
AcknowledgementsWe apologize for the fact that, because of space constraints, full citation ofall relevant articles is not possible. P. Hogeweg, J. Deschamps and areferee are thanked for their critical reading of the manuscript. This studywas supported in part by HPRN and EU quality of life programs and G.M.is a recipient of a Marie Curie fellowship.
References
1 Claverie, J.M. (2000) From bioinformatics to computational biology.Genome Res. 10, 1277–1279
2 Benos, P.V. et al. (2002) Is there a code for protein–DNA recognition?Probab(ilistical)ly. BioEssays 24, 466–475
3 Hong, W.K. and Sporn, M.B. (1997) Recent advances in chemopreven-tion of cancer. Science 278, 1073–1077
4 Mangelsdorf, D.J. et al. (1994) In The Retinoids: Biology, Chemistryand Medicine (Sporn, M.B. et al., eds), pp. 319–349, Raven Press
5 Durston, A.J. et al. (1998) Retinoids and related signals in earlydevelopment of the vertebrate central nervous system. Curr. Top. Dev.Biol. 40, 111–175
6 Rastinejad, F. (2001) Retinoid X receptor and its partners in thenuclear receptor family. Curr. Opin. Struct. Biol. 11, 33–38
7 Gronemeyer, H. and Miturski, R. (2001) Molecular mechanisms ofretinoid action. Cell. Mol. Biol. Lett. 6, 3–52
8 Kurokawa, R. et al. (1995) Polarity-specific activities of retinoic acidreceptors determined by a co-repressor. Nature 377, 451–454
9 Gould, A. et al. (1998) Initiation of rhombomeric Hoxb4 expressionrequires induction by somites and a retinoid pathway. Neuron 21,39–51
10 Huang, D. et al. (2002) Analysis of two distinct retinoic acid responseelements in the homeobox gene Hoxb1 in transgenic mice. Dev. Dyn.223, 353–370
11 Nolte, C. et al. (2003) The role of a retinoic acid response element inestablishing the anterior neural expression border of Hoxd4 trans-genes. Mech. Dev. 120, 325–335
12 Amores, A. et al. (1998) Zebrafish hox clusters and vertebrate genomeevolution. Science 282, 1711–1714
13 Duboule, D. and Morata, G. (1994) Colinearity and functionalhierarchy among genes of the homeotic complexes. Trends Genet. 10,358–364
14 Krumlauf, R. (1994) Hox genes in vertebrate development. Cell 78,191–201
15 Moroni, M.C. et al. (1993) Regulation of the human HOXD4 gene byretinoids. Mech. Dev. 44, 139–154
16 Popperl, H. and Featherstone, M.S. (1993) Identification of a retinoicacid response element upstream of the murine Hox-4.2 gene. Mol. Cell.Biol. 13, 257–265
17 Morrison, A. et al. (1996) In vitro and transgenic analysis of a humanHOXD4 retinoid-responsive enhancer. Development 122, 1895–1907
18 Langston, A.W. et al. (1997) Retinoic acid-responsive enhancerslocated 30 of the Hox A and Hox B homeobox gene clusters. Functionalanalysis. J. Biol. Chem. 272, 2167–2175
19 Packer, A.I. et al. (1998) Expression of the murine Hoxa4 gene requiresboth autoregulation and a conserved retinoic acid response element.Development 125, 1991–1998
20 Kim, C.B. et al. (2000) Hox cluster genomics in the horn shark,Heterodontus francisci. Proc. Natl. Acad. Sci. U. S. A. 97, 1655–1660
21 Kurokawa, R. et al. (1993) Differential orientations of the DNA-binding domain and carboxy-terminal dimerization interface regulatebinding site selection by nuclear receptor heterodimers. Genes Dev. 7,1423–1435
22 Rastinejad, F. et al. (1995) Structural determinants of nuclear receptorassembly on DNA direct repeats. Nature 375, 203–211
23 Escriva, H. et al. (1997) Ligand binding was acquired during evolutionof nuclear receptors. Proc. Natl. Acad. Sci. U. S. A. 94, 6803–6808
24 Sluder, A.E. et al. (1999) The nuclear receptor superfamily hasundergone extensive proliferation and diversification in nematodes.Genome Res. 9, 103–120
25 Cho, Y.S. et al. (2002) A genomic-scale view of the cAMP responseelement-enhancer decoy: a tumor target-based genetic tool. Proc. Natl.Acad. Sci. U. S. A. 99, 15626–15631
26 Wang, L.H. et al. (2003) The cis decoy against the estrogen responseelement suppresses breast cancer cells via target disrupting c-fos notmitogen-activated protein kinase activity. Cancer Res. 63, 2046–2051
27 Chang, B.E. et al. (1997) Axial (HNF3beta) and retinoic acid receptorsare regulators of the zebrafish sonic hedgehog promoter. EMBO J. 16,3955–3964
28 Zhang, F. et al. (2000) Murine hoxd4 expression in the CNS requiresmultiple elements including a retinoic acid response element. Mech.Dev. 96, 79–89
29 Valcarcel, R. et al. (1994) Retinoid-dependent in vitro transcriptionmediated by the RXR/RAR heterodimer. Genes Dev. 8, 3068–3079
30 Astrom, A. et al. (1994) Retinoic acid induction of human cellularretinoic acid-binding protein-II gene transcription is mediated byretinoic acid receptor-retinoid X receptor heterodimers bound to onefar upstream retinoic acid-responsive element with 5-base pairspacing. J. Biol. Chem. 269, 22334–22339
31 Wingender, E. et al. (2001) The TRANSFAC system on gene expressionregulation. Nucleic Acids Res. 29, 281–283
0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00202-6
Scaling laws in the functional content of genomes
Erik van Nimwegen
Center for Studies in Physics and Biology, the Rockefeller University, 1230 York Avenue, New York, NY 12001, USA
With the number of sequenced genomes now totaling
more than 100, and the availability of rough functional
annotations for a substantial proportion of their genes,
it has become possible to study the statistics of gene
content across genomes. In this article I show that, for
many high-level functional categories, the number of
genes in each category scales as a power-law of the
total number of genes in the genome. The occurrence
of such scaling laws can be explained using a simple
theoretical model, and this model suggests that the
exponents of the observed scaling laws correspond
to universal constants of the evolutionary process.Corresponding author: Erik van Nimwegen ([email protected]).
Update TRENDS in Genetics Vol.19 No.9 September 2003 479
http://tigs.trends.com
I discuss some consequences of these scaling laws for
our understanding of organism design.
What fraction of the gene content of a genome is allotted todifferent functional tasks, and how does this depend on thecomplexity of the organism? Until recently, there weresimply no data to address such questions in a quantitativeway. At present, however, there are more than 100sequenced genomes in public databases (http://www.ncbi.nlm.nih.gov/Genomes/index.htm), and protein-familyclassification algorithms allow functional annotations fora considerable fraction of the genes in each genome. Thus,it has become possible to analyze the statistics offunctional gene-content across different genomes, and inthis article I present results on the dependency of thenumber of genes in different high-level categories on thetotal number of genes in the genome.
Evaluating the functional gene-content of genomes
To estimate the number of genes in different functionalcategories, each genome has to be functionally annotated.The main results presented in this article were obtainedusing the InterPro [1] annotations of sequenced genomesavailable from the European Bioinformatics Institute [2].To map the InterPro annotations to high-level functionalcategories I used the Gene Ontology (GO) ‘biologicalprocess’ hierarchy [3] and a mapping from InterPro entriesto GO categories, both of which can be obtained from theGene Ontology website (http://www.geneontology.org). Foreach GO category I collected all InterPro entries that mapto it or to one of its descendants in the ‘biological process’hierarchy. To minimize the effects of potential bias inthe mappings from InterPro to GO I only used high-level functional categories that are represented by atleast 50 different InterPro entries. This leaves 44 high-level GO categories.
A gene with multiple ‘hits’ to InterPro entries that areassociated with a GO category has a higher probability ofbelonging to that category than a gene with only a singlehit. To take this information into account, I assumed that ifa gene i has nc
i independent hits to InterPro entriesassociated with GO category c, then with probability 1 2
expð2bnci Þ the gene belongs to this particular GO category.
The results are generally insensitive to the value of b
(Box 1), and I used b ¼ 3 for the results shown below. Theestimated number of genes nc in a genome for a given GOcategory is then the sum nc ¼
Pi 1 2 expð2bnc
i Þ over allgenes i in the genome.
The results for the categories of transcription regulat-ory genes, metabolic genes and cell-cycle-related genes forbacteria and eukaryotes are shown in Fig. 1. Remarkably,for each functional category shown, we find an approxi-mately power-law relationship (solid line fits)*. That is, ifnc is the number of genes in the category, and g is the totalnumber of genes in the genome, we observe laws of theform: nc ¼ lga; where both l and the exponent a dependon the category under investigation. In fact, such
power-laws are observed for most of the 44 high-levelcategories, and the estimated values of the exponentsfor several functional categories are shown in Table 1.A potential source of bias in estimating these expo-nents is the occurrence of multiple genomes from thesame or closely related species in the data. Removingthis ‘redundancy’ from the data does not alter theobserved exponents (Box 1).
The power-law fits and 99% posterior probabilityestimates for their exponents were obtained using aBayesian procedure described in Box 1. To assess thequality of the fits, I measured, for each fit, the fraction ofthe variance in the data that is explained by the fit. Inbacteria, 26 out of 44 GO categories have more than 95% ofthe variance explained by the fit, and 38 categories havemore than 90% explained by the fit. In eukaryotes, 26categories have more than 95% explained by the fit and 32categories have more than 90% explained by the fit.However, with total gene number varying by less than afactor of 20 in bacteria, and the small number of datapoints in eukaryotes, one might wonder how one can claimthat power-laws have been ‘observed’. First, the fact that Ifit the data to power-laws should not be mistaken for aclaim that the data can only be described by power-lawfunctions. I only claim that the power-law is by far thesimplest functional form that fits almost all the observeddata. Second, when scatter plots such as those shown inFig. 1 are plotted on linear as opposed to logarithmicscales, it is clear even by eye that the fluctuations in thenumber of genes in a category scale with the total numberof genes in the genome. That is, the fluctuations in the datasuggest that logarithmic scales are the ‘natural’ scales forthese data. This is further supported by the simpleevolutionary model presented below.
One might also wonder to what extent the results aresensitive to the specific functional annotation procedure. Iperformed a variety of tests to assess the robustness of theresults (i.e. the observed power-law scaling and the valuesof the exponents) to changes in the annotation method-ology (Box 1). These involve using entirely independentannotations based on clusters of orthologous groups ofproteins (COGs) [7], and a simple (crude) annotationscheme based on keyword searches of protein tables forsequenced microbial genomes from the NCBI website(ftp.ncbi.nlm.nih.gov/genomes/bacteria). As shown inBox 1, the observed power-law scaling and the values ofthe exponents are generally insensitive to these and otherchanges in annotation methodology. However, because allcurrently available annotation schemes assume at somelevel that functional homology can be inferred fromsequence homology, our results implicitly depend on thisassumption as well.
Observed exponents
Some functional categories, such as the large category ofmetabolic genes, occupy a roughly constant fraction of thegene content of a genome, as evidenced by their exponentof 1. However, many categories show significant deviationsfrom this ‘trivial’ exponent. Genes related to cell-cycle orprotein biosynthesis have exponents significantly below 1,whereas for transcription factors (TFs) the exponent is
* These power-laws as a function of total gene number should be distinguished fromthe power-law distributions of gene-family sizes and other genomic attributes within asingle genome [4,5].
Update TRENDS in Genetics Vol.19 No.9 September 2003480
http://tigs.trends.com
Box 1. Appendix
Results for Archaea
Figure I shows the number of transcription regulatory genes, metabolic
genes and cell-cycle-related genes as a function of the total number of
genes in archaeal genomes. Note that the size of the largest archaeal
genome differs from that of the smallest by only a factor of ,3.
Consequently, there is large uncertainty regarding the values of the
exponents for archaea. The maximum likelihood values and 99%
posterior probability intervals are 2.1 and [1.3–5.7] for transcription
regulatory genes, 0.81 and [0.44–1.48] for metabolic genes, and 0.83
and [0.54–1.35] for cell-cycle-related genes.
Power-law fittingA power-law relation of the form y ¼ cxa translates into a linear
relation between the logarithms of the variables x and y
[i.e. logðyÞ ¼ logðcÞ þ alogðxÞ]. Thus, power-law fitting translates into
fitting a straight line to the log-transformed data. The power-law fits
shown in Fig. 1 of the main text were obtained using a Bayesian
straight-line fitting procedure. For each gene ontology (GO) category, I
log-transformed the data such that each data point ðxi ; yi Þ corresponds
to the logarithm xi of the number of genes in the genome and the
logarithm yi of the number of genes in the category. I assumed that
these transformed data were drawn from a linear model:
y ¼ aðx þ 1Þ þ lþ h; ½Eqn I�
whereh and 1 are ‘noise’ terms in the x- and y-coordinates, respectively,
and l is an unknown off-set. I assumed that the joint-distribution Pðh; 1Þ
is a two-dimensional Gaussian with means zero and unknown variances
and co-variance. That is, I used scale invariant priors for the variances,
and integrated these nuisance variables out of the likelihood. A uniform
prior was used for the location parameter l, and for the slope a I used a
rotation invariant prior:
PðaÞda ¼da
ð1 þ a2Þ3=2: ½Eqn II�
The use of these priors guarantees that the results are invariant under all
shifts and rotations of the plane.
Integrating over all variables except for a, we obtain for the posterior
Pða=DÞ given the data D :
Pða=DÞda ¼ Cða2 þ 1Þðn23Þ=2da
ða2Sxx 2 2asyx þ syy Þðn21Þ=2
; ½Eqn III�
where n is the number of genomes in the data, sxx is the variance in
x-values (logarithms of the total gene numbers), syy is the variance in
y-values (logarithms of the number of genes in the category), syx is the
co-variance, and C is a normalizing constant. The values of the
exponents reported in Table I are the values of a that maximize
Pða=DÞ; and the boundaries of the 99% posterior probability interval
around it.
For each fit, I also measured what fraction of the variance in the data is
explained by the fit. That is, I compared the average distance d of the
points in the plane to their center of mass with the average distance dl of
the points to the fitted line and defined the fraction q ¼ 1 2 dl =d as the
variance in the data explained by the fit.
Robustness of the resultsFirst, I checked that the total amount of available annotation information
is not itself dependent on genome size. If the total amount of available
annotation information were to vary with the number of genes in the
genome, this could lead to biases in the estimated exponents. To
exclude this possibility, I counted the total number of genes with any
InterPro hits in each genome and found that, for both bacteria and
eukaryotes, the fraction of genes in the genome with InterPro hits is
approximately two-thirds, independent of the total number of genes in
the genome. Consistent with this observation, when one fits power-
laws to the number of genes nc in a GO category c as a function of the
total number of annotated genes in the genome, one finds exponents
that are very close to those found for nc as a function of the total number
of genes in the genome.
Second, I tested that the results are insensitive to the value of the
parameter b: The default value b ¼ 3 gives a gene with a single InterPro
hit a probability of 1 2 e23 < 0:95 to belong to the category. This is
reasonable because InterPro is designed to only report statistically
significant hits. To assess the effect of changing b, results forb ¼ 1 were
generated. For bacteria the change in fitted exponent is less than 5% for
26 of 44 categories, and less than 10% for 39 categories. For eukaryotes
the exponent changes by less than 5% for all but three categories. In all
cases, the change in fitted exponent is significantly smaller than the 99%
posterior probability intervals associated with the exponents of the fits.
Third, I tested the robustness of the results against removal of
potential ‘redundancies’ in the data. For bacteria there are several
examples where multiple genomes of the same species or genomes of
very closely related species occur in the data, and one might suspect
Fig. I. The number of transcription regulatory genes (red), metabolic genes
(blue), and cell-cycle-related genes (green) as a function of the total number of
genes in archaea. Both axes are shown on logarithmic scales. Each dot corre-
sponds to a genome. The straight lines are power-law fits.
TRENDS in Genetics
1500 2000 3000 5000
Genes in genome
20
50
100
200
500
1000
2000
Gen
es in
cat
egor
y
Archaea
Table I. Estimated 99% posterior probability intervals for the
scaling exponents obtained with three different annotation
schemesa
Annotation Category Exponent
GO Protein biosynthesis 0.11–0.15
COG Translation, ribosomal structure and
biogenesis
0.21–0.37
NCBI Protein biosynthesis 0.09–0.15
GO Signal transduction 1.55–1.90
COG Signal transduction mechanisms 1.66–2.14
GO Protein metabolism and modification 0.68–0.80
GO Protein degradation 0.89–1.06
COG Posttranslational modification, protein
turnover, chaperones
0.88–1.15
NCBI Protein degradation 0.68–0.91
GO Cell cycle 0.39–0.54
GO DNA repair 0.52–1.14
COG Replication, recombination and repair 0.66–0.83
NCBI Cell cycle 0.45–0.64
GO Ion transport 1.15–1.70
COG Inorganic ion transport and metabolism 1.19–1.47
NCBI Ion transport 1.12–1.88
GO Regulation of transcription 1.74–2.00
COG Transcription regulation 1.69–2.36
NCBI Transcription 1.90–2.42
GO Kinase 0.96–1.16
NCBI Kinase 0.80–1.03
GO Transport 1.08–1.32
NCBI Transport 1.16–1.50
aAbbreviations: COG, clusters of orthologous groups of proteins; GO, gene
ontology.
Update TRENDS in Genetics Vol.19 No.9 September 2003 481
http://tigs.trends.com
significantly above 1. These trends are strongest inbacteria, where the exponent for TFs is almost 2, implyingthat as the number of genes in the genome doubles, thenumber of TFs quadruples. This has some interestingimplications for ‘regulatory design’ in bacteria. It impliesthat the number of TFs per gene grows in proportionto the size of the genome (see [6] for a similarobservation). This in turn implies that, in largergenomes, each gene must be regulated by a largernumber of TFs and/or each TF must be regulating asmaller set of genes. An exponent of 2 is also observedfor two-component systems, which are the primarymeans by which bacteria sense their environment†.This suggests that the relative increase in transcriptionregulators in more complex bacteria is accompanied by anequal relative increase in sensory systems.
The difficulties with gene prediction and annotation ineukaryotes, the small number of available genomes, andour lack of understanding of the role of alternative splicingacross eukaryotic genomes make it premature to drawmany conclusions from Fig. 1b. However, the main trendsfrom Fig. 1a are reproduced: the super-linear scaling ofTFs, the sub-linear scaling of cell-cycle genes, and thesmall exponents for DNA replication and protein biosyn-thesis genes.
The observed super-linear scaling of TFs also hasimplications for our understanding of ‘combinatorialcontrol’ in transcription regulation. It is well establishedthat in complex organisms, different TFs combine intocomplexes to affect transcription control. Therefore, arelatively small number of TFs can implement a combi-natorially large number of different transcriptionregulatory ‘states’, which might correspond to particularexternal environments, developmental stages, tissues or
combinations of external stimuli. Each such regulatorystate will be associated with a unique set of genes that isexpressed in that state. If the number of such regulatorystates were proportional to the total number of genes, thenthe number of TFs would increase more slowly than thetotal number of genes. However, the scaling results showthat, instead, the number of TFs increases more rapidlythan does the total number of genes. This implies that thenumber of regulatory states is also ‘combinatorial’ in thetotal number of genes: a relatively small number of genesis used in different combinations to implement combina-torially many regulatory states.
The picture that emerges is not one of TFs being used indifferent combinations to implement the regulatory needsof individual genes. But rather that, as one moves fromsimple to more complex organisms, the number ofregulatory states grows so much faster than the totalnumber of genes such that, even with combinatorialcontrol of transcription, the number of TFs grows muchfaster than the total number genes.
Evolutionary model
One of course wonders about the origins of these scalinglaws in genome organization, and I would like to presentsome speculations in this regard. Assume that mostchanges in the number of genes nc in a functional categoryc are caused by duplications and deletions. Then, ncðtÞgenerally evolves according to the equation:
dncðtÞ
dt¼ ðbðtÞ2 dðtÞÞncðtÞ ¼ rðtÞncðtÞ; ½Eqn 1�
with bðtÞ and dðtÞ; respectively, the duplication rate anddeletion rate of the genes in this category at time t in theevolutionary history of the genome. For simplicity ofnotation I have introduced the difference of duplicationand deletion rates rðtÞ; which can be thought of as an
that these could bias the results. To this end, I parsed the names of all
bacterial species into a general and a specific part, for example, for
Escherichia coli, ‘Escherichia’ is the general part and ‘coli’ is the specific
part, for Listeria innocua, ‘Listeria’ is the general part and ‘innocua’ is the
specific part, and so on. Groups of genomes with the same general part
were then collected together and for each group the gene numbers were
replaced with a single average of total gene number and average gene
counts in each of the functional categories. This reduces the size of the
dataset by approximately a third. Power-law fitting was then applied to
this reduced set and the fitted exponents were compared with those of
the full dataset. The exponent was changed by less than 5% in 31 out of
44 categories. The largest observed change was an 18% change in the
exponent. All changes to the exponents were well within their 99%
posterior probability intervals.
More importantly, the results could depend on the use of InterPro,
Gene Ontology, and the mapping of InterPro entries to GO categories.
To test the robustness of the results to bias inherent in InterPro
annotation and/or the mapping from InterPro to Gene Ontology, I
analyzed selected functional categories using two other annotation
schemes.
The first is based on clusters of orthologous groups of proteins
(COGs) [7] annotation of 63 bacterial genomes that can be obtained from
the NCBI database (ftp://ftp.ncbi.nih.gov/pub/COG/COG). In this data-
set, proteins of the 63 bacterial genomes are assigned to COGs, and the
COGs have been assigned to functional categories. I used these
assignments to count the number of genes in different functional
categories according to the COG annotation scheme. A comparison of
the exponents for COG functional categories and the exponents for the
closest GO categories obtained using the InterPro annotations is shown
in Table I.
The second alternative annotation scheme I used is based on simple
keyword searches of protein tables for fully sequenced bacterial
genomes available from the NCBI ftp site. (ftp.ncbi.nlm.nih.gov/
genomes/Bacteria). Removing genomes for which little or no annota-
tion exists, this leaves protein tables for 90 bacterial genomes. Each
protein in these tables is annotated with a short description line. The
number of genes in different functional categories was counted by
searching each description line for hits to a set of keywords that
characterize the category. For instance, I chose the keywords ‘ribosom’,
‘translation’ and ‘tRNA’ for the category ‘protein biosynthesis’, and a
gene is counted as belonging to this category if any of these keywords
occurs in its description line. For the other categories I used the
following keywords: ‘transcription’ for the category ‘transcription’;
‘transport’, ‘channel’, ‘efflux’, ‘pump’, ‘porin’, ‘export ‘, ‘permease’,
‘symport’, ‘transloca’ and ‘PTS’ for the category ‘transport’; all
combinations ‘X Y’ with ‘X’ being one of ‘ion’, ‘sodium’, ‘calcium’,
‘potassium’, ‘magnesium’ and ‘manganese’ and ‘Y’ being one of
‘channel’, ‘efflux’, ‘transport’ and ‘uptake’ for the category ‘ion
transport’; ‘protease’ and ‘peptidase’ for the category ‘protein degra-
dation’; ‘kinase’ for the category ‘kinase’; and finally the phrases ‘DNA
polymerase’, ‘topoisomerase’, ‘DNA gyrase’, ‘DNA ligase’, ‘replication’,
‘helicase’, ‘DNA primase’, ‘DNA repair’, ‘cell division’ and ‘septum’ for
the category ‘cell cycle’. The exponents resulting from this (crude)
annotation scheme are also shown in Table I. There is good quantitative
agreement between the exponents that are obtained with the different
annotation schemes.
† Note that two-component systems were not in the list of 44 high-level categories.
Update TRENDS in Genetics Vol.19 No.9 September 2003482
http://tigs.trends.com
‘effective’ duplication rate. This rate rðtÞ is presumablyproportional to the difference between the averageprobability that selection will favor fixation of a duplicatedgene from this category and the average probabilitythat selection will favor deletion of a gene from thiscategory. Similarly, the total number of genes gðtÞ obeysthe equation:
dgðtÞ
dt¼ gðtÞgðtÞ; ½Eqn 2�
with gðtÞ the overall effective rate of gene duplication in thegenome at time t in its evolutionary history. When we solvefor nc as a function of g we find:
nc ¼ lgkrl=kgl; ½Eqn 3�
where krl and kgl are the mean effective duplication rates ofgenes in category c and the entire genome, respectively,averaged over the evolutionary history of the genome, andl is a constant that depends on the boundary conditions. Inorder for all bacterial genomes to obey the same functionalrelation, the constant l and the ratios krl=kgl have to be thesame for all bacterial evolutionary lineages. Because alllife shares a common ancestor, the boundary conditions forEqns 1 and 2 are trivially the same for all bacteriallineages, implying that the constant l is indeed the same
for all bacterial lineages. In summary, simply assumingthat changes in gene-number occur mostly throughduplications and deletions implies our observed power-law scaling if the ratios krl=kgl are the same for allevolutionary lineages.
I thus propose that the explanation for the observedscaling laws is that the ratios krl=kgl are indeed the samefor all bacterial lineages. That is, these ratios of averageduplication rates are ‘universal constants’ of the evolu-tionary process. For instance, the exponent 2 for TFs inbacteria indicates that, in all bacterial lineages, evolutionselects duplicated TFs twice as frequently as duplicatedgenes in general. It seems likely that such universalconstants are intimately connected to fundamental designprinciples of the evolutionary process. It is tempting tobecome even more speculative in this regard, and suggestthat this factor of 2 in duplication rate is related to theswitch-like function of transcription factors: with eachaddition of a TF the number of transcription regulatorystates of the cell doubles. It is not entirely implausible toassume that with twice the number of internal statesavailable, the probability of such a duplication being fixed
Fig. 1. The number of transcription regulatory genes (red), metabolic genes (blue),
and cell-cycle-related genes (green) as a function of the total number of genes in
the genome for bacteria (a) and eukaryotes (b). Both axes are shown on logarith-
mic scales. Each dot corresponds to a genome. The straight lines are power-law
fits. Archaea are not shown because the range of genome sizes in archaea is too
small for meaningful fits. For completeness the archaeal results are shown in Fig. I
in Box 1.
500 1000 2000 5000 10000Genes in genome
Genes in genome
10
100
1000
Bacteria
500 1000 2000 5000 10000 20000
10
100
1000
10000
Gen
es in
cat
egor
yG
enes
in c
ateg
ory
Eukaryotes
(a)
(b)
TRENDS in Genetics
Table 1. Estimates for the exponents of a selection of functional
categoriesa
Category Bacteria Eukaryotes
Transcription regulation 1.87 ^ 0.13 1.26 ^ 0.10
Metabolism 1.01 ^ 0.06 1.01 ^ 0.08
Cell cycle 0.47 ^ 0.08 0.79 ^ 0.16
Signal transduction 1.72 ^ 0.18 1.48 ^ 0.39
DNA repair 0.64 ^ 0.08 0.83 ^ 0.31
DNA replication 0.43 ^ 0.08 0.72 ^ 0.23
Protein biosynthesis 0.13 ^ 0.02 0.41 ^ 0.15
Protein degradation 0.97 ^ 0.09 0.90 ^ 0.11
Ion transport 1.42 ^ 0.28 1.43 ^ 0.20
Catabolism 0.88 ^ 0.07 0.92 ^ 0.08
Carbohydrate metabolism 1.01 ^ 0.11 1.36 ^ 0.36
Two-component systems 2.07 ^ 0.21 NAb
Cell communication 1.81 ^ 0.19 1.58 ^ 0.34
Defense response NAb 3.35 ^ 1.41
aThe first number gives the maximum likelihood estimate of the exponent and the
second number indicates the boundaries of the 99% posterior probability interval.bNA, not analyzed.
Fig. 2. The number of fully sequenced genomes in the NCBI database (http://www.
ncbi.nlm.nih.gov/Genomes/index.htm) as a function of time. The vertical axis is
shown on a logarithmic scale. The straight line is a least squares fit to an exponen-
tial function: nðtÞ ¼ 2ðt21994Þ=1:3 :
TRENDS in Genetics
1996 1998 2000 2002 2004Year
1
2
5
10
20
50
100
200
Num
ber
of g
enom
es
Growth of the number of sequenced genomes
Update TRENDS in Genetics Vol.19 No.9 September 2003 483
http://tigs.trends.com
in evolution is twice as large as the probability of fixing aduplicated gene that does not double the number ofinternal states of the cell.
Finally, as Table 1 shows, there is still substantialuncertainty about the exact numerical values of theexponents given the current data, and many moregenomes are needed to estimate these values moreaccurately. A survey of the NCBI genome database(http://www.ncbi.nlm.nih.gov/Genomes/index.htm) showsthat the number of sequenced genomes is increasingexponentially, with a doubling time of,16 months (Fig. 2).
This suggests that within a few years thousands ofgenomes will become available. With such an increase inavailable data it will become possible to look at much morefine-grained gene-content statistics than those presentedhere. One can imagine, for instance, going beyond lookingat single functional categories at a time, and investi-gate if there are correlations in the variations of genenumber in more fine-grained functional categories. Ibelieve that such investigations have the potential to
teach us much about the functional design principles ofthe evolutionary process.
References
1 Apweiler, R. et al. (2001) The InterPro database, an integrateddocumentation resource for protein families, domains and functionalsites. Nucleic Acids Res. 29, 37–40
2 Apweiler, R. et al. (2001) Proteome Analysis Database: onlineapplication of InterPro and CluSTr for the functional classification ofproteins in whole genomes. Nucleic Acids Res. 29, 44–48
3 The Gene Ontology Consortium (2000) Gene ontology: tool for theunification of biology. Nat. Genet. 25, 25–29
4 Huynen, M.A. and van Nimwegen, E. (1998) The frequency distributionof gene family sizes in complete genomes. Mol. Biol. Evol. 15, 583–589
5 Luscombe, N.M. et al. (2002) The dominance of the population by aselected few: power-law behaviour applies to a wide variety of genomicproperties. Genome Biol. 3, RESEARCH0040
6 Stover, C.K. et al. (2000) Complete genome sequence of Pseudomonasaeruginosa PA01, an opportunistic pathogen. Nature 406, 959–964
7 Tatusov, R.L. et al. (1997) A genomic perspective on protein families.Science 278, 631–637
0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00203-8
|Errata
Erratum: FlyMove – a new way to look at developmentof DrosophilaTrends in Genetics 19 (2003), 310–311
In the article by Weigmann et al., which was published inthe June issue of TIG, there was an error regarding thecontribution of the authors. Thomas Strasser, KatrinWeigmann and Robert Klapper contributed equally tothe article. TIG apologises to the authors and readers
for this error. doi of original article: 10.1016/S0168-9525(03)00050-7.
0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00198-7
Erratum: Control of protein degradation by E3ubiquitin ligases in Drosophila eye developmentTrends in Genetics 19 (2003), 382–389
In the article by Ou et al., which was published in theJuly issue of TIG, there was an error regarding thename of the corresponding author. Cheng-Ting Chien([email protected]) is the corresponding authorfor this article. TIG apologises to the authors and
readers for this production error. doi of original article:10.1016/S0168-9525(03)00146-X.
0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00199-9
Update TRENDS in Genetics Vol.19 No.9 September 2003484
http://tigs.trends.com