Scaling laws in the functional content of genomes

specific protein interactions provide a sufficient constraintfor a DNA sequence to be recognized, it will ultimately bepossible to design cell-specific response elements to blocktranscription factors in a specific subset of cells [25,26].

Our findings suggest that the understanding of howspecific functions are discriminated as different blueprintson the DNA might be within our reach.

AcknowledgementsWe apologize for the fact that, because of space constraints, full citation ofall relevant articles is not possible. P. Hogeweg, J. Deschamps and areferee are thanked for their critical reading of the manuscript. This studywas supported in part by HPRN and EU quality of life programs and G.M.is a recipient of a Marie Curie fellowship.

References

1 Claverie, J.M. (2000) From bioinformatics to computational biology.Genome Res. 10, 1277–1279

2 Benos, P.V. et al. (2002) Is there a code for protein–DNA recognition?Probab(ilistical)ly. BioEssays 24, 466–475

3 Hong, W.K. and Sporn, M.B. (1997) Recent advances in chemopreven-tion of cancer. Science 278, 1073–1077

4 Mangelsdorf, D.J. et al. (1994) In The Retinoids: Biology, Chemistryand Medicine (Sporn, M.B. et al., eds), pp. 319–349, Raven Press

5 Durston, A.J. et al. (1998) Retinoids and related signals in earlydevelopment of the vertebrate central nervous system. Curr. Top. Dev.Biol. 40, 111–175

6 Rastinejad, F. (2001) Retinoid X receptor and its partners in thenuclear receptor family. Curr. Opin. Struct. Biol. 11, 33–38

7 Gronemeyer, H. and Miturski, R. (2001) Molecular mechanisms ofretinoid action. Cell. Mol. Biol. Lett. 6, 3–52

8 Kurokawa, R. et al. (1995) Polarity-specific activities of retinoic acidreceptors determined by a co-repressor. Nature 377, 451–454

9 Gould, A. et al. (1998) Initiation of rhombomeric Hoxb4 expressionrequires induction by somites and a retinoid pathway. Neuron 21,39–51

10 Huang, D. et al. (2002) Analysis of two distinct retinoic acid responseelements in the homeobox gene Hoxb1 in transgenic mice. Dev. Dyn.223, 353–370

11 Nolte, C. et al. (2003) The role of a retinoic acid response element inestablishing the anterior neural expression border of Hoxd4 trans-genes. Mech. Dev. 120, 325–335

12 Amores, A. et al. (1998) Zebrafish hox clusters and vertebrate genomeevolution. Science 282, 1711–1714

13 Duboule, D. and Morata, G. (1994) Colinearity and functionalhierarchy among genes of the homeotic complexes. Trends Genet. 10,358–364

14 Krumlauf, R. (1994) Hox genes in vertebrate development. Cell 78,191–201

15 Moroni, M.C. et al. (1993) Regulation of the human HOXD4 gene byretinoids. Mech. Dev. 44, 139–154

16 Popperl, H. and Featherstone, M.S. (1993) Identification of a retinoicacid response element upstream of the murine Hox-4.2 gene. Mol. Cell.Biol. 13, 257–265

17 Morrison, A. et al. (1996) In vitro and transgenic analysis of a humanHOXD4 retinoid-responsive enhancer. Development 122, 1895–1907

18 Langston, A.W. et al. (1997) Retinoic acid-responsive enhancerslocated 30 of the Hox A and Hox B homeobox gene clusters. Functionalanalysis. J. Biol. Chem. 272, 2167–2175

19 Packer, A.I. et al. (1998) Expression of the murine Hoxa4 gene requiresboth autoregulation and a conserved retinoic acid response element.Development 125, 1991–1998

20 Kim, C.B. et al. (2000) Hox cluster genomics in the horn shark,Heterodontus francisci. Proc. Natl. Acad. Sci. U. S. A. 97, 1655–1660

21 Kurokawa, R. et al. (1993) Differential orientations of the DNA-binding domain and carboxy-terminal dimerization interface regulatebinding site selection by nuclear receptor heterodimers. Genes Dev. 7,1423–1435

22 Rastinejad, F. et al. (1995) Structural determinants of nuclear receptorassembly on DNA direct repeats. Nature 375, 203–211

23 Escriva, H. et al. (1997) Ligand binding was acquired during evolutionof nuclear receptors. Proc. Natl. Acad. Sci. U. S. A. 94, 6803–6808

24 Sluder, A.E. et al. (1999) The nuclear receptor superfamily hasundergone extensive proliferation and diversification in nematodes.Genome Res. 9, 103–120

25 Cho, Y.S. et al. (2002) A genomic-scale view of the cAMP responseelement-enhancer decoy: a tumor target-based genetic tool. Proc. Natl.Acad. Sci. U. S. A. 99, 15626–15631

26 Wang, L.H. et al. (2003) The cis decoy against the estrogen responseelement suppresses breast cancer cells via target disrupting c-fos notmitogen-activated protein kinase activity. Cancer Res. 63, 2046–2051

27 Chang, B.E. et al. (1997) Axial (HNF3beta) and retinoic acid receptorsare regulators of the zebrafish sonic hedgehog promoter. EMBO J. 16,3955–3964

28 Zhang, F. et al. (2000) Murine hoxd4 expression in the CNS requiresmultiple elements including a retinoic acid response element. Mech.Dev. 96, 79–89

29 Valcarcel, R. et al. (1994) Retinoid-dependent in vitro transcriptionmediated by the RXR/RAR heterodimer. Genes Dev. 8, 3068–3079

30 Astrom, A. et al. (1994) Retinoic acid induction of human cellularretinoic acid-binding protein-II gene transcription is mediated byretinoic acid receptor-retinoid X receptor heterodimers bound to onefar upstream retinoic acid-responsive element with 5-base pairspacing. J. Biol. Chem. 269, 22334–22339

31 Wingender, E. et al. (2001) The TRANSFAC system on gene expressionregulation. Nucleic Acids Res. 29, 281–283

0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00202-6

Scaling laws in the functional content of genomes

Erik van Nimwegen

Center for Studies in Physics and Biology, the Rockefeller University, 1230 York Avenue, New York, NY 12001, USA

With the number of sequenced genomes now totaling

more than 100, and the availability of rough functional

annotations for a substantial proportion of their genes,

it has become possible to study the statistics of gene

content across genomes. In this article I show that, for

many high-level functional categories, the number of

genes in each category scales as a power-law of the

total number of genes in the genome. The occurrence

of such scaling laws can be explained using a simple

theoretical model, and this model suggests that the

exponents of the observed scaling laws correspond

to universal constants of the evolutionary process.Corresponding author: Erik van Nimwegen ([email protected]).

Update TRENDS in Genetics Vol.19 No.9 September 2003 479

http://tigs.trends.com

http://www.trends.com

I discuss some consequences of these scaling laws for

our understanding of organism design.

What fraction of the gene content of a genome is allotted todifferent functional tasks, and how does this depend on thecomplexity of the organism? Until recently, there weresimply no data to address such questions in a quantitativeway. At present, however, there are more than 100sequenced genomes in public databases (http://www.ncbi.nlm.nih.gov/Genomes/index.htm), and protein-familyclassification algorithms allow functional annotations fora considerable fraction of the genes in each genome. Thus,it has become possible to analyze the statistics offunctional gene-content across different genomes, and inthis article I present results on the dependency of thenumber of genes in different high-level categories on thetotal number of genes in the genome.

Evaluating the functional gene-content of genomes

To estimate the number of genes in different functionalcategories, each genome has to be functionally annotated.The main results presented in this article were obtainedusing the InterPro [1] annotations of sequenced genomesavailable from the European Bioinformatics Institute [2].To map the InterPro annotations to high-level functionalcategories I used the Gene Ontology (GO) ‘biologicalprocess’ hierarchy [3] and a mapping from InterPro entriesto GO categories, both of which can be obtained from theGene Ontology website (http://www.geneontology.org). Foreach GO category I collected all InterPro entries that mapto it or to one of its descendants in the ‘biological process’hierarchy. To minimize the effects of potential bias inthe mappings from InterPro to GO I only used high-level functional categories that are represented by atleast 50 different InterPro entries. This leaves 44 high-level GO categories.

A gene with multiple ‘hits’ to InterPro entries that areassociated with a GO category has a higher probability ofbelonging to that category than a gene with only a singlehit. To take this information into account, I assumed that ifa gene i has nc

i independent hits to InterPro entriesassociated with GO category c, then with probability 1 2

expð2bnci Þ the gene belongs to this particular GO category.

The results are generally insensitive to the value of b

(Box 1), and I used b ¼ 3 for the results shown below. Theestimated number of genes nc in a genome for a given GOcategory is then the sum nc ¼

Pi 1 2 expð2bnc

i Þ over allgenes i in the genome.

The results for the categories of transcription regulat-ory genes, metabolic genes and cell-cycle-related genes forbacteria and eukaryotes are shown in Fig. 1. Remarkably,for each functional category shown, we find an approxi-mately power-law relationship (solid line fits)*. That is, ifnc is the number of genes in the category, and g is the totalnumber of genes in the genome, we observe laws of theform: nc ¼ lga; where both l and the exponent a dependon the category under investigation. In fact, such

power-laws are observed for most of the 44 high-levelcategories, and the estimated values of the exponentsfor several functional categories are shown in Table 1.A potential source of bias in estimating these expo-nents is the occurrence of multiple genomes from thesame or closely related species in the data. Removingthis ‘redundancy’ from the data does not alter theobserved exponents (Box 1).

The power-law fits and 99% posterior probabilityestimates for their exponents were obtained using aBayesian procedure described in Box 1. To assess thequality of the fits, I measured, for each fit, the fraction ofthe variance in the data that is explained by the fit. Inbacteria, 26 out of 44 GO categories have more than 95% ofthe variance explained by the fit, and 38 categories havemore than 90% explained by the fit. In eukaryotes, 26categories have more than 95% explained by the fit and 32categories have more than 90% explained by the fit.However, with total gene number varying by less than afactor of 20 in bacteria, and the small number of datapoints in eukaryotes, one might wonder how one can claimthat power-laws have been ‘observed’. First, the fact that Ifit the data to power-laws should not be mistaken for aclaim that the data can only be described by power-lawfunctions. I only claim that the power-law is by far thesimplest functional form that fits almost all the observeddata. Second, when scatter plots such as those shown inFig. 1 are plotted on linear as opposed to logarithmicscales, it is clear even by eye that the fluctuations in thenumber of genes in a category scale with the total numberof genes in the genome. That is, the fluctuations in the datasuggest that logarithmic scales are the ‘natural’ scales forthese data. This is further supported by the simpleevolutionary model presented below.

One might also wonder to what extent the results aresensitive to the specific functional annotation procedure. Iperformed a variety of tests to assess the robustness of theresults (i.e. the observed power-law scaling and the valuesof the exponents) to changes in the annotation method-ology (Box 1). These involve using entirely independentannotations based on clusters of orthologous groups ofproteins (COGs) [7], and a simple (crude) annotationscheme based on keyword searches of protein tables forsequenced microbial genomes from the NCBI website(ftp.ncbi.nlm.nih.gov/genomes/bacteria). As shown inBox 1, the observed power-law scaling and the values ofthe exponents are generally insensitive to these and otherchanges in annotation methodology. However, because allcurrently available annotation schemes assume at somelevel that functional homology can be inferred fromsequence homology, our results implicitly depend on thisassumption as well.

Observed exponents

Some functional categories, such as the large category ofmetabolic genes, occupy a roughly constant fraction of thegene content of a genome, as evidenced by their exponentof 1. However, many categories show significant deviationsfrom this ‘trivial’ exponent. Genes related to cell-cycle orprotein biosynthesis have exponents significantly below 1,whereas for transcription factors (TFs) the exponent is

* These power-laws as a function of total gene number should be distinguished fromthe power-law distributions of gene-family sizes and other genomic attributes within asingle genome [4,5].

Update TRENDS in Genetics Vol.19 No.9 September 2003480


http://www.ncbi.nlm.nih.gov/Genomes/index.htm


http://www.geneontology.org

ftp://ftp.ncbi.nlm.nih.gov/genomes/bacteria


Box 1. Appendix

Results for Archaea

Figure I shows the number of transcription regulatory genes, metabolic

genes and cell-cycle-related genes as a function of the total number of

genes in archaeal genomes. Note that the size of the largest archaeal

genome differs from that of the smallest by only a factor of ,3.

Consequently, there is large uncertainty regarding the values of the

exponents for archaea. The maximum likelihood values and 99%

posterior probability intervals are 2.1 and [1.3–5.7] for transcription

regulatory genes, 0.81 and [0.44–1.48] for metabolic genes, and 0.83

and [0.54–1.35] for cell-cycle-related genes.

Power-law fittingA power-law relation of the form y ¼ cxa translates into a linear

relation between the logarithms of the variables x and y

[i.e. logðyÞ ¼ logðcÞ þ alogðxÞ]. Thus, power-law fitting translates into

fitting a straight line to the log-transformed data. The power-law fits

shown in Fig. 1 of the main text were obtained using a Bayesian

straight-line fitting procedure. For each gene ontology (GO) category, I

log-transformed the data such that each data point ðxi ; yi Þ corresponds

to the logarithm xi of the number of genes in the genome and the

logarithm yi of the number of genes in the category. I assumed that

these transformed data were drawn from a linear model:

y ¼ aðx þ 1Þ þ lþ h; ½Eqn I�

whereh and 1 are ‘noise’ terms in the x- and y-coordinates, respectively,

and l is an unknown off-set. I assumed that the joint-distribution Pðh; 1Þ

is a two-dimensional Gaussian with means zero and unknown variances

and co-variance. That is, I used scale invariant priors for the variances,

and integrated these nuisance variables out of the likelihood. A uniform

prior was used for the location parameter l, and for the slope a I used a

rotation invariant prior:

PðaÞda ¼da

ð1 þ a2Þ3=2: ½Eqn II�

The use of these priors guarantees that the results are invariant under all

shifts and rotations of the plane.

Integrating over all variables except for a, we obtain for the posterior

Pða=DÞ given the data D :

Pða=DÞda ¼ Cða2 þ 1Þðn23Þ=2da

ða2Sxx 2 2asyx þ syy Þðn21Þ=2

; ½Eqn III�

where n is the number of genomes in the data, sxx is the variance in

x-values (logarithms of the total gene numbers), syy is the variance in

y-values (logarithms of the number of genes in the category), syx is the

co-variance, and C is a normalizing constant. The values of the

exponents reported in Table I are the values of a that maximize

Pða=DÞ; and the boundaries of the 99% posterior probability interval

around it.

For each fit, I also measured what fraction of the variance in the data is

explained by the fit. That is, I compared the average distance d of the

points in the plane to their center of mass with the average distance dl of

the points to the fitted line and defined the fraction q ¼ 1 2 dl =d as the

variance in the data explained by the fit.

Robustness of the resultsFirst, I checked that the total amount of available annotation information

is not itself dependent on genome size. If the total amount of available

annotation information were to vary with the number of genes in the

genome, this could lead to biases in the estimated exponents. To

exclude this possibility, I counted the total number of genes with any

InterPro hits in each genome and found that, for both bacteria and

eukaryotes, the fraction of genes in the genome with InterPro hits is

approximately two-thirds, independent of the total number of genes in

the genome. Consistent with this observation, when one fits power-

laws to the number of genes nc in a GO category c as a function of the

total number of annotated genes in the genome, one finds exponents

that are very close to those found for nc as a function of the total number

of genes in the genome.

Second, I tested that the results are insensitive to the value of the

parameter b: The default value b ¼ 3 gives a gene with a single InterPro

hit a probability of 1 2 e23 < 0:95 to belong to the category. This is

reasonable because InterPro is designed to only report statistically

significant hits. To assess the effect of changing b, results forb ¼ 1 were

generated. For bacteria the change in fitted exponent is less than 5% for

26 of 44 categories, and less than 10% for 39 categories. For eukaryotes

the exponent changes by less than 5% for all but three categories. In all

cases, the change in fitted exponent is significantly smaller than the 99%

posterior probability intervals associated with the exponents of the fits.

Third, I tested the robustness of the results against removal of

potential ‘redundancies’ in the data. For bacteria there are several

examples where multiple genomes of the same species or genomes of

very closely related species occur in the data, and one might suspect

Fig. I. The number of transcription regulatory genes (red), metabolic genes

(blue), and cell-cycle-related genes (green) as a function of the total number of

genes in archaea. Both axes are shown on logarithmic scales. Each dot corre-

sponds to a genome. The straight lines are power-law fits.

TRENDS in Genetics

1500 2000 3000 5000

Genes in genome

20

50

100

200

500

1000

2000

Gen

es in

cat

egor

y

Archaea

Table I. Estimated 99% posterior probability intervals for the

scaling exponents obtained with three different annotation

schemesa

Annotation Category Exponent

GO Protein biosynthesis 0.11–0.15

COG Translation, ribosomal structure and

biogenesis

0.21–0.37

NCBI Protein biosynthesis 0.09–0.15

GO Signal transduction 1.55–1.90

COG Signal transduction mechanisms 1.66–2.14

GO Protein metabolism and modification 0.68–0.80

GO Protein degradation 0.89–1.06

COG Posttranslational modification, protein

turnover, chaperones

0.88–1.15

NCBI Protein degradation 0.68–0.91

GO Cell cycle 0.39–0.54

GO DNA repair 0.52–1.14

COG Replication, recombination and repair 0.66–0.83

NCBI Cell cycle 0.45–0.64

GO Ion transport 1.15–1.70

COG Inorganic ion transport and metabolism 1.19–1.47

NCBI Ion transport 1.12–1.88

GO Regulation of transcription 1.74–2.00

COG Transcription regulation 1.69–2.36

NCBI Transcription 1.90–2.42

GO Kinase 0.96–1.16

NCBI Kinase 0.80–1.03

GO Transport 1.08–1.32

NCBI Transport 1.16–1.50

aAbbreviations: COG, clusters of orthologous groups of proteins; GO, gene

ontology.




significantly above 1. These trends are strongest inbacteria, where the exponent for TFs is almost 2, implyingthat as the number of genes in the genome doubles, thenumber of TFs quadruples. This has some interestingimplications for ‘regulatory design’ in bacteria. It impliesthat the number of TFs per gene grows in proportionto the size of the genome (see [6] for a similarobservation). This in turn implies that, in largergenomes, each gene must be regulated by a largernumber of TFs and/or each TF must be regulating asmaller set of genes. An exponent of 2 is also observedfor two-component systems, which are the primarymeans by which bacteria sense their environment†.This suggests that the relative increase in transcriptionregulators in more complex bacteria is accompanied by anequal relative increase in sensory systems.

The difficulties with gene prediction and annotation ineukaryotes, the small number of available genomes, andour lack of understanding of the role of alternative splicingacross eukaryotic genomes make it premature to drawmany conclusions from Fig. 1b. However, the main trendsfrom Fig. 1a are reproduced: the super-linear scaling ofTFs, the sub-linear scaling of cell-cycle genes, and thesmall exponents for DNA replication and protein biosyn-thesis genes.

The observed super-linear scaling of TFs also hasimplications for our understanding of ‘combinatorialcontrol’ in transcription regulation. It is well establishedthat in complex organisms, different TFs combine intocomplexes to affect transcription control. Therefore, arelatively small number of TFs can implement a combi-natorially large number of different transcriptionregulatory ‘states’, which might correspond to particularexternal environments, developmental stages, tissues or

combinations of external stimuli. Each such regulatorystate will be associated with a unique set of genes that isexpressed in that state. If the number of such regulatorystates were proportional to the total number of genes, thenthe number of TFs would increase more slowly than thetotal number of genes. However, the scaling results showthat, instead, the number of TFs increases more rapidlythan does the total number of genes. This implies that thenumber of regulatory states is also ‘combinatorial’ in thetotal number of genes: a relatively small number of genesis used in different combinations to implement combina-torially many regulatory states.

The picture that emerges is not one of TFs being used indifferent combinations to implement the regulatory needsof individual genes. But rather that, as one moves fromsimple to more complex organisms, the number ofregulatory states grows so much faster than the totalnumber of genes such that, even with combinatorialcontrol of transcription, the number of TFs grows muchfaster than the total number genes.

Evolutionary model

One of course wonders about the origins of these scalinglaws in genome organization, and I would like to presentsome speculations in this regard. Assume that mostchanges in the number of genes nc in a functional categoryc are caused by duplications and deletions. Then, ncðtÞgenerally evolves according to the equation:

dncðtÞ

dt¼ ðbðtÞ2 dðtÞÞncðtÞ ¼ rðtÞncðtÞ; ½Eqn 1�

with bðtÞ and dðtÞ; respectively, the duplication rate anddeletion rate of the genes in this category at time t in theevolutionary history of the genome. For simplicity ofnotation I have introduced the difference of duplicationand deletion rates rðtÞ; which can be thought of as an

that these could bias the results. To this end, I parsed the names of all

bacterial species into a general and a specific part, for example, for

Escherichia coli, ‘Escherichia’ is the general part and ‘coli’ is the specific

part, for Listeria innocua, ‘Listeria’ is the general part and ‘innocua’ is the

specific part, and so on. Groups of genomes with the same general part

were then collected together and for each group the gene numbers were

replaced with a single average of total gene number and average gene

counts in each of the functional categories. This reduces the size of the

dataset by approximately a third. Power-law fitting was then applied to

this reduced set and the fitted exponents were compared with those of

the full dataset. The exponent was changed by less than 5% in 31 out of

44 categories. The largest observed change was an 18% change in the

exponent. All changes to the exponents were well within their 99%

posterior probability intervals.

More importantly, the results could depend on the use of InterPro,

Gene Ontology, and the mapping of InterPro entries to GO categories.

To test the robustness of the results to bias inherent in InterPro

annotation and/or the mapping from InterPro to Gene Ontology, I

analyzed selected functional categories using two other annotation

schemes.

The first is based on clusters of orthologous groups of proteins

(COGs) [7] annotation of 63 bacterial genomes that can be obtained from

the NCBI database (ftp://ftp.ncbi.nih.gov/pub/COG/COG). In this data-

set, proteins of the 63 bacterial genomes are assigned to COGs, and the

COGs have been assigned to functional categories. I used these

assignments to count the number of genes in different functional

categories according to the COG annotation scheme. A comparison of

the exponents for COG functional categories and the exponents for the

closest GO categories obtained using the InterPro annotations is shown

in Table I.

The second alternative annotation scheme I used is based on simple

keyword searches of protein tables for fully sequenced bacterial

genomes available from the NCBI ftp site. (ftp.ncbi.nlm.nih.gov/

genomes/Bacteria). Removing genomes for which little or no annota-

tion exists, this leaves protein tables for 90 bacterial genomes. Each

protein in these tables is annotated with a short description line. The

number of genes in different functional categories was counted by

searching each description line for hits to a set of keywords that

characterize the category. For instance, I chose the keywords ‘ribosom’,

‘translation’ and ‘tRNA’ for the category ‘protein biosynthesis’, and a

gene is counted as belonging to this category if any of these keywords

occurs in its description line. For the other categories I used the

following keywords: ‘transcription’ for the category ‘transcription’;

‘transport’, ‘channel’, ‘efflux’, ‘pump’, ‘porin’, ‘export ‘, ‘permease’,

‘symport’, ‘transloca’ and ‘PTS’ for the category ‘transport’; all

combinations ‘X Y’ with ‘X’ being one of ‘ion’, ‘sodium’, ‘calcium’,

‘potassium’, ‘magnesium’ and ‘manganese’ and ‘Y’ being one of

‘channel’, ‘efflux’, ‘transport’ and ‘uptake’ for the category ‘ion

transport’; ‘protease’ and ‘peptidase’ for the category ‘protein degra-

dation’; ‘kinase’ for the category ‘kinase’; and finally the phrases ‘DNA

polymerase’, ‘topoisomerase’, ‘DNA gyrase’, ‘DNA ligase’, ‘replication’,

‘helicase’, ‘DNA primase’, ‘DNA repair’, ‘cell division’ and ‘septum’ for

the category ‘cell cycle’. The exponents resulting from this (crude)

annotation scheme are also shown in Table I. There is good quantitative

agreement between the exponents that are obtained with the different

annotation schemes.

† Note that two-component systems were not in the list of 44 high-level categories.




‘effective’ duplication rate. This rate rðtÞ is presumablyproportional to the difference between the averageprobability that selection will favor fixation of a duplicatedgene from this category and the average probabilitythat selection will favor deletion of a gene from thiscategory. Similarly, the total number of genes gðtÞ obeysthe equation:

dgðtÞ

dt¼ gðtÞgðtÞ; ½Eqn 2�

with gðtÞ the overall effective rate of gene duplication in thegenome at time t in its evolutionary history. When we solvefor nc as a function of g we find:

nc ¼ lgkrl=kgl; ½Eqn 3�

where krl and kgl are the mean effective duplication rates ofgenes in category c and the entire genome, respectively,averaged over the evolutionary history of the genome, andl is a constant that depends on the boundary conditions. Inorder for all bacterial genomes to obey the same functionalrelation, the constant l and the ratios krl=kgl have to be thesame for all bacterial evolutionary lineages. Because alllife shares a common ancestor, the boundary conditions forEqns 1 and 2 are trivially the same for all bacteriallineages, implying that the constant l is indeed the same

for all bacterial lineages. In summary, simply assumingthat changes in gene-number occur mostly throughduplications and deletions implies our observed power-law scaling if the ratios krl=kgl are the same for allevolutionary lineages.

I thus propose that the explanation for the observedscaling laws is that the ratios krl=kgl are indeed the samefor all bacterial lineages. That is, these ratios of averageduplication rates are ‘universal constants’ of the evolu-tionary process. For instance, the exponent 2 for TFs inbacteria indicates that, in all bacterial lineages, evolutionselects duplicated TFs twice as frequently as duplicatedgenes in general. It seems likely that such universalconstants are intimately connected to fundamental designprinciples of the evolutionary process. It is tempting tobecome even more speculative in this regard, and suggestthat this factor of 2 in duplication rate is related to theswitch-like function of transcription factors: with eachaddition of a TF the number of transcription regulatorystates of the cell doubles. It is not entirely implausible toassume that with twice the number of internal statesavailable, the probability of such a duplication being fixed

Fig. 1. The number of transcription regulatory genes (red), metabolic genes (blue),

and cell-cycle-related genes (green) as a function of the total number of genes in

the genome for bacteria (a) and eukaryotes (b). Both axes are shown on logarith-

mic scales. Each dot corresponds to a genome. The straight lines are power-law

fits. Archaea are not shown because the range of genome sizes in archaea is too

small for meaningful fits. For completeness the archaeal results are shown in Fig. I

in Box 1.

500 1000 2000 5000 10000Genes in genome

Genes in genome

10

100

1000

Bacteria

500 1000 2000 5000 10000 20000

10

100

1000

10000

Gen

es in

cat

egor

yG

enes

in c

ateg

ory

Eukaryotes

(a)

(b)

TRENDS in Genetics

Table 1. Estimates for the exponents of a selection of functional

categoriesa

Category Bacteria Eukaryotes

Transcription regulation 1.87 ^ 0.13 1.26 ^ 0.10

Metabolism 1.01 ^ 0.06 1.01 ^ 0.08

Cell cycle 0.47 ^ 0.08 0.79 ^ 0.16

Signal transduction 1.72 ^ 0.18 1.48 ^ 0.39

DNA repair 0.64 ^ 0.08 0.83 ^ 0.31

DNA replication 0.43 ^ 0.08 0.72 ^ 0.23

Protein biosynthesis 0.13 ^ 0.02 0.41 ^ 0.15

Protein degradation 0.97 ^ 0.09 0.90 ^ 0.11

Ion transport 1.42 ^ 0.28 1.43 ^ 0.20

Catabolism 0.88 ^ 0.07 0.92 ^ 0.08

Carbohydrate metabolism 1.01 ^ 0.11 1.36 ^ 0.36

Two-component systems 2.07 ^ 0.21 NAb

Cell communication 1.81 ^ 0.19 1.58 ^ 0.34

Defense response NAb 3.35 ^ 1.41

aThe first number gives the maximum likelihood estimate of the exponent and the

second number indicates the boundaries of the 99% posterior probability interval.bNA, not analyzed.

Fig. 2. The number of fully sequenced genomes in the NCBI database (http://www.

ncbi.nlm.nih.gov/Genomes/index.htm) as a function of time. The vertical axis is

shown on a logarithmic scale. The straight line is a least squares fit to an exponen-

tial function: nðtÞ ¼ 2ðt21994Þ=1:3 :

TRENDS in Genetics

1996 1998 2000 2002 2004Year

1

2

5

10

20

50

100

200

Num

ber

of g

enom

es

Growth of the number of sequenced genomes






in evolution is twice as large as the probability of fixing aduplicated gene that does not double the number ofinternal states of the cell.

Finally, as Table 1 shows, there is still substantialuncertainty about the exact numerical values of theexponents given the current data, and many moregenomes are needed to estimate these values moreaccurately. A survey of the NCBI genome database(http://www.ncbi.nlm.nih.gov/Genomes/index.htm) showsthat the number of sequenced genomes is increasingexponentially, with a doubling time of,16 months (Fig. 2).

This suggests that within a few years thousands ofgenomes will become available. With such an increase inavailable data it will become possible to look at much morefine-grained gene-content statistics than those presentedhere. One can imagine, for instance, going beyond lookingat single functional categories at a time, and investi-gate if there are correlations in the variations of genenumber in more fine-grained functional categories. Ibelieve that such investigations have the potential to

teach us much about the functional design principles ofthe evolutionary process.

References

1 Apweiler, R. et al. (2001) The InterPro database, an integrateddocumentation resource for protein families, domains and functionalsites. Nucleic Acids Res. 29, 37–40

2 Apweiler, R. et al. (2001) Proteome Analysis Database: onlineapplication of InterPro and CluSTr for the functional classification ofproteins in whole genomes. Nucleic Acids Res. 29, 44–48

3 The Gene Ontology Consortium (2000) Gene ontology: tool for theunification of biology. Nat. Genet. 25, 25–29

4 Huynen, M.A. and van Nimwegen, E. (1998) The frequency distributionof gene family sizes in complete genomes. Mol. Biol. Evol. 15, 583–589

5 Luscombe, N.M. et al. (2002) The dominance of the population by aselected few: power-law behaviour applies to a wide variety of genomicproperties. Genome Biol. 3, RESEARCH0040

6 Stover, C.K. et al. (2000) Complete genome sequence of Pseudomonasaeruginosa PA01, an opportunistic pathogen. Nature 406, 959–964

7 Tatusov, R.L. et al. (1997) A genomic perspective on protein families.Science 278, 631–637


|Errata

Erratum: FlyMove – a new way to look at developmentof DrosophilaTrends in Genetics 19 (2003), 310–311

In the article by Weigmann et al., which was published inthe June issue of TIG, there was an error regarding thecontribution of the authors. Thomas Strasser, KatrinWeigmann and Robert Klapper contributed equally tothe article. TIG apologises to the authors and readers

for this error. doi of original article: 10.1016/S0168-9525(03)00050-7.


Erratum: Control of protein degradation by E3ubiquitin ligases in Drosophila eye developmentTrends in Genetics 19 (2003), 382–389

In the article by Ou et al., which was published in theJuly issue of TIG, there was an error regarding thename of the corresponding author. Cheng-Ting Chien([email protected]) is the corresponding authorfor this article. TIG apologises to the authors and

readers for this production error. doi of original article:10.1016/S0168-9525(03)00146-X.





mailto:[email protected]


Documents

Scaling laws in the functional content of genomes