specific protein interactions provide a sufficient constraintfor a DNA sequence to be recognized, it will ultimately bepossible to design cell-specific response elements to blocktranscription factors in a specific subset of cells [25,26].
Our findings suggest that the understanding of howspecific functions are discriminated as different blueprintson the DNA might be within our reach.
AcknowledgementsWe apologize for the fact that, because of space constraints, full citation ofall relevant articles is not possible. P. Hogeweg, J. Deschamps and areferee are thanked for their critical reading of the manuscript. This studywas supported in part by HPRN and EU quality of life programs and G.M.is a recipient of a Marie Curie fellowship.
References1 Claverie, J.M. (2000) From bioinformatics to computational biology.
Genome Res. 10, 127712792 Benos, P.V. et al. (2002) Is there a code for proteinDNA recognition?
Probab(ilistical)ly. BioEssays 24, 4664753 Hong, W.K. and Sporn, M.B. (1997) Recent advances in chemopreven-
tion of cancer. Science 278, 107310774 Mangelsdorf, D.J. et al. (1994) In The Retinoids: Biology, Chemistry
and Medicine (Sporn, M.B. et al., eds), pp. 319349, Raven Press5 Durston, A.J. et al. (1998) Retinoids and related signals in early
development of the vertebrate central nervous system. Curr. Top. Dev.Biol. 40, 111175
6 Rastinejad, F. (2001) Retinoid X receptor and its partners in thenuclear receptor family. Curr. Opin. Struct. Biol. 11, 3338
7 Gronemeyer, H. and Miturski, R. (2001) Molecular mechanisms ofretinoid action. Cell. Mol. Biol. Lett. 6, 352
8 Kurokawa, R. et al. (1995) Polarity-specific activities of retinoic acidreceptors determined by a co-repressor. Nature 377, 451454
9 Gould, A. et al. (1998) Initiation of rhombomeric Hoxb4 expressionrequires induction by somites and a retinoid pathway. Neuron 21,3951
10 Huang, D. et al. (2002) Analysis of two distinct retinoic acid responseelements in the homeobox gene Hoxb1 in transgenic mice. Dev. Dyn.223, 353370
11 Nolte, C. et al. (2003) The role of a retinoic acid response element inestablishing the anterior neural expression border of Hoxd4 trans-genes. Mech. Dev. 120, 325335
12 Amores, A. et al. (1998) Zebrafish hox clusters and vertebrate genomeevolution. Science 282, 17111714
13 Duboule, D. and Morata, G. (1994) Colinearity and functionalhierarchy among genes of the homeotic complexes. Trends Genet. 10,358364
14 Krumlauf, R. (1994) Hox genes in vertebrate development. Cell 78,191201
15 Moroni, M.C. et al. (1993) Regulation of the human HOXD4 gene byretinoids. Mech. Dev. 44, 139154
16 Popperl, H. and Featherstone, M.S. (1993) Identification of a retinoicacid response element upstream of the murine Hox-4.2 gene. Mol. Cell.Biol. 13, 257265
17 Morrison, A. et al. (1996) In vitro and transgenic analysis of a humanHOXD4 retinoid-responsive enhancer. Development 122, 18951907
18 Langston, A.W. et al. (1997) Retinoic acid-responsive enhancerslocated 30 of the Hox A and Hox B homeobox gene clusters. Functionalanalysis. J. Biol. Chem. 272, 21672175
19 Packer, A.I. et al. (1998) Expression of the murine Hoxa4 gene requiresboth autoregulation and a conserved retinoic acid response element.Development 125, 19911998
20 Kim, C.B. et al. (2000) Hox cluster genomics in the horn shark,Heterodontus francisci. Proc. Natl. Acad. Sci. U. S. A. 97, 16551660
21 Kurokawa, R. et al. (1993) Differential orientations of the DNA-binding domain and carboxy-terminal dimerization interface regulatebinding site selection by nuclear receptor heterodimers. Genes Dev. 7,14231435
22 Rastinejad, F. et al. (1995) Structural determinants of nuclear receptorassembly on DNA direct repeats. Nature 375, 203211
23 Escriva, H. et al. (1997) Ligand binding was acquired during evolutionof nuclear receptors. Proc. Natl. Acad. Sci. U. S. A. 94, 68036808
24 Sluder, A.E. et al. (1999) The nuclear receptor superfamily hasundergone extensive proliferation and diversification in nematodes.Genome Res. 9, 103120
25 Cho, Y.S. et al. (2002) A genomic-scale view of the cAMP responseelement-enhancer decoy: a tumor target-based genetic tool. Proc. Natl.Acad. Sci. U. S. A. 99, 1562615631
26 Wang, L.H. et al. (2003) The cis decoy against the estrogen responseelement suppresses breast cancer cells via target disrupting c-fos notmitogen-activated protein kinase activity. Cancer Res. 63, 20462051
27 Chang, B.E. et al. (1997) Axial (HNF3beta) and retinoic acid receptorsare regulators of the zebrafish sonic hedgehog promoter. EMBO J. 16,39553964
28 Zhang, F. et al. (2000) Murine hoxd4 expression in the CNS requiresmultiple elements including a retinoic acid response element. Mech.Dev. 96, 7989
29 Valcarcel, R. et al. (1994) Retinoid-dependent in vitro transcriptionmediated by the RXR/RAR heterodimer. Genes Dev. 8, 30683079
30 Astrom, A. et al. (1994) Retinoic acid induction of human cellularretinoic acid-binding protein-II gene transcription is mediated byretinoic acid receptor-retinoid X receptor heterodimers bound to onefar upstream retinoic acid-responsive element with 5-base pairspacing. J. Biol. Chem. 269, 2233422339
31 Wingender, E. et al. (2001) The TRANSFAC system on gene expressionregulation. Nucleic Acids Res. 29, 281283
0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00202-6
Scaling laws in the functional content of genomes
Erik van Nimwegen
Center for Studies in Physics and Biology, the Rockefeller University, 1230 York Avenue, New York, NY 12001, USA
With the number of sequenced genomes now totalingmore than 100, and the availability of rough functionalannotations for a substantial proportion of their genes,it has become possible to study the statistics of genecontent across genomes. In this article I show that, for
many high-level functional categories, the number ofgenes in each category scales as a power-law of thetotal number of genes in the genome. The occurrenceof such scaling laws can be explained using a simpletheoretical model, and this model suggests that theexponents of the observed scaling laws correspondto universal constants of the evolutionary process.Corresponding author: Erik van Nimwegen (email@example.com).
Update TRENDS in Genetics Vol.19 No.9 September 2003 479
I discuss some consequences of these scaling laws forour understanding of organism design.
What fraction of the gene content of a genome is allotted todifferent functional tasks, and how does this depend on thecomplexity of the organism? Until recently, there weresimply no data to address such questions in a quantitativeway. At present, however, there are more than 100sequenced genomes in public databases (http://www.ncbi.nlm.nih.gov/Genomes/index.htm), and protein-familyclassification algorithms allow functional annotations fora considerable fraction of the genes in each genome. Thus,it has become possible to analyze the statistics offunctional gene-content across different genomes, and inthis article I present results on the dependency of thenumber of genes in different high-level categories on thetotal number of genes in the genome.
Evaluating the functional gene-content of genomesTo estimate the number of genes in different functionalcategories, each genome has to be functionally annotated.The main results presented in this article were obtainedusing the InterPro  annotations of sequenced genomesavailable from the European Bioinformatics Institute .To map the InterPro annotations to high-level functionalcategories I used the Gene Ontology (GO) biologicalprocess hierarchy  and a mapping from InterPro entriesto GO categories, both of which can be obtained from theGene Ontology website (http://www.geneontology.org). Foreach GO category I collected all InterPro entries that mapto it or to one of its descendants in the biological processhierarchy. To minimize the effects of potential bias inthe mappings from InterPro to GO I only used high-level functional categories that are represented by atleast 50 different InterPro entries. This leaves 44 high-level GO categories.
A gene with multiple hits to InterPro entries that areassociated with a GO category has a higher probability ofbelonging to that category than a gene with only a singlehit. To take this information into account, I assumed that ifa gene i has nci independent hits to InterPro entriesassociated with GO category c, then with probability 12exp2bnci the gene belongs to this particular GO category.The results are generally insensitive to the value of b(Box 1), and I used b 3 for the results shown below. Theestimated number of genes nc in a genome for a given GOcategory is then the sum nc
Pi 12 exp2bnci over all
genes i in the genome.The results for the categories of transcription regulat-
ory genes, metabolic genes and cell-cycle-related genes forbacteria and eukaryotes are shown in Fig. 1. Remarkably,for each functional category shown, we find an approxi-mately power-law relationship (solid line fits)*. That is, ifnc is the number of genes in the category, and g is the totalnumber of genes in the genome, we observe laws of theform: nc lga; where both l and the exponent a dependon the category under investigation. In fact, such
power-laws are observed for most of the 44 high-levelcategories, and the estimated values of the exponentsfor several functional categories are shown in Table 1.A potential source of bias in estimating these expo-nents is the occurrence of multiple genomes from thesame or closely related species in the data. Removingthis redundancy from the data does not alter theobserved exponents (Box 1).
The power-law fits and 99% posterior probabilityestimates for their exponents were obtained using aBayesian procedure described in Box 1. To assess thequality of the fits, I measured, for each fit, the fraction ofthe variance in the data that is explained by the fit. Inbacteria, 26 out of 44 GO categories have more than 95% ofthe variance explained by the fit, and 38 categories havemore than 90% explained by the fit. In eukaryotes, 26categories have more than 95% explained by the fit and 32categories have more than 90% explained by the fit.However, with total gene number varying by less than afactor of 20 in bacteria, and the small number of datapoints in eukaryotes, one might wonder how one can claimthat power-laws have been observed. First, the fact that Ifit the data to power-laws should not be mistaken for aclaim that the data can only be described by power-lawfunctions. I only claim that the power-law is by far thesimplest functional form that fits almost all the observeddata. Second, when scatter plots such as those shown inFig. 1 are plotted on linear as opposed to logarithmicscales, it is clear even by eye that the fluctuations in thenumber of genes in a category scale with the total numberof genes in the genome. That is, the fluctuations in the datasuggest that logarithmic scales are the natural scales forthese data. This is further supported by the simpleevolutionary model presented below.
One might also wonder to what extent the results aresensitive to the specific functional annotation procedure. Iperformed a variety of tests to assess the robustness of theresults (i.e. the observed power-law scaling and the valuesof the exponents) to changes in the annotation method-ology (Box 1). These involve using entirely independentannotations based on clusters of orthologous groups ofproteins (COGs) , and a simple (crude) annotationscheme based on keyword searches of protein tables forsequenced microbial genomes from the NCBI website(ftp.ncbi.nlm.nih.gov/genomes/bacteria). As shown inBox 1, the observed power-law scaling and the values ofthe exponents are generally insensitive to these and otherchanges in annotation methodology. However, because allcurrently available annotation schemes assume at somelevel that functional homology can be inferred fromsequence homology, our results implicitly depend on thisassumption as well.
Observed exponentsSome functional categories, such as the large category ofmetabolic genes, occupy a roughly constant fraction of thegene content of a genome, as evidenced by their exponentof 1. However, many categories show significant deviationsfrom this trivial exponent. Genes related to cell-cycle orprotein biosynthesis have exponents significantly below 1,whereas for transcription factors (TFs) the exponent is
* These power-laws as a function of total gene number should be distinguished fromthe power-law distributions of gene-family sizes and other genomic attributes within asingle genome [4,5].
Update TRENDS in Genetics Vol.19 No.9 September 2003480