Scaling laws in the functional content of genomes

  • Published on

  • View

  • Download

Embed Size (px)


  • specific protein interactions provide a sufficient constraintfor a DNA sequence to be recognized, it will ultimately bepossible to design cell-specific response elements to blocktranscription factors in a specific subset of cells [25,26].

    Our findings suggest that the understanding of howspecific functions are discriminated as different blueprintson the DNA might be within our reach.

    AcknowledgementsWe apologize for the fact that, because of space constraints, full citation ofall relevant articles is not possible. P. Hogeweg, J. Deschamps and areferee are thanked for their critical reading of the manuscript. This studywas supported in part by HPRN and EU quality of life programs and a recipient of a Marie Curie fellowship.

    References1 Claverie, J.M. (2000) From bioinformatics to computational biology.

    Genome Res. 10, 127712792 Benos, P.V. et al. (2002) Is there a code for proteinDNA recognition?

    Probab(ilistical)ly. BioEssays 24, 4664753 Hong, W.K. and Sporn, M.B. (1997) Recent advances in chemopreven-

    tion of cancer. Science 278, 107310774 Mangelsdorf, D.J. et al. (1994) In The Retinoids: Biology, Chemistry

    and Medicine (Sporn, M.B. et al., eds), pp. 319349, Raven Press5 Durston, A.J. et al. (1998) Retinoids and related signals in early

    development of the vertebrate central nervous system. Curr. Top. Dev.Biol. 40, 111175

    6 Rastinejad, F. (2001) Retinoid X receptor and its partners in thenuclear receptor family. Curr. Opin. Struct. Biol. 11, 3338

    7 Gronemeyer, H. and Miturski, R. (2001) Molecular mechanisms ofretinoid action. Cell. Mol. Biol. Lett. 6, 352

    8 Kurokawa, R. et al. (1995) Polarity-specific activities of retinoic acidreceptors determined by a co-repressor. Nature 377, 451454

    9 Gould, A. et al. (1998) Initiation of rhombomeric Hoxb4 expressionrequires induction by somites and a retinoid pathway. Neuron 21,3951

    10 Huang, D. et al. (2002) Analysis of two distinct retinoic acid responseelements in the homeobox gene Hoxb1 in transgenic mice. Dev. Dyn.223, 353370

    11 Nolte, C. et al. (2003) The role of a retinoic acid response element inestablishing the anterior neural expression border of Hoxd4 trans-genes. Mech. Dev. 120, 325335

    12 Amores, A. et al. (1998) Zebrafish hox clusters and vertebrate genomeevolution. Science 282, 17111714

    13 Duboule, D. and Morata, G. (1994) Colinearity and functionalhierarchy among genes of the homeotic complexes. Trends Genet. 10,358364

    14 Krumlauf, R. (1994) Hox genes in vertebrate development. Cell 78,191201

    15 Moroni, M.C. et al. (1993) Regulation of the human HOXD4 gene byretinoids. Mech. Dev. 44, 139154

    16 Popperl, H. and Featherstone, M.S. (1993) Identification of a retinoicacid response element upstream of the murine Hox-4.2 gene. Mol. Cell.Biol. 13, 257265

    17 Morrison, A. et al. (1996) In vitro and transgenic analysis of a humanHOXD4 retinoid-responsive enhancer. Development 122, 18951907

    18 Langston, A.W. et al. (1997) Retinoic acid-responsive enhancerslocated 30 of the Hox A and Hox B homeobox gene clusters. Functionalanalysis. J. Biol. Chem. 272, 21672175

    19 Packer, A.I. et al. (1998) Expression of the murine Hoxa4 gene requiresboth autoregulation and a conserved retinoic acid response element.Development 125, 19911998

    20 Kim, C.B. et al. (2000) Hox cluster genomics in the horn shark,Heterodontus francisci. Proc. Natl. Acad. Sci. U. S. A. 97, 16551660

    21 Kurokawa, R. et al. (1993) Differential orientations of the DNA-binding domain and carboxy-terminal dimerization interface regulatebinding site selection by nuclear receptor heterodimers. Genes Dev. 7,14231435

    22 Rastinejad, F. et al. (1995) Structural determinants of nuclear receptorassembly on DNA direct repeats. Nature 375, 203211

    23 Escriva, H. et al. (1997) Ligand binding was acquired during evolutionof nuclear receptors. Proc. Natl. Acad. Sci. U. S. A. 94, 68036808

    24 Sluder, A.E. et al. (1999) The nuclear receptor superfamily hasundergone extensive proliferation and diversification in nematodes.Genome Res. 9, 103120

    25 Cho, Y.S. et al. (2002) A genomic-scale view of the cAMP responseelement-enhancer decoy: a tumor target-based genetic tool. Proc. Natl.Acad. Sci. U. S. A. 99, 1562615631

    26 Wang, L.H. et al. (2003) The cis decoy against the estrogen responseelement suppresses breast cancer cells via target disrupting c-fos notmitogen-activated protein kinase activity. Cancer Res. 63, 20462051

    27 Chang, B.E. et al. (1997) Axial (HNF3beta) and retinoic acid receptorsare regulators of the zebrafish sonic hedgehog promoter. EMBO J. 16,39553964

    28 Zhang, F. et al. (2000) Murine hoxd4 expression in the CNS requiresmultiple elements including a retinoic acid response element. Mech.Dev. 96, 7989

    29 Valcarcel, R. et al. (1994) Retinoid-dependent in vitro transcriptionmediated by the RXR/RAR heterodimer. Genes Dev. 8, 30683079

    30 Astrom, A. et al. (1994) Retinoic acid induction of human cellularretinoic acid-binding protein-II gene transcription is mediated byretinoic acid receptor-retinoid X receptor heterodimers bound to onefar upstream retinoic acid-responsive element with 5-base pairspacing. J. Biol. Chem. 269, 2233422339

    31 Wingender, E. et al. (2001) The TRANSFAC system on gene expressionregulation. Nucleic Acids Res. 29, 281283

    0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00202-6

    Scaling laws in the functional content of genomes

    Erik van Nimwegen

    Center for Studies in Physics and Biology, the Rockefeller University, 1230 York Avenue, New York, NY 12001, USA

    With the number of sequenced genomes now totalingmore than 100, and the availability of rough functionalannotations for a substantial proportion of their genes,it has become possible to study the statistics of genecontent across genomes. In this article I show that, for

    many high-level functional categories, the number ofgenes in each category scales as a power-law of thetotal number of genes in the genome. The occurrenceof such scaling laws can be explained using a simpletheoretical model, and this model suggests that theexponents of the observed scaling laws correspondto universal constants of the evolutionary process.Corresponding author: Erik van Nimwegen (

    Update TRENDS in Genetics Vol.19 No.9 September 2003 479

  • I discuss some consequences of these scaling laws forour understanding of organism design.

    What fraction of the gene content of a genome is allotted todifferent functional tasks, and how does this depend on thecomplexity of the organism? Until recently, there weresimply no data to address such questions in a quantitativeway. At present, however, there are more than 100sequenced genomes in public databases (, and protein-familyclassification algorithms allow functional annotations fora considerable fraction of the genes in each genome. Thus,it has become possible to analyze the statistics offunctional gene-content across different genomes, and inthis article I present results on the dependency of thenumber of genes in different high-level categories on thetotal number of genes in the genome.

    Evaluating the functional gene-content of genomesTo estimate the number of genes in different functionalcategories, each genome has to be functionally annotated.The main results presented in this article were obtainedusing the InterPro [1] annotations of sequenced genomesavailable from the European Bioinformatics Institute [2].To map the InterPro annotations to high-level functionalcategories I used the Gene Ontology (GO) biologicalprocess hierarchy [3] and a mapping from InterPro entriesto GO categories, both of which can be obtained from theGene Ontology website ( Foreach GO category I collected all InterPro entries that mapto it or to one of its descendants in the biological processhierarchy. To minimize the effects of potential bias inthe mappings from InterPro to GO I only used high-level functional categories that are represented by atleast 50 different InterPro entries. This leaves 44 high-level GO categories.

    A gene with multiple hits to InterPro entries that areassociated with a GO category has a higher probability ofbelonging to that category than a gene with only a singlehit. To take this information into account, I assumed that ifa gene i has nci independent hits to InterPro entriesassociated with GO category c, then with probability 12exp2bnci the gene belongs to this particular GO category.The results are generally insensitive to the value of b(Box 1), and I used b 3 for the results shown below. Theestimated number of genes nc in a genome for a given GOcategory is then the sum nc

    Pi 12 exp2bnci over all

    genes i in the genome.The results for the categories of transcription regulat-

    ory genes, metabolic genes and cell-cycle-related genes forbacteria and eukaryotes are shown in Fig. 1. Remarkably,for each functional category shown, we find an approxi-mately power-law relationship (solid line fits)*. That is, ifnc is the number of genes in the category, and g is the totalnumber of genes in the genome, we observe laws of theform: nc lga; where both l and the exponent a dependon the category under investigation. In fact, such

    power-laws are observed for most of the 44 high-levelcategories, and the estimated values of the exponentsfor several functional categories are shown in Table 1.A potential source of bias in estimating these expo-nents is the occurrence of multiple genomes from thesame or closely related species in the data. Removingthis redundancy from the data does not alter theobserved exponents (Box 1).

    The power-law fits and 99% posterior probabilityestimates for their exponents were obtained using aBayesian procedure described in Box 1. To assess thequality of the fits, I measured, for each fit, the fraction ofthe variance in the data that is explained by the fit. Inbacteria, 26 out of 44 GO categories have more than 95% ofthe variance explained by the fit, and 38 categories havemore than 90% explained by the fit. In eukaryotes, 26categories have more than 95% explained by the fit and 32categories have more than 90% explained by the fit.However, with total gene number varying by less than afactor of 20 in bacteria, and the small number of datapoints in eukaryotes, one might wonder how one can claimthat power-laws have been observed. First, the fact that Ifit the data to power-laws should not be mistaken for aclaim that the data can only be described by power-lawfunctions. I only claim that the power-law is by far thesimplest functional form that fits almost all the observeddata. Second, when scatter plots such as those shown inFig. 1 are plotted on linear as opposed to logarithmicscales, it is clear even by eye that the fluctuations in thenumber of genes in a category scale with the total numberof genes in the genome. That is, the fluctuations in the datasuggest that logarithmic scales are the natural scales forthese data. This is further supported by the simpleevolutionary model presented below.

    One might also wonder to what extent the results aresensitive to the specific functional annotation procedure. Iperformed a variety of tests to assess the robustness of theresults (i.e. the observed power-law scaling and the valuesof the exponents) to changes in the annotation method-ology (Box 1). These involve using entirely independentannotations based on clusters of orthologous groups ofproteins (COGs) [7], and a simple (crude) annotationscheme based on keyword searches of protein tables forsequenced microbial genomes from the NCBI website( As shown inBox 1, the observed power-law scaling and the values ofthe exponents are generally insensitive to these and otherchanges in annotation methodology. However, because allcurrently available annotation schemes assume at somelevel that functional homology can be inferred fromsequence homology, our results implicitly depend on thisassumption as well.

    Observed exponentsSome functional categories, such as the large category ofmetabolic genes, occupy a roughly constant fraction of thegene content of a genome, as evidenced by their exponentof 1. However, many categories show significant deviationsfrom this trivial exponent. Genes related to cell-cycle orprotein biosynthesis have exponents significantly below 1,whereas for transcription factors (TFs) the exponent is

    * These power-laws as a function of total gene number should be distinguished fromthe power-law distributions of gene-family sizes and other genomic attributes within asingle genome [4,5].

    Update TRENDS in Genetics Vol.19 No.9 September 2003480

  • Box 1. Appendix

    Results for ArchaeaFigure I shows the number of transcription regulatory genes, metabolicgenes and cell-cycle-related genes as a function of the total number ofgenes in archaeal genomes. Note that the size of the largest archaealgenome differs from that of the smallest by only a factor of ,3.Consequently, there is large uncertainty regarding the values of theexponents for archaea. The maximum likelihood values and 99%posterior probability intervals are 2.1 and [1.35.7] for transcriptionregulatory genes, 0.81 and [0.441.48] for metabolic genes, and 0.83and [0.541.35] for cell-cycle-related genes.

    Power-law fittingA power-law relation of the form y cxa translates into a linearrelation between the logarithms of the variables x and y[i.e. logy logc alogx]. Thus, power-law fitting translates intofitting a straight line to the log-transformed data. The power-law fitsshown in Fig. 1 of the main text were obtained using a Bayesianstraight-line fitting procedure. For each gene ontology (GO) category, Ilog-transformed the data such that each data point xi ; yi correspondsto the logarithm xi of the number of genes in the genome and thelogarithm yi of the number of genes in the category. I assumed thatthese transformed data were drawn from a linear model:

    y ax 1 l h; Eqn Iwhereh and 1 are noise terms in the x- and y-coordinates, respectively,and l is an unknown off-set. I assumed that the joint-distribution Ph; 1is a two-dimensional Gaussian with means zero and unknown variancesand co-variance. That is, I used scale invariant priors for the variances,and integrated these nuisance variables out of the likelihood. A uniformprior was used for the location parameter l, and for the slope a I used arotation invariant prior:

    Pada da1 a23=2 : Eqn II

    The use of these priors guarantees that the results are invariant under allshifts and rotations of the plane.

    Integrating over all variables except for a, we obtain for the posteriorPa=D given the data D :

    Pa=Dda C a2 1n23=2da

    a2Sxx 2 2asyx syy n21=2; Eqn III

    where n is the number of genomes in the data, sxx is the variance inx-values (logarithms of the total gene numbers), syy is the variance iny-values (logarithms of the number of genes in the category), syx is theco-variance, and C is a normalizing const...