16
The 2R hypothesis and the human genome sequence Karsten Hokamp , Aoife McLysaght & Kenneth H. Wolfe * Department of Genetics, Smurfit Institute, University of Dublin, Trinity College, Dublin 2, Ireland; * Author for correspondence: E-mail: [email protected] These authors contributed equally to this work. Received 02.04.2002; accepted in final form 29.08.2002 Key words: human genome, paralogon, polyploidy, 2R hypothesis Abstract One theory formalised in 1970 proposes that the complexity of vertebrate genomes originated by means of genome duplication at the base of the vertebrate lineage. Since then, the theory has remained both popular and controversial. Here we review the theory, and present preliminary results from our analysis of duplications in the draft human genome sequence. We find evidence for extensive duplication of parts of the genome. We also ques- tion the validity of the ‘parsimony test’ that has been used in other analyses. The 2R hypothesis In his 1970 book Susumu Ohno proposed that there may have been one or more whole genome duplica- tions in the lineage leading to vertebrates. He postu- lated that genome duplication in the vertebrate lin- eage provided a platform for increasing the sophistication of the vertebrate genome and thus increasing morphological complexity. Genome dupli- cation may be particularly powerful because all genes in a biochemical pathway will be duplicated simulta- neously. Ohno was not specific about how many events occurred. The most popular form of this hypo- thesis is that there were 2Rounds of genome dupli- cation early in the vertebrate lineage, as proposed by Holland et al. (1994). This has recently become known as the 2R hypothesis, an abbreviation attribut- able to Hughes (1999). There is no absolute consen- sus on the timing of these events, but the majority of references in the literature put one of these events immediately before, and one immediately after the divergence of agnathans from the lineage leading to tetrapods (see Figure 1 in Skrabanek and Wolfe, 1998). These timings are speculative and were prob- ably chosen to coincide with major evolutionary tran- sitions that they were thought to have facilitated. The lower limit on the timing of genome duplication is set by the observation of only a single Hox cluster in the invertebrate chordate amphioxus compared to four clusters in vertebrates (Garcia-Fernandez and Hol- land, 1994). As an upper limit, it seems unlikely that genome duplications would be viable in the mamma- lian lineage. Theory predicts that a genome duplica- tion in an organism with a chromosomal basis of sex- determination (such as that of mammals) will result in sterility of the heterogametic sex, and thus invia- bility (Muller, 1925). Indeed the only known tetrap- loid mammal, a South American rodent, has dupli- cated copies of every chromosome except the sex chromosomes (Gallardo et al., 1999). At the time of writing his book there was little evidence to support Ohno’s claim. Very few protein sequences were known, and the hypothesis was based largely on genome size comparisons and matching patterns of cytogenetic bands. Much of the evidence which prompted Ohno to suggest a genome duplica- tion event has lost merit in the light of our current understanding of genetics and genomes (Skrabanek and Wolfe, 1998). For example, differences in genome sizes are largely due to increased amounts of non-coding DNA rather than an increased number of genes; and cytogenetic bands, whose patterns were 95 Journal of Structural and Functional Genomics 3: 95–110, 2003. © 2003 Kluwer Academic Publishers. Printed in the Netherlands. XPS 5103278 (JSFG) – product element JSFGSI-02-10 – Grafikon

The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

The 2R hypothesis and the human genome sequence

Karsten Hokamp†, Aoife McLysaght† & Kenneth H. Wolfe*

Department of Genetics, Smurfit Institute, University of Dublin, Trinity College, Dublin 2, Ireland;*Author for correspondence: E-mail: [email protected]†These authors contributed equally to this work.

Received 02.04.2002; accepted in final form 29.08.2002

Key words: human genome, paralogon, polyploidy, 2R hypothesis

Abstract

One theory formalised in 1970 proposes that the complexity of vertebrate genomes originated by means ofgenome duplication at the base of the vertebrate lineage. Since then, the theory has remained both popular andcontroversial. Here we review the theory, and present preliminary results from our analysis of duplications in thedraft human genome sequence. We find evidence for extensive duplication of parts of the genome. We also ques-tion the validity of the ‘parsimony test’ that has been used in other analyses.

The 2R hypothesis

In his 1970 book Susumu Ohno proposed that theremay have been one or more whole genome duplica-tions in the lineage leading to vertebrates. He postu-lated that genome duplication in the vertebrate lin-eage provided a platform for increasing thesophistication of the vertebrate genome and thusincreasing morphological complexity. Genome dupli-cation may be particularly powerful because all genesin a biochemical pathway will be duplicated simulta-neously. Ohno was not specific about how manyevents occurred. The most popular form of this hypo-thesis is that there were 2 Rounds of genome dupli-cation early in the vertebrate lineage, as proposed byHolland et al. (1994). This has recently becomeknown as the 2R hypothesis, an abbreviation attribut-able to Hughes (1999). There is no absolute consen-sus on the timing of these events, but the majority ofreferences in the literature put one of these eventsimmediately before, and one immediately after thedivergence of agnathans from the lineage leading totetrapods (see Figure 1 in Skrabanek and Wolfe,1998). These timings are speculative and were prob-ably chosen to coincide with major evolutionary tran-sitions that they were thought to have facilitated. The

lower limit on the timing of genome duplication is setby the observation of only a single Hox cluster in theinvertebrate chordate amphioxus compared to fourclusters in vertebrates (Garcia-Fernandez and Hol-land, 1994). As an upper limit, it seems unlikely thatgenome duplications would be viable in the mamma-lian lineage. Theory predicts that a genome duplica-tion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will resultin sterility of the heterogametic sex, and thus invia-bility (Muller, 1925). Indeed the only known tetrap-loid mammal, a South American rodent, has dupli-cated copies of every chromosome except the sexchromosomes (Gallardo et al., 1999).

At the time of writing his book there was littleevidence to support Ohno’s claim. Very few proteinsequences were known, and the hypothesis was basedlargely on genome size comparisons and matchingpatterns of cytogenetic bands. Much of the evidencewhich prompted Ohno to suggest a genome duplica-tion event has lost merit in the light of our currentunderstanding of genetics and genomes (Skrabanekand Wolfe, 1998). For example, differences ingenome sizes are largely due to increased amounts ofnon-coding DNA rather than an increased number ofgenes; and cytogenetic bands, whose patterns were

95Journal of Structural and Functional Genomics 3: 95–110, 2003.© 2003 Kluwer Academic Publishers. Printed in the Netherlands.

XPS 5103278 (JSFG) – product element JSFGSI-02-10 – Grafikon

Page 2: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

used to list human chromosomes in pairs (Comings,1972), are not indicative of the underlying gene con-tent.

The debate on the 2R hypothesis to date has beena war of words (and limited data) between the phy-logeneticists and the cartographers. As a general rule,analyses based on phylogenetic methods come outagainst the genome duplication hypothesis (e.g.,Hughes, 1998; Hughes, 1999; Martin, 1999; Hugheset al., 2001; Martin, 2001), whereas map-based stud-ies come out in favour (e.g., Lundin, 1993; Spring,1997). Two main arguments have been advanced tosupport the theory of genome duplication in an earlyvertebrate: that there should be four vertebrate ortho-logues of each invertebrate gene, the so-called ‘one-to-four rule’ (Spring, 1997; Meyer and Schartl, 1999;Ohno, 1999); and that paralogous genes are clusteredin a similar fashion in different regions of the genome(e.g., Martin et al., 1990; Lundin, 1993).

The one-to-four rule

The one-to-four rule was first proposed by JürgSpring (1997). He listed human paralogues present ondifferent chromosomes and their Drosophila ortho-logues, and surmised that the maximum ratio ofhuman to Drosophila genes was four. These ‘tetra-logues’ seemed to bear the hallmark of a genome-wide event because they were discovered on all 23female human chromosomes. The observation ofsome gene families with ratios of 2:6 or 2:5 Droso-phila:human genes contradicts this hypothesis andSpring suggested that more complete genomesequences would provide data that could split thesefamilies into ‘tetrapacks’.

The first extensive examination of the one-to-fourrule using almost complete proteomes from D. mela-nogaster, C. elegans, and human, showed no excessof four-membered vertebrate gene families (seeFig. 49 of International Human Genome SequencingConsortium [2001] and Fig. 12 of Venter et al.[2001]). Furthermore, the observation of gene fami-lies with five or more members directly contradictsthe expectations of Spring (1997) that membershipwould be ‘maximally four’. It appears that the one-to-four rule is an over-simplification of the history ofthe vertebrate genome. These data can of course beexplained by hypothesising two genome duplicationson a background of independent gene duplication andloss. However, as it is impossible to distinguish

genome duplication from gene duplication on thebasis of gene family size alone, this measure is sim-ply uninformative.

Paralogous chromosomal segments

The analysis of paralogous regions of the humangenome is based on the assumption that, although itis expected that many rearrangements will haveoccurred in the time since the two duplication eventsenvisaged by the 2R hypothesis, there should still bedetectable remnants of the 4-way paralogy betweensome chromosomes, i.e., some portions of some chro-mosomes should remain almost intact in four copies.This principle seems reasonable, though these studieshave suffered for want of extensive genomic data.Finding as few as two genes in several linked clus-ters in a genome of over 30,000 is hardly overwhelm-ing evidence for a genome duplication event (e.g.,Martin et al., 1990). Objections that these observa-tions can easily be explained by regional duplicationsof segments of chromosomes must be entertained.

HSA 1, 6, 9, and 19

The observation of paralogous regions (around theMHC locus) on human chromosomes 1, 6, 9, and 19,led to the suggestion that these were duplicated bywhole genome duplication events at the base of thevertebrate lineage (Kasahara et al., 1996; Katsanis etal., 1996; Kasahara, 1997). This was further sup-ported by the finding of only a single related clusterin amphioxus (Flajnik and Kasahara, 2001). Tenmembers of particular gene families are present onchromosomes 6 and 9, and four of these are also rep-resented on chromosome 1. The claim that thisarrangement resulted from several rounds of polyp-loidy was refuted by Hughes (1998) using phyloge-netic analysis of the nine families with sufficient data(Retinoid X receptor (RXR); � pro-collagen (COL);ATP-binding cassette (ABC) transporter; Proteasomecomponent � (PSMB); Notch; Pre-B-cell-leukemiatranscription factor (PBX); Tenascin (TEN); C3/C4/C5 complement components; Heat shock protein70 (HSP70)). However, Hughes’ analysis did indicatethat this arrangement could be partly due to blockduplication. Trees of these families showed that five(RXR, COL, PBX, TEN, C3/4/5) of the nine familieswith sufficient phylogenetic information could haveduplicated simultaneously, and that this timing was

96

Page 3: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

consistent with a duplication in early vertebrate his-tory 550−700 Mya. The phylogenetic analysis indi-cated that the four genes on chromosome 1 probablyduplicated as a block. Similarly, a phylogenetic anal-ysis by Endo et al. (1997) rejected the hypothesis thatthe 11 gene pairs on chromosomes 6 and 9 wereduplicated in a single event, but did support the si-multaneous duplication of six of the pairs. However,analysis of the remaining genes showed that the ABCtransporter genes diverged before the origin ofeukaryotes, the PSMB and the HSP70 gene familiesboth originated before the divergence of animals andfungi, and the Notch genes diverged before the originof deuterostomes (Hughes, 1998). Obviously thesegene families did not arise as part of a block dupli-cation event at the base of the vertebrate lineage.However, it can still be argued that these results areconsistent with block duplication of this region if oneassumes that there was an ancient tandem duplicationof some of these genes, and after block duplicationthere was differential loss of one of the tandems, sothat the divergence date of paralogues on two differ-ent chromosomes is that of the tandem duplicationevent rather than of the block duplication event(Kasahara et al., 1996; Smith et al., 1999).

HSA 4, 5, 8, and 10

Pébusque et al. (1998) reported the presence of paral-ogous genes on human chromosomes 4, 5, 8, and 10.In contrast to the analysis of the genes around theMHC discussed above, this study was based on acombination of phylogenetic and map-based methods.These genes are linked on the human chromosomes,with the exception that there is one family memberon each of chromosomes 2 and 20, which requiregenome rearrangements to be reconciled with a blockduplication event. The phylogenetic analyses consis-tently showed that these gene family membersdiverged in the vertebrate lineage and so are consis-tent with the 2R hypothesis of genome duplication.This conclusion was criticised by Martin (1999) whopointed out that the gene trees of the ankyrin familyand the EGR (early growth response) family indicateddifferent histories for their host chromosomes. Theankyrin gene tree groups chromosome 4 and 10 to theexclusion of chromosome 8, whereas the EGR genetree groups 8 and 10 to the exclusion of all others.This contradicts the expectation that the family mem-bers on each chromosome have had a shared historysince the block duplication event.

HSA 2, 7, 12, and 17

The quadruplication of the Hox cluster is the touch-stone of the 2R hypothesis. There are four colinearHox clusters in the vertebrate genome (Kappen et al.,1989), but only one in the invertebrate chordate am-phioxus (Garcia-Fernandez and Holland, 1994). Phy-logenetic analysis of the clusters showed that theyduplicated early in vertebrate history. It seems certainthat these clusters duplicated en bloc. The question iswhether they arose by genome duplication events, orby sub-genomic duplication events, or a mixture ofboth. In the analysis of Zhang and Nei (1996) Hoxclusters C and D were grouped with a high bootstrap,but there is not enough information in the alignmentsof the 61 amino acids of the homeodomain to resolvethe phylogeny further. Instead, Bailey et al. (1997)analysed the relationship of the linked fibrillar-typecollagen genes, which presumably shared the sameduplication history. Assuming the collagen geneshave a shared history with the Hox clusters, then theresults can be interpreted as a topology (out-group(HoxD(HoxA(HoxB,HoxC)))), which contra-dicts the grouping of HoxC and HoxD found byZhang and Nei (1996). Furthermore, this is contraryto the expectations of the 2R hypothesis, which pre-dicts a symmetric topology, but may be explained bythree rounds of genome duplication with loss of 4clusters, or by independent cluster duplications(Bailey et al., 1997).

In a phylogenetic analysis of the human Hox-bear-ing chromosomes (2, 7, 12, 17) Hughes et al. (2001)examined 35 gene families with members on at leasttwo of the Hox chromosomes. 15 of these familiescould be classified as either pre-vertebrate, or post-mammalian duplicates and so are inconsistent withthe 2R hypothesis. For the remaining 17 gene fami-lies the tree topologies did not exclude duplication atthe same time as the Hox clusters. There were 15 ofthese for which the molecular clock was not rejectedand estimates for the divergence dates of these genefamilies were calculated. Six of the gene familieswere dated to within the time of divergence of theHox clusters, 528−750 Mya (as defined by lineagedivergences), and two others had divergence esti-mates that were not significantly different from thetime of Hox duplication. Phylogenies of gene fami-lies with members on at least three of the four Hoxbearing chromosomes did not reveal a common topol-ogy for the relationship of these chromosomes.

97

Page 4: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

Other regions

Some of the supposed paralogous regions of the ver-tebrate genome that can be found listed in the litera-ture are based on rather sparse evidence. For exampleGibson and Spring (2000) list human chromosomesX, 4, 5, and 11 as a possible paralogous quartet basedonly on the presence of members of two gene fami-lies (alpha-amino−3-hydroxy−5-methyl−4-isoxazole-propionic acid (AMPA) and androgen / mineralocortic-oid / glucocorticoid / progesterone nuclear receptors)on all of these chromosomes.

Testing the (AB)(CD) topology prediction

In its simplest form, the hypothesis of two rounds ofgenome duplication predicts a symmetric (A,B)(C,D)phylogenetic tree topology (where A, B, C, D, repre-sent any four-membered gene family), with the ageof the AB split the same as the age of the CD split,thus displaying the history of successive genomeduplications. The alternative hypothesis, that ofsequential gene duplication, will not always predict asymmetric topology. Under a sequential duplicationmodel a four-membered family must arise from theduplication of one member of a three membered fam-ily. There is only one possible topology for threesequences, namely (A(C,D)) (Figure 1). Duplicationof gene A will result in a symmetric topology, andduplication of either C or D will result in an asym-metric topology. Assuming that all three genes areequally likely to be duplicated, sequential gene dupli-cation will give rise to a symmetric (A,B)(C,D)topology 1/3 of the time, and an asymmetric topol-ogy (A((B,C)D)) or (A(C(B,D))) the remaining 2/3 ofthe time. Thus, the null hypothesis is that the sym-metric topology should be found in 1/3 of trees, andnot 1/5 as proposed by Gibson and Spring (2000).

Hughes (1999) and Martin (2001) employed simi-lar methodologies to test the phylogenies of genefamilies listed as exemplars of the one-to-four rule(Sidow, 1996; Spring, 1997) for congruence with the2R hypothesis (i.e., whether or not they displayed asymmetric topology, and if they duplicated in the ver-tebrate lineage). The symmetric topology was onlyobserved in a small minority of the cases (one out ofnine trees in Hughes [1999]; and two out of ten treesin Martin [2001]), although in Martin’s analysis sevenof the eight minimum-length trees that were not sym-

metric were not significantly shorter than a symmet-ric tree.

Variations on the 2R hypothesis result in differentpredictions for the phylogenies of vertebrate genefamilies. For example, if vertebrate genome doublingoccurred by allopolyploidy (i.e., hybridisation of twospecies, as has been suggested; (Spring, 1997)) or bysegmental allopolyploidy (i.e., behaving as an auto-polyploid at some loci, and as an allopolyploid atothers) then a single genome doubling event will pro-duce paralogues with two different coalescence dates(Gaut and Doebley, 1997; Wolfe, 2001). Alternativemodels hypothesise that the two rounds of genomeduplication may have occurred in short successionand thus not allowing the diploidisation proceduretime to complete before the second genome duplica-tion event. This would result in some tetrasomic loci,and some octasomic loci, in the quadruplicated ge-nome (Gibson and Spring, 2000).

Diploidisation

Diploidisation is a natural consequence of polyploidy.With some rare exceptions (e.g., some loci of recentsalmonid tetraploids; Allendorf and Thorgaard, 1984)all hypothesised paleopolyploid genomes havereverted to disomic inheritance at all loci. There is anincreased incidence of non-disjunction of chromo-somes when they form multivalents rather thandivalents, so selection for increased fertility probablycauses the reinstatement of disomic inheritance(Allendorf and Thorgaard, 1984).

Immediately after autotetraploidy all loci in thegenome will be tetrasomic. These duplicated geneswill not separate into two independently divergingloci until disomic inheritance is established (Ohno,1970). This is important for our interpretation of whata paleopolyploid genome should look like becauseone of the properties we test in assessing genomeduplication is the synchronicity of divergence ofduplicated loci. Depending on the manner and speedof diploidisation this may or may not be an appropri-ate test for a paleopolyploid genome. In a diploidorganism, chromosomes are arranged in pairs at mei-osis (i.e., chromosomes are bivalent). These pairs canexchange segments of DNA by recombination, anddrift and gene conversion maintain a high degree ofsimilarity between most alleles. In a tetraploidgenome, chromosomes are arranged in tetravalents,

98

Page 5: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

rather than pairs, at meiosis. Diploidisation can bereduced to a problem of chromosome association. Bywhat mechanism does a genome convert from form-ing chromosome quartets to forming chromosomepairs, i.e., from tetraploid to diploid behaviour?

The answer to this question probably lies in adeeper understanding of the mechanisms of chromo-some association. Is chromosome sequence diver-gence a cause or a consequence of diploidisation? Ifchromosome association occurs by homologoussequence attraction, then sequence divergence (bychromosome rearrangements) will cause diploidisa-tion of chromosomes. On the other hand, if chromo-some association is controlled by some other mecha-

nism, such as attraction of homologous centromeresor telomeres, then chromosomal rearrangements mayallow the independent evolution of the relocated lociand their previous partners in a tetrasomic locus, asseparate loci without actually causing the diploidisa-tion of the chromosomes in question.

The mammalian Y chromosome may serve as amodel for this process. It is an unusual chromosomebecause it is partially diploid (at the pseudoautosomalregion), and the rest is haploid. Lahn and Page (1999)identified homologous genes on the human X and Ychromosomes, which would have been part of thesame locus when these chromosomes behaved auto-somally (the sex chromosomes are thought to have

Figure 1. Alternative phylogenetic tree topologies of four-membered families resulting from sequential gene duplication or genome dupli-cation. Grey arrows indicate the nodes that are critical to define the symmetry or asymmetry of the topology. (A) Phylogenetic tree topolo-gies resulting from duplication of one member of a three-membered gene family. Three different trees result. The tree from the duplicationof gene C and that from the duplication of gene D have asymmetric topologies. (B) Phylogenetic tree topologies resulting from two wholegenome duplication (WGD) events. All genes are duplicated at each step, resulting in a symmetric tree topology.

99

Page 6: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

evolved from autosomes; (Graves, 1996)). They mea-sured the amount of divergence at synonymous sites(Ks) between homologous gene pairs. From this theyfound that the homologues were in four age classesarranged sequentially along the X chromosome. Theyinterpreted this as the result of inversions of largesections of the Y chromosome, leaving the X intact,which had the effect of suppressing recombinationbetween these portions of the chromosomes. Thesechromosomes have diverged substantially, and mostof the Y chromosome loci are haploid. The X and Ychromosome still pair at meiosis (at the pseudoauto-somal region), and thus behave like diploid chromo-somes, yet most of the loci are haploid. It may be thecase that chromosomal tetravalency and locus tetra-somy can be separated in the same way.

The wheat genome (Triticum aestivum) is hex-aploid, with three contributory genomes (A, B, andD). There is evidence for genetic control of chromo-some association in wheat through the Ph1 locus onchromosome V of the B genome (Riley and Kem-panna, 1963). In the presence, but not the absence, ofa particular allele of this locus, non-homologousassociated centromeres separate at the beginning ofmeiosis (Martinez-Perez et al., 2001). The Ph1 locusprobably acts to amplify the differences between non-homologous chromosomes.

The most widely accepted hypothesis is that dip-loidisation proceeds by structural divergence of chro-mosomes. Allendorf and Thorgaard (1984) discuss amodel whereby some loci may appear disomic whileothers apparently segregate tetrasomically. In theirmodel they assumed that chromosome pairingoccurred at the telomeres, but it can easily be modi-fied to assume centromere association as indicatedfrom the wheat study (Martinez-Perez et al., 2001).The model of residual tetrasomic inheritance hypoth-esises that there are two stages of chromosome pair-ing. The first stage will allow pairing betweenhomoeologous chromosomes (partially similar chro-mosomes), thereby allowing recombination eventsbetween paralogous loci on different chromosomes.The second stage of pairing in this hypothesisresolves non-homologous chromosome pairing, andensures that each gamete receives one copy of eachchromosome in the normal manner. Evidence in sup-port of this model comes from the observation ofMartinez-Perez et al. (2001) that some non-homolo-gous centromeres are associated just before the begin-ning of meiosis. This model predicts that loci closerto the point of association of the chromosomes (i.e.,

closer to the centromere) will retain residual tetra-somic inheritance longer than others. For any locus,the likelihood that it behaves disomically rather thantetrasomically in a particular meiosis will be corre-lated with its distance from the centromere.

Paralogon searches in the human genomesequence

In our laboratory we have begun to analyse the draftsequence of the human genome for evidence ofancient large-scale duplications, such as might beexpected under the 2R hypothesis, and describe ourapproach here. The detection of paralogous chromo-somal blocks (termed ‘paralogons’ by Popovici et al.(2001)) essentially involves the search for closelygrouped sets of genes with homologues that occur inclose vicinity in one or more other locations withinthe genome. Thus, the basic requirements for paral-ogon detection consist of the position and thesequence information of all genes. For a thorough andprecise analysis the most complete and accurate setof human genes together with their map position isdesirable. Further annotational data such as genedescriptions can add valuable background informa-tion, particularly about gene functions. A final versionof the human genome sequence will probably not beavailable until 2003 (International Human GenomeSequencing Consortium, 2001), but several groupsare striving to annotate the genome sequence in itscurrent state. We have chosen to use data releasesfrom Ensembl (Hubbard et al., 2002), a joint projectbetween EMBL-EBI and the Sanger Institute whichaims to develop a software system for automaticannotation of eukaryotic genomes, because it standsout in several respects:– Ensembl employs an ‘open source’ philosophy,

which allows public insight into every detail of thedata assembly procedure. This can be very valuableto trace back steps that lead to decisions in the geneprediction process.

– Strategies and plans are discussed openly througha mailing list. This offers the chance to receiveearly information on data-related issues and alsoallows for interaction with the developers.

– The data are provided in the common SQL data-base format, which facilitates their integration intoa local computer system.

– Frequent version releases provide a constant updateof information.

100

Page 7: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

– Most importantly, all data and programs are freelyavailable and can be used without any restrictions.In the genome annotation process, Ensembl inte-

grates information of known proteins from SP-TREMBL (Bairoch and Apweiler, 2000). The Gene-Wise program (Birney and Durbin, 2000) convertsthese to exon structures on the human sequence.Additionally, new genes are detected ab initio throughthe GenScan program (Burge and Karlin, 1997).Functional annotation is derived from the InterPro,OMIM and SAGE databases. This results in a set ofpredicted and confirmed genes, the latter of whichnumbered 27,615 in Ensembl data release version 1.0.Since sequencing of the human genome is still inprogress, this also affects the annotation process,which leads to regular data updates. Ensembl pro-vides chromosomal locations for most of the con-firmed genes through integration of mapping datafrom the ‘Golden Path’, an arrangement of BAC-cloned human sequences assembled and maintainedby the University of California at Santa Cruz (Kentand Haussler, 2001). The underlying data compriseapproximately 830 Mb of finished sequences and2,300 Mb of draft sequences. A further 100 Mb hadnot been sequenced at the time of the data freeze.Data in the draft stage have only been sequenced onceor twice and contain basepair ambiguities, gaps andsegments with unknown order or orientation. Highquality sequence data are expected from tenfold cov-erage.

Sequence similarity search

Establishing homology relationships among genes inan automated fashion is based on sequence similaritysearches. The first step in the analysis thereforerequires the comparison of all human proteins witheach other. We included invertebrate proteomes (nem-atode and fly) in the search database to act as anapproximate natural orthology threshold (any humanproteins that are less similar than an invertebrate pro-tein to the human query protein probably duplicatedbefore the invertebrate-vertebrate lineage divergence,and so are not relevant to the 2R hypothesis). We car-ried out BLASTP searches, running on a 20-nodeBeowulf cluster, with the SEG filter to exclude lowcomplexity regions. For organising the large volumeof resulting query/hit pairings, we found the freelyavailable MySQL database system very useful.

ORF collapsing

Tandem gene duplications are usually relativelyrecent evolutionary events and occur quite frequently.They can artificially inflate gaps between pairs of par-alogues, and can inflate the number of hits reportedbetween different chromosomes (or chromosomalregions). We therefore attempted to detect and col-lapse tandem duplicates. Similar to the methodapplied by Vision et al. (2000) in an analysis of theA. thaliana genome, all genes within a close neigh-bourhood that show strong sequence similarity toeach other were removed from the map and replacedby a single representative, i.e., the longest peptide ofeach group was retained. Any BLASTP hits to a genethat is part of a tandem array were ‘redirected’ to thesingle remaining gene representing the array.

Parameter optimisation

The paralogon detection process has to take intoaccount evolutionary events like inversions, rear-rangements, deletions and mutations, which are likelyto obfuscate traces of genome duplication. A programwas developed that is controlled by parameters to dealwith gaps between pairs of duplicates, high-copy pro-tein families, and the distinction between spurioussimilarities and true homologies. The values for theseparameters greatly determine the outcome of the pro-gram. Too strict a set of values will mostly showhighly conserved or recent blocks, where only closelygrouped genes with very strong similarity aredetected. By contrast, a relaxed definition of paral-ogons can lead to inflated block sizes and numbers asa result of inclusion of insignificant pairings. In theend a trade-off between selectivity and sensitivitymust be chosen. We carried out extensive tests withdifferent combinations of parameters to determinesuitable parameter values.

Paralogon detection results

A paralogon was ‘built’ starting from an anchor: apair of homologous genes at different chromosomallocations. This was extended by including proteinpairs on these chromosomes that were positioned nofurther than 30 genes distance from another proteinincluded in the paralogon. Hits with a BLASTPexpectation threshold higher than 1e−7 whereexcluded, as well as proteins with more than 20 hits.

101

Page 8: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

The resulting paralogons range in sizes from only twopairs of duplicated genes to up to 29. Some of theseparalogons, particularly the smaller ones, might havearisen from chance constellation of similar genes. Todetermine statistical significance, the paralogondetection method was applied to an artificial genomein which chromosome number and size has beenretained but where genes have been assigned randomlocations. Using the same sets of parameters and re-peating the shuffling and detection 1000 times al-lowed an estimation of significant block sizes. Theresults indicate that paralogons occur much more fre-quently in the real genome than expected by chance.Paralogons defined by at least six duplicated geneswere in excess of 50 standard deviations more fre-quent in the real genome than expected from the sim-ulations. The only alternative hypothesis that could fitthese data is selection for clustering of these genes ona chromosome, as has been suggested for the mam-malian MHC gene complex and the Surfeit locus(Hughes, 1999).

A web-based, interactive user interface was devel-oped to allow navigation of duplicated regions andzooming between chromosomal and gene level. Anexample of the graphical presentation of a paralogonis shown in Figure 2. The web graphic contains inte-grated links that lead to more detailed information forgenes and their similarity search results, as well as toexternal databases like Ensembl or GenBank. Onlythe paired duplicates are shown but intermittent genescan be switched on for closer inspection. The numberof paralogues within a block lies typically between 9and 23 percent of the genes that are covered by theregion. This gives an indication of the large amountof evolutionary changes that happened since theoccurrence of the duplication event.

Blocks with sizes of six or greater cover more than44% of the human genome. Some of the largestregions are found on the four Hox chromosomes 2/7/12/17. This is also one of the few examples whereduplications among four locations were found. Thegraphical overview of these chromosomes shown inFigure 3 proves the effectiveness of the block detec-tion algorithm: the paralogons exactly cover the po-sition of the Hox clusters, which are often used as aprime example of duplications within the humangenome. Additionally, the covered areas expand pre-viously reported regions and indicate more extensiveduplications than formerly estimated.

Hughes et al. (2001) recently reported widely dif-fering duplication dates for 42 gene families having

members on the Hox-bearing chromosomes. Com-parison of their results to ours shows that 139 of the175 genes used in their study were present in ourgenome dataset, but only 31 of these genes (and afurther 9 associated tandem repeats) form links thatmake up our paralogons; the other possible pairs wereremoved by our chordate-specific filters. Our paral-ogons (containing three or more duplicated genes) onthe Hox chromosomes include a total of 426 dupli-cated genes (i.e., 395 genes not included in Hugheset al.). There is no disagreement between the two setsof observations; chromosomes 2/7/12/17 containsome large paralogons formed by chordate-specificduplications, as well as many members of gene fami-lies formed by older duplications.

Beyond known examples, our analysis also uncov-ered interesting new duplications such as a regionshared between chromosomes 8/14/16/20 in whichcopines and matrix metalloproteinases are found inclose vicinity. Figure 4 provides a detailed view of thecore areas of the duplicated regions. Copines form arecently discovered family of calcium-dependent,phospholipid-binding proteins that are suggested tobe involved in membrane trafficking (Tomsig andCreutz, 2000). The metalloproteinases found in theirvicinity are classified as the transmembrane subtypeof the membrane-type MMPs (MT-MMPs) (Sato etal., 1997; Kojima et al., 2000). A literature searchproduced no results that indicate a connectionbetween these two groups, which is not surprising,considering that research on copines is in its earlystages. Their colocation in the same blocks, togetherwith their membrane-association, suggests some kindof relationship, in particular because the copine andtransmembrane MT-MMPs gene families are bothsmall. These highly significant results provide aninteresting base for a separate research project.

Comparison with Celera data

The only equally comprehensive report on paralogousblocks in human so far can be found in a study whichis part of the private-sector human genome project ledby Celera (Venter et al., 2001). A version of the pro-gram MUMmer (Delcher et al., 1999), modified toalign protein sequences, was used to carry outintragenomic comparison based on the Celerasequence data. Results are presented as one largegraphic (Fig. 13 in Venter et al., 2001), which showsparalogous regions for each chromosome. Blockswere defined by at least three linked genes. Due to

102

Page 9: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

lack of details a comprehensive comparison with ourparalogons is not possible. The only feasible methodconsists of counting the presence or absence of linksbetween each pair of chromosomes for both data sets:Of the 276 possible chromosome pairs, our methoddetected 151. We detected 55 regions that were notfound in the analysis by Venter et al., and we did not

detect any relationship between 21 pairs of chromo-somes for which they illustrated pairings. Pairings be-tween chromosomes 18 and 20 were provided in moredetail and are shown in Figure 5 together with thecorresponding blocks detected by our method. Theoverall appearance of cross-links seems to be thesame except for a region near the centre of chromo-

Figure 2. Paralogous block between human chromosomes 1 and 9. This is the largest paralogous block detected in the human genome in-cluding 29 duplicated genes. Blocks can be viewed at wolfe.gen.tcd.ie/dup.

103

Page 10: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

some 20. The segment from gene ‘GATA rel’ to ‘Kruprel’ on chromosome 20 in the MUMmer graph mightcorrespond to the far end of chromosome 20 (� 64Mb) in our graph, because both seem to be connectedto roughly the same region on chromosome 18. Atranslocation such as this could occur from differ-ences in the assembly process. Large discrepanciesexist in the underlying data: Celera reports 217 pro-tein assignments on chromosome 18 and 322 on chro-

mosome 20. This corresponds to 388 and 748 proteinsin the Ensembl data. Venter et al. state that their anal-ysis found 64 protein pairs in the blocks betweenchromosome 18 and 20, and that these blocks have aduplicate gene density of 20–30%. In our case fourblocks of sizes 6, 7, 7 and 8 are detected which link atotal of 29 and 28 genes on chromosome 18 and 20,respectively. The density of involved genes rangesfrom 12–39% with a median at 19.7%. Unfortunately,

Figure 3. Blocks between Hox chromosomes. All detected paralogous blocks containing 6 or more duplicated genes between human chro-mosomes 2, 7, 12 and 17 are shown. Blocks can be viewed at wolfe.gen.ted.ie/dup.

104

Page 11: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

only 7 of the reported gene names (TGIF, VAPA,VAPB, NFATC1, NFATC2, KCNG2, KCNB1) matchbetween the two data sets so a more detailed com-parison was not possible. A possible explanation fordifferences between both graphs can be found in therecent finishing of chromosome 20 by Deloukas et al.(2001), where discrepancies between the private andthe public sequence affecting order of genes and chro-mosomal blocks were discussed.

A question of parsimony

One way in which the genome duplication hypothesisis more parsimonious than alternative hypotheses that

explain the distribution of paralogues in the genomeis in the number of words it takes to describe it, a factwhich may be related to its popularity as a hypothe-sis. Austin Hughes has challenged the assumption thatblock duplication is the most parsimonious way togenerate paralogous regions within a genome using aparsimony statistic. The statistic considers the relativeparsimony of the hypothesis that paralogous regionswere made by a block duplication event (perhaps aspart of a whole genome duplication event), or thealternative hypothesis that they are the result of tan-dem duplication of genes followed by translocation(Hughes, 1998; Hughes et al., 2001). Following Guand Huang (2002) we refer to these as the ‘BD’ (blockduplication) model and ‘TD’ (tandem duplication)

Figure 4. Blocks with copines and MMPs. The core areas of duplicated regions between human chromosomes 8, 14, 16 and 20 are shown.A copine (CPNE) and a matrix metalloproteinase (MMP) can be found in each of them in close vicinity (protein ENSP00000219193 onchromosome 16 is predicted to be a copine by the Ensembl automated annotation system). Linked genes are drawn and labelled in blue.Genes with links to two or three other blocks are highlighted in red and green, respectively. Intermediate genes are filled with grey.

105

Page 12: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

model, respectively. Hughes found for both the Hoxcluster regions and the chromosome 1/6/9/19 regionthat the TD model was more parsimonious than theBD model as an explanation for the observed geneorders and phylogenetic trees. However, his reason-

ing may have been flawed as shown below. Theapplicability of the parsimony statistic to the pale-opolyploid Arabidopsis genome has also been chal-lenged by Gu and Huang (2002).

Figure 5. Paralogous regions between human chromosomes 18 and 20 defined by two different methods. Lines drawn between the chromo-somes connect paralogues. (A) From Venter et al. (2001), showing labels for a selection of genes. (B) From our analyses with all duplicatedgenes labelled.

106

Page 13: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

Here we consider the simple case of a singlegenome duplication. The BD model has an inbuiltdisadvantage in the parsimony count method ofHughes (1998; 2001). Each gene that is no longerpresent in duplicate is counted as an individual dele-tion event (or equally, as a single translocation eventremoving it from the scope of detection as part of aparalogous region). By contrast, the method is verygenerous to the TD model, assuming that a singletranslocation event brings each gene to its current po-sition within a paralogous region. In fact, as shownbelow, Hughes’ TD model will always require fewerevents than a BD model, so long as fewer than 1/3 ofgenes are retained in duplicate.

Let G be the number of genes in the pre-duplica-tion genome. Let q be the proportion of the pre-dup-lication genome retained in duplicate in the moderngenome, and p be the proportion in single copy.

Then q � p � 1 (1)

and Gq � Gp � G (2)

Gq is the number of genes retained in duplicate.The TD model requires Gq tandem duplication

events and a further Gq translocation events, totalling2Gq events. The BD model requires one large dupli-cation event (in this example, a genome duplication)followed by Gp gene deletion events (the number ofgenes seen in single copy). For these two hypothesesto have an equal number of events (i.e., to be equallyparsimonious) then:

2Gq � 1 � Gp (3)

Replace p with 1�q from Equation 1:

→ 2Gq � 1 � G�1 � q� (4)

→ q � 1/3G � 1/3 (5)

For genomes with a large number of genes (e.g.,G � 6000 for yeast):

→ q � 1/3 (6)

The TD model will be more parsimonious than thegenome duplication (BD) model if q � 1/3, i.e.,whenever the retention of genes in duplicate is lessthan 1/3 of the pre-duplication genome. This result isalso apparent from the work of Gu and Huang (2002)who separately analysed 103 duplicated blocks in the

Arabidopsis genome. Their Figure 2 shows empiri-cally that the TD model is more parsimonious in the94 blocks having q � 1/3, whereas the BD model ismore parsimonious only in the remaining 9 blockswith q � 1/3. A retention frequency (q) of 1/3 corre-sponds to a duplication level of 50% in the post-duplication genome (i.e., 50% of genes in the moderngenome will have polyploidy-derived paralogues).

The above calculations were based on Hughes’assumption that genes are deleted individually, i.e.,that only one gene is deleted per deletion event. Itmay be more biologically realistic to allow for sev-eral neighbouring genes to be duplicated in a singleevent. If d is the average number of genes deleted ina gene deletion event, then Equation 4 can be re-phrased as:

2q � 1/G � �1 � q�/d (7)

→ 2q � �1 � q�/d (8)

→ d � 1/2q � 1/2 (9)

Solving Equation 9 for different values of q shows theaverage size of a deletion event that is required forthe two hypotheses to have equal probability for dif-ferent frequencies of duplicate gene retention(Table 1).

One of the observations of the well-documentedcase of paleopolyploidy in yeast was that only 8% ofthe pre-duplication genome was retained in duplicate(Seoighe and Wolfe, 1998). For q � 0.08 the averagesize of a deletion event (d) needs to be 6 genes orlarger (Table 1) to favour the genome duplicationhypothesis by the simple parsimony statistic. Intu-itively this seems like a biologically feasible size.Indeed it seems more acceptable than anotherassumption built-in to the alternative tandem-duplica-

Table 1. Crossover values for q (proportion of genes retained induplicate) and d (average number of genes deleted in a singleevent) at which the genome duplication and TD models are equallyparsimonious (calculated from equation 9).

q d

0.33 1.0

0.20 2.0

0.10 4.5

0.08 5.8

0.05 9.5

0.01 49.5

107

Page 14: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

tion and translocation model, i.e., that selection cancreate genomic regions with similar gene contents byfavouring particular translocations (Hughes, 1999).

We consider an example from the yeast genome.Within a block first described by Pohlmann and Phil-ippsen (1996), and later numbered block 39 by Wolfeand Shields (1997) there are six duplicated genes and20 unduplicated genes (eight on chromosome XIVand 12 on chromosome IX). Under Hughes’ TDmodel the formation of this block would require 12steps (six tandem duplications, and six transloca-tions). Under a whole genome duplication model,with each gene deleted individually, the formation ofthis block would require 21 events (one wholegenome duplication, and 20 gene deletions) andwould thus be less parsimonious by this statistic.However, if each deletion event included on averagethree genes then only seven deletion events would berequired to explain the current state of this paralogousregion, and the block duplication model would bemore parsimonious. Thus it appears that, even in thewell-documented case of yeast, which Hughes andcolleagues agree is a likely polyploid (Friedman andHughes, 2001), this simple parsimony statistic is notappropriate to determine the relative probability ofparalogous region formation by block duplication ver-sus tandem duplication and translocation.

Discussion

The approach described here acknowledges two im-portant aspects of the genome duplication hypothesis.Ones is that it proposes a whole genome event, andanecdotal evidence for individual gene families isunlikely to result in a firm conclusion – we havetherefore analysed the whole human genome. Theother important aspect is that genome duplication isnot proposed as the only mechanism of gene familyexpansion – we have therefore removed obvious tan-dem duplicates from the analysis, and deliberatelylimited this study to look at the mechanisms of gene-family expansion in the vertebrate lineage. Nobodyrealistically expects that all gene families that exist inthe vertebrate genome were singletons before verte-brate origins. What is detectable as a gene family, i.e.,a group of paralogous genes, will often include mem-bers representing a long evolutionary history, onlysome of which is vertebrate specific. Studies wheregene families are included indiscriminately willdoubtless include diverse evolutionary histories, and

closer inspection will reveal a straw man easy toknock down.

Our map-based method found paralogons coveringover 44% of the human genome. These are mostprobably vertebrate specific, and are distributedthroughout the genome. This can be explained by ei-ther extensive sub-genomic duplications, or by poly-ploidy. An additional phylogenetic analysis carriedout recently on the same data set indicates a signifi-cant accumulation of duplication activity during a rel-atively short period between 350 and 650 Mya(McLysaght et al., 2002). Both findings together lendsupport to the hypothesis of one polyploidy eventearly in the vertebrate lineage. There is no specificevidence for two as opposed to one tetraploidy event,or for auto- rather than allotetraploidy. Neither docurrent findings from investigations of the one-to-fourrule or tree topologies give clear signals for any ofthese scenarios. Further refinement of the humangenome sequence and complete gene identificationwill hopefully enable more precise analyses in the fu-ture. It is also likely that the genome sequencingproject for the tunicate Ciona will contribute usefuldata in the near future. However, in view of theimmense time span under consideration and the hugecomplexity of genomic changes that might haveoccurred, it remains unsure if the complete history ofancient duplication events that led to the currentshape of the human genome will ever be revealed.

References

Allendorf, F.W. and Thorgaard, G.H. (1984) Tetraploidy and theevolution of salmonid fishes. In Evolutionary Genetics ofFishes (Ed. Turner, B.), Plenum Press, New York, NY, pp. 1–46.

Bailey, W.J., Kim, J., Wagner, G.P. and Ruddle, F.H. (1997) Phy-logenetic reconstruction of vertebrate Hox cluster duplications.Mol. Biol. Evol., 14, 843–853.

Bairoch, A. and Apweiler, R. (2000) The SWISS-PROT protein se-quence database and its supplement TrEMBL in 2000. NucleicAcids Res., 28, 45–48.

Birney, E. and Durbin, R. (2000) Using GeneWise in the Droso-phila annotation experiment. Genome Res., 10, 547–548.

Burge, C. and Karlin, S. (1997) Prediction of complete gene struc-tures in human genomic DNA. J. Mol. Biol., 268, 78–94.

Comings, D.E. (1972) Evidence for ancient tetraploidy and conser-vation of linkage groups in mammalian chromosomes. Nature,238, 455–457.

Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O.and Salzberg, S.L. (1999) Alignment of whole genomes. Nu-cleic Acids Res., 27, 2369–2376.

108

Page 15: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

Deloukas, P., et al. (2001) The DNA sequence and comparativeanalysis of human chromosome 20. Nature, 414, 865–871.

.Endo, T., Imanishi, T., Gojobori, T. and Inoko, H. (1997) Evolu-tionary significance of intra-genome duplications on humanchromosomes. Gene, 205, 19–27.

Flajnik, M.F. and Kasahara, M. (2001) Comparative genomics ofthe MHC: glimpses into the evolution of the adaptive immunesystem. Immunity, 15, 351–362.

Friedman, R. and Hughes, A.L. (2001) Gene duplication and thestructure of eukaryotic genomes. Genome Res., 11, 373–381.

Gallardo, M.H., Bickham, J.W., Honeycutt, R.L., Ojeda, R.A. andKohler, N. (1999) Discovery of tetraploidy in a mammal. Na-ture, 401, 341.

Garcia-Fernandez, J. and Holland, P.W. (1994) Archetypal organi-zation of the amphioxus Hox gene cluster. Nature, 370, 563–566.

Gaut, B.S. and Doebley, J.F. (1997) DNA sequence evidence forthe segmental allotetraploid origin of maize. Proc. Natl. Acad.Sci. USA, 94, 6809–6814.

Gibson, T.J. and Spring, J. (2000) Evidence in favour of ancientoctaploidy in the vertebrate genome. Biochem. Soc. Trans., 28,259–264.

Graves, J.A. (1996) Mammals that break the rules: genetics ofmarsupials and monotremes. Annu. Rev. Genet., 30, 233–260.

Gu, X. and Huang, W. (2002) Testing the parsimony test of genomeduplications: a counterexample. Genome Res., 12, 1–2.

Holland, P.W.H., Garcia-Fernandez, J., Williams, N.A. and Sidow,A. (1994) Gene duplications and the origins of vertebrate de-velopment. Development, Suppl. 1994, 125–133.

Hubbard, T., et al. (2002) The Ensembl genome database project.Nucleic Acids Res., 30, 38–41.

Hughes, A.L. (1998) Phylogenetic tests of the hypothesis of blockduplication of homologous genes on human chromosomes 6, 9,and 1. Mol. Biol. Evol., 15, 854–870.

Hughes, A.L. (1999) Phylogenies of developmentally importantproteins do not support the hypothesis of two rounds of genomeduplication early in vertebrate history. J. Mol. Evol., 48, 565–576.

Hughes, A.L., da Silva, J. and Friedman, R. (2001) Ancient genomeduplications did not structure the human Hox-bearing chromo-somes. Genome Res., 11, 771–780.

International Human Genome Sequencing Consortium (2001) Ini-tial sequencing and analysis of the human genome. Nature 409,860–921.

Kappen, C., Schughart, K. and Ruddle, F.H. (1989) Two steps inthe evolution of Antennapedia-class vertebrate homeoboxgenes. Proc. Natl. Acad. Sci. USA, 86, 5459–5463.

Kasahara, M. (1997) New insights into the genomic organizationand origin of the major histocompatibility complex: role ofchromosomal (genome) duplication in the emergence of theadaptive immune system. Hereditas, 127, 59–65.

Kasahara, M., Hayashi, M., Tanaka, K., Inoko, H., Sugaya, K.,Ikemura, T. and Ishibashi, T. (1996) Chromosomal localizationof the proteasome Z subunit gene reveals an ancient chromo-somal duplication involving the major histocompatibility com-plex. Proc. Natl. Acad. Sci. USA, 93, 9096–9101.

Katsanis, N., Fitzgibbon, J. and Fisher, E.M.C. (1996) Paralogymapping: identification of a region in the human MHC tripli-cated onto human chromosomes 1 and 9 allows the predictionand isolation of novel PBX and NOTCH loci. Genomics, 35,101–108.

Kent, W.J. and Haussler, D. (2001) Assembly of the working draftof the human genome with GigAssembler. Genome Res., 11,1541–1548.

Kojima, S., Itoh, Y., Matsumoto, S., Masuho, Y. and Seiki, M.(2000) Membrane-type 6 matrix metalloproteinase (MT6-MMP, MMP–25) is the second glycosyl-phosphatidyl inositol(GPI)-anchored MMP. FEBS Lett., 480, 142–146.

Lahn, B.T. and Page, D.C. (1999) Four evolutionary strata on thehuman X chromosome. Science, 286, 964–967.

Lundin, L.G. (1993) Evolution of the vertebrate genome as re-flected in paralogous chromosomal regions in man and thehouse mouse. Genomics, 16, 1–19.

Martin, A. (2001) Is tetralogy true? Lack of support for the ‘one-to-four’ rule. Mol. Biol. Evol., 18, 89–93.

Martin, A.P. (1999) Increasing genomic complexity by gene dupli-cation and the origin of vertebrates. Amer. Nat., 154, 111–128.

Martin, G.R., Richman, M., Reinsch, S., Nadeau, J.H. and Joyner,A. (1990) Mapping of the two mouse engrailed-like genes:close linkage of En–1 to dominant hemimelia (Dh) on chromo-some 1 and of En–2 to hemimelic extratoes (Hx) on chromo-some 5. Genomics, 6, 302–308.

Martinez-Perez, E., Shaw, P. and Moore, G. (2001) The Ph1 locusis needed to ensure specific somatic and meiotic centromereassociation. Nature, 411, 204–207.

McLysaght, A., Hokamp, K. and Wolfe, K.H. (2002) Extensive ge-nomic duplication during early chordate evolution. Nature Gen-et., 31, 200–204.

Meyer, A. and Schartl, M. (1999) Gene and genome duplicationsin vertebrates: the one-to-four (-to-eight in fish) rule and theevolution of novel gene functions. Curr. Opin. Cell Biol., 11,699–704.

Muller, H.J. (1925) Why polyploidy is rarer in animals than inplants. Amer. Nat., 9, 346–353.

Ohno, S. (1970) Evolution by Gene Duplication, George Allen andUnwin, London, UK.

Ohno, S. (1999) Gene duplication and the uniqueness of vertebrategenomes circa 1970–1999. Semin. Cell Devel. Biol., 10, 517–522.

Pébusque, M.-J., Coulier, F., Birnbaum, D. and Pontarotti, P. (1998)Ancient large scale genome duplications: phylogenetic andlinkage analyses shed light on chordate genome evolution. Mol.Biol. Evol., 15, 1145–1159.

Pohlmann, R. and Philippsen, P. (1996) Sequencing a cosmid cloneof Saccharomyces cerevisiae chromosome XIV reveals 12 newopen reading frames (ORFs) and an ancient duplication of sixORFs. Yeast, 12, 391–402.

Popovici, C., Leveugle, M., Birnbaum, D. and Coulier, F. (2001)Coparalogy: Physical and functional clusterings in the humangenome. Biochem. Biophys. Res. Commun., 288, 362–370.

Riley, R. and Kempanna, C. (1963) The homeologous nature of thenon-homologous meiotic pairing in Triticum aestivum deficientfor chromosome V (5B) Heredity, 18, 287–306.

Sato, H., Tanaka, M., Takino, T., Inoue, M. and Seiki, M. (1997)Assignment of the human genes for membrane-type–1, –2, and–3 matrix metalloproteinases (MMP14, MMP15, and MMP16)to 14q12.2, 16q12.2-q21, and 8q21, respectively, by in situ hy-bridization. Genomics, 39, 412–413.

Seoighe, C. and Wolfe, K.H. (1998) Extent of genomic rearrange-ment after genome duplication in yeast. Proc. Natl. Acad. Sci.USA, 95, 4447–4452.

109

Page 16: The 2R hypothesis and the human genome sequencetion in an organism with a chromosomal basis of sex-determination (such as that of mammals) will result in sterility of the heterogametic

Sidow, A. (1996) Gen(om)e duplications in the evolution of earlyvertebrates. Curr. Opin. Genet. Devel., 6, 715–722.

Skrabanek, L. and Wolfe, K.H. (1998) Eukaryote genome duplica-tion – where’s the evidence? Curr. Opin. Genet. Devel., 8, 694–700.

Smith, N.G.C., Knight, R. and Hurst, L.D. (1999) Vertebrate ge-nome evolution: a slow shuffle or a big bang? BioEssays, 21,697–703.

Spring, J. (1997) Vertebrate evolution by interspecific hybridisation– are we polyploid? FEBS Lett., 400, 2–8.

Tomsig, J.L. and Creutz, C.E. (2000) Biochemical characterizationof copine: a ubiquitous Ca2 � -dependent, phospholipid-bindingprotein. Biochemistry, 39, 16163–16175.

Venter, J.C. et al. (2001) The sequence of the human genome. Sci-ence, 291, 1304–1351.

Vision, T.J., Brown, D.G. and Tanksley, S.D. (2000) The origins ofgenomic duplications in Arabidopsis. Science, 290, 2114–2117.

Wolfe, K.H. (2001) Yesterday’s polyploids and the mystery of dip-loidization. Nature Reviews Genet., 2, 333–341.

Wolfe, K.H. and Shields, D.C. (1997) Molecular evidence for anancient duplication of the entire yeast genome. Nature, 387,708–713.

Zhang, J. and Nei, M. (1996) Evolution of Antennapedia-class ho-meobox genes. Genetics, 142, 295–303.

110