4
Dispatch R311 Mouse genomics: Making sense of the sequence Ian J. Jackson Interpretation of the human genome sequence relies on studies of model genetic organisms. Mouse genetics and genomics will help to identify all the genes, and to determine their function. Address: MRC Human Genetics Unit, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU, UK. Current Biology 2001, 11:R311–R314 0960-9822/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. For once the hype surrounding the publication of the draft human genome sequence [1,2] is justified; practi- cally all of the important human DNA now resides in publicly accessible databases. Of course this fantastic resource generates innumerable questions, but these boil down to two fundamental ones. Where are the genes? And what does each gene do? Ongoing work on the mouse genome will provide impor- tant leads to answering these questions. One commonly used method for identifying genes in genomic sequence is to find the sequence represented in cDNA libraries, and a recently published collection of mouse cDNA sequences adds substantially to these [3]. Furthermore, the similarity between mouse and human genomes means that the respective genomic DNA sequences can be easily aligned, and the most highly conserved segments (mostly corre- sponding to coding exons) readily spotted. Several mouse genome sequencing projects are at various stages of pro- duction of qualitatively different datasets, all of which will be invaluable for annotating the human sequence. Finally, a recently announced international consortium aims to discover a function for all mouse genes, and by homology for all human genes [4]. The genome projects have brought to biology a new modus operandi; just as molecular technologies revolutionised cell and develop- mental biology 15 or 20 years ago, so genomic approaches will fundamentally change the way we carry out biological experimentation. Mouse as a genetic model Many organisms are valid genetic models of humans. If a human gene has a clearly identifiable equivalent in another sequenced genome, then useful information should be gained from studying the model organism. There are over 1,000 genes that are present in single copy in the human, nematode worm and fruitfly genomes [1]. On the whole, however, the gene content in invertebrate models appears to be quite different from human. For example, humans encode 616 G-protein-coupled receptors, whereas flies have 146 and worms, 284 [2]. In many cases it is just not possi- ble to find a gene that is equivalent in a comparison of human to invertebrates. By contrast, the mouse is evolu- tionarily much closer to humans and its gene content largely identical. The mouse and human genomes are derived from a common mammalian ancestor, and as few as 100 chromosomal rearrangements separate each genome from this ancestral genome [1,5]. Genes linked in one species are often linked in the other; and genomic sequence comparisons indicate that gene order is typically conserved over many megabases of DNA [1]. More mouse cDNAs Expressed sequence tags, or ESTs, are short sequence reads, typically of a few hundred bases, from the ends of cDNA clones. A few caveats aside, these represent transcripts and therefore genes, and so have been invaluable for finding genes in genomic sequence. Millions of ESTs have been produced from hundreds of cDNA libraries. Pro- jects such as Unigene (www.ncbi.nlm.nih.gov/Unigene) have used automated methods to assign ESTs into clus- ters on the basis of sequence matches, and these clusters provide one estimate of gene number. These are overesti- mates, partly because a substantial fraction of ESTs derive from genomic DNA rather than cDNA, and partly because multiple, non-overlapping clusters can derive from the same gene. Furthermore, ESTs by their nature are short ‘tags’, which often do not contain the protein coding segment of the transcript, and so do not help in catalogu- ing functional gene content, and as they do not have con- served features — coding potential — they are often not useful for cross-species comparisons. A better representation of mouse cDNA has been produced by human curation and annotation by the Genome Exploration Group at RIKEN, Japan [3]. In this project, almost one million mouse cDNA sequence tags from numerous libraries were clustered, from which about 21,000 clones were sequenced. These sequences contained redundancies identifiable by cross comparisons, and further redundancies in which non-overlapping sequences derived from the same known gene. By extrapolating the incidence of this latter redundancy across their collection of novel sequences, the authors could estimate that they have representatives of just under 13,000 unique genes. RIKEN hosted an annotation ‘jamboree’ at which curators examined the sequences and, where possible, assigned definitions to the cDNA on the basis of likely function or similarity to known genes. A key tool in the annotation is the vocabulary developed by the Gene Ontology (GO)

Mouse genomics: Making sense of the sequence

Embed Size (px)

Citation preview

Page 1: Mouse genomics: Making sense of the sequence

Dispatch R311

Mouse genomics: Making sense of the sequenceIan J. Jackson

Interpretation of the human genome sequence relies onstudies of model genetic organisms. Mouse geneticsand genomics will help to identify all the genes, and todetermine their function.

Address: MRC Human Genetics Unit, Western General Hospital,Crewe Road, Edinburgh, EH4 2XU, UK.

Current Biology 2001, 11:R311–R314

0960-9822/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved.

For once the hype surrounding the publication of thedraft human genome sequence [1,2] is justified; practi-cally all of the important human DNA now resides inpublicly accessible databases. Of course this fantasticresource generates innumerable questions, but these boildown to two fundamental ones. Where are the genes? Andwhat does each gene do?

Ongoing work on the mouse genome will provide impor-tant leads to answering these questions. One commonlyused method for identifying genes in genomic sequence isto find the sequence represented in cDNA libraries, and arecently published collection of mouse cDNA sequencesadds substantially to these [3]. Furthermore, the similaritybetween mouse and human genomes means that therespective genomic DNA sequences can be easily aligned,and the most highly conserved segments (mostly corre-sponding to coding exons) readily spotted. Several mousegenome sequencing projects are at various stages of pro-duction of qualitatively different datasets, all of which willbe invaluable for annotating the human sequence.

Finally, a recently announced international consortiumaims to discover a function for all mouse genes, and byhomology for all human genes [4]. The genome projectshave brought to biology a new modus operandi; just asmolecular technologies revolutionised cell and develop-mental biology 15 or 20 years ago, so genomic approacheswill fundamentally change the way we carry out biologicalexperimentation.

Mouse as a genetic modelMany organisms are valid genetic models of humans. If ahuman gene has a clearly identifiable equivalent in anothersequenced genome, then useful information should begained from studying the model organism. There are over1,000 genes that are present in single copy in the human,nematode worm and fruitfly genomes [1]. On the whole,however, the gene content in invertebrate models appearsto be quite different from human. For example, humans

encode 616 G-protein-coupled receptors, whereas flies have146 and worms, 284 [2]. In many cases it is just not possi-ble to find a gene that is equivalent in a comparison ofhuman to invertebrates. By contrast, the mouse is evolu-tionarily much closer to humans and its gene contentlargely identical. The mouse and human genomes arederived from a common mammalian ancestor, and as fewas 100 chromosomal rearrangements separate each genomefrom this ancestral genome [1,5]. Genes linked in onespecies are often linked in the other; and genomicsequence comparisons indicate that gene order is typicallyconserved over many megabases of DNA [1].

More mouse cDNAsExpressed sequence tags, or ESTs, are short sequencereads, typically of a few hundred bases, from the ends ofcDNA clones. A few caveats aside, these representtranscripts and therefore genes, and so have been invaluablefor finding genes in genomic sequence. Millions of ESTshave been produced from hundreds of cDNA libraries. Pro-jects such as Unigene (www.ncbi.nlm.nih.gov/Unigene)have used automated methods to assign ESTs into clus-ters on the basis of sequence matches, and these clustersprovide one estimate of gene number. These are overesti-mates, partly because a substantial fraction of ESTs derivefrom genomic DNA rather than cDNA, and partly becausemultiple, non-overlapping clusters can derive from thesame gene. Furthermore, ESTs by their nature are short‘tags’, which often do not contain the protein codingsegment of the transcript, and so do not help in catalogu-ing functional gene content, and as they do not have con-served features — coding potential — they are often notuseful for cross-species comparisons.

A better representation of mouse cDNA has beenproduced by human curation and annotation by theGenome Exploration Group at RIKEN, Japan [3]. In thisproject, almost one million mouse cDNA sequence tagsfrom numerous libraries were clustered, from which about21,000 clones were sequenced. These sequences containedredundancies identifiable by cross comparisons, andfurther redundancies in which non-overlapping sequencesderived from the same known gene. By extrapolating theincidence of this latter redundancy across their collectionof novel sequences, the authors could estimate that theyhave representatives of just under 13,000 unique genes.RIKEN hosted an annotation ‘jamboree’ at which curatorsexamined the sequences and, where possible, assigneddefinitions to the cDNA on the basis of likely function orsimilarity to known genes. A key tool in the annotation isthe vocabulary developed by the Gene Ontology (GO)

Page 2: Mouse genomics: Making sense of the sequence

Consortium [6]. GO annotations assign to gene productsstandard terms that describe the biological process, themolecular function and the cellular component or locationof the product. GO terms are intended to enable genefunction and content information to be readily inter-pretable across species.

This cDNA resource has already proved useful in measur-ing the gene content of the draft human sequence. TheInternational Human Genome Sequencing Consortiumattempted to compile an index of genes from the availablesequence [1]. They derived a list of over 31,000 predic-tions, almost 15,000 of which are known genes with about17,000 predicted by various methods (which probablyhave a fairly high false-positive rate). When the RIKENset was compared to the 31,000 predicted genes, 69% ofthem showed sequence similarity. If the same RIKENsequences were used to search the total human sequence,81% found a match. So 81% of the mouse cDNA setdetects a match against the whole human genomesequence, but only 69% pick up hits to the human geneindex, indicating that the human gene index underrepre-sents the gene set and contains 69/81 or 85% of the mousecDNA collection. The reverse comparison — of the humanset to the RIKEN collection — found 69% were repre-sented, and for known genes was 78%. So the comparisonsindicate that both collections of genes are incomplete, butthere are some problems in deciding how incomplete. Themouse cDNAs were selected to bias for novel genes, butwe do not know the effect of that bias on overall represen-tation in the collection (although we know that only 78%of known genes are present).

Mouse genome sequencingThese comparisons, as well as other analyses, are the basisfor the surprisingly low prediction for the human genenumber of 32,000 (another, higher estimate based on thesame data by F.A. Wright et al. can be found in an elec-tronic preprint available at genomebiology.com). A firmerestimate will come from doing a whole genome compari-son of human to mouse. All the current methods used topredict genes in genomic sequence are subject to error.Methods using matches to cDNA may underestimatebecause of incomplete representation in the libraries,whereas ab initio methods produce overestimates thatmust be tempered by additional evidence, such as similar-ity to already described genes, which in turn will overcom-pensate and miss truly novel genes. Mouse genomicsequence will give a new and powerful means of findinggenes. The human and rodent lineages diverged suffi-ciently long ago that only sequences subject to selection,such as exons, will retain extensive similarity.

Coincident with the publication of the draft humansequence, a consortium funded by public, charity andindustrial sources released the first batch of the mouse

genome sequence, with the intention of releasing three-fold coverage by April 2001. This sequence is a wholegenome shotgun, which essentially means it is randomreads, each of a few hundred base pairs, from throughoutthe genome, and these will not overlap into largercontiguous segments to any significant degree. Instead, theintention is that the mouse data will align along the humansequence, indicating conserved sequence. This is currentlyviewable at the Ensembl web site (www.ensembl.org). Atthe moment, these mouse matches should be treated withcaution, but indications are that they will be useful cross-species sequence tags, whose location in the humangenome is defined.

The mouse genome is also being sequenced at a higherlevel of coverage in a clone-by-clone approach. Much ofthe genome will be completed at ‘draft’ level in 2001, andcertain regions have been targeted for production offinished sequence by October 2002 [7]. These sequenceswill enable a large-scale overview of sequence similaritybetween mouse and human genomes, and identification oflikely conserved exons and other features. Figure 1 showsa ‘percent identity plot’ of a 100 kilobase region of theX chromosomes of mouse and human [8,9]. Exons ofknown genes are clearly distinguished by the higherpercent identity compared to surrounding DNA, and puta-tive novel genes may be identified.

More mouse mutationsMouse studies will also be of key importance in providinginformation about gene function. The phenotype of amouse with a mutation in a single gene provides clear evi-dence of at least one function of that gene. There areseveral ways that such mutant mice can be made, the mostwidely used over the past decade being targeting genesby homologous recombination in embryonic stem cells.Several thousand genes have so far been mutagenised inthis way, but this is a long way short of the total number inthe genome, and the technique is very labour intensive.Methods have been developed to accelerate the stem cellapproach, in particular the use of gene traps which causemutations by the random insertion of marker DNA, viawhich the insertion site can be sequenced to identify thedisrupted gene before a mutant mouse is generated. Thisis a genotype-driven technology, in that the identity of thegene is known before the mutation is made.

A complementary, phenotype-driven approach is to createpoint mutations at random and identify mice with infor-mative phenotypes from the progeny. The availabilityof the mouse genome sequence should permit the mutatedgenes to be readily identified. The last few years hasseen the development of numerous phenotype-drivenprogrammes, utilising the powerful mutagen ethylnitrosourea. Initial studies, in the UK and in Germany,have catalogued several hundred new dominant mutations

R312 Current Biology Vol 11 No 8

Page 3: Mouse genomics: Making sense of the sequence

conferring a range of behavioural, developmental andother phenotypes [10,11].

More ENU programmes are now underway throughout theworld. The NIH has funded several large collaborative pro-jects which are looking for dominant and recessive muta-tions that affect development, nervous system functionand complex behaviour [7]. Screens elsewhere in the worldare also expanding to find recessive mutations. The mousegenetics community has recognised that it is now possibleto set the goal of creating a collection of mouse lines thattogether have a mutation in each gene of the genome. Arecent publication in Science [4] marks the formation of theInternational Mouse Mutagenesis Consortium, whichbrings together geneticists from across the world with thecommon goal of assigning a function to every gene.

The genomic view of biologyAbout 20 years ago, the techniques of molecular biologybegan to be used to study cell biology and developmentalbiology. It was more than the methodology that was broughtto bear, but a particular philosophy, which was that complexprocesses could be reduced to simple, tractable, interactions.Now, another fundamental change is underway in biologywith the advent of genomic techniques. Mass collectionsof data, whether sequence, expression profiles, proteincontent, molecular interactions or mutations generateinformation on whole systems rather than on isolated parts.The philosophy is that we can gain meaningful informationfrom problems whose answers are too large and complex tobe written in a lab notebook, and can only reside in a com-puter. Biologists will have to change the way we think aboutexperiments to take advantage of these resources.

Dispatch R313

Figure 1

Percent identity plot (PIP) comparing a regionof mouse and human X chromosomes [8,9].The human sequence is represented on theabscissa and percentage sequence identity isplotted on the ordinate. The symbols abovethe plot represent features of the humansequence, including confirmed and putativeexons which are depicted as numbered blackboxes. ECRA1–ECRA3 are evolutionarilyconserved regions that may predict a gene.Calractin and Nsdh1 are known mouse andhuman genes. (Figure from [9].)

0k 20k2k 4k 6k 8k 10k 12k 14k 16k 18k

50%

100%

75%

20k 40k22k 24k 26k 28k 30k 32k 34k 36k 38k

50%

100%

75%

exon 6.4f2

1

ECRA1-A3

A1

40k 60k42k 44k 46k 48k 50k 52k 54k 56k 58k

50%

100%

75%

A3

Caltractin

12345

Nsdhl

1

60k 80k62k 64k 66k 68k 70k 72k 74k 76k 78k

50%

100%

75%

Nsdhl

2 3

80k 100k

Current Biology

82k 84k 86k 88k 90k 92k 94k 96k 98k

50%

100%

75%

Nsdhl

4 5 6 7 8

A2

ECRA1-A3

magea9

Page 4: Mouse genomics: Making sense of the sequence

References1. International Human Genome Sequencing Consortium:

Initial sequencing and analysis of the human genome. Nature2001, 409:860-921.

2. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG,Smith HO, Yandell M, Evans CA, Holt RA, et al.: The sequence ofthe human genome. Science 2001, 291:1304-1351.

3. The RIKEN Genome Exploration Research Group Phase II Team andthe FANTOM Consortium: Functional annotation of a full-lengthmouse cDNA collection. Nature 2001, 409:685-689.

4. The International Mouse Mutagenesis Consortium: Annotatinggenome sequences with biological functions in mice. Science2001, 291:1251-1255.

5. Nadeau JH, Taylor, BA: Lengths of chromosomal segmentsconserved since divergence of mouse and man. Proc Natl AcadSci USA 1984, 81:814-818.

6. The Gene Ontology Consortium: Gene ontology: tool for theunification of biology. Nat Genetics 2000, 25:25-29.

7. Graham B, Battey E, Jordan E: Report of second follow-upworkshop on priority setting for mouse genomics. Mamm Genome2001, 12:1-2.

8. Hardison RC, Oeltjen J, Miller W: Long human-mouse sequencealignments reveal novel regulatory elements: a reason tosequence the mouse genome. Genome Res 1997, 7:959-966.

9. Mallon AM, Platzer M, Bate R, Gloeckner G, Botcherby MR,Nordsiek G, Strivens MA, Kioschis P, Dangel A, et al.: Comparativegenome sequence analysis of the Bpa/Str region in mouse andman. Genome Res 2000, 10:758-775.

10. Nolan PM, Peters J, Strivens M, Rogers D, Hagan J, Spurr N, Gray IC,Vizor L, Brooker D, Whitehill E, et al.: A systematic, genome-wide,phenotype-driven mutagenesis programme for gene functionstudies in the mouse. Nat Genetics 2000, 25:440-443.

11. Hrabé de Angelis M, Flaswinkel H, Fuchs H, Rathkolb B, Soewarto D,Marschall S, Heffner S, Pargent W, Wuensch K, Jung M, et al.:Genome-wide, large-scale production of mutant mice by ENUmutagenesis. Nat Genetics 2000, 25:444-447.

R314 Current Biology Vol 11 No 8