Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Comparative genomics
Lucy SkrabanekICB, WMC6 May 2008
What does it encompass?• Genome conservation
– transfer knowledge gained from model organismsto non-model organisms
• Genome evolution– understand how genomes change over time in
order to identify evolutionary processes andconstraints
• Genome variation– understand how genomes vary within a species to
identify genes central to particular processes
Main uses• Whole genome comparisons
– Genome evolution• Coding regions comparisons
– Gene prediction– Gene structure (exon-intron) prediction– Function prediction
• Non-coding region comparisons– Regulatory region discovery
• Protein-protein interaction prediction
Metazoan evolutionary tree
Ureta-Vidal et al, NRG 2003
Rearrangement rate• Can estimate the number of rearrangements
from cytological comparisons• Two very different rates:
– Very slow rate of rearrangement (1 or 2exchanges per 10 MYR)
• ~ 7 rearrangements between the human genome fromthe hypothetical primate ancestor
• 13 rearrangements between cat and human
Rearrangement rate• Punctuated by abrupt global genome
rearrangement in some lineages– Gibbons and siamangs rearranged 3 or 4 times
more extensively than human or other greatapes
– Rodent species exhibit very rapid patterns ofchromosome change
• ~180 conserved segments between mouse and human• ~100 conserved segments between rat and human
Human, catand mouseX chromosomecomparison
O’Brien et al, Science 1999
Genome comparison across speciesgross changes in chromosome number
O’Brien et al, Science 1999
Genome alignment• Very different from single gene or
protein alignments• Standard DPAs are too expensive• Made complicated by extensive
rearrangements of large homologoussegments
Problems• Looking for syntenic regions• Rearrangements disrupting syntenic
regions– Insertions– Deletions– Inversions– Translocations– Duplications
Assumptions• The two genomes to be aligned derived
from a common ancestor• There remains sufficient similarity
between the genomes to enable easyidentification of homologous regions
• For the alignment to be informative,there has to have been time for thegenomes to diverge and for selection tohave occurred
Algorithm requirements• Genome alignment algorithms must
– Scale linearly (computationally)– Be robust (not too many parameters)– Be memory-efficient– Be able to handle rearrangements, gene
duplications, repetitive elements• Smith-Waterman, Needleman Wunsch
– Time to do calculation of the order of (O)4
• Not feasible for sequence length > 10,000 bp
– Cannot handle rearrangements or inversions
Alignment methods• Seeding methods (e.g., BLASTZ, BLAT,
EXONERATE)– Produce “local” alignments– All matches found (including all paralogs)
• Very sensitive, not very specific– Very fast
• Anchor-based methods (e.g., MUMmer,GLASS, AVID)– Produce “global” alignments– Specific– Difficulty with rearrangements
Ureta-Vidal et al, Nat Rev Genet April 2003
Local aligner: BLASTZ• As in BLAST, find seeds
– Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions forstrict matches are specified
– 1110100110010101111 combination shown to bethe most sensitive (and more sensitive than the11-consecutive match seed strategy used byBLAST)
– Also allows one transition in 1/12 strict matchpositions
– Seeds with many matches masked out (assumedto be repetitive regions)
BLASTZ• Gap-free extension• Further extend seeds by DPA, allowing gaps
– Low complexity regions have their scoresspecifically down-weighted
• Repeat above steps (using a more sensitiveseed, e.g., 7-mer) for regions that lie betweenmatches, that share the same order andorientation, and are separated by <50 kb
• Post-processing to remove multiple hits to thesame region
• When aligning human and mouse genomes,can achieve 98% alignment of known codingregions
Local aligner: BLAT• Two modes: untranslated and translated
– Untranslated mode performs poorly when conservation < 90%so translated mode usually used when aligning genomesWill more efficiently identify regions that are conserved for their
coding ability rather than for the regulatory functions
• Seeds created by building an index of non-overlapping 5-mers from one genome– Frequent 5-mers and ambiguous sequences (repeats and
low-complexity regions) excluded• The other genome is chopped into <200kb chunks
– Comparisons are made between the indexed 5-mers and thechunks
• DPA applied both upstream and downstream ofmatches
Global aligner: AVID• Assumes that no gene duplications,
inversions or translocations have occurred– Need a pre-processing step to identify syntenic
regions - use BLAT hits, filtered for specificity• Find ‘maximally repeated matches’ in
syntenic regions– Matches are flanked on either side by mismatches
• Determine an ‘anchor set’– Exclude all maximally repeated matches that are
less than half the length of the longest match– Allow only non-overlapping non-crossing matches– Use ‘clean’ matches first, then ‘repeat’ matches
Blue + Red: all maximally repeated matchesRed: anchor set
Bray et al, Genome Res 2003
• n anchors ⇒ n+1 regions to be aligned• Add any matches that were discarded because theywere too short on the first round• Add any new anchors• When there are no more anchors to be added, use theNW algorithm to align the intervening sequence
– If the regions are longer than ~ 4kb, the alignmentis returned with a gap in both sequences at thatposition
Global aligner: LAGAN• Different matching algorithm to AVID
– Generates local alignments– Allows mismatches
• Find highest weightchain
• Compute NWalignment, limiting itto area aroundchains
• MLAGAN - multipleglobal alignment tool
Global aligner: WABA• Takes into account degeneracy of genetic
code• Seeding step similar to BLAT
– Uses the two weighted-spaced seed 6of811011011, where the third position is allowed tomismatch
• In the alignment of homologous regions, usesa seven-state pair HMM which also allowsmismatches in the third position
• Finally, join overlapping matches from thealignment phase
Visualization tools:VISTA vs. PipMaker
• VISTA (VISualization Tools for Alignments)– Uses AVID to generate alignments– Uses a sliding window approach
• Plots percent identity within a fixed window size, at regularintervals
• PipMaker (Percent Identity Plot)– Uses BLASTZ– X-axis is the reference sequence; horizontal lines represent
gap-free alignments
• Once we have a macro-alignment, we canstudy the evolution of genomes betweenspecies, and also can trace the evolutionaryhistory of the structure of each genome itself
• Genome structure rearrangement• Gene duplication, chromosomal duplication
and polyploidization (whole genomeduplication)– New genes
Unravelling the history of the genome• Plants
– Wheat• Allohexaploid (AABBCC)
– Maize• Diploidized allotetraploid• Has grown 12x in the past 5 MY due to increased
numbers of transposable elements– Arabidopsis
• Haploid, 5N
• Yeast– Saccharomyces cerevisiae vs. Kluyveromyces
lactis• Human
– Genome duplication?
Decipher history of lineages• Genome size changes
– Compaction• Large scale deletions - Fugu• Intron loss
– Expansion• Duplication• Transposable element insertions
• Transposons - Alus, LINEs– Ancient insertions prior to eutherian radiation– More recent insertions - maize
Baxendale et al, Nature Genet, 1995
Why gen(om)e duplication?• Duplicated genes provide a source for
genetic novelty during evolution– Either member of a duplicated gene pair can
diverge to either• Acquire a new function which may be positively selected
for• Subfunctionalize• Be differently regulated (e.g., tissue-specific)• Become a pseudogene• Be deleted
• Whole genome duplication allows for theduplication (and subsequent divergence) ofwhole pathways at a time
Duplication events• Resolution of polyploidy
– Non-disjunction of chromosomes (formmultivalents instead of divalents)
• Sterility– Duplicated genes do not start to diverge (or
get deleted) until disomic inheritanceresolved
Rearrangements• Are genome rearrangements the cause or
consequence of diploidization?– Most widely accepted hypothesis is that
diploidization proceeds by structural divergence ofchromosomes
– Some loci appear disomic, others tetrasomic• Stage 1: pairing between similar chromosomes allowed
– Loci near the centromeres can display residual tetrasomy• Stage 2: non-homologous chromosome pairing resolved
Just how hard can it be to tell what happened?
“Take four, or maybe eight, decks of 52 playing cards.Shuffle them all together and then throw some cardsaway. Pick 20 cards at random and drop the rest on thefloor. Give the 20 cards to some evolutionary biologistsand ask them to figure out what you’ve done.”
Skrabanek and Wolfe, Curr Opin Genet Dev, 1998
Now that the human genome has been sequenced, thingsaren’t quite so bleak.
Effects of polyploidization
Wolfe, Nature Rev Genet 2001
Effects of polyploidization
Wolfe, Nature Rev Genet 2001
Saccharomyces cerevisiae (baker's yeast)
Whole GenomeDuplication (WGD)
WGD in S. cerevisiae: Wolfe & Shields, Nature 387:708 (1997)
Yeast
• Saccharomyces cerevisiae– Degenerate tetraploid– Polyploidy followed by extensive deletion and
(70-100) reciprocal translocations– 8% of genes duplicated in 55 blocks (plus
many missed smaller blocks)– Relative orientation of genes in blocks
conserved with respect to the centromere
Seoighe and Wolfe, PNAS, 1998
Duplicatedblocks in
yeast
Wolfe and Shields, Nature 1997
Estimation of time of polyploidy event
• Diverged from Kluyveromyces lactis(unduplicated) ~150 MYA– Comparison of gene sequences and gene order
reveals conservation• 59% of adjacent gene pairs in K. lactis or K. marxianus
are also adjacent in S. cerevisiae• 16% of Kluveromyces neighbors can be explained in
terms of inferred ancestral gene order– Phylogenetic analyses of duplicated genes where both
the Kluveromyces orthologue and an outgrouporthologue were available, deduced that thepolyploidization event in S. cerevisiae occurred around100 MYA
Kellis et al Nature 428:617-624 (2004)
10%retention
5,000 genes
10,000 genes
5,500 genes
5,000 genes
Different subsets retained
Evidence from conservedorder of a very few genes
Evidence from interleavinggenes from sister segments
Each region of K.waltiimatches two regions ofS.cerevisiae
We don’t even needany remaining two-copy genes to inferthe ancestral order
Human• 1970 - Susumu Ohno proposed that vertebrate
genomes had originated via genome duplication• 2R (two Rounds [of genome duplication])
hypothesis– One before, and one after divergence of agnathans
(lamprey) from tetrapods (~430 MYA and~500 MYA)
– Popular and controversial• Split between the map-based people and the tree-
based people• Duplicated regions are evident (covering ≈ 44% of
the genome), but it is difficult to tell whether it isdue to (a) genome duplication(s) or chromosomalduplications
Susumu Ohno1928-2000
Book:Evolution by GeneDuplication(1970)
Whole-genome duplication (polyploidy)
Timing of tetraploidy events
Skrabanek, PhD thesis, 1999
Evidence?• There are regions in the genome which are
quadruplicated– HSA 1, 6, 9 and 19 (MHC region, 10 gene
families)– HSA 4, 5, 8 and 10– HSA 2, 7, 12 and 17 (Hox clusters)
• Expect to see (A,B)(C,D) tree, where the timeof divergence of A from B and C from D isapproximately the same– However, this is not consistently the case
Proposed evolutionof the Hox clusters
Skrabanek, PhD thesis, 1999
Possible routes to 4-gene families
Hokamp et al, J Struct Func Genomics 2003
Hughes, Mol Biol Evol 1998
Estimates ofdivergence timesfor genes in theMHC region onchromosomes1, 6, 9 and 19
Hughes et al, Genome Res, 2001
Divergence timesof genes with
members on atleast two of the
Hox clusterbearing
chromosomes(2, 7, 12, 17)
Hughes et al, Genome Res, 2001
Phylogenetic relationships of four- and three-membered gene families on Hox cluster bearingchromosomes
Conclusions• Duplicated regions seen in human genome are
most likely vertebrate specific• Significant amount of duplication occurring
~350-650 MYA– Possible to explain large margin by alloploidy?
• Probable support for at least one wholegenome duplication event
• Once more genomes are available, such asCiona intestinalis or amphioxus, it may beeasier to decipher the history of duplication– However, long time spans under consideration, and
diploidization requires extensive genomic changes
PLOS Biology 3:e314 (2005)
Panopoulou and Poustka, TIG 21:559, 2005
Fish-specific genome duplication event
Lander et al - Intl. Human Genome Sequencing Consortium paperNature 409:860 (2001)
Hypothetical genomic region
1R
2R
decay
decay
Find most similar vertebrate gene (here M1) to the Ciona gene. Other vertebrate genes areadded to the cluster if they are more similar to M1 than M1 is to the Ciona gene.
S1
> S1
< S1
Duplicates may have arisen byspeciation (lineage-splitting) or bygene duplication events specific toone or more vertebrates
Fugu-specific duplication
Human-specific duplication
Finding all homologs, and only homologs
Number of gene duplications
46.6% of ancestral chordate genes are duplicated in ≥1 lineage34.5% with at least one duplication before Fugu-tetrapod split23.5% with at least one duplication after Fugu-tetrapod split
No evidence of 2R hypothesis from gene family membership alone, nor from phylogenetics (sinceduplications abundant on every branch)
Paralogs generated by a gene duplication before the Fugu-tetrapod split count as matches. N-fold redundancy calculated by identifying all cases where ≥ 2 different duplicates are within a100-gene window, and then counting the number of times that their paralogs occur within a 100-genewindow elsewhere in the genome.
2R
4-fold redundancy most common - accounts for 25% of the genome
1,912 genes duplicated priorto fish-tetrapod split 2,953 paralogous genepairs 32.4% are found in 386detectable paralogoussegments, comprising 772individual genomic segments 454 are tetra-paralogons,where overlapping sets fall into4-fold groups
Of the genes that duplicatedafter the fish-tetrapod split,only 11% are found inparalogous regions (i.e.,duplications after the split didnot include large segments ofthe genome)
Could be of recent origin, orcould have undergone multiplerearrangement events thatdestroyed the tetra-paralogonssignal
Old duplications Recent duplications
Hox-bearing chromosomes:50% of genes duplicated after the fish-tetrapod split are tandem duplicates (separated by< 10 genes), whereas only 6% of genes duplicated before the split are tandem duplicates.
Conclusions• 2R hypothesis most likely scenario• Two rounds of closely spaced auto-
tetraploidization events– Some paralogous pairs within tetraparalogons
extend over longer regions than others• So octaploidy unlikely, since pairs of regions would have
had to lose the same sets of genes– Phylogenetic trees are not consistently nested
• So allotetraploidy or two distantly spaced autotetraploidyevents unlikely
– Tree topologies within paralogous blocks notalways congruent
• Gene loss and rediploidization processes probablyspanned the two duplication events
Identification of functional elements• Coding sequences
– Relatively easy to identify• Many gene prediction programs available• General gene structure known, e.g.,
– TATA boxes– Splice donor-acceptor sites
– ESTs and cDNAs available to aid gene prediction• Non-coding functional sequences
– Much harder to identify– No common structure of regulatory regions– TF binding sites are short and ubiquitous
• Comparative genomics– A genomic sequence that provides a function that is under
selection and tends to be conserved between species iscalled a “functional sequence” (e.g., a protein-coding regionor a transcription factor binding site)
Coding region comparison• Discover new genes
– Annotate gene structure• Exon-intron structure
• Compare gene content– Find genes common to sets of organisms– Find genes unique to an organism
• We can study how genomes vary within aspecies to identify genes central to particularprocesses– Can also compare subspecies e.g., E. coli K12 and
O157:H7 (pathogenic)• Discover gene function• Can find missing genes in metabolic pathways
Nature 420:520, 2002
Predicting structure of ‘new’ genes• Many gene finding programs
– Look for start sites, termination sites, splice sites– Analyze codon usage, exon length
• Comparative method A– Find syntenic regions– Predict genes using conventional gene finding
techniques in both species– Genes predicted in both species are probable
Gene prediction• Comparative method B
– Find syntenic regions– Perform pattern filtering
• Coding exons tend to be well conserved• Conservation higher in first and second positions of
codons– Advantage: can deal with sequencing errors
Identification of paralogues and orthologuesin other species
• Best Reciprocal Hit (BRH)
• Reciprocal Hit by Synteny (RHS)– Identification of adjacent orthologues
• Domain checking - internal qualitycontrol
Mouse Humanhttp://www.nbn.ac.za/
Identification of orthologues
OrphanOthers
Matches to someother chromosome
Human
BRH
Mouse
RHS
http://www.nbn.ac.za/
Genes in mouse shared with other organisms
Nature 420:520, 2002
Non-coding regioncomparison
• Regulatory region discovery– No accurate methods to identify regulatory
sequences based on scanning the DNA sequencealone
• Putative TF binding sites found everywhere• Low specificity
– Assume that regulatory regions are conservedbecause they serve a vital function
Rationale
• DNA sequences encoding and regulating theexpression of essential proteins and RNAswill be conserved
• Consequently, the regulatory profiles ofgenes involved in similar processes amongrelated species will be conserved
• Conversely, sequences that encode orcontrol the expression of proteins or RNAsresponsible for differences between specieswill be divergent
What species?• How distant should the compared species be?
Transgenic mice tend to express the transgene ina similar manner to the source mammalianorganism, including those for which the orthologdoes not exist in mice!
• Rapidly evolving regulatoryregions (e.g., β−globin) allow comparisonsbetween closely related species
• However, some regulatory regions evolve veryslowly (e.g., T-cell receptors), so comparisonswith more distant species are required
Finding regulatory regions• Find conserved
regions betweenspecies
• Can also findconservedregions betweensimilarlyexpressed genes– Search for TF
binding sites inthese conservedregions
Pennacchio and Rubin, Nature Rev Genet 2001
Human vs. mouse• Similar gene content and linear organization
– ~340 syntenic blocks (~150 more than discoveredusing cytological studies) spanning 90% of thegenome
• Difference in genome size– Mouse genome is 14% smaller, probably due to a
higher rate of deletion• Sequence Conservation
– ~40% in alignments– ~5% under selection
• ~1.5% protein coding• ~3.5% non-coding: untranslated regions, regulatory
elements, non-protein-coding genes, and chromosomalstructural elements
Nature 420:520, 2002
Human vs. mouse genomes
Nature 420:520, 2002