ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict

Comparative genomics

Lucy SkrabanekICB, WMC6 May 2008

What does it encompass?• Genome conservation

– transfer knowledge gained from model organismsto non-model organisms

• Genome evolution– understand how genomes change over time in

order to identify evolutionary processes andconstraints

• Genome variation– understand how genomes vary within a species to

identify genes central to particular processes

Main uses• Whole genome comparisons

– Genome evolution• Coding regions comparisons

– Gene prediction– Gene structure (exon-intron) prediction– Function prediction

• Non-coding region comparisons– Regulatory region discovery

• Protein-protein interaction prediction

Metazoan evolutionary tree

Ureta-Vidal et al, NRG 2003

Rearrangement rate• Can estimate the number of rearrangements

from cytological comparisons• Two very different rates:

– Very slow rate of rearrangement (1 or 2exchanges per 10 MYR)

• ~ 7 rearrangements between the human genome fromthe hypothetical primate ancestor

• 13 rearrangements between cat and human

Rearrangement rate• Punctuated by abrupt global genome

rearrangement in some lineages– Gibbons and siamangs rearranged 3 or 4 times

more extensively than human or other greatapes

– Rodent species exhibit very rapid patterns ofchromosome change

• ~180 conserved segments between mouse and human• ~100 conserved segments between rat and human

Human, catand mouseX chromosomecomparison

O’Brien et al, Science 1999

Genome comparison across speciesgross changes in chromosome number

O’Brien et al, Science 1999

Genome alignment• Very different from single gene or

protein alignments• Standard DPAs are too expensive• Made complicated by extensive

rearrangements of large homologoussegments

Problems• Looking for syntenic regions• Rearrangements disrupting syntenic

regions– Insertions– Deletions– Inversions– Translocations– Duplications

Assumptions• The two genomes to be aligned derived

from a common ancestor• There remains sufficient similarity

between the genomes to enable easyidentification of homologous regions

• For the alignment to be informative,there has to have been time for thegenomes to diverge and for selection tohave occurred

Algorithm requirements• Genome alignment algorithms must

– Scale linearly (computationally)– Be robust (not too many parameters)– Be memory-efficient– Be able to handle rearrangements, gene

duplications, repetitive elements• Smith-Waterman, Needleman Wunsch

– Time to do calculation of the order of (O)4

• Not feasible for sequence length > 10,000 bp

– Cannot handle rearrangements or inversions

Alignment methods• Seeding methods (e.g., BLASTZ, BLAT,

EXONERATE)– Produce “local” alignments– All matches found (including all paralogs)

• Very sensitive, not very specific– Very fast

• Anchor-based methods (e.g., MUMmer,GLASS, AVID)– Produce “global” alignments– Specific– Difficulty with rearrangements

Ureta-Vidal et al, Nat Rev Genet April 2003

Local aligner: BLASTZ• As in BLAST, find seeds

– Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions forstrict matches are specified

– 1110100110010101111 combination shown to bethe most sensitive (and more sensitive than the11-consecutive match seed strategy used byBLAST)

– Also allows one transition in 1/12 strict matchpositions

– Seeds with many matches masked out (assumedto be repetitive regions)

BLASTZ• Gap-free extension• Further extend seeds by DPA, allowing gaps

– Low complexity regions have their scoresspecifically down-weighted

• Repeat above steps (using a more sensitiveseed, e.g., 7-mer) for regions that lie betweenmatches, that share the same order andorientation, and are separated by <50 kb

• Post-processing to remove multiple hits to thesame region

• When aligning human and mouse genomes,can achieve 98% alignment of known codingregions

Local aligner: BLAT• Two modes: untranslated and translated

– Untranslated mode performs poorly when conservation < 90%so translated mode usually used when aligning genomesWill more efficiently identify regions that are conserved for their

coding ability rather than for the regulatory functions

• Seeds created by building an index of non-overlapping 5-mers from one genome– Frequent 5-mers and ambiguous sequences (repeats and

low-complexity regions) excluded• The other genome is chopped into <200kb chunks

– Comparisons are made between the indexed 5-mers and thechunks

• DPA applied both upstream and downstream ofmatches

Global aligner: AVID• Assumes that no gene duplications,

inversions or translocations have occurred– Need a pre-processing step to identify syntenic

regions - use BLAT hits, filtered for specificity• Find ‘maximally repeated matches’ in

syntenic regions– Matches are flanked on either side by mismatches

• Determine an ‘anchor set’– Exclude all maximally repeated matches that are

less than half the length of the longest match– Allow only non-overlapping non-crossing matches– Use ‘clean’ matches first, then ‘repeat’ matches

Blue + Red: all maximally repeated matchesRed: anchor set

Bray et al, Genome Res 2003

• n anchors ⇒ n+1 regions to be aligned• Add any matches that were discarded because theywere too short on the first round• Add any new anchors• When there are no more anchors to be added, use theNW algorithm to align the intervening sequence

– If the regions are longer than ~ 4kb, the alignmentis returned with a gap in both sequences at thatposition

Global aligner: LAGAN• Different matching algorithm to AVID

– Generates local alignments– Allows mismatches

• Find highest weightchain

• Compute NWalignment, limiting itto area aroundchains

• MLAGAN - multipleglobal alignment tool

Global aligner: WABA• Takes into account degeneracy of genetic

code• Seeding step similar to BLAT

– Uses the two weighted-spaced seed 6of811011011, where the third position is allowed tomismatch

• In the alignment of homologous regions, usesa seven-state pair HMM which also allowsmismatches in the third position

• Finally, join overlapping matches from thealignment phase

Visualization tools:VISTA vs. PipMaker

• VISTA (VISualization Tools for Alignments)– Uses AVID to generate alignments– Uses a sliding window approach

• Plots percent identity within a fixed window size, at regularintervals

• PipMaker (Percent Identity Plot)– Uses BLASTZ– X-axis is the reference sequence; horizontal lines represent

gap-free alignments

• Once we have a macro-alignment, we canstudy the evolution of genomes betweenspecies, and also can trace the evolutionaryhistory of the structure of each genome itself

• Genome structure rearrangement• Gene duplication, chromosomal duplication

and polyploidization (whole genomeduplication)– New genes

Unravelling the history of the genome• Plants

– Wheat• Allohexaploid (AABBCC)

– Maize• Diploidized allotetraploid• Has grown 12x in the past 5 MY due to increased

numbers of transposable elements– Arabidopsis

• Haploid, 5N

• Yeast– Saccharomyces cerevisiae vs. Kluyveromyces

lactis• Human

– Genome duplication?

Decipher history of lineages• Genome size changes

– Compaction• Large scale deletions - Fugu• Intron loss

– Expansion• Duplication• Transposable element insertions

• Transposons - Alus, LINEs– Ancient insertions prior to eutherian radiation– More recent insertions - maize

Baxendale et al, Nature Genet, 1995

Why gen(om)e duplication?• Duplicated genes provide a source for

genetic novelty during evolution– Either member of a duplicated gene pair can

diverge to either• Acquire a new function which may be positively selected

for• Subfunctionalize• Be differently regulated (e.g., tissue-specific)• Become a pseudogene• Be deleted

• Whole genome duplication allows for theduplication (and subsequent divergence) ofwhole pathways at a time

Duplication events• Resolution of polyploidy

– Non-disjunction of chromosomes (formmultivalents instead of divalents)

• Sterility– Duplicated genes do not start to diverge (or

get deleted) until disomic inheritanceresolved

Rearrangements• Are genome rearrangements the cause or

consequence of diploidization?– Most widely accepted hypothesis is that

diploidization proceeds by structural divergence ofchromosomes

– Some loci appear disomic, others tetrasomic• Stage 1: pairing between similar chromosomes allowed

– Loci near the centromeres can display residual tetrasomy• Stage 2: non-homologous chromosome pairing resolved

Just how hard can it be to tell what happened?

“Take four, or maybe eight, decks of 52 playing cards.Shuffle them all together and then throw some cardsaway. Pick 20 cards at random and drop the rest on thefloor. Give the 20 cards to some evolutionary biologistsand ask them to figure out what you’ve done.”

Skrabanek and Wolfe, Curr Opin Genet Dev, 1998

Now that the human genome has been sequenced, thingsaren’t quite so bleak.

Effects of polyploidization

Wolfe, Nature Rev Genet 2001

Effects of polyploidization

Wolfe, Nature Rev Genet 2001

Saccharomyces cerevisiae (baker's yeast)

Whole GenomeDuplication (WGD)

WGD in S. cerevisiae: Wolfe & Shields, Nature 387:708 (1997)

Yeast

• Saccharomyces cerevisiae– Degenerate tetraploid– Polyploidy followed by extensive deletion and

(70-100) reciprocal translocations– 8% of genes duplicated in 55 blocks (plus

many missed smaller blocks)– Relative orientation of genes in blocks

conserved with respect to the centromere

Seoighe and Wolfe, PNAS, 1998

Duplicatedblocks in

yeast

Wolfe and Shields, Nature 1997

Estimation of time of polyploidy event

• Diverged from Kluyveromyces lactis(unduplicated) ~150 MYA– Comparison of gene sequences and gene order

reveals conservation• 59% of adjacent gene pairs in K. lactis or K. marxianus

are also adjacent in S. cerevisiae• 16% of Kluveromyces neighbors can be explained in

terms of inferred ancestral gene order– Phylogenetic analyses of duplicated genes where both

the Kluveromyces orthologue and an outgrouporthologue were available, deduced that thepolyploidization event in S. cerevisiae occurred around100 MYA

Kellis et al Nature 428:617-624 (2004)

10%retention

5,000 genes

10,000 genes

5,500 genes

5,000 genes

Different subsets retained

Evidence from conservedorder of a very few genes

Evidence from interleavinggenes from sister segments

Each region of K.waltiimatches two regions ofS.cerevisiae

We don’t even needany remaining two-copy genes to inferthe ancestral order

Human• 1970 - Susumu Ohno proposed that vertebrate

genomes had originated via genome duplication• 2R (two Rounds [of genome duplication])

hypothesis– One before, and one after divergence of agnathans

(lamprey) from tetrapods (~430 MYA and~500 MYA)

– Popular and controversial• Split between the map-based people and the tree-

based people• Duplicated regions are evident (covering ≈ 44% of

the genome), but it is difficult to tell whether it isdue to (a) genome duplication(s) or chromosomalduplications

Susumu Ohno1928-2000

Book:Evolution by GeneDuplication(1970)

Whole-genome duplication (polyploidy)

Timing of tetraploidy events

Skrabanek, PhD thesis, 1999

Evidence?• There are regions in the genome which are

quadruplicated– HSA 1, 6, 9 and 19 (MHC region, 10 gene

families)– HSA 4, 5, 8 and 10– HSA 2, 7, 12 and 17 (Hox clusters)

• Expect to see (A,B)(C,D) tree, where the timeof divergence of A from B and C from D isapproximately the same– However, this is not consistently the case

Proposed evolutionof the Hox clusters

Skrabanek, PhD thesis, 1999

Possible routes to 4-gene families

Hokamp et al, J Struct Func Genomics 2003

Hughes, Mol Biol Evol 1998

Estimates ofdivergence timesfor genes in theMHC region onchromosomes1, 6, 9 and 19

Hughes et al, Genome Res, 2001

Divergence timesof genes with

members on atleast two of the

Hox clusterbearing

chromosomes(2, 7, 12, 17)

Hughes et al, Genome Res, 2001

Phylogenetic relationships of four- and three-membered gene families on Hox cluster bearingchromosomes

Conclusions• Duplicated regions seen in human genome are

most likely vertebrate specific• Significant amount of duplication occurring

~350-650 MYA– Possible to explain large margin by alloploidy?

• Probable support for at least one wholegenome duplication event

• Once more genomes are available, such asCiona intestinalis or amphioxus, it may beeasier to decipher the history of duplication– However, long time spans under consideration, and

diploidization requires extensive genomic changes

PLOS Biology 3:e314 (2005)

Panopoulou and Poustka, TIG 21:559, 2005

Fish-specific genome duplication event

Lander et al - Intl. Human Genome Sequencing Consortium paperNature 409:860 (2001)

Hypothetical genomic region

1R

2R

decay

decay

Find most similar vertebrate gene (here M1) to the Ciona gene. Other vertebrate genes areadded to the cluster if they are more similar to M1 than M1 is to the Ciona gene.

S1

> S1

< S1

Duplicates may have arisen byspeciation (lineage-splitting) or bygene duplication events specific toone or more vertebrates

Fugu-specific duplication

Human-specific duplication

Finding all homologs, and only homologs

Number of gene duplications

46.6% of ancestral chordate genes are duplicated in ≥1 lineage34.5% with at least one duplication before Fugu-tetrapod split23.5% with at least one duplication after Fugu-tetrapod split

No evidence of 2R hypothesis from gene family membership alone, nor from phylogenetics (sinceduplications abundant on every branch)

Paralogs generated by a gene duplication before the Fugu-tetrapod split count as matches. N-fold redundancy calculated by identifying all cases where ≥ 2 different duplicates are within a100-gene window, and then counting the number of times that their paralogs occur within a 100-genewindow elsewhere in the genome.

2R

4-fold redundancy most common - accounts for 25% of the genome

1,912 genes duplicated priorto fish-tetrapod split 2,953 paralogous genepairs 32.4% are found in 386detectable paralogoussegments, comprising 772individual genomic segments 454 are tetra-paralogons,where overlapping sets fall into4-fold groups

Of the genes that duplicatedafter the fish-tetrapod split,only 11% are found inparalogous regions (i.e.,duplications after the split didnot include large segments ofthe genome)

Could be of recent origin, orcould have undergone multiplerearrangement events thatdestroyed the tetra-paralogonssignal

Old duplications Recent duplications

Hox-bearing chromosomes:50% of genes duplicated after the fish-tetrapod split are tandem duplicates (separated by< 10 genes), whereas only 6% of genes duplicated before the split are tandem duplicates.

Conclusions• 2R hypothesis most likely scenario• Two rounds of closely spaced auto-

tetraploidization events– Some paralogous pairs within tetraparalogons

extend over longer regions than others• So octaploidy unlikely, since pairs of regions would have

had to lose the same sets of genes– Phylogenetic trees are not consistently nested

• So allotetraploidy or two distantly spaced autotetraploidyevents unlikely

– Tree topologies within paralogous blocks notalways congruent

• Gene loss and rediploidization processes probablyspanned the two duplication events

Identification of functional elements• Coding sequences

– Relatively easy to identify• Many gene prediction programs available• General gene structure known, e.g.,

– TATA boxes– Splice donor-acceptor sites

– ESTs and cDNAs available to aid gene prediction• Non-coding functional sequences

– Much harder to identify– No common structure of regulatory regions– TF binding sites are short and ubiquitous

• Comparative genomics– A genomic sequence that provides a function that is under

selection and tends to be conserved between species iscalled a “functional sequence” (e.g., a protein-coding regionor a transcription factor binding site)

Coding region comparison• Discover new genes

– Annotate gene structure• Exon-intron structure

• Compare gene content– Find genes common to sets of organisms– Find genes unique to an organism

• We can study how genomes vary within aspecies to identify genes central to particularprocesses– Can also compare subspecies e.g., E. coli K12 and

O157:H7 (pathogenic)• Discover gene function• Can find missing genes in metabolic pathways

Nature 420:520, 2002

Predicting structure of ‘new’ genes• Many gene finding programs

– Look for start sites, termination sites, splice sites– Analyze codon usage, exon length

• Comparative method A– Find syntenic regions– Predict genes using conventional gene finding

techniques in both species– Genes predicted in both species are probable

Gene prediction• Comparative method B

– Find syntenic regions– Perform pattern filtering

• Coding exons tend to be well conserved• Conservation higher in first and second positions of

codons– Advantage: can deal with sequencing errors

Identification of paralogues and orthologuesin other species

• Best Reciprocal Hit (BRH)

• Reciprocal Hit by Synteny (RHS)– Identification of adjacent orthologues

• Domain checking - internal qualitycontrol

Mouse Humanhttp://www.nbn.ac.za/

Identification of orthologues

OrphanOthers

Matches to someother chromosome

Human

BRH

Mouse

RHS

http://www.nbn.ac.za/

Genes in mouse shared with other organisms

Nature 420:520, 2002

Non-coding regioncomparison

• Regulatory region discovery– No accurate methods to identify regulatory

sequences based on scanning the DNA sequencealone

• Putative TF binding sites found everywhere• Low specificity

– Assume that regulatory regions are conservedbecause they serve a vital function

Rationale

• DNA sequences encoding and regulating theexpression of essential proteins and RNAswill be conserved

• Consequently, the regulatory profiles ofgenes involved in similar processes amongrelated species will be conserved

• Conversely, sequences that encode orcontrol the expression of proteins or RNAsresponsible for differences between specieswill be divergent

What species?• How distant should the compared species be?

Transgenic mice tend to express the transgene ina similar manner to the source mammalianorganism, including those for which the orthologdoes not exist in mice!

• Rapidly evolving regulatoryregions (e.g., β−globin) allow comparisonsbetween closely related species

• However, some regulatory regions evolve veryslowly (e.g., T-cell receptors), so comparisonswith more distant species are required

Finding regulatory regions• Find conserved

regions betweenspecies

• Can also findconservedregions betweensimilarlyexpressed genes– Search for TF

binding sites inthese conservedregions

Pennacchio and Rubin, Nature Rev Genet 2001

Human vs. mouse• Similar gene content and linear organization

– ~340 syntenic blocks (~150 more than discoveredusing cytological studies) spanning 90% of thegenome

• Difference in genome size– Mouse genome is 14% smaller, probably due to a

higher rate of deletion• Sequence Conservation

– ~40% in alignments– ~5% under selection

• ~1.5% protein coding• ~3.5% non-coding: untranslated regions, regulatory

elements, non-protein-coding genes, and chromosomalstructural elements

Nature 420:520, 2002

Human vs. mouse genomes

Nature 420:520, 2002

Documents

ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict