Comparative genomics - Cornell University...Comparative genomics Lucy Skrabanek ICB, WMC 15 April...

Preview:

Citation preview

Comparativegenomics

LucySkrabanekICB,WMC

15April2010

Whatdoesitencompass?

•  Genomeconservation–  transferknowledgegainedfrommodelorganismstonon‐modelorganisms

•  Genomeevolution–  understandhowgenomeschangeovertimeinordertoidentifyevolutionaryprocessesandconstraints

•  Genomevariation–  understandhowgenomesvarywithinaspeciestoidentifygenescentraltoparticularprocesses

Mainuses

• Wholegenomecomparisons– Genomeevolution

•  Codingregionscomparisons– Geneprediction– Genestructure(exon‐intron)prediction–  Functionprediction

•  Non‐codingregioncomparisons– Regulatoryregiondiscovery

•  Protein‐proteininteractionprediction

iTOL;LetunicandBork,Bioinformatics2007

Rearrangementrate•  Canestimatethenumberofrearrangementsfromcytologicalcomparisons

•  Twoverydifferentrates:–  Veryslowrateofrearrangement(1or2exchangesper10MYR)•  ~7rearrangementsbetweenthehumangenomefromthehypotheticalprimateancestor(60MYA)

•  13rearrangementsbetweencatandhuman

Rearrangementrate

•  Punctuatedbyabruptglobalgenomerearrangementinsomelineages–  Gibbonsandsiamangsrearranged3or4times

moreextensivelythanhumanorothergreatapes–  Dogshavehighlyrearrangedgenomescompared

totheancestralCarnivoragenome–  Rodentspeciesexhibitveryrapidpatternsof

chromosomechange•  ~180conservedsegmentsbetweenmouseandhuman•  ~100conservedsegmentsbetweenratandhuman

Human, cat and mouse X chromosome comparison

O’Brienetal,Science1999

Genomecomparisonacrossspeciesgrosschangesinchromosomenumber

O’Brienetal,Science1999

Genomealignment

•  Verydifferentfromsinglegeneorproteinalignments

•  StandardDPAsaretooexpensive•  Madecomplicatedbyextensiverearrangementsoflargehomologoussegments

Problems

•  Lookingforsyntenicregions•  Rearrangementsdisruptingsyntenicregions–  Insertions– Deletions–  Inversions– Translocations– Duplications

Assumptions

•  Thetwogenomestobealignedderivedfromacommonancestor

•  Thereremainssufficientsimilaritybetweenthegenomestoenableeasyidentificationofhomologousregions

•  Forthealignmenttobeinformative,therehastohavebeentimeforthegenomestodivergeandforselectiontohaveoccurred

Algorithmrequirements

•  Genomealignmentalgorithmsmust–  Scalelinearly(computationally)–  Berobust(nottoomanyparameters)–  Bememory‐efficient–  Beabletohandlerearrangements,geneduplications,repetitiveelements

•  Smith‐Waterman,NeedlemanWunsch–  TimetodocalculationoftheorderofO(n2)

•  Notfeasibleforsequencelength>10,000bp–  Cannothandlerearrangementsorinversions

Alignmentmethods•  Seedingmethods(e.g.,BLASTZ,BLAT,EXONERATE)

–  Produce“local”alignments–  Allmatchesfound(includingallparalogs)

• Verysensitive,notveryspecific–  Veryfast

•  Anchor‐basedmethods(e.g.,MUMmer,AVID,WABA)–  Produce“global”alignments–  Specific–  Difficultywithrearrangements

•  Multiplesequencealigners(e.g.,Mauve,GRIMM‐Synteny,Shuffle‐LAGAN,Mercator)–  Designedtodealwithrearrangments

Localaligner:BLASTZ•  AsinBLAST,findseeds

–  Seedsaredeterminedbya12of19weighted‐spacedseedsstrategywherethepositionsforstrictmatchesarespecified

–  1110100110010101111combinationshowntobethemostsensitive(andmoresensitivethanthe11‐consecutivematchseedstrategyusedbyBLAST)

–  Alsoallowsonetransitionin1/12strictmatchpositions

–  Seedswithmanymatchesmaskedout(assumedtoberepetitiveregions)

BLASTZ•  Gap‐freeextension•  FurtherextendseedsbyDPA,allowinggaps

–  Lowcomplexityregionshavetheirscoresspecificallydown‐weighted

•  Repeatabovesteps(usingamoresensitiveseed,e.g.,7‐mer)forregionsthatliebetweenmatches,thatsharethesameorderandorientation,andareseparatedby<50kb

•  Post‐processingtoremovemultiplehitstothesameregion

•  Whenaligninghumanandmousegenomes,canachieve98%alignmentofknowncodingregions

Localaligner:BLAT•  Twomodes:untranslatedandtranslated

–  Untranslatedmodeperformspoorlywhenconservation<90%sotranslatedmodeusuallyusedwhenaligninggenomes Willmoreefficientlyidentifyregionsthatareconservedfortheircodingabilityratherthanfortheregulatoryfunctions

•  Seedscreatedbybuildinganindexofnon‐overlapping5‐mersfromonegenome–  Frequent5‐mersandambiguoussequences(repeatsandlow‐complexityregions)excluded

•  Theothergenomeischoppedinto<200kbchunks–  Comparisonsaremadebetweentheindexed5‐mersandthechunks

•  DPAappliedbothupstreamanddownstreamofmatches

Globalaligner:Shuffle‐LAGAN•  CHAOS:localaligner

–  Findsmatchingseeds,andchainsthemtogether(withinacutoffdistance)

–  ExtendschainswithungappedBLAST•  Createamapwheretheanchorshavetobecollinearinone

sequenceonly(1‐monotonicconservationmap)•  Alignconsistentsubsegments

–  Sortalllocalalignmentsbytheirstartcoordinates,identifyalignablesubsegments,expandtobordersofadjacentsubsegments(buttrimifanycharactersareoverlapping)

Brudnoetal,Bioinformatics,2003

Globalaligner:Mauve

1.  Findlocalalignments(multi‐MUMs)–  seed‐and‐extendhashingmethod

2.  Createaphylogeneticguidetree(NJ)3.  Selectasubsettouseasanchorsa)  partitionintoLCBs(locallycollinear

blocks)4.  Recursivelyidentifyadditionalanchors

aroundandwithineachLCB5.  ProgressivelyaligneachLCBusingtheguide

tree.

Darlingetal,GenomeResearch,2004

Darlingetal,GenomeResearch,2004

Breakpointidentification:GR‐Aligner•  Identifyandmergeconservedsequencepairs

–  Bl2Sequsedtogeneratelocalnon‐overlappingmatches–  Matchesaresortedbystartposition–  Mergeadjacentmatchesiftheyareeitherbothdirectorinverseandifthecombinedscoreisoversomethreshold(intervalalignedwithNW)

Chuetal,Bioinformatics,2009

•  Identifybreakpointsininversions–  Identifyallinversematchesflankedbydirectmatches

–  Alignhomologoussequencebetweentheinverseanddirectmatchandfindtheoptimalscore

GR‐Aligner•  Identifybreakpointsin

transpositions–  Identifyalldirectmatchesthatcrossonlyoneotherdirectmatch,andcrossedmatchesareflankedbydirectmatchesthatdonotcrossanyothermatches

–  Alignsequencebetweenthetransposedmatchwiththesequencesoutsidethetransposedmatchandfindtheoptimalscore

Chuetal,Bioinformatics,2009

Genomealignmentparameters•  Repeat‐masking,E‐value

–  Estimatedbyaligningreversedgenome–  TRFonlyrepeat‐finderthateliminatedmanyspuriousalignments(non‐

standardparameters,hard‐masking)

•  Scoringmatrix•  Gapcosts

–  495combinationsofabove3parameters(usinghard‐maskingwithTRFandanE‐valueof1)

–  Forproteinfamilyalignments,2:1:2:16:1[match‐score:transition‐cost:transversioncost:gap‐open:gap‐extend]balancedsensitivityandspecificity

–  ForstructuralRNAalignments,3:3:4:24:1and4:4:5:24:1•  X‐dropparameter

–  Usedtoterminateextensionofgappedalignments(terminatesifthescoredropsbymorethanXbelowthepreviouslyseenmaximumscore)

–  Seemstofindfewerfalse‐positiveswhenX‐dropisnotgreaterthanalignmentscorecutoff

Frithetal,BMCBioinformatics,2010

Visualizationtools:VISTAvs.PipMaker

•  VISTA(VISualizationToolsforAlignments)–  UsesAVIDtogeneratealignments–  Usesaslidingwindowapproach

•  Plotspercentidentitywithinafixedwindowsize,atregularintervals

•  PipMaker(PercentIdentityPlot)–  UsesBLASTZ–  X‐axisisthereferencesequence;horizontallinesrepresentgap‐freealignments http://pipmaker.bx.psu.edu/pipmaker/

http://genome.lbl.gov/vista/index.shtml

2‐Dview:VISTA‐dot

http://genome.jgi‐psf.org/synteny/

Circos

http://mkweb.bcgsc.ca/circos/

•  Oncewehaveamacro‐alignment,wecanstudytheevolutionofgenomesbetweenspecies,andalsocantracetheevolutionaryhistoryofthestructureofeachgenomeitself

•  Genomestructurerearrangement•  Geneduplication,chromosomalduplicationandpolyploidization(wholegenomeduplication)–  Newgenes

Phylogenetictree–sharedgenecontent

http://www.bork.embl‐heidelberg.de/~korbel/SHOT_v2/

Phylogenetictree–geneorder

Booreetal,Nature,1998

Unravellingthehistoryofthegenome•  Plants

–  Wheat•  Allohexaploid(AABBCC)

–  Maize•  Diploidizedallotetraploid•  Hasgrown12xinthepast5MYduetoincreasednumbersoftransposableelements

–  Arabidopsis•  Haploid,5N

•  Yeast–  Saccharomycescerevisiaevs.Kluyveromyceslactis

•  Human–  Genomeduplication?

Decipherhistoryoflineages

•  Genomesizechanges–  Compaction

•  Largescaledeletions‐Fugu•  Intronloss

–  Expansion•  Duplication•  Transposableelementinsertions

•  Transposons‐Alus,LINEs–  Ancientinsertionspriortoeutherianradiation–  Morerecentinsertions‐maize

Baxendaleetal,NatureGenet,1995

Whygen(om)eduplication?•  Duplicatedgenesprovideasourceforgeneticnoveltyduringevolution–  Eithermemberofaduplicatedgenepaircandivergetoeither•  Acquireanewfunctionwhichmaybepositivelyselectedfor

•  Subfunctionalize•  Bedifferentlyregulated(e.g.,tissue‐specific)•  Becomeapseudogene•  Bedeleted

•  Wholegenomeduplicationallowsfortheduplication(andsubsequentdivergence)ofwholepathwaysatatime

Duplicationevents•  Resolutionofpolyploidy– Non‐disjunctionofchromosomes(formmultivalentsinsteadofdivalents)•  Sterility

– Duplicatedgenesdonotstarttodiverge(orgetdeleted)untildisomicinheritanceresolved

Rearrangements

•  Aregenomerearrangementsthecauseorconsequenceofdiploidization?–  Mostwidelyacceptedhypothesisisthatdiploidizationproceedsbystructuraldivergenceofchromosomes

–  Somelociappeardisomic,otherstetrasomic•  Stage1:pairingbetweensimilarchromosomesallowed

–  Locinearthecentromerescandisplayresidualtetrasomy•  Stage2:non‐homologouschromosomepairingresolved

Effectsofpolyploidization

Wolfe,NatureRevGenet2001

Effectsofpolyploidization

Wolfe,NatureRevGenet2001

Saccharomyces cerevisiae (baker's yeast)

Whole Genome Duplication (WGD)

WGDinS.cerevisiae:Wolfe&Shields,Nature1997

Yeast

•  Saccharomycescerevisiae–  Degeneratetetraploid–  Polyploidyfollowedbyextensivedeletionand(70‐100)reciprocaltranslocations

–  8%ofgenesduplicatedin55blocks(plusmanymissedsmallerblocks)

–  Relativeorientationofgenesinblocksconservedwithrespecttothecentromere

SeoigheandWolfe,PNAS,1998

Duplicated blocks in

yeast !

WolfeandShields,Nature1997

Estimationoftimeofpolyploidyevent

•  DivergedfromKluyveromyceslactis(unduplicated)~150MYA–  Comparisonofgenesequencesandgeneorderrevealsconservation•  59%ofadjacentgenepairsinK.lactisorK.marxianusarealsoadjacentinS.cerevisiae

•  16%ofKluveromycesneighborscanbeexplainedintermsofinferredancestralgeneorder

–  PhylogeneticanalysesofduplicatedgeneswhereboththeKluveromycesorthologueandanoutgrouporthologuewereavailable,deducedthatthepolyploidizationeventinS.cerevisiaeoccurredaround100MYA

KellisetalNature428:617‐624(2004)

10% retention

5,000 genes

10,000 genes

5,500 genes

5,000 genes

Different subsets retained

Evidence from conserved order of a very few genes

Evidence from interleaving genes from sister segments

Each region of K.waltii matches two regions of S.cerevisiae

We don’t even need any remaining two-copy genes to infer the ancestral order

Human•  1970‐SusumuOhnoproposedthatvertebrategenomeshadoriginatedviagenomeduplication

•  2R(twoRounds[ofgenomeduplication])hypothesis–  Onebefore,andoneafterdivergenceofagnathans(lamprey)fromtetrapods(~430MYAand~500MYA)

–  Popularandcontroversial•  Splitbetweenthemap‐basedpeopleandthetree‐basedpeople

•  Duplicatedregionsareevident(covering≈44%ofthegenome),butitisdifficulttotellwhetheritisdueto(a)genomeduplication(s)orchromosomalduplications

Evidence?•  Thereareregionsinthegenomewhicharequadruplicated–  HSA1,6,9and19(MHCregion,10genefamilies)–  HSA4,5,8and10–  HSA2,7,12and17(Hoxclusters)

•  Expecttosee(A,B)(C,D)tree,wherethetimeofdivergenceofAfromBandCfromDisapproximatelythesame–  However,thisisnotconsistentlythecase

Possible routes to 4-gene families !

Hokampetal,JStructFuncGenomics2003

Hughes,MolBiolEvol1998

Estimates of divergence times for genes in the MHC region on chromosomes 1, 6, 9 and 19

Hughesetal,GenomeRes,2001

Divergence times of genes with

members on at least two of the

Hox cluster bearing

chromosomes (2, 7, 12, 17)

Hughesetal,GenomeRes,2001

Phylogenetic relationships of four- and three-membered gene families on Hox cluster bearing chromosomes

Conclusions•  Duplicatedregionsseeninhumangenomearemostlikelyvertebratespecific

•  Significantamountofduplicationoccurring~350‐650MYA–  Possibletoexplainlargemarginbyalloploidy?

•  Probablesupportforatleastonewholegenomeduplicationevent

•  Withtheavailabilityofmoregenomes,suchasCionaintestinalisoramphioxus,itmaybeeasiertodecipherthehistoryofduplication–  However,longtimespansunderconsideration,anddiploidizationrequiresextensivegenomicchanges

PLOSBiology3:e314(2005)

PanopoulouandPoustka,TIG21:559,2005

Fish-specific genome duplication event

Proposed evolution of the Hox clusters

Skrabanek,PhDthesis,1999

Landeretal‐Intl.HumanGenomeSequencingConsortiumpaperNature409:860(2001)

Find most similar vertebrate gene (here M1) to the Ciona gene. Other vertebrate genes are added to the cluster if they are more similar to M1 than M1 is to the Ciona gene.

S1

< S1

> S1

Duplicates may have arisen by speciation (lineage-splitting) or by gene duplication events specific to one or more vertebrates

Fugu-specific duplication

Human-specific duplication

Finding all homologs, and only homologs

Number of gene duplications

46.6% of ancestral chordate genes are duplicated in ≥1 lineage 34.5% with at least one duplication before Fugu-tetrapod split 23.5% with at least one duplication after Fugu-tetrapod split

No evidence of 2R hypothesis from gene family membership alone (no peak at four genes/family), nor from phylogenetics (since duplications abundant on every branch)

Hypothetical genomic region

1R

2R

decay

decay

!  Paralogs generated by a gene duplication before the Fugu-tetrapod split count as matches. !  N-fold redundancy calculated by identifying all cases where ≥ 2 different duplicates are within a 100-gene window, and then counting the number of times that their paralogs occur within a 100-gene window elsewhere in the genome.

2R

4-fold redundancy most common - accounts for 25% of the genome

!  1,912 genes duplicated prior to fish-tetrapod split !  2,953 paralogous gene pairs !  32.4% are found in 386 detectable paralogous segments, comprising 772 individual genomic segments !  454 are tetra-paralogons, where overlapping sets fall into 4-fold groups

!  Of the genes that duplicated after the fish-tetrapod split, only 11% are found in paralogous regions (i.e., duplications after the split did not include large segments of the genome)

Could be of recent origin, or could have undergone multiple rearrangement events that destroyed the tetra-paralogons signal

Old duplications Recent duplications

Hox-bearing chromosomes: 50% of genes duplicated after the fish-tetrapod split are tandem duplicates (separated by < 10 genes), whereas only 6% of genes duplicated before the split are tandem duplicates.

Conclusions•  2Rhypothesismostlikelyscenario•  Mostlikelyscenario:Tworoundsofcloselyspacedauto‐tetraploidizationevents–  Someparalogouspairswithintetraparalogonsextendoverlongerregionsthanothers•  Sooctaploidyunlikely,sincepairsofregionswouldhavehadtolosethesamesetsofgenes

–  Phylogenetictreesarenotconsistentlynested•  Soallotetraploidyortwodistantlyspacedautotetraploidyeventsunlikely

–  Treetopologieswithinparalogousblocksnotalwayscongruent•  Genelossandrediploidizationprocessesprobablyspannedthetwoduplicationevents

Identificationoffunctionalelements•  Codingsequences

–  Relativelyeasytoidentify•  Manygenepredictionprogramsavailable•  Generalgenestructureknown,e.g.,

–  TATAboxes–  Splicedonor‐acceptorsites

–  ESTsandcDNAsavailabletoaidgeneprediction•  Non‐codingfunctionalsequences

–  Muchhardertoidentify–  Nocommonstructureofregulatoryregions–  TFbindingsitesareshortandubiquitous

•  Comparativegenomics–  Agenomicsequencethatprovidesafunctionthatisunderselectionandtendstobeconservedbetweenspeciesiscalleda“functionalsequence”(e.g.,aprotein‐codingregionoratranscriptionfactorbindingsite)

Codingregioncomparison•  Discovernewgenes

–  Annotategenestructure•  Exon‐intronstructure

•  Comparegenecontent–  Findgenescommontosetsoforganisms–  Findgenesuniquetoanorganism

•  Wecanstudyhowgenomesvarywithinaspeciestoidentifygenescentraltoparticularprocesses–  Canalsocomparesubspeciese.g.,E.coliK12andO157:H7(pathogenic)

•  Discovergenefunction•  Canfindmissinggenesinmetabolicpathways

Predictingstructureof‘new’genes•  Manygenefindingprograms

–  Lookforstartsites,terminationsites,splicesites–  Analyzecodonusage,exonlength

•  ComparativemethodA–  Findsyntenicregions–  Predictgenesusingconventionalgenefindingtechniquesinbothspecies

–  Genespredictedinbothspeciesareprobable

Geneprediction

•  ComparativemethodB–  Findsyntenicregions–  Performpatternfiltering

•  Codingexonstendtobewellconserved•  Conservationhigherinfirstandsecondpositionsofcodons

–  Advantage:candealwithsequencingerrors

Identificationofparaloguesandorthologuesinotherspecies

•  BestReciprocalHit(BRH)

•  ReciprocalHitbySynteny(RHS)–  Identificationofadjacentorthologues

•  Domainchecking‐internalqualitycontrol

Mouse Human

http://www.nbn.ac.za/

Identificationoforthologues

Orphan Others

Matches to some other chromosome

Human

BRH

Mouse

RHS

http://www.nbn.ac.za/

Genome sequences of 3 strains of E. coli

Welchetal,PNAS2002

4,288 genes

5,063 genes

5,016 genes

Rich in genes encoding potential fimbrial adhesins, autotransporters, iron-sequestration systems, phase-switch recombinases.

Includes genes encoding fimbrial adhesins, iron uptake systems, as well as toxins and toxin-transport systems and integrases.

Conservednon‐codingelements•  PhastCons–phylo‐HMM•  Alignmentoffivevertebrategenomes•  3‐8%ofthehumangenomeisconservedinothervertebrates–about1.18millionelements–  Distinguishfrom“ultra‐conserved”codingelements(exons)

–  ManyHCEsoverlapwith3’UTRs–  ManyHCEsshowpotentialtohavelocalRNAsecondarystructure•  Possiblefunction–post‐transcriptionalmodifications/regulationorcodeforfunctionalRNAs

–  AlsomanyHCEsareingenedeserts–possibledistalregulatoryelements?

Siepeletal,GenomeResearch2005