GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES

Preview:

DESCRIPTION

GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES. Shin-Han Shiu Plant Biology / QBMI Michigan State University. Genomes and gene contents. 17,000. 6,000. 45,000. 10,000. 30,000. 25,000. Duplicate genes in the genome. Arabidopsis gene families*. - PowerPoint PPT Presentation

Citation preview

GENOME EVOLUTION AND GENOME EVOLUTION AND GENE DUPLICATIONS IN GENE DUPLICATIONS IN EUKARYOTESEUKARYOTES

Shin-Han ShiuShin-Han Shiu

Plant Biology / QBMIPlant Biology / QBMI

Michigan State UniversityMichigan State University

Genomes and gene contentsGenomes and gene contents

30,000 25,000

10,000

6,00045,000

17,000

Duplicate genes in the genomeDuplicate genes in the genome

Arabidopsis gene families*Arabidopsis gene families*

*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

Gene function and duplicationGene function and duplication

What’s the consequence?What’s the consequence?

Gene function and duplicationGene function and duplication

What’s the consequence?What’s the consequence?

Focus I: Duplication Mechanism and Loss Focus I: Duplication Mechanism and Loss RateRate

GeneDuplications

Mechanisms ConsequencesPreferential

retention

Duplication mechanismsDuplication mechanisms

+

Whole genome duplicationWhole genome duplication

Tandem duplicationTandem duplication

Segmental duplicationSegmental duplication

Replicative transpositionReplicative transposition

Lineage-specific gains in plants and animalsLineage-specific gains in plants and animals

OrganismOrganism Lineage-specific Lineage-specific gainsgains

Normalized Normalized gain*gain*

# of genes in # of genes in familiesfamilies

analyzedanalyzed% total% total

Rice 10115 6743 28467 35.5 (23.7)**

Arabidopsis 5984 3990 21936 27.3 (18.2)**

Human 811 811 21954 3.7

Mouse 1265 1265 24041 5.3

*: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively).

**: Numbers in parentheses refer to percentage total based on normalized gains.

Substantially more recent duplicates in plants than in animalsSubstantially more recent duplicates in plants than in animals Mostly due to frequent whole genome duplications in plantsMostly due to frequent whole genome duplications in plants

Gain vs. LossGain vs. Loss

3 rounds of whole-genome duplications in the Arabidopsis lineage3 rounds of whole-genome duplications in the Arabidopsis lineage ~82% duplicates from the last round were lost in the past 40 ~82% duplicates from the last round were lost in the past 40

million yearsmillion years

15,000*30,000

60,000

120,000

Arabidopsisgene content:

21,000**

*: Number of orthologous groups in shared families between Arabidopsis and rice.**: Number of genes in shared families.

Genome duplications + tandem duplications – gene losses =

““Age” distribution of animal duplicatesAge” distribution of animal duplicates

Steady decay in the number of duplicatesSteady decay in the number of duplicates Frequent TD, SD, and RTFrequent TD, SD, and RT

Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity

Shiu et al., 2006

Plant duplicate “age” distributionPlant duplicate “age” distribution

Apparent peak at ~0.18 instead of zero KsApparent peak at ~0.18 instead of zero Ks Frequent Frequent WGDWGD, TD, SD (maybe), and RT (in some plants), TD, SD (maybe), and RT (in some plants)

Shiu et al., 2004

Genome remodeling in polyploidsGenome remodeling in polyploids

Natural and synthetic polyploidsNatural and synthetic polyploids

~348 Mb

~203 Mb~314 Mb

~257 Mb

20,000 yr

Experimental approachesExperimental approaches

Genome-wide polymorphism monitored by tiling arrayGenome-wide polymorphism monitored by tiling array

Genome

Tiled probes

Gap Resolution

Array

20,000 yr

~6 million features

Genome-wide Single Feature PolymorphismGenome-wide Single Feature Polymorphism

Mid-parent (MP) vs. Arabidopsis suecica (As)Mid-parent (MP) vs. Arabidopsis suecica (As)

PolyploidPolyploid SFPSFP

Natural 58,517

Synthetic 503

Genome-wide Single Feature PolymorphismGenome-wide Single Feature Polymorphism

Genome-wide polymorphism monitored by tiling arrayGenome-wide polymorphism monitored by tiling array

Gene Pseudogene Transposon

Genome-wide Single Feature PolymorphismGenome-wide Single Feature Polymorphism

Duplication or deletionDuplication or deletion

MP duplication or

As deletion

Genome Survey SequencingGenome Survey Sequencing

Sequence ~40-60Mb of the Arabidopsis suecica genome Sequence ~40-60Mb of the Arabidopsis suecica genome 0.15-0.2 X coverage, will be done next week!0.15-0.2 X coverage, will be done next week!

Ultra-high throughput sequencer (GS20) funded by the Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership GrantStrategic Partnership Grant Ultra-high throughputUltra-high throughput

20-30 Mb per run, each run 5 hours20-30 Mb per run, each run 5 hours Will be 100Mb per run early 2007Will be 100Mb per run early 2007

Cost efficientCost efficient ~$0.3/kb~$0.3/kb

Read length rather limitedRead length rather limited ~100bp per read now~100bp per read now Will be ~200bp early 2007Will be ~200bp early 2007

For more information contact:For more information contact: Andreas Weber (Andreas Weber (aweber@msu.eduaweber@msu.edu)) David DeWitt (David DeWitt (dewittd@msu.edudewittd@msu.edu)) Or Shin-Han Shiu (Or Shin-Han Shiu (shius@msu.edushius@msu.edu))

Seminar on instrumentation: Seminar on instrumentation: 9/29, Friday, 1pm, 1415 BPS9/29, Friday, 1pm, 1415 BPS

Summary: Gene duplication and polyploidySummary: Gene duplication and polyploidy

Gene duplication occurred frequently in eukaryotes but most Gene duplication occurred frequently in eukaryotes but most duplicate are lost.duplicate are lost.

In plants, whole genome duplication is common. But gene lost In plants, whole genome duplication is common. But gene lost occurred frequently.occurred frequently.

After 4 generations, very small number of SFPs are identified in After 4 generations, very small number of SFPs are identified in synthetic polyploids.synthetic polyploids.

After 20,000 generations, most coding genes do not have After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion.clustered sequence polymorphism that indicative of deletion.

Clustered polymorphisms mostly locate in pseudogenes and Clustered polymorphisms mostly locate in pseudogenes and transposons.transposons.

Survey sequencing is necessary to determine if some coding Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted. genes have become pseudogenes without being deleted.

Focus II: Differential Retention of Focus II: Differential Retention of DuplicatesDuplicates

GeneDuplications

Mechanisms ConsequencesPreferential

retention

Duplicate genes in the genomeDuplicate genes in the genome

Arabidopsis gene families*Arabidopsis gene families*

*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

Large gene families in plantsLarge gene families in plants

One of the largest gene familiesOne of the largest gene families

Normalized gain: % expanded OGs Normalized gain: % expanded OGs

Large family sizes do not necessarily indicates higher expansion Large family sizes do not necessarily indicates higher expansion ratesrates

Ancestral family sizes and gene gainsAncestral family sizes and gene gains

Large ancestral family tend to have more lineage specific gains Large ancestral family tend to have more lineage specific gains but with many exceptionsbut with many exceptions

Differential expansion of functional Differential expansion of functional categoriescategories

GO: GeneOntologyGO: GeneOntology

Protein ubiquitinationProtein ubiquitination Polysaccharide biosynthesisPolysaccharide biosynthesis Cell wall modificationCell wall modification Transcriptional regulationTranscriptional regulation Biotic stress responseBiotic stress response Secondary metabolismSecondary metabolism

Differences in DuplicabilityDifferences in Duplicability

CategoryCategory ArabidopsisArabidopsis HumanHuman

Defense responseDefense response

ProteolysisProteolysis

TransportTransport

Ion channel activityIon channel activity

MetabolismMetabolism

DevelopmentDevelopment

Protein kinase activityProtein kinase activity

Transcription factor activityTranscription factor activity

DuplicabilityDuplicability The propensity for the retention of a duplicate geneThe propensity for the retention of a duplicate gene Computational analysis of genome-wide trendComputational analysis of genome-wide trend

Kinase superfamily sizes among eukaryotesKinase superfamily sizes among eukaryotes

OrganismNumber of

genesKinase

superfamilyPercent

total gene

Arabidopsis thaliana 25,814 1041 4.0

Oryza sativa subsp. indica ~35,000 1607 3.6

Chlamydomonas reinhardtii ~12,200 414 3.4

Plasmodium falciparum 5,334 94 1.8

Plasmodium yoelii 7,681 70 0.9

Caenorhabditis elegans 19,484 417 2.1

Drosophila melanogaster 13,808 262 1.9

Anopheles gambiae 15,088 216 1.4

Ciona intestinalis 15,852 316 2.0

Fugu rubripes 33,609 632 1.9

Mus musculus 22,444 495 2.2

Homo sapiens 22,980 472 2.1

Saccharomyces cerevisiae 6449 113 1.8

Candida albicans 6,164 95 1.5

Neurospora crassa 10082 104 1.9

Schizosaccharomyces pombe 4945 109 2.2

Shiu & Bleecker, 2003

Kinase families in rice and Kinase families in rice and ArabidopsisArabidopsis

Gene count differences among families indicate differential Gene count differences among families indicate differential expansionexpansion

Shiu et al., 2004

Estimation of ancestral RLK family sizeEstimation of ancestral RLK family size

A. B.440 speciation points rice Arabidopsis

A. B.WAK LRR VIII, X, XII

Kinase phylogeny of Arabidopsis and rice RLKsKinase phylogeny of Arabidopsis and rice RLKs

Shiu et al., 2004

Development vs. resistance/defense RLKsDevelopment vs. resistance/defense RLKs

Shiu et al., 2004

ContradictionContradiction

Plant genes invovled in development tend to have high Plant genes invovled in development tend to have high duplicabilityduplicability

DevelopmentalRLKs

Low duplicability

Resistance/DefenseRLKs

High duplicability

Animal tyrosinekinases

Low duplicability

Transcription factors

High duplicability

Selection for expansionSelection for expansion

Depend on the level of variations of the signalsDepend on the level of variations of the signals

T

T

OR

Summary: differential retentionSummary: differential retention

Longevity and duplicability of plant genesLongevity and duplicability of plant genes

High High

High Low

Low High

Low Low

Duplicability Longevity Examples

Transcription factors

Resistance genes

Enzymes in central metabolicpathways

??

Focus III: Functional ConsequencesFocus III: Functional Consequences

GeneDuplications

Mechanisms ConsequencesPreferential

retention

Functional Consequences of DuplicationFunctional Consequences of Duplication

Functional divergence and conservationFunctional divergence and conservation Is it because of changes in cis-regulatory elements or coding sequencesIs it because of changes in cis-regulatory elements or coding sequences

How are duplicates retained, subfunctionalization or How are duplicates retained, subfunctionalization or neofunctionalizationneofunctionalization

Divergence in gene expressionDivergence in gene expression

Develop pipelines for cis-element prediction and Develop pipelines for cis-element prediction and

Clusters ofgenes with similarexpression profiles

Machine learning

Motif functionalprediction

Cis-regulatorylogic

Expression dataOver-representedsequence motifs

in 5’ regions

Experimentalvalidations

Divergence in post-translational Divergence in post-translational modificationmodification

Conservation of phosphorylation site across specesConservation of phosphorylation site across speces SACE: budding yeastSACE: budding yeast CAGL: Candida glabraCAGL: Candida glabra CAAL: Candida albicansCAAL: Candida albicans CATR: Candida tropicalisCATR: Candida tropicalis NECR: Neurospora crassaNECR: Neurospora crassa DEHA: Debaryomuces hanseniiDEHA: Debaryomuces hansenii

Detailed Functional Studies of Duplicate Detailed Functional Studies of Duplicate GenesGenes

Functional analyses of DDF1 and DDF2 transcription factorsFunctional analyses of DDF1 and DDF2 transcription factors Derived from recent whole genome duplication in ArabidopsisDerived from recent whole genome duplication in Arabidopsis Related to the well known CBF factors involved in cold and draught Related to the well known CBF factors involved in cold and draught

stressstress

DDFs

PromoterGFP

Knockouts

Over-expression

studies

Interactingproteins

Bindingtargets

DDFs

PromoterGFP

Knockouts

Over-expression

studies

Interactingproteins

Bindingtargets

Arabidopsis thaliana Arabidopsis lyrata

Focus IV: Protein spaceFocus IV: Protein space

GeneDuplications

Mechanisms ConsequencesPreferential

retentionConsequences

Preferentialretention

Tiling array analysis of transcriptomeTiling array analysis of transcriptome

Human Chr 21, 22Human Chr 21, 22

Kapranov et al., 2002

Posterior probability p(F|coding)Posterior probability p(F|coding)

Performance of the CI measurePerformance of the CI measure

Known Arabidopsis exon and intron 90-300bpKnown Arabidopsis exon and intron 90-300bp

Arabidopsis small protein that are not annotatedArabidopsis small protein that are not annotated Correctly predict 19 out of 20 (95%).Correctly predict 19 out of 20 (95%).

Yesat sORF with translation evidenceYesat sORF with translation evidence Correctly predict 98 out of 114 (86%)Correctly predict 98 out of 114 (86%)

In “intergenic” sequences of Arabidopsis genomeIn “intergenic” sequences of Arabidopsis genome 3,274 sORF identified3,274 sORF identified

Coupling with tiling array expressionCoupling with tiling array expression

Hybridization intensities for feature typesHybridization intensities for feature types

Summary: Novel coding genesSummary: Novel coding genes

Many unannotated regions in the genomes are expressed.Many unannotated regions in the genomes are expressed.

Using the CI measure, many proteins that were not annotated Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are but with evidence of expression from yeast and Arabidopsis are identified correctly.identified correctly.

Using the CI measure, we estimated that ~3000 novel coding Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis regions are present in the unannotated regions of Arabidopsis thaliana genome.thaliana genome.

Using tiling array data, we found that many of these novel Using tiling array data, we found that many of these novel coding regions are expressed.coding regions are expressed.

AcknowledgementAcknowledgement

Lab membersLab members

Kousuke Hanada

Melissa Lehti-Shiu

Cheng Zou

Emily Eckenrode

University of ChicagoUniversity of Chicago Justin BorevitzJustin Borevitz Xu ZhangXu Zhang

University of WisconsinUniversity of Wisconsin Sara PattersonSara Patterson Rick VierstraRick Vierstra

University of MissouriUniversity of Missouri Scott PeckScott Peck

Michigan State UniversityMichigan State University Many…Many… Rong Jin, Comp Sci & EngRong Jin, Comp Sci & Eng Yue-Hua Cui, Stat & ProbYue-Hua Cui, Stat & Prob Startup fundStartup fund

Recent completion …Recent completion …

Genome remodeling in polyploidsGenome remodeling in polyploids

Genome duplication occur frequently in plantsGenome duplication occur frequently in plants What is the fate of duplicates?What is the fate of duplicates?

How fast do gene losses occur?How fast do gene losses occur? Is there any preference in genes retained?Is there any preference in genes retained?

AB

CD

E

A1B1

C1D1

E1

A2B2

C2D2

E2

t1 t2

A1B1

C1D1

E1

A2B2

C2D2

E2

A1B1

C1D1

E1

A2B2

C2D2

E2

Ng = 5 10 8 5

Comparing degrees of expansionComparing degrees of expansion

Combined set

Arabidopsis: ~25,000 proteins

Rice prediction:~66,000 genes

Gene/domainfamilies

Shared

unique

Pairwise distance

Putative orthologous

groups

ui = 1

GO:0001

ei = 4

All orthologous groups

Total unexpanded = Σ ui

Total expanded = Σ ei

Major questions on gene duplicationMajor questions on gene duplication

When: timing of gene duplications, e.g. N = 10When: timing of gene duplications, e.g. N = 10

Domain gains in rice and Domain gains in rice and ArabidopsisArabidopsis

Gain in one lineage does not necessarily predict gain in the otherGain in one lineage does not necessarily predict gain in the other

Identify novel small coding genesIdentify novel small coding genes

Determine base composition probabilitiesDetermine base composition probabilities

Codingsequences

Non-codingsequences

CDSparameters

NCDSparameters

# of AAA

# of all NNNPc(AAA) =

Pc(AAAT)

Pc(AAA)Pc(T|AAA) =

Calculate posterior probabilityCalculate posterior probability

c1 c2 c3

c4 c5 c6

Feature tablesFeature tables

n

)()|()()|()()|()|(

NCDSPNCDSSPCDSPCDSSPCDSPCDSSPSCDSP

Setting up the Bayes’Setting up the Bayes’

PriorsPriors

S = S = ATG ATG TTC TTC TAC TAC TTT TTT GG……

6

1

2

1)(...)()( 621 CDSPCDSPCDSP2

1)()( NCDSPCDSP

6

1

)()|()()|(m

mCDSPmCDSSPCDSPCDSSP

...)|()|()|()|()()|( 132111 TTCTPGTTCPTGTTPATGTPATGPCDSSP ccccc...)|()|()|()|()()|( 213222 TTCTPGTTCPTGTTPATGTPATGPCDSSP ccccc

...)|()|()|()|()()|( 654666 TTCTPGTTCPTGTTPATGTPATGPCDSSP ccccc

...)|()|()|()|()()|( TTCTPGTTCPTGTTPATGTPATGPCDSSP nnnnnn

)()|()()|()()|()|(

NCDSPNCDSSPCDSPCDSSPCDSPCDSSPSCDSP

Coding Likelihood (CL)Coding Likelihood (CL)

Sliding windows of a sequenceSliding windows of a sequence

Simulation based on NCDS (introns)Simulation based on NCDS (introns)

n

SCDSPCL n

)|(1 2 3 4 … n

Divergence in post-translational Divergence in post-translational modificationmodification

Conservation of phosphorylation site across specesConservation of phosphorylation site across speces

Recommended