Gene Family

GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTESShin-Han ShiuPlant Biology / QBMIMichigan State University

Genomes and gene contents30,00025,00010,0006,00045,00017,000

Duplicate genes in the genomeArabidopsis gene families**: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

Gene function and duplicationWhats the consequence?

Focus I: Duplication Mechanism and Loss RateGeneDuplicationsMechanismsConsequencesPreferentialretention

Duplication mechanismsWhole genome duplication

Tandem duplication

Segmental duplication

Replicative transposition

Lineage-specific gains in plants and animals*: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively).**: Numbers in parentheses refer to percentage total based on normalized gains.Substantially more recent duplicates in plants than in animalsMostly due to frequent whole genome duplications in plants

OrganismLineage-specific gainsNormalized gain*# of genes in familiesanalyzed% totalRice1011567432846735.5 (23.7)**Arabidopsis598439902193627.3 (18.2)**Human811811219543.7Mouse12651265240415.3

Gain vs. Loss3 rounds of whole-genome duplications in the Arabidopsis lineage~82% duplicates from the last round were lost in the past 40 million years

15,000*30,00060,000120,000Arabidopsisgene content:21,000***: Number of orthologous groups in shared families between Arabidopsis and rice.**: Number of genes in shared families.Genome duplications + tandem duplications gene losses =

Age distribution of animal duplicatesSteady decay in the number of duplicatesFrequent TD, SD, and RT

Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identityShiu et al., 2006

Plant duplicate age distributionApparent peak at ~0.18 instead of zero KsFrequent WGD, TD, SD (maybe), and RT (in some plants)

Shiu et al., 2004

Genome remodeling in polyploidsNatural and synthetic polyploids~348 Mb~203 Mb~314 Mb~257 Mb20,000 yr

Experimental approachesGenome-wide polymorphism monitored by tiling arrayGenomeTiled probesGapResolutionArray20,000 yr~6 million features

Genome-wide Single Feature PolymorphismMid-parent (MP) vs. Arabidopsis suecica (As)

PolyploidSFPNatural58,517Synthetic503

Genome-wide Single Feature PolymorphismGenome-wide polymorphism monitored by tiling arrayGenePseudogeneTransposon

Genome-wide Single Feature PolymorphismDuplication or deletionMP duplication or

As deletion

Genome Survey SequencingSequence ~40-60Mb of the Arabidopsis suecica genome 0.15-0.2 X coverage, will be done next week!

Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership GrantUltra-high throughput20-30 Mb per run, each run 5 hoursWill be 100Mb per run early 2007Cost efficient~$0.3/kbRead length rather limited~100bp per read nowWill be ~200bp early 2007For more information contact:Andreas Weber ([email protected])David DeWitt ([email protected])Or Shin-Han Shiu ([email protected])

Seminar on instrumentation: 9/29, Friday, 1pm, 1415 BPS

Summary: Gene duplication and polyploidyGene duplication occurred frequently in eukaryotes but most duplicate are lost.

In plants, whole genome duplication is common. But gene lost occurred frequently.

After 4 generations, very small number of SFPs are identified in synthetic polyploids.

After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion.

Clustered polymorphisms mostly locate in pseudogenes and transposons.

Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted.

Focus II: Differential Retention of DuplicatesGeneDuplicationsMechanismsConsequencesPreferentialretention

Duplicate genes in the genomeArabidopsis gene families**: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

Large gene families in plantsOne of the largest gene families

Normalized gain: % expanded OGs Large family sizes do not necessarily indicates higher expansion rates

Ancestral family sizes and gene gainsLarge ancestral family tend to have more lineage specific gains but with many exceptions

Differential expansion of functional categoriesGO: GeneOntologyProtein ubiquitinationPolysaccharide biosynthesisCell wall modificationTranscriptional regulationBiotic stress responseSecondary metabolism

Differences in DuplicabilityDuplicabilityThe propensity for the retention of a duplicate geneComputational analysis of genome-wide trend

CategoryArabidopsisHumanDefense responseProteolysisTransportIon channel activityMetabolismDevelopmentProtein kinase activityTranscription factor activity

Kinase superfamily sizes among eukaryotesShiu & Bleecker, 2003

OrganismNumber of genesKinase superfamilyPercent total geneArabidopsis thaliana25,81410414.0Oryza sativa subsp. indica~35,00016073.6Chlamydomonas reinhardtii~12,2004143.4Plasmodium falciparum5,334941.8Plasmodium yoelii7,681700.9Caenorhabditis elegans19,4844172.1Drosophila melanogaster13,8082621.9Anopheles gambiae15,0882161.4Ciona intestinalis15,8523162.0Fugu rubripes33,6096321.9Mus musculus22,4444952.2Homo sapiens22,9804722.1Saccharomyces cerevisiae64491131.8Candida albicans6,164951.5Neurospora crassa100821041.9Schizosaccharomyces pombe49451092.2

Kinase families in rice and ArabidopsisGene count differences among families indicate differential expansionShiu et al., 2004

Estimation of ancestral RLK family sizeA.B.440 speciation points rice ArabidopsisKinase phylogeny of Arabidopsis and rice RLKsShiu et al., 2004

Development vs. resistance/defense RLKsShiu et al., 2004

ContradictionPlant genes invovled in development tend to have high duplicability

Selection for expansionDepend on the level of variations of the signalsOR

Summary: differential retentionLongevity and duplicability of plant genes

HighHighHighLowLowHighLowLowDuplicabilityLongevityExamplesTranscription factorsResistance genesEnzymes in central metabolicpathways??

Focus III: Functional ConsequencesGeneDuplicationsMechanismsConsequencesPreferentialretention

Functional Consequences of DuplicationFunctional divergence and conservationIs it because of changes in cis-regulatory elements or coding sequences

How are duplicates retained, subfunctionalization or neofunctionalization

Divergence in gene expressionDevelop pipelines for cis-element prediction and

Divergence in post-translational modificationConservation of phosphorylation site across specesSACE: budding yeastCAGL: Candida glabraCAAL: Candida albicansCATR: Candida tropicalisNECR: Neurospora crassaDEHA: Debaryomuces hansenii

Detailed Functional Studies of Duplicate GenesFunctional analyses of DDF1 and DDF2 transcription factorsDerived from recent whole genome duplication in ArabidopsisRelated to the well known CBF factors involved in cold and draught stress

DDFsPromoterGFPKnockoutsOver-expressionstudiesInteractingproteinsBindingtargetsDDFsPromoterGFPKnockoutsOver-expressionstudiesInteractingproteinsBindingtargetsArabidopsis thalianaArabidopsis lyrata

Focus IV: Protein space

Tiling array analysis of transcriptomeHuman Chr 21, 22Kapranov et al., 2002

Posterior probability p(F|coding)

Performance of the CI measureKnown Arabidopsis exon and intron 90-300bp

Arabidopsis small protein that are not annotatedCorrectly predict 19 out of 20 (95%).

Yesat sORF with translation evidenceCorrectly predict 98 out of 114 (86%)

In intergenic sequences of Arabidopsis genome3,274 sORF identified

Coupling with tiling array expressionHybridization intensities for feature types

Summary: Novel coding genesMany unannotated regions in the genomes are expressed.

Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are identified correctly.

Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis thaliana genome.

Using tiling array data, we found that many of these novel coding regions are expressed.

AcknowledgementLab members

Kousuke Hanada

Melissa Lehti-Shiu

Cheng Zou

Emily EckenrodeUniversity of ChicagoJustin BorevitzXu Zhang

University of WisconsinSara PattersonRick Vierstra

University of MissouriScott Peck

Michigan State UniversityManyRong Jin, Comp Sci & EngYue-Hua Cui, Stat & ProbStartup fund

Recent completion

Genome remodeling in polyploidsGenome duplication occur frequently in plantsWhat is the fate of duplicates?How fast do gene losses occur?Is there any preference in genes retained?AB

CD

EA1B1

C1D1

E1A2B2

C2D2

E2t1t2A1B1

C1D1

E1A2B2

C2D2

E2A1B1

C1D1

E1A2B2

C2D2

E2Ng = 5 10 8 5

Comparing degrees of expansionCombined setArabidopsis: ~25,000 proteinsRice prediction:~66,000 genesGene/domainfamiliesShareduniquePairwise distancePutative orthologous groupsui = 1GO:0001ei = 4All orthologous groupsTotal unexpanded = uiTotal expanded = ei

Major questions on gene duplicationWhen: timing of gene duplications, e.g. N = 10

Domain gains in rice and ArabidopsisGain in one lineage does not necessarily predict gain in the other

Identify novel small coding genesDetermine base composition probabilitiesCodingsequencesNon-codingsequencesCDSparametersNCDSparametersPc(AAA) =Pc(T|AAA) =Calculate posterior probabilityc1c2c3c4c5c6Feature tablesn

Setting up the BayesPriors

S = ATG TTC TAC TTT G

Coding Likelihood (CL)Sliding windows of a sequence

Simulation based on NCDS (introns)1 2 3 4 n

Divergence in post-translational modificationConservation of phosphorylation site across speces

Documents

Gene Family