Upload
arun-sendhil
View
217
Download
0
Tags:
Embed Size (px)
DESCRIPTION
gene
Citation preview
GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTESShin-Han ShiuPlant Biology / QBMIMichigan State University
Genomes and gene contents30,00025,00010,0006,00045,00017,000
Duplicate genes in the genomeArabidopsis gene families**: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
Gene function and duplicationWhats the consequence?
Gene function and duplicationWhats the consequence?
Focus I: Duplication Mechanism and Loss RateGeneDuplicationsMechanismsConsequencesPreferentialretention
Duplication mechanismsWhole genome duplication
Tandem duplication
Segmental duplication
Replicative transposition
Lineage-specific gains in plants and animals*: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively).**: Numbers in parentheses refer to percentage total based on normalized gains.Substantially more recent duplicates in plants than in animalsMostly due to frequent whole genome duplications in plants
OrganismLineage-specific gainsNormalized gain*# of genes in familiesanalyzed% totalRice1011567432846735.5 (23.7)**Arabidopsis598439902193627.3 (18.2)**Human811811219543.7Mouse12651265240415.3
Gain vs. Loss3 rounds of whole-genome duplications in the Arabidopsis lineage~82% duplicates from the last round were lost in the past 40 million years
15,000*30,00060,000120,000Arabidopsisgene content:21,000***: Number of orthologous groups in shared families between Arabidopsis and rice.**: Number of genes in shared families.Genome duplications + tandem duplications gene losses =
Age distribution of animal duplicatesSteady decay in the number of duplicatesFrequent TD, SD, and RT
Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identityShiu et al., 2006
Plant duplicate age distributionApparent peak at ~0.18 instead of zero KsFrequent WGD, TD, SD (maybe), and RT (in some plants)
Shiu et al., 2004
Genome remodeling in polyploidsNatural and synthetic polyploids~348 Mb~203 Mb~314 Mb~257 Mb20,000 yr
Experimental approachesGenome-wide polymorphism monitored by tiling arrayGenomeTiled probesGapResolutionArray20,000 yr~6 million features
Genome-wide Single Feature PolymorphismMid-parent (MP) vs. Arabidopsis suecica (As)
PolyploidSFPNatural58,517Synthetic503
Genome-wide Single Feature PolymorphismGenome-wide polymorphism monitored by tiling arrayGenePseudogeneTransposon
Genome-wide Single Feature PolymorphismDuplication or deletionMP duplication or
As deletion
Genome Survey SequencingSequence ~40-60Mb of the Arabidopsis suecica genome 0.15-0.2 X coverage, will be done next week!
Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership GrantUltra-high throughput20-30 Mb per run, each run 5 hoursWill be 100Mb per run early 2007Cost efficient~$0.3/kbRead length rather limited~100bp per read nowWill be ~200bp early 2007For more information contact:Andreas Weber ([email protected])David DeWitt ([email protected])Or Shin-Han Shiu ([email protected])
Seminar on instrumentation: 9/29, Friday, 1pm, 1415 BPS
Summary: Gene duplication and polyploidyGene duplication occurred frequently in eukaryotes but most duplicate are lost.
In plants, whole genome duplication is common. But gene lost occurred frequently.
After 4 generations, very small number of SFPs are identified in synthetic polyploids.
After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion.
Clustered polymorphisms mostly locate in pseudogenes and transposons.
Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted.
Focus II: Differential Retention of DuplicatesGeneDuplicationsMechanismsConsequencesPreferentialretention
Duplicate genes in the genomeArabidopsis gene families**: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
Large gene families in plantsOne of the largest gene families
Normalized gain: % expanded OGs Large family sizes do not necessarily indicates higher expansion rates
Ancestral family sizes and gene gainsLarge ancestral family tend to have more lineage specific gains but with many exceptions
Differential expansion of functional categoriesGO: GeneOntologyProtein ubiquitinationPolysaccharide biosynthesisCell wall modificationTranscriptional regulationBiotic stress responseSecondary metabolism
Differences in DuplicabilityDuplicabilityThe propensity for the retention of a duplicate geneComputational analysis of genome-wide trend
CategoryArabidopsisHumanDefense responseProteolysisTransportIon channel activityMetabolismDevelopmentProtein kinase activityTranscription factor activity
Kinase superfamily sizes among eukaryotesShiu & Bleecker, 2003
OrganismNumber of genesKinase superfamilyPercent total geneArabidopsis thaliana25,81410414.0Oryza sativa subsp. indica~35,00016073.6Chlamydomonas reinhardtii~12,2004143.4Plasmodium falciparum5,334941.8Plasmodium yoelii7,681700.9Caenorhabditis elegans19,4844172.1Drosophila melanogaster13,8082621.9Anopheles gambiae15,0882161.4Ciona intestinalis15,8523162.0Fugu rubripes33,6096321.9Mus musculus22,4444952.2Homo sapiens22,9804722.1Saccharomyces cerevisiae64491131.8Candida albicans6,164951.5Neurospora crassa100821041.9Schizosaccharomyces pombe49451092.2
Kinase families in rice and ArabidopsisGene count differences among families indicate differential expansionShiu et al., 2004
Estimation of ancestral RLK family sizeA.B.440 speciation points rice ArabidopsisKinase phylogeny of Arabidopsis and rice RLKsShiu et al., 2004
Development vs. resistance/defense RLKsShiu et al., 2004
ContradictionPlant genes invovled in development tend to have high duplicability
Selection for expansionDepend on the level of variations of the signalsOR
Summary: differential retentionLongevity and duplicability of plant genes
HighHighHighLowLowHighLowLowDuplicabilityLongevityExamplesTranscription factorsResistance genesEnzymes in central metabolicpathways??
Focus III: Functional ConsequencesGeneDuplicationsMechanismsConsequencesPreferentialretention
Functional Consequences of DuplicationFunctional divergence and conservationIs it because of changes in cis-regulatory elements or coding sequences
How are duplicates retained, subfunctionalization or neofunctionalization
Divergence in gene expressionDevelop pipelines for cis-element prediction and
Divergence in post-translational modificationConservation of phosphorylation site across specesSACE: budding yeastCAGL: Candida glabraCAAL: Candida albicansCATR: Candida tropicalisNECR: Neurospora crassaDEHA: Debaryomuces hansenii
Detailed Functional Studies of Duplicate GenesFunctional analyses of DDF1 and DDF2 transcription factorsDerived from recent whole genome duplication in ArabidopsisRelated to the well known CBF factors involved in cold and draught stress
DDFsPromoterGFPKnockoutsOver-expressionstudiesInteractingproteinsBindingtargetsDDFsPromoterGFPKnockoutsOver-expressionstudiesInteractingproteinsBindingtargetsArabidopsis thalianaArabidopsis lyrata
Focus IV: Protein space
Tiling array analysis of transcriptomeHuman Chr 21, 22Kapranov et al., 2002
Posterior probability p(F|coding)
Performance of the CI measureKnown Arabidopsis exon and intron 90-300bp
Arabidopsis small protein that are not annotatedCorrectly predict 19 out of 20 (95%).
Yesat sORF with translation evidenceCorrectly predict 98 out of 114 (86%)
In intergenic sequences of Arabidopsis genome3,274 sORF identified
Coupling with tiling array expressionHybridization intensities for feature types
Summary: Novel coding genesMany unannotated regions in the genomes are expressed.
Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are identified correctly.
Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis thaliana genome.
Using tiling array data, we found that many of these novel coding regions are expressed.
AcknowledgementLab members
Kousuke Hanada
Melissa Lehti-Shiu
Cheng Zou
Emily EckenrodeUniversity of ChicagoJustin BorevitzXu Zhang
University of WisconsinSara PattersonRick Vierstra
University of MissouriScott Peck
Michigan State UniversityManyRong Jin, Comp Sci & EngYue-Hua Cui, Stat & ProbStartup fund
Recent completion
Genome remodeling in polyploidsGenome duplication occur frequently in plantsWhat is the fate of duplicates?How fast do gene losses occur?Is there any preference in genes retained?AB
CD
EA1B1
C1D1
E1A2B2
C2D2
E2t1t2A1B1
C1D1
E1A2B2
C2D2
E2A1B1
C1D1
E1A2B2
C2D2
E2Ng = 5 10 8 5
Comparing degrees of expansionCombined setArabidopsis: ~25,000 proteinsRice prediction:~66,000 genesGene/domainfamiliesShareduniquePairwise distancePutative orthologous groupsui = 1GO:0001ei = 4All orthologous groupsTotal unexpanded = uiTotal expanded = ei
Major questions on gene duplicationWhen: timing of gene duplications, e.g. N = 10
Domain gains in rice and ArabidopsisGain in one lineage does not necessarily predict gain in the other
Identify novel small coding genesDetermine base composition probabilitiesCodingsequencesNon-codingsequencesCDSparametersNCDSparametersPc(AAA) =Pc(T|AAA) =Calculate posterior probabilityc1c2c3c4c5c6Feature tablesn
Setting up the BayesPriors
S = ATG TTC TAC TTT G
Coding Likelihood (CL)Sliding windows of a sequence
Simulation based on NCDS (introns)1 2 3 4 n
Divergence in post-translational modificationConservation of phosphorylation site across speces