Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome...

Preview:

Citation preview

Genomics of Microbial Eukaryotes

Igor Grigoriev,Fungal Genomics Program Head

US DOE Joint Genome Institute, Walnut Creek, CA

<ivgrigoriev@lbl.gov>

2

Large and Complex Eukaryotes

3

Outline

Eukaryotic Genome Annotation

Fungal Genomics Program

MycoCosm

4

Started with Human Genome Project

5genome.jgi.doe.gov

IMG

MycoCosm

150+ annotated eukaryotic genomes

6

Genomic assembly and ESTs

An

no

tati

on

Pip

elin

e

Gene predictions Gene predictions

Protein annotationsProtein annotations

Reference data mappingReference data mapping

Repeat maskingRepeat masking

Manual curation (optional)Manual curation (optional)

Annotation Pipeline

Analysis

Gene familiesGene expressionPhylogenomicsProteomicsProtein targetingetc

Annotation

ValidationsValidations

7

Protein-based methods build CDS exons around known protein alignments.(Fgenesh, GeneWise)

GenBank protein

Transcript-based methods map or assemble transcripts on the genome, including UTRs (EST_map, Combest)

EST contig

Predict model

Predict model

Ab initio methods use knowledge of known genes’ structures to predict start, stop, and splice sites in CDS only. (Fgenesh+, GeneMark)

Train on known genes

ATG TGA

GT AG

exons introns5’UTR5’UTR3’UTR3’UTR

Promoter PolyA

Gene model

Eukaryotic Gene Prediction

8

More Gene Prediction

Use ESTs/cDNAs to extend, correct or predict gene models

• ESTEXT

Predicted model

ESTs

Extended model

5’UTR5’UTR 3’UTR3’UTR

ATG TGA

ATG TGADetect orthologs with poor alignments and refine with synteny based methods • FGENESH2

Genome A

Genome B

FGENESH

Representative set

GENEWISE

EXTERNAL MODELS

Non-redundant gene set is built from “the best” models from each locus according to homology and ESTs, followed by manual curation

9

Combine Gene Predictors for Better Quality

Eugene Genemark Fgenesh JGI Pipe

Number of gene models 11,547 9,609 8,409 12,270

Models with partial EST support 5544 3829 4567 5248

with full length EST support 2538 1182 2896 3073

EST coverage per gene 77.7% 68.2% 80.8% 79.1%

supported splice sites 41,581 40,808 45,498 47,671

Models with homology support 6758 6043 5750 7214

with strong homology support (80+%ide, 80+%cov.)

112 109 174 187

model coverage 64% 60% 68% 69%

Models with homology and EST support

2894 2172 2720 2953

Heterobasidion annosum v1.0

10

Re-annotation Using Comparative Genomics

MAKER JGI pipeline Re-annot

# of predicted gene models

9,940 12,290 12,802

with Swissprot hits 6,521 7,356 7,900

With non-repeat PFAM domains

5,365 6,010 6,353

with EST support 9,252 10,796 11,105

with >90% EST support

7,729 9,178 9,444

# of unique PFAM domains

2,207 2,245 2,322

EST coverage per gene

93.0% 93.3% 93.3%

# EST-supported splice sites

99,627 102,200 104,246

Asaf Salamov

11

Predicted protein

Protein Annotation

Higher order assignments:

Gene Ontology terms

EC numbers --> KEGG pathways

Gene families, with and without other species

Possible orthologs

(in nr, SwissProt, KEGG, KOG)

Possible paralog

(Blastp+MCL)

Domain

(InterPro, tmhmm)

Signal peptide

(signalP)

12

Validation with Transcriptomics

0%10%20%30%40%50%60%70%80%90%

100%

Other Genes

Supported by ESTs

Sanger 454 Illumina

5531

34

EST profile

Processing RNA-Seq with CombEST

models

ESTs

Old Sanger Days

Transformation of EST sequencing

13

Validation with Proteomics

Wright et al, BMC Genomics (2009)

14

Gene Cluster Analysis

Comparative analysis

15

Genome Portal Framework

16

Many Genes of Eco-responsive Daphnia pulex

First crustacean, aquatic animal sequenced, new model organism30,940 predicted D.pulex genes in ~200Mb genome85% supported by 1+ lines of evidence Colbourne et al, Science, 2011

17

Half of Daphnia Genes: no Homologs, Experessed Under Environmental Stress

With Evgeny Zdobnov’s group (Univ. Genève)

* Of 716 highly conserved single copy orthologs, Daphnia is missing only two

Colbourne et al, 2011

18

Outline

Eukaryotic Genome Annotation

Fungal Genomics Program

MycoCosm

19

Fungal Genomics for Energy & Environment

Grow Grow DegradeDegrade

Lignocellulose degradation

Plant symbiontsand pathogens

SugarFermentation

FermentFerment

Bio-refinery

GOAL: Scale up sequencing and analysis of fungal diversity for DOE science and applications

20

13%10%

31%

5%

41%

DOE Joint GenomeInstitute

Broad Institute

Sanger Institute

Washington Univ

other

GOLD (October 2011)

758 fungal projects

21

• Chapter 1: Plant health• Symbiosis

• Plant Pathogenicity

• Biocontrol

• Chapter 2: Biorefinery• Lignocellulose degradation

• Sugar fermentation

• Industrial organisms

• Chapter 3: Diversity• Phylogentics

• Ecology

Genomic Encyclopedia of FungiGenomic Encyclopedia of Fungi

22

Genome-Centric View

Comparative View

http://jgi.doe.gov/fungi100+ fungal genomes5000+ visitors/month

23

Comparative Genome Analysis

24

Strategy: 1000 Fungal Genomes

Goal: Sequencing 1000 fungal genomes from across the Fungal Tree of Life will provide references for research on plant-microbe interactions and environmental metagenomics.

68%

23%

Ascomycota

Basidiomycota

Blastocladiomycota

Chytridiomycota

Glomeromycota

Microsporidia

Neocallimastigomycota

Unknown

Zygomycota

25

Strategy: Fungal Systems

Lichen: alga+

fungus

ECM:plant+

fungus

T.terrestris

Forest soil metagenomesS.commune

Model fungi Simple systems Complex environments

26

Model Mushroom Development

Ohm et al, 2010

SEQUENCE FUNCTION MODEL

WT

S.commune

<Transcriptomics>

Gene knock-outs

Modeling regulatory cascades

27

Summary

Eukaryotic Annotation Recipe:

Combine gene predictors, experimental data, and community expertise

Fungal Genomics: we aim to

scale-up sequencing & comparative analysis of fungi relevant for energy & environment (jgi.doe.gov/fungi)

28

Enjoy Algae as well!

http://genome.jgi.doe.gov/Algae

29

AcknowledgementsJGI Staff

Our Users

30

Outline

Eukaryotic Genome Annotation

Fungal Genomics Program

MycoCosm

Recommended