17
Advancing Science with DNA Sequence Finding the genes in Finding the genes in microbial genomes microbial genomes Natalia Ivanova Natalia Ivanova MGM Workshop MGM Workshop May 15, 2012 May 15, 2012

Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Embed Size (px)

Citation preview

Page 1: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Finding the genes in Finding the genes in microbial genomesmicrobial genomes

Natalia IvanovaNatalia IvanovaMGM WorkshopMGM Workshop

May 15, 2012May 15, 2012

Page 2: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

1.1. Introduction Introduction

2. Tools out there2. Tools out there

3. Basic principles behind tools 3. Basic principles behind tools and known problemsand known problems

4. Metagenomes4. Metagenomes

Outline

Page 3: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Sequence features in prokaryotic genomes: stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA) protein-coding genes (CDSs) transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends) translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins) pseudogenes (tRNA and protein-coding genes) …

Finding the genes in microbial genomesfeaturesfeatures

Well-annotated bacterial genome in Artemis genome viewer:

Page 4: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

1.1. Introduction Introduction

2. Tools out there2. Tools out there (don’t bother to write down the names and links, (don’t bother to write down the names and links,

all presentations will be available on the web all presentations will be available on the web site)site)

3. Known problems3. Known problems

4. Metagenomes4. Metagenomes

Outline

Page 5: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

IMG-ERhttp://img.jgi.doe.gov/

RASThttp://rast.nmpdr.org/

JCVI Annotation Servicehttp://www.jcvi.org/cms/research/projects/

annotation-service/

NCBI PGAAPhttp://www.ncbi.nlm.nih.gov/genomes/static/

Pipeline.html

Publicly available genome annotation services

Page 6: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

•“Non-coding” RNAs 23S, 16S and 5S rRNAstRNAs(tmRNA, RNaseP RNA component)

•Protein-coding genesTopological features (signal peptides, transmembrane regions)

• Repeats (CRISPRs, IS elements)• Regulatory RNAs (riboswitches)• Pseudogenes and frameshifts

Features predicted by most pipelines

Page 7: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

• Best way – covariance models (captures the secondary structure)

• Used for small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component)Rfam database, INFERNAL search toolhttp://www.sanger.ac.uk/Software/Rfam/http://rfam.janelia.org/http://infernal.janelia.org/tRNAScan-SEhttp://lowelab.ucsc.edu/tRNAscan-SE

• Does not work for large RNAs (23S, 16S). Alternatives: BLASTnRNAmmer http://www.cbs.dtu.dk/services/RNAmmer/

RNA prediction

Page 8: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction)Open reading frame (ORF): reading frame between a start and stop codon

CDS (not ORF!) prediction

Page 9: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Two major approaches to CDS prediction

Two major approaches to prediction of protein-coding genes:

ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs)

Advantages: finds “unique” genes; high sensitivity; very fast!Limitations: often misses “unusual” genes; high rate of false

positives

“evidence-based” (ORFs with translations homologous to the known proteins are CDSs)

Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions

Limitations: cannot find “unique” genes; low sensitivity on short genes; prone to propagation of false positive results of ab initio annotation tools; slow!

Page 10: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

How ab initio tools work – very briefly

• Statistical model of coding and non-coding regions

• Statistical model(s) of other components (RBS)

• Additional algorithms for refinement of predictions (overlap resolution, etc.)

Prokaryotic gene model used by all ab initio gene findersRibosome-binding site within certain distance of the start codon;One of 3 start codons;One of 3 stop codons;No frame interruptions

Ribosome binding site

Start codon:ATG, GTG, TTG

Stop codon:TAG, TAA, TGA

open reading frame

Page 11: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Gene finders: ab initio tools; evidence-based refinement

Ab initio tools used by the pipelines:Glimmer family (Glimmer2, Glimmer3, RBS finder) ->NCBI,

RAST, JCVIhttp://glimmer.sourceforge.net/

GeneMark family (GeneMark-hmm, GeneMarkS) ->NCBIhttp://exon.gatech.edu/GeneMark/

PRODIGAL -> IMG-ER, NCBIhttp://compbio.ornl.gov/prodigal/

Evidence-based refinement: mostly undocumented in-house developed tools.Types of corrections:missed genes (RAST, JCVI, NCBI), frameshifts (JCVI, NCBI), start sites

(RAST)

Page 12: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Known problems of all annotation pipelines

• RNAs– Incomplete rRNAs– Trans-spliced tRNAin archaeal genomes– Small structural RNAs not predicted at all

• Protein-coding genes that don’t fit into prokaryotic gene model used by ab initio gene finders– no RBS (leaderless transcripts)– interrupted translation frame

sequencing errors ortranslational exceptions

– non-canonical start

Ribosome binding

site

Start codon:ATG, GTG, TTG

Stop codon:TAG, TAA, TGA

open reading frame

Genome Sequencing center

16S rRNA, nt

Synechococcus sp. CC9311 UCSD, TIGR 1477

Synechococcus sp. CC9605 JGI 1440

Synechococcus elongatus PCC 7942

JGI 1490

Synechococcus sp. JA-2-3BA(2-13)

TIGR 1323

Synechococcus sp. JA-3-3Ab TIGR 1324

Synechococcus sp. RCC307 Genoscope 1498

Synechococcus sp. WH7803 Genoscope 1497, 1464

Page 13: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Symptoms of gene finding problems

• Some type of mandatory features (rRNAs, tRNAs, CDSs) is missing

• “Truncated” genes (shorter than homologs) => unusual translation initiation features (non-canonical start codons, leaderless transcripts)

• Many “unique” genes without protein family assignment or BLASTp hit => sequencing errors (frameshifts)

• Undetected selenocysteines, programmed frameshifts in ~50 well-conserved protein families

Page 14: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Supplemental tools

TIS (translation initiation site) prediction/correction

TICO http://tico.gobics.de/TriTISA http://mech.ctb.pku.edu.cn/protisa/TriTISA

Two tools often disagree about the best TIS, especially in high GC genomes

Operon predictionJPOP http://csbl.bmb.uga.edu/downloads/#jpophttp://www.cse.wustl.edu/~jbuhler/research/operons/http://www.sph.umich.edu/~qin/hmm/

Proteins with unusual translational features – selenocysteine-containing genesbSECISearch http://genomics.unl.edu/bSECISearch/

Page 15: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Metagenomes sequenced with new technologies: low-coverage problems

• Both 454 and Illumina require high sequence coverage in order to achieve high sequence quality (25x to >100x)

• High sequence coverage cannot be achieved for metagenome data

How does this affect metagenome annotation?~70% of 454 Titanium reads have at least 1 sequencing

artifact (basecalls in homopolymeric runs), there is no clear pattern of error distribution

>100 bp Illumina reads have ~3% error rate, error rate is higher towards the end of the read, the majority of errors are substitutions

metagenomegenome

sequence sequence

coverage

coverage

Page 16: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Just one example…

predicted gene

3 frameshifts

4-read contig, 1476 nt, no misassembly

Contig has 27 homopolymers (3 nt and more), 3 of them have errorsNo correlation with homopolymer type or error type Reads were quality trimmed prior to assembly

Page 17: Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012

Advancing Science with DNA Sequence

Metagenome annotation tools (more details will be given)

GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs)http://exon.gatech.edu/GeneMark/

• MetaGenehttp://metagene.cb.k.u-tokyo.ac.jp/metagene/

• FragGeneScanhttp://omics.informatics.indiana.edu/FragGeneScan/

Full-service annotation pipelines• IMG/M-ER – “metagenome gene calling” + other

optionshttp://img.jgi.doe.gov/submit

• MG-RASThttp://metagenomics.nmpdr.org/

• CAMERA annotation pipelinehttp://camera.calit2.net/