Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Preview:

DESCRIPTION

Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011. Tricky question. What do you need to do data curation in IMG? I-phone PhD in Computer Science supernatural powers Correct answer: you need an IMG account http://img.jgi.doe.gov/er. Gene models Add a gene - PowerPoint PPT Presentation

Citation preview

Advancing Science with DNA Sequence

Data Curation in IMG-ER

Natalia IvanovaMGM Workshop

September 28, 2011

Advancing Science with DNA Sequence

Tricky question

• What do you need to do data curation in IMG?a) I-phoneb) PhD in Computer Sciencec) supernatural powers

• Correct answer: you need an IMG accounthttp://img.jgi.doe.gov/er

Advancing Science with DNA Sequence

1. Gene modelsa) Add a geneb) Make a gene pseudogene or “obsolete” (=delete it)2. Functional annotations:c) Product namesd) EC numberse) Gene symbolsIf you believe something else needs to be changed (genome

name, taxonomy, etc.) – please use IMG Questions/Comments link

What can’t be changed: automated assignments to protein families (Pfam, COGs, TIGRfam, InterPro, SEED assignments, KO assignments)

What can be curated in IMG-ER?

Advancing Science with DNA Sequence

Center point for curation – Gene Cart

Advancing Science with DNA Sequence

• Product Name is free text (but see GenBank requirements http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html)

• Prot Description is free text (goes to “note” in GenBank submission)

• EC number and PUBMED ID – see explanation

• Notes are free text (goes to “note” in GenBank submission)

• Gene symbol is “gene name” – 4 letter abbreviation; goes to “gene” in GenBank submission

Advancing Science with DNA Sequence

How to find the genes that need curation?

Two possible scenarios:• You have submitted a genome to IMG-ER

and want to have the best annotations possible for it (e. g. for GenBank submission)

• You’re an expert and know everything about a certain protein family (families) = “community service”

Advancing Science with DNA Sequence

Curation of genome annotations

Compare Gene Annotations

find genome

Genome Statistics

review Gene Pages

add to Gene Cart

refine gene setFind Genomes:

• Genome Browser• Genome Search

• “Hypothetical protein”, but with some evidence

• Non-hypothetical protein, but no evidence

w/o enzymes but with candidate KO

based enzymes • Protein families• Homologs/orthologs• Gene Neighborhoods

Advancing Science with DNA Sequence

Why do you want to review annotations?

• Most IMG pipelines are optimized for specificity, so they are more likely to have false negatives, but generate few false positives

• Compare Annotations– Product name is a consensus of multiple assignments:

BLASTp, TIGRfam, COG, Pfam– Sources of false negatives - cutoffs: TIGRfam trusted cutoffs

are quite stringent; COG doesn’t have trusted cutoffs; BLASTp cutoff of 50% identity

• Candidate genes with KO annotations – sources of false negatives– Cutoffs for % identity and alignment length

Advancing Science with DNA Sequence

Curation of annotation in one genome (or a set of genomes)

a) Your favorite genes (experimental verification, etc.) -> use Find Genes, Gene Search or BLAST

b)“Compare Annotations” on Organism Details page

c) “Candidate genes with KO annotations” on Organism Details page

d)PhyloProfiler

Advancing Science with DNA Sequence

A shortcut for product name/EC number assignments based on KO

Advancing Science with DNA Sequence

Example of a missed gene

• Run PhyloProfiler of Deinococcus geothermalis as a query, Deinococcus hopiensis as target (with no homologs in)

• Select Dgeo_0119 as a sequence to check whether a homolog of this gene was missed in Deinococcus hopiensis

Advancing Science with DNA Sequence

Adding missed genes - contd

• Use graphical viewer to check the translation

• Adjust the start if other start codons with better RBS exist upstream

Advancing Science with DNA Sequence

Reviewing your annotations

• Organism Details page -> Genome Statistics

• MyIMG

Advancing Science with DNA Sequence

IMG curation exercises

Go to the link in the usual place:http://genomebiology.jgi-psf.org/Content/MGM-10.Sep2011/agenda.html

The first 2 pages – questions without answers; the rest is cheat sheet

Recommended