14
Advancing Science with DNA Sequence Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

  • Upload
    trevor

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011. Tricky question. What do you need to do data curation in IMG? I-phone PhD in Computer Science supernatural powers Correct answer: you need an IMG account http://img.jgi.doe.gov/er. Gene models Add a gene - PowerPoint PPT Presentation

Citation preview

Page 1: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

Data Curation in IMG-ER

Natalia IvanovaMGM Workshop

September 28, 2011

Page 2: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

Tricky question

• What do you need to do data curation in IMG?a) I-phoneb) PhD in Computer Sciencec) supernatural powers

• Correct answer: you need an IMG accounthttp://img.jgi.doe.gov/er

Page 3: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

1. Gene modelsa) Add a geneb) Make a gene pseudogene or “obsolete” (=delete it)2. Functional annotations:c) Product namesd) EC numberse) Gene symbolsIf you believe something else needs to be changed (genome

name, taxonomy, etc.) – please use IMG Questions/Comments link

What can’t be changed: automated assignments to protein families (Pfam, COGs, TIGRfam, InterPro, SEED assignments, KO assignments)

What can be curated in IMG-ER?

Page 4: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

Center point for curation – Gene Cart

Page 5: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

• Product Name is free text (but see GenBank requirements http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html)

• Prot Description is free text (goes to “note” in GenBank submission)

• EC number and PUBMED ID – see explanation

• Notes are free text (goes to “note” in GenBank submission)

• Gene symbol is “gene name” – 4 letter abbreviation; goes to “gene” in GenBank submission

Page 6: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

How to find the genes that need curation?

Two possible scenarios:• You have submitted a genome to IMG-ER

and want to have the best annotations possible for it (e. g. for GenBank submission)

• You’re an expert and know everything about a certain protein family (families) = “community service”

Page 7: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

Curation of genome annotations

Compare Gene Annotations

find genome

Genome Statistics

review Gene Pages

add to Gene Cart

refine gene setFind Genomes:

• Genome Browser• Genome Search

• “Hypothetical protein”, but with some evidence

• Non-hypothetical protein, but no evidence

w/o enzymes but with candidate KO

based enzymes • Protein families• Homologs/orthologs• Gene Neighborhoods

Page 8: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

Why do you want to review annotations?

• Most IMG pipelines are optimized for specificity, so they are more likely to have false negatives, but generate few false positives

• Compare Annotations– Product name is a consensus of multiple assignments:

BLASTp, TIGRfam, COG, Pfam– Sources of false negatives - cutoffs: TIGRfam trusted cutoffs

are quite stringent; COG doesn’t have trusted cutoffs; BLASTp cutoff of 50% identity

• Candidate genes with KO annotations – sources of false negatives– Cutoffs for % identity and alignment length

Page 9: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

Curation of annotation in one genome (or a set of genomes)

a) Your favorite genes (experimental verification, etc.) -> use Find Genes, Gene Search or BLAST

b)“Compare Annotations” on Organism Details page

c) “Candidate genes with KO annotations” on Organism Details page

d)PhyloProfiler

Page 10: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

A shortcut for product name/EC number assignments based on KO

Page 11: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

Example of a missed gene

• Run PhyloProfiler of Deinococcus geothermalis as a query, Deinococcus hopiensis as target (with no homologs in)

• Select Dgeo_0119 as a sequence to check whether a homolog of this gene was missed in Deinococcus hopiensis

Page 12: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

Adding missed genes - contd

• Use graphical viewer to check the translation

• Adjust the start if other start codons with better RBS exist upstream

Page 13: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

Reviewing your annotations

• Organism Details page -> Genome Statistics

• MyIMG

Page 14: Data Curation in IMG-ER Natalia Ivanova MGM Workshop September 28, 2011

Advancing Science with DNA Sequence

IMG curation exercises

Go to the link in the usual place:http://genomebiology.jgi-psf.org/Content/MGM-10.Sep2011/agenda.html

The first 2 pages – questions without answers; the rest is cheat sheet