41
MGM workshop. 19 Oct 2010 Functional annotation Functional annotation Datasources Datasources Konstantinos Mavrommatis Konstantinos Mavrommatis [email protected] [email protected]

Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

  • Upload
    brosh

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Functional annotation Datasources Konstantinos Mavrommatis [email protected]. Let’s get started…. Tools & Information from databases are used to predict the function of a protein (functional annotation). Product name Enzyme catalog number Domain architecture. But what is function?. - PowerPoint PPT Presentation

Citation preview

Page 1: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Functional annotationFunctional annotationDatasourcesDatasources

Konstantinos MavrommatisKonstantinos Mavrommatis

[email protected]@lbl.gov

Page 2: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Let’s get started… Let’s get started…

Tools & Information from databases are used to predict the function of a protein (functional annotation).Product nameEnzyme catalog numberDomain architecture

Page 3: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

But what is function?But what is function?

cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF)

molecular/enzymatic (methyltransferase) Reaction (methylation)

Substrate (cobalt-precorrin-4)

Ligand (S-adenosyl-L-methionine)

metabolic (cobalamin biosynthesis)

physiological (maintenance of healthy nerve and red blood cells, through B12).

Page 4: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Functional annotationFunctional annotation

Predict the biochemistry and physiology of an organism based on its genome sequence

Explain known biochemical and physiological properties

Page 5: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Function predictionFunction prediction

Function transfer by homology

Homology implies a common

evolutionary origin.

not retention of similarity in any of their properties.

Homology ≠ similarity of function.

Punta & Ofran. PLOS Comp Biol. 2008

Page 6: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Trust transfer of Trust transfer of annotation ?annotation ?

Punta & Ofran. PLOS Comp Biol. 2008

Page 7: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Dos and Don’tsDos and Don’ts

Type Don’t Do

Homology Same function Probability for same function

Orthology Same function Probability for same function

Paralogy Same function Probability for same function

Sequence similarity Same function Probability for same function

High sequence similarity Same function Probability for same function

Same sequence Same function Probability for same function

Page 8: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

What if nothing is similar ? What if nothing is similar ?

Subcellular localization

Gene contextSpecial featuresPrediction of

binding residues (DISIS, bindN)

Cytoplasm

S ~ S S ~ S

Periplasm

Page 9: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Genome annotation

Model pathway

Annotation should make Annotation should make sensesense

SubstrateA

SubstrateB

SubstrateC

SubstrateDEnzyme 2Enzyme 1 Enzyme 3

Enzyme 2? ?Enzyme 1 Enzyme 3 ✓

Page 10: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Annotation should make senseAnnotation should make sense

Page 11: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

DatabasesDatabases

Databases used for the analysis of biological molecules.

Databases contain information organized in a way that allows users/researchers to retrieve and exploit it.

Why bother?Store information.Organize data.Predict features (genes, functions ...).Predict the functional role of a feature (annotation).Understand relationships (metabolic reconstruction).

Page 12: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Primary nucleotide Primary nucleotide databasesdatabases

EMBL/GenBank/DDBJ EMBL/GenBank/DDBJ ((http://

www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl,http://www.ebi.ac.uk/embl))

Archive containing all sequences from: genome projects sequencing centers individual scientists patent offices

The sequences are exchanged between the three centers on a daily basis.

Database is doubling every 10 months.

Sequences from >140,000 different species.

1400 new species added every month.

Year Base pairs Sequences2004 44,575,745,17640,604,3192005 56,037,734,46252,016,7622006 69,019,290,70564,893,7472007 83,874,179,73080,388,3822008 99,116,431,94298,868,465

Page 13: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Primary protein sequence Primary protein sequence databasesdatabases

Contain coding sequences derived from the translation of nucleotide sequences GenBank

Valid translations (CDS) from nt GenBank entries.

UniProtKB/TrEMBL (1996) Automatic CDS translations from

EMBL. TrEMBL Release 40.3 (26-May-2009)

contains 7,916,844 entries.

Page 14: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Errors in Errors in databasesdatabases

There are a lot of errors in the primary sequence databases: In the sequences themselves:

Sequencing errors.Cloning vectors sequences.

For the annotations, the free submission of entries results to: Inaccuracies, omissions, and even mistakes. Inconsistencies between some fields.

Page 15: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

RedundancyRedundancy

Redundancy is a major problem.

Entries are partially or entirely duplicated:

e.g. 20% of vertebrate sequences in GenBank.

{ {

{

Partial and completesequence duplications

Page 16: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

NCBI Derivative Sequence NCBI Derivative Sequence DataData

ATTGACTA

TTGACA

CGTGAAT

TGACTA

TATAGCCG

ACGTGC

ACGTGCACGTGC

TTGACA

TTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTA

ATTGACTAATTGACTA

ATTGACTA

TATAGCCG

TATAGCCG

TATAGCCG

TATAGCCG

GenBank

TATAGCCG TATAGCCGTATAGCCGTATAGCCG

ATGA

CATT

GAGA

ATT

ATTCC GAGA

ATTCCGAGA

ATTC GAGA

ATTC

GAGA

ATTCC GAGA

ATTCC

UniGene

RefSeq

GenomeAssembly

Labs

Curators

Algorithms

TATAGCCGAGCTCCGATACCGATGACAA

Page 17: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

RefSeqRefSeq

Curated transcripts and proteins. reviewed by NCBI staff.

Model transcripts and proteins. generated by computer algorithms.

Assembled Genomic Regions (contigs).Chromosome records.

Page 18: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Secondary protein Secondary protein databasesdatabases

Uniprot/SWISS-PROT (1986) (http://ca.expasy.org/spro) a curated protein sequence database

high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.)

a minimal level of redundancy

high level of integration with other databases

Page 19: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Classification databasesClassification databases

Groups (families/clusters) of proteins based on… Overall sequence similarity. Local sequence similarity. Presence / absence of specific features (active site, signal

peptides… ). Structural similarity. ...

These groups contain proteins with similar properties. Specific function, enzymatic activity. General function. Evolutionary relationship. …

Page 20: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Overall sequence Overall sequence similaritysimilarity

Page 21: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Clusters of orthologous groups Clusters of orthologous groups (COGs)(COGs)

COGs were delineated by comparing protein sequences encoded in 43 complete genomes representing 30 major phylogenetic lineages. Each Cluster has representatives of at least 3 lineages

A function (specific or broad) has been assigned to each COG.

http://www.ncbi.nlm.nih.gov/COG/

Page 22: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

How it worksHow it works

Reciprocal best hitBidirectional best hit

Blast best hitUnidirectional best hit

COG1COG2

Page 23: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Profiles & PfamProfiles & Pfam

A method for classifying proteins into groups exploits region similarities, which contain valuable information (domains/profiles).

These domains/profiles can be used to detect distant relationships, where only few residues are conserved.

Page 24: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Regions similarityRegions similarity

Page 25: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

http://pfam.sanger.ac.uk

HMMs of protein alignments(local) for domains, or global (cover whole protein)

PfamPfam

Page 26: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

TIGRfamTIGRfam

Full length alignments. Domain alignments. Equivalogs: families of

proteins with specific function.

Superfamilies: families of homologous genes.

HMMs

http://www.tigr.org/TIGRFAMs/

Page 27: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

KEGG orthologyKEGG orthology

Page 28: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Composite pattern Composite pattern databasesdatabases

To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro

Release 30.0 (Dec10) contains 21178 entries Central annotation resource, with pointers to its satellite dbs

http://www.ebi.ac.uk/interpro/

Page 29: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

* It is up to the user to decide if the annotation is correct *

Page 30: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

ENZYMEENZYME

Page 31: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

http://ca.expasy.org/enzyme/

ENZYMEENZYME

Page 32: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

KEGGKEGG

Contains information about biochemical pathways, and protein interactions.

http://www.kegg.com

Page 33: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Functional annotationFunctional annotation

COGsCOGs

genegene

KO termsKO terms

PfamTIGRfam

PfamTIGRfam

IMGIMG

PSI BLAST1e-2

BLASTp<1e-10, >45% id,

>70% length

Hmmsearch(BLAST preprocessing)

BLASTpevalue<10, 20 best hits

IMG termIMG term

TIGRfamTIGRfam

COGCOG

COG + pfamCOG + pfam

PfamPfam

Product name (based on translation tables)

NO

NO

NO

NO

YES

YES

YES

YES

YES

hypothetical

NO

BLASTBLAST

NO

YES

http://imgweb.jgi-psf.org/img_er_v260/doc/img_er_ann.pdf

Page 34: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Sequencing projects Sequencing projects GOLD

Information for ongoing and finished (meta)genomic projects.

Information about the metadata of genomes and metagenomic samples.

http://www.genomesonline.org

Page 35: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Literature searchLiterature search

PubMed

http://www.ncbi.nlm.nih.gov/Pubmed

Page 36: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Specialized Specialized databasesdatabases

There is a large number of databases devoted to specific organisms.

For some model organisms there are often concurrent systems.

These databases are associated to sequencing or mapping projects.

Page 37: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Other specialized databasesOther specialized databases

Signal transduction, regulation, protein-protein interactions TRANSFAC (Transcription Factor database) BRITE (Biomolecular Relations in Information Transmission and

Expression database) DIP (Database of Interacting Proteins) BIND (Biomolecular Interaction Network database) BioCarta

Biochemical pathways KLOTHO (Biochemical Compounds Declarative database) BRENDA (enzyme information system) LIGAND (similar to Enzyme but with more information for

substrates) Gene order and co-occurrence

STRING

3D structuresPDB (Protein Data Bank)MMDB (Molecular Modelling Data Base)NRL_3D (Non-Redundant Library of 3D Structures)SCOP (Structural Classification of Proteins)

PolymorphismALFRED (Allele Frequency Database)

Molecular interactionsDIP (Database of Interacting proteins)BIND (Biomolecular Interaction Network Database)

Gene expressionGXD (Mouse Gene Expression Database)The Stanford Microarray Database

MappingGDB (Genome Data Base)EMG (Encyclopedia of Mouse Genome)MGD (Mouse Genome Database)INE (Integrated Rice Genome Explorer)

Protein quantificationSWISS-2DPAGEPDD (Protein Disease Database)Sub2D (B. subtilis 2D Protein Index)

Page 38: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

List of databasesList of databases

http://www.oxfordjournals.org/nar/database/c

Page 39: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Databanks Databanks interconnectioninterconnection

SWISS-PROT

ENZYME

PDB

HSSP

SWISSNEW

YPDREF

YPD

PDBFINDERALI

DSSP

FSSP

NRL_3D

PMD

PIR

ProtFam

FlyGene

TFSITE

TFACTOR

EMBL

TrEMBL

ECDC

TrEMBLNEW

EMNEW

EPD

GenBank MOLPROBE

OMIM

MIMMAP

REBASE

PROSITE ProDom

PROSITEDOCBlocks

SWISSDOM

Not all databases are updated regularly. Changes of annotation in one database are not reflected in others.

Page 40: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Summary Summary

We have main archives (Genbank), and currated databases (Refseq, SwissProt), and protein classification database (COG, Pfam), and many, many more…

They help predict the function, or the network of functions.

Systems that integrate the information from several databases, visualize and allow handling of data in an intuitive way are required

Page 41: Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

MGM workshop. 19 Oct 2010

Thank you for your attention.