Upload
brosh
View
43
Download
0
Embed Size (px)
DESCRIPTION
Functional annotation Datasources Konstantinos Mavrommatis [email protected]. Let’s get started…. Tools & Information from databases are used to predict the function of a protein (functional annotation). Product name Enzyme catalog number Domain architecture. But what is function?. - PowerPoint PPT Presentation
Citation preview
MGM workshop. 19 Oct 2010
Functional annotationFunctional annotationDatasourcesDatasources
Konstantinos MavrommatisKonstantinos Mavrommatis
[email protected]@lbl.gov
MGM workshop. 19 Oct 2010
Let’s get started… Let’s get started…
Tools & Information from databases are used to predict the function of a protein (functional annotation).Product nameEnzyme catalog numberDomain architecture
MGM workshop. 19 Oct 2010
But what is function?But what is function?
cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF)
molecular/enzymatic (methyltransferase) Reaction (methylation)
Substrate (cobalt-precorrin-4)
Ligand (S-adenosyl-L-methionine)
metabolic (cobalamin biosynthesis)
physiological (maintenance of healthy nerve and red blood cells, through B12).
MGM workshop. 19 Oct 2010
Functional annotationFunctional annotation
Predict the biochemistry and physiology of an organism based on its genome sequence
Explain known biochemical and physiological properties
MGM workshop. 19 Oct 2010
Function predictionFunction prediction
Function transfer by homology
Homology implies a common
evolutionary origin.
not retention of similarity in any of their properties.
Homology ≠ similarity of function.
Punta & Ofran. PLOS Comp Biol. 2008
MGM workshop. 19 Oct 2010
Trust transfer of Trust transfer of annotation ?annotation ?
Punta & Ofran. PLOS Comp Biol. 2008
MGM workshop. 19 Oct 2010
Dos and Don’tsDos and Don’ts
Type Don’t Do
Homology Same function Probability for same function
Orthology Same function Probability for same function
Paralogy Same function Probability for same function
Sequence similarity Same function Probability for same function
High sequence similarity Same function Probability for same function
Same sequence Same function Probability for same function
MGM workshop. 19 Oct 2010
What if nothing is similar ? What if nothing is similar ?
Subcellular localization
Gene contextSpecial featuresPrediction of
binding residues (DISIS, bindN)
Cytoplasm
S ~ S S ~ S
Periplasm
MGM workshop. 19 Oct 2010
Genome annotation
Model pathway
Annotation should make Annotation should make sensesense
SubstrateA
SubstrateB
SubstrateC
SubstrateDEnzyme 2Enzyme 1 Enzyme 3
Enzyme 2? ?Enzyme 1 Enzyme 3 ✓
MGM workshop. 19 Oct 2010
Annotation should make senseAnnotation should make sense
MGM workshop. 19 Oct 2010
DatabasesDatabases
Databases used for the analysis of biological molecules.
Databases contain information organized in a way that allows users/researchers to retrieve and exploit it.
Why bother?Store information.Organize data.Predict features (genes, functions ...).Predict the functional role of a feature (annotation).Understand relationships (metabolic reconstruction).
MGM workshop. 19 Oct 2010
Primary nucleotide Primary nucleotide databasesdatabases
EMBL/GenBank/DDBJ EMBL/GenBank/DDBJ ((http://
www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl,http://www.ebi.ac.uk/embl))
Archive containing all sequences from: genome projects sequencing centers individual scientists patent offices
The sequences are exchanged between the three centers on a daily basis.
Database is doubling every 10 months.
Sequences from >140,000 different species.
1400 new species added every month.
Year Base pairs Sequences2004 44,575,745,17640,604,3192005 56,037,734,46252,016,7622006 69,019,290,70564,893,7472007 83,874,179,73080,388,3822008 99,116,431,94298,868,465
MGM workshop. 19 Oct 2010
Primary protein sequence Primary protein sequence databasesdatabases
Contain coding sequences derived from the translation of nucleotide sequences GenBank
Valid translations (CDS) from nt GenBank entries.
UniProtKB/TrEMBL (1996) Automatic CDS translations from
EMBL. TrEMBL Release 40.3 (26-May-2009)
contains 7,916,844 entries.
MGM workshop. 19 Oct 2010
Errors in Errors in databasesdatabases
There are a lot of errors in the primary sequence databases: In the sequences themselves:
Sequencing errors.Cloning vectors sequences.
For the annotations, the free submission of entries results to: Inaccuracies, omissions, and even mistakes. Inconsistencies between some fields.
MGM workshop. 19 Oct 2010
RedundancyRedundancy
Redundancy is a major problem.
Entries are partially or entirely duplicated:
e.g. 20% of vertebrate sequences in GenBank.
{ {
{
Partial and completesequence duplications
MGM workshop. 19 Oct 2010
NCBI Derivative Sequence NCBI Derivative Sequence DataData
ATTGACTA
TTGACA
CGTGAAT
TGACTA
TATAGCCG
ACGTGC
ACGTGCACGTGC
TTGACA
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCG
GenBank
TATAGCCG TATAGCCGTATAGCCGTATAGCCG
ATGA
CATT
GAGA
ATT
ATTCC GAGA
ATTCCGAGA
ATTC GAGA
ATTC
GAGA
ATTCC GAGA
ATTCC
UniGene
RefSeq
GenomeAssembly
Labs
Curators
Algorithms
TATAGCCGAGCTCCGATACCGATGACAA
MGM workshop. 19 Oct 2010
RefSeqRefSeq
Curated transcripts and proteins. reviewed by NCBI staff.
Model transcripts and proteins. generated by computer algorithms.
Assembled Genomic Regions (contigs).Chromosome records.
MGM workshop. 19 Oct 2010
Secondary protein Secondary protein databasesdatabases
Uniprot/SWISS-PROT (1986) (http://ca.expasy.org/spro) a curated protein sequence database
high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.)
a minimal level of redundancy
high level of integration with other databases
MGM workshop. 19 Oct 2010
Classification databasesClassification databases
Groups (families/clusters) of proteins based on… Overall sequence similarity. Local sequence similarity. Presence / absence of specific features (active site, signal
peptides… ). Structural similarity. ...
These groups contain proteins with similar properties. Specific function, enzymatic activity. General function. Evolutionary relationship. …
MGM workshop. 19 Oct 2010
Overall sequence Overall sequence similaritysimilarity
MGM workshop. 19 Oct 2010
Clusters of orthologous groups Clusters of orthologous groups (COGs)(COGs)
COGs were delineated by comparing protein sequences encoded in 43 complete genomes representing 30 major phylogenetic lineages. Each Cluster has representatives of at least 3 lineages
A function (specific or broad) has been assigned to each COG.
http://www.ncbi.nlm.nih.gov/COG/
MGM workshop. 19 Oct 2010
How it worksHow it works
Reciprocal best hitBidirectional best hit
Blast best hitUnidirectional best hit
COG1COG2
MGM workshop. 19 Oct 2010
Profiles & PfamProfiles & Pfam
A method for classifying proteins into groups exploits region similarities, which contain valuable information (domains/profiles).
These domains/profiles can be used to detect distant relationships, where only few residues are conserved.
MGM workshop. 19 Oct 2010
Regions similarityRegions similarity
MGM workshop. 19 Oct 2010
http://pfam.sanger.ac.uk
HMMs of protein alignments(local) for domains, or global (cover whole protein)
PfamPfam
MGM workshop. 19 Oct 2010
TIGRfamTIGRfam
Full length alignments. Domain alignments. Equivalogs: families of
proteins with specific function.
Superfamilies: families of homologous genes.
HMMs
http://www.tigr.org/TIGRFAMs/
MGM workshop. 19 Oct 2010
KEGG orthologyKEGG orthology
MGM workshop. 19 Oct 2010
Composite pattern Composite pattern databasesdatabases
To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro
Release 30.0 (Dec10) contains 21178 entries Central annotation resource, with pointers to its satellite dbs
http://www.ebi.ac.uk/interpro/
MGM workshop. 19 Oct 2010
* It is up to the user to decide if the annotation is correct *
MGM workshop. 19 Oct 2010
ENZYMEENZYME
MGM workshop. 19 Oct 2010
http://ca.expasy.org/enzyme/
ENZYMEENZYME
MGM workshop. 19 Oct 2010
KEGGKEGG
Contains information about biochemical pathways, and protein interactions.
http://www.kegg.com
MGM workshop. 19 Oct 2010
Functional annotationFunctional annotation
COGsCOGs
genegene
KO termsKO terms
PfamTIGRfam
PfamTIGRfam
IMGIMG
PSI BLAST1e-2
BLASTp<1e-10, >45% id,
>70% length
Hmmsearch(BLAST preprocessing)
BLASTpevalue<10, 20 best hits
IMG termIMG term
TIGRfamTIGRfam
COGCOG
COG + pfamCOG + pfam
PfamPfam
Product name (based on translation tables)
NO
NO
NO
NO
YES
YES
YES
YES
YES
hypothetical
NO
BLASTBLAST
NO
YES
http://imgweb.jgi-psf.org/img_er_v260/doc/img_er_ann.pdf
MGM workshop. 19 Oct 2010
Sequencing projects Sequencing projects GOLD
Information for ongoing and finished (meta)genomic projects.
Information about the metadata of genomes and metagenomic samples.
http://www.genomesonline.org
MGM workshop. 19 Oct 2010
Literature searchLiterature search
PubMed
http://www.ncbi.nlm.nih.gov/Pubmed
MGM workshop. 19 Oct 2010
Specialized Specialized databasesdatabases
There is a large number of databases devoted to specific organisms.
For some model organisms there are often concurrent systems.
These databases are associated to sequencing or mapping projects.
MGM workshop. 19 Oct 2010
Other specialized databasesOther specialized databases
Signal transduction, regulation, protein-protein interactions TRANSFAC (Transcription Factor database) BRITE (Biomolecular Relations in Information Transmission and
Expression database) DIP (Database of Interacting Proteins) BIND (Biomolecular Interaction Network database) BioCarta
Biochemical pathways KLOTHO (Biochemical Compounds Declarative database) BRENDA (enzyme information system) LIGAND (similar to Enzyme but with more information for
substrates) Gene order and co-occurrence
STRING
3D structuresPDB (Protein Data Bank)MMDB (Molecular Modelling Data Base)NRL_3D (Non-Redundant Library of 3D Structures)SCOP (Structural Classification of Proteins)
PolymorphismALFRED (Allele Frequency Database)
Molecular interactionsDIP (Database of Interacting proteins)BIND (Biomolecular Interaction Network Database)
Gene expressionGXD (Mouse Gene Expression Database)The Stanford Microarray Database
MappingGDB (Genome Data Base)EMG (Encyclopedia of Mouse Genome)MGD (Mouse Genome Database)INE (Integrated Rice Genome Explorer)
Protein quantificationSWISS-2DPAGEPDD (Protein Disease Database)Sub2D (B. subtilis 2D Protein Index)
MGM workshop. 19 Oct 2010
List of databasesList of databases
http://www.oxfordjournals.org/nar/database/c
MGM workshop. 19 Oct 2010
Databanks Databanks interconnectioninterconnection
SWISS-PROT
ENZYME
PDB
HSSP
SWISSNEW
YPDREF
YPD
PDBFINDERALI
DSSP
FSSP
NRL_3D
PMD
PIR
ProtFam
FlyGene
TFSITE
TFACTOR
EMBL
TrEMBL
ECDC
TrEMBLNEW
EMNEW
EPD
GenBank MOLPROBE
OMIM
MIMMAP
REBASE
PROSITE ProDom
PROSITEDOCBlocks
SWISSDOM
Not all databases are updated regularly. Changes of annotation in one database are not reflected in others.
MGM workshop. 19 Oct 2010
Summary Summary
We have main archives (Genbank), and currated databases (Refseq, SwissProt), and protein classification database (COG, Pfam), and many, many more…
They help predict the function, or the network of functions.
Systems that integrate the information from several databases, visualize and allow handling of data in an intuitive way are required
MGM workshop. 19 Oct 2010
Thank you for your attention.