32
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis of Proteins Celebrating the 20th anniversary of Swiss-Prot Fortaleza, Brazil August 4, 2006 Cathy H. Wu, Ph.D. Director, Protein Information Resource Professor, Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

Embed Size (px)

Citation preview

Page 1: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology

In-Silico Analysis of ProteinsCelebrating the 20th anniversary of Swiss-ProtFortaleza, Brazil

August 4, 2006Cathy H. Wu, Ph.D.Director, Protein Information ResourceProfessor, Biochemistry and Molecular & Cellular BiologyGeorgetown University Medical Center

Page 2: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

2

Wu CH, Zhao S, Chen HL. (1996)

A protein class database organized with PROSITE protein groups and PIR superfamilies.

Journal of Computational Biology, 3 (4), 547-562.

Page 3: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

3

Protein Information Resource (PIR)

UniProt Universal Protein Resource: Central Resource of Protein Sequence and Function

PIRSF Family Classification System: Protein Classification and Functional Annotation

iProClass Integrated Protein Database: Data Integration and Protein Mapping

iProLINK Literature Mining Resource: Annotation Extraction

Other Projects: NIAID Proteomics, caBIG Grid-Enablement

Integrated Protein Informatics Resource for Genomic/Proteomic Research

http://pir.georgetown.edu

Page 4: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

4

PIR Protein Sequence Database The PIR-International Protein Sequence

Database (PIR-PSD) grew out of the Atlas of Protein Sequence and Structure (1965-1978), Vol 1-5, Suppl 1-3.

Margaret Dayhoff collected all the known protein sequences to study protein evolution.

The first Atlas contained 65 proteins, the

final volume had 1081 proteins. The PIR-PSD was produced from

1984 (Release 1, 2900 proteins) to 2004 (Release 80, 283,416 proteins).

PIR-PSD has been integrated with the UniProt since 2002. 0

50,000

100,000

150,000

200,000

250,000

300,000

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76

PIR-PSD Release Number

Nu

mb

er

of S

eq

ue

nce

s Joined UniProt (Jan 2002)

Page 5: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

5

UniProt Activities at PIR Integration of PIR-PSD into UniProtKB

Incorporation of unique PIR entries Incorporation of PIR annotations: references, experimental

features with literature evidence tag Functional annotation of UniProtKB proteins

Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins

Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site feature)

Production of UniRef100/90/50 databases Creation of UniProt web site and help system => Unified UniProt

web site & user community interaction

Page 6: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

6

PIRSF Classification System

PIRSF: Evolutionary relationships of proteins from super- to sub-families Curated families with name rules and site rules Curation platform with classification/visualization tools Dissemination: UniProtKB annotations, InterPro families,

PIRSF reports, PIRSF curation platform

Protein Classification and Functional Annotation

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

Page 7: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

7

iProClass Integrated Protein Database

Data integration from >90 databases Underlying data warehouse for protein ID/name/bibliography mapping &

pre-computed BLAST results Integration of protein family, function, structure for functional annotation Rich link (link + summary) for value-added reports of UniProt proteins

Data Integration and Protein Mapping

Disease/Variation

OMIMHapMap

…Ontology

GO

Protein Sequence

UniProtUniRefUniParcRefSeq

GenPept…

Gene/Genome

GenBank/EMBL/DDBJLocusLinkUniGene

MGITIGR

Gene Expression

GEOGXD

ArrayExpressCleanExSOURCE

Structure

PDBSCOPCATH

PDBSumMMDB

Family

PIRSFInterPro

PfamPrositeCOG

Interaction

DIPBIND

Taxonomy

NCBI TaxonNEWT

Protein Expression

Swiss-2DPAGEPMG

Literature

PubMed

Function/Pathway

EC-IUBMBKEGG

BioCartaEcoCyc

WIT…

Modification

RESIDPhosphoBase

iProClass

Integrated Protein Knowledgebase

iProClass

Integrated Protein Knowledgebase

NCBI X-Refs

Gene/Genome

Gene Ontology

KEGG PathwayStructure Homolog

PTM

EC

Additional Refs

NCBI X-Refs

Gene/Genome

Gene Ontology

KEGG PathwayStructure Homolog

PTM

EC

Additional Refs

Page 8: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

8

iProLINK Text Mining Resource

Curated datasets and literature corpus for development of literature mining and annotation extraction tools

RLIMS-P text-mining tool for extracting protein phosphorylation data BioThesaurus of gene/protein names to resolve synonym and ambiguity

Annotation Extraction and Literature-Based Protein Annotation

iProLINKNLP Text Mining

Research

Literature-Based Curation

Bibliography Mapping

Text Categorization

Annotation Extraction

Named Entity Recognition

Databases

UniProtPIRSF

iProClassGO

Bibliography

PubMed

Literature Mining &Protein Curation

Literature Corpus• Mapping to Proteins/Features• Annotation-Tagged• Name-Tagged

integrated Protein Literature, INformation and Knowledge

http://pir.georgetown.edu/iprolink

Guidelines• Protein/Family Naming Guidelines• Name Tagging Guidelines

Dictionary and Ontology• Protein Names and Synonyms• PIRSF Family Names in DAG

Bibliography Display• Mapping of PubMed IDs to Proteins• Papers Categorized by Annotations

Page 9: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

9

NIAID Biodefense Proteomic Program Goals

Characterize proteomes of pathogens and host cells Identify proteins associated with the biology of the microbes Elucidate mechanisms of microbial pathogenesis Understand immune responses and non-immune mediated host responses

Adm Ctr

PRC

Data Type

Organism

Page 10: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

10

Multiple Data Typesfrom Proteomics

Research Centers

Data Integration atNIAID Admin Center

Integrated Dataat VBI

Data Exchange FormatControlled Vocabulary

Ontology

Master Protein Directory & Complete Proteomes

at GU-PIR

iProClass UniProtPIRSF

Protein IDPeptide/Protein

Sequence Mapping

Rich annotation - capture experimental data and scientific conclusion; integrate with major databases

http://pir.georgetown.edu/proteomics/

Page 11: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

11

NCI caBIG Initiative caBIG (cancer Biomedical Informatics Grid) Cancer research platform to enable sharing of research infrastructure, data, tools

Designed and built by an open federation of organizations Based on common standards and open source/open access principles

One of four caBIG grid reference projects PIR Grid-Enablement: UniProtKB as central protein

information resource for cancer research caBIG Workspaces

Integrative Cancer Research

PIR Developer Project: Grid Enablement of PIR

PIR Adopter Project: SEED Genome Annotation

PIR Adopter Project: GeneConnect ID mapping Vocabularies and Common Data Elements

PIR Participant Project: Protein models, objects, vocabularies, ontologies

caGrid Architecture

Page 12: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

12

UniProt Knowledgebase: Accurate, Consistent, and Rich Annotation of

Protein Sequence and Function

Family Classification-Driven and Rule-Based Curation Functional inference of uncharacterized hypothetical proteins Systematic detection and correction of genome annotation errors Improvement of under- or over-annotated proteins

Text Mining-Assisted and Literature-Based Curation Annotation extraction from scientific literature Attribution of experimental evidence

Ontology and Controlled Vocabulary-Based Curation Standardization of protein/gene/family names and annotation terms Annotation of specific protein entities

Page 13: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

13

PIR Superfamily Classification

Tree of Life and Evolution of Protein Families (Dayhoff)

The protein superfamily concept (1976) was based on sequence similarity, where sequences were categorized into superfamilies, families, subfamilies, and entries using different % identity thresholds.

Page 14: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

14

PIRSF Classification System A network classification system from superfamily to subfamily levels to

reflect the evolutionary relationships of full-length proteins and domains Basic unit is homeomorphic family: Full-length similarity, common domain

architecture Provide annotation of generic biochemical and specific biological functions Basis for evolutionary and comparative genomics research Basis for accurate and consistent automated protein annotation (protein

name, biochemical and biological functions, functional sites) Basis for standardization of protein names and development of ontology

for protein evolution

Page 15: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

15

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

Page 16: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

16

PIRSF Classification/Curation WorkflowUnclassified UniProtKB proteins

Uncurated Homeomorphic Clusters

Orphans

Preliminary Homeomorphic Families

Final Families, Subfamilies, Superfamilies

Add/Remove Members

Name, Refs, Abstract, Domain Arch.

Automatic Clustering

Computer-assisted Manual Curation

Automatic Procedure Unassigned Proteins

Au

tom

atic

Place

me

nt

Hierarchies (Superfamilies/Subfamilies)

Map Domains on Clusters

Merge/SplitClusters

New Proteins

Protein Name Rules/Site Rules Build and Test HMMs

1

2

3

4

5

6

7 8

Unclassified UniProtKB proteins

Uncurated Homeomorphic Clusters

Orphans

Preliminary Homeomorphic Families

Final Families, Subfamilies, Superfamilies

Add/Remove Members

Name, Refs, Abstract, Domain Arch.

Automatic Clustering

Computer-assisted Manual Curation

Automatic Procedure Unassigned Proteins

Au

tom

atic

Place

me

nt

Hierarchies (Superfamilies/Subfamilies)

Map Domains on Clusters

Merge/SplitClusters

New Proteins

Protein Name Rules/Site Rules Build and Test HMMs

1

2

3

4

5

6

7 8

1. Computational generation of homeomorphic clusters

2. Computational domain mapping and annotation of preliminary clusters

3. Automatic placement of new proteins into families

4. Computer-assisted expert analysis to define homeomorphic families

5. Family hierarchy created as needed

6. Expert annotation

7. Name rules and optional site rules created

8. Seed members to generate family HMMs

Page 17: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

17

PIRSF Classification Tools Iterative BlastClust Tree with Annotation Table Multiple Alignment and Phylogenetic Tree PIRSF Classification in DAG Editor

HPS

KGPDC

Phylogenetic Tree Classification/Annotation Alignment

ISMB: PIRSF Protein Classification System Demo

Page 18: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

18

PIRSF Analysis/Visualization Tools Taxonomy Distribution and Phylogenetic Pattern Domain Display Family Hierarchy (DAG Browser)

Page 19: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

19

PIRSF Family Report

Curated family name

Description of family

Sequence analysis tools

Page 20: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

20

ATP_PFK_DR0635

ATP_PFK_euk

PPi_PFK_PfpB

PPi_PFK_TM0289

PPi_PFK_TP0108

PPi_PFK_SMc01852

PFK_XF0274

E. coli (P06998)Gly105 Gly125

ATP-PFK:Gly105

+Gly125

PPi-PFK:Gly/Asp105

+Lys125

Example - Phosphofructokinase (PFK) classification shows that functional specialization can occur as a result not only of major sequence changes but also by mutation of a single amino-acid residue.

Classification and Functional Annotation

Families

Cla

ssifi

catio

n T

ree

Page 21: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

21

Family-Based Rules for Annotation

?

Functional Site Rule: tags

active site, binding, other residue-specific information

Functional Name Rule: gives name, EC, GO, other function-specific information

Page 22: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

22

iProLINKNLP Text Mining

Research

Literature-Based Curation

Bibliography Mapping

Text Categorization

Annotation Extraction

Named Entity Recognition

Databases

UniProtPIRSF

iProClassGO

Bibliography

PubMed

Literature Mining &Protein Curation

Literature Corpus• Mapping to Proteins/Features• Annotation-Tagged• Name-Tagged

integrated Protein Literature, INformation and Knowledge

http://pir.georgetown.edu/iprolink

Guidelines• Protein/Family Naming Guidelines• Name Tagging Guidelines

Dictionary and Ontology• Protein Names and Synonyms• PIRSF Family Names in DAG

Bibliography Display• Mapping of PubMed IDs to Proteins• Papers Categorized by Annotations

iProLINK Literature Mining Resource

Page 23: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

23

iProLINKNLP Research

Literature-Based Curation

Bibliography Mapping

Text Categorization

Annotation Extraction

Named Entity Recognition

Databases

UniProtPIRSF

iProClassGO

Bibliography

PubMed

Literature Mining &Protein Curation

Literature Corpus• Mapping to Proteins/Features• Annotation-Tagged• Name-Tagged

integrated Protein Literature, INformation and Knowledge

http://pir.georgetown.edu/iprolink

Guidelines• Protein/Family Naming Guidelines• Name Tagging Guidelines

Dictionary and Ontology• Protein Names and Synonyms• PIRSF Family Names in DAG

Bibliography Display• Mapping of PubMed IDs to Proteins• Papers Categorized by Annotations

iProLINK Literature Mining Resource1. UniProtKB Bibliography mapping in iProClass

2. RLIMS-P Rule-based NLP method for extracting protein phosphorylation data

3. Substring-based machine learning method for PTM text categorization

4. BioThesaurus of protein/gene names with UniProtKB association

5. Entity-named tagging Guide

3

1

2

4

5

Page 24: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

24

Literature Corpus for Text Mining Literature survey and manual tagging for evidence attribution Training and benchmarking sets for information retrieval and extraction

Protein phosphorylation data used to develop RLIMS-P for extracting phosphorylation information

The five PTM datasets used to develop a machine learning algorithm for text categorization

Page 25: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

25

Online RLIMS-P

A

1. Summary table: PMIDs & top-ranking annotation

1

2. Report: Full annotation with evidence tagging and PMID mapping to UniProtKB entry2

3. Name mapping searches BioThesaurus

3

Page 26: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

26

BioThesaurus

Raw Thesaurus

iProClass

NCBIEntrez Gene

RefSeqGenPept

UniProtUniProtKB

UniRef90/50PIR-PSD

Genome

FlyBaseWormBase

MGDSGDRGD

OtherHUGO

ECOMIM

Name Filtering

Highly Ambiguous Nonsensical

Terms

Semantic Typing

UMLS

NameExtraction

UniProtKB Entries:

Protein/Gene Names & Synonyms

BioThesaurus

Raw Thesaurus

iProClassiProClass

NCBIEntrez Gene

RefSeqGenPept

NCBIEntrez Gene

RefSeqGenPept

UniProtUniProtKB

UniRef90/50PIR-PSD

UniProtUniProtKB

UniRef90/50PIR-PSD

Genome

FlyBaseWormBase

MGDSGDRGD

Genome

FlyBaseWormBase

MGDSGDRGD

OtherHUGO

ECOMIM

OtherHUGO

ECOMIM

Name Filtering

Highly Ambiguous Nonsensical

Terms

Name Filtering

Highly Ambiguous Nonsensical

Terms

Semantic Typing

UMLS

Semantic Typing

UMLS

NameExtraction

UniProtKB Entries:

Protein/Gene Names & Synonyms

BioThesaurus

UniProtKB Entries:

Protein/Gene Names & Synonyms

BioThesaurus

Comprehensive collection of protein/gene names from 23 databases Associate names (~3.2 million) with UniProtKB entries (>2 million) Web-based searches to retrieve synonymous names, resolve

ambiguous names, evaluate name coverage FTP download for automatic dictionary-based named entity tagging

Page 27: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

27

Online BioThersaurus

1

2

1. Search protein entries sharing the same names

2. Retrieve BioThesaurus report

Name ambiguity of CLIM1

Annotation error detection

Page 28: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

28

Synonyms for Metalloproteinase inhibitor 3

1

2

Name ambiguity of TIMP-3

BioThesaurus ReportGene/Protein Name Mapping

1. Search Synonyms

2. Resolve Name Ambiguity

3. Underlying ID Mapping

3 ID Mapping

Page 29: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

29

Protein Ontology (PRO) PRotein Ontology (PRO) in OBO (Open Biomedical Ontologies)

Framework Two sub-ontologies:

Ontology for Protein Evolution (ProEvo) for the classification of proteins on the basis of evolutionary relationships

Ontology for Protein Modified Forms (ProMod) to represent the multiple protein forms of a gene (genetic variation, alternative splicing, proteolytic cleavage, and post-translational modification).

Why PRO? Allow the specification of relationships between PRO and other

ontologies, such as GO and Disease Ontology Facilitate precise protein annotation of specific proteins/classes

The PRO prototype is illustrated using human proteins from the TGF-beta signaling pathway (http://pir.georgetown.edu/pro).

Page 30: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

30

PRO Conceptual Framework

GO Gene Ontology

molecular function

cellular component

biological process

has_ancestral_property participates_in

has_ancestral_property part_of (for complexes) located_in (for compartments)

has_ancestral_propertyhas_functionlacks_function

evolutionary unit

domain

is_a

is_a

ProEvo

structure domain

sequence domain

protein

is_a

Root level

is_a

modified product

is_a

homeomorphic protein

is_a

ProMod

has_part

is_a

splice variant

reference protein

derives_from

PROProtein Ontology

gene product

genetic variant

is_ais_a

derives_from derives_from

Unit Level• The two types of evolutionary units• Not substituted by any other terms

Domain Family Level (structure)• Related by structural similarity• Source: SCOP Superfamily

Domain Family Level (sequence)• Related by sequence similarity• Source: Pfam domain

Protein Family Level• Evolutionarily-related full-length protein• May contain finer-grain sub-categories• Sources: PIRSF family/subfamily, Panther subfamily

Post-translation level• Protein as modified after translation• Source: UniProtKB

Transcript level• Possible transcript forms• Source: UniProtKB

cleaved product

Gene level• All protein products encoded by one gene• Source: UniProtKB

disease

DO/UMLS Disease Ontology/Term

agent_of

is_a

protein modification

has_modification

PSI-MOD Modification

HGNC/MGI Gene Name

gene name

encoded_by

lacks

Page 31: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

31

Protein Ontology (PRO)

Page 32: Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis

32

PIR Team Protein Science Team: Darren Natale, Winona Barker, Peter McGarvey,

Zhangzhi Hu, Lai-Su Yeh, Anastasia Nikolskaya, Raja Mazumder, CR Vinayaka, Sona Vasudevan, Cecilia Arighi, Xin Yuan

Informatics Team: Hongzhan Huang, Baris Suzek, Leslie Arminski, Hsing-Kuo Hua, Yongxing Chen, Jing Zhang, Robel Kahsay, Jess Cannata

Students: Natalia Petrova, Paul Ramos, Ti-Cheng Chang, Anna Bank Collaborators

UniProt: Rolf Apweiler, Amos Bairoch and EBI/SIB Teams NIAID: Margaret Moore (SSS), Bruno Sobral (VBI) Text Mining: Hongfang Liu (GUMC), Interjeet Mani (MITRE), Vijay

Shanker (U Delaware), Zoran Obradovic (Temple U) Funding Support

NHGRI/NIGMS (UniProt) NCI caBIG NIAID (Proteomic Admin Center) NSF: iProClass, text mining

Acknowledgements