Upload
giuseppe-bishop
View
26
Download
0
Tags:
Embed Size (px)
DESCRIPTION
COMPLEMENTING GENE ONTOLOGY WITH PIRSF CLASSIFICATION-BASED PROTEIN ONTOLOGY. Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center www.uniprot.org http://pir.georgetown.edu/. Why Protein Classification?. - PowerPoint PPT Presentation
Citation preview
Anastasia NikolskayaPIR (Protein Information Resource)
Georgetown University Medical Center
www.uniprot.org http://pir.georgetown.edu/
COMPLEMENTING GENE ONTOLOGY WITH PIRSF CLASSIFICATION-BASED PROTEIN ONTOLOGY
2
Why Protein Classification?
Automatic annotation of protein sequences based on protein families (propagation of annotation)
Systematic correction of annotation errors
Protein name standardization in UniProt
Functional predictions for uncharacterized protein families
3
PIRSF Classification System PIRSF: A network structure with hierarchies from Superfamilies to
Subfamilies reflects evolutionary relationships of full-length proteins
Definitions: Basic unit = Homeomorphic Family Homologous (Common Ancestry): Inferred by sequence similarity Homeomorphic: Full-length sequence similarity and common domain
architecture Network Structure: Flexible number of levels with varying degrees of
sequence conservation
Advantages: Annotation of both generic biochemical and specific biological functions Accurate propagation of annotation and development of standardized
protein nomenclature and ontology
4
Levels of protein classificationLevel Example Similarity Evolution
Fold TIM-Barrel Topology of folded backbone Possible monophyly
Domain Superfamily
Aldolase Recognizable sequence similarity (motifs); basic biochemistry
Monophyletic origin
Class I Aldolase High sequence similarity (alignments); biochemical properties
Evolution by ancient duplications
Orthologous group
2-keto-3-deoxy-6-phosphogluconate aldolase
Orthology for a given set of species; biochemical activity; biological function
Origin traceable to a single gene in LCA
Lineage-specific expansion
(LSE)
PA3131 and PA3181
Paralogy within a lineage Evolution by recent duplication and loss
5
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF Classification SystemA protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.
6
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF Classification SystemA protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.
SF500001: stimulates trophoblast migrationSF500002: stimulates proliferation of prostate cancer cellsSF500003: anti-proliferative and pro-apoptotic effects on cancer cellsSF500004: inhibitor of IGF SF500005: stimulates bone formationSF500006: inhibitor of IGF-II
7
Creation and curation of PIRSFsUniProt proteins
Preliminary Homeomorphic Families
Orphans
Curated Homeomorphic Families
Final Homeomorphic Families
Add/remove members
Name, refs, abstract, domain arch.
Automatic clustering
Computer-assisted Manual Curation
Automatic Procedure Unassigned proteins
Au
tom
atic
pla
ce
me
nt
Create hierarchies (superfamilies/subfamilies)
Map domains on Families
Merge/split clusters
New proteins
Protein name rule/site rule
Computer-Generated (Uncurated) Clusters (36,000 PIRSFs)
Preliminary Curation (5,000 PIRSFs) Membership Signature
Domains
Full Curation (1,300 PIRSFs) Family Name
with evidence tag
Description, Bibliography Build and test HMMs
8
PIRSF-Based Protein Annotation in UniProt
Rule-Based annotation system using curated PIRSFs Site Rules (PIRSR): Position-Specific Site Features (active sites,
binding sites, modified sites, other functional sites) Name Rules (PIRNR): transfer name from PIRSF to individual proteins
(define a subgroup if necessary) Protein Name (may differ from family name), synonyms, acronyms EC Misnomers GO Terms (homeomorphic family-based, propagatable GO annotation) Function
UniProt is developing protein name standards and guidelines
Classification of proteins into families provides a convenient and accurate mechanism to propagate curated information to individual protein members
9
PIRSF-Based Protein Ontology PIRSF family hierarchy is based on evolutionary relationships Standardized PIRSF family names Network structure (in DAG) for PIRSF family classification system
10
PIRSF to GO Mapping PIRSF to GO mapping provides a link between GO
concepts and protein objects Mapped 5500 curated PIRSF homeomorphic families and
subfamilies to the GO hierarchy
Superimpose GO and PIRSF hierarchies Bidirectional display (GO-centric or PIRSF-centric views)
DynGO viewerHongfang Liu , University of Maryland
11
Protein Ontology Can Complement GO
Expanding a Node Identification of GO subtrees
that need expansion if GO concepts are too broad
~ 67% of curated PIRSF families and subfamilies map to GO leaf nodes
Among these, 2209 PIRSFs have shared GO leaf nodes (many PIRSFs to 1 GO leaf)
Example: PIRSF001969 vs PIRSF018239 and PIRSF036495 : High- vs low-affinity IGF binding
Identification of missing GO nodes
12
Protein Ontology Can Complement GO
Identification of Missing GO Nodes (higher levels)
13
Protein Ontology Can Complement GO
Mechanism to examine the relationships between the three GO ontologies based on the shared annotations at different protein family levels
Example: molecular function “estrogen receptor activity” and biological process “signal transduction” ,“estrogen receptor signaling pathway”
Linking Function, Biological Process, and Cellular Component through a Protein Object Based on Protein Annotations
14
PIRSF Protein Classification: a link between GO and protein objects
Annotation Quality Annotation of biological function of whole proteins Annotation of uncharacterized “hypothetical” proteins Correction of annotation errors and underannotations Standardization of Protein Names
PIRSF to GO mapping provides a link between GO sub-ontologies and protein objects
15
PIRSF-based Protein Ontology Can Complement GO
Identification of GO subtrees that need expansion if GO concepts are too broad
Comprehensive classification of related protein families in PIRSF can help in identification of missing GO nodes when entire groups of PIRSF superfamilies or families cannot be mapped to existing GO terms
Mechanism to examine the relationships between the three GO ontologies (molecular function, biological process, and cellular component), as well as between GO concepts, based on the shared annotations at different protein family levels
16
Acknowledgements Hongfang Liu , University of Maryland Judith Blake, The Jackson Laboratory
Dr. Cathy Wu, Director Protein Classification team
Dr. Winona Barker Dr. Lai-Su Yeh Dr. Anastasia NikolskayaDr. Darren Natale Dr. Zhangzhi Hu Dr. Raja Mazumder Dr. CR Vinayaka Dr. Xianying Wei Dr. Sona Vasudevan
Informatics teamDr. Hongzhan Huang Baris Suzek, M.S. Sehee Chung, M.S.Dr. Leslie Arminski Dr. Hsing-Kuo Hua Yongxing Chen, M.S.Jing Zhang, M.S. Amar Kalelkar
StudentsChristina Fang Vincent Hormoso Natalia Petrova Jorge Castro-Alvear
PIR Team http://pir.georgetown.edu/
UniProt (SwissProt, TrEMBL, PIR) www.uniprot.org