16
Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center www. uniprot .org http:// pir . georgetown . edu / COMPLEMENTING GENE ONTOLOGY WITH PIRSF CLASSIFICATION- BASED PROTEIN ONTOLOGY

Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

Embed Size (px)

DESCRIPTION

COMPLEMENTING GENE ONTOLOGY WITH PIRSF CLASSIFICATION-BASED PROTEIN ONTOLOGY. Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center www.uniprot.org http://pir.georgetown.edu/. Why Protein Classification?. - PowerPoint PPT Presentation

Citation preview

Page 1: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

Anastasia NikolskayaPIR (Protein Information Resource)

Georgetown University Medical Center

www.uniprot.org http://pir.georgetown.edu/

COMPLEMENTING GENE ONTOLOGY WITH PIRSF CLASSIFICATION-BASED PROTEIN ONTOLOGY

Page 2: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

2

Why Protein Classification?

Automatic annotation of protein sequences based on protein families (propagation of annotation)

Systematic correction of annotation errors

Protein name standardization in UniProt

Functional predictions for uncharacterized protein families

Page 3: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

3

PIRSF Classification System PIRSF: A network structure with hierarchies from Superfamilies to

Subfamilies reflects evolutionary relationships of full-length proteins

Definitions: Basic unit = Homeomorphic Family Homologous (Common Ancestry): Inferred by sequence similarity Homeomorphic: Full-length sequence similarity and common domain

architecture Network Structure: Flexible number of levels with varying degrees of

sequence conservation

Advantages: Annotation of both generic biochemical and specific biological functions Accurate propagation of annotation and development of standardized

protein nomenclature and ontology

Page 4: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

4

Levels of protein classificationLevel Example Similarity Evolution

Fold TIM-Barrel Topology of folded backbone Possible monophyly

Domain Superfamily

Aldolase Recognizable sequence similarity (motifs); basic biochemistry

Monophyletic origin

Class I Aldolase High sequence similarity (alignments); biochemical properties

Evolution by ancient duplications

Orthologous group

2-keto-3-deoxy-6-phosphogluconate aldolase

Orthology for a given set of species; biochemical activity; biological function

Origin traceable to a single gene in LCA

Lineage-specific expansion

(LSE)

PA3131 and PA3181

Paralogy within a lineage Evolution by recent duplication and loss

Page 5: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

5

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF Classification SystemA protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

Page 6: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

6

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF Classification SystemA protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

SF500001: stimulates trophoblast migrationSF500002: stimulates proliferation of prostate cancer cellsSF500003: anti-proliferative and pro-apoptotic effects on cancer cellsSF500004: inhibitor of IGF SF500005: stimulates bone formationSF500006: inhibitor of IGF-II

Page 7: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

7

Creation and curation of PIRSFsUniProt proteins

Preliminary Homeomorphic Families

Orphans

Curated Homeomorphic Families

Final Homeomorphic Families

Add/remove members

Name, refs, abstract, domain arch.

Automatic clustering

Computer-assisted Manual Curation

Automatic Procedure Unassigned proteins

Au

tom

atic

pla

ce

me

nt

Create hierarchies (superfamilies/subfamilies)

Map domains on Families

Merge/split clusters

New proteins

Protein name rule/site rule

Computer-Generated (Uncurated) Clusters (36,000 PIRSFs)

Preliminary Curation (5,000 PIRSFs) Membership Signature

Domains

Full Curation (1,300 PIRSFs) Family Name

with evidence tag

Description, Bibliography Build and test HMMs

Page 8: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

8

PIRSF-Based Protein Annotation in UniProt

Rule-Based annotation system using curated PIRSFs Site Rules (PIRSR): Position-Specific Site Features (active sites,

binding sites, modified sites, other functional sites) Name Rules (PIRNR): transfer name from PIRSF to individual proteins

(define a subgroup if necessary) Protein Name (may differ from family name), synonyms, acronyms EC Misnomers GO Terms (homeomorphic family-based, propagatable GO annotation) Function

UniProt is developing protein name standards and guidelines

Classification of proteins into families provides a convenient and accurate mechanism to propagate curated information to individual protein members

Page 9: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

9

PIRSF-Based Protein Ontology PIRSF family hierarchy is based on evolutionary relationships Standardized PIRSF family names Network structure (in DAG) for PIRSF family classification system

Page 10: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

10

PIRSF to GO Mapping PIRSF to GO mapping provides a link between GO

concepts and protein objects Mapped 5500 curated PIRSF homeomorphic families and

subfamilies to the GO hierarchy

Superimpose GO and PIRSF hierarchies Bidirectional display (GO-centric or PIRSF-centric views)

DynGO viewerHongfang Liu , University of Maryland

Page 11: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

11

Protein Ontology Can Complement GO

Expanding a Node Identification of GO subtrees

that need expansion if GO concepts are too broad

~ 67% of curated PIRSF families and subfamilies map to GO leaf nodes

Among these, 2209 PIRSFs have shared GO leaf nodes (many PIRSFs to 1 GO leaf)

Example: PIRSF001969 vs PIRSF018239 and PIRSF036495 : High- vs low-affinity IGF binding

Identification of missing GO nodes

Page 12: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

12

Protein Ontology Can Complement GO

Identification of Missing GO Nodes (higher levels)

Page 13: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

13

Protein Ontology Can Complement GO

Mechanism to examine the relationships between the three GO ontologies based on the shared annotations at different protein family levels

Example: molecular function “estrogen receptor activity” and biological process “signal transduction” ,“estrogen receptor signaling pathway”

Linking Function, Biological Process, and Cellular Component through a Protein Object Based on Protein Annotations

Page 14: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

14

PIRSF Protein Classification: a link between GO and protein objects

Annotation Quality Annotation of biological function of whole proteins Annotation of uncharacterized “hypothetical” proteins Correction of annotation errors and underannotations Standardization of Protein Names

PIRSF to GO mapping provides a link between GO sub-ontologies and protein objects

Page 15: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

15

PIRSF-based Protein Ontology Can Complement GO

Identification of GO subtrees that need expansion if GO concepts are too broad

Comprehensive classification of related protein families in PIRSF can help in identification of missing GO nodes when entire groups of PIRSF superfamilies or families cannot be mapped to existing GO terms

Mechanism to examine the relationships between the three GO ontologies (molecular function, biological process, and cellular component), as well as between GO concepts, based on the shared annotations at different protein family levels

Page 16: Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

16

Acknowledgements Hongfang Liu , University of Maryland Judith Blake, The Jackson Laboratory

Dr. Cathy Wu, Director Protein Classification team

Dr. Winona Barker Dr. Lai-Su Yeh Dr. Anastasia NikolskayaDr. Darren Natale Dr. Zhangzhi Hu Dr. Raja Mazumder Dr. CR Vinayaka Dr. Xianying Wei Dr. Sona Vasudevan

Informatics teamDr. Hongzhan Huang Baris Suzek, M.S. Sehee Chung, M.S.Dr. Leslie Arminski Dr. Hsing-Kuo Hua Yongxing Chen, M.S.Jing Zhang, M.S. Amar Kalelkar

StudentsChristina Fang Vincent Hormoso Natalia Petrova Jorge Castro-Alvear

PIR Team http://pir.georgetown.edu/

UniProt (SwissProt, TrEMBL, PIR) www.uniprot.org