Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)

Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)Hu ZZ1, Mani I2, Liu H3, Vijay-Shanker K4, Hermoso V1, Nikolskaya A1, Natale DA1, and Wu CH1

1Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057; 2Georgetown University, 37th and O Streets, NW, Washington, DC 20057; 3University of Maryland at Baltimore County, Baltimore, MD 21250; 4Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716

PIRSF in DAG View

• PIRSF family hierarchy based on evolutionary relationships• Standardized PIRSF family names as hierarchical protein ontology• DAG Network structure for PIRSF family classification system

PIRSF-Based Protein Ontology

ABSTRACTAn integrated protein literature mining resource iProLINK is developed at PIR to provide data sources for Natural Language Processing (NLP) research on bibliography mapping, annotation extraction, protein named-entity recognition, and protein ontology development. A rule-based text-mining system RLIMS-P is used to extract protein phosphorylation information from MEDLINE abstracts to assist database annotation, an online BioThesaurus is developed for protein/gene name mapping and to assist with protein named-entity recognition, and a family classification PIRSF-based protein ontology is developed and to complement other ontologies.

As the volume of scientific literature rapidly grows, literature data mining becomes increasingly critical to facilitate genome/proteome annotation and to improve the quality of biological databases. Annotations derived from experimentally verified data from literature are of special value to the UniProtKB (UniProt Knowledgbase). One objective of UniProtKB is to have accurate, consistent, and rich annotation of protein sequence and function. Relevant to this goal are the literature-based curation and development and adoption of ontologies and controlled vocabularies.• Literature-Based Curation – Extract Reliable Information from Literature

• Protein properties: protein function, domains and sites, developmental stages, catalytic activity, binding and modified residues, regulation, induction, pathways, tissue specificity, subcellular location, quaternary structure…

• This will ensure high quality, accurate and up-to-date experimental data for each protein. But it is a major bottleneck!

• Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management

• UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature.

The Protein Information Resource has been collaborating with several NLP research groups to develop text-mining methodologies to extract information from biological literature and to develop protein ontology.

INTRODUCTION PIR – Integrated Protein Informatics Resource for Genomic/Proteomic Research

(http://pir.georgetown.edu)

UniProt – Central international database of protein sequence and function

(http://www.uniprot.org)

Bioinformatics. 2005 Jun 1;21(11):2759-65

High recall for paper retrieval and high precision for information extraction

• UniProtKB site feature annotation• Proteomics MS data analysis: protein identification

Benchmarking of RLIMS-P

Sentence extraction

Part of speech tagging

Preprocessing

Acronym detection

Term recognition

Entity Recognition

Noun and verb group detection

Other syntactic structure detection

Phrase Detection

Semantic Type

Classification

Nominal level relation

Verbal level relation

Relation Identification

Abstracts Full-Length Texts

Post-Processing

Extracted Annotations Tagged Abstracts

Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)?ATR/FRP-1 also phosphorylated p53 in Ser 15

http://pir.georgetown.edu/iprolink/

RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation

Substrate(e.g., cPLA2)

phosphorylated-cPLA2

Enzyme(e.g., MAP kinase)

<THEME> Substrate (protein being phosphorylated)

<AGENT> Enzyme (kinase catalyzing the phosphorylation)

Phosphorylation

P-site

(e.g., Ser505)

P-group

<SITE> P-Site (amino acid residue being phosphorylated)

Ser-P

RLIMS-P

Protein Phosphorylation Annotation Extraction• Manual tagging assisted with computational extraction• Training sets of positive and negative samples

BioThesaurus reportUniProtKB entry P35625

• Tagging guideline versions 1.0 and 2.0

– Generation of domain expert-tagged corpora

– Inter-coder reliability – upper bound of machine tagging

• Dictionary pre-tagging

– F-measure: 0.412 (0.372 Precision, 0.462 Recall)

– Advantages: helpful with standardization and extent of tagging, reducing fatigue problem, and improve inter-coder reliability.

• BioThesaurus for pre-tagging

Raw Thesurus

iProClass

NCBIEntrez Gene

RefSeqGenPept

UniProtUniProtKB

UniRef90/50PIR-PSD

Genome

FlyBaseWormBase

MGDSGDRGD

OtherHUGO

ECOMIM

Name Filtering

Highly Ambiguous Nonsensical

Terms

Semantic Typing

UMLS

NameExtraction

UniProtKB Entries:

Protein/Gene Names &

Synonyms

BioThesaurus

BioThesaurus

• Biological entity tagging

• Name mapping

• Database annotation

• literature mining

• Gateway to other resources

Applications:

# UniProtKB entry 1.86m

# Source DB record 6.6m

# Gene/protein name/terms 3.6m

BioThesaurus v1.0 m = million

(May, 2005)

Protein Name Tagging

Example 2. Name ambiguity of CLIM1

PIRSF to GO Mapping

• Superimpose GO and PIRSF hierarchies• Bidirectional display (GO- or PIRSF-centric views)

• Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies

• Mapped 5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy– 68% of the PIRSF families and subfamilies map

to GO leaf nodes– 2329 PIRSFs have shared GO leaf nodes

DynGO viewer

Two cases: analyze GO branches and concepts and identify missing GO nodes

Case I. Nuclear receptor superfamily Case II. IGF-binding protein superfamily

iProLINK: An integrated protein resource for literature mining

1. Bibliography mapping - UniProt mapped citations2. Annotation extraction - annotation tagged literature3. Protein entity recognition - dictionary, tagged literature4. Protein ontology development - PIRSF-based ontology

http://pir.georgetown.edu/iprolink/

Testing and Benchmarking Dataset

• RLIMS-P text mining tool

• Protein dictionaries

• Name tagging guideline

• Protein ontology

3 4

5 6Protein Ontology Can Complement GO

Expanding a Node: Identification of GO subtrees that need expansion if GO concepts are too broad

– IGFBP subfamilies– High- vs. low-affinity

binding for IGF between IGFBP and IGFBPrP

GO-centric view

2

1

Exploration of Gene and Protein Ontology

PIRSF-centric view

1

Molecular function

Biological process

Estrogen receptor alpha (PIRSF50001)

Systematic links between three GO sub-ontologies based on the shared annotations at different protein family levels, e.g., linking molecular function and biological process:

– estrogen receptor binding and

– estrogen receptor signaling pathway

Acknowledgements

Research Projects

NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt)

NSF: SEIII (Entity Tagging)

NSF: ITR (Ontology)

Collaborators

I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology.

H. Liu from University of Maryland Department of Information System on protein name recognition and text mining.

Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features.

Summary• PIR iProLINK literature mining resource

provides annotated data sets for NLP research on annotation extraction and protein ontology development

• RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P can be applied for UniProtKB protein feature annotation.

• Biothesaurus can be used to solve name synonym and ambiguity, name mapping.

• PIRSF-based protein ontology can complement GO by identify missing GO concepts/nodes and provides systematic links between three GO sub-ontologies.

7

8

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)


PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

…


PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen


PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily


• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain


PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)


PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type


…


PIRSF Homeomorphic Subfamily


• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP



PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily


• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF: A network structure from superfamilies to subfamilies to reflect evolutionary relationships of full-length proteins

DefinitionsBasic unit = Homeomorphic FamilyHomeomorphic: Full-length similarity, common domain architectureNetwork Structure: Flexible number of levels with varying degrees

of sequence conservation

PIRSF Protein Family Classification

Example 1. Name ambiguity of TIMP3

http://pir.georgetown.edu/iprolink/biothesaurus/

Web-based BioThesaurus

Gene/Protein Name Mapping

1.Search Synonyms

2.Resolve Name Ambiguity

3.Underlying ID Mapping

Online RLIMS-P text-mining tool (version 1.0)

http://pir.georgetown.edu/iprolink/rlimsp/

1

2

1. Search interface

2. Summary table with top hit of all sites

3. All sites and tagged text evidence

3

DAG file: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/dagfiles/

Liu et al, 2005, submitted

Liu et al, 2005, submitted

http://www.georgetown.edu/departments/linguistics

http://complingone.georgetown.edu/~prot/

http://www.umbc.edu/academics/Depart/infor.html

http://www.cis.udel.edu/

http://www.cis.udel.edu/

Documents

Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)