1
Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ 1 , Mani I 2 , Liu H 3 , Vijay-Shanker K 4 , Hermoso V 1 , Nikolskaya A 1 , Natale DA 1 , and Wu CH 1 1 Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057; 2 Georgetown University, 37th and O Streets, NW, Washington, DC 20057; 3 University of Maryland at Baltimore County, Baltimore, MD 21250; 4 Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716 PIRSF in DAG View • PIRSF family hierarchy based on evolutionary relationships • Standardized PIRSF family names as hierarchical protein ontology • DAG Network structure for PIRSF family classification system PIRSF-Based Protein Ontology ABSTRACT An integrated protein literature mining resource iProLINK is developed at PIR to provide data sources for Natural Language Processing (NLP) research on bibliography mapping, annotation extraction, protein named-entity recognition, and protein ontology development. A rule-based text-mining system RLIMS-P is used to extract protein phosphorylation information from MEDLINE abstracts to assist database annotation, an online BioThesaurus is developed for protein/gene name mapping and to assist with protein named-entity recognition, and a family classification PIRSF- based protein ontology is developed and to complement other ontologies. As the volume of scientific literature rapidly grows, literature data mining becomes increasingly critical to facilitate genome/proteome annotation and to improve the quality of biological databases. Annotations derived from experimentally verified data from literature are of special value to the UniProtKB (UniProt Knowledgbase). One objective of UniProtKB is to have accurate, consistent, and rich annotation of protein sequence and function. Relevant to this goal are the literature-based curation and development and adoption of ontologies and controlled vocabularies. • Literature-Based Curation – Extract Reliable Information from Literature • Protein properties: protein function, domains and sites, developmental stages, catalytic activity, binding and modified residues, regulation, induction, pathways, tissue specificity, subcellular location, quaternary structure… • This will ensure high quality, accurate and up-to- date experimental data for each protein. But it is a major bottleneck! • Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management • UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature. The Protein Information Resource has been collaborating with several NLP research groups to develop text-mining methodologies to extract information from biological literature and to develop protein ontology. INTRODUCTION PIR Integrated Protein Informatics Resource for Genomic/Proteomic Research (http:// pir.georgetown.edu) UniProt – Central international database of protein sequence and function (http:// www.uniprot.org) Bioinformatics. 2005 Jun 1;21(11):2759- 65 High recall for paper retrieval and high precision for information extraction UniProtKB site feature annotation Proteomics MS data analysis: protein identification Benchmarking of RLIMS-P Sentence extraction Part of speech tagging Preprocessing Acronym detection Term recognition Entity Recognition Noun and verb group detection Other syntactic structure detection Phrase Detection Semantic Type Classifica tion Nominal level relation Verbal level relation Relation Identificati on Abstracts Full-Length Texts Post- Processin g Extracted Annotations Tagged Abstracts Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)? ATR/FRP-1 also phosphorylated p53 in Ser 15 http://pir.georgetown.edu/iprolink/ RLIMS-P R ule-based LI terature M ining S ystem for Protein P hosphorylation Substrate (e.g.,cPLA2) phosphorylated-cPLA2 Enzym e (e.g.,M AP kinase) <THEME> Substrate (protein being phosphorylated) <AGENT> Enzym e (kinase catalyzing the phosphorylation) Phosphorylation P -site (e.g.,Ser505) P -group <SITE> P-Site (am ino acid residue being phosphorylated) Ser-P RLIMS-P Protein Phosphorylation Annotation Extraction • Manual tagging assisted with computational extraction • Training sets of positive and negative samples BioThesaurus report UniProtKB entry P35625 Tagging guideline versions 1.0 and 2.0 – Generation of domain expert-tagged corpora – Inter-coder reliability – upper bound of machine tagging Dictionary pre-tagging – F-measure: 0.412 (0.372 Precision, 0.462 Recall) – Advantages: helpful with standardization and extent of tagging, reducing fatigue problem, and improve inter- coder reliability. BioThesaurus for pre-tagging Raw Thesurus iProClass NCBI Entrez Gene RefSeq GenPept UniProt UniProtKB UniRef90/ 50 PIR-PSD Genome FlyBase WormBase MGD SGD RGD Other HUGO EC OMIM Name Filtering Highly Ambiguous Nonsensic al Terms Semantic Typing UMLS Name Extraction UniProtKB Entries: Protein/Gen e Names & Synonyms BioThesaurus BioThesaurus Biological entity tagging Name mapping Database annotation literature mining Gateway to other resources Applications: # UniProtKB entry 1.86m # Source DB record 6.6m # Gene/protein name/terms 3.6m BioThesaurus v1.0 m = million (May, 2005) Protein Name Tagging Example 2. Name ambiguity of CLIM1 PIRSF to GO Mapping Superimpose GO and PIRSF hierarchies Bidirectional display (GO- or PIRSF- centric views) • Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub- ontologies • Mapped 5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy – 68% of the PIRSF families and subfamilies map to GO leaf nodes – 2329 PIRSFs have shared GO leaf nodes DynGO viewer Two cases: analyze GO branches and concepts and identify missing GO nodes Case I. Nuclear receptor superfamily Case II. IGF-binding protein superfamily iProLINK: An integrated protein resource for literature mining 1. Bibliography mapping - UniProt mapped citations 2. Annotation extraction - annotation tagged literature 3. Protein entity recognition - dictionary, tagged literature 4. Protein ontology development - PIRSF-based ontology http:// pir.georgetown.edu/ iprolink/ Testing and Benchmarking Dataset •RLIMS-P text mining tool •Protein dictionaries •Name tagging guideline •Protein ontology 3 4 5 6 Protein Ontology Can Complement GO Expanding a Node: Identification of GO subtrees that need expansion if GO concepts are too broad – IGFBP subfamilies High- vs. low- affinity binding for IGF between IGFBP and IGFBPrP GO-centric view 2 1 Exploration of Gene and Protein Ontology PIRSF-centric view 1 Molecular function Biological process Estrogen receptor alpha (PIRSF50001) Systematic links between three GO sub- ontologies based on the shared annotations at different protein family levels, e.g., linking molecular function and biological process: – estrogen receptor binding and – estrogen Acknowledgements Research Projects NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) NSF: SEIII (Entity Tagging) NSF: ITR (Ontology) Collaborators I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology . H. Liu from University of Maryland Department of Information System on protein name recognition and text mining. Vijay K. Shanker from University of Delaware Department of Computer and Inform ation Science on text mining of protein Summary • PIR iProLINK literature mining resource provides annotated data sets for NLP research on annotation extraction and protein ontology development • RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P can be applied for UniProtKB protein feature annotation. • Biothesaurus can be used to solve name synonym and ambiguity, name mapping. • PIRSF-based protein ontology can complement GO by identify missing GO concepts/nodes and provides systematic links between three GO sub- ontologies. 7 8 PIR S F017318: CM ofA roQ class,eukaryotic type PIR S F001501: CM ofAroQ class,prokaryotic type PIR S F026640: Periplasm icCM P IR S F001500: BifunctionalC M /PD T (P -protein) PIR S F001499: BifunctionalC M /PD H (T-protein) P F01817: Chorism ate m utase (CM) P IR SF500001:IG FBP -1 P IR SF500006:IG FBP -6 PIR SF H om eom orphic Subfam ily •0 orm ore levels •Functional specialization PIR SF H om eom orphic Fam ily •E xactly one level •Full-length sequence sim ilarity and com m on dom ain architecture PIR SF Superfam ily 0 orm ore levels •O ne orm ore com m on dom ains PF00219: Insulin-like grow th factorbinding protein (IG FBP ) P F02735: K u70/K u80 beta- barreldomain Dom ain Superfam ily •O ne com m on Pfam domain PIR S F017318: CM ofA roQ class,eukaryotic type PIR S F001501: CM ofAroQ class,prokaryotic type PIR S F026640: Periplasm icCM P IR S F001500: BifunctionalC M /PD T (P -protein) PIR S F001499: BifunctionalC M /PD H (T-protein) P F01817: Chorism ate m utase (CM) P IR S F006493:Ku,prokaryotic type P IR SF500001:IG FBP -1 P IR SF500006:IG FBP -6 PIR SF H om eom orphic Subfam ily •0 orm ore levels •Functional specialization P IR S F018239: IG FB P -related protein,M A C 25 type PIR S F001969: IG FBP PIR S F003033: Ku70 autoantigen PIR S F016570:K u80 autoantigen PIR SF H om eom orphic Fam ily •E xactly one level •Full-length sequence sim ilarity and com m on dom ain architecture PIR SF Superfam ily 0 orm ore levels •O ne orm ore com m on dom ains PF00219: Insulin-like grow th factorbinding protein (IG FBP ) P IR SF800001: Ku70/80 autoantigen P F02735: K u70/K u80 beta- barreldomain Dom ain Superfam ily •O ne com m on Pfam domain PIRSF: A network structure from superfamilies to subfamilies to reflect evolutionary relationships of full-length proteins Definitions Basic unit = Homeomorphic Family Homeomorphic: Full-length similarity, common domain architecture Network Structure: Flexible number of levels with varying degrees of sequence conservation PIRSF Protein Family Classification Example 1. Name ambiguity of TIMP3 http://pir.georgetown.edu/iprolink/biothesaurus/ Web-based BioThesaurus Gene/Protein Name Mapping 1.Search Synonyms 2.Resolve Name Ambiguity 3.Underlying ID Mapping Online RLIMS-P text-mining tool (version 1.0) http:// pir.georgetown.edu/ iprolink/rlimsp/ 1 2 1. Search interface 2. Summary table with top hit of all sites 3. All sites and tagged text evidence 3 DAG file: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf /dagfiles/ Liu et al, 2005, submitted Liu et al, 2005, submitted

Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)

Embed Size (px)

DESCRIPTION

UniProt. NCBI. UniProtKB UniRef90/50 PIR-PSD. Entrez Gene RefSeq GenPept. Name Filtering. Entity Recognition. Preprocessing. Highly Ambiguous Nonsensical Terms. Acronym detection. Name Extraction. Abstracts Full-Length Texts. Sentence extraction. - PowerPoint PPT Presentation

Citation preview

Page 1: Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)

Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR)Hu ZZ1, Mani I2, Liu H3, Vijay-Shanker K4, Hermoso V1, Nikolskaya A1, Natale DA1, and Wu CH1

1Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057; 2Georgetown University, 37th and O Streets, NW, Washington, DC 20057; 3University of Maryland at Baltimore County, Baltimore, MD 21250; 4Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716

PIRSF in DAG View

• PIRSF family hierarchy based on evolutionary relationships• Standardized PIRSF family names as hierarchical protein ontology• DAG Network structure for PIRSF family classification system

PIRSF-Based Protein Ontology

ABSTRACTAn integrated protein literature mining resource iProLINK is developed at PIR to provide data sources for Natural Language Processing (NLP) research on bibliography mapping, annotation extraction, protein named-entity recognition, and protein ontology development. A rule-based text-mining system RLIMS-P is used to extract protein phosphorylation information from MEDLINE abstracts to assist database annotation, an online BioThesaurus is developed for protein/gene name mapping and to assist with protein named-entity recognition, and a family classification PIRSF-based protein ontology is developed and to complement other ontologies.

As the volume of scientific literature rapidly grows, literature data mining becomes increasingly critical to facilitate genome/proteome annotation and to improve the quality of biological databases. Annotations derived from experimentally verified data from literature are of special value to the UniProtKB (UniProt Knowledgbase). One objective of UniProtKB is to have accurate, consistent, and rich annotation of protein sequence and function. Relevant to this goal are the literature-based curation and development and adoption of ontologies and controlled vocabularies.• Literature-Based Curation – Extract Reliable Information from Literature

• Protein properties: protein function, domains and sites, developmental stages, catalytic activity, binding and modified residues, regulation, induction, pathways, tissue specificity, subcellular location, quaternary structure…

• This will ensure high quality, accurate and up-to-date experimental data for each protein. But it is a major bottleneck!

• Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management

• UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature.

The Protein Information Resource has been collaborating with several NLP research groups to develop text-mining methodologies to extract information from biological literature and to develop protein ontology.

INTRODUCTION PIR – Integrated Protein Informatics Resource for Genomic/Proteomic Research

(http://pir.georgetown.edu)

UniProt – Central international database of protein sequence and function

(http://www.uniprot.org)

Bioinformatics. 2005 Jun 1;21(11):2759-65

High recall for paper retrieval and high precision for information extraction

• UniProtKB site feature annotation• Proteomics MS data analysis: protein identification

Benchmarking of RLIMS-P

Sentence extraction

Part of speech tagging

Preprocessing

Acronym detection

Term recognition

Entity Recognition

Noun and verb group detection

Other syntactic structure detection

Phrase Detection

Semantic Type

Classification

Nominal level relation

Verbal level relation

Relation Identification

Abstracts Full-Length Texts

Post-Processing

Extracted Annotations Tagged Abstracts

Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)?ATR/FRP-1 also phosphorylated p53 in Ser 15

http://pir.georgetown.edu/iprolink/

RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation

Substrate(e.g., cPLA2)

phosphorylated-cPLA2

Enzyme(e.g., MAP kinase)

<THEME> Substrate (protein being phosphorylated)

<AGENT> Enzyme (kinase catalyzing the phosphorylation)

Phosphorylation

P-site

(e.g., Ser505)

P-group

<SITE> P-Site (amino acid residue being phosphorylated)

Ser-P

RLIMS-P

Protein Phosphorylation Annotation Extraction• Manual tagging assisted with computational extraction• Training sets of positive and negative samples

BioThesaurus reportUniProtKB entry P35625

• Tagging guideline versions 1.0 and 2.0

– Generation of domain expert-tagged corpora

– Inter-coder reliability – upper bound of machine tagging

• Dictionary pre-tagging

– F-measure: 0.412 (0.372 Precision, 0.462 Recall)

– Advantages: helpful with standardization and extent of tagging, reducing fatigue problem, and improve inter-coder reliability.

• BioThesaurus for pre-tagging

Raw Thesurus

iProClass

NCBIEntrez Gene

RefSeqGenPept

UniProtUniProtKB

UniRef90/50PIR-PSD

Genome

FlyBaseWormBase

MGDSGDRGD

OtherHUGO

ECOMIM

Name Filtering

Highly Ambiguous Nonsensical

Terms

Semantic Typing

UMLS

NameExtraction

UniProtKB Entries:

Protein/Gene Names &

Synonyms

BioThesaurus

BioThesaurus

• Biological entity tagging

• Name mapping

• Database annotation

• literature mining

• Gateway to other resources

Applications:

# UniProtKB entry 1.86m

# Source DB record 6.6m

# Gene/protein name/terms 3.6m

BioThesaurus v1.0 m = million

(May, 2005)

Protein Name Tagging

Example 2. Name ambiguity of CLIM1

PIRSF to GO Mapping

• Superimpose GO and PIRSF hierarchies• Bidirectional display (GO- or PIRSF-centric views)

• Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies

• Mapped 5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy– 68% of the PIRSF families and subfamilies map

to GO leaf nodes– 2329 PIRSFs have shared GO leaf nodes

DynGO viewer

Two cases: analyze GO branches and concepts and identify missing GO nodes

Case I. Nuclear receptor superfamily Case II. IGF-binding protein superfamily

iProLINK: An integrated protein resource for literature mining

1. Bibliography mapping - UniProt mapped citations2. Annotation extraction - annotation tagged literature3. Protein entity recognition - dictionary, tagged literature4. Protein ontology development - PIRSF-based ontology

http://pir.georgetown.edu/iprolink/

Testing and Benchmarking Dataset

• RLIMS-P text mining tool

• Protein dictionaries

• Name tagging guideline

• Protein ontology

3 4

5 6Protein Ontology Can Complement GO

Expanding a Node: Identification of GO subtrees that need expansion if GO concepts are too broad

– IGFBP subfamilies– High- vs. low-affinity

binding for IGF between IGFBP and IGFBPrP

GO-centric view

2

1

Exploration of Gene and Protein Ontology

PIRSF-centric view

1

Molecular function

Biological process

Estrogen receptor alpha (PIRSF50001)

Systematic links between three GO sub-ontologies based on the shared annotations at different protein family levels, e.g., linking molecular function and biological process:

– estrogen receptor binding and

– estrogen receptor signaling pathway

Acknowledgements

Research Projects

NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt)

NSF: SEIII (Entity Tagging)

NSF: ITR (Ontology)

Collaborators

I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology.

H. Liu from University of Maryland Department of Information System on protein name recognition and text mining.

Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features.

Summary• PIR iProLINK literature mining resource

provides annotated data sets for NLP research on annotation extraction and protein ontology development

• RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P can be applied for UniProtKB protein feature annotation.

• Biothesaurus can be used to solve name synonym and ambiguity, name mapping.

• PIRSF-based protein ontology can complement GO by identify missing GO concepts/nodes and provides systematic links between three GO sub-ontologies.

7

8

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF: A network structure from superfamilies to subfamilies to reflect evolutionary relationships of full-length proteins

DefinitionsBasic unit = Homeomorphic FamilyHomeomorphic: Full-length similarity, common domain architectureNetwork Structure: Flexible number of levels with varying degrees

of sequence conservation

PIRSF Protein Family Classification

Example 1. Name ambiguity of TIMP3

http://pir.georgetown.edu/iprolink/biothesaurus/

Web-based BioThesaurus

Gene/Protein Name Mapping

1.Search Synonyms

2.Resolve Name Ambiguity

3.Underlying ID Mapping

Online RLIMS-P text-mining tool (version 1.0)

http://pir.georgetown.edu/iprolink/rlimsp/

1

2

1. Search interface

2. Summary table with top hit of all sites

3. All sites and tagged text evidence

3

DAG file: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/dagfiles/

Liu et al, 2005, submitted

Liu et al, 2005, submitted