1
iProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1 , Liu H 2 , Vijay-Shanker K 3 , Mani I 4 , and Wu CH 1 1 Protein Information Resource, 2 Department of Biostatistics, Bioinformatics, and Biomathematics, 4 Department of Computational Linguistics, Georgetown University, Washington, DC 20007; 3 University of Delaware, DE 19716 Contact: [email protected] http://pir.georgetown.edu/iprolink Introduction: With the increasing volume of scientific literature available electronically, efficient text mining tools will greatly facilitate the extraction of information buried in free text and will assist in database annotation and scientific inquiry. Many methods, including natural language processing, machine learning, and rule-based approaches have been employed for biological literature mining, especially in areas of entity recognition, information retrieval and extraction. The Protein Information Resource (PIR) group, actively collaborating with several other groups, conducts research and provides resources on literature mining in the above three areas. iProLINK is a public resource provided by PIR that aims at providing annotated literature data sets for development of new literature mining algorithms, such as protein named entity recognition, text categorization, and protein annotation extraction, and of protein ontology. iProLINK also provides literature mining tools for scientific users and curators. ( Comp Biol Chem, 28:409-416, 2004) Summary - iProLINK is a public resource for literature mining and ontology development. - RLIMS-P is a text-mining tool for protein phosphorylation. - BioThesaurus is for gene and protein name mapping to solve name ambiguity. - BioThesaurus and RLIMS-P can be used to assist UniProtKB protein annotations. - PIRSF-based protein ontology can complement GO. 1. Bibliography mapping - UniProtKB mapped citations 2. Annotation extraction - annotation tagged literature 3. Protein entity recognition - dictionary, tagged literature 4. Protein ontology development - PIRSF-based ontology iProLINK Resource Overview Bibliography mapping: Contains curated literature citations for UniProtKB protein entries from multiple sources including GeneRIF, SGD, and MGI, in addition to current UniProt literature citations. Also included are user-submitted and computationally mapped citations. Annotation tagged literature sets: e.g. acetylation, glycosylation, hydroxylation, phosphorylation, methylation in abstract or full text. Tagging guideline versions 1.0 and 2.0 2 sets of tagged corpora Inter-coder reliability Guideline v1.0 Guideline v2.0 Protein name tagging guidelines: lessons learned – Comp. Funct Genomics, 6(1-2): 72-76, 2005 Protein entity recognition: name dictionaries, tagged abstracts and tagging guidelines BioThesaurus • Comprehensive collection of protein/gene names from multiple molecular databases • Associates names with UniProtKB entries • Primary usage: Retrieve synonymous names Resolve ambiguous names Evaluate name coverage PIRSF-Based Protein Ontology PIRSF family hierarchy based on evolutionary relationships Standardized PIRSF family names and relations as protein ontology DAG Network structure for PIRSF family classification system (left) PIRSF-based protein ontology can complement Gene Ontology (right) PIRSF in DAG View Name ambiguity of TIMP-3 Synonyms for Metalloproteinase inhibitor 3 Search and browse tagged features Bioinformatics, 21(11): 2759-2765, 2005 RLIMS-P Details in a separate RLIMS- P poster RLIMS-P and BioThesaurus combined can be used for UniProt protein feature annotations. Acknowledgements: NIH (UniProt), NSF (Entity Tagging, Ontology). PIR team: Hermoso V, Fang C, Yuan X, Huang H, Zhang J, Natale D, Nikolskaya A. Temple University: Han B, Obradovic Z, Vucetic S. Data sets for the five PTMs are being used for developing machine learning algorithms for text categorization (classification). A substring-based approach is developed that is highly effective in biomedical document classification (Bioinformatics, submitted, 2006) Data sets for protein phosphorylation were used for testing and benchmarking a rule- based text mining program for phosphorylation – RLIMS-P (Bioinformatics 21:2759-65, 2005.) Bioinformatics. 2006 Apr 27

IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,

Embed Size (px)

Citation preview

Page 1: IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,

iProLINK – A Literature Mining Resource at PIR(integrated Protein Literature INformation and Knowledge )

Hu ZZ1, Liu H2, Vijay-Shanker K3, Mani I4, and Wu CH1

1Protein Information Resource, 2Department of Biostatistics, Bioinformatics, and Biomathematics, 4Department of Computational Linguistics, Georgetown University, Washington, DC 20007; 3University of Delaware, DE 19716

Contact:[email protected]

http://pir.georgetown.edu/iprolink

Introduction: With the increasing volume of scientific literature available electronically, efficient text mining tools will greatly facilitate the extraction of information buried in free text and will assist in database annotation and scientific inquiry. Many methods, including natural language processing, machine learning, and rule-based approaches have been employed for biological literature mining, especially in areas of entity recognition, information retrieval and extraction. The Protein Information Resource (PIR) group, actively collaborating with several other groups, conducts research and provides resources on literature mining in the above three areas. iProLINK is a public resource provided by PIR that aims at providing annotated literature data sets for development of new literature mining algorithms, such as protein named entity recognition, text categorization, and protein annotation extraction, and of protein ontology. iProLINK also provides literature mining tools for scientific users and curators. (Comp Biol Chem, 28:409-416, 2004)

Summary- iProLINK is a public resource for literature mining and ontology development.- RLIMS-P is a text-mining tool for protein phosphorylation.- BioThesaurus is for gene and protein name mapping to solve name ambiguity.- BioThesaurus and RLIMS-P can be used to assist UniProtKB protein annotations.- PIRSF-based protein ontology can complement GO.

1. Bibliography mapping

- UniProtKB mapped citations

2. Annotation extraction

- annotation tagged literature

3. Protein entity recognition

- dictionary, tagged literature

4. Protein ontology development

- PIRSF-based ontology

iProLINK Resource Overview Bibliography mapping:

Contains curated literature citations for UniProtKB protein entries from multiple sources including GeneRIF, SGD, and MGI, in addition to current UniProt literature citations. Also included are user-submitted and computationally mapped citations.

Annotation tagged literature sets: e.g. acetylation, glycosylation, hydroxylation, phosphorylation, methylation in abstract or full text.

Tagging guideline versions 1.0 and 2.0 2 sets of tagged corpora

Inter-coder reliability

Guideline v1.0

Guideline v2.0

Protein name tagging guidelines: lessons learned – Comp. Funct Genomics, 6(1-2): 72-76, 2005

Protein entity recognition: name dictionaries, tagged abstracts and tagging guidelines

BioThesaurus• Comprehensive collection of protein/gene names from

multiple molecular databases• Associates names with UniProtKB entries• Primary usage:

• Retrieve synonymous names • Resolve ambiguous names• Evaluate name coverage

PIRSF-Based Protein Ontology PIRSF family hierarchy based on evolutionary relationships Standardized PIRSF family names and relations as protein ontology DAG Network structure for PIRSF family classification system (left) PIRSF-based protein ontology can complement Gene Ontology (right)

PIRSF in DAG View

Name ambiguity of TIMP-3Synonyms for Metalloproteinase inhibitor 3

Search and browse tagged features

Bioinformatics, 21(11): 2759-2765, 2005

RLIMS-P

Details in a separate RLIMS-P poster

RLIMS-P and BioThesaurus combined can be used for UniProt protein feature annotations.

Acknowledgements: NIH (UniProt), NSF (Entity Tagging, Ontology). PIR team: Hermoso V, Fang C, Yuan X, Huang H, Zhang J, Natale D, Nikolskaya A. Temple University: Han B, Obradovic Z, Vucetic S.

Data sets for the five PTMs are being used for developing machine learning algorithms for text categorization (classification). A substring-based approach is developed that is highly effective in biomedical document classification (Bioinformatics, submitted, 2006)

Data sets for protein phosphorylation were used for testing and benchmarking a rule-based text mining program for phosphorylation – RLIMS-P (Bioinformatics 21:2759-65, 2005.)

Bioinformatics. 2006 Apr 27