University of Texas at Austin

University of Texas at Austin

Machine Learning Group

Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein

Interactions

Razvan C. Bunescu

Raymond J. Mooney

Machine Learning GroupDepartment of Computer Sciences

University of Texas at Austin

{razvan, mooney}@cs.utexas.edu

Arun K. Ramani

Edward M. Marcotte

Institute for Cellular and Molecular Biology and Center for Computational Biology and

BioinformaticsUniversity of Texas at Austin

{arun, marcotte}@icmb.utexas.edu

2University of Texas at Austin


Outline

Introduction & Motivation.

Two benchmark tests of accuracy.

Framework for the extraction of interactions.

Future Work.

Conclusions.



Introduction

Large scale protein networks facilitate a better understanding of the interactions between proteins. Most complete for yeast. Minimal progress for human.

Most known interactions between human proteins are reported in Medline.

Reactome, BIND, HPRD: databases with protein interactions manually curated from Medline.

In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.



Motivation

Many interactions from Medline are not covered by current databases. Databases are generally biased for different classes of interactions. Manually extracting interactions is a very laborious process.

Aim: Automatically identify pairs of interacting proteins with high accuracy.



Outline


Two benchmark tests of accuracy. Functional Annotation.

Physical Interaction.


Future Work.

Conclusions.



Accuracy Benchmarks – Shared Functional Annotations

Accuracy of interaction datasets correlates well with % of interaction partners sharing functional annotations.

Functional annotation a pathway between the two proteins in a particular ontology:KEGG: 55 pathways at lowest level.GO: 1356 pathways at level 8 of biological process

annotation.



Accuracy Benchmarks – Shared Known Physical Interactions

Assumption: Accurate datasets are more enriched in pairs of proteins known to participate in a physical interaction.

Reactome and BIND are more accurate than others use them as source of known physical interactions.

Total: 11,425 interactions between 1,710 proteins.



Accuracy Benchmarks – LLR Scoring Scheme

Use the log-likelihood ratio (LLR) of protein pairs with respect to:Sharing functional annotations.Physically interacting.

)()|(

)()|(ln

)|(

)|(ln

IPDIP

IPDIP

IDP

IDPLLR

P(D|I) and P(D|-I) are the probabilities of observing the data D conditioned on the proteins sharing (I) or not sharing (-I) benchmark associations.

Higher values for LLR indicate higher accuracy.



Outline


Two benchmark tests of accuracy.


Future Work.

Conclusions.



Framework for Interaction Extraction

InteractionsDatabase

Medline abstractProtein

ExtractionMedline abstract(proteins tagged)

Interaction Extraction

Extensive comparative experiments in [Bunescu et al. 2005]

Protein Extraction: Maximum Entropy tagger.

Interaction Extraction: ELCS (Extraction using Longest Common Subsequences).

Current framework aims to improve on the previous approach on a much larger scale (750K Medline abstracts).



Framework for Interaction Extraction

[Protein Extraction]

• Identify protein names using a Conditional Random Fields (CRFs) tagger trained on a dataset of 750 Medline abstracts, manually tagged for proteins.

[Interaction Extraction]

2) Keeping most confident extractions, detect which pairs of proteins are interacting. Two methods:

2.1) Co-citation analysis (document level).

2.2) Learning of interaction extractors (sentence level).

[Lafferty et al. 2001]



1) A CRF tagger for protein names

Protein Extraction a sequence tagging task, where each word is associated a tag from: O(-utside), B(-egin), C(-ontinue), E(-nd), U(-nique).

O O O O O O B E O O O O OIn synchronized human osteosarcoma cells , cyclin D1 is induced in early G1

The input text is first preprocessed: Tokenized Split in sentences (Ratnaparki’s MXTerminator) Tagged with part-of-speech (POS) tags (Brill’s tagger)




Each token position in a sentence is associated with a vector of binary features based on the (current tag, previous tag) combination, and observed values such as:

Words before, after or at the current position.

Their POS tags & capitalization patterns.

A binary flag set on true if the word is part of a protein dictionary.

IN VBN JJ NN NNS , NN NNP VBZ VBN IN JJ

In synchronized human osteosarcoma cells , cyclin D1 is induced in early

words afterwords before

current word

POS before POS after

current POS




The CRF model is trained on 750 Medline abstracts manually annotated for proteins.

Experimentally, CRFs give better performance then Maximum Entropy models – they allow local tagging decisions to compete against each other in a global sentence model.

The model is used for tagging a large set (750K) of Medline abstracts citing the word ‘human’.

Each extracted protein is associated a normalized confidence value.

For the Interaction Extraction step, we keep only proteins scoring 0.8 or better.



2.1) Interaction Extraction using Co-citation Analysis

Intuition: proteins co-occurring in a large number of abstracts tend to be interacting proteins.

Compute the probability of co-citation under a random model (hyper-geometric distribution).

m

N

km

nN

k

n

mnNkP ),,|(N – total number of abstracts (750K)n – abstracts citing the first proteinm – abstracts citing the second proteink – abstracts citing both proteins



2.1) Interaction Extraction using Co-citation Analysis

Protein pairs which co-occur in a large number of abstracts (high k) are assigned a low probability under the random model.

Empirically, protein pairs whose observed co-citation rate is given a low probabilty under the random model score high on the functional annotation benchmark.

RESULT: Close to 15K interactions extracted that score comparable or better than HPRD on the functional annotation benchmark.



2.1) Co-citation Analysis with Bayesian Reranking

1. Use a trained Naïve Bayes model to measure the likelihood that an abstract discusses physical protein interactions.

2. For a given pair of proteins, compute the average score of co-citing abstracts.

3. Use the average score to re-rank the 15k already extracted pairs.

Medline abstract CRF tagger Medline abstract(proteins tagged)

Co-citationAnalysis

Re-rankedInteractions

RankedInteractions

Naïve Bayesscores



Integrating Extracted Data with Existing Databases

Extracted: 6,580 interactions

between 3,737 human proteins

Total: 31,609 interactions

between 7,748 human proteins.



2.1) Co-citation Analysis: Evaluation



2.1) Co-citation Analysis: Evaluation



2.2) Learning of Interaction Extractors

Proteins may be co-cited for reasons other than interactions.

Solution: sentence level extraction, with a binary classifier.

Given a sentence containing the two protein names, output:

Positive: if the sentence asserts an interaction between the two.

Negative: otherwise.

If the sentence contains n > 2 protein names, replicate it into (n choose 2) sentences, each with only two protein names.

Training data: AImed, a collection of Medline abstracts, manually tagged.



AImed

Total of 225 documents (200 w/ interactions + 25 wo interactions) Annotations for proteins and interactions

In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity …

cyclin D1 … becomes associated with p9Ckshs1 => Interaction

cyclin D1 is associated with both p34cdc2 => Interaction

cyclin D1 is associated with both p34cdc2 and p33cdk2 => Interaction



ELCS (Extraction using Longest Common Subsequences)

A new method for inducing rules that extract interactions between previously tagged proteins.

Each rule consists of a sequence of words with allowable word gaps between them, similar to [Blaschke & Valencia, 2001, 2002].

- (7) interactions (0) between (5) PROT (9) PROT (17) .

Any pair of proteins in a sentence if tagged as interacting forms a positive example, otherwise it forms a negative example.

Positive examples are repeatedly generalized to form rules until the rules become overly general and start matching negative examples.

[Bunescu et al., 2005]



ERK (Extraction using a Relation Kernel)

The patterns (features) are sparse subsequences of words constrained to be anchored on the two protein names.

The feature space can be further pruned down – in almost all examples, a sentence asserts a relationship between two entities using one of the following patterns:

[FI] Fore-Inter: ‘interaction of P1 with P2’, ‘activation of P1 by P2’

[I] Inter: ‘P1 interacts with P2’, ‘P1 is activated by P2’

[IA] Inter-After: ‘P1 – P2 complex’, ‘P1 and P2 interact’

Restrict the three types of patterns to use at most 4 words (besides the two protein anchors).



ERK (Extraction using a Relation Kernel)

The kernel K(S1,S2) the number of common patterns between S1 and S2, weighted by

their span in the two sentences.

K(S1,S2) can be computed based on the dynamic procedure from [Lodhi et al., 2002].

Train an SVM model to find a max-margin linear discriminator between positive and

negative examples

S1 In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

S2 Experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and

[FI] patterns: “human cells P1 associated with P2”, …

[I] patterns: “P1 associated with P2”, …

[IA] patterns: “P1 associated with P2 ,”, …



Evaluation: ERK vs ELCS vs Manual

Compare results using the standard measures of precision and recall:

extracted nsinteractio#

extracted nsinteractio correct #precision

corpus the in nsinteractio#

extracted nsinteractio correct #recall

All three systems were tested on Aimed, using gold-standard proteins.



Evaluation: ERK vs ELCS vs Manual



Future Work & Conclusions

Future Work:

Analyze the complete set of 750K abstracts using the relational kernel and integrate results into an improved composite dataset.

Conclusions:

Created a large database of interacting human proteins by consolidating interactions automatically extracted from Medline abstracts with existing databases.

Final database:

31,609 interactions between 7,748 human proteins.



For Further Information

• Consolidated database available on line:– http://bioinformatics.icmb.utexas.edu/idserve/

• Papers available online:– http://www.cs.utexas.edu/users/ml/publication/bioinformatics.html

• “Consolidating the Set of Known Human Protein-Protein Interactions in Preparation for Large-Scale Mapping of the Human Interactome,” Ramani, A.K., Bunescu, R.C., Mooney, R.J. and Marcotte, E.M.,Genome Biology, 6, 5, r40(2005).

• “Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions,”Arun Ramani, Edward Marcotte, Razvan Bunescu, Raymond Mooney, to appear in the Proceedings of ISMB BioLINK SIG: Linking Literature, Information and Knowledge for Biology, Detroit, MI, June 2005.

• “Collective Information Extraction with Relational Markov Networks,” Razvan Bunescu and Raymond J. Mooney, Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-2004), pp. 439-446, Barcelona, Spain, July 2004.

• “Comparative Experiments on Learning Information Extractors for Proteins and their Interactions.,” Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Edward M. Marcotte, Raymond J. Mooney, Arun Kumar Ramani, and Yuk Wah Wong, Artificial Intelligence in Medicine (Special Issue on Summarization and Information Extraction from Medical Documents), 33, 2 (2005), pp. 139-155.



The End



Protein Interaction Datasets – Normalization

Need a shared convention for referencing proteins and their interactions.

Map each interacting protein to a LocusLink ID => small loss of proteins.

Consider interactions symmetric => many duplicates eliminated. Omit self interactions – cannot be evaluated on functional annotation

benchmark.

Example: HPRD reduced from 12,013 to 6,054 unique symmetric, non-self interactions.



Protein Interaction Datasets – Normalization

Dataset Version Total Is (Ps) Self Is (Ps) Unique Is (Ps)

Reactome 08/03/04 12,497 (6,257) 160 (160) 12,336 (807)

BIND 08/03/04 6,212 (5,412) 549 (549) 5,663 (4,762)

HPRD 04/12/04 12,013 (4,122) 3,028 (3,028) 6,054 (2,747)

Orthology (all) 03/31/04 71,497 (6,257) 373 (373) 71,124 (6,228)

Orthology (core) 03/31/04 11,488 (3,918) 206 (206) 11,282 (3,863)

Dataset statistics after normalization (Is interactions, Ps proteins):



Accuracy of manually curated interactions

Functional Annotation Benchmark

Physical Interaction Benchmark

Database LLR Database LLR

Reactome 3.8 N/A N/A

BIND 2.9 N/A N/A

HPRD 2.1 Core orthology 5.0

Core orthology 2.1 HPRD 3.7

Non-core orthology 1.1 Non-core orthology 3.7