33
University of Texas at Austin Machine Learning Group Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions Razvan C. Bunescu Raymond J. Mooney Machine Learning Group Department of Computer Sciences University of Texas at Austin {razvan, mooney}@cs.utexas.edu Arun K. Ramani Edward M. Marcotte Institute for Cellular and Molecular Biology and Center for Computational Biology and Bioinformatics University of Texas at Austin {arun, marcotte}@icmb.utexas.edu

University of Texas at Austin

  • Upload
    butest

  • View
    8

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: University of Texas at Austin

University of Texas at Austin

Machine Learning Group

Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein

Interactions

Razvan C. Bunescu

Raymond J. Mooney

Machine Learning GroupDepartment of Computer Sciences

University of Texas at Austin

{razvan, mooney}@cs.utexas.edu

Arun K. Ramani

Edward M. Marcotte

Institute for Cellular and Molecular Biology and Center for Computational Biology and

BioinformaticsUniversity of Texas at Austin

{arun, marcotte}@icmb.utexas.edu

Page 2: University of Texas at Austin

2University of Texas at Austin

Machine Learning Group

Outline

Introduction & Motivation.

Two benchmark tests of accuracy.

Framework for the extraction of interactions.

Future Work.

Conclusions.

Page 3: University of Texas at Austin

3University of Texas at Austin

Machine Learning Group

Introduction

Large scale protein networks facilitate a better understanding of the interactions between proteins. Most complete for yeast. Minimal progress for human.

Most known interactions between human proteins are reported in Medline.

Reactome, BIND, HPRD: databases with protein interactions manually curated from Medline.

In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

Page 4: University of Texas at Austin

4University of Texas at Austin

Machine Learning Group

Motivation

Many interactions from Medline are not covered by current databases. Databases are generally biased for different classes of interactions. Manually extracting interactions is a very laborious process.

Aim: Automatically identify pairs of interacting proteins with high accuracy.

Page 5: University of Texas at Austin

5University of Texas at Austin

Machine Learning Group

Outline

Introduction & Motivation.

Two benchmark tests of accuracy. Functional Annotation.

Physical Interaction.

Framework for the extraction of interactions.

Future Work.

Conclusions.

Page 6: University of Texas at Austin

6University of Texas at Austin

Machine Learning Group

Accuracy Benchmarks – Shared Functional Annotations

Accuracy of interaction datasets correlates well with % of interaction partners sharing functional annotations.

Functional annotation a pathway between the two proteins in a particular ontology:KEGG: 55 pathways at lowest level.GO: 1356 pathways at level 8 of biological process

annotation.

Page 7: University of Texas at Austin

7University of Texas at Austin

Machine Learning Group

Accuracy Benchmarks – Shared Known Physical Interactions

Assumption: Accurate datasets are more enriched in pairs of proteins known to participate in a physical interaction.

Reactome and BIND are more accurate than others use them as source of known physical interactions.

Total: 11,425 interactions between 1,710 proteins.

Page 8: University of Texas at Austin

8University of Texas at Austin

Machine Learning Group

Accuracy Benchmarks – LLR Scoring Scheme

Use the log-likelihood ratio (LLR) of protein pairs with respect to:Sharing functional annotations.Physically interacting.

)()|(

)()|(ln

)|(

)|(ln

IPDIP

IPDIP

IDP

IDPLLR

P(D|I) and P(D|-I) are the probabilities of observing the data D conditioned on the proteins sharing (I) or not sharing (-I) benchmark associations.

Higher values for LLR indicate higher accuracy.

Page 9: University of Texas at Austin

9University of Texas at Austin

Machine Learning Group

Outline

Introduction & Motivation.

Two benchmark tests of accuracy.

Framework for the extraction of interactions.

Future Work.

Conclusions.

Page 10: University of Texas at Austin

10University of Texas at Austin

Machine Learning Group

Framework for Interaction Extraction

InteractionsDatabase

Medline abstractProtein

ExtractionMedline abstract(proteins tagged)

Interaction Extraction

Extensive comparative experiments in [Bunescu et al. 2005]

Protein Extraction: Maximum Entropy tagger.

Interaction Extraction: ELCS (Extraction using Longest Common Subsequences).

Current framework aims to improve on the previous approach on a much larger scale (750K Medline abstracts).

Page 11: University of Texas at Austin

11University of Texas at Austin

Machine Learning Group

Framework for Interaction Extraction

[Protein Extraction]

• Identify protein names using a Conditional Random Fields (CRFs) tagger trained on a dataset of 750 Medline abstracts, manually tagged for proteins.

[Interaction Extraction]

2) Keeping most confident extractions, detect which pairs of proteins are interacting. Two methods:

2.1) Co-citation analysis (document level).

2.2) Learning of interaction extractors (sentence level).

[Lafferty et al. 2001]

Page 12: University of Texas at Austin

12University of Texas at Austin

Machine Learning Group

1) A CRF tagger for protein names

Protein Extraction a sequence tagging task, where each word is associated a tag from: O(-utside), B(-egin), C(-ontinue), E(-nd), U(-nique).

O O O O O O B E O O O O OIn synchronized human osteosarcoma cells , cyclin D1 is induced in early G1

The input text is first preprocessed: Tokenized Split in sentences (Ratnaparki’s MXTerminator) Tagged with part-of-speech (POS) tags (Brill’s tagger)

Page 13: University of Texas at Austin

13University of Texas at Austin

Machine Learning Group

1) A CRF tagger for protein names

Each token position in a sentence is associated with a vector of binary features based on the (current tag, previous tag) combination, and observed values such as:

Words before, after or at the current position.

Their POS tags & capitalization patterns.

A binary flag set on true if the word is part of a protein dictionary.

IN VBN JJ NN NNS , NN NNP VBZ VBN IN JJ

In synchronized human osteosarcoma cells , cyclin D1 is induced in early

words afterwords before

current word

POS before POS after

current POS

Page 14: University of Texas at Austin

14University of Texas at Austin

Machine Learning Group

1) A CRF tagger for protein names

The CRF model is trained on 750 Medline abstracts manually annotated for proteins.

Experimentally, CRFs give better performance then Maximum Entropy models – they allow local tagging decisions to compete against each other in a global sentence model.

The model is used for tagging a large set (750K) of Medline abstracts citing the word ‘human’.

Each extracted protein is associated a normalized confidence value.

For the Interaction Extraction step, we keep only proteins scoring 0.8 or better.

Page 15: University of Texas at Austin

15University of Texas at Austin

Machine Learning Group

2.1) Interaction Extraction using Co-citation Analysis

Intuition: proteins co-occurring in a large number of abstracts tend to be interacting proteins.

Compute the probability of co-citation under a random model (hyper-geometric distribution).

m

N

km

nN

k

n

mnNkP ),,|(N – total number of abstracts (750K)n – abstracts citing the first proteinm – abstracts citing the second proteink – abstracts citing both proteins

Page 16: University of Texas at Austin

16University of Texas at Austin

Machine Learning Group

2.1) Interaction Extraction using Co-citation Analysis

Protein pairs which co-occur in a large number of abstracts (high k) are assigned a low probability under the random model.

Empirically, protein pairs whose observed co-citation rate is given a low probabilty under the random model score high on the functional annotation benchmark.

RESULT: Close to 15K interactions extracted that score comparable or better than HPRD on the functional annotation benchmark.

Page 17: University of Texas at Austin

17University of Texas at Austin

Machine Learning Group

2.1) Co-citation Analysis with Bayesian Reranking

1. Use a trained Naïve Bayes model to measure the likelihood that an abstract discusses physical protein interactions.

2. For a given pair of proteins, compute the average score of co-citing abstracts.

3. Use the average score to re-rank the 15k already extracted pairs.

Medline abstract CRF tagger Medline abstract(proteins tagged)

Co-citationAnalysis

Re-rankedInteractions

RankedInteractions

Naïve Bayesscores

Page 18: University of Texas at Austin

18University of Texas at Austin

Machine Learning Group

Integrating Extracted Data with Existing Databases

Extracted: 6,580 interactions

between 3,737 human proteins

Total: 31,609 interactions

between 7,748 human proteins.

Page 19: University of Texas at Austin

19University of Texas at Austin

Machine Learning Group

2.1) Co-citation Analysis: Evaluation

Page 20: University of Texas at Austin

20University of Texas at Austin

Machine Learning Group

2.1) Co-citation Analysis: Evaluation

Page 21: University of Texas at Austin

21University of Texas at Austin

Machine Learning Group

2.2) Learning of Interaction Extractors

Proteins may be co-cited for reasons other than interactions.

Solution: sentence level extraction, with a binary classifier.

Given a sentence containing the two protein names, output:

Positive: if the sentence asserts an interaction between the two.

Negative: otherwise.

If the sentence contains n > 2 protein names, replicate it into (n choose 2) sentences, each with only two protein names.

Training data: AImed, a collection of Medline abstracts, manually tagged.

Page 22: University of Texas at Austin

22University of Texas at Austin

Machine Learning Group

AImed

Total of 225 documents (200 w/ interactions + 25 wo interactions) Annotations for proteins and interactions

In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity …

cyclin D1 … becomes associated with p9Ckshs1 => Interaction

cyclin D1 is associated with both p34cdc2 => Interaction

cyclin D1 is associated with both p34cdc2 and p33cdk2 => Interaction

Page 23: University of Texas at Austin

23University of Texas at Austin

Machine Learning Group

ELCS (Extraction using Longest Common Subsequences)

A new method for inducing rules that extract interactions between previously tagged proteins.

Each rule consists of a sequence of words with allowable word gaps between them, similar to [Blaschke & Valencia, 2001, 2002].

- (7) interactions (0) between (5) PROT (9) PROT (17) .

Any pair of proteins in a sentence if tagged as interacting forms a positive example, otherwise it forms a negative example.

Positive examples are repeatedly generalized to form rules until the rules become overly general and start matching negative examples.

[Bunescu et al., 2005]

Page 24: University of Texas at Austin

24University of Texas at Austin

Machine Learning Group

ERK (Extraction using a Relation Kernel)

The patterns (features) are sparse subsequences of words constrained to be anchored on the two protein names.

The feature space can be further pruned down – in almost all examples, a sentence asserts a relationship between two entities using one of the following patterns:

[FI] Fore-Inter: ‘interaction of P1 with P2’, ‘activation of P1 by P2’

[I] Inter: ‘P1 interacts with P2’, ‘P1 is activated by P2’

[IA] Inter-After: ‘P1 – P2 complex’, ‘P1 and P2 interact’

Restrict the three types of patterns to use at most 4 words (besides the two protein anchors).

Page 25: University of Texas at Austin

25University of Texas at Austin

Machine Learning Group

ERK (Extraction using a Relation Kernel)

The kernel K(S1,S2) the number of common patterns between S1 and S2, weighted by

their span in the two sentences.

K(S1,S2) can be computed based on the dynamic procedure from [Lodhi et al., 2002].

Train an SVM model to find a max-margin linear discriminator between positive and

negative examples

S1 In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

S2 Experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and

[FI] patterns: “human cells P1 associated with P2”, …

[I] patterns: “P1 associated with P2”, …

[IA] patterns: “P1 associated with P2 ,”, …

Page 26: University of Texas at Austin

26University of Texas at Austin

Machine Learning Group

Evaluation: ERK vs ELCS vs Manual

Compare results using the standard measures of precision and recall:

extracted nsinteractio#

extracted nsinteractio correct #precision

corpus the in nsinteractio#

extracted nsinteractio correct #recall

All three systems were tested on Aimed, using gold-standard proteins.

Page 27: University of Texas at Austin

27University of Texas at Austin

Machine Learning Group

Evaluation: ERK vs ELCS vs Manual

Page 28: University of Texas at Austin

28University of Texas at Austin

Machine Learning Group

Future Work & Conclusions

Future Work:

Analyze the complete set of 750K abstracts using the relational kernel and integrate results into an improved composite dataset.

Conclusions:

Created a large database of interacting human proteins by consolidating interactions automatically extracted from Medline abstracts with existing databases.

Final database:

31,609 interactions between 7,748 human proteins.

Page 29: University of Texas at Austin

29University of Texas at Austin

Machine Learning Group

For Further Information

• Consolidated database available on line:– http://bioinformatics.icmb.utexas.edu/idserve/

• Papers available online:– http://www.cs.utexas.edu/users/ml/publication/bioinformatics.html

• “Consolidating the Set of Known Human Protein-Protein Interactions in Preparation for Large-Scale Mapping of the Human Interactome,” Ramani, A.K., Bunescu, R.C., Mooney, R.J. and Marcotte, E.M.,Genome Biology, 6, 5, r40(2005).

• “Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions,”Arun Ramani, Edward Marcotte, Razvan Bunescu, Raymond Mooney, to appear in the Proceedings of ISMB BioLINK SIG: Linking Literature, Information and Knowledge for Biology, Detroit, MI, June 2005.

• “Collective Information Extraction with Relational Markov Networks,” Razvan Bunescu and Raymond J. Mooney, Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-2004), pp. 439-446, Barcelona, Spain, July 2004.

• “Comparative Experiments on Learning Information Extractors for Proteins and their Interactions.,” Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Edward M. Marcotte, Raymond J. Mooney, Arun Kumar Ramani, and Yuk Wah Wong, Artificial Intelligence in Medicine (Special Issue on Summarization and Information Extraction from Medical Documents), 33, 2 (2005), pp. 139-155.

Page 30: University of Texas at Austin

30University of Texas at Austin

Machine Learning Group

The End

Page 31: University of Texas at Austin

31University of Texas at Austin

Machine Learning Group

Protein Interaction Datasets – Normalization

Need a shared convention for referencing proteins and their interactions.

Map each interacting protein to a LocusLink ID => small loss of proteins.

Consider interactions symmetric => many duplicates eliminated. Omit self interactions – cannot be evaluated on functional annotation

benchmark.

Example: HPRD reduced from 12,013 to 6,054 unique symmetric, non-self interactions.

Page 32: University of Texas at Austin

32University of Texas at Austin

Machine Learning Group

Protein Interaction Datasets – Normalization

Dataset Version Total Is (Ps) Self Is (Ps) Unique Is (Ps)

Reactome 08/03/04 12,497 (6,257) 160 (160) 12,336 (807)

BIND 08/03/04 6,212 (5,412) 549 (549) 5,663 (4,762)

HPRD 04/12/04 12,013 (4,122) 3,028 (3,028) 6,054 (2,747)

Orthology (all) 03/31/04 71,497 (6,257) 373 (373) 71,124 (6,228)

Orthology (core) 03/31/04 11,488 (3,918) 206 (206) 11,282 (3,863)

Dataset statistics after normalization (Is interactions, Ps proteins):

Page 33: University of Texas at Austin

33University of Texas at Austin

Machine Learning Group

Accuracy of manually curated interactions

Functional Annotation Benchmark

Physical Interaction Benchmark

Database LLR Database LLR

Reactome 3.8 N/A N/A

BIND 2.9 N/A N/A

HPRD 2.1 Core orthology 5.0

Core orthology 2.1 HPRD 3.7

Non-core orthology 1.1 Non-core orthology 3.7