15
LESSIONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree , Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 MARTIN KRALLINGER, 2006 LESSIONS FROM THE BIOCREATIVE PPI TASK

LESSIONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree ,

  • Upload
    mervyn

  • View
    36

  • Download
    1

Embed Size (px)

DESCRIPTION

LESSIONS FROM THE BIOCREATIVE PPI TASK. LESSIONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree , Friday, December, 1st, (2006). MARTIN KRALLINGER, 2006. LESSIONS FROM THE BIOCREATIVE PPI TASK. PROTEIN-PROTEIN INTERACTIONS (PPI). - PowerPoint PPT Presentation

Citation preview

Page 1: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

LESSIONS FROM THE

BIOCREATIVE PROTEIN-

PROTEIN INTERACTION (PPI) TASK

RegCreative Jamboree ,

Friday, December, 1st, (2006)

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

Page 2: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

PROTEIN-PROTEIN INTERACTIONS (PPI)

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

M. Krallinger and A. Valencia. Applications of Text Mining in Molecular Biology, from name recognition to

Protein interaction maps. In Data Analysis and Visualization in Genomics and Proteomics, chapter 4, Wiley.

Crucial to understanding functional role of proteins

Relevant for organization of biological processes

Development of high throughput experimental technologies

Implication PPI for gene regulation (TF and co-regulators)

Interaction networks and diseases (e.g. cancer)

Page 3: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

PPI ANNOTATION AND DATABASES

LESSIONS FROM THE BIOCREATIVE PPI TASK

Database Reference URL

MINT (Zanoni et al., 2002) http://mint.bio.uniroma2.it/mint

IntAct (Hermjakob et al., 2004) http://www.ebi.ac.uk/intact

DIP (Xenarios et al., 2002) http://dip.doe-mbi.ucla.edu/

HPID (Han et al., 2004) http://www.hpid.org

HPRD (Peri et al., 2004) http://www.hprd.org/

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

iMEX agreement to share curation efforts

Protein Standard Initiative (PSI) recommendation

Molecular Interaction (MI) Ontology

Large scale experiments

Literature curation

Page 4: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

BIOCREATIVE PPI TASK

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

Rapid literature growth and manual curation Automatic extraction of protein-protein interactions from text Variety of published strategies

Main goals:

(1) To determine the state of the art (2) To produce useful resources for training and testing (3) To learn which approaches are successful and practical (4) To monitor interesting new approaches; (5) To provide useful tools to extract protein-protein interactions from texts

Task design resembles manual curation process steps

Structured record

1010101010102010010101001010101011010010101001010101010100010100

110101011010101001010101111010010

Page 5: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

Second BioCreative challenge evaluation

http://biocreative.sourceforge.net/index.html

Page 6: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

INTERACTION ARTICLE SUBTASK (IAS)

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

RELEVANTNOT

RELEVANT

Identify those articles which are curation relevant Document categorization task Based on PubMed abstracts Training set consisted in:

(1) P: Abstracts of PPI relevant abstracts form MINT/IntAct (2) N: Abstracts not relevant for PPI (exhaustive curation) (3) P*: Abstracts of interaction relevant articles: other DB

Return two collections of ranked documents: P, N Evaluation: precision, recall, f-score and AROC Participating systems: supervised learning Balanced test set, recent publications

Page 7: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

LESSION I: IAS TASK AND OREGANNO

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

Determine relevance of abstract vs. full text for article selection Balanced training collection: positive and negative Avoid journal and date used as classifier features Define training and test set in terms of publication date, e.g.:

Training set: published before 2003 Test set: published after 2003

Enriched training data: sentences with relevant evidence Define basic selection strategy:

Exhaustive curation of a set of journals: high recall Whole PubMed mining: high precision

Curation relevance and annotation types Integration of resulting applications into annotation pipeline Interactive evaluation: timing and annotation efficiency

Page 8: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

INTERACTION PAIR SUBTASK (IPS)

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

PMID: 11739376Interactor 1: P73213_SYNY3(Ssr2857 protein )Interactor 2: ATCS_SYNY3(pacS protein)

Identify protein-protein interaction pairs from full text articles (HTML, PDF) Individual protein identified using UniProt ID/Acc Restrict / define a baseline UniProt release Extraction of physical PPI (MI ontology) Training set: articles and associated PPI pairs System output: for each article ranked list of PPI pairs Evaluation: precision, recall or predicted compared to manual annotation Main difficulties gene normalization / inter-species ambiguity No limitation in organism source

Page 9: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

LESSON II: IPS TASK AND OREGANNO

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

GENERAL ASPECTS Difficulties due to inter-organism gene name ambiguity Difficulty to differentiate experimentally confirmed interactions Importance of additional lexical resources Indirect expressions for interactions Author names of the protein interactors for training Protein family ambiguity

ASPECTS FOR A GENE REGULATION EXTRACTION TASK Define database for gene normalization Consider experimentally confirmed regulation Bio-entity types: Protein vs. gene (promoter) name finding Provide negative and positive training of co-occurrences (passages) compared to manual annotation Define actual evaluation metric depending on the needs

Page 10: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

INTERACTION SENTENCE SUBTASK (ISS)

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

Select the most relevant sentence expressing a protein-protein interaction from full text article Useful for human interpretation and summary generation Provide for each interaction pair a ranked list of maximum 5 evidence passages (max 3 sentences) Pooling method of the predicted passages Evaluation: Percentage of relevant sentences with respect to the total number of submitted and mean reciprocal rank of the passages compared to the manual ones Example: Using a biochemical approach to search for such co-regulatory factors, we identified hGCN5, TRRAP,

and hMSH2/6 as BRCA1-interacting proteins.

Also additional collection included: Prodisen collection, Veuthey collection, Brun collection, GeneRif interaction sentences

M. Krallinger, R. Malik and Alfonso Valencia Text Mining and Protein Annotations: the Construction and Use of Protein Description Sentences,Genome Informatics Vol.17,No.2.

Page 11: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

LESSON III: ISS TASK AND OREGANNO

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

GENERAL ASPECTS Difficulties due to lack of collections ‘negative training sentences’ Need of larger (additional) training instances from full text Complex descriptions of referring to interactions Protein normalization and protein family name ambiguity problems Multiple sentence evidence cases (referring expressions, anaphora) Importance of figure legends and certain section titles Article format dependency (PDF vs. HTML)

ASPECTS FOR A GENE REGULATION EXTRACTION TASK Define semantic types of (or structure) comment fields Length restriction of training passages Restriction to certain format type and journals Define type of passage which should be extracted: for gene regulation or for evidence type annotation

Page 12: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

INTERACTION METHOD SUBTASK (IMS)

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

Identify protein-protein interaction pairs from full text articles together with interaction detection method Map to the MI Ontology (CV) Maximum of 5 MI for a PPI pair Extraction of physical PPI (MI ontology) Mean reciprocal rank compared to the manual annotation<ENTRY> <PPI_SUB_TASK_ID> BC2_PPI_IMS </PPI_SUB_TASK_ID> <TEAM_ID> T1_BC2_PPI </TEAM_ID> <RUN_NR> 1 </RUN_NR> <PMID> 10924507 </PMID> <INTERACTION_PAIR> <INTERACTOR_1> Q08211 </INTERACTOR_1> <INTERACTOR_2> Q9UBU9 </INTERACTOR_2> </INTERACTION_PAIR> <INT_DET_METHOD> <INT_DET_METHOD_ID> MI:0004 </INT_DET_METHOD_ID> <RANK> 1 </RANK> </INT_DET_METHOD> </ENTRY>

Page 13: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

LESSON IV: IMS AND OREGANNO

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

GENERAL ASPECTS Difficulties due to lack of training method sentences Very complex task: both PPI pair as well as terms for methods Community focus more on IPS than on IMS (too much task overlap) Difficulty to separate PPI pair and interaction detection method identification Different parts of documents referring to the method Information in non-textual data (e.g. figures)

ASPECTS FOR A GENE REGULATION EXTRACTION TASK Define controlled vocabulary relevant for annotation (e.g. evidence types) Provide lexical resources evidence types (synonyms, …) Extraction of controlled vocabulary (ontology concepts) to full text

Page 14: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

REGCREATIVE TEXT MINING TASKS

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

Different tasks which might result in automatic annotation relevant summary, which could include:

0. Detection of relevant articles (document categorization & ranking)1. Ranked (normalized) TF list extracted from the paper2. Ranked list of regulated genes extracted from the paper3. Ranked list of Evidence types (and subtypes) extracted from the articles together with text passages.4. Ranked list of associations between TF and regulated genes together with evidence text

Page 15: LESSIONS FROM THE  BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree  ,

Acknowledgements

MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006

LESSIONS FROM THE BIOCREATIVE PPI TASK

MINT and IntAct for providing the training and test data collections Publishers for allowing use of the full text articles (NPG and Elsevier) MITRE, NCBI for collaboration in organizing the BioCreative Challenge CNIO for their assistance Thanks to Lynette Hirschman and Alfonso Valencia for their coordination. Thanks to the participating teams from all over the world for their effort in developing the participating systems.

Detailed results will be presented in Madrid at the BioCreative IIEvaluation workshop, sponsored by the European Science Foundation, ESF (23-25th of April 2007, CNIO, Madrid) and in a special issue of Genome Biology.

http://biocreative.sourceforge.net/index.html