38
Current challenges and opportunities for the text mining of interactions Raul Rodriguez-Esteban Data Science, pRED Informatics Roche Innovation Center Basel

Keynote conference talk - June 17th 2015

Embed Size (px)

Citation preview

Page 1: Keynote conference talk - June 17th 2015

Current challenges and opportunities for the text mining of interactionsRaul Rodriguez-EstebanData Science, pRED InformaticsRoche Innovation Center Basel

Page 2: Keynote conference talk - June 17th 2015

pRED Informatics – your scientific informatics expertsWe are:

Your scientific data experts within Roche pRED

Connecting research, knowledge and people across Roche pRED

Scientists and informatics professionals united in a single organization

Information technology scouts for Roche pRED

Page 3: Keynote conference talk - June 17th 2015

Data Science at Roche

Applies the concept of mixed informatics capability teams to retrieve and analyze data to support drug project decision-making.CapabilitiesCheminformaticsBioinformaticsText miningInformation scienceGenomicsGenetics...

Page 4: Keynote conference talk - June 17th 2015

Text mining in Data Science

Main goalSupporting the decision making of R&D projects with tools and expertise in text mining.

Some strategic themes• Maintaining a text mining

infrastructure for R&D.• Testing new text mining products

that could be beneficial to R&D projects.

• Proposing and implementing initiatives to improve our text mining capabilities.

• Increasing the value of our open-domain and licensed content.

Page 5: Keynote conference talk - June 17th 2015

The flow of the talk

1. Curation2. Name recognition3. Interactions

Page 6: Keynote conference talk - June 17th 2015

CURATION

Page 7: Keynote conference talk - June 17th 2015

Curation drives our text mining strategies

overhead=

With precision of 75%, you get 1 false positive for every 3 true positives. Overhead is 1/3=33%.

Acceptable overhead

With precision of 50%, overhead is 100%, you get one false positive for every true positive.

Unacceptable overhead

It all depends on how much curation time is available.

1. In our work, we almost always need curation: curation is a crucial constraint.2. Overhead determines the need for curation.

Page 8: Keynote conference talk - June 17th 2015

Crowdsourcing: no, we can’t

Burger et al. (2014)

Page 9: Keynote conference talk - June 17th 2015

Curation

Mining

Page 10: Keynote conference talk - June 17th 2015

Corollaries

overhead=

1. We probably don’t need text mining systems with precision above 75%-80%.

2. However, when precision goes down, overhead goes up very quickly.

3. But how does this all apply in practice?

We use multiple strategies so that we can choose one depending on the curation effort required or available. We adapt our text mining strategy to our curation resources.

Page 11: Keynote conference talk - June 17th 2015

The multiple strategy approach: many tools for the same problem

Page 12: Keynote conference talk - June 17th 2015

The multiple strategy approach

recall

precision

High recall

Compromise

High precision

Page 13: Keynote conference talk - June 17th 2015

Example for gene names

recall

precision

Dictionary - basedGene Name Identification

– Machine LearningGene Name

Normalization – Machine Learning

Page 14: Keynote conference talk - June 17th 2015

Name recognition

Page 15: Keynote conference talk - June 17th 2015

Our current approach to name recognitionMultiple OntologiesMachine Learning Open Source and Proprietary BANNER for genes / proteins named entity recognition

CBioC : Collaborative Bio Curation Arizona State University ChemSpot for chemical named entity recognition

Institut für Informatik Humboldt-Universität zu Berlin ChemAxon for chemical name-to-structure

ChemAxon DNorm for disease named entity recognition +

normalization (MESH, OMIM)NCBI : National Center for Biotechnology Information National Institutes of Health

Open source packages come from a research environment and were not easy to use in a production environment.

Page 16: Keynote conference talk - June 17th 2015

I2E from Linguamatics

• General-purpose rule-based – Ontologies– Regular expressions– Shallow parsing– Boolean logic

• Interactive– Pre-indexed– Graphical interface– Highlighted, compact output

I2E = Interactive Information Extraction

Page 17: Keynote conference talk - June 17th 2015
Page 18: Keynote conference talk - June 17th 2015

Integration through UIMA

Give the components the same interface for seamless usage: XML file for component description Parameter configuration Shared resources initialization Parallel computing (dedicated cluster or multiple servers) Data processing Access results through indexes

Page 19: Keynote conference talk - June 17th 2015

Interactions

Page 20: Keynote conference talk - June 17th 2015

Current state for protein-protein interactions

GOOD

BAD

Page 21: Keynote conference talk - June 17th 2015

First, the bad news

Krallinger et al. (2008)

Biocreative II, full text and gene name normalization, F-Score of 35%

Pyysalo et al. (2008) Change of corpus lowers F-score

Kabiljo et al. (2009) Change of corpus lowers F-score, keywords + entity recognition is competitive with machine learning

Tikk et al. (2010) Change of corpus lowers F-score, rule-based (RelEx) is competitive with machine learning

Page 22: Keynote conference talk - June 17th 2015

Why is performance bad? The leaky and noisy pipeline for PPIs

Identify interactions between those gene names

Loss of true positives

Addition of false positives

Identify articles that contain interactionsIdentify and normalize gene names within those articles

Page 23: Keynote conference talk - June 17th 2015

Why is performance bad? The problem is ill-defined

Ill-defined problems are those that do not have clear goals, solution paths, or expected solution. (Wikipedia)

• Every gold standard corpus defines interactions somewhat differently.

• “Interactions” is a concept coined for bioNLP. Outside of bioNLP it means something else.

• Interactions might be too broad a concept.

Page 24: Keynote conference talk - June 17th 2015
Page 25: Keynote conference talk - June 17th 2015

And now for some good news

• Tikk et al. (2010):“[…] we think that there is also a need to complement the currently predominant approach, treating all interactions as equally important, with more specific extraction tasks. To this end, it is important to create specialized corpora, such as those for the extraction of regulation events or for protein complex formation.”

• Some specialized systems perform better than PPI systems:– Phosphorylation (Tudor et al., 2015)

• Some other interactions besides PPI work better:- Drug-drug: DDIExtraction 2013, F-score of 65.1%- Expression (Neves et al., 2013)

• More generally, it makes sense to have multiple strategies for extracting interactions, both for PPIs and for other types of interactions.

Page 26: Keynote conference talk - June 17th 2015

Our current multiple strategies

1 - High precision, specialized for biomarkers (DiMA)

2 – Flexible, all-purpose (Linguamatics I2E)

3 - High recall for protein-protein interactions (University of Zürich)

Page 27: Keynote conference talk - June 17th 2015

Rule-based system for biomarkers: Disease Marker Associations Database (DiMA)

Page 28: Keynote conference talk - June 17th 2015

Generation of the Knowledge BaseExtracting Gene-Disease-Relationships

Genes / Proteins(Linguamatics)

Relationship(query patterns)

Diseases(Linguamatics)

Standardized Relationships Altered Expression Genetic Variation Role Marker Response Marker Regulatory

Modification Negative Association

Query Development Multiple variations for

each corpus 50+ Sub-queries

Variations of linguistic patterns

The ERBB2 gene (HER2/neu) is overexpressed in many human breast cancers.

ERBB2EntrezID:2064Score: 87

Altered Expressionis overexpressed

Breast Neoplasmsbreast cancers

Standardized Relationships

Sentence

Page 29: Keynote conference talk - June 17th 2015

Flexible strategy: rule-based + named-entity recognition

BANNERDNorm

ChemSpotChemAxon

Multiple ontologies

Machine learning named-entity recognition

Rules

Ontologies

Page 30: Keynote conference talk - June 17th 2015

High recall strategy: Ontogene

High recall systems are typically very noisy.Strategies to reduce the need for curation:1 - Fact-centric2 - Collection-wide 3 – Ranked

Page 31: Keynote conference talk - June 17th 2015

Collection-wide and fact-centric view

• Focusing more on facts than on mentions– The same fact can be redundantly mentioned many times

in many documents.– However, mentions of facts may also re-inforce each other.

Rzhetsky et al. (2006)

• Aggregate view of the literature rather than document view.

– We are interested in facts across documents, the “literature-wide” view of a fact.

Page 32: Keynote conference talk - June 17th 2015

Ranked, not filtered

Ranking has been somewhat disregarded in text mining.But all results are not created equal if you have to curate them.The goal is to present first the results with highest quality and best biological evidence. Users are warned that they will find noisy results.

Biological evidence

Mining quality

Our simple approach for biological evidence: based on number of document mentions.

Page 33: Keynote conference talk - June 17th 2015

Bonus: trained using a database, not a corpus

Gold-standard corpora are typically small and expensive to develop.Using a biological database we can increase dramatically the amount of training data.BioGRID is a leading biological database which includes information of the type “in a certain article, proteins X and Y are said to interact”A training set based on BioGRID covers interactions from 20,928 PubMed abstracts describing physical interactions in humans.

Page 34: Keynote conference talk - June 17th 2015

The interface of the system: gene-centric

Gene A– Interacting gene B1

• Example mention B1.1• Example mention B1.2• …

– Interacting gene B2• Example mention B2.1• Example mention B2.2• …

– Interacting gene B3• Example mention B3.1• Etc.

Page 35: Keynote conference talk - June 17th 2015

The interface of the system

Page 36: Keynote conference talk - June 17th 2015

Take home

• Current state of protein-protein interaction extraction is poor but strategies to cope are available.

• A multiple strategy approach provides the feasibility to confront different types of questions.

• We have recently developed a high-recall strategy together with University of Zürich. The approach is collection-wide, fact-centric, ranked and trained on BioGRID.

• Future approaches may increasingly hinge on specialization in certain types of interactions and certain tasks.

Page 37: Keynote conference talk - June 17th 2015

Acknowledgements

Roche pREDiHermann BillerBarbara Endler-JobstRalf JaegerAurélien OomsMartin Romacker

Roche DiagnosticsMartin Baron

Uni ZürichSimon ClematideLenz FurrerHernani MarquesFabio Rinaldi

NCBIRobert Leaman

Page 38: Keynote conference talk - June 17th 2015

Doing now what patients need next