Keynote conference talk - June 17th 2015

Current challenges and opportunities for the text mining of interactionsRaul Rodriguez-EstebanData Science, pRED InformaticsRoche Innovation Center Basel

pRED Informatics – your scientific informatics expertsWe are:

Your scientific data experts within Roche pRED

Connecting research, knowledge and people across Roche pRED

Scientists and informatics professionals united in a single organization

Information technology scouts for Roche pRED

Data Science at Roche

Applies the concept of mixed informatics capability teams to retrieve and analyze data to support drug project decision-making.CapabilitiesCheminformaticsBioinformaticsText miningInformation scienceGenomicsGenetics...

Text mining in Data Science

Main goalSupporting the decision making of R&D projects with tools and expertise in text mining.

Some strategic themes• Maintaining a text mining

infrastructure for R&D.• Testing new text mining products

that could be beneficial to R&D projects.

• Proposing and implementing initiatives to improve our text mining capabilities.

• Increasing the value of our open-domain and licensed content.

The flow of the talk

1. Curation2. Name recognition3. Interactions

CURATION

Curation drives our text mining strategies

overhead=

With precision of 75%, you get 1 false positive for every 3 true positives. Overhead is 1/3=33%.

Acceptable overhead

With precision of 50%, overhead is 100%, you get one false positive for every true positive.

Unacceptable overhead

It all depends on how much curation time is available.

1. In our work, we almost always need curation: curation is a crucial constraint.2. Overhead determines the need for curation.

Crowdsourcing: no, we can’t

Burger et al. (2014)

Curation

Mining

Corollaries

overhead=

1. We probably don’t need text mining systems with precision above 75%-80%.

2. However, when precision goes down, overhead goes up very quickly.

3. But how does this all apply in practice?

We use multiple strategies so that we can choose one depending on the curation effort required or available. We adapt our text mining strategy to our curation resources.

The multiple strategy approach: many tools for the same problem

The multiple strategy approach

recall

precision

High recall

Compromise

High precision

Example for gene names

recall

precision

Dictionary - basedGene Name Identification

– Machine LearningGene Name

Normalization – Machine Learning

Name recognition

Our current approach to name recognitionMultiple OntologiesMachine Learning Open Source and Proprietary BANNER for genes / proteins named entity recognition

CBioC : Collaborative Bio Curation Arizona State University ChemSpot for chemical named entity recognition

Institut für Informatik Humboldt-Universität zu Berlin ChemAxon for chemical name-to-structure

ChemAxon DNorm for disease named entity recognition +

normalization (MESH, OMIM)NCBI : National Center for Biotechnology Information National Institutes of Health

Open source packages come from a research environment and were not easy to use in a production environment.

I2E from Linguamatics

• General-purpose rule-based – Ontologies– Regular expressions– Shallow parsing– Boolean logic

• Interactive– Pre-indexed– Graphical interface– Highlighted, compact output

I2E = Interactive Information Extraction

Integration through UIMA

Give the components the same interface for seamless usage: XML file for component description Parameter configuration Shared resources initialization Parallel computing (dedicated cluster or multiple servers) Data processing Access results through indexes

Interactions

Current state for protein-protein interactions

GOOD

BAD

First, the bad news

Krallinger et al. (2008)

Biocreative II, full text and gene name normalization, F-Score of 35%

Pyysalo et al. (2008) Change of corpus lowers F-score

Kabiljo et al. (2009) Change of corpus lowers F-score, keywords + entity recognition is competitive with machine learning

Tikk et al. (2010) Change of corpus lowers F-score, rule-based (RelEx) is competitive with machine learning

Why is performance bad? The leaky and noisy pipeline for PPIs

Identify interactions between those gene names

Loss of true positives

Addition of false positives

Identify articles that contain interactionsIdentify and normalize gene names within those articles

Why is performance bad? The problem is ill-defined

Ill-defined problems are those that do not have clear goals, solution paths, or expected solution. (Wikipedia)

• Every gold standard corpus defines interactions somewhat differently.

• “Interactions” is a concept coined for bioNLP. Outside of bioNLP it means something else.

• Interactions might be too broad a concept.

And now for some good news

• Tikk et al. (2010):“[…] we think that there is also a need to complement the currently predominant approach, treating all interactions as equally important, with more specific extraction tasks. To this end, it is important to create specialized corpora, such as those for the extraction of regulation events or for protein complex formation.”

• Some specialized systems perform better than PPI systems:– Phosphorylation (Tudor et al., 2015)

• Some other interactions besides PPI work better:- Drug-drug: DDIExtraction 2013, F-score of 65.1%- Expression (Neves et al., 2013)

• More generally, it makes sense to have multiple strategies for extracting interactions, both for PPIs and for other types of interactions.

Our current multiple strategies

1 - High precision, specialized for biomarkers (DiMA)

2 – Flexible, all-purpose (Linguamatics I2E)

3 - High recall for protein-protein interactions (University of Zürich)

Rule-based system for biomarkers: Disease Marker Associations Database (DiMA)

Generation of the Knowledge BaseExtracting Gene-Disease-Relationships

Genes / Proteins(Linguamatics)

Relationship(query patterns)

Diseases(Linguamatics)

Standardized Relationships Altered Expression Genetic Variation Role Marker Response Marker Regulatory

Modification Negative Association

Query Development Multiple variations for

each corpus 50+ Sub-queries

Variations of linguistic patterns

The ERBB2 gene (HER2/neu) is overexpressed in many human breast cancers.

ERBB2EntrezID:2064Score: 87

Altered Expressionis overexpressed

Breast Neoplasmsbreast cancers

Standardized Relationships

Sentence

Flexible strategy: rule-based + named-entity recognition

BANNERDNorm

ChemSpotChemAxon

Multiple ontologies

Machine learning named-entity recognition

Rules

Ontologies

High recall strategy: Ontogene

High recall systems are typically very noisy.Strategies to reduce the need for curation:1 - Fact-centric2 - Collection-wide 3 – Ranked

Collection-wide and fact-centric view

• Focusing more on facts than on mentions– The same fact can be redundantly mentioned many times

in many documents.– However, mentions of facts may also re-inforce each other.

Rzhetsky et al. (2006)

• Aggregate view of the literature rather than document view.

– We are interested in facts across documents, the “literature-wide” view of a fact.

Ranked, not filtered

Ranking has been somewhat disregarded in text mining.But all results are not created equal if you have to curate them.The goal is to present first the results with highest quality and best biological evidence. Users are warned that they will find noisy results.

Biological evidence

Mining quality

Our simple approach for biological evidence: based on number of document mentions.

Bonus: trained using a database, not a corpus

Gold-standard corpora are typically small and expensive to develop.Using a biological database we can increase dramatically the amount of training data.BioGRID is a leading biological database which includes information of the type “in a certain article, proteins X and Y are said to interact”A training set based on BioGRID covers interactions from 20,928 PubMed abstracts describing physical interactions in humans.

The interface of the system: gene-centric

Gene A– Interacting gene B1

• Example mention B1.1• Example mention B1.2• …

– Interacting gene B2• Example mention B2.1• Example mention B2.2• …

– Interacting gene B3• Example mention B3.1• Etc.

The interface of the system

Take home

• Current state of protein-protein interaction extraction is poor but strategies to cope are available.

• A multiple strategy approach provides the feasibility to confront different types of questions.

• We have recently developed a high-recall strategy together with University of Zürich. The approach is collection-wide, fact-centric, ranked and trained on BioGRID.

• Future approaches may increasingly hinge on specialization in certain types of interactions and certain tasks.

Acknowledgements

Roche pREDiHermann BillerBarbara Endler-JobstRalf JaegerAurélien OomsMartin Romacker

Roche DiagnosticsMartin Baron

Uni ZürichSimon ClematideLenz FurrerHernani MarquesFabio Rinaldi

NCBIRobert Leaman

Doing now what patients need next

Documents

Keynote conference talk - June 17th 2015