ICIC 2013 Conference Proceedings David Milward Linguamatics

Unstructured Text in Big Data The Elephant in the Room

David Milward

ICIC, October 2013

Click to edit Master title style Click to edit Master title style Unstructured Big Data

• Big Data

– Volume, Variety, Velocity

• Estimated 80% of all data is unstructured

– Need to be able to make decisions based on this data

• Ever increasing amount of data makes it harder and harder to filter by hand

– Need more automation

© 2013 Linguamatics Ltd. 2

80%

20%

Click to edit Master title style Click to edit Master title style From Unstructured Data to Structured

Linguamatics Confidential

Unstructured or Semi-Structured Content Sources

Structured, relational data Normalized concept identifiers

Traditional text mining works in batch mode

Requires each extraction to be programmed in, or learnt from a large amount of annotated material

Identify concepts and relationships

Click to edit Master title style Click to edit Master title style

Results

Information Consumers

Querying

I2E Agile Text Mining

Confidential


In the beginning...

Increasing Range Of Applications

Protein-Protein

Interactions

Vocabulary Development

Biomarker Discovery

Target Identification

& Prioritization

Drug Repositioning

Conference Abstract Mining

Patent Analysis

Key Opinion Leader

Identification

Competitive Intelligence

Safety/Tox

In-licensing Opportunities

SAR Extraction

Systems Biology

Mining FDA Drug

Labels Extracting Numerical

and Experimental

Data

Chemical Search

Sentiment Analysis in

Social Media

Plant Metabolic Pathways

Mining Electronic Medical Records

Patent FTO

Confidential

Click to edit Master title style Click to edit Master title style The Database Approach

• Can we predetermine what is required?

• One approach is to try to extract everything … but in practice have to make assumptions, whether by hand or automatically

– Do we include negative statements?

– Do we include speculation?

– What context do we include?

• Need a good idea of how the data will be used to know what needs to be extracted

• There are cases where this works well e.g.

– have useful structured data and want to add to it from free text

– have new records structured, but want to bring in information from legacy free text


Click to edit Master title style Click to edit Master title style Extracting for a Relational Database

• Extracting all relevant information from pathology reports


Click to edit Master title style Click to edit Master title style Extracting for a Semantic Store

• Output as a set of triples

• Because all the parts have URIs, the parts are web-addressable

Psoriasis Treat

Peptide Express

Associate Infectious Disease

Web


Click to edit Master title style Click to edit Master title style The Ad Hoc Approach

• No need to build a database

• Regard text mining queries as ways to create a database on the fly

• Keep all the unstructured free text, and query for what you need when you need it

• Even index structured data with text mining to link information together

• Used very successfully by knowledge professionals:

– Just like a search engine, there are thousands of questions people want to ask of the data

– You can’t predict them all beforehand


Click to edit Master title style Click to edit Master title style Ad Hoc Querying

• Find relationships that you want for any particular information request

• Treat the text mining as an “agile” database to answer thousands of different questions e.g. metabolites from fruit


Click to edit Master title style Click to edit Master title style Do We Need More?

• I2E Agile Text Mining has opened up text mining to the end user, but it is still mainly used by knowledge professionals

• Given the importance of unstructured big data, can we bring the benefits of text mining to an even wider audience?



• Successful queries converted into regular workflows using Pipeline Pilot or KNIME

• Real-time workflows for up-to-date dashboards, alerts

• Further analysis, visualization, integration with other data

12

Workflow Automation

Pipeline Pilot

Click to edit Master title style Click to edit Master title style Clinical Trials Analysis from I2E Text Mining Results

Dates

Colours indicate Phase

Using Pipeline Pilot visualization

13 © 2013 Linguamatics Ltd.

Click to edit Master title style Click to edit Master title style Embedding I2E within Web Apps

4734_8.pptx 14 10/21/2013

Click to edit Master title style Click to edit Master title style Form-Based Querying

• Create web interfaces connecting to I2E server

• Information professionals develop sophisticated queries

• Queries are parameterized to allow end users to customize

• Javascript allows portability to mobile devices such as iPads

Click to edit Master title style Click to edit Master title style Enhancing Enterprise Search

• Use text mining to annotate concepts to feed into a search engine to:

– provide concept search

– provide concepts for facets

– provide intelligent thumbnails for documents, pulling out key information



Keywords

Semantic annotation of documents V

alu

e

© 2013 Confidential

Concepts

Relationships

Search Engine

NLP-based Text Mining

Click to edit Master title style Click to edit Master title style Provide Concepts as Facets for Enterprise Search

Refine the search based on

terminologies

Document information and

teaser

Document preview and additional

information


Click to edit Master title style Click to edit Master title style Dictionary Matching - Companies

Click to edit Master title style Click to edit Master title style Pattern Matching - Institutions

• Not a fixed list: can find previously unknown institutions

Click to edit Master title style Click to edit Master title style Pattern Matching - Mutations

Click to edit Master title style Click to edit Master title style Pattern Matching - Chemicals

• Assign structure to novel chemicals

• Distinguish exemplified compounds

• Distinguish compounds given properties


• ... I2E can mine and extract with precision

Whatever the Content...

Scientific literature

Twitter


• I2E 4.1 Linked Servers

• Users can query local information or information on the cloud

• Proprietary content can reside within Enterprise

• Standard sources e.g. patents can be hosted on the cloud, and shared by all users

• Data from content providers can reside on their sites

External Content

Content Provider

Clinical Trials

Drug Labels MEDLINE Internal Content

Publisher

Patents

…. and Wherever

24

I2E Enterprise

I2E Platform I2E Platform

I2E OnDemand

Click to edit Master title style Click to edit Master title style Bringing the Elephant Down to Size

• Data warehouses for data that you know you need

• Embedding text mining in other applications e.g. Enterprise Search to provide benefits of semantic search

• Building specialized new interfaces to provide self-service applications for end-users

• Continuing role for ad hoc text mining

– text mining can ask almost any question of the data

– you can never predict every question


Technology

ICIC 2013 Conference Proceedings David Milward Linguamatics