25
Unstructured Text in Big Data The Elephant in the Room David Milward ICIC, October 2013

ICIC 2013 Conference Proceedings David Milward Linguamatics

Embed Size (px)

DESCRIPTION

Unstructured Text in Big Data: the Elephant in the Room David Milward (Linguamatics, UK) A traditional approach has been to extract structured data from the unstructured text. When it comes to big data, manual or semi-automated methods just don't scale. Even with fully automated methods, it is infeasible to extract all the facts and relationships buried in the text to produce a 'knowledge base of everything' that answers every possible end-user question. In recent years, there has therefore been a trend towards using text mining over unstructured data to directly answer questions, effectively creating novel databases on the fly. However, this has widened the gap between information professionals and end users. In this talk I will explain how multiple approaches are being adopted to exploit the skills of information professionals to improve decision support for end users. These include specialised real-time querying, alerting based on semantic queries, semantic enrichment of documents, and population of semantic stores or linked data.

Citation preview

Page 1: ICIC 2013 Conference Proceedings David Milward Linguamatics

Unstructured Text in Big Data The Elephant in the Room

David Milward

ICIC, October 2013

Page 2: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Unstructured Big Data

• Big Data

– Volume, Variety, Velocity

• Estimated 80% of all data is unstructured

– Need to be able to make decisions based on this data

• Ever increasing amount of data makes it harder and harder to filter by hand

– Need more automation

© 2013 Linguamatics Ltd. 2

80%

20%

Page 3: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style From Unstructured Data to Structured

Linguamatics Confidential

Unstructured or Semi-Structured Content Sources

Structured, relational data Normalized concept identifiers

Traditional text mining works in batch mode

Requires each extraction to be programmed in, or learnt from a large amount of annotated material

Identify concepts and relationships

Page 4: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style

Results

Information Consumers

Querying

I2E Agile Text Mining

Confidential

Page 5: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style

In the beginning...

Increasing Range Of Applications

Protein-Protein

Interactions

Vocabulary Development

Biomarker Discovery

Target Identification

& Prioritization

Drug Repositioning

Conference Abstract Mining

Patent Analysis

Key Opinion Leader

Identification

Competitive Intelligence

Safety/Tox

In-licensing Opportunities

SAR Extraction

Systems Biology

Mining FDA Drug

Labels Extracting Numerical

and Experimental

Data

Chemical Search

Sentiment Analysis in

Social Media

Plant Metabolic Pathways

Mining Electronic Medical Records

Patent FTO

Confidential

Page 6: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style The Database Approach

• Can we predetermine what is required?

• One approach is to try to extract everything … but in practice have to make assumptions, whether by hand or automatically

– Do we include negative statements?

– Do we include speculation?

– What context do we include?

• Need a good idea of how the data will be used to know what needs to be extracted

• There are cases where this works well e.g.

– have useful structured data and want to add to it from free text

– have new records structured, but want to bring in information from legacy free text

© 2013 Linguamatics Ltd. 6

Page 7: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Extracting for a Relational Database

• Extracting all relevant information from pathology reports

© 2013 Linguamatics Ltd. 7

Page 8: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Extracting for a Semantic Store

• Output as a set of triples

• Because all the parts have URIs, the parts are web-addressable

Psoriasis Treat

Peptide Express

Associate Infectious Disease

Web

© 2013 Linguamatics Ltd. 8

Page 9: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style The Ad Hoc Approach

• No need to build a database

• Regard text mining queries as ways to create a database on the fly

• Keep all the unstructured free text, and query for what you need when you need it

• Even index structured data with text mining to link information together

• Used very successfully by knowledge professionals:

– Just like a search engine, there are thousands of questions people want to ask of the data

– You can’t predict them all beforehand

© 2013 Linguamatics Ltd. 9

Page 10: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Ad Hoc Querying

• Find relationships that you want for any particular information request

• Treat the text mining as an “agile” database to answer thousands of different questions e.g. metabolites from fruit

© 2013 Linguamatics Ltd. 10

Page 11: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Do We Need More?

• I2E Agile Text Mining has opened up text mining to the end user, but it is still mainly used by knowledge professionals

• Given the importance of unstructured big data, can we bring the benefits of text mining to an even wider audience?

© 2013 Linguamatics Ltd. 11

Page 12: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style

• Successful queries converted into regular workflows using Pipeline Pilot or KNIME

• Real-time workflows for up-to-date dashboards, alerts

• Further analysis, visualization, integration with other data

12

Workflow Automation

Pipeline Pilot

Page 13: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Clinical Trials Analysis from I2E Text Mining Results

Dates

Colours indicate Phase

Using Pipeline Pilot visualization

13 © 2013 Linguamatics Ltd.

Page 14: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Embedding I2E within Web Apps

4734_8.pptx 14 10/21/2013

Page 15: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Form-Based Querying

• Create web interfaces connecting to I2E server

• Information professionals develop sophisticated queries

• Queries are parameterized to allow end users to customize

• Javascript allows portability to mobile devices such as iPads

Page 16: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Enhancing Enterprise Search

• Use text mining to annotate concepts to feed into a search engine to:

– provide concept search

– provide concepts for facets

– provide intelligent thumbnails for documents, pulling out key information

© 2013 Linguamatics Ltd. 16

Page 17: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style

Keywords

Semantic annotation of documents V

alu

e

© 2013 Confidential

Concepts

Relationships

Search Engine

NLP-based Text Mining

Page 18: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Provide Concepts as Facets for Enterprise Search

Refine the search based on

terminologies

Document information and

teaser

Document preview and additional

information

© 2013 Linguamatics Ltd. 18

Page 19: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Dictionary Matching - Companies

Page 20: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Pattern Matching - Institutions

• Not a fixed list: can find previously unknown institutions

Page 21: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Pattern Matching - Mutations

Page 22: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Pattern Matching - Chemicals

• Assign structure to novel chemicals

• Distinguish exemplified compounds

• Distinguish compounds given properties

Page 23: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style

• ... I2E can mine and extract with precision

Whatever the Content...

Scientific literature

Twitter

Page 24: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style

• I2E 4.1 Linked Servers

• Users can query local information or information on the cloud

• Proprietary content can reside within Enterprise

• Standard sources e.g. patents can be hosted on the cloud, and shared by all users

• Data from content providers can reside on their sites

External Content

Content Provider

Clinical Trials

Drug Labels MEDLINE Internal Content

Publisher

Patents

…. and Wherever

24

I2E Enterprise

I2E Platform I2E Platform

I2E OnDemand

Page 25: ICIC 2013 Conference Proceedings David Milward Linguamatics

Click to edit Master title style Click to edit Master title style Bringing the Elephant Down to Size

• Data warehouses for data that you know you need

• Embedding text mining in other applications e.g. Enterprise Search to provide benefits of semantic search

• Building specialized new interfaces to provide self-service applications for end-users

• Continuing role for ad hoc text mining

– text mining can ask almost any question of the data

– you can never predict every question

© 2013 Linguamatics Ltd. 25