30
TEXT MINING: THE NEXT DATA FRONTIER An Infrastructural Approach @openminted_eu Dr. Petr Knoth CORE (core.ac.uk) Knowledge Media institute, The Open University United Kingdom

TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

TEXT MINING:

THE NEXT DATA FRONTIER An Infrastructural Approach

@openminted_eu

Dr. Petr Knoth CORE (core.ac.uk)

Knowledge Media institute, The Open University United Kingdom

Page 2: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

OpenMinTeD Establish an open and sustainable Text and

Data Mining (TDM) platform and infrastructure

where researchers can collaboratively create,

discover, share and re-use knowledge from a

wide range of text based scientific and

scholarly related sources.

2

Page 3: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

beyond Open Access MAKING SENSE OF

LARGE VOLUMES OF SCIENTIFIC CONTENT

3

Page 4: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

The phases of text mining

@openminted_eu

NLP Analysis

Entity

Recognition

Data Mining

Knowledge

Discovery

Information

Extraction

STAGE 1 STAGE 2 STAGE 3 STAGE 4

Information

Retrieval

OPENMINTED -The Open Mining Infrastructure for Text and Data

Page 5: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

TDM challenges for researchers

1. Content challenges - Barriers and obstacles due to non-availability,

technical restrictions, copyright law or licensing

issues

- No uniform way to search for, retrieve and

access content for TDM

@openminted_eu

OPENMINTED - The Open Mining Infrastructure for Text and Data

Page 6: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

TDM challenges for researchers

2. Services challenges How to identify the most fitting TDM service?

How to combine with other TDM services I have

access to? How to use them on my content?

@openminted_eu

OPENMINTED - The Open Mining Infrastructure for Text and Data

Page 7: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

TDM challenges for researchers

3. Processing challenges

Where to deploy? Are my machines powerful enough?

How can I get access to powerful machines?

Where to store intermediate and final results?

How to ensure persistence of storage?

@openminted_eu

OPENMINTED - The Open Mining Infrastructure for Text and Data

Page 8: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

OpenMinTeD – Provides solutions

an open and sustainable TDM

infrastructure where researchers can

collaboratively create, discover, share and

re-use knowledge from a wide range of text

based scientific-related sources.

@openminted_eu

OPENMINTED - The Open Mining Infrastructure for Text and Data

Page 9: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

OpenMinTeD – working on many fronts

@openminted_eu

10

ACCESSIBLE

CONTENT

DISCOVERABLE

SERVICES

EFFICIENT

PROCESSING

RESEARCH

COMMUNITIES

VALUE ADDED

APPS

Via standardised programmatic interfaces

Well-documented easily discoverable text mining services and workflows which process, analyse and annotate text

Operate on public e-Infrastructures via standarized APIs

Different scientific communities have different challenges

Community-driven applications to illustrate the value of the infastructure. Engage with industry.

OPENMINTED - The Open Mining Infrastructure for Text and Data

Page 10: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

The project Started: June 2015

Duration: 3 years

Budget of: €6 million

Grant of: €5.3 million

16 Partners:

- 6 mining research groups

- 3 content providers

- 1 data center

- 1 library association

- 2 legal experts

- 6 community related partners

- 2 SMEs

Athena RIC Univ. of Manchester (NacTem) Univ. of Darmstadt INRA EMBL-EBI Agro-Know LIBER Univ. of Amsterdam Open University UK (CORE) EPFL CNIO Univ. of Sheffield (GATE) GESIS GRNET Frontiers Univ. of Stirling

PARTNERS

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 11: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

The OpenMinTeD landscape

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 12: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Infrastructural approach

OpenMinted does not build

new services, but adopts

and adapts existing services

for new communities

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 13: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Infrastructural approach

Focuses on interoperability

across text mining services

and content provision outlets

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 14: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Infrastructural approach

Creates and an Open & collaborative space for

researchers to use the best fitting text mining services available building on the

cloud computing philosophy

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 15: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

@openminted_eu

Data centre Data centre Data centre Data centre

in public cloud

Publisher text corpus

OpenAIRE/CORE text corpus

PMC text corpus

Other text corpora

Other text corpora

Other text corpora

Other types of text corpora

Layer 3:

Interoperability

to shared storage and

computing resources

Language resources Language resources

Language resources Language resources

Layer 2:

Interoperability of

language resources

& corpora

Layer 1:

Interoperability

of text mining services

(platforms or

components)

Language resources and corpora registry service

Platform services

Users: researchers, curators, text-miners and new services developers

Registry Workflow Management Auth2 & Policy management Annotator Accounting

Mining Platforms Mining Platforms Mining Platforms

Proprietary architectures

Mining Platforms

OPENMINTED = The Open Mining Infrastructure for Text and Data

Overview

Page 16: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Interoperability framework

Bringing together mining tools, resources and content

1. Content metadata & transfer standards

To document scientific literature, language resources, taxonomies and provenance as well as transfer protocols for full text retrieval

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 17: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Interoperability framework

Bringing together mining tools, resources and content

2. Service metadata & pipelining

To document and classify text mining services, how they receive input, in what form they output their results, how they combine for workflows, what granularity to consider.

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 18: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Interoperability framework

Bringing together mining tools, resources and content

3. IPR and licensing

To study IPR restrictions, describe license metadata for re-use, for content and TDM services & tools, and information on how to apply for academic and non-commercial mining research

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 19: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

OpenMinTeD users

1. End users

- Researchers, data base curators, …

- Novice: use services to advance their science

- Advanced: use TDM services into complex workflows

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 20: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

OpenMinTeD users

2. Content and service providers

- Publishers, libraries, scientific data base centres, …

- TDM researchERS

- SME’s

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 21: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

@openminted_eu

RESEARCH

ANALYTICS

SOCIAL

SCIENCES

AGRICULTURE LIFE

SCIENCES

Bottom-up approach OpenMinTeD works with 4 use cases, which give their requirements and evaluate the results.

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 22: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Openminted use case 1

Scholarly communication analytics •Semantic search and discovery of open

scientific outcomes

•Map of academia – scholarly

communication network

•Research monitoring and analytics

Partners CORE/OU, OpenAIRE/ARC, Frontiers

2

4

@openminted_eu

Page 23: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Openminted use case 2

Life sciences •Assisted curation of the EMBL-EBI chemical

databases for metabolomics

•Curation of the neurosciences resources

KnowledgeBase and Neurolex

Partners EBI - Metabolomics, Human brain project

2

5

@openminted_eu

Page 24: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Openminted use case 3

Agriculture and biodiversity •Enrich agricultural databases to assist food- and

water-borne disease outbreak alerts and product

recalls

•Image, figure and dataset discovery in the

AGRIS

Partners INRA, AGRO-KNOW

2

6

@openminted_eu

Page 25: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Openminted use case 4

social sciences Develop and evaluate methods for the automatic

detection and linking of named entities, citation

traces and intentions in social science scientific

publications

Partners GESIS

2

7

@openminted_eu

Page 26: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

What can OpenMinTeD do for you?

Are you a content provider?

make your content available for mining

Register your collections in the

OpenMinTeD registry and let others discover it

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 27: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

What can OpenMinTeD do for you?

Are you a TDM service provider?

share and collaborate with other TDM services

Register your TDM service in the

OpenMinTeD registry and let others discover it.

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 28: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

What can OpenMinTeD do for you?

Are you a text miner/research who can benefot from text-mining?

Use OpenMinTeD (when launched)

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 29: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Conclusions

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

- The ability to text-mine research literature at scale can redefine the way we do research

- OpenMinTeD is laying the groundwork (interoperability) and building the cloud infrastructure for text-mining research literature

- Building an open, transparent infrastructure that is enabling others to participate

Page 30: TEXT MINING: THE NEXT DATA FRONTIEReexcess.eu/wp-content/uploads/2016/05/Petr-Knoth... · 2016-05-11 · MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 3 . The phases of text

Contact us

www.openminted.eu

3

2

twitter.com/openminted_eu

facebook.com/openminted

bit.do/openmintedlinkedin

vimeo.com/openminted

bit.do/openmintedplus