42
Mining and meaning in the chemical sciences Richard kidd [email protected]

Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Mining and meaning in the

chemical sciences

Richard [email protected]

Page 2: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,
Page 3: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Overview

� Why are we doing this?� The conventional text-mining paradigm� How we do it� Where text-mining and annotation could

happen in future� Standards� Challenges

Page 4: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Why are we doing this?

A solution looking for many problems

� Enhanced reader experience� Current awareness� Information retrieval (pre-indexing)

Page 5: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Enhanced HTML

Page 6: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Enhanced HTML

Page 7: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Conventional text-mining paradigm

There is a corpus of text (PubMed abstracts, internal reports, PDFs).

There is a resource (WordNet, FrameNet, the NTU Sentiment Dictionary).

Text mining software is trained, using the resourceon subset of corpus and tested on the remainder.

This all happens after publication.

Page 8: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Resources, conventionally

StaticProbably developed for a single use casePossibly inconveniently licensedDeveloped by a single institution

Page 9: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

The kind of resources we want

DynamicMultiple use casesOpenDeveloped by multiple institutions

Page 10: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,
Page 11: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Text mining (Oscar)http://www.sciborg.org.uk/

http://oscar3-chem.sourceforge.net/

Manual QA

Enhanced HTML

Enhanced RSS

Database

Page 12: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Resources we use

StaticIUPAC Gold Book

DynamicOBO biomedical ontologies, especially:ChEBI

RSC ontologies (http://www.rsc.org/ontologies)

CMO, RXNO, MOP (and more to come)

InChI Identifier

Page 13: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Live resource update (stage one)

Integr. Biol., 2009, doi:10.1039/b905580k

affinity chromatography (CMO:0001006)

A chromatography method where the separation is caused by differing analyte–

ligand interactions.

(source: IUPAC Orange Book 9.2.1.5)

Page 14: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Live resource update (stage two)

immobilized metal affinity

chromatography (CMO:0002255)

A chromatography method where the

separation is caused by differing

analyte–ligand interactions. Proteins

containing amino acids with a specific

affinity for metal ions (e.g. His which

has an affinity for Co and Zn ions) are

retained by the column.

metal oxide affinity chromatography

(CMO:0002256)

A chromatography method where the

separation is caused by differing

analyte–ligand interactions.

Phosphorylated proteins and peptides

are retained by metal oxide particles

because of their affinity for the

phosphate group.

Page 15: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

But beware of ambiguity

distribution (noun)

Does this mean:(a) Spreading something out (a process)?(b) The way something is spread out (a

quality)?

Page 16: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,
Page 17: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

RSC ontology development

Annotations to a particular ontology are a moving target.

And we can’t guarantee completeness for any given resource–corpus combination.

(Unless we build a corpus-specific resource, which is bad.)

Page 18: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Compounds?

All kinds of problems...

Different names, systematic and commonNo namesImages (specific and generic)

Best dictionary wins for names

Page 19: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Stephen Arnold

Search: The Three Curves of Despair

March 2008

Page 20: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

ChemMantis

Page 21: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Deposit structures…build dictionaries

Page 22: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

A free to access online database for chemists

Website and web services

A link farm for over 22 million compounds integrated to 200 data sources

A curation platform for the public to improve the quality of data online

A deposition platform for the public to annotate and extend the data

Page 23: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,
Page 24: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Annotation: where and when?

Pre-publication?

(by authors)

?

At publication?

(by editors)

Prospect

After publication?

(by the crowd)

ChemMantis

Page 25: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Authoring: Word ontology plugin

http://ucsdbiolit.codeplex.com/

Page 26: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"><molecule id="m1"><atomArray><atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /><atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /><atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /><atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /><atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /><atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /><atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /><atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" />

</atomArray><bondArray><bond atomRefs2="a1 a2" order="1" /><bond atomRefs2="a2 a3" order="1" /><bond atomRefs2="a2 a4" order="2" /><bond atomRefs2="a1 a5" order="1" /><bond atomRefs2="a1 a6" order="1" /><bond atomRefs2="a1 a7" order="1" /><bond atomRefs2="a3 a8" order="1" />

</bondArray></molecule>

</cml>

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"><molecule id="m1"><atomArray><atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /><atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /><atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /><atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /><atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /><atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /><atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /><atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" />

</atomArray><bondArray><bond atomRefs2="a1 a2" order="1" /><bond atomRefs2="a2 a3" order="1" /><bond atomRefs2="a2 a4" order="2" /><bond atomRefs2="a1 a5" order="1" /><bond atomRefs2="a1 a6" order="1" /><bond atomRefs2="a1 a7" order="1" /><bond atomRefs2="a3 a8" order="1" />

</bondArray></molecule>

</cml>

Relationships: Navigate

and link referenced

chemistry

Relationships: Navigate

and link referenced

chemistry

• Peter Murray-Rust

• Joe Townsend

• Jim Downing

Available soon:http://research.microsoft.com/chem4word/

Data: Semantics stored

in Chemistry Markup

Language

Data: Semantics stored

in Chemistry Markup

Language

Intent: Recognizes

chemical dictionary and

ontology terms

Intent: Recognizes

chemical dictionary and

ontology terms

Author and edit 1D and 2D

chemistry.

Author and edit 1D and 2D

chemistry.

Intelligence: Verifies

validity of authored

chemistry

Intelligence: Verifies

validity of authored

chemistry

Authoring: Chem4Word – Chemistry Drawing in Word

Page 27: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

oreChem

Page 28: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

TREC Chemistry

� Combination of 1m+ patents� 36k RSC articles

� Test runs on defined tasks, prior art

� 8 runs in year one (none from UK)

Page 29: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

InfoChem

� Chemisches Zentralblatt� Digitised

� Structure searchable

� 98k unique names, 48k unique structures

� Unique P R O B L E M S

� OCR interpretations

� Text searchable from

FIZ Chemie

Page 30: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Role:

� To fund development and support of the IUPAC InChI standard

� Working groups set up by IUPAC Subcommittee: reactions, organometallics, polymers, markush, business rules for structure input, Resolver protocol

Page 31: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Current members of the

Trust:

ACD/Labs

ChemAxonElsevier

FIZ Chemie

Informa / Taylor & Francis

NPGOpenEye

RSC Symyx Technologies

Thomson-Reuters

Wiley-Blackwell

Page 32: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

RInChI

� Reactions

� Jonathan Goodman

� http://www-rinchi.ch.cam.ac.uk/

Page 33: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

inchis.chemspider.com

Page 34: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

NCI Resolver

� So we need a Resolver Protocol

Page 35: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Semantic Enrichment of the

Scientific Literature (SESL)

� Pistoia-funded� EBI� Elsevier, NPG, OUP, RSC

� Oct 2009 – Oct 2010

Page 36: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Assertion & Meta Data Mgmt

Transform / Translate

Integrator

Service Layer

Corpus 1

‘Consumer’

Firewall

Supplier

Firewall

Common

Service

Broker

Multiple

Consumers

Biomedical Knowledge Service Framework

Db 2

Db 3

Db 4

Corpus 5

Std Public

Vocabularies

Knowledge

Applications

Content

Suppliers

Effort required

to fit DBs to

service layer

Business

Rules

Open

Stds

Page 37: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

SESL deliverables

� Pilot to deliver target-disease assertions� Publication of data, application and web

service standards

� So: to deliver standards for semantic delivery

Page 38: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Standards - are we there yet?

� InChI Trust� Compound standards

� Reaction InChIs

� Resolver Protocol

� Pistoia/EBI� Semantic standards for web services

� Microsoft/Academia� oreChem

� Chem4Word

� Semantic markup by publishers

Page 39: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,
Page 40: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

Challenges for RSC

Open problems

� Chemical structures from images

� Productive identifiers for productively-named

entities

Putting ChemMantis and Prospect together

� Backfile (to 1841)

� Microsoft Word as well as XML

� Name to structure conversion

Page 41: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,

...and for text miners and

repositories

Inputs and outputs

� Who’s putting the data in?

� Who’s curating it?

� Who wants to use it?

� ...and what for?

Standards implementation

� Compelling use cases now here

Page 42: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,