Identifying, Indexing, and Ranking Chemical Formulae and ... · Introduction End users demand fast responses to searches for chemical entities (e.g., chemical formulae and chemical

Identifying, Indexing, and Ranking Chemical Formulae and Chemical

Names in Digital Text

slides by Suleyman Cetintas & Luo Si

1

Outline

Introduction

Chemical Entity Mentions

SMILES, InChi, UIPAC Nomenclature, Trivial Names

Chemical Entity Extraction from Text

Chemical Name Segmentation

Chemical Entity Indexing

Chemical Entity Search

Text Retrieval Conference – Chemical Track

References

2

Introduction

End users demand fast responses to searches for

chemical entities (e.g., chemical formulae and chemical

names)

A chemical search engine

must identify all occurrences of chemical entities

must index them in order to enable fast access

can be done offline, but still challenging due to large data

Tagging chemical formulae & chemical names

hard problem due to inherent ambiguity in natural language

text

3

Introduction

Partial formulae or partial chemical names

chemists and users of chemical search engines desire to input

chemical name or formula

expect the search engine to return documents having chemical

entities that contain the partial formula or chemical name

indexing on sub-formulae is required by the search engine for

efficiency

indexing all possible sub-formulae of any formula or sub-names

of chemical names

require large index, prohibitively expensive for time & memory

requirements

index-pruning is required [Sun et al. 2011]

4

Introduction

Different forms of the same formula

users can search CH3COOH or C2H4O2 (same formula)

both appear in significant number of documents

for larger chemical formulae, the diversity is even greater

search engine ChemIndustry.com returns the ‘synonyms’ of a chemical

formula and the docs containing those

need to identify the chemical formula, and disambiguate from

other non-chemical-entity-related abbreviations

E.g., OH can be “hydroxyl group” or the state “Ohio”

hard problem as it requires context analysis and natural language

processing (NLP)

5

Introduction

Partial chemical name searches

segmenting a chemical name into meaningful subterms

e.g. “ethly” or “methyl” instead of “ethy” or “lmethy”

let users perform partial name searches

Tools: Name=Struct [Brecher, 1999], CHEMorph [Kremer et al., 2006],

OPSIN [Corbett and Murray-Rust, 2006]

segment a chemical name into its morphemes, map the morphemes into

their chemical structures, and use these structures to construct the

structure of the named chemical

2 main directions

using dictionaries and lexicons

identifying and using frequent substring patterns in text

6

Introduction

Architecture of a chemical entity search engine with

document search in ChemXSeer [Sun et al., 2010]

7

Chemical Entity Mentions

Finding mentions of chemical compounds in text is

important for several reasons:

annotation of the entities enables a search engine to return

documents containing elements of this entity class (semantic

search), e.g. together with a disease

mapping found entities to corresponding structures leads to

the possibility to search relations between different chemicals

then, a chemist can search for similar structures, substructures,

and combine the information from the text with other tools

8

Chemical Entity Mentions: SMILES

Chemical names can be distinguished into different

classes: to deal with complex structures

SMILES

mentions of the sum formula or names according to the Simplified

Molecular Input Line Entry Specification (SMILES) [Weininger, 1988]

more human readable than InChi (shown in next slide)

has a wide base of software support with extensive theoretical

(e.g., graph theory) backing

a number of equally valid SMILES can be written for a molecule

e.g., CCO, OCC and C(O)C all specify the structure of ’ethanol’

allow direct structure search

limited readability of such specifications for humans

therefore trivial names are used more frequently in scientific texts

9

Chemical Entity Mentions: InChi



InChi

successor of SMILES, the IUPAC International Chemical Identifier (InChi)

current version is 1.03 and was released in June 2010

InChI algorithm converts input structural information into a unique

InChI identifier in a three-step process:

normalization (to remove redundant info.), canonicalization (to generate a

unique number label for each atom), serialization (to give a string of chars)

unique representation

standard InChi for ‘ethanol’ is ‘InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3’

less human readable than SMILES, can get quite lengthy

allow direct structure search

limited readability of such specifications for humans

therefore trivial names are used more frequently in scientific texts

10

Chemical Entity Mentions: UIPAC Nomenclature



UIPAC Nomenclature

set of rules to generate systematic names for chemical compounds to

ensure that a chemical name leaves no ambiguity as to what it refers

worldwide the most used chemical nomenclature

i.e., each chemical name should refer to a single substance

does not hold the complete structure information (unlike InChi)

human readable (unlike InChi), and more human readable than SMILES

developed and kept up to date under the auspices of the International

Union of Pure and Applied Chemistry (IUPAC)

along with trivial names (shown in next slide) more common in

scientific text than SMILES or InChi as more human readable

11

Chemical Entity Mentions: Trivial Names



Trivial Names

in biology and chemistry, a common name or vernacular name is a

non-systematic name or non-scientific name

most human readable of all four main representations

the name is not recognized according to the rules of any formal

(e.g. IUPAC) system of nomenclature

unlike SMILES & InChi

along with UIPAC names, more common in scientific text than SMILES or

InChi as more human readable

UIPAC names and trivial names does not allow for direct structure search

12


(Generic) Entity Extraction from Text

Hidden Markov Models (HMMs) [Baum et al., 1970]

commonly used to label or segment sequences

independence assumption, given the hidden state, observations are

independent

hence, can not capture the interactions between adjacent tokens

Maximum Entropy (ME) [Borthwick, 1999]

exponential prob. model based on binary data from sequences

estimate parameters with MLE

Maximum Entropy Markov Models (MEMMs) [McCallum et al.,

2000]

exp. prob. models that take the observation features as input, and

output a prob. distribution over possible next states

suffer from label-bias problem 13


(Generic) Entity Extraction from Text

Conditional Random Field (CRF) [Lafferty et al. 2001]

unlike HMM & MEMM (that use directed graphical models), uses an

undirected graphical model

relaxes conditional independence assumption of HMMs

avoid the label-bias problem of MEMMs

used for labeling sequences

named entity recognition [McCallum & Li, 2003]

detecting biological entities

e.g., proteins [Settles, 2005]

genes [McDonald & Pereira, 2005])

14


Chemical Entity Extraction

challenging problem due to ambiguity, different representations, etc.

examples of chemical formulae, names, and ambiguous terms

15



several early approaches (machine learning & rule based)

automatic recognition of chemical names in natural text

first by [Hodge et al., 1989]

bayesian classification using n-grams [Wilbur et al., 1999]

rule based algorithms by [Narayanaswamy et al., 2003]

unsupervised approaches by [Vasserman, 2004]

Oscar 3 [Corbett & Murray-Rust, 2006]

unsupervised method using n-grams, but further uses Kneser-Ney

smoothing [Kneser & Ney, 1995]

also used n-based models & MEMM [Corbett & Copestake, 2008]

F1 of 0.807 on chemical journals, 0.832 on PubMed abstracts

MEMM has shorter training cycles, but suffers from the label-bias

problem

16



CRF based approaches by [Sun et al., 2007, 2008, 2010; Klinger

et al., 2008]

avoids label-bias problem

requires more training time than MEMMs

testing time is comparable with MEMMs

better precision & recall values reported [Sun et al., 2010]

Classifiers such as SVMs can be used to tag chemical formulae

asymmetric binary classification problem on imbalanced data

many more false samples than true samples

precision and recall of true samples are more important than overall acc.

decision boundary dominated by false samples

cost sensitive classification & decision threshold tuning studied for

imbalanced data [Shanahan & Roma, 2003]

17



Feature Sets

utilizing parts-of-speech tagging tools (e.g., OpenNLP), a lexicon of

chemical terms (e.g., WordNet)

example of the feature set used in [Sun et al., 2010]

18



enables partial chemical name search

e.g., acetaldoxime is segmented into acet & aldoxime, aldoxime is

further segmented into ald & oxime.

if end user searches for aldoxime or oxime, the documents referrings

to acetaldoxime will be returned by the system

early approaches

breaking down the chemical name into its morphemes [Garfield,

1962]

does not work well since it attempts to match the longest string from the

right to left with dictionary entries

context free grammars [Cooke-Fox et al., 1989]

as people use chemical names that do not conform to formalized rules,

shown to be not effective by [Brecher, 1999]

19



OPSIN, a subsystem of OSCAR3 system [Corbett & Murray-

Rust, 2006]

an Open Parser for Systematic IUPAC Nomenclature (OPSIN)

open source license along with OSCAR3

used finite state grammar, ‘less expressive but more tractable’ than

context free grammars

their tokenization is based on “a list of multi-character tokens” and “a

set of regular expressions”; both created manually

20

Chemical Entity Indexing

Chemical Entity (Formula and Name) Indexing

set of partial formulae of the set of all chemical formulae is

quite large

many of the have redundant information

index selected and discriminative partial formulae only

segmenting chemical names into “meaningful” substrings and

indexing them

e.g., for ‘methylethyl’, indexing ‘methyl’ & ‘ethyl’ is enough, while

‘hyleth’ is not necessary

21

Chemical Entity Indexing: Formulae

Chemical Formula Indexing

same molecule may have different formula representations

‘acetic acid’ can be represented as ‘CH3COOH’ and ‘C2H4O2’

same formula can represent different molecules

C2H4O2 can be ‘acetic acid’ (CH3COOH) or ‘methyl formate’

(CH3OCHO)

indexing all formulae is prohibitively exprensive

sub-formulae of CH3OH are C, H3, O, H, CH3, H3O, OH, CH3O,

H3OH, CH3OH

query logs would reveal which sub-formulae are cost-effective to

index

when query logs not available, assumption can be made that

infrequent sub-formulae will not be queried frequently [Yan et al.,

2004; Sun et al., 2010]

22

Chemical Entity Indexing: Name

Chemical Name Indexing

before a chemical name is indexed, it should be segmented into

its sub-terms (or morphemes)

e.g., ’10-Hydroxy-trans-3-oxadecalin’ will first be segmented into ‘10’,

‘hydroxy’, ‘trans’, ‘3’, ‘oxadecalin’

then those terms will further be segmented into their

subterms

e.g., ‘oxadecalin’ will be segmented into ‘oxy’ and ‘decalin’

frequent sub-names can be mined, and can then be used for

segmenting chemical names into sub-terms

maximal frequent subsequence mining can be found in [Yang, 2004]

frequent subsequence mining and hierarchical segmentation details

can be found in [Sun et al., 2010]

23

Chemical Entity Search: Formula

Chemical Formula Search

can be grouped into 4 categories

Exact formula search:

user specifying a query formula gets back document having formulae

that match the query exactly

e.g., C1-2H4-6 will return CH4 or C2H6, but not H4C

Frequency search:

most current chemistry databases support frequency searches as the

only query models for formula searches

for a user query C2H4-6,

full-frequency returns two C and four to six H, but no other atoms

partial frequency returns 2C, 4H and any number of other atoms

24

Chemical Entity Search: Formula

Chemical Formula Search

Subsequence search

the system returns the documents with the formula that contain the

query formula as a subsequence

e.g., for query COOH,

COOH is exact match (high score), HOOC is reverse match (medium

score), CHO2 is parsed match (low score)

Similarity formula search

e.g., for query H2CO3

HC(O)OOH has higher ranking score than HNO3

computing similarity between the query formula and all formula in

text is expensive

extract a feature vector of partial formulas out of the query formula

(where each dimension is an indexed partial formula), and calculate

the score accordingly [Haussler, 1999; Sun et al., 2010]

25

Chemical Entity Search: Name

Chemical Name Search

Exact name search

the system returns the documents with the chemical names that

contain the exact query keyword

Substring name search

returns a ranked list of documents containing chemical names that

contain the user-provided keyword as a substring

if query string is indexed, results are retrieved directly

otherwise, query string is segmented hierarchically into substrings and

they are look up

segmenting continues until a substring is retrieved in the index

26


Conjunctive search

Conjunctive chemical formulae search

conjunctive searches of the basic chemical entity searches are

supported

for a formulae that have 2 to 4 C, four to ten H, and have a

subsequence of CH2

user can do a conjunctive search of a full frequency search C2-4H4-10, and

a subsequence formula search of CH2.

Conjunctive chemical name search

user can define multiple substrings in a query so that the satisfied

chemical name must contain both of them

chemical names where both substrings appear in order are given

higher priority than those only one appears

27


Query rewriting

when a user inputs a query that contains chemical formula,

chemical names as well as other keywords, the process of a

search engine is as follows:

chemical entity searches are executed to find desired names and

formulae

returned entities as well as other keywords (non-chemical-formula

and non-chemical-name) are used to retrieve related documents

TF.IDF can be used as the ranking function in the second stage

the ranking scores of each returned chemical entity in the first stage

can be used as weights of the TF.IDF of each chemical entity when

computing the ranking score in the second stage [Sun et al., 2010]

28

Text Retrieval Conference: Chemical Track

TREC 2009, 2010 Chemical Tracks

large scale domain specific (i.e., chemistry) IR evaluation tasks

following Legal Track and Genomics Tracks @ TREC

Data

1.3 million patent files from the chemical domain (classified under IPC

codes C and A61K)

All data in structured XML format

fields such as title, abstract, claims (for patents) can be identified easily and

should be utilized for the tasks

DTD are available for both patents and the scientific articles

images are available when publisher provides

chemical structure information is available in the form of CDX and

MOL files

29



Data

data from 3 major patent offices:

USPTO (US patent office),

EPO (European Patent Office), and

WIPO (World Intellectual Property Organization)

181,076 scientific articles from

The Royal Society of Chemistry

All open Access Journals from PubMed Central (as it was in Jan 2010)

Oxford Publishing,

Hindawi Publishing

International Union of Crystallography

Molecular Diversity Preservation International

30



Technology Survey

similar to the search engine / information need scenario described in

previous slides

30 expert created queries

In 2009: the task is to find relevant documents for a natural language

expression of an information need

In 2010: two tasks

first, the same task in 2009 – i.e., natural language query

second, structure search, the query is a chemical structure rather than a

chemical name

utilization of the document structure is crucial

utilization of the chemical entity mentions is crucial, yet quite hard in

such large scale data

31



Prior-Art Search

given a patent file, the task is to find all relevant patent files for the

query patent

1000 query patents

automatic evaluation process compare the set of identified relevant

patents with the set of known references from the query patent

generation of the search query from the given query patent is crucial

query patent is too long

structured nature of the query patent should be utilized

different fields give different information

chemical entity identification is very important, but hard to deal with

in such large scale data

re-ranking with respect to International Patent Codes (IPC) is

beneficial 32

References

Main References:

Sun, B., Mitra, P., Giles, L., Mueller, K. Identifying, indexing,

ranking chemical formulae and chemical names in digital

text. ACM TOIS, 2010.

Sun, B., Mitra, P., Giles, L. Mining, indexing, and searching for

textual chemical molecule information on the web. WWW,

2008.

Sun, B., Tan, Q., Mitra, P., Giles, L. Extraction and search of

chemical formulae in text document on the web. WWW,

2007.

Klinger, R., Kolarik, C., Fluck, J., Hofman-Apitius, M., Friedrich, C.

Detection of IUPAC and IUPAC-like chemical names.

Bioinformatics, 2008.

33

References

Main References:

Corbett, P., Murray-Rust, P. High-throughput identification of

chemistry in life-science texts. CompLife, 2006.

For original images & references to the mentioned tools, please

either conduct an online search with their names or refer to

the original articles above

34

Questions ?

Please let us know in case of any

questions/issues!

Further info: {scetinta, lsi}@cs.purdue.edu

35

Documents

Identifying, Indexing, and Ranking Chemical Formulae and ... · Introduction End users demand fast responses to searches for chemical entities (e.g., chemical formulae and chemical