Extract Chemical Information from Patents Using

Preview:

Citation preview

Extract Chemical Information from Patents

Using Chemicalize and D2S

(Document to Structure)

Wei Deng (David), Daniel Bonniot, Andras Stracz

International Conference and Exhibition on

Computer Aided Drug Design & QSAR

Oct 30th, 2012

Chicago, IL, USA

ChemAxon’s Naming Technology

● Structure to Name

● Name to Structure

● Document to Structure

● Document to Database

● Chemicalize

2

ChemAxon’s Naming Technology

• Structure to Name

– IUPAC Name, traditional names

– Reaching maturity

– Still upcoming: peptides, some fused systems

• Name to structure

– IUPAC, CAS and systematic names

– Dictionary of common names and drug names

– Support CAS Registry number (webservice)

– Homology group: alkyl, aryl …

– Future: Biological names, polymers, ...

• Accuracy and coverage constantly improving

• Also available from command-line

3

Name to Structure Internals

• Dictionary of common and drug names

– Uses nine different source dictionaries

– Harmonized using weighted consensus method

– 150K names for 40K unique structures

– Custom dictionary and plug-in system

• Systematic names

– Proprietary algorithm

– ”Understands” systematic names

– Example:

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

4

Systematic Name Example

5

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

6

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

7

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

8

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

9

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

1

0

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Parsing

1

1

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Systematic Name: Structure Generation

1

2

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

OCR Error Correction

1

3

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

OCR Error Correction

1

4

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

OCR Error Correction

1

5

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Λr-benzyl-Λr-[3-(lH-tetrazol-5-

yl)phenyl]propanamide

?-benzyl-?-[3-(?H-tetrazol-5-

yl)phenyl]propanamide

OCR Error Correction

1

6

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Λr-benzyl-Λr-[3-(lH-tetrazol-5-

yl)phenyl]propanamide

N-benzyl-N-[3-(1H-tetrazol-5-

yl)phenyl]propanamide

ChemAxon’s “Document to Structure”

• Extract chemical information from

documents – Names: powered by the Naming Technology

– Also import smiles, InChI, CAS number …

– Images: OSRA …

– Works with scanned non-searchable PDF

– Returns structures and their location in the document

• Supported formats: – MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …

– Embedded structure objects (ChemDraw, Symyx, Marvin,

…)

– PDF, text, XML, HTML

17

From Document to Structures

1

8

Non-searchable patent (50 pages) Structure (text + image) + location

Instant JChem Example

• In Instant JChem, search for

19

Search by Structure or Text

20

Non-searchable PDF is now Searchable

21

“Document to Database”

22

ChemAxon’s “Document to Database”

• Data in DB:

– Structures

– Source (name, smiles, embedded, …) and

location

– Documents, Authors, Metadata...

• Questions:

– What structures appear in a specific document?

– What documents contain a

structure/substructure/...?

– What documents written since 2010 in location X

contain substructure S?

... 23

ChemAxon’s “Document to Database”

• Customizable:

– Metadata extracted from documents

– Interface (IJC forms, webapp)

• Demo:

– One month of US patents

– 85K unique structures from systematic names

– 1M occurrences

24

Summary

● Extensive, improving naming technologies

(n2s, s2n)

● Increasing support for Document mining

(d2s, d2db, SharePoint)

● Putting it all together → going large scale,

extract and use valuable chemical

information

● Still component-based and responding to

our users requests

2

5

Free Online Service Chemicalize.org

• Extract

• Interactively display

• Calculate

• Search

26

Recently reviewed in J. Chem. Inf. Model., 2012, 52 (2), pp 613–

615

27

Webpage - Chemicalized

• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit

Data Page: Extensive Predicted Properties

29

Webpage - Chemicalized

PDF File - Chemicalized

Aspirin: query highlighted in results

Searching Chemicalize.org – Structure Search

• Aspirin; web page hits - “show” related structures

• Autosuggest while typing

Searching Chemicalize.org – Keyword Search

Availability and Customization

• Source code available

• Minor changes required on example codes

for customization, such as

– Import extracted structures to other databases

– Post-process filtering according to properties

– Batch process of multiple documents

34

Recommended