34
Extract Chemical Information from Patents Using Chemicalize and D2S (Document to Structure) Wei Deng (David), Daniel Bonniot, Andras Stracz International Conference and Exhibition on Computer Aided Drug Design & QSAR Oct 30 th , 2012 Chicago, IL, USA

Extract Chemical Information from Patents Using

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Extract Chemical Information from Patents Using

Extract Chemical Information from Patents

Using Chemicalize and D2S

(Document to Structure)

Wei Deng (David), Daniel Bonniot, Andras Stracz

International Conference and Exhibition on

Computer Aided Drug Design & QSAR

Oct 30th, 2012

Chicago, IL, USA

Page 2: Extract Chemical Information from Patents Using

ChemAxon’s Naming Technology

● Structure to Name

● Name to Structure

● Document to Structure

● Document to Database

● Chemicalize

2

Page 3: Extract Chemical Information from Patents Using

ChemAxon’s Naming Technology

• Structure to Name

– IUPAC Name, traditional names

– Reaching maturity

– Still upcoming: peptides, some fused systems

• Name to structure

– IUPAC, CAS and systematic names

– Dictionary of common names and drug names

– Support CAS Registry number (webservice)

– Homology group: alkyl, aryl …

– Future: Biological names, polymers, ...

• Accuracy and coverage constantly improving

• Also available from command-line

3

Page 4: Extract Chemical Information from Patents Using

Name to Structure Internals

• Dictionary of common and drug names

– Uses nine different source dictionaries

– Harmonized using weighted consensus method

– 150K names for 40K unique structures

– Custom dictionary and plug-in system

• Systematic names

– Proprietary algorithm

– ”Understands” systematic names

– Example:

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

4

Page 5: Extract Chemical Information from Patents Using

Systematic Name Example

5

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 6: Extract Chemical Information from Patents Using

Systematic Name: Parsing

6

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 7: Extract Chemical Information from Patents Using

Systematic Name: Parsing

7

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 8: Extract Chemical Information from Patents Using

Systematic Name: Parsing

8

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 9: Extract Chemical Information from Patents Using

Systematic Name: Parsing

9

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 10: Extract Chemical Information from Patents Using

Systematic Name: Parsing

1

0

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 11: Extract Chemical Information from Patents Using

Systematic Name: Parsing

1

1

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 12: Extract Chemical Information from Patents Using

Systematic Name: Structure Generation

1

2

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 13: Extract Chemical Information from Patents Using

OCR Error Correction

1

3

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

Page 14: Extract Chemical Information from Patents Using

OCR Error Correction

1

4

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 15: Extract Chemical Information from Patents Using

OCR Error Correction

1

5

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Λr-benzyl-Λr-[3-(lH-tetrazol-5-

yl)phenyl]propanamide

?-benzyl-?-[3-(?H-tetrazol-5-

yl)phenyl]propanamide

Page 16: Extract Chemical Information from Patents Using

OCR Error Correction

1

6

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Λr-benzyl-Λr-[3-(lH-tetrazol-5-

yl)phenyl]propanamide

N-benzyl-N-[3-(1H-tetrazol-5-

yl)phenyl]propanamide

Page 17: Extract Chemical Information from Patents Using

ChemAxon’s “Document to Structure”

• Extract chemical information from

documents – Names: powered by the Naming Technology

– Also import smiles, InChI, CAS number …

– Images: OSRA …

– Works with scanned non-searchable PDF

– Returns structures and their location in the document

• Supported formats: – MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …

– Embedded structure objects (ChemDraw, Symyx, Marvin,

…)

– PDF, text, XML, HTML

17

Page 18: Extract Chemical Information from Patents Using

From Document to Structures

1

8

Non-searchable patent (50 pages) Structure (text + image) + location

Page 19: Extract Chemical Information from Patents Using

Instant JChem Example

• In Instant JChem, search for

19

Page 20: Extract Chemical Information from Patents Using

Search by Structure or Text

20

Page 21: Extract Chemical Information from Patents Using

Non-searchable PDF is now Searchable

21

Page 22: Extract Chemical Information from Patents Using

“Document to Database”

22

Page 23: Extract Chemical Information from Patents Using

ChemAxon’s “Document to Database”

• Data in DB:

– Structures

– Source (name, smiles, embedded, …) and

location

– Documents, Authors, Metadata...

• Questions:

– What structures appear in a specific document?

– What documents contain a

structure/substructure/...?

– What documents written since 2010 in location X

contain substructure S?

... 23

Page 24: Extract Chemical Information from Patents Using

ChemAxon’s “Document to Database”

• Customizable:

– Metadata extracted from documents

– Interface (IJC forms, webapp)

• Demo:

– One month of US patents

– 85K unique structures from systematic names

– 1M occurrences

24

Page 25: Extract Chemical Information from Patents Using

Summary

● Extensive, improving naming technologies

(n2s, s2n)

● Increasing support for Document mining

(d2s, d2db, SharePoint)

● Putting it all together → going large scale,

extract and use valuable chemical

information

● Still component-based and responding to

our users requests

2

5

Page 26: Extract Chemical Information from Patents Using

Free Online Service Chemicalize.org

• Extract

• Interactively display

• Calculate

• Search

26

Recently reviewed in J. Chem. Inf. Model., 2012, 52 (2), pp 613–

615

Page 27: Extract Chemical Information from Patents Using

27

Webpage - Chemicalized

Page 28: Extract Chemical Information from Patents Using

• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit

Data Page: Extensive Predicted Properties

Page 29: Extract Chemical Information from Patents Using

29

Webpage - Chemicalized

Page 30: Extract Chemical Information from Patents Using

PDF File - Chemicalized

Page 31: Extract Chemical Information from Patents Using

Aspirin: query highlighted in results

Searching Chemicalize.org – Structure Search

Page 32: Extract Chemical Information from Patents Using

• Aspirin; web page hits - “show” related structures

• Autosuggest while typing

Searching Chemicalize.org – Keyword Search

Page 34: Extract Chemical Information from Patents Using

Availability and Customization

• Source code available

• Minor changes required on example codes

for customization, such as

– Import extracted structures to other databases

– Post-process filtering according to properties

– Batch process of multiple documents

34