Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Extract Chemical Information from Patents
Using Chemicalize and D2S
(Document to Structure)
Wei Deng (David), Daniel Bonniot, Andras Stracz
International Conference and Exhibition on
Computer Aided Drug Design & QSAR
Oct 30th, 2012
Chicago, IL, USA
ChemAxon’s Naming Technology
● Structure to Name
● Name to Structure
● Document to Structure
● Document to Database
● Chemicalize
2
ChemAxon’s Naming Technology
• Structure to Name
– IUPAC Name, traditional names
– Reaching maturity
– Still upcoming: peptides, some fused systems
• Name to structure
– IUPAC, CAS and systematic names
– Dictionary of common names and drug names
– Support CAS Registry number (webservice)
– Homology group: alkyl, aryl …
– Future: Biological names, polymers, ...
• Accuracy and coverage constantly improving
• Also available from command-line
3
Name to Structure Internals
• Dictionary of common and drug names
– Uses nine different source dictionaries
– Harmonized using weighted consensus method
– 150K names for 40K unique structures
– Custom dictionary and plug-in system
• Systematic names
– Proprietary algorithm
– ”Understands” systematic names
– Example:
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
4
Systematic Name Example
5
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Systematic Name: Parsing
6
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Systematic Name: Parsing
7
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Systematic Name: Parsing
8
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Systematic Name: Parsing
9
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Systematic Name: Parsing
1
0
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Systematic Name: Parsing
1
1
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Systematic Name: Structure Generation
1
2
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
OCR Error Correction
1
3
(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate
OCR Error Correction
1
4
(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
OCR Error Correction
1
5
(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Λr-benzyl-Λr-[3-(lH-tetrazol-5-
yl)phenyl]propanamide
?-benzyl-?-[3-(?H-tetrazol-5-
yl)phenyl]propanamide
OCR Error Correction
1
6
(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Λr-benzyl-Λr-[3-(lH-tetrazol-5-
yl)phenyl]propanamide
N-benzyl-N-[3-(1H-tetrazol-5-
yl)phenyl]propanamide
ChemAxon’s “Document to Structure”
• Extract chemical information from
documents – Names: powered by the Naming Technology
– Also import smiles, InChI, CAS number …
– Images: OSRA …
– Works with scanned non-searchable PDF
– Returns structures and their location in the document
• Supported formats: – MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …
– Embedded structure objects (ChemDraw, Symyx, Marvin,
…)
– PDF, text, XML, HTML
17
From Document to Structures
1
8
Non-searchable patent (50 pages) Structure (text + image) + location
Instant JChem Example
• In Instant JChem, search for
19
Search by Structure or Text
20
Non-searchable PDF is now Searchable
21
“Document to Database”
22
ChemAxon’s “Document to Database”
• Data in DB:
– Structures
– Source (name, smiles, embedded, …) and
location
– Documents, Authors, Metadata...
• Questions:
– What structures appear in a specific document?
– What documents contain a
structure/substructure/...?
– What documents written since 2010 in location X
contain substructure S?
... 23
ChemAxon’s “Document to Database”
• Customizable:
– Metadata extracted from documents
– Interface (IJC forms, webapp)
• Demo:
– One month of US patents
– 85K unique structures from systematic names
– 1M occurrences
24
Summary
● Extensive, improving naming technologies
(n2s, s2n)
● Increasing support for Document mining
(d2s, d2db, SharePoint)
● Putting it all together → going large scale,
extract and use valuable chemical
information
● Still component-based and responding to
our users requests
2
5
Free Online Service Chemicalize.org
• Extract
• Interactively display
• Calculate
• Search
26
Recently reviewed in J. Chem. Inf. Model., 2012, 52 (2), pp 613–
615
27
Webpage - Chemicalized
• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit
Data Page: Extensive Predicted Properties
29
Webpage - Chemicalized
PDF File - Chemicalized
Aspirin: query highlighted in results
Searching Chemicalize.org – Structure Search
• Aspirin; web page hits - “show” related structures
• Autosuggest while typing
Searching Chemicalize.org – Keyword Search
Everything is Published
• Recent viewed
– Webpages
– Structures
http://www.chemicalize.org/blog/2012/09/26/1-million-
molecules/
– Documents
– Searched queries (structure and keyword)
33
Availability and Customization
• Source code available
• Minor changes required on example codes
for customization, such as
– Import extracted structures to other databases
– Post-process filtering according to properties
– Batch process of multiple documents
34