Upload
hatuong
View
218
Download
0
Embed Size (px)
Citation preview
Matthias Negri , PhDScientific Information Center
Boehringer Ingelheim Pharma GmbH & Co. KG
Chemistry-Enriched Patent Curationsemi-automatic analysis and elaboration of patents
ChemAxon UGM 2015, Budapest, 20 May 2015
Árpád FigyelmesiChemAxon
Content
1. Chemistry in patents
2. Why do we need a patent curation workflow?
3. Semi-automatic Patent Curation Workflow - Overview
4. Linked tools/technologies
5. ChemCurator (ChemCC)
6. Semi-automatic Patent Curation Workflow – Step by Step
7. Lessons learned, weak-points, limitations
8. Outlook
2Negri Matthias, ChemAxon UGM 2015
Chemistry in patents
Chemistry appears within diverse form in patents:
1. TEXT - IUPAC names, common names, etc
2. IMAGES - embedded within or attached to the document
3. ATTACHMENTS (MOL/CDX)
4. TABLES
– as ONE-image file (tables with chemistry and bioactivity data)
– as chemistry-only image files embedded within table tags
5. Markush Structures/Formulas with R-groups
---------------------------------------------------------------------------------------
Currently NO commercial solution covers all these cases
Most of the cases are considered in the patent curation workflow
(Markush/R-group Formulas recognized and stored separately)
3Negri Matthias, ChemAxon UGM 2015
Why do we need a patent curation workflow?
Motivations:
1. Linked chemistry-retrieval from patents (+ chemistry as images)
2. IUPAC-enriched XML patent files as NEW source for text-mining
3. extraction of bioactivity data/targets/diseases/… in relation to chemistry
4. Similarity/Substructure frequency in compound sets of patents
5. …
4Negri Matthias, ChemAxon UGM 2015
Semi-automatic Patent Curation Workflow
Overview – current state
2 parallel branches
5
I2E API KNIME – Batch indexing, text-mining and (relational) data retrieval
SLOWER & memory intensive vs BUT Higher Quality, More Control & IUPAC-enriched XML
FASTER vs LESS informative/flexible - ChemCC as the (near) future perspectiveINPUT
Negri Matthias, ChemAxon UGM 2015
Linked tools/technologies
1. KNIME/XPATH
2. ChemAxon ChemCurator (ChemCC)
3. Other ChemAxon tools in KNIME nodes (document2structure/d2s,
Naming, Molconverter, Structure checker, Standardizer, …)
4. Text/data-mining – Linguamatics I2E (+I2E Chemistry)
5. Optical Structure Recognition – Keymodule CLiDE Batch
6Negri Matthias, ChemAxon UGM 2015
Content
1. Chemistry in patents
2. Why do we need a patent curation workflow?
3. Semi-automatic Patent Curation Workflow - Overview
4. Linked tools/technologies
5. ChemCurator (ChemCC)
6. Semi-automatic Patent Curation Workflow – Step by Step
7. Lessons learned, weak-points, limitations
8. Outlook
7Negri Matthias, ChemAxon UGM 2015
Computer-aided chemical data extraction
English, Chinese and Japanese N2S
Markush Editor
Structure Checker
Hit visualization
Third party OSR technologies
ChemCurator (ChemCC)
8 Árpád Figyelmesi, ChemAxon UGM 2015
ChemCurator (ChemCC)
Name to Structure
Support for many nomenclatures (common, drug names, …)
IUPAC names
Custom dictionaries
English (2008)
Chinese (2013)
Japanese (2014)
9 Árpád Figyelmesi, ChemAxon UGM 2015
Compound Extraction View
Compound listProject explorer
Annotated document
Selected structures
ChemCurator (ChemCC)
10
Markush Extraction View
Markush editor
Example structures
Annotated document
Project explorer
Selected structures
Structure checker
ChemCurator (ChemCC)
11
General Document Curation
Extract Markush Structures from patents
Extract specific structures
Journal articles
Company reports
Patent examples
Structure extraction wizards
Exclude fragments, chemical elements, etc.
ChemCurator (ChemCC)
12 Árpád Figyelmesi, ChemAxon UGM 2015
ChemCurator (ChemCC)
Integration & Information Sharing
Other ChemAxon products:
Direct IJC schema connection
Project sharing function
Accessible from Plexus, IJC, etc.
Third party tools:
Standard file formats
Export functions
Easily processable projects
13 Árpád Figyelmesi, ChemAxon UGM 2015
Content
1. Chemistry in patents
2. Why do we need a patent curation workflow?
3. Semi-automatic Patent Curation Workflow - Overview
4. Linked tools/technologies
5. ChemCurator (ChemCC)
6. Semi-automatic Patent Curation Workflow – Step by Step
7. Lessons learned, weak-points, limitations
8. Outlook
14Negri Matthias, ChemAxon UGM 2015
Semi-automatic Patent Curation Workflowa) input sources and b) bibliographic data
a) Input sources
files with patent-IDs list
XML collection
…
b) Retrieval of bibliographic information and attachment data
family ID, patent references, expiration date, etc
Attachment files MOL/CDX (US-patents only), TIF files
….
15Negri Matthias, ChemAxon UGM 2015
Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering
1. ChemCurator branch
data retrieval (XML, attachments) from IFI Claims Direct BI-server
ChemCurator project creation/sharing/annotation html output
Chemistry extraction name2structure/document2structure sdf output
Generation of pre-annotated patent set stored as ChemCC projects
Faster, but lower quality within the chemistry extraction process
16Negri Matthias, ChemAxon UGM 2015
2. KNIME branch
- OCR-errors CLEAN-UP in KNIME improved chemistry recognition
- MOL/CDX/TIF - standardizer, structure checker filter formulas, solvents, R-groups
Higher quality and more control in chemistry extraction process
Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering
17Negri Matthias, ChemAxon UGM 2015
2. KNIME branch
MOL IUPAC
CDX IUPAC
TIFF (via CLiDE) IUPAC
Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering
18Negri Matthias, ChemAxon UGM 2015
Merging and Comparison of the converted chemistry output of MOL/CDX/TIF – 2 “quality” checks
IUPAC
string length (different output order of chemicals in multiple molecules image/multiMOL files
OCR-correction (“dictionary” based)
2. KNIME - Chemistry “Normalization”
(within KNIME) set up a relation between each TIFF/attachment file
1. to (one or more) IUPAC name(s)
2. to a position/section in the text/document
Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering
19
Merge IUPAC Clean-Up IUPAC
If NO IUPAC IMG-name is set
“Normalize” IUPAC names
Negri Matthias, ChemAxon UGM 2015
Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names
Chemistry present as text is recognized and extracted either via
- Textmining (I2E chemistry – d2s is working in behind) or
- Within KNIME/ChemCC using annotate/molconvert
Replacement:<chemistry> vs IUPAC
IUPAC-enriched XML
20Negri Matthias, ChemAxon UGM 2015
OCR-errors in chemical names
Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names
TIF
CDX
MOL
This text-chunk is replaced by the IUPAC name
21Negri Matthias, ChemAxon UGM 2015
XPATH/XML parsing and extraction of:
Tables
Rows - XML tags & strings
Entries - XML tags & strings
Semi-automatic Patent Curation Workflow e) Bioactivity/tabular data extraction with KNIME/XPATH
22Negri Matthias, ChemAxon UGM 2015
IUPAC-enriched XML as source for I2E API/textmining
indexing
pre-defined queries
results retrieval
saved as SDF files (KNIME)
Semi-automatic Patent Curation Workflow f) Text-/datamining with Linguamatics I2E via KNIME
Text-mining retrieved (chemistry-related) information
Example Nr.
Bioactivity data from tables
Claims, regions where chemistry appears in patents
Genes, diseases
23Negri Matthias, ChemAxon UGM 2015
1. Example Nr. – IUPAC
Table:Image:
For comparison – chemistry in PDF:
Semi-automatic Patent Curation Workflow f) Bioactivity Data using I2E multi-queries – 2 steps
Source: (IUPAC-enriched) XML
2. Example Nr. – Bioactivity data
24
IUPAC
Bioactivity
Example Nr.
Semi-automatic Patent Curation Workflowg) Visualize data-/textmining results in ChemCC
SDF file imported into ChemCC project + automatic mapping to existing chemistry
25Negri Matthias, ChemAxon UGM 2015
Lessons learned, weak-points, limitations
1. Advantages KNIME Full-Mode (MOL/CDX/TIF) vs ChemCC branch
chemistry check/normalization – 3 input sources improved quality
improved chemistry recall - ALL images (incl. tables and drawings)
More filtering options in KNIME workflow vs ChemCurator only
IUPAC-enriched XML as new source for I2E
Advantages ChemCC vs KNIME Full-Mode (MOL/CDX/TIF)
faster
Image processing using CLiDE is already incorporated with naming
26Negri Matthias, ChemAxon UGM 2015
Lessons learned, weak-points, limitations
2. No full automation of the workflow due to lack of homogenicity in patent data (US vs WO, EP, etc..)
Missing attachment files
No tables present in XML
Error rate in chemistry recognition (OPSIN vs n2s/d2s)
…
NEEDS: different workflows/branches, patent-files clean-up (OCR)
3. Time & Computational Resources-consuming process
27Negri Matthias, ChemAxon UGM 2015
Outlook
1. KNIME Workflow
Add new data fields to Chemicals: BI-internal codes, genes, targets, etc..
Usage of ChemCC html output as source for textmining
Ontology mapping
Expand workflow by including other sources (internal PDF, literature full-text)
Use KNIME to interconnect to BI-intern workflows, DB, etc
chemistry-linked information in a patent-DB improved (semantic) search
28Negri Matthias, ChemAxon UGM 2015
Outlook
2. ChemCurator
Improved n2s
New command-line functions
Complex-phrase requests from IFI server
Improved SDF import
Preprocessing wizards
Árpád Figyelmesi, ChemAxon UGM 201529