17
New Software Developments on Chemical Information Extraction Wei Deng (David)

New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

New Software Developments on Chemical Information Extraction

Wei Deng (David)

Page 2: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

ChemAxon’s Naming Technology

•  Name to structure –  IUPAC, traditional and common names –  A common name library of existing drugs –  Support CAS Registry number –  Homology group: alkyl, aryl … –  Future: Biological names (PDB code, EC # …)

•  Structure to Name –  IUPAC Name, traditional names, common names –  Support other structure features

•  Isotopes, pseudo-asymmetric stereocenters …

•  Accuracy and coverage constantly improving •  Also available from command-line

2

Page 3: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

ChemAxon’s “Document to Structure”

•  Extract chemical information from documents –  Names: powered by the Naming Technology –  Also import SMILES, InChI, CAS number … –  Images: OSRA –  Returns structure and their location in the document

•  Works with scanned PDF since 5.8 (Feb 2012) –  Great for patent mining

•  OCR and syntax correction constantly developed –  3-rnethyl-l-me- thoxynaphthalene –  3-methyl-1-methoxynaphthalene

3

Page 4: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

From Document to Structures

4 Non-searchable patent (50 pages) Structure (text + image) + location

Page 5: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

Search by Structure or Text

5

Page 6: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

Non-searchable PDF is now Searchable

6

Page 7: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

ChemAxon’s “Document to Structure”

•  New Features in 5.9 (Mar 2012) –  MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt … –  Embedded structure objects (ChemDraw, Symyx, Marvin

…) –  Progressively display result –  Speed improvement –  Instant JChem Integration; Simplfied API

•  Currently in development for 5.10 (May 2012) –  OSRA “Confidence” –  Fragment groups integration with Markush generation –  Collaboration with Linguamatics –  IJC (OSRA, Location)

7

Page 8: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

Free Online Service Chemicalize.org

•  Extract chemical information from web pages and PDF documents •  Interactively display all structures and their predicted properties

•  Search all structures extracted

•  Gather links of interest to chemists for post processing (search, analysis, reporting, fun…)

•  Recently reviewed on Journal of Chemical Information and Modeling

8

Page 9: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

9

Webpage - chemicalized •  All chemical names are highlighted with dotted line •  Mouse over a name pops up the structure image •  Click on the image will direct to the data page •  Links are “respected”

Page 10: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

•  Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit

Data Page: Extensive Predicted Properties

Page 11: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

11

•  All structures are summarized above the chemicalized page •  Click on a structure to highlight all occurrences. Click again to

navigate to the next occurrence •  All structures can be downloaded as MRV or SDF

Webpage - chemicalized

Page 12: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

PDF File - chemicalized

Page 13: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

Aspirin: query highlighted in results

Searching Chemicalize.org – Structure Search

Page 14: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

•  Aspirin; web page hits - “show” related structures •  Autosuggest while typing

Searching Chemicalize.org – Keyword Search

Page 15: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

Everything is Published

•  Recent viewed –  Webpages –  Structures –  Documents –  Searched queries (structure and keyword)

15

Page 16: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

Availability and Customization

•  Source code available •  Minor changes required on example codes

for customization, such as –  Import extracted structures to other databases –  Post-process filtering according to properties –  Batch process of multiple documents

16

Page 17: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents

Hunting for Hidden Treasures

•  A CINF Symposium regarding “chemical information in patents and other documents”

•  ACS meeting in Philadelphia, August 19-23, 2012.

•  Current speakers from –  Content providers –  Software providers –  Pharmaceutical researchers

17