32
The Integration of Chemistry with Everything Else ACS National Meeting , August 2020 Ian Wetherbee, Lutz Weber, Evan Bolton, Vihang Mehta, Stephen Boyer, Jane Frommer Contributions directed toward the identification, organization and FAIR availability of the world's molecular content

The Integration of Chemistry with Everything Else · ORDER BY count_acn DESC acname count_acn Potency 51,654,071 IC50 12,947,417 EC50 2,664,101 Ki 559,067 IC90 334,633 AC50 209,578

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • The Integration of Chemistry with Everything Else

    ACS National Meeting , August 2020Ian Wetherbee, Lutz Weber, Evan Bolton, Vihang Mehta, Stephen Boyer, Jane Frommer

    Contributions directed toward the identification, organization and FAIR availability of the world's molecular content

  • Previously at ACS San Diego...

    Google Patents

    Patent corpusGoogle Scholar Google Books

    computer curation processes

  • Google Patents

    Worldwide patent corpus

    ~ 50 Billion mentions of scientific entities

    New today:Donating annotated data to the public

    Google Translate

    computer curation processes

    Google BigQueryDataset

    NIH

    Other

    SciWalker

    (Your application)

    DataCC BY 4.0

    “Google Patents Research Data” by Google, based on data provided by IFI CLAIMS Patent Services and OntoChem, is licensed under a Creative Commons Attribution 4.0 International License.

    Natural language processing

    Name to structure

    Image to structure

    Map to ontology

  • Table# of rows 47,676,364,297publication_number ocid preferred_name domainsourceconfidencecharacter_offset_startcharacter_offset_endinchiinchi_keysmiles

    Schema

    Data

    count distinct domain samples

    14,187,900,687 8,512 substances Fluoxastrobin|ougon|Sulfosulfuron

    6,245,936,990 6,935 methods electrochemical analysis|ultrasonication|slugging

    4,863,317,264 10,770 effects pro-activator|glycemic|vasopressic

    3,891,972,531 13,629,554 chemCompound sulfur|3-(3-methoxypropoxy)propan-1-ol|Iprindole

    1,921,140,121 20,124 inorgmatTc alloys|Ferromolybdenum|Li[Ni1/3Co1/3Mn1/3]O2 (NCM 111)

    1,711,449,095 2,652,600 chemGroup stabilizer|1,1,1-trifluoroethane|3-pyrrolidinyl group

    1,636,858,964 5,996 polymers4-(hydroxymethyl)-1,3-dioxolan-2-one|dodecyl group|NoRC associated RNA

    1,376,553,092 571,320 chemClass α-D-xylose|Pyrylium salt|Isophorone

    1,364,950,542 141,500 proteins GLP1R family|C1QT3 family|Amphiregulin

    1,167,892,259 10,704 nutritionPteridium aquilinum subsp pseudocaudatum|vanilla syrup|Basella

    1,074,276,820 108,928 humangenes AP3D1|IARS|EHD4

    936,819,010 2,668 anatomyMegakaryocytes|Neurofibrillary Tangles|Satellite Cells, Skeletal Muscle

    906,061,800 29,624 drugs Guanine|atovaquone|aminoxytriphene

    835,712,433 199,070 species Geotrichum candidum|Nezara|Abutilon

    696,195,364 22,159 diseasesFibromyalgia|Social avoidant behaviour|Adenocarcinoma pancreas

    271,303,308 8,808 natprod Vinblastine|Physagulin M|Indol-3-propionic acid

    73,346,722 936 toxicity aquatic toxicology|Ames test|ribotoxic

    42,923,635 26,160,939 chem

    N1(*)*N2C(C1=O)C(C(c1c2nc(Nc2**c*c2)nc1)(*)*)(*)*|C(N(I)CC)([C](#N)CCCCCCCC)CCC(=*)*|OC(=O)[C@]1(C[C@@H](C(*)C(C2(C(C2)C(C)CC)[C@@H](CO)O)O1)C)F

  • Google Patents extracts data from both text & images

    Table# of rows 47,676,364,297publication_number ocid preferred_name domainsourceconfidencecharacter_offset_startcharacter_offset_endinchiinchi_keysmiles

    Data

    count distinct domain source

    3,891,972,531 13,629,554 chemCompoundtitle,abstract,claims,description

    1,711,449,095 2,652,600 chemGrouptitle,abstract,claims,description

    1,376,553,092 571,320 chemClasstitle,abstract,claims,description

    21,291,227 14,912,387 chem pdf *16,987,431 11,183,893 chem image *

    4,644,977 2,711,658 chem mol

    * still processing

  • Total number of chemistry entities

    extracted 36 full text countries Data

    country_code description claims abstract pdf * image * title mol Grand Total

    US 1,409,208,212 145,488,707 10,651,425 6,504,233 6,640,303 1,425,365 4,644,977 1,584,563,222

    JP 1,059,614,185 93,795,657 14,312,034 3,136,333 3,051,411 1,319,277 1,175,228,897

    CN 880,335,706 130,838,122 25,630,032 1,869,214 1,180,255 2,573,546 1,042,426,875

    EP 680,356,821 87,966,443 5,377,765 1,595,481 1,972,898 840,088 778,109,496

    WO 522,361,876 71,891,448 4,733,108 2,332,337 2,367,645 537,893 604,224,307

    KR 445,605,239 49,543,964 5,263,342 1,299,386 1,118,158 516,001 503,346,090

    CA 292,648,762 45,117,787 2,855,808 1,196,171 431,672 342,250,200

    AU 258,294,420 46,518,099 441,824 1,224,705 193,271 327,160 306,999,479

    TW 173,408,881 19,402,002 1,259,332 947,090 51,152 144,676 195,213,133

    DE 78,908,305 11,580,392 1,213,910 162,209 460,756 92,325,572

    ES 65,087,386 7,723,456 2,081,943 269,099 38,847 141,399 75,342,130

    RU 41,438,558 11,637,908 2,211,258 151,414 169,750 146,552 55,755,440

    HU 29,630,553 4,311,374 375,805 104,216 68,077 34,490,025

    BR 26,326,881 3,065,815 658,663 103,517 26,257 144,139 30,325,272

    GB 17,906,597 3,136,854 6,741,940 40,357 532 299,648 28,125,928

    FR 21,874,630 3,797,415 672,698 53,555 13,107 154,800 26,566,205

    CZ 18,422,163 3,750,093 247,458 65,715 25,433 22,510,862

    DK 19,854,711 1,309,738 22,895 96,009 144 84,597 21,368,094

    EA 14,802,787 1,759,257 336,897 86,134 27,593 17,012,668

    PT 14,069,417 1,881,382 73,238 48,035 48,439 16,120,511

    SK 10,616,113 2,204,289 128,483 35,653 13,119 12,997,657

    FI 8,985,958 1,029,199 24,540 26,695 40 33,119 10,099,551

    SU 5,638,012 1,197,129 410,553 7,567 90,371 7,343,632

    CH 5,229,765 902,985 107,427 14,020 13 19,554 6,273,764

    DD 4,547,354 742,071 183,256 12,317 17,961 5,502,959

    NL 4,386,277 728,270 72,673 9,146 372 35,251 5,231,989

    BG 3,142,521 665,065 66,290 14,842 9,970 3,898,688

    CS 3,131,663 486,895 127,412 7,119 23,594 3,776,683

    OA 2,938,019 627,075 68,314 13,236 4,853 3,651,497

    AT 1,567,123 282,784 19,963 2,250 5 101,882 1,974,007

    RO 1,439,864 233,016 107,070 5,079 14,563 1,799,592

    BE 1,453,407 233,066 45,029 3,660 931 10,387 1,746,480

    SI 899,347 274,808 19,133 3,729 14,236 1,211,253

    LU 806,646 160,527 3,934 1,978 131 3,614 976,830

    LV 588,528 124,227 9,173 5,151 2,177 729,256

    AP 551,743 94,321 6,079 1,792 5,036 658,971

    MX 485,287 65,869 551,156

    LT 449,403 82,608 10,841 1,455 4,008 548,315

  • Start OCID End OCID Domain Sub-domain229910000000 229914999999 chem inorganic materials229915000000 229915999999 chem alloys229920000000 229929999999 chem polymers229930000000 229939999999 chem natural products229940000000 229969999999 chem drugs239000000000 239999999999 chem substances       200000000000 209999999999 diseases main200000000000 200999999999 diseases diseases::OIS201000000000 201999999999 diseases diseases::hdo202000000000 202999999999 diseases diseases::icd9203000000000 203999999999 diseases diseases::icd10204000000000 204999999999 diseases diseases::snomed205000000000 205999999999 diseases diseases::elsevier206000000000 206999999999 diseases diseases::MedDRA208999000000 208999999999 diseases diseases::MeSH209000000000 209999999999 diseases diseases::syno_ocids

    Table

    # of rows 47, 571, 142, 272publication_number ocid preferred_name domainsourceconfidencecharacter_offset_startcharacter_offset_endinchiinchi_keysmiles

    OCID : a unique identifier for every entity

    OCIDs provided by OntoChem

    Data

  • Website

    ● Easy to use● Limited feature

    scope/breadth● Limit to

    configurability/extensibility

    Publication

    Patent

    Compound

    Disease

    Clinical trial

    Assay

    Classification

    Target

    Side effect

    Entity graph

    Previously at ACS San Diego...

    Database

    ● More complex configuration (uploading, maintenance)

    ● Private data, other data outside of scope

    Dataset scopeGoogle BigQuery

  • Confidential + Proprietary

    Project 2

    Project 1

    SQL interface1 TB query free / month

    10 GB storage free / month

    Google BigQuery

    No server setupAnalyze, download results

    A single huge, relational database

    Compound -> MW, LogP, etc.

    Patent -> Compound, Target

    Patent -> Company, Class, Text, etc.

    ACLs

    etc.

    Compound -> Toxicity

  • Example of use in cloudBigQuery

  • 1,074 patents in the D06M15/00 patent classification mention perfluorooctanoic acid

  • Example of use in webPubChem

  • PubChem is a chemical information resource

    • 100s of data fields about chemicals

    • Biological activities• Programmatic access interfaces• FTP site for bulk downloads• Extensive integration with

    chemistry-related websites

    • Millions of monthly users

    https://pubchem.ncbi.nlm.nih.gov

    https://pubchem.ncbi.nlm.nih.gov/

  • Google Patent contribution to PubChem

    • +45B CSV rows available on the PubChem FTP site ‘as-is’

    • +16K gzip compressed CSV files (subdirectories each with up to 1000 files)

    • +4TB content uncompressed

    • Structures via SMILES added to PubChem Substance (+9M)

    • Association of patent to structure (billions of links)

  • Google Patent contribution to PubChem

    • +45B CSV rows available on the PubChem FTP site ‘as-is’

    • +16K gzip compressed CSV files (subdirectories each with up to 1000 files)

    • +4TB content uncompressed

    • Structures via SMILES added to PubChem Substance (+9M)

    • Association of patent to structure (billions of links) Accessible via:

    https://ftp.ncbi.nlm.nih.gov/pubchem/Other/GooglePatents/ftp://ftp.ncbi.nlm.nih.gov/pubchem/Other/GooglePatents/

    https://ftp.ncbi.nlm.nih.gov/pubchem/Other/GooglePatents/ftp://ftp.ncbi.nlm.nih.gov/pubchem/Other/GooglePatents/

  • Google Patent contribution to PubChem

    • +45B CSV rows available on the PubChem FTP site ‘as-is’

    • +16K gzip compressed CSV files (subdirectories each with up to 1000 files)

    • +4TB content uncompressed

    • Structures via SMILES added to PubChem Substance (+9M)

    • Association of patent to structure (billions of links)

    1•• Extract SMILES

    2•• Add as PubChem

    Substance records

    3•• Associate patent

    identifiers to records

    Processing of Google Patent contribution

  • Google Patent contribution to PubChem

    • To be integrated with other patent links within PubChem

    • Searchable collection• Patent section of compounds• Accessible by programmatic interfaces• Downloadable per record• Associated to metadata about a given

    patent

  • }aid_sid_cid_acname_acvalue_aidname243,326,686

    30.2 GBaidsidcidacnameacvalueaidname

    SQL in BQ

    SELECT distinct(acname), count(acname) as count_acn FROM `ncbi-research-pubchem.pubchem.aid_sid_cid_acname_acvalue_aidname` GROUP BY acnameORDER BY count_acn DESC

    acname count_acnPotency 51,654,071IC50 12,947,417EC50 2,664,101Ki 559,067IC90 334,633AC50 209,578GI50 132,714Kd 88,048CC50 75,027MIC 64,718LC50 38,073TGI 33,396Activity 27,139AbsAC40_uM 21,374ED50 16,745LD50 3,018

    ‘starting material’ = PubChem tables ‘reagents’ = SQL query

    product = answers to queries

    ExampleWhat types of assays are in PubChem ?

    partial list

    } 554.3 MB processed, could run ~1800 times (= 1TB) per month for free

    }

  • Kaggle +

    ExampleWhat compounds have assays for targets MDM2, MEK, KRAS, & IL17?

    Ass

    ay #

    , lo

    gAssay type

    Assay Counts by Target

    Free Python notebooks

    5TB / month free BQ quota

  • Acknowledgements We would like to acknowledge our colleagues who help make this work possible.

    ● Vihang Mehta

    ● Robert Frommer

    ● Brodrick Arneson

    ● Stephen S. Walker

    ● Igor Filippov

    ● Members of the OntoChem team

    ● The Intramural Research Program of the National Library of Medicine, National Institutes of Health.

    ● The entire PubChem team and all PubChem contributors and collaborators

  • Thank you!

    BigQuery: "Google Patents Public Datasets"patents-public-data.google_patents_research.publications

    PubChem download: https://ftp.ncbi.nlm.nih.gov/pubchem/Other/GooglePatents/

    Google Patents: patents.google.comPubChem: pubchem.ncbi.nlm.nih.govSciWalker: www.sciwalker.com

  • Backup

  • This Kaggle Script will query PubChem to answer the following question:What bio assay data is available for a specific target or list of targets? e.g. MDM2,MEK, KRAS,IL17. etc.

    Example

  • Pubchem ChemBlTarget Central

    EPA

    Clinical Trials

    FDA Medline Drug Central

    ~ 250 tables from ~ 15 dbs

    cluster

    machine-learning

    AI

    visualization

    post processing

    How the data are used

    Kaggle + BQ

  • aid_sid_cid_acname_acvalue_aidname243326686aidsidcidacnameacvalueaidnameinchikeysmiles

    How many molecules have EC50 assays for SGLT2 ? What are they ?

    or Kaggle Table 1

    Answer: 143 unique molecules 20 duplicate compounds

    output

    SELECT cid, acname, acvalue, aidname FROM `google.com:skb1-228101.NIH_Data.PUBCHEM_AID_CID_SMILES_EC50_Data` WHERE acname = 'EC50' AND UPPER(aidname) LIKE '%SGLT2%'

    Example

  • Kaggle in BQ+

    Results for 3,443 assays for MDM2

    Output file opened with DataWarrior

    Example

  • Biomarkers co-occurring in the same document or clinical trial

    Most frequently mentioned pairs of biomarkers within the same document

    Example

  • Opportunities for partners

    Google

    • Patents• Scholar• Books

    BigQuery

    Middleware

    to be developed

    Front end

    Integrity

    GWAS

    PMCMedLine COSMIC

    Patents

    Text sources Databases

    Your Applications

    Data Studio

    Client Data

    Looker

    Knime

    Other

    Open Data

  • Example