12
www.guidetopharmacology.org Multiplexing analysis of 1000 approved drugs across 60 million PubChem entries: Will the correct structures please stand up? Christopher Southan IUPHAR/BPS Guide to PHARMACOLOGY, Center for Integrative Physiology, University of Edinburgh ACS CINF session: The Growing Impact of Big Data in the World of Chemical Information 1 http:// www.slideshare.net/cdsouthan/multiplexing-analysis-of -approved

Multiplexing analysis of approved

Embed Size (px)

Citation preview

Page 1: Multiplexing analysis of approved

1

www.guidetopharmacology.org

Multiplexing analysis of 1000 approved drugs across 60 million PubChem entries: Will the

correct structures please stand up?

Christopher Southan

IUPHAR/BPS Guide to PHARMACOLOGY, Center for Integrative Physiology, University of Edinburgh

ACS CINF session: The Growing Impact of Big Data in the World of Chemical Information

http://www.slideshare.net/cdsouthan/multiplexing-analysis-of-approved

Page 2: Multiplexing analysis of approved

2

Abstract

Database molecular entries for approved drugs are the Crown Jewels of over 50 years of global R&D. However, a surprising degree of uncertainty surrounds exact numbers and explicit chemical structures. Choosing a representative approved drug or clinical candidate is becoming harder because of different molecular representations (i.e. structural multiplexing). In this work results will be presented from the analysis a 1000 drug set compiled inside PubChem . This showed that each structure had been submitted (on average) 81 times. In addition the “same connectivity” operator indicted 21 canonically related CIDs and each drug represented in 44 mixtures. We can also detect the “split bioactivity” problem where 135 CIDs related to taxol, 12 have bioassay results. As the totality of public chemical structures pushes towards 100 million we can track a constellation of problems related to the type of statistics above. In particular, the recently increased open availability of patent extracted chemistry and broadening vendor choice is generally welcome by database users. However, analysing entries related to the 1000 drugs across time indicates both types of expansions come with a cost. For example, the 55 million vendor CIDs show increased unresolved chirality (i.e. flat versions) and/or E/Z positions (crossed-bonds). In addition, noticeable “patent-picking”, including mixtures, suggest vendor submissions are increasing in virtual, rather than extant structures. The 21 million automated and manual patent extractions also bring in a variety of artefacts, such as shotgun exemplifications of mixtures, chrial permutations and virtual deuteration. Also, 85% are devoid of BioAssay data links. As a solutions to at least some of these problems, PubChem facilitates particularly effective query selects and filters predicated on their advanced relationship rules. Notwithstanding, the inexorable increase in multiplexing can confound the less experienced and is arguably reaching problematic proportions across all “big data” chemical resources.

Page 3: Multiplexing analysis of approved

3

Introduction

• In the Guide to Pharmacology (GtoPdb) team we have been doing expert drug curation since 2009

• In recent years we have noticed an increase in alternative database representations of what we recognize as the “same” canonical drug

• Via inspection of the PubChem relationships we getting an idea of what causes and sources were contributing to this “multiplexing” problem.

• We do this to provide contextual annotation for users, via comments and cross-pointers to CIDs we recognize as alternative drug forms

• We recently began to investigate multiplexing on a more systematic level and this work summarises early results

Page 4: Multiplexing analysis of approved

4

Taxol example

Page 5: Multiplexing analysis of approved

5

Compiling a 1000 drug set for multiplex analysis

1. Query at the SID level: approved[All Fields] AND "DrugBank"[SourceName] = 1533

2. Find related data < compound < same CID = 1504

3. Select ChEMBL CIDs = 1458721

4. ChEMBL AND DrugBank = 1358

5. .. Covalent unit 1 = 1329

6. AND Therapeutic Target Database (13913)

7. = 1040

8. Mw < 145 = 1001 (remove one manually) = 1000

9. Stored as a MyNCBI collection and downloaded CID set for queries on the PubChem Identifier Exchange Service

Page 6: Multiplexing analysis of approved

6

Breakdown of the 1000

Mostly older compounds

Page 7: Multiplexing analysis of approved

7

Substance contributions

• 1000-drug CIDs < 82194 SIDs • i.e. average of 82 for every CID

• Patent sources summed from IBM, SCRIPDB, SureChembl, NextMove Software and Thomson Pharma = 35.6 millions SIDs

• Average contribution of patents and vendors to SIDs is 31%• Yearly SID spikes can be interpreted

Page 8: Multiplexing analysis of approved

8

Same connectivity breakdown• 1000-drug CIDs < 22633 “same connectivity”• Each drug represented by 23 different CIDs (multiplexed)

• 59% of the multiplexing is caused by patent deuteration• Only 8% have recorded bioactivity (c.f. 95% for the 1000-drugs)

Page 9: Multiplexing analysis of approved

9

Stereo splits

• 80% of different stereo multiplexing is also from patents

Page 10: Multiplexing analysis of approved

10

Mixtures: another form of drug structure multiplexing

• 1000-drugs > 39141 mixtures • Each drug permutated on average into 39 mixture CIDs• Of these 80% have a patent CID match• Proportionally dominated by SureChEMBL and Thomson• Only 2% bioactivity data • A few “real drug” mixtures and USAN salts

Page 11: Multiplexing analysis of approved

11

Conclusions

• The problem of structural multiplexing is a consequence of successful expansion of sources that feed into PubChem

• But its not restricted to drugs and effects all bioactivity databases• The value of expanding vendor coverage and open patent chemistry

extraction is very high but comes with various “noise” costs. • PubChem has impressive filtration options but they only go so far.• Confounding effects on big data are difficult to predict (e.g. the

activity splitting problem)• PubChem vendor CIDs have recently shrunk by nearly 20 mill that

will improve things

Page 12: Multiplexing analysis of approved

12

Questions welcome

PAPER ID: 2258166 Deuterogate: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs feeding into PubChem (final paper: CINF 56) SESSION: Enabling Machines to "Read" the Chemical Literature: Techniques,SESSION TIME: 9:30 AM - 11:55 AM slot from 1:45 PM - 2:05 PM

PAPER ID: 2267053 Resolving cryptic needles to molecular structures: The GtoPdb experience ( CINF 129)SESSION: Find the Needle in a Haystack: Mining Data from Large Chemical Spaces Wednesday, August, 19, 2015 from 10:20 AM - 10:50 AM