Transcript
Page 1: Will the real drugs please stand up?

Christopher Southan and Joanna L. Sharman Poster no 45 at BioITWorld 2014 BostonIUPHAR/BPS Guide to PHARMACOLOGY Web portal Group, Centre for Integrative Physiology, School of Biomedical Sciences, University of Edinburgh, Hugh Robson Building, Edinburgh, EH8 9XD, UK. ([email protected]) http://www.slideshare.net/cdsouthan/southan-sharman-bioitposter

www.guidetopharmacology.org

WILL THE REAL DRUGS PLEASE STAND [email protected]

Results: Intra-PubChem consensus triage

Results: further analysis

As pointed out (PMID:24533037) ChEMBL, DrugBank, and the Therapeutic Target Database (TTD) are valuable sources to compare. However, the only way to select “approved” drug structures directly from PubChem is retrieving from the DrugBank substance (SID) comment fields. This returns 1533 SIDs that collapse to 1504 CIDs as the de facto starting point. In addition, the following CIDs were selected: TTD, ChEMBL, covalent count 1 (i.e. stripping salts and mixtures) and that at least one of the sources had assigned a synonym as an INN (WHO International Non-proprietary Name). The final step was the Boolean intersect between all five. The sequence of CID query results is shown below (the extra ANDs are an interface artefact).

Examples of GPCR database tables

Supported by:

Discussion

The PubChem toolbox allows numerous additional analyses of drug chemical space to be carried out. Just a selection is included:• Each of the 923 was, on average, represented by 76 submissions (SIDs). The

CID chemistry rules are thus effective for generating robust consensus• Applying “same (bond) connectivity” gives 18749 but removing the virtual

deuterated entries from patents reduces this to 6919 (i.e. the 923 have, on average, 7.5 alternative stereo CIDs (also mostly from patents)

• The 3 sources chose taxol as CID 36314 from 96 same-connectivity CIDs• 209 have fully unspecified chiral centres (probably racemates not errors) • The intersect with the FDA Maximum Daily Dose Database is only 489• The 1504 DrugBank approvals include 7 unique CIDs. One was CID

73425384, sofosvir, specified as (2S)-2-[[[(3R,4R,5R) with one unresolved stereo centre. The majority SID “vote” was CID 45375808 as (2S)-2-[[[(2R,3R,4R,5R) picked up by ChEMBL but absent from TTD.

With specified exceptions this 4-way set is now captured in GToP. We are completing the quantitative activity mappings by “walking-out” from this core set (e.g. to the 1117) but a variety of additional consensus intersects and filters are being explored including InChIKey downloads direct from sources. We have also constituted an 8-member international committee to support us in resolving challenging cases. In addition, we have introduced key ligand-to-ligand relationships such as drug<>prodrug (i.e. two INNs), fixed mixtures and active metabolites. The persistent structural discordance between nominally the same drugs (in documents and databases) presents an important issue in cheminformatics and pharmacology, extending beyond INNs to all of medicinal chemistry. Contributory factors include semantic naming inconsistencies, salt ambiguity and multiple stereo forms. Definitive lists will remain elusive until there is not only more inter-source collaboration for standardisation and cross-mapping (including wider use of InChI) but also when regulatory bodies and pharmaceutical companies commit to directly verifying and provenancing public database structure records for clinically used drugs.

This set of 923 drugs can be accessed via the MyNCBI open URL:http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1Fo7u3apR1bzS_UWr1YhHOTkZ/

While this 4-way corroborated set has high utility there are caveats;• The “approved” tag via DrugBank 4.0 is not FDA-only• Since TTD last submitted in Feb 2012 the intersected drug content is thus

capped to before that date (dropping TTD gives 1117 CIDs)• Some metabolites (e.g. amino acids) come through the filters but an

additional restrict of DrugBank nutraceuticals can be applied• Some older drugs have no INN (e.g. aspirin) • Some peptide drug CIDs are missing (suggesting low concordance)• Approved fixed-mixtures are excluded (but their capture is problematic

anyway since they do not get an INN)• The computed CID identity is actually a hash-code match, rather than via

InChIKey (but this should give similar numbers)

Introducing the Guide to PHARMACOLOGY

How many approved drugs are there ?

A consensus approach to drug curation

Workflow

The IUPHAR/BPS Guide to PHARMACOLOGY Web portal (GToP) is an open resource providing expert curated information on approved drugs, clinical candidates, research compounds and other selective ligands for a wide range of biological targets (PMID:24234439). Overview pages link to detailed information including biological variants, tissue distribution, signalling mechanisms and disease involvement. The key objective of our Wellcome Trust grant is to complete expert annotation of approved drugs vs their primary targets connected by quantitative data (typically IC50, Ki or Kd). This poster describes the comparison of approved drugs inside PubChem that have an assigned compound identifier (CID).

Given their crucial position for the global practice of medicine and formal approval by stringent national procedures, the uncertainty of approved drug structures seems paradoxical. It spans a range from the FDA Maximum Daily Dose Database (PubChem Assay ID 1195) at 1216 up to 1817 quoted for the NCGC Pharmaceutical Collection (PMID:21525397) but the fact that these efforts gave up in 2008 and 2011, respectively, illustrates the challenge. This is underlined in a comparison of database subsets of approved drugs done in 2009 that recorded only 807 exact chemical structures in-common (PMID:20298516).

As a pragmatic approach to maximising the precision and utility of our database we have strategically chosen to use consensus sets as curatorial starting points, rather than attempt total capture (i.e. quality vs quantity). In addition, we set a high stringency for our curated activity mappings and primary target assignments. We thus do not target-map nutraceuticals, endogenous hormones or simple topical compounds as they typically have many protein interactions already captured in metabolite, enzyme and pathway databases. The consensus concept (first exemplified in PMID:20298516) is that an exact chemical structure match between multiple sources is more likely to be right than wrong. Nonetheless, nominally independent databases may re-cycle annotations rather than resolve primary provenances (this is not a criticism, merely an observation). There are also challenging cases where consensus may be unrealistic to expect (e.g. large peptides, long oligos or complex natural products).

Through exploring the PubChem mining options we came up with the workflow below.