1
Christopher Southan and Joanna L. Sharman Poster no 45 at BioITWorld 2014 Boston IUPHAR/BPS Guide to PHARMACOLOGY Web portal Group, Centre for Integrative Physiology, School of Biomedical Sciences, University of Edinburgh, Hugh Robson Building, Edinburgh, EH8 9XD, UK. ([email protected]) http://www.slideshare.net/cdsouthan/southan-sharman-bioitposter www.guidetopharmacology.org WILL THE REAL DRUGS PLEASE STAND UP? [email protected] Results: Intra-PubChem consensus triage Results: further analysis As pointed out (PMID:24533037) ChEMBL, DrugBank, and the Therapeutic Target Database (TTD) are valuable sources to compare. However, the only way to select “approved” drug structures directly from PubChem is retrieving from the DrugBank substance (SID) comment fields. This returns 1533 SIDs that collapse to 1504 CIDs as the de facto starting point. In addition, the following CIDs were selected: TTD, ChEMBL, covalent count 1 (i.e. stripping salts and mixtures) and that at least one of the sources had assigned a synonym as an INN (WHO International Non-proprietary Name). The final step was the Boolean intersect between all five. The sequence of CID query results is shown below (the extra ANDs are an interface artefact). Examples of GPCR database tables Supported by: Discussion The PubChem toolbox allows numerous additional analyses of drug chemical space to be carried out. Just a selection is included: Each of the 923 was, on average, represented by 76 submissions (SIDs). The CID chemistry rules are thus effective for generating robust consensus Applying “same (bond) connectivity” gives 18749 but removing the virtual deuterated entries from patents reduces this to 6919 (i.e. the 923 have, on average, 7.5 alternative stereo CIDs (also mostly from patents) The 3 sources chose taxol as CID 36314 from 96 same-connectivity CIDs 209 have fully unspecified chiral centres (probably racemates not errors) The intersect with the FDA Maximum Daily Dose Database is only 489 The 1504 DrugBank approvals include 7 unique CIDs. One was CID 73425384, sofosvir, specified as (2S)-2-[[[(3R,4R,5R) with one unresolved stereo centre. The majority SID “vote” was CID 45375808 as (2S)-2-[[[(2R,3R,4R,5R) picked up by ChEMBL but absent from TTD. With specified exceptions this 4-way set is now captured in GToP. We are completing the quantitative activity mappings by “walking-out” from this core set (e.g. to the 1117) but a variety of additional consensus intersects and filters are being explored including InChIKey downloads direct from sources. We have also constituted an 8-member international committee to support us in resolving challenging cases. In addition, we have introduced key ligand-to-ligand relationships such as drug<>prodrug (i.e. two INNs), fixed mixtures and active metabolites. The persistent structural discordance between nominally the same drugs (in documents and databases) presents an important issue in cheminformatics and pharmacology, extending beyond INNs to all of medicinal chemistry. Contributory factors include semantic naming inconsistencies, salt ambiguity and multiple stereo forms. Definitive lists will remain elusive until there is not only more inter-source collaboration for standardisation and cross-mapping (including wider use of InChI) but also when regulatory bodies and pharmaceutical companies commit to directly verifying and provenancing public database structure records for clinically used drugs. This set of 923 drugs can be accessed via the MyNCBI open URL: http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/ 1Fo7u3apR1bzS_UWr1YhHOTkZ/ While this 4-way corroborated set has high utility there are caveats; The “approved” tag via DrugBank 4.0 is not FDA-only Since TTD last submitted in Feb 2012 the intersected drug content is thus capped to before that date (dropping TTD gives 1117 CIDs) Some metabolites (e.g. amino acids) come through the filters but an additional restrict of DrugBank nutraceuticals can be applied Some older drugs have no INN (e.g. aspirin) Some peptide drug CIDs are missing (suggesting low concordance) Approved fixed-mixtures are excluded (but their capture is problematic anyway since they do not get an INN) The computed CID identity is actually a hash-code match, rather than via InChIKey (but this should give similar numbers) Introducing the Guide to PHARMACOLOGY How many approved drugs are there ? A consensus approach to drug curation Workflow The IUPHAR/BPS Guide to PHARMACOLOGY Web portal (GToP) is an open resource providing expert curated information on approved drugs, clinical candidates, research compounds and other selective ligands for a wide range of biological targets (PMID:24234439). Overview pages link to detailed information including biological variants, tissue distribution, signalling mechanisms and disease involvement. The key objective of our Wellcome Trust grant is to complete expert annotation of approved drugs vs their primary targets connected by quantitative data (typically IC50, Ki or Kd). This poster describes the comparison of approved drugs inside PubChem that have an assigned compound identifier (CID). Given their crucial position for the global practice of medicine and formal approval by stringent national procedures, the uncertainty of approved drug structures seems paradoxical. It spans a range from the FDA Maximum Daily Dose Database (PubChem Assay ID 1195) at 1216 up to 1817 quoted for the NCGC Pharmaceutical Collection (PMID:21525397) but the fact that these efforts gave up in 2008 and 2011, respectively, illustrates the challenge. This is underlined in a comparison of database subsets of approved drugs done in 2009 that recorded only 807 exact chemical structures in-common (PMID:20298516). As a pragmatic approach to maximising the precision and utility of our database we have strategically chosen to use consensus sets as curatorial starting points, rather than attempt total capture (i.e. quality vs quantity). In addition, we set a high stringency for our curated activity mappings and primary target assignments. We thus do not target-map nutraceuticals, endogenous hormones or simple topical compounds as they typically have many protein interactions already captured in metabolite, enzyme and pathway databases. The consensus concept (first exemplified in PMID:20298516) is that an exact chemical structure match between multiple sources is more likely to be right than wrong. Nonetheless, nominally independent databases may re-cycle annotations rather than resolve primary provenances (this is not a criticism, merely an observation). There are also challenging cases where consensus may be unrealistic to expect (e.g. large peptides, long oligos or complex natural products). Through exploring the PubChem mining options we came up with the workflow below.

Will the real drugs please stand up?

Embed Size (px)

DESCRIPTION

For BioIT World, Boston, April 2014, Christopher Southan and Joanna L. Sharman http://www.guidetopharmacology.org/ URL for the consensus set http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1Fo7u3apR1bzS_UWr1YhHOTkZ/ (n.b. result numbers in the poster have been updated since the abstract submission due to DrugBank and ChEMBL updates in PubChem) BACKGROUND: A comparison of database subsets of approved drugs in 2009 recorded only 807 exact structures in-common (PMID:20298516). Factors contributing to low overlap included semantic naming inconsistencies, ambiguity in structure representation and the fact that neither regulatory bodies nor pharmaceutical companies directly verify public electronic chemical database records. This work is a current comparison of drug sources inside PubChem. METHODS: We selected submitters that nominally included small-molecule drug collections and International Non-proprietary names (INNs) and/or US approved names (USANS). Unions, intersects and differences were derived by using the Entrez query history interface to perform Boolean operations on retrieved sets. Additional filters were explored, including salt-stripping by selecting a covalent unit count of 1. RESULTS: DrugBank 3.0 declares 1,541 small-molecule drugs and the term “approved” returned 1,424 substances (SIDs) in PubChem. These collapse to 1,392 compounds (CIDs), and removal of mixtures reduces to 1,325. The Therapeutic Target Database (TTD) declares 1,540 approved drugs on their website. The CID overlap with the DrugBank 1,325 was 1,108, and the equivalent figure for ChEMBL_17 was 1,141. The three-way consensus (from the DrugBank starting point) was 1,003. The term INN retrieves 7,916 CIDS, reducing to 7,180 single-components. USAN brings back 5,494 of which only 3,204 are single-component (i.e. more salt forms are designated as USANs). Of the 1,108 3-way set, 927 have an INN or USAN. The “same connectivity” query indicates, on average, each of the 927 have nearly 20 canonically-related CIDs. Issues associated with these metrics will be outlined and, depending on new source releases, the numbers will be updated. CONCLUSIONS: A surprising degree of non-overlap persists in drug structures. Our results are not a criticism of the valuable sources but further analysis is needed of the multiplicity of structural representations and fuzzy naming of essentially the same canonical drugs inside PubChem. This important issue in cheminformatics extends beyond the INNs to all pharmacologically active structures. It also rationalises our IUPHAR/BPS Guide to PHARMACOLOGY strategic choice of focusing on consensus sets for curation. This work indicates definitive drug lists will remain elusive until there is more collective engagement for provenance, standardisation and cross-mapping.

Citation preview

Page 1: Will the real drugs please stand up?

Christopher Southan and Joanna L. Sharman Poster no 45 at BioITWorld 2014 BostonIUPHAR/BPS Guide to PHARMACOLOGY Web portal Group, Centre for Integrative Physiology, School of Biomedical Sciences, University of Edinburgh, Hugh Robson Building, Edinburgh, EH8 9XD, UK. ([email protected]) http://www.slideshare.net/cdsouthan/southan-sharman-bioitposter

www.guidetopharmacology.org

WILL THE REAL DRUGS PLEASE STAND [email protected]

Results: Intra-PubChem consensus triage

Results: further analysis

As pointed out (PMID:24533037) ChEMBL, DrugBank, and the Therapeutic Target Database (TTD) are valuable sources to compare. However, the only way to select “approved” drug structures directly from PubChem is retrieving from the DrugBank substance (SID) comment fields. This returns 1533 SIDs that collapse to 1504 CIDs as the de facto starting point. In addition, the following CIDs were selected: TTD, ChEMBL, covalent count 1 (i.e. stripping salts and mixtures) and that at least one of the sources had assigned a synonym as an INN (WHO International Non-proprietary Name). The final step was the Boolean intersect between all five. The sequence of CID query results is shown below (the extra ANDs are an interface artefact).

Examples of GPCR database tables

Supported by:

Discussion

The PubChem toolbox allows numerous additional analyses of drug chemical space to be carried out. Just a selection is included:• Each of the 923 was, on average, represented by 76 submissions (SIDs). The

CID chemistry rules are thus effective for generating robust consensus• Applying “same (bond) connectivity” gives 18749 but removing the virtual

deuterated entries from patents reduces this to 6919 (i.e. the 923 have, on average, 7.5 alternative stereo CIDs (also mostly from patents)

• The 3 sources chose taxol as CID 36314 from 96 same-connectivity CIDs• 209 have fully unspecified chiral centres (probably racemates not errors) • The intersect with the FDA Maximum Daily Dose Database is only 489• The 1504 DrugBank approvals include 7 unique CIDs. One was CID

73425384, sofosvir, specified as (2S)-2-[[[(3R,4R,5R) with one unresolved stereo centre. The majority SID “vote” was CID 45375808 as (2S)-2-[[[(2R,3R,4R,5R) picked up by ChEMBL but absent from TTD.

With specified exceptions this 4-way set is now captured in GToP. We are completing the quantitative activity mappings by “walking-out” from this core set (e.g. to the 1117) but a variety of additional consensus intersects and filters are being explored including InChIKey downloads direct from sources. We have also constituted an 8-member international committee to support us in resolving challenging cases. In addition, we have introduced key ligand-to-ligand relationships such as drug<>prodrug (i.e. two INNs), fixed mixtures and active metabolites. The persistent structural discordance between nominally the same drugs (in documents and databases) presents an important issue in cheminformatics and pharmacology, extending beyond INNs to all of medicinal chemistry. Contributory factors include semantic naming inconsistencies, salt ambiguity and multiple stereo forms. Definitive lists will remain elusive until there is not only more inter-source collaboration for standardisation and cross-mapping (including wider use of InChI) but also when regulatory bodies and pharmaceutical companies commit to directly verifying and provenancing public database structure records for clinically used drugs.

This set of 923 drugs can be accessed via the MyNCBI open URL:http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1Fo7u3apR1bzS_UWr1YhHOTkZ/

While this 4-way corroborated set has high utility there are caveats;• The “approved” tag via DrugBank 4.0 is not FDA-only• Since TTD last submitted in Feb 2012 the intersected drug content is thus

capped to before that date (dropping TTD gives 1117 CIDs)• Some metabolites (e.g. amino acids) come through the filters but an

additional restrict of DrugBank nutraceuticals can be applied• Some older drugs have no INN (e.g. aspirin) • Some peptide drug CIDs are missing (suggesting low concordance)• Approved fixed-mixtures are excluded (but their capture is problematic

anyway since they do not get an INN)• The computed CID identity is actually a hash-code match, rather than via

InChIKey (but this should give similar numbers)

Introducing the Guide to PHARMACOLOGY

How many approved drugs are there ?

A consensus approach to drug curation

Workflow

The IUPHAR/BPS Guide to PHARMACOLOGY Web portal (GToP) is an open resource providing expert curated information on approved drugs, clinical candidates, research compounds and other selective ligands for a wide range of biological targets (PMID:24234439). Overview pages link to detailed information including biological variants, tissue distribution, signalling mechanisms and disease involvement. The key objective of our Wellcome Trust grant is to complete expert annotation of approved drugs vs their primary targets connected by quantitative data (typically IC50, Ki or Kd). This poster describes the comparison of approved drugs inside PubChem that have an assigned compound identifier (CID).

Given their crucial position for the global practice of medicine and formal approval by stringent national procedures, the uncertainty of approved drug structures seems paradoxical. It spans a range from the FDA Maximum Daily Dose Database (PubChem Assay ID 1195) at 1216 up to 1817 quoted for the NCGC Pharmaceutical Collection (PMID:21525397) but the fact that these efforts gave up in 2008 and 2011, respectively, illustrates the challenge. This is underlined in a comparison of database subsets of approved drugs done in 2009 that recorded only 807 exact chemical structures in-common (PMID:20298516).

As a pragmatic approach to maximising the precision and utility of our database we have strategically chosen to use consensus sets as curatorial starting points, rather than attempt total capture (i.e. quality vs quantity). In addition, we set a high stringency for our curated activity mappings and primary target assignments. We thus do not target-map nutraceuticals, endogenous hormones or simple topical compounds as they typically have many protein interactions already captured in metabolite, enzyme and pathway databases. The consensus concept (first exemplified in PMID:20298516) is that an exact chemical structure match between multiple sources is more likely to be right than wrong. Nonetheless, nominally independent databases may re-cycle annotations rather than resolve primary provenances (this is not a criticism, merely an observation). There are also challenging cases where consensus may be unrealistic to expect (e.g. large peptides, long oligos or complex natural products).

Through exploring the PubChem mining options we came up with the workflow below.