1
Motivation It is axiomatic that data stored in chemical databases must be accurate; yet it has been reported the error rate in freely-accessible public databases may exceed 8%. 1 A recent example comes from the NCGC (National Chemical Genomics Center) pharmaceutical collection (Figure 2). 2 When building computational models of chemical properties, one wrong structure in twenty is enough to reduce the reliability and prediction performance of the model. 3 Chemical data curation is labor-intensive, perhaps unexciting but critical; but it should be recognized and supported as an inseparable component of cheminformatics research On the Accuracy of Chemical Structures Found on the Internet Andrew D. Fant 1 , Eugene Muratov 1 , Denis Fourches 1 , David Sharpe 2 , Antony J. Williams 2 , and Alexander Tropsha 1 1 Laboratory for Molecular Modeling, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel 2 Royal Society of Chemistry Methods An initial master set of 151 (out of 200) names was generated by one author. Each team was required to return the following information about each compound: systematic name, MOL-formatted record, and JPEG/PNG/GIF image. The following search workflows were employed: The UNC workflow (Figure 3) was based entirely on open Internet data repositories and included some manual reentry of structures from PDF sources. The RSC workflow (Figure 4) was more iterative in the early stages, and included redistribution-restricted sources in some cases. The other two workflows (AZ and IMIM/CT) utilized were more highly automated and are not described further in the current work. InChI keys were calculated from the returned molecular structures and compared. Discrepancies in structures were discussed by participants and a consensus was reached on which structure for the compound was supported by the evidence available, leading to the final master list. Figure 3: UNC workflow Name to structure resolution Figure 4: RSC workflow – Name to structure resolution Identifying correct chemical structures from compound names utilizing publicly available resources on the Internet is possible, but not trivial. Success requires careful comparison of multiple resources. No single source is correct in all cases. Automated Internet queries are still significantly less accurate than manually guided searches. InChI strings and keys are an improvement in chemical data handling, but the current standard keys are not perfect for large- scale comparisons. We believe that the adoption of the MIABE (Minimum Information About Bioactive Entity) standard 5 as part of the peer- reviewed literature publication process could improve the quality of public structural information by eliminating manual re-entry of structures from the primary literature as is currently required in most cases. It is insufficient for a database to return the correct structure from a name query. It also should minimize (better, eliminate) the number of incorrect and/or auxiliary answers that are returned along with the correct one. References 1 Young, D.; Martin, T.; Venkatapathy, R.; Harten, P. QSAR Comb Sci 2008, 27, 1337– 1345. 2 Williams, A. J.; Ekins, S.; Tkachenko, V. Drug Discov Today 2012, 1–17. 3 Fourches, D.; Muratov, E.; Tropsha, A. J Chem Inf Model 2010, 50, 1189–1204. 4 Muresan, S. et al. Drug Discov Today 2011, 16, 1019–1030. 5 Orchard, S. et al. Nat Rev Drug Discov 2011, 10, 661–669. Acknowledgements The authors would like to thank Ricard Garcia (Chemotargets S.L), Jordi Mestres (Barcelona IMIM), Sorel Muresan and Christopher Southan (AstraZeneca), and Andrey Erin (ACD/Labs) for their participation in the search for Internet drug structures. Phyllis Pugh provided workflow graphics and statistical consulting. We acknowledge software licenses donated by OpenEye Scientific Software, ChemAxon, and ACD/Labs that were used for portions of the data collection and analysis. AT acknowledges financial support from NIH (grant GM66940) and EPA (grant RD 83499901 ). Conclusions Structures from the consensus master list were compared (as InChI keys) to hits on name searches against several well-known chemical structure repositories. The number of correct structures and total number of hits are summarized in Figure 7 Incorrect structures in ChemSpider were corrected when found, but not counted as correct for this analysis. Figure 7: Accuracy of results from public structure repositories Results Out of 151 total compounds, all 4 groups reported a structure identical to that on the initial master list for 113 compounds (74.8%). No compound was incorrectly reported by all 4 groups; no group achieved 100% accuracy (Figure 5) Differing results between the curated and unsupervised structure determination methods are highly significant by Fisher’s Exact Test. Figure 5: Relative accuracy of groups against final master list Tautomeric Forms Vardenafil Chiral Sulphoxides Esomeprazole Pro-drug Forms Olmesartan Wrong Chirality Pravastatin Just Plain Wrong Paclitaxel RSC, AZ & IMIM/CT UNC UNC & IMIM/CT Figure 6: Examples of problematic structures and sources of disagreement Figure 2: “Neomycin” – First six structures retrieved from the NCGC browser Study Design Select and curate a list of the top-200 selling drugs (as of 2006 from Wikipedia). Distribute the list to four independent groups of cheminformaticians and ask each group to generate the structures of the drugs using their preferred methods. Royal Society of Chemistry (RSC) Manual Web Search •University of North Carolina (UNC) Manual Web Search •AstraZeneca (AZ) Automated Search of Pre-curated Internal Source 4 •Institut de Recera Hospital del Mar/Chemotargets S.L. (IMIM/CT) Automated Internet Search Compare the results and discuss any discrepancies until agreement on the correct structure is reached. Once a master list is established, compare those structures to individual public chemical structure Figure 1: Which structure of the top-selling anti- glaucoma drug dorzolamide is correct? ChemSpider ID 4447604 ChemSpider ID 23499154

On the Accuracy of Chemical Structures Found on the Internet

Embed Size (px)

Citation preview

Page 1: On the Accuracy of Chemical Structures Found on the Internet

Motivation• It is axiomatic that data stored in chemical databases must be

accurate; yet it has been reported the error rate in freely-accessible public databases may exceed 8%.1 A recent example comes from the NCGC (National Chemical Genomics Center) pharmaceutical collection (Figure 2).2

• When building computational models of chemical properties, one wrong structure in twenty is enough to reduce the reliability and prediction performance of the model.3

• Chemical data curation is labor-intensive, perhaps unexciting but critical; but it should be recognized and supported as an inseparable component of cheminformatics research

On the Accuracy of Chemical Structures Found on the InternetAndrew D. Fant1, Eugene Muratov1, Denis Fourches1, David Sharpe2, Antony J. Williams2, and Alexander

Tropsha11 Laboratory for Molecular Modeling, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill;

2 Royal Society of Chemistry

Methods• An initial master set of 151 (out of 200) names was

generated by one author.• Each team was required to return the following

information about each compound: systematic name, MOL-formatted record, and JPEG/PNG/GIF image.

• The following search workflows were employed:• The UNC workflow (Figure 3) was based entirely on

open Internet data repositories and included some manual reentry of structures from PDF sources.

• The RSC workflow (Figure 4) was more iterative in the early stages, and included redistribution-restricted sources in some cases.

• The other two workflows (AZ and IMIM/CT) utilized were more highly automated and are not described further in the current work.

• InChI keys were calculated from the returned molecular structures and compared. Discrepancies in structures were discussed by participants and a consensus was reached on which structure for the compound was supported by the evidence available, leading to the final master list.

Figure 3: UNC workflow – Name to structure resolution

Figure 4: RSC workflow – Name to structure resolution

Identifying correct chemical structures from compound names utilizing publicly available resources on the Internet is possible, but not trivial.

Success requires careful comparison of multiple resources. No single source is correct in all cases.

Automated Internet queries are still significantly less accurate than manually guided searches.

InChI strings and keys are an improvement in chemical data handling, but the current standard keys are not perfect for large-scale comparisons.

We believe that the adoption of the MIABE (Minimum Information About Bioactive Entity) standard5 as part of the peer-reviewed literature publication process could improve the quality of public structural information by eliminating manual re-entry of structures from the primary literature as is currently required in most cases.

It is insufficient for a database to return the correct structure from a name query. It also should minimize (better, eliminate) the number of incorrect and/or auxiliary answers that are returned along with the correct one.

References1 Young, D.; Martin, T.; Venkatapathy, R.; Harten, P. QSAR Comb Sci 2008, 27, 1337–1345.2 Williams, A. J.; Ekins, S.; Tkachenko, V. Drug Discov Today 2012, 1–17.3 Fourches, D.; Muratov, E.; Tropsha, A. J Chem Inf Model 2010, 50, 1189–1204.4 Muresan, S. et al. Drug Discov Today 2011, 16, 1019–1030.5 Orchard, S. et al. Nat Rev Drug Discov 2011, 10, 661–669.

AcknowledgementsThe authors would like to thank Ricard Garcia (Chemotargets S.L), Jordi Mestres (Barcelona IMIM), Sorel Muresan and Christopher Southan (AstraZeneca), and Andrey Erin (ACD/Labs) for their participation in the search for Internet drug structures. Phyllis Pugh provided workflow graphics and statistical consulting. We acknowledge software licenses donated by OpenEye Scientific Software, ChemAxon, and ACD/Labs that were used for portions of the data collection and analysis. AT acknowledges financial support from NIH (grant GM66940) and EPA (grant RD 83499901 ).

Conclusions

• Structures from the consensus master list were compared (as InChI keys) to hits on name searches against several well-known chemical structure repositories. The number of correct structures and total number of hits are summarized in Figure 7

• Incorrect structures in ChemSpider were corrected when found, but not counted as correct for this analysis.Figure 7: Accuracy of results from public structure

repositories

Results• Out of 151 total compounds, all 4 groups reported a

structure identical to that on the initial master list for 113 compounds (74.8%).

• No compound was incorrectly reported by all 4 groups; no group achieved 100% accuracy (Figure 5)

• Differing results between the curated and unsupervised structure determination methods are highly significant by Fisher’s Exact Test.

Figure 5: Relative accuracy of groups against final master list

Tautomeric FormsVardenafil

Chiral SulphoxidesEsomeprazole

Pro-drug FormsOlmesartan

Wrong ChiralityPravastatin

Just Plain WrongPaclitaxel

✔ ✗

RSC, AZ & IMIM/CT

UNC

UNC & IMIM/CT

Figure 6: Examples of problematic structures and sources of disagreement

Figure 2: “Neomycin” – First six structures retrieved from the NCGC browser

Study Design• Select and curate a list of the top-200 selling drugs (as of 2006

from Wikipedia).• Distribute the list to four independent groups of

cheminformaticians and ask each group to generate the structures of the drugs using their preferred methods.

• Royal Society of Chemistry (RSC)• Manual Web Search

•University of North Carolina (UNC)• Manual Web Search

•AstraZeneca (AZ)• Automated Search of Pre-curated Internal Source4

•Institut de Recera Hospital del Mar/Chemotargets S.L. (IMIM/CT)

• Automated Internet Search• Compare the results and discuss any discrepancies until

agreement on the correct structure is reached.• Once a master list is established, compare those structures to

individual public chemical structure sources.

Figure 1: Which structure of the top-selling anti-glaucoma drug dorzolamide is correct?

ChemSpider ID 4447604 ChemSpider ID 23499154