Structure representations in public chemistry databases: The challenges of validating the chemical...

Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Antony WilliamsACS Denver

September 2011

Upfront Acknowledgment - All Authors…

Royal Society of Chemistry – Antony Williams, David Sharpe

University of North Carolina, Chapel Hill – Alex Tropsha, Denis Fourches, Eugene Muratov, Andrew Fant

Chemotargets SL – Ricard Garcia-Serna IMIM-Hospital del Mar Research Institute and

Universitat Pompeu Fabra – Jordi Mestres Astra Zeneca – Sorel Muresan, Christopher

Southan ACD/Labs – Andrey Erin

Internet-Based Chemistry

Internet-based chemistry resources are:

Diverse in quality Confusing Uncoordinated Fixable – with a lot of effort

Open PHACTS : partnership between European Community and EFPIA

Freely accessible for knowledge discovery and verification. Data on small molecules Pharmacological profiles Pharmacokinetics ADMET data Biological targets and pathways Proprietary and public data sources.

Stop Whining – Fix it

What needs to happen?

Standards Standardization of structures

ChEBI/PubChem sharing InChI adoption

Collaboration Stop reinventing the wheel Share data, share efforts and speed the process

Vision is not good enough – Execute!

Standards : Structure Standardization

Collaboration

Then this won’t happen…

Top 200 Drugs on Wikipediahttp://en.wikipedia.org/wiki/List_of_bestselling_drugs

The Project Challenge PART ONE

Agree on the set of chemical names to work with

Independently create an SDF file in each “lab”

Compare differences and agree on final structures

Issue “Gold Standard” SDF file to team

The Project Challenge PART TWO

Use Gold Standard SDF File to investigate data quality on these compounds in Internet Databases

Two checks Search chemical name – does it return the

correct compound. If not correct, how is it different?

Search “structure” – SMILES, Molfile, InChIString or InChIKey

200 Top-Selling Drugs (2006)

Biologicals removed immediately

Single compounds versus mixtures identified

Decision to NOT exclude racemates

List of 152 drugs to analyze

Generic names used

Different Approaches

ACD/Labs – Curated commercial dictionary

RSC|ChemSpider and UNC Chapel Hill – manual curation

ChemoTargets/IMIM – lookup against database

AstraZeneca – lookup against database

Different Approaches

Choose a Starting Point

Comparisons

Observations

Manual curation – slow and imperfect process. A loop of assertions Software tool issues

Lookup – fast and imperfect Totally dependent on initial investment in time

InChIs Very useful for comparison Imperfect

Structure Representations

Representing Racemates

Representing Racemates - Formoterol

Racemic Mixtures

“The First 10”

Collaboration on Curation If we could collaborate on curation…share through

standards and open interfaces

Proof of Concept Data Curation Sharing

SciDBs.com (Coming soon)

Conclusions It is DIFFICULT to aggregate high quality structure

datasets of even common drugs! InChI is very enabling but enhanced stereo necessary Is there a need to be “right”?

Publication will provide: Recommendations for structure standardization Rank ordering of resources Suggestions for InChI enhancement SDF file Curation feed of structures and synonyms

Thank you

Email: williamsa@rsc.org Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams

Structure representations in public chemistry databases: The challenges of validating the chemical...

Technology

Validating VoLTE (First Edition)

Indexing & Similarity Search in Time Series Databases · in Time Series Databases Reza Akbarinia Inria & Lirmm 1 Outline Context Representations Similarity functions Motif and anomaly

PLM Industry Summary - CIMdata · CIMdata PLM Industry Summary. Page 3 . time searching for and validating information in massive computer databases. For several reasons, including

Quantitative Data Cleaning for Large Databases · attempt to resolve inconsistencies across the databases involving data representations, units, measurement periods, and so on.

Validating self-made multimedia resources for English ...site.ufvjm.edu.br/revistamultidisciplinar/files/2014/10/Validating... · Validating self-made multimedia resources for English

Validating Requirements

Validating Surge Test

Validating Strategies Review

Validating and Quantifying the Qualitative Elements …files.acams.org/pdfs/2017/Validating_and_Quantifying_the... · Validating and Quantifying the Qualitative ... Validating and

VALIDATING INFORMAL CONTRACTS

Validating Ideas with Users

Smart Solutions: Data Analytics to Support Fraud …...Geocoding: Enriching and Validating Data 31 Geocoding: Enriching and Validating Data 32 Geocoding: Enriching and Validating Data

Robustly Disentangled Causal Mechanisms: Validating Deep ... · Robustly Disentangled Causal Mechanisms: Validating Deep Representations for Interventional Robustness Raphael Suter

Validating estimates of merchantable volume from … · Validating estimates of merchantable volume from airborne laser scanning ... Validating estimates of merchantable volume from

Validating Wordscores

Validating Ideas Through Prototyping

Revising and Validating Achievement Emotions Questionnaire · Revising and Validating Achievement Emotions Questionnaire ... Revising and Validating Achievement Emotions ... los elementos

Chapter 3: CS689 1 Computational Medical Imaging Analysis Chapter 3: Image Representations, Displays, Communications, and Databases Jun Zhang Laboratory

Contingency planning guide for federal information systems€¦ · Web viewValidation data testing is the process of testing and validating data to ensure that data files or databases

MMDB-6 J. Teuhola 2012141 6. Image databases Image representations: Digitized (sampled) representation of field-based spatial data ‘Raw’ images digital