Structure representations in public chemistry databases: The challenges of validating the chemical...

Preview:

DESCRIPTION

Internet-based public domain databases containing chemical compounds have grown in number, capability and content in recent years. There are now many databases containing millions of chemical compounds associated with different types of data including chemical names, properties, analytical data, and with associated mapping to proteins, assay data, clinical information and so on. These disparate data sources suffer from one common issue – quality of data. This presentation will provide an overview of our efforts to source the appropriate structural representations for 200 top-selling drugs from public domain sources. This intra- and inter-laboratory comparison of approaches, processes and necessary agreements exposed the challenges associated with aggregating structure-based data. The project also provided data regarding the distribution of quality issues associated with many of the community’s popular databases.

Citation preview

Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Antony WilliamsACS Denver

September 2011

Upfront Acknowledgment - All Authors…

Royal Society of Chemistry – Antony Williams, David Sharpe

University of North Carolina, Chapel Hill – Alex Tropsha, Denis Fourches, Eugene Muratov, Andrew Fant

Chemotargets SL – Ricard Garcia-Serna IMIM-Hospital del Mar Research Institute and

Universitat Pompeu Fabra – Jordi Mestres Astra Zeneca – Sorel Muresan, Christopher

Southan ACD/Labs – Andrey Erin

Internet-Based Chemistry

Internet-based chemistry resources are:

Diverse in quality Confusing Uncoordinated Fixable – with a lot of effort

Open PHACTS : partnership between European Community and EFPIA

Freely accessible for knowledge discovery and verification. Data on small molecules Pharmacological profiles Pharmacokinetics ADMET data Biological targets and pathways Proprietary and public data sources.

Stop Whining – Fix it

What needs to happen?

Standards Standardization of structures

ChEBI/PubChem sharing InChI adoption

Collaboration Stop reinventing the wheel Share data, share efforts and speed the process

Vision is not good enough – Execute!

Standards : Structure Standardization

Standards : Structure Standardization

Standards : Structure Standardization

Collaboration

Then this won’t happen…

Top 200 Drugs on Wikipediahttp://en.wikipedia.org/wiki/List_of_bestselling_drugs

The Project Challenge PART ONE

Agree on the set of chemical names to work with

Independently create an SDF file in each “lab”

Compare differences and agree on final structures

Issue “Gold Standard” SDF file to team

The Project Challenge PART TWO

Use Gold Standard SDF File to investigate data quality on these compounds in Internet Databases

Two checks Search chemical name – does it return the

correct compound. If not correct, how is it different?

Search “structure” – SMILES, Molfile, InChIString or InChIKey

200 Top-Selling Drugs (2006)

Biologicals removed immediately

Single compounds versus mixtures identified

Decision to NOT exclude racemates

List of 152 drugs to analyze

Generic names used

Different Approaches

ACD/Labs – Curated commercial dictionary

RSC|ChemSpider and UNC Chapel Hill – manual curation

ChemoTargets/IMIM – lookup against database

AstraZeneca – lookup against database

Different Approaches

Different Approaches

Different Approaches

Different Approaches

Choose a Starting Point

Comparisons

Observations

Manual curation – slow and imperfect process. A loop of assertions Software tool issues

Lookup – fast and imperfect Totally dependent on initial investment in time

InChIs Very useful for comparison Imperfect

Structure Representations

Representing Racemates

Representing Racemates - Formoterol

Racemic Mixtures

Racemic Mixtures

X

“The First 10”

Collaboration on Curation If we could collaborate on curation…share through

standards and open interfaces

Proof of Concept Data Curation Sharing

SciDBs.com (Coming soon)

Conclusions It is DIFFICULT to aggregate high quality structure

datasets of even common drugs! InChI is very enabling but enhanced stereo necessary Is there a need to be “right”?

Publication will provide: Recommendations for structure standardization Rank ordering of resources Suggestions for InChI enhancement SDF file Curation feed of structures and synonyms

Thank you

Email: williamsa@rsc.org Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams

Recommended