64
Sourcing High-Quality Online Data Resources for Computational Toxicology Antony Williams Bio-IT World, Current Methods for Computational Toxicology and Chemogenomics

Sourcing high quality online data resources for computational toxicology

Embed Size (px)

DESCRIPTION

The internet continues to offer increased access to chemistry data that may be of value to scientists interested in populating systems containing reference toxicology data as well as to provide data for the development of predictive models. This presentation will give an overview of some of the various sources of data available via the internet, provide an overview of some of the challenges associated with gathering high-quality data and discuss methods by which to mesh together disparate data sources.

Citation preview

Page 1: Sourcing high quality online data resources for computational toxicology

Sourcing High-Quality Online Data Resources for Computational Toxicology

Antony WilliamsBio-IT World, Current Methods for Computational Toxicology and Chemogenomics

Page 2: Sourcing high quality online data resources for computational toxicology

The Community Depends On Us

“We don’t want another Love Canal!” “What we know about PCBs should warn us all!”

The public is “suspicious” of pharma… “Chemicals are dangerous”

Page 3: Sourcing high quality online data resources for computational toxicology

Comp Tox Models Depend on DATA

Models for Computational Toxicology depend on the quality of the training set

There are multiple issues with data quality including: Experimental

The validity of the method, Reproducibility, Sample quality, Data capture, Transcription of values

Computational Accurately representing the data – correct units,

annotations, quality flags, attribution, are the structures correct?

Page 4: Sourcing high quality online data resources for computational toxicology

Nothing but the Facts

Jean-Claude Bradley, Drexel University

“There are no facts, only measurements embedded

within assumptions”

Page 5: Sourcing high quality online data resources for computational toxicology

Open Notebook Science

UsefulChem Blog: http://tinyurl.com/48dyujh

Page 6: Sourcing high quality online data resources for computational toxicology

Aqueous Solubility of ECGC

Epigallocatechin gallate solubility in water

Page 7: Sourcing high quality online data resources for computational toxicology

Melting Point of DMT

Page 8: Sourcing high quality online data resources for computational toxicology

Content is King and Quality Costs Chemistry “content” is big money

Patent searching Structures and properties Drug databases Literature databases

Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information 101 years of content $260 million revenue (2006) >50 million substances >60 million sequences

Page 9: Sourcing high quality online data resources for computational toxicology

Where can we find data online?

Page 10: Sourcing high quality online data resources for computational toxicology

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Page 11: Sourcing high quality online data resources for computational toxicology

Lots of “Public Compound” Databases

PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider

Page 12: Sourcing high quality online data resources for computational toxicology

Toxicology Data

Page 13: Sourcing high quality online data resources for computational toxicology
Page 14: Sourcing high quality online data resources for computational toxicology

Chemistry on the Internet ChemSpider “links” chemistry on the internet

Almost 25 million compounds, 400 data sources Allows community deposition, curation, annotation Integrating properties, publications, patents, media Text, structure and substructure searching

Page 15: Sourcing high quality online data resources for computational toxicology

www.chemspider.com

Page 16: Sourcing high quality online data resources for computational toxicology

Search for a Chemical

Page 17: Sourcing high quality online data resources for computational toxicology

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Page 18: Sourcing high quality online data resources for computational toxicology

We Have Delivered the Vision

“Build a Structure Centric Community toServe Chemists”

Integrate chemical structure data on the web Create a “structure-based hub” to information,

data and algorithmic predictions Let chemists contribute their own data Allow the community to curate/correct data

Page 19: Sourcing high quality online data resources for computational toxicology

Dialects describing chemicals

Page 20: Sourcing high quality online data resources for computational toxicology

What is the Structure of Vitamin K?

Page 21: Sourcing high quality online data resources for computational toxicology

MeSH

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

Page 22: Sourcing high quality online data resources for computational toxicology

What is the Structure of Vitamin K1?

Page 23: Sourcing high quality online data resources for computational toxicology

What is the Structure of Vitamin K1?

Page 24: Sourcing high quality online data resources for computational toxicology

CAS’s Common Chemistry

Page 25: Sourcing high quality online data resources for computational toxicology

Wikipedia

Page 26: Sourcing high quality online data resources for computational toxicology
Page 27: Sourcing high quality online data resources for computational toxicology
Page 28: Sourcing high quality online data resources for computational toxicology

ChEBI – Manual Curation

Page 29: Sourcing high quality online data resources for computational toxicology
Page 30: Sourcing high quality online data resources for computational toxicology
Page 31: Sourcing high quality online data resources for computational toxicology

PubChem

Page 32: Sourcing high quality online data resources for computational toxicology
Page 33: Sourcing high quality online data resources for computational toxicology

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Page 34: Sourcing high quality online data resources for computational toxicology

Public Domain Chemistry Databases

Our databases are a mess…

Non-curated databases are proliferating errors

We source and deposit data between databases

Original sources of errors hard to determine

Curation is time-consuming, challenging and exacting

Page 35: Sourcing high quality online data resources for computational toxicology

Vancomycin

Who will curate?

PubChem is not resourced to clean these errors

How would you clean such a large dataset?

Page 36: Sourcing high quality online data resources for computational toxicology

The FDA’s DailyMed

Page 37: Sourcing high quality online data resources for computational toxicology

Structures on DailyMed

Page 38: Sourcing high quality online data resources for computational toxicology

Lack of Stereochemisty

Page 39: Sourcing high quality online data resources for computational toxicology

Incorrect Structures

Page 40: Sourcing high quality online data resources for computational toxicology

Wow!

Page 41: Sourcing high quality online data resources for computational toxicology

We want to model DILI…

Drug metabolism in the liver can convert some drugs into highly reactive intermediates,

This can affect the structure and functions of the liver. Drug-induced liver injury (DILI), is the #1 reason

drugs are not approved and withdrawn from market after approval

Estimated global annual incidence rate of DILI is 13.9-24.0 per 100,000 inhabitants

DILI accounts for an estimated 3-9% of all adverse drug reactions reported to health authorities

Herbal components can cause DILI too

Thanks to Sean Ekins https://dilin.dcri.duke.edu/for-researchers/info/

Page 42: Sourcing high quality online data resources for computational toxicology

Initial DILI data – Names and Data

Griseofulvin Hycanthone Hydrochlorothiazide Hydrocortisone Hydroxyurea Idarubicin HCl Idoxuridine Imipramine HCl indomethacin

isoniazid Isoproterenol HCl Isotretinoin Isoxsuprine HCl Kanamycin Sulfate Ketorolac

Tromethamine Ketotifen Labetalol

Page 43: Sourcing high quality online data resources for computational toxicology

So you want data on drugs???

Sourcing data based on drug names is difficult!

Where would you find the “correct chemical structures”?

What databases can you trust?

Page 44: Sourcing high quality online data resources for computational toxicology

Vytorin: Ezetimibe/Simvastatin

Page 45: Sourcing high quality online data resources for computational toxicology

Vytorin: Ezetimibe/Simvastatin

Page 46: Sourcing high quality online data resources for computational toxicology

Vytorin: Ezetimibe/Simvastatin

Page 47: Sourcing high quality online data resources for computational toxicology

Vytorin: Ezetimibe/Simvastatin

Page 48: Sourcing high quality online data resources for computational toxicology

Vytorin: Ezetimibe/Simvastatin

Page 49: Sourcing high quality online data resources for computational toxicology

Symbicort: Budesonide + Formoterol

Page 50: Sourcing high quality online data resources for computational toxicology

Symbicort: Budesonide + Formoterol

ChemIDPlus

Wikipedia

Page 51: Sourcing high quality online data resources for computational toxicology

DrugBank: Search Symbicort…

Page 52: Sourcing high quality online data resources for computational toxicology

Symbicort: Budesonide + Formoterol PubChem

8 structures called Budesonide. 1 “correct” 6 structures called Formoterol. 1 “correct” Search on “Symbicort” gives 1 structure.

Page 53: Sourcing high quality online data resources for computational toxicology

Taxol: Paclitaxel 44 structures

Page 54: Sourcing high quality online data resources for computational toxicology

Taxol: Paclitaxel Bioassay Data

Page 55: Sourcing high quality online data resources for computational toxicology

Public Domain Chemistry Databases

An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs

Page 56: Sourcing high quality online data resources for computational toxicology
Page 57: Sourcing high quality online data resources for computational toxicology

Drug Name Generic Name ChEBI ChemSpiderCAS Com.

Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia

SpirivaTiotropium Bromide

No Hits No Hits 4/0

DepakoteValproate semisodium No

Structure

Basen Voglibose No Hits No Hits 2/1 Symbicort 1) Budesonide 8/1 Symbicort 2) Formoterol WRONG No Hits 6/1 Vytorin 1) Ezetimibe No Hits Vytorin 2) Simvastatin 2/1 Taxol Paclitaxel 44/1 Thalidomid Thalidomide No Hits Zocor Simvastatin 2/1 Crestor Rosuvastatin No Hits 2/1

Page 58: Sourcing high quality online data resources for computational toxicology

Personal Experiences

Highest Quality Resources : DSSTox (EPA), ChEBI (EBI)

High Quality Resources : DrugBank, Human Metabolome Database, ChemIDPlus, ChemSpider, KEGG

Are there others you use???

Page 59: Sourcing high quality online data resources for computational toxicology

What can be done to help?

“Crowdsourcing” – gather the support of members of the community to add, annotate and curate data

Wikipedia is the domain success story for crowdsourcing. PubChem is an example of “crowdsourced

deposition” of chemistry data ChemSpider is an example of “crowdsourced

deposition and curation”

Page 60: Sourcing high quality online data resources for computational toxicology

Open source software : descriptors and algorithms QSAR should be cheaper and better! Selectively share your models with collaborators Centralized hosting of models / predictions

The Future: Open Source and Data

Page 61: Sourcing high quality online data resources for computational toxicology

The Future: Open PHACTS

The Open PHACTS project will develop an open access innovation platform, called Open Pharmacological Space (OPS), via a semantic web approach. OPS will be comprised of data, vocabularies and infrastructure needed to accelerate drug-oriented research.

Page 62: Sourcing high quality online data resources for computational toxicology

Exposing Data for Semantic Web…

Page 63: Sourcing high quality online data resources for computational toxicology

Coming soon…

Book chapter:

“Accessing, Using and Creating Chemical Property Databases For Computational Toxicology Modeling”

Antony J. Williams, Sean Ekins, Ola Spjuth and Egon L. Willighagen

Page 64: Sourcing high quality online data resources for computational toxicology

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams