ChemSpider and Traveling the Internet via Chemical Structures Cheminformatics Presentation

Preview:

DESCRIPTION

This is a short presentation given to chemistry students at Drexel University as a remote presentation. This was for the class of Jean-Claude Bradley.

Citation preview

ChemSpider and Traveling the Internet via Chemical Structures

Antony WilliamsDrexel University, November 2012

Compounds and Identifiers

Chemistry on the Internet

Where do you source chemistry information? What can you trust online? How can you recognize potential issues? Cross-referencing and curating data

Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)

Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0 M END

Molfiles Molfiles are the primary exchange format between

structure drawing packages Can be different between different drawing packages Most commonly carry X,Y coordinates for layout Can support polymers, organometallics, etc. Can carry 3D coordinates

SMILES (http://en.wikipedia.org/wiki/SMILES)

SMILES is a common format Can support polymers,

organometallics, etc. Does NOT carry X,Y or Z

coordinates for layout so requires layout algorithms – can be problematic!

Generally different between drawing packages

Stereo

Tautomers

SMILES ACD/Labs CC(C)CCC[C@@H](C)CCC[C@@H](C)CCCC(\

C)=C\CC2=C(C)C(=O)c1ccccc1C2=O

OpenEye CC1=C(C(=O)c2ccccc2C1=O)C/C=C(\C)/

CCC[C@H](C)CCC[C@H](C)CCCC(C)C

ChEMBL CC(C)CCC[C@@H](C)CCC[C@@H](C)CCC\

C(=C\CC1=C(C)C(=O)c2ccccc2C1=O)\C

The InChI Identifier

InChI

SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES

InChI Strings can be reversed to structures – same problem as with SMILES – no layout

Well adopted by the community (databases, publishers, blogs, Wikipedia) – good for searching the internet

The InChI Standard

Tautomers – “Mobile H Perception”

Double Bond Orientation

Stereo

Checking for Stereochemistry

Checking for StereochemistryUse your drawing package!

Checking for Stereochemistry

Checking for Stereochemistry

Checking for Stereochemistry

InChIKeysSearch the Web by Structure

InChIs

Databases and Standardization

Databases and Standardization

InChI

No support for polymers, organometallics

Many option settings can lead to variability and make integration across databases difficult – FixedH option especially problematic

“Slight” chance of collisions of InChIKeys

VERY USEFUL FOR INTEGRATING THE WEB

Vancomycin

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Full Skeleton Search: 104 Hits

Full Molecule Search: 4 Hits

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

www.chemspider.com

How do we build it?

We deal in Molfiles or SDF files – with coordinates

Valence checking, charge imbalance

We have our own “business logic” to standardize

InChI to “aggregate tautomers” to one record

We link out to external sites using their IDs

Searches: The INTERNET

All ChemSpider and Internet searches are “simply algorithms” but synonym searching is based on an assertion

Validated Names for Searching…

Validating structures

Check for “full stereo” and use stereo descriptors especially for checking!

Check for quality of associated data sources

Check against reference literature when available – but it can be wrong

Question EVERYTHING!

Contributing to The Quality of DataWhat is the Structure of Vitamin K?

Contributing to The Quality of DataWhat is the Structure of Vitamin K?

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K1 (phytomenadione) derived from plants, VITAMIN K2 (menaquinone) from bacteria & synthetic naphthoquinone provitamins, VITAMIN K3 (menadione).

What is the Structure of Vitamin K1?

CAS’s Common Chemistry

Wikipedia

Wolfram Alpha

DailyMed

ALL Different, ALL “Domoic Acids”

Thank you

Email: williamsa@rsc.org Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams