1Chemical Structure Representation
and Search Systems
Lecture 1. Oct 28, 2003
John Barnard
Barnard Chemical Information LtdChemical Informatics Software & Consultancy Services
Sheffield, UK
2 Purpose of my 7 lectures
How do you store chemical structures on computer?
What can you do with them there? How do the computer systems used in
chemical informatics work?
Data Structures + Algorithms
3 Lecture topics
Oct 28 Introduction to structure representation;
Introduction to Graph theory [video link] Oct 30 Problems of structure representation
[video link] Nov 4More graph theory; Structure analysis
and processing [video link] Nov 11 Structure searching I [video link] Nov 13 Structure searching II [video link] Nov 18 Chemical similarity [Indianapolis] Nov 20 Cluster analysis etc. [Bloomington]
4 John Barnard
B.Sc. in Biochemistry (Birmingham, UK) M.Sc. and Ph.D in Information Studies (Sheffield,
UK) Has run chemical informatics software
development and consultancy business since 1985• Barnard Chemical Information (BCI) Ltd• http://www.bci.gb.com
Adjunct Professor of Informatics at Indiana University
5 Lecture 1: Topics to be Covered
Structure representations and computers• structure diagrams• nomenclature• line notations• connection tables
Introduction to Graph Theory
6 Representing a chemical structure
How much information do you want to include?• atoms present• connections between atoms
o bond types
• stereochemical configuration• charges• isotopes • 3D-coordinates for atoms
C8H9NO3
7 Representing a chemical structure
How much information do you want to include?• atoms present• connections between atoms
o bond types
• stereochemical configuration• charges• isotopes • 3D-coordinates for atoms
OH
CH2
CHNH2OH
O
8 Representing a chemical structure
How much information do you want to include?• atoms present• connections between atoms
o bond types (aromatic ring identification)
• stereochemical configuration• charges• isotopes • 3D-coordinates for atoms
OH
CH2
CHNH2OH
O
9 Representing a chemical structure
How much information do you want to include?• atoms present• connections between atoms
o bond types
• stereochemical configuration• charges• isotopes • 3D-coordinates for atoms
OH
CH2
CHNH2OH
O
10 Representing a chemical structure
How much information do you want to include?• atoms present• connections between atoms
o bond types
• stereochemical configuration• charges• isotopes• 3D-coordinates for atoms
OH
CH2
CHNH3+
O
O
11 Representing a chemical structure
How much information do you want to include?• atoms present• connections between atoms
o bond types
• stereochemical configuration• charges• isotopes• 3D-coordinates for atoms
OH
CH2
C14 HNH2OH
O
12 Representing a chemical structure
How much information do you want to include?• atoms present• connections between atoms
o bond types
• stereochemical configuration• charges• isotopes• 3D-coordinates for atoms
13 2D structure diagram
chemists’ “natural language” used by most computer systems for display shows topology, optionally stereochemistry several commonly-used computer programs allow input/
editing of structure diagrams• ISIS/Draw (MDL)
http://www.mdl.com/downloads/downloadable/index.jsp• ChemDraw (CambridgeSoft)
http://www.cambridgesoft.com/products/• GRINS/JavaGRINS (Daylight)
http://www.daylight.com/products/javatools.html• MarvinSketch
http://www.chemaxon.com/marvin/
14 2D structure diagram
provides 2D pictorial representation of chemical structure• display on screen• cut/paste/embed in Word document etc.
inter-convert with other forms for further processing• database searching• structure analysis• property prediction• database analysis
15 Chemical Nomenclature
name that can be used to identify a substance • potentially important for legislation
represents chemical structure as text string • which can (sometimes) be pronounced
trivial names• usually short and easy to pronounce• do not usually give much information about structure
systematic names• usually long and difficult to pronounce• usually describe structure in considerable detail
16 Trivial and Systematic Names
Trivial name:• tyrosine
Systematic names: -(p-hydroxyphenyl)alanine -amino-p-hydroxyhydrocinnamic acid
OHCH2CH
NH2
OH
O
17 Systematic Names
several systems under continual revision and extension• IUPAC• Chemical Abstracts (lecture from Dr Davis on Sep 9)• some special systems designed by individuals
not usually designed for computer processing• programs exist both to read (translate) and to generate
systematic names from computer formatso http://www.beilstein.com/products/autonom/anm2000.shtmlo http://www.acdlabs.com/products/name_lab/
have arguably outlived their usefulness• IUPAC “IChI” (IUPAC Chemical Identifier) project
18 Registry Numbers
unique identifiers for compounds or substances• catalogue number
most chemical databases have them• Chemical Abstracts• Beilstein• private compound registries in pharmaceutical companies
usually just “idiot numbers”• no chemical information
may have hierarchical structureparent compound stereoisomer salt batch
need to decide what is a separate compound
19 Line Notations
represent structures as compact linear string of alphanumeric symbols
easily handled by computer• compact storage• easily transmitted over a network
allow rapid manual coding/decoding by trained users• much faster for input than using a structure drawing
program
20 Line Notations: SMILES
Simplified Molecular Input Line Entry System developed by Dave Weininger (Daylight)
OC(=O)C(N)CC1=CC=C(O)C=C1
OHCH2CH
NH2
OH
O 1
21 Simplified SMILES encoding rules
atoms are shown by atomic symbols: B, C, N, O, F, P, S, Cl, Br, I
hydrogen atoms are assumed to fill spare valencies adjacent atoms are connected by single bonds
• double bonds are shown `=', triple bonds are `#' branching is indicated by parentheses ring closures are shown by pairs of matching digits
Full rules:http://www.daylight.com/smiles/smiles-intro.html
22 Other line notations
ROSDAL (Beilstein)Representation Of Structure Diagram Arranged Linearly
1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O Sybyl Line Notation (Tripos)
OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1 Wiswesser Line Notation (WLN) (obsolete)
QVYZ1R DQ
OHCH2CH
NH2
OH
O
1
3
4
5
6
8 9
111213
23 Connection Tables (CTs)
main form of structure representation in computer systems• list atoms and bonds (and other data) as a table
many different formats • “internal” CTs (in memory)
o algorithmic processing• “external” CTs (disk files)
o archival storage o data exchange between programs
24 “Redundant” Connection Table
1. O 1 2 12. C 0 1 1 3 2 4 13. O 0 2 24. C 1 2 1 5 1 6 15. N 2 4 16. C 2 4 1 7 17. C 0 6 1 8 2 12 18. C 1 7 2 9 19. C 1 8 1 10 210. C 0 9 2 11 1 13 111. C 1 10 1 12 212. C 1 11 2 7 113. O 1 10 1
9
OH
CH2
CHNH2
OHO 13
4
5
6
8
11
12
13
25 Internal Connection Table
usually “redundant”• every bond shown twice, once for each atom
implemented as array of records record for each atom might store
• atomic type• hydrogen count• formal charge• 2D display co-ordinates• bonds to neighbouring atoms• etc.
26 MDL Connection Table
proprietary file format developed by MDL• http://www.mdl.com/downloads/public/ctfile/ctfile.jsp
de facto standard for exchange of datasets several different flavours and versions
• Molfile (single molecule)• SDfile (set of molecules and data)• RGfile (Markush structure)• Rxnfile (single reaction)• RDfile (set of reactions with data)
separates atoms and bonds into separate blocks
27 New MDL File Formats
Since this lecture was delivered on Oct 28, 2003 MDL have published details of a new file format called “XDfile”• XML-based data format for transferring
structure/reaction information with associated data• built around existing MDL connection table formats • can incorporate Chime strings (encrypted format used
to render structures and reactions on a Web page)• can incorporate SMILES strings
Details available in MDL documentation at: • http://www.mdl.com/downloads/public/ctfile/ctfile.jsp
28 MDL Connection Table
Header Block• data on molecule name and file origin• counts of atoms and bonds etc.
Tyrosine
-ISIS- 08220120432D
13 13 0 0 0 0 0 0 0 0999 V2000
29 MDL Connection Table Atoms block
• one line per atom• specifies X,Y,Z-coords, atom symbol, isotope, charge,
stereo code etc. 0.2459 -1.4736 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.5815 -1.4724 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.9944 -2.1872 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.5810 -2.9037 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.2495 -2.9008 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.6586 -2.1854 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.4836 -2.1830 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.9042 -2.1792 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.1027 -2.1870 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 -3.1359 -1.1516 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -3.9070 -2.1847 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -4.4070 -2.6845 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -4.4989 -1.5618 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
30 MDL Connection Table Bonds Block
• one line per bond (each bond shown once)• specifies row numbers for atoms, and codes for
bond type, bond stereochemistry etc. 1 2 2 0 0 0 0 6 7 1 0 0 0 0 3 4 2 0 0 0 0 3 8 1 0 0 0 0 4 5 1 0 0 0 0 9 10 1 0 0 0 0 2 3 1 0 0 0 0 9 11 1 0 0 0 0 5 6 2 0 0 0 0 11 12 1 0 0 0 0 6 1 1 0 0 0 0 11 13 2 0 0 0 0 8 9 1 0 0 0 0M END
31 Standard Connection Table Formats
different vendors have proprietary CT formats many attempts to establish agreed “standard”
formats• no real general success• different user communities have failed to coordinate
efforts• some standards exist in restricted areas
SMILES and MDL CT formats widely used most popular programs read/write several different
formats
32 Standard Connection Table Formats
Standard Molecular Data (SMD) format• never gained wide acceptance
Protein Data Bank (PDB) format Crystallographic Information File (CIF/mmCIF) Molecular Information File (MIF)
• developed from SMD and compatible with CIF
Chemical Exchange Format (CXF) • Chemical Abstracts Service
33 Standard Connection Table Formats
Chemical Markup Language (CML)• uses principles of the eXtensible Markup Language (XML)
protocol for data exchange using the Internet• http://www.xml-cml.org
Chemical EXchange (CEX)• exchange protocol for TCP/IP networks developed collaboratively
by several organizations• http://www.cgl.ucsf.edu/cex
Chemical MIME• incorporates several popular formats into protocols for exchange
of molecular structures as e-mail attachments• http://www.ch.ic.ac.uk/chemime/
34 IUPAC Chemical Identifier (IChI)
Project being undertaken by International Union of Pure and Applied Chemistry
Intended to provide unique identifier for compounds, but with “chemical intelligence”• based on connection table• “canonicalised” (see lecture 3 on November 4)• compacted to short alphanumerical string
http://www.iupac.org/projects/2000/2000-025-1-800.html see also Dr Nicklaus’s lecture on Oct 16
35 Topological Graph Theory
branch of mathematics• particularly useful in chemical informatics
and in computer science generally study of “graphs” which
consist of• a set of “nodes”• a set of “edges” joining
pairs of nodes
36 Properties of graphs
graphs are only about connectivity• spatial position of nodes is irrelevant • length of edges are irrelevant• crossing edges are irrelevant
37 Properties of Graphs
nodes and edges can be “coloured” to distinguish them
OH
CH2
CHNH2OH
O
38 Structure Diagrams as Graphs
2D structure diagrams very like topological graphs• atoms nodes• bonds edges
terminal hydrogen atoms are not normally shown as separate nodes (“implicit” hydrogens)
• reduces number of nodes by ~50%• “hydrogen count” information used to colour neighbouring
“heavy atom” atom• separate nodes sometimes used for “special” hydrogens
o deuterium, tritiumo hydrogen bonded to more than one other atomo hydrogens attached to stereocentres
39 Advantages of using graphs
mathematical theory is well understood graphs can be easily represented in
computers• many useful algorithms are known
identical graphs identical molecules different graphs different molecules
40 Disadvantages of using graphs
analogy between chemical structures and graphs is not perfect
• identical graphs identical molecules• different graphs different molecules
realities of chemical structures cause problems• aromaticity stereochemistry• tautomerism coordination compounds• multi-centre bonds inorganic compounds• macromolecules polymers• incompletely-defined substances
many graph algorithms are inherently slow
//
41 Lecture 1: Conclusions
There are lots of ways of storing a chemical structure in a computer• including different amounts of information
Most important ones are• line notations (e.g. SMILES)• connection tables (e.g. MDL Molfile)• nomenclature
Structure diagrams used for input/output Chemical structures can be regarded as topological
graphs
42 Lecture 2: Topics to be Covered
Special problems of structure representation• aromaticity and tautomerism• multi-centre bonds• stereochemistry and coordination compounds• inorganic compounds• macromolecules and polymers• incompletely-defined substances• Markush structures
43 Further reading
• A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics, Dordrecht: Kluwer, 2003
• J. Gasteiger and T. Engel Chemoinformatics: a Textbook, Wiley-VCH 2003
• J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Wiley-VCH, 2003
o Vol 1, Chapter II (Representation of chemical compounds)