Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Useful Information
bull The web address for these lectures is
httpwww-jmgchcamacukcilpartii (on
front of handout)
bull Assessment is by two online exercises
(Glen and Goodman) at this address Each
will be marked out of ten Your (paper)
answers should be submitted to Mykola
bull Glen exercises due Feb 10th 2016
bull Lectures and handout available on Moodle
Molecular Informatics
1 molecules and computers
An Introduction to Chemoinformatics Andrew R
Leach Valerie J Gillet Springer 2007
Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel
Wiley-VCH 2003
Handbook of Chemoinformatics Johann Gasteiger
Wiley-VCH 2003
Chemoinformatics An Approach to Virtual Screening
By Alexandre Varnek Alex Tropsha RSC Publishing
Bunin Barry A Chemoinformatics Theory Practice and Products
Dordrecht Springer 2007
Chemoinformatics An Approach to Virtual Screening By Alexandre
Varnek Alex Tropsha RSC Publishing
Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers
Ed Johannes Kirchmair Methods and Principles in Medicinal
Chemistry Vol 63 Pub Wiley-VCH
Sources- textbooksonline you may wish to consider if you want
to take the subject further
Journals of MolecularCheminformatics you may wish
to follow up on
Journal of Chemical Information and Modeling
Journal of Chemical Theory and Computation
Journal of Cheminformatics
Journal of Computer-Aided Molecular Design
Journal of Molecular Graphics amp Modeling
Journal of Computational Chemistry
Journal of Medicinal Chemistry
Reviews in Computational Chemistry
Drug Discovery Today
BMC Bioinformatics
Nature Reviews Drug Discovery
Expert Opinion on Drug Discovery
WIRES computational Molecular Science
Molecular
Informatics
Includes all aspects of the study of molecules on computers
Also includes Chem(o)informatics
This includes the representation of molecules databases display
simulation prediction of their properties and the discovery and
design of new molecules and materials
Molecular informatics is closely related to bioinformatics
computational chemistry molecular modelling simulation machine
learning and statistics as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery hence the concentration on small organic molecules
Cambridge HPC
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Molecular Informatics
1 molecules and computers
An Introduction to Chemoinformatics Andrew R
Leach Valerie J Gillet Springer 2007
Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel
Wiley-VCH 2003
Handbook of Chemoinformatics Johann Gasteiger
Wiley-VCH 2003
Chemoinformatics An Approach to Virtual Screening
By Alexandre Varnek Alex Tropsha RSC Publishing
Bunin Barry A Chemoinformatics Theory Practice and Products
Dordrecht Springer 2007
Chemoinformatics An Approach to Virtual Screening By Alexandre
Varnek Alex Tropsha RSC Publishing
Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers
Ed Johannes Kirchmair Methods and Principles in Medicinal
Chemistry Vol 63 Pub Wiley-VCH
Sources- textbooksonline you may wish to consider if you want
to take the subject further
Journals of MolecularCheminformatics you may wish
to follow up on
Journal of Chemical Information and Modeling
Journal of Chemical Theory and Computation
Journal of Cheminformatics
Journal of Computer-Aided Molecular Design
Journal of Molecular Graphics amp Modeling
Journal of Computational Chemistry
Journal of Medicinal Chemistry
Reviews in Computational Chemistry
Drug Discovery Today
BMC Bioinformatics
Nature Reviews Drug Discovery
Expert Opinion on Drug Discovery
WIRES computational Molecular Science
Molecular
Informatics
Includes all aspects of the study of molecules on computers
Also includes Chem(o)informatics
This includes the representation of molecules databases display
simulation prediction of their properties and the discovery and
design of new molecules and materials
Molecular informatics is closely related to bioinformatics
computational chemistry molecular modelling simulation machine
learning and statistics as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery hence the concentration on small organic molecules
Cambridge HPC
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
An Introduction to Chemoinformatics Andrew R
Leach Valerie J Gillet Springer 2007
Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel
Wiley-VCH 2003
Handbook of Chemoinformatics Johann Gasteiger
Wiley-VCH 2003
Chemoinformatics An Approach to Virtual Screening
By Alexandre Varnek Alex Tropsha RSC Publishing
Bunin Barry A Chemoinformatics Theory Practice and Products
Dordrecht Springer 2007
Chemoinformatics An Approach to Virtual Screening By Alexandre
Varnek Alex Tropsha RSC Publishing
Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers
Ed Johannes Kirchmair Methods and Principles in Medicinal
Chemistry Vol 63 Pub Wiley-VCH
Sources- textbooksonline you may wish to consider if you want
to take the subject further
Journals of MolecularCheminformatics you may wish
to follow up on
Journal of Chemical Information and Modeling
Journal of Chemical Theory and Computation
Journal of Cheminformatics
Journal of Computer-Aided Molecular Design
Journal of Molecular Graphics amp Modeling
Journal of Computational Chemistry
Journal of Medicinal Chemistry
Reviews in Computational Chemistry
Drug Discovery Today
BMC Bioinformatics
Nature Reviews Drug Discovery
Expert Opinion on Drug Discovery
WIRES computational Molecular Science
Molecular
Informatics
Includes all aspects of the study of molecules on computers
Also includes Chem(o)informatics
This includes the representation of molecules databases display
simulation prediction of their properties and the discovery and
design of new molecules and materials
Molecular informatics is closely related to bioinformatics
computational chemistry molecular modelling simulation machine
learning and statistics as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery hence the concentration on small organic molecules
Cambridge HPC
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Journals of MolecularCheminformatics you may wish
to follow up on
Journal of Chemical Information and Modeling
Journal of Chemical Theory and Computation
Journal of Cheminformatics
Journal of Computer-Aided Molecular Design
Journal of Molecular Graphics amp Modeling
Journal of Computational Chemistry
Journal of Medicinal Chemistry
Reviews in Computational Chemistry
Drug Discovery Today
BMC Bioinformatics
Nature Reviews Drug Discovery
Expert Opinion on Drug Discovery
WIRES computational Molecular Science
Molecular
Informatics
Includes all aspects of the study of molecules on computers
Also includes Chem(o)informatics
This includes the representation of molecules databases display
simulation prediction of their properties and the discovery and
design of new molecules and materials
Molecular informatics is closely related to bioinformatics
computational chemistry molecular modelling simulation machine
learning and statistics as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery hence the concentration on small organic molecules
Cambridge HPC
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Molecular
Informatics
Includes all aspects of the study of molecules on computers
Also includes Chem(o)informatics
This includes the representation of molecules databases display
simulation prediction of their properties and the discovery and
design of new molecules and materials
Molecular informatics is closely related to bioinformatics
computational chemistry molecular modelling simulation machine
learning and statistics as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery hence the concentration on small organic molecules
Cambridge HPC
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by as a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simpe graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical 5 = -
1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash Finding the right compound
ndash Designing compounds
ndash Searching for compound data
ndash Patents