39
Useful Information The web address for these lectures is http://www-jmg.ch.cam.ac.uk/cil/partii/ (on front of handout) Assessment is by two online exercises (Glen and Goodman) at this address. Each will be marked out of ten. Your (paper) answers should be submitted to Mykola. Glen exercises due: Feb 10 th 2016 Lectures and handout available on Moodle

Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Useful Information

bull The web address for these lectures is

httpwww-jmgchcamacukcilpartii (on

front of handout)

bull Assessment is by two online exercises

(Glen and Goodman) at this address Each

will be marked out of ten Your (paper)

answers should be submitted to Mykola

bull Glen exercises due Feb 10th 2016

bull Lectures and handout available on Moodle

Molecular Informatics

1 molecules and computers

An Introduction to Chemoinformatics Andrew R

Leach Valerie J Gillet Springer 2007

Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel

Wiley-VCH 2003

Handbook of Chemoinformatics Johann Gasteiger

Wiley-VCH 2003

Chemoinformatics An Approach to Virtual Screening

By Alexandre Varnek Alex Tropsha RSC Publishing

Bunin Barry A Chemoinformatics Theory Practice and Products

Dordrecht Springer 2007

Chemoinformatics An Approach to Virtual Screening By Alexandre

Varnek Alex Tropsha RSC Publishing

Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers

Ed Johannes Kirchmair Methods and Principles in Medicinal

Chemistry Vol 63 Pub Wiley-VCH

Sources- textbooksonline you may wish to consider if you want

to take the subject further

Journals of MolecularCheminformatics you may wish

to follow up on

Journal of Chemical Information and Modeling

Journal of Chemical Theory and Computation

Journal of Cheminformatics

Journal of Computer-Aided Molecular Design

Journal of Molecular Graphics amp Modeling

Journal of Computational Chemistry

Journal of Medicinal Chemistry

Reviews in Computational Chemistry

Drug Discovery Today

BMC Bioinformatics

Nature Reviews Drug Discovery

Expert Opinion on Drug Discovery

WIRES computational Molecular Science

Molecular

Informatics

Includes all aspects of the study of molecules on computers

Also includes Chem(o)informatics

This includes the representation of molecules databases display

simulation prediction of their properties and the discovery and

design of new molecules and materials

Molecular informatics is closely related to bioinformatics

computational chemistry molecular modelling simulation machine

learning and statistics as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery hence the concentration on small organic molecules

Cambridge HPC

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 2: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Molecular Informatics

1 molecules and computers

An Introduction to Chemoinformatics Andrew R

Leach Valerie J Gillet Springer 2007

Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel

Wiley-VCH 2003

Handbook of Chemoinformatics Johann Gasteiger

Wiley-VCH 2003

Chemoinformatics An Approach to Virtual Screening

By Alexandre Varnek Alex Tropsha RSC Publishing

Bunin Barry A Chemoinformatics Theory Practice and Products

Dordrecht Springer 2007

Chemoinformatics An Approach to Virtual Screening By Alexandre

Varnek Alex Tropsha RSC Publishing

Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers

Ed Johannes Kirchmair Methods and Principles in Medicinal

Chemistry Vol 63 Pub Wiley-VCH

Sources- textbooksonline you may wish to consider if you want

to take the subject further

Journals of MolecularCheminformatics you may wish

to follow up on

Journal of Chemical Information and Modeling

Journal of Chemical Theory and Computation

Journal of Cheminformatics

Journal of Computer-Aided Molecular Design

Journal of Molecular Graphics amp Modeling

Journal of Computational Chemistry

Journal of Medicinal Chemistry

Reviews in Computational Chemistry

Drug Discovery Today

BMC Bioinformatics

Nature Reviews Drug Discovery

Expert Opinion on Drug Discovery

WIRES computational Molecular Science

Molecular

Informatics

Includes all aspects of the study of molecules on computers

Also includes Chem(o)informatics

This includes the representation of molecules databases display

simulation prediction of their properties and the discovery and

design of new molecules and materials

Molecular informatics is closely related to bioinformatics

computational chemistry molecular modelling simulation machine

learning and statistics as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery hence the concentration on small organic molecules

Cambridge HPC

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 3: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

An Introduction to Chemoinformatics Andrew R

Leach Valerie J Gillet Springer 2007

Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel

Wiley-VCH 2003

Handbook of Chemoinformatics Johann Gasteiger

Wiley-VCH 2003

Chemoinformatics An Approach to Virtual Screening

By Alexandre Varnek Alex Tropsha RSC Publishing

Bunin Barry A Chemoinformatics Theory Practice and Products

Dordrecht Springer 2007

Chemoinformatics An Approach to Virtual Screening By Alexandre

Varnek Alex Tropsha RSC Publishing

Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers

Ed Johannes Kirchmair Methods and Principles in Medicinal

Chemistry Vol 63 Pub Wiley-VCH

Sources- textbooksonline you may wish to consider if you want

to take the subject further

Journals of MolecularCheminformatics you may wish

to follow up on

Journal of Chemical Information and Modeling

Journal of Chemical Theory and Computation

Journal of Cheminformatics

Journal of Computer-Aided Molecular Design

Journal of Molecular Graphics amp Modeling

Journal of Computational Chemistry

Journal of Medicinal Chemistry

Reviews in Computational Chemistry

Drug Discovery Today

BMC Bioinformatics

Nature Reviews Drug Discovery

Expert Opinion on Drug Discovery

WIRES computational Molecular Science

Molecular

Informatics

Includes all aspects of the study of molecules on computers

Also includes Chem(o)informatics

This includes the representation of molecules databases display

simulation prediction of their properties and the discovery and

design of new molecules and materials

Molecular informatics is closely related to bioinformatics

computational chemistry molecular modelling simulation machine

learning and statistics as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery hence the concentration on small organic molecules

Cambridge HPC

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 4: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Journals of MolecularCheminformatics you may wish

to follow up on

Journal of Chemical Information and Modeling

Journal of Chemical Theory and Computation

Journal of Cheminformatics

Journal of Computer-Aided Molecular Design

Journal of Molecular Graphics amp Modeling

Journal of Computational Chemistry

Journal of Medicinal Chemistry

Reviews in Computational Chemistry

Drug Discovery Today

BMC Bioinformatics

Nature Reviews Drug Discovery

Expert Opinion on Drug Discovery

WIRES computational Molecular Science

Molecular

Informatics

Includes all aspects of the study of molecules on computers

Also includes Chem(o)informatics

This includes the representation of molecules databases display

simulation prediction of their properties and the discovery and

design of new molecules and materials

Molecular informatics is closely related to bioinformatics

computational chemistry molecular modelling simulation machine

learning and statistics as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery hence the concentration on small organic molecules

Cambridge HPC

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 5: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Molecular

Informatics

Includes all aspects of the study of molecules on computers

Also includes Chem(o)informatics

This includes the representation of molecules databases display

simulation prediction of their properties and the discovery and

design of new molecules and materials

Molecular informatics is closely related to bioinformatics

computational chemistry molecular modelling simulation machine

learning and statistics as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery hence the concentration on small organic molecules

Cambridge HPC

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 6: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 7: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 8: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 9: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 10: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 11: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 12: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 13: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 14: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 15: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 16: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by as a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simpe graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 17: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 18: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 19: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 20: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 21: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 22: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical 5 = -

1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 23: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 24: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 25: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 26: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 27: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 28: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 29: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 30: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 31: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 32: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 33: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 34: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 35: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 36: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 37: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 38: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents

Page 39: Useful Information - University of Cambridge 1 2016a.pdf · 2016-11-24 · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics,

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash Finding the right compound

ndash Designing compounds

ndash Searching for compound data

ndash Patents