Bridging cheminformatics and bioinformatics using protein structures Edith Chan Inpharmatica London 10 April 2001

Bridging

cheminformatics and bioinformatics

using

protein structures

Bridging

cheminformatics and bioinformatics

using

protein structures

Edith Chan

Inpharmatica

London10 April 2001

2

Bioinformatics Cheminformatics

SELECTING THE BEST TARGETSSELECTING THE BEST TARGETS

Disease-association doesn’t make a protein a target - requires validation as point of intervention in pathway

Having good biological rationale doesn’t make a protein tractable to chemistry (drugable)

Genomics, HTS and Combichem have increased numerical throughput many hundred fold - overload of poorly integrated data, shortfall in productivity

Target Validation Process

Disease TargetTarget

Selection

Drug Discovery Process

ClinicLeads

Inpharmatica’s protein structure focus - uniquely placed to assess both parameters

High Validity and Drugability Requires a Unifying High Validity and Drugability Requires a Unifying Informatics FrameworkInformatics Framework

High Validity and Drugability Requires a Unifying High Validity and Drugability Requires a Unifying Informatics FrameworkInformatics Framework

3

BIOPENDIUM AND CHEMATICABIOPENDIUM AND CHEMATICA

Genome Data Target Structure Lead Hypotheses

O

O

HO

O

O

N

F

O

OO

O

O

NN

O

OO

O

Biopendium Chematicactgacaagtatgaaaacaacaagctgattg tccgcagagggcagtctttctatgtgcaga ttgacctcagtcgtc

protein target validation drug discoveryand selection

4

%

SEQ

UEN

CE

ID

AdvancedApproaches

AHHLDRPGHNMCEAGFWQPILLTest Sequence

100%

30%

0

Standard Approaches

STRUCTURE-BASED METHODS FIND MANY HOMOLOGUES (AND PUTATIVE TARGETS) NOT DETECTABLE FROM SEQUENCE SIMILARITY

STRUCTURE-BASED METHODS FIND MANY HOMOLOGUES (AND PUTATIVE TARGETS) NOT DETECTABLE FROM SEQUENCE SIMILARITY

Biochemical function and drugability defined by 3D structure, not sequence - structure is better conserved

Inpharmatica

5

BIOPENDIUMBIOPENDIUM

Inputs - all public (or proprietary) protein data

Proprietary methods

Genome-ThreaderGenome-Threader

QBI--Blast

Reverse Search MaximisationReverse Search Maximisation

Massive computation

1 million cpu hour set of calculations employing the most advanced algorithms (1100 processor farm)

Applied to 600,000 sequences, 14,000 structures + bound ligands

Yields 670m precalculated protein relationships

Query results in 15 minutes vs. two weeks with traditional bioinformatics in an Oracle database Protein Information

Structures Sequences Bound ligands Families Functions

6

Link complementary datain the 7 resources

Precalculated data for

600,000 protein sequences.

(scores and alignments for each hit)

Pairwise

sequence

searches

Profile

based

searches

Threading

based

approaches

InpharmaticaWorkbench

Ligplot ligand interaction

editor

Inpharmaticaenhanced RasMol

3D viewer

Interactive sequence alignment

editor

RelationalDatabase

Taxonomy

Processed PDBto XMAS data

Mask sequences

THE INPHARMATICA BIOPENDIUMTHE INPHARMATICA BIOPENDIUM

Genbank PDBPrositePrints EnzymeSwissprot

Ligplot

Proprietary seq.ORF prediction

Proprietarystructures

8

CHEMATICA Drugable site

identified

DRUGABLE TARGET DISCOVERYDRUGABLE TARGET DISCOVERY

Finding a novel brain metalloproteaseFinding a novel brain metalloprotease

BIOPENDIUM Novel brain

protein identified

9

CHEMATICA IS….CHEMATICA IS….

SiteMapping

SiteIdentification

FragmentMapping

Pharmacophore Generation

Database of putative/known binding sites site mapping and pharmacophore generation

similarity searching/clustering of siteslarge scale virtual screening resource

Gene FamilyData Views

Chemical annotation of

PDB ‘real’ ligand structures

N

O

N

O

C

O

O

N

N

O

O

O

Ligand 2-D structures

Gene family structures

consensus family analysis

10

a. Sphere is placed between the VDW surfaces of each atom pair.

b. Any neighbouring atoms penetrating sphere cause its size to be reduced.

c. Repeat for all possible atom pairs.

d. Generate surface around surviving sphere to define site region.

SURFNET: A program for visualizing molecular surfaces, cavities and intermolecular interactions.

Laskowski R A (1995), J. Mol. Graph., 13, 323-330.

Site identification - How sites in a protein structure are delineated?

11

Volume

Hydrophobic content

Polar content

surface accessibility

……

In total - 20 parameters calculated.

Physical Parameters of the clefts

8 largest sites are stored together with their physical parameters

12

Prediction of binding/active sitesPrediction of binding/active sites

Rule driven:

use of Neural Netsa on a training set of

100 ligand/protein PDBs

Validation:

success rate = 90% on a extended set of 500 PDBs

a backpropagation net -7-5-1 network

13

•3-D distributions of 20 different atom types about the 20 amino acids are calculated.

•No assumption of energy terms.

How XSITE potential is derived?

X-SITE: use of empirically derived atomic packing preferences to identify favourable interaction regions in the binding sites of proteins.

Laskowski R A, Thornton J M, Humblet C & Singh J (1996), Journal of Molecular Biology, 259, 175-201.

14

Data set Used

(1) 521 non-homologus protein chains* from PDB that satisfy

no two sequence identity is > 20%resolution <1.8ÅR factor < 0.2

AND

(2) 376 protein-ligand PDB structures for studying additional atom types other than those from peptides and proteins, such as Cl, F.

Note: The PDB has about 14K entries!

*cullpdb_pc20_res1.8_R0.2_d001130_chains521 (R. Dunbrack, Jr.)

U. Hobohm, M. Scharf, R. Schneider, "Selection of representative protein data sets." Protein Science, 1, 409-417 (1993).

15

Application of XSITE distributions to side-chains making up the calculated protein binding site

Projecting XSITE distributions onto the predicted binding site

16

How Pharmacophore is generated?

a. Compare the XSITE predictions generated for the different probe atoms at a 3D grid of densities encompassing the region of the binding site.

b. The higher the value at a given grid-point the higher the likelihood of finding that type of atom at that location.

c. For each probe atom, it derives a “best” map.

d. The net result is a new set of 3D grid maps, one per probe atom, holding only those regions where that atom scored higher than the others.

17

What is fragments mapping?

a. In-built database of more than 100 small molecule fragments - most common functional groups and represent the common building blocks that satisfy drug-like elements used in chemistry.

b. Privileged structures from companies.

O

O

O

O

N

ON

H

H

H

O

O

O S O

N

O

S N N

NN

N

N

N

S

O

N

N S

N N

S

OO

N SHS

OO

N OH

Cl FF

F

FCl

Cl

Cl

P O

O

O

N+O

O

t-butyl ethyl tBoc

phenyl naphthayl di-phenyl bi-phenyl

carbonyl carboxyl acetic acid acetamide methylamine

furan thiophene oxazole thiazole pryrole imidazole triazine

cyclohexyl thiazolidine piperazine thiadiazole

sulfonyl sulfnamide cyano mercapto methol

18

How is fragments mapping done?

• Each atom in a fragment is assigned one of the 20 atom type.

• Each fragment is placed at every grid-point within the binding site and subjected to 300 rotations.

• At each rotation a score is calculated using the appropriate X-SITE predictions for the atom types that the fragment contains.

C.ar

C.ar

C.ar

19

CHEMATICACHEMATICA

Curated, high-quality annotation and presentation of important ‘drugable’ gene families

NHRs, kinases, caspases, GPCRs,….

Contains ligand structure information

Contains crystal environment classification

Automatic alerts for newly released structures

Multiple structure comparison options

Gene Family Data Views

20

Consensus Family Analysis

MMP-1 MMP-8 MMP-13 MMP-3

Size and topology of binding sites for MMP-1 & MMP-8 are similar, but detailed interactions differ

Spheres signify negative charge requirement in different areas of the binding pockets

provides potential for specificity

CHEMATICACHEMATICA

21

Taken two sets of data from literature

1) GOLD (Jones, Willett, Glen, Leach and Taylor)

Genetic Optimization for Ligand Docking

(71% success rate in ligand binding mode in 100 pdbs)

our method - 70%

2) SUPERSTAR (Verdonk, Cole and Taylor)

Empirical method for interactions in proteins

(67% success rate for original 4 probes ~67% in 122 pdbs)

our method - 84%

Validation Study

1. Jones et al. J. Mol. Biol. (1997) 267, 727-748

2. Verdonk et al. J. Mol. Biol. (1999) 289, 1093-1108

22

AcknowledgementsAcknowledgements

Inpharmatica

Alex Michie

John Overington

Simon Skidmore

UCL

Roman Laskowski

Adrian Shepherd

Janet Thornton

Documents

Bridging cheminformatics and bioinformatics using protein structures Edith Chan Inpharmatica London 10 April 2001