CS6104: Computational Structural Biology & Bioinformatics

CS6104: Computational Structural Biology & Bioinformatics

Homology Modeling and Molecular Docking

David BevanDept of Biochemistry

[email protected]

April 6, 2004

Some Components of Molecular Modeling

• Visualization• Molecular mechanics• Quantum mechanics• Molecular dynamics• Homology modeling• Molecular docking

Aims of Structural Genomics

• To determine or predict the 3D structures of all the proteins encoded in the genome

• Up to 40% of the known protein sequenceshave at least one segment related to one or more structures

=> Determine all of the folds=> Use homology modeling to

predict 3D structures

What is Homology?

• Cannot be partial• Assertion of homology is an hypothesis• Hypothesis is usually based on extent of

sequence similarity between proteins, though ultimately similar functions need to be demonstrated

Homology: having a commonevolutionary origin

Some Definitions

• Homologue (Homolog): proteins that are evolutionarily related

• Orthologue (Ortholog): homologues from different organisms

• Paralogue (Paralog): homologues from the same organism

Homology Modeling(Comparative Structure Modeling)

• 3D structures conserved to greater extent than primary structures

• Develop models of protein structure based on structures of homologues

• Using known structure as a “template”, calculate 3D model of a protein for which only know the sequence (the “target”)

• Model includes core, loops, and side chains

Steps in Homology Modeling

Template Selection• Identify protein structures related to

target and select those to be used as templates

• Involves searching a database such as at NCBI (e.g., BLAST at NCBI)

• Involves a certain amount of sequence alignment

Aligning Sequences

• Critical step in homology modeling• Many options to consider• Factors to consider

– Which algorithm to use– Which scoring method to apply– Whether and how to assign gap penalties

Algorithms for Alignment

• Earliest method was that of Needleman and Wunsch (1970)

• Another widely adopted one was by Smith and Waterman (1981)

• More recent are the heuristic methods such as BLAST and FASTA, which are more approximate but much faster

Needleman-Wunsch Algorithm• Global alignment method (i.e., match

sequences along entire length)• Produces an optimal alignment• Steps

– Setting up a matrix– Scoring the matrix– Identifying the optimal alignment

• Can be time-consuming• Not effective for highly divergent proteins

Smith and Waterman Algorithm• Local alignment method• Useful in database searching• Similar to global alignment

– Proteins arranged in a matrix– Optimal path along diagonal is sought

• Differences compared to global alignment– Can start alignment at internal position– Alignment does not have to extend to ends

Heuristic Search Methods• Developed to do rapid searches of

large databases• Not guaranteed to find globally

optimal solution• Rarely miss a significant match• Identify regions of potential interest

and then expand regions to identify alignment

• Include FASTA and BLAST

Scoring Alignments• Need some method of scoring to find optimal

alignment• Four general types of scoring have been applied

– Identity: considers only identical residues– Genetic code: considers the number of base changes in

DNA or RNA to interconvert codons for the amino acids

– Chemical similarity: considers physico-chemical properties

– Observed substitutions: considers substitution frequencies observed in alignments of sequences (*used the most*)

PAM Matrices• Dayhoff mutation data matrix originally developed to

study evolution of proteins• Uses probability of one amino acid mutating to a second

amino acid within a particular evolutionary time• Denoted PAM (Percentage of Acceptable Point Mutations)• One PAM is a unit of evolutionary divergence in which 1%

of the amino acids have been changed• Uses substitution frequencies from alignments of very

similar sequences and extrapolates to more distant relationships

• PAM40 will recognize short alignments of highly similar sequences

• PAM250 will recognize longer, weaker local alignments

BLOSUM Matrices• Matrices based on alignments of more distantly related

sequences• Uses alignments of short regions of related sequences• Sequences clustered into groups (blocks) based on

similarity at some threshold value of percentage identity

• Blocks substitution matrices (BLOSUM) derived based on substitution frequencies

• BLOSUM90 matrix derived using threshold of 90% identity => very similar sequences

• BLOSUM30 matrix derived using threshold of 30% identity => highly divergent sequences

Summary of PAM and BLOSUM Matrices

BLOSUM90

PAM30

BLOSUM 80 BLOSUM62

PAM120 PAM180BLOSUM45

PAM240

Less divergent More divergent

Mouse vs Rat

Mouse vsBacteria

Building the 3D Model• Rigid body assembly

– Rigid bodies from aligned sequences– Core region, loops, and side chains

• Segment matching– Most hexapeptide segments can be

clustered into ~100 structural classes– Use segments from homologs (or non-

homologs) to build up structure

Building the 3D Model (cont.)

• Satisfaction of spatial restraints– Generate restraints from templates– Assume distances and angles between

aligned template and target are similar– Minimize violations of all restraints using

distance geometry or optimization techniques (i.e., force field) to satisfy spatial restraints

Evaluating the 3D ModelProcheck

• Ramachandran plot• Planar peptide bonds• Side chain conformations that

correspond to those in rotamer library

• Hydrogen bonding• No bad atom-atom contacts

Evaluating the 3D Model3D-Profiler

• Based on statistical preferences of each of the 20 amino acids for particular environments within a protein

• Each residue position can be characterized by its environment

• Preferred environments for amino acids defined by three parameters– Area of each residue that is buried– Fraction of side-chain area that is covered by polar

atoms (i.e., O and N)– Local secondary structure

Refining the 3D Model

• MD and energy minimization• Application of restraints based on

experimental data (e.g., NMR, fluorescence)

Threading• Based on concepts of finite number of folds

and conservation of 3D structure• Challenge is to find fold(s) adopted by a given

sequence• Model a given amino acid sequence as each of

several folds from a pre-specified library• Models are evaluated by some criteria to

identify the most likely fold (e.g., pairwise contact “energies”)

⇒ instead of using homologues astemplates, uses library of folds

Applications of the Model

Arabidopsis thaliana Project

• 46 genes for family 1 β-glucosidases– 40 β-O-glucosidases– 6 β-S-glucosidases

• ~ 20 genes for family 35 β-galactosidases

An NSF 2010 Project

Model of Beta-Glucosidase

Red: Crystal structure Blue: Homology model

Crystal vs. Homology Model of β-Glucosidase

RMSD (heavy atoms) 2.53 Å

RMSD (backbone atoms) 1.82 Å

Possible Template Structures• Maize Glu1

– 1E1E: native enzyme– 1E1F: native enzyme in complex with PSG– 1E4L: E191D– 1E4N, 1E55, 1E56: E191D complexes

• Myrosinase– 1E4M: Sinapis alba 1.20 Angstroms– Many others

• Cyanogenic β-glucosidase from white clover - 1CBG• Bacillus polymyxa β-glucosidase

– 1BGA: native enzyme– 1BGG, 1E4I, 1TR1: complexes or mutants

• Bacillus circulans β-glucosidase - 1QOX

Approach to Homology Modeling• Identify and align templates• Align target sequences with templates

– Biology Workbench (CLUSTALW)– MODELLER4 and MODELLER6

• Build models– MODELLER4 and MODELLER6

• Evaluate models– PROCHECK– PROSAII

Evaluation of Models

0.4-10.2Average0 to 1.3-8.3 to -11.4Range

0.2-9.8At5g259800-10.1At1g60090

0.5-10.9At2g444701.0-9.8At1g662800-10.6At1g26560

0.2-11.3At5g44640

Rama - % disallowed

ProsaII z-score

Molecular Docking

• Receptor-ligand• Enzyme-substrate• Protein-DNA• Protein-protein• Receptor-drug

Attempt to predict the structure(s) of the intermolecular complex between two or more molecules.

General Considerations• Molecular representations

– Abstract or atoms– Flexible or fixed

• Juxtaposition of molecules– Interactive or automated– Search algorithm should create an optimum

number of conformations that include experimental binding modes

• Evaluation of complementarity– Scoring function– Force field energy functions

Search Algorithms• Molecular dynamics• Monte Carlo methods• Genetic algorithms• Fragment-based methods• Point complementary methods• Distance geometry methods• Tabu searches• Systematic searches

Docking Algorithms

SPROUTICMGrowMolSurflexSMoGAutoDockMCSSSLIDEGRIDFlexX/FlexELUDIDOCKDe novo DesignVirtual Screening

AutoDock• A suite of automated docking tools• Designed to predict how small

molecules, such as substrates or drug candidates, bind to a receptor

• Consists of three programs– AutoTors: to define torsions in ligand– AutoGrid: to calculate grids– AutoDock: to perform the docking– Also includes GUI called AutoDockTools

AutoDock Grid Maps• Pre-calculated and used as

look-up tables• Place probe atom at each grid

point and calculate energy• Grid for each type of atom

(e.g., C, O, N, H)• Interpolate ligand atom

positions relative to grid points• Energy based on van der

Waals, electrostatics, and hydrogen bonding

Structure-based Drug Design

• Directs discovery of a drug lead, a compound with at least micromolaraffinity for a target

• Involves combination of docking (virtual ligand screening) and experimental assays (high throughput screening)

Docking and HTS Overview

Docking HTS

Virtual Ligand Screening

Retinoid Receptors

• Members of superfamily of nuclear receptors (transcription factors)

• Two families (RARs and RXRs)• Subtypes denoted α, β, γ within each

family

Functions of Receptors

• Normal processes– Morphogenesis– Differentiation– Metabolism– Homeostasis

• Pathologies– Teratogenicity– Endocrine disruption

Natural Ligands for Retinoid Receptors

O-CH3 CH3

CH3

CH3 CH3 O

CH3 CH3

CH3

CH3

CH3

O O-

trans-Retinoic acid cis-Retinoic acid

Docking of trans-Retinoic Acid to RARγ

Blue = crystal structure

Red = re-docked ligand

Docking of RetinoidsDesignation Structure RMSD (Å)t-RA

O-CH3 CH3

CH3

CH3 CH3 O

0.30

c-RA CH3 CH3

CH3

CH3

CH3

O O-

0.37

BMS181156

O

CO

O-0.97

BMS184394-R

O

CO

O-0.29

CD564

O

CO

O-0.23

Kinesins

Eg5

• A kinesin motor protein

• Slides microtubules of developing spindle apart to pushcentrosomes apart

Monastrol• Inhibitor of Eg5• An anti-mitotic agent

Docking to Human Eg5

Docking to Drosophila Eg5

Documents

CS6104: Computational Structural Biology & Bioinformatics