85
G53BIO – Bioinformatics http://www.cs.nott.ac.uk/~jqb/G 53BIO Protein Structure Prediction Dr. Jaume Bacardit – [email protected] Some material taken from “Arthur Lesk Introduction to Bioinformatics 2nd edition Oxford University Press 2005” and “Introduction to Bioinformatics by Anna Tramontano”

G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – [email protected]@cs.nott.ac.uk

Embed Size (px)

Citation preview

Page 1: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

G53BIO – Bioinformaticshttp://www.cs.nott.ac.uk/~jqb/G53BIO

Protein Structure Prediction

Dr. Jaume Bacardit – [email protected]

Some material taken from “Arthur Lesk Introduction to Bioinformatics 2nd edition Oxford University Press 2005” and “Introduction to Bioinformatics by Anna Tramontano”

Page 2: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Outline

• Introduction and motivation• Basic concepts of protein structure• PSP: A family of problems• Prediction of structural aspects of protein

residues• Prediction of the 3D structure of proteins• Assessment of PSP quality: CASP• Summary

Page 3: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Introduction• Proteins are molecules of primary importance for the

functioning of life – Structural Proteins (collagen nails hair etc.)– Enzymes– Transmembrane proteins

• Proteins are polypeptide chains constructed by joining a certain kind of peptides amino acids in a linear way

• The chain of amino acids however folds to create very complex 3D structures

• There is a general consensus that the end state of the folding process depends on the amino acid composition of the chain

Page 4: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Motivation for PSP

The function of a protein depends greatly on its structure

The structure that a protein adopts is vital to it’s chemistry

Its structure determines which of its amino acids are exposed to carry out the protein’s function

Its structure also determines what substrates it can react with

However the structure of a protein is very difficult to determine experimentally and in some cases almost impossible

Page 5: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure Prediction• That is why we have to predict it• PSP aims to predict the 3D structure of a protein

based on its primary sequence

Page 6: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Impact of PSP

PSP is an open problem. The 3D structure depends on many variables

It has been one of the main holy grails of computational biology for many decades

• Impact of having better protein structure models are countless– Genetic therapy– Synthesis of drugs for incurable diseases– Improved crops– Environmental remediation

Page 7: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure

Page 8: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Backbone and side chain

• All amino acids have a common part: the backbone

• Each amino acid type has a different side chain

• The Cα atom connects the backbone and the side chain

• The first carbon atom in the side chain is called Cβ (except for Gly)

Page 9: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Amino Acids

Page 10: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Introduction

• Different amino acids have different properties

• These properties will affect the protein structure and function

• Hydrophobicity, for instance, is the main driving force (but not the only one) of the folding process

Page 11: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Hierarchical nature of protein structure

MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTLPFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQREKIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKKHLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYLIKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE

Primary Structure = Sequence of amino acids

Secondary Structure Tertiary

Local Interactions Global Interactions

Page 12: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Hierarchical nature of protein structure

• The amino acid composition of a protein is called primary structure or primary sequence

• The folding process of a protein involves several steps– The protein creates some patterns due to local interactions with the

closest residues in the chain. These patters are called the protein secondary structure

– Afterwards, the secondary structure motifs organise into stable patters, called tertirary structure

– Finally, proteins can be composed of several subunits or monomers, forming the quaternary structure

• Other, less used, levels of this hierarchy are – Supersecondary structure (recurrent patters of interaction

between secondary structure elements close in sequence )– Domains (subunits within a protein with quasi-independent folding

stability)

Page 13: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Backbone• The polypeptide chain of proteins in joined

together in a very specific way• Two dihedral angles (phi and psi) define the

torsion of each amino acid in the chain• Phi is the angle of the Cα –N bond and psi is

the angle of the Cα-C bond.

http://wiki.cmbi.ru.nl/index.php/Phi-psi_angle

Page 14: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Hierarchical nature of protein structure

• There are two main kinds of secondary structure motifs: – α helices – β sheets

• Residues that do not fail in these two categories are said to be in coil state

Residues form a loop of 3.6 residues/turn and 5.4Å wide

Residues lay flat in parallell strands. Called parallell sheets if all strands have the same N-to-C orientation, and antiparallell if adjacent strands have opposed orientations

Page 15: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Hierarchical nature of protein structure

• Supersecondary structure elements

β hairpin β-α-β unit

Page 16: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Data Bank

• Proteins for which scientists have been able to resolve the structure (using x-ray crystallography, NMR, etc.) are stored in the Protein Data Bank (PDB)

• Each protein has a four letter ID code (PDB id)• A fifth letter (A, B, C, etc.) is used to identify the

chain within the protein• Proteins are stored in a format called also PDB

format• File for the 1A68 protein

Page 17: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Ramachandran plots

• We saw that the backbone of a residue is characterised by two angles: psi and phi.

• Can they take any value?• Fortunately not • This effect was studied long ago

by GN Ramachandran• He proposed a diagram to

visualize these angles (phi in the X axis, psi in the Y axi) of amino acid residues

• Different types of secondary structure are clustered in different regions of the diagram

Page 18: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Ramachandran plots

• In real proteins, these plots are not so clear

• You can create the Ramachandran plot for any protein in PDB at http://www.fos.su.se/~pdbdna/input_Raman.html

• At the right there is the plot for a set of 80 proteins

Page 19: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Classifications of protein structure

• Several tertiary structure classification method exists, for instance, SCOP, CATH, and FSSP/DDD.

• No method is perfect, hence www.procksi.org was proposed.• SCOP is the most widespread of them• SCOP = Structural Classification Of Proteins http://scop.mrc-

lmb.cam.ac.uk/scop/• In its 1.73 release (November 2007) it catalogs 34494 proteins

with known structure (that is, entries in the PDB archive)• It uses a hierarchical system to catalog the proteins, according to

evolutionary origin and structural similarity• The levels of the hierarchy are: class, fold, superfamily, family,

protein and species

Page 20: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Classifications of protein structure

• Main classes of SCOP (first level of hierarchy)1. All α proteins – proteins that have (almost) only α helices2. All β proteins – proteins that have (almost) only β sheets3. α+β proteins – proteins that have both α helices and (mostly)

antiparallell strands, but segregated in different parts of the protein4. α/β proteins – proteins that have both α helices and (mostly) parallell

strands, typically forming β+α+β units5. Multidomains proteins – proteins having two or more domains

belonging to different classes6. Membrane and cell surface proteins7. Small proteins (metal ligans, heme and proteins with disulfide bridges8. Coiled coils proteins9. Low resolution protein structure10. Peptides11. Designed proteins

Page 21: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Protein Structure: Classifications of protein structure

• SCOP classification of Flavodoxin from Clostridium beijerinckii– Class: α/β– Fold: Flavodoxin-like: 3

layers, α/β/α; parallel β-sheet of 5 strands

– Superfamily: Flavoproteins– Family: Flavodoxin-related

binds FMN– Protein: Flavodoxin– Species: Clostridium

beijerinckii

PDB ID: 5ULL

Page 22: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Prediction types of PSP• There are several kinds of prediction problems within

the scope of PSP– The main one of course is to predict the 3D coordinates

of all atoms of a protein (or at least the backbone) based on its primary sequence

– There are many structural properties of individual residues within a protein that can be predicted for instance:

• The secondary structure state of the residue• If a residue is buried in the core of the protein or exposed in the

surface– Accurate predictions of these sub-problems can simplify

the general 3D PSP problem

Page 23: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Prediction types of PSP

• There is an important distinction between the two classes of prediction

• The 3D PSP is generally treated as an optimisation problem

• The prediction of structural aspects of protein residues are generally treated as machine learning problems

Page 24: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Optimisation• Given a problem for which you have a way of assessing

how good is each possible solution – An evaluation function

• Optimisation is the process of finding the best possible solution

• Dynamic programming (as seen for sequence alignment) is an optimisation method

• Genetic Algorithms are another examples of optimisation• The key differences between them is how they explore

the space of candidate solutions

Page 25: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Machine Learning

• Machine learning: How to construct programs that automatically learn from experience [Mitchell 1997]

• ML is a Computer Science discipline part of the Artificial Intelligence field

• Its goal is to construct automatically a description of some phenomenon given a set of data extracted from previous observations of the phenomenon because it would be beneficial to predict it in the future.

Page 26: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Flow of data in machine learning

• Specifically we are concerned with supervised learning. That is when we know the solution for the training data

Training SetLearning

MethodTheory

Unknown instance

Class

Page 27: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Types of machine learning

• Rule learning

X

Y

0 1

1

If (X<0.25 and Y>0.75) or (X>0.75 and Y<0.25) then

If (X>0.75 and Y>0.75) then

If (X<0.25 and Y<0.25) then Everything else

Page 28: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Other machine learning techniques

• Other methods that have also been used in PSP are– Artificial Neural Networks– Support Vector Machines– Hidden Markov Models

• If you are interested in the technology side of PSP a good book is “Bioinformatics: The Machine Learning Approach” by Baldi and Brunak

Page 29: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Prediction of structural aspects of protein residues

• Many of these features are due to local interactions of an amino acid and its immediate neighbours – Can it be predicted using information from the closest

neighbours in the chain?

– In this simplified example to predict the SS state of residue i we would use information from residues i-1 i and i+1. That is a window of ±1 residues around the target

Ri

SSi

Ri+1

SSi+1

Ri-1

SSi-1

Ri+2

SSi+2

Ri-2

SSi-2

Ri+3

SSi+3

Ri+4

SSi+4

Ri-3

SSi-3

Ri-4

SSi-4

Ri-5

SSi-5

Ri+5

SSi+5

Ri-1 Ri Ri+1 SSi

Ri Ri+1 Ri+2 SSi+1

Ri+1 Ri+2 Ri+3 SSi+2

Page 30: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

What information do we include for each residue?

– Early prediction methods used just the primary sequence the AA types of the residues in the window

– However the primary sequence has limited amount of information

• It does not contain any evolutionary information it does not say which residues are conserved and which are not

– Where can we obtain this information?• Position-Specific Scoring Matrices which is a product of a

Multiple Sequence Alignment

Page 31: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Position-Specific Scoring Matrices (PSSM)

– For each residue in the query sequence compute the distribution of amino acids of the corresponding residues in all aligned sequences (discarding those too similar to the query)

– This distributions will tell us which mutations are likely and which mutations are less likely for each residue in the query sequence

– In essence it’s similar to a substitution matrix but tailored for the sequence that we are aligning

– A PSSM profile will also tell us which residues are more conserved and which residues are more subject to insertions or deletions

Page 32: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

PSSM for the 10 first residues of 1n7lA

A R N D C Q E G H I L K M F P S T W Y V

A: 4 -1 -2 -2 0 -1 -1 0 -2 -1 -2 -1 -1 -2 -1 1 0 -3 -2 0

M:-1 -2 -3 -4 -2 -1 -2 -3 -2 1 2 -2 7 0 -3 -2 -1 -2 -1 1

E:-1 0 0 2 -4 2 6 -2 0 -4 -3 1 -2 -4 -1 0 -1 -3 -2 -3

K:-1 2 0 -1 -4 1 1 -2 -1 -3 -3 5 -2 -4 -1 0 -1 -3 -2 -3

V: 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 5

Q:-1 1 0 0 -3 6 2 -2 0 -3 -3 1 -1 -4 -2 0 -1 -2 -2 -3

Y:-2 -1 -1 -3 -3 -1 -1 -3 6 -2 -2 -2 -1 2 -3 -2 -2 1 7 -2

L:-2 -3 -4 -4 -2 -3 -3 -4 -3 2 5 -3 2 0 -3 -3 -1 -2 -1 1

T: 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0

R:-2 6 -1 -2 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3

Page 33: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Secondary Structure Prediction

– The most usual way is to predict whether a residue belongs to an α helix a β sheet or is in coil state

– Several programs can determine the actual SS state of a protein from a PDB file. The most common of them is DSSP

– Typically, a window of ±7 amino acids (15 in total) is used

Page 34: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Secondary Structure Prediction

R1 R2 R3 Rn-1 Rn

Primary sequenceMSA

PSSM1 PSSM2 PSSM3 PSSMn-1 PSSMn

PSSM profile of sequence

Windows generation

PSSMi-1 PSSMi PSSMi+1Prediction

methodSSi?

Window of PSSM profilesPrediction

•The most popular public SS predictor is PSIPRED

Page 35: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Coordination Number PredictionTwo residues of a chain are said to be in contact if

their distance is less than a certain threshold (e.g. 8Å)

CN of a residue : count of contacts that a certain residue has

CN gives us a simplified profile of the density of packing of the protein

ContactPrimary Sequence

Native State

Page 36: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Other predictions• Other kinds of residue

structural aspects that can be predicted– Solvent accessibility: Amount Amount

of surface of each residue that of surface of each residue that is exposed to solvent is exposed to solvent

– Recursive Convex Hull: A metric that models a protein as an onion and assigns each residue to a layer. Formally each layer is a convex hull of points

• These features (and others) are predicted in a similar was as done for SS or CN

Page 37: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Contact Map prediction• Prediction given two residues

from a chain whether these two residues are in contact or not

• This problem can be represented by a binary matrix. 1= contact 0 = non contact

• Plotting this matrix reveals many characteristics from the protein structure

helices sheets

Page 38: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Contact Map Prediction

• Instead of a single window around the target now there are two windows around the pair of residues to be predicted to be in contact or not

• Many methods also use a third window, placed in the middle point in the chain between the two target residues

Page 39: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Contact Map prediction at Nottingham

• For each position in these 3 windows we include:– PSSM profile– Predicted SS, SA, RCH and CN

• The whole connecting segment between the two targets is represented as– Distribution of AA and predicted SS, SA, RCH and

CN

Page 40: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Contact Map prediction at Nottingham

• Moreover, global protein information is also included– Sequence length– Separation between target residues– Contact propensity of target residues– Distribution of AA and predicted SS, SA, RCH and

CN of the whole chain

• Each instance is represented by 631 variables

Page 41: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Contact Map prediction at Nottingham

• Training set of 2413 proteins selected to represent a broad set of sequences

• 32 million pairs of amino-acids (instances in the training set) with less than 2% of real contacts

• Each instance is characterized by up to 631 attributes

• 50 samples of ~660000 examples are generated from the training set. Each sample contains two no-contact instances for each contact instance

• The BioHEL GBML method (Bacardit et al., 2009) was run 25 times on each sample

• An ensemble of 1250 rule sets (50 samples x 25 seeds) performs the contact maps predictions using simple consensus voting

• Confidence is computed based on the votes distribution in the ensemble

Training set

x50

x25

Consensus

Predictions

Samples

Rule sets

(Bacardit et al., Bioinformatics (2012) 28 (19): 2441-2448)

Page 42: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

3D Protein Structure Prediction

• Approaches for 3D PSP• Template-Based Modelling• Ab-Initio methods• State-of-the-Art methods

– I-Tasser– Rosetta

Page 43: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Approaches for 3D PSP

• Some PSP methods try to identify a template protein and then adapt the structure of the template to the target protein Template-based Modelling

• Other methods try to generate the structure of the protein from scratch (Ab Initio Modelling) optimizing some energy function that models the stability of the protein, in case that no template can be identified

Page 44: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Pipeline for Template-based Modelling

• Typical steps1. Identify the template (next slide)2. Produce the final alignment between the residues of target and template3. Determine main chain segments to represent the regions containing

insertions and deletions (gaps in the alignment) and stitch them into the main chain of the template to create an initial model for the target

4. Replace the side chains of residues that have been mutated (mismatches in the alignment) although it is possible that the conformation in the template is still conserved

5. Examine the model to detect any serious atom collision and relieve them6. Refine the model by energy minimization. This stage is meant to adapt

the stitched segments to the conserved structure and to adjust the side chains so find the most stable conformation

Page 45: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Loop remodelling

Page 46: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Template identification

• Can we find a sequence with known structure and high sequence identify with the target?• Homology Modelling

• Still, there is a template (structure similar to that of the target) but it has poor sequence identity. We need to identify it by other means• Fold recognition

• Profile-based methods• Threading methods

Page 47: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Profile-based Methods

• Aim is to construct 1D representations (profiles) of the structures in our fold database

• Afterwards, when a target sequence comes, we construct its profile and check our database for the most similar profile

• That is, instead of aligning amino acid sequences, we align structural 1D profiles

Page 48: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

How to construct the profile?

• We choose a series of structural properties of residues– Most frequent secondary structure state

• Alpha helix, Beta sheet, other

– Solvent Accessibility• < 40Å2, >100Å2, intermediate

– Hydrophobic/polar

• For each amino acid, we decide to which category it belongs based on statistics computed on a large database of structures

Page 49: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

How to construct the profile?

• Now the sequence for each protein in our database will have a new structural representation

• We need to predict SS and Acc for the template

Alpha helix Beta sheet Other

<40Å2 Hydrophobic: aPolar: d

Hydrophobic: bPolar: e

Hydrophobic: cPolar: f

>100Å2 Hydrophobic: gPolar: j

Hydrophobic: hPolar: k

Hydrophobic: iPolar: l

intermediate

Hydrophobic: mPolar: p

Hydrophobic: nPolar: q

Hydrophobic: oPolar: r

Page 50: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Threading methods• We start with compiling a catalogue of unique folds

(filtering out repeats)• Afterwards, we evaluate how likely it is that the

target sequence adopts each of the folds, and how (alignment)

• Name is a metaphor taken from tailoring, as we are are trying to fit the sequence (a thread) through a known structure

• We will choose the template (and alignment) that has the lowest (estimated) energy

Page 51: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Threading methods• Energy estimation needs to be simple and fast

– As we need to evaluate all possible folds and alignments

• Energy is the product of all the pair-wise interactions ocurring in a protein

• Thus, the energy estimation will be computed as the sum of the energy terms for every pair of residues in the protein

• How to compute the energy interaction for a given pair of amino acids?

Page 52: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Pair-wise Energy estimation• Boltzmann’s equation states that the probablity

of observing a given event depends on its energy– P(x) = e(E(x)/KT)

• If we reverse this equation we get:– E(x) = -KT ln[ P(x) ]

• We can compute P(x), for each pair of amino acids from a database of known structures as the frequency in which these amino acids are observed to be in contact

Page 53: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Alignment within threading

• We still need to solve the problem of the correspondence of the residues in our template with those of the target

• This is a very difficult problem, as a change in an alignment can have impact in the interaction with many residues

• There is an exact (but costly) solution• Instead, most methods adopt an approximate method called

frozen approximation• When evaluating the possibility of assigning one of the amino

acids of the target to a certain position in the template, instead of computing the interactions with the rest of the target residues, we will use those of the template

Page 54: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Frozen Apporximation

Page 55: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Aligning target and template

• Crucial step before generating the initial model• It is possible, specially for homology modelling, that

the best sequence alignment does not correspond to the best structural alignment– That is, finding the best correspondence between the

coordinates of each amino acid of target and template

• In this case, a better alignment process needs to be performed, to do se, we can use– Information derived from the template’s structure– Predicted for the target

Page 56: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Aligning Target and Template

Correct alignment after shiftingWrong alignment. Some atoms aretoo close (big circle). Some atomsare too far (small circle)

Page 57: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

The poor man approach to homology modelling

– To find templates• PSI-BLAST• 3D Jury. This program is a meta-server. That is it asks

many other servers what templates would they choose and then produces a consensus decision based on the answers of the servers

– To produce a model of a protein given a template• MODELLER. Very popular homology modelling package.

Free for academic use– To refine the side-chain conformations

• SCWRL

Page 58: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Ab-Initio modelling

• In general this kind of modelling is still quite primitive when compared to homology modelling

• However without a target it is the only choice• Pure ab-initio modelling is still very costly and

ineffective but hybrid homology/ab-initio methods such as fragment assembly have better performance

Page 59: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Ab-Initio modelling• The most advanced ab-initio method is fragment

assembly– Consists by breaking up the sequence in small

subsegments of 3 to 9 residues and generating structure for these segments based on a large library of known fragments

– Decoys are generated from all possible combinations of fragments

– An energy minimization process is applied to all decoys. – Decoys are clustered and the final models are selected

from the center of the largest clusters

Page 60: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Energy minimisation

Energy minimization is not easy. We may need to go uphill before we can reach the lowest energy conformation

Page 61: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Energy functions for ab-initio methods• Energy function needs to take into account the interactions of

all atoms of all amino acids• Many different types of energy sources

– Covalent bonds– Angles and torsions of bonds between atoms– Van der Waals interactions (repulsion/attraction)– Energy of charged atoms– Interactions with solvent– Hydrogen bonds

• Exact formulas are very costly, so generally PSP methods use knowledge-based potentials, computed from a large database of structures

Page 62: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

I-Tasser

• Prediction method from Zhang’s group• Fully automated server, without any human

intervention• Steps

– Template identification– Structure assembly– Atomic model construction– Model selection

Page 63: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

I-Tasser: Template Identification

• MUSTER fold recognition method, used both for whole proteins (TBM) or for fragments (Ab Inition)

• Profile-based fold recognition– Secondary structure– Structural frament profile– Solvent accessibility– Backbone torsion angle – Hydrophobicity

• For the most difficult targets, a meta-server that combines the outputs of various methods is used

Page 64: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

I-Tasser: Structure assembly• Generation of a preliminary model with only

coordinates for Cα and sidechain positions

• Using the template as starting point where possible and ab-initio methods for amino acids without alignment

• Two iterations of refinement– 1st based on templates– 2nd based on clustering the models of the previous

iteration and using the centroids of each cluster as starting points

Page 65: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

I-Tasser energy function

• Knowledge-based statistics of– Cα – sidechain correlation

– H-bonds– Hydrophobicity

• Spatial restraints of templates• Contact Map prediction from SVMSEQ

– 9 predictions included, combinations of– Contacts between Cα, Cβ or side chain centers

– Contact cut-offs of 6, 7 or 8 Å

Page 66: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

I-Tasser atomic model construction

• Full-atom models are constructed from the approximate models produced by the cluster centroids

• 1st the backbone is matched with a large library of template fragments with high resolution structure

• Then full-atom optimization occurs focusing on H-bonds, removing clashes and using the Charmm22 molecular dynamics force field

Page 67: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

I-Tasser model selection

• Several full-atom models are generated from each cluster centroid

• Models need to be ranked to select the best one

• I-Tasser uses a weighted sum of– Number of H-Bonds / target length– TM-score (metric to compare structures) between

the full-atom model and the centroid cluster

Page 68: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Rosetta

• Predictor from David Baker’s group• It uses a massive distributed computing infrastructure

(Rosetta@home)• For CASP7 in 2006 it claimed to dedicate up to 104 cpu

years/target• Template identification used a variety of methods

depending on sequence identity between target and template

• Different protocols for Template-Based Modelling and Free Modelling (fragment assembly)

• 3 variants of TBM depending on degree of homology between target and template

Page 69: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Rosetta

• Full-atom refinement protocol– Energy function based on

• Short-range interations: Van der Waals energe, H-bonds and solvent accessibility

• Long range interactions (dampening of electrostatic interactions)

– Minimization through Monte Carlo with the following steps:

• Perturbation of a randomly selected angle from the backbone• Optimisation of side-chain rotamer conformations• Optimisation of both backbone and sidechain torsion angles

Page 70: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

PSP and CASP• PSP has improved through the years. This improvement has been

assessed mainly in CASP• CASP = Critical Assessment of Techniques for Protein Structure

Prediction• It is a biannual community exercise to evaluate the state-of-the-art

in PSP• Every day for about three months the organizers release some

protein sequences for which nobody knows the structure (128 sequences were released in CASP8 in 2008)

• Each prediction group is given three weeks to return their predictions. 24 hours are give to automated servers

• Then at the end of the year experts meet in a place close to the sea to discuss the results of the experiment

Page 71: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

CASP categories

• Several categories of experiments are assessed in CASP– Template-Based Modeling (Homology and fold recognition)– Free Modeling (no template i.e. ab initio)– Contact Map prediction– Functional sites prediction– Domain prediction– Disordered regions– Quality assessment

• Categories have changed through time– SS prediction is not assessed anymore after CASP4– Homology modeling and fold recognition merged into TBM

Page 72: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Progress through CASP

1. Computers help structure prediction: no more paper models

2. Knowledge-based potentials work better.

3. Local “threading” and fragment assembly(Baker)

4. Averaging and consensus methods work:meta-servers (Ginalski-Rychlewski)

5. Sequence profile methods are as (or more powerful) than threading: (Sốding)

6. Jamming poorly similar templates togetherhelps: (Skolnick-Zhang)

(From Nick Grishin’s Humans vs Servers presentation in CASP8)

Page 73: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Assessment of 3D PSP

• How can we quantify how good is a model?• That is, how similar is a model structure to the

actual (native) one?• We will see this in depth when we cover the

protein structure comparison topic, later in the module

• Now we are just going to describe the most popular metric, GDT-TS

Page 74: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

GDT-TS

• Global Distance Test – Total Score• This measure tries to produce a balance

between good local and global similarity of structures (unlike RMSD)

• If a measure only takes a global point of view, good models that only fail badly in a few amino acids could be discarded

Page 75: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

GDT-TS steps1. All segments of 3, 5 and 7 consecutive amino acids from

the model are superimposed to the actual structure. 2. Each of them will be iteratively extended while they are

good enough3. Good enough = Distance between all residue pairs

(represented by their Cα atoms) is less than a certain threshold

4. A final superposition includes the set of segments covering as many residues as possible

5. Segments do not need to be continuous

Page 76: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

GDT-TS metric

• The process of superposition is performed four times, using thresholds of 1, 2, 4 and 8 Å

• The reason for including 4 different thresholds is to have a metric which is good both for high accuracy models and for approximate models

Page 77: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

GDT-HA

• HA = High Accuracy• Set of thresholds in GDT-TS changed to 0.5, 1, 2

and 4• For high accuracy GDT just provide a crude

approximation (backbone). So other measures are taken into account– H-bonds– Position and rotation of sidechains– Clashes of atoms

Page 78: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Contact Map prediction in CASP

Contact Map is assessed using the targets in the Free Modelling category

Also, only long-range contacts (with a minimum chain separation of 24 residues) are evaluated

Predictor groups are asked to submit a list of predicted contacts and a confidence level for each prediction

The assessors then rank the predictions for each protein and take a look at the top L/x ones, where L is the length of the protein and x={5,10}

Page 79: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Contact Map prediction in CASP

From these L/x top ranked contacts two measures are computed Accuracy: TP/(TP+FP) Xd: difference between the distribution of

predicted distances and a random distribution

Page 80: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

CASP9 results

These two groups derived contact predictions from 3D models

http://www.predictioncenter.org/casp9/doc/presentations/CASP9_RR.pdf

Page 81: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Other CASP prediction categories

• Functional sites prediction– Predicting which residues of a given sequence are those that perform

the chemistry of the protein– Bind to other proteins/compounds– Methods can use whatever information they can infer to perform this

prediction– However, most predictions can be performed simply by homology

• Domain prediction– Domains = quasi-independent subsets of a protein, that fold on their

own– Their prediction follows a simple divide-and-conquer motivation– It is much easier to create separate models for the different domains

of a protein

Page 82: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Disordered regions prediction

• Regions of a protein that do not fold into a unique pattern (no coordinates in the PDB file)

• 75% of mammal signaling proteins are estimated to contain long (>30) disordered regions, and 25% of the total amount of proteins may be fully disordered

• Thus, it is useful to predict from the sequence if that is the case

Page 83: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Disordered protein 2K5K

Page 84: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Quality assessment prediction

• Given a model, can we predict how good it is (without comparing it to the native structure)?

• Overall and per-residue model quality• Prediction was done based on the models from

the server category• Two families of methods

– That perform predictions for individual models– That take a set of models and give predictions based

on consensus agreements

Page 85: G53BIO – Bioinformatics jqb/G53BIO jqb/G53BIO Protein Structure Prediction Dr. Jaume Bacardit – jqb@cs.nott.ac.ukjqb@cs.nott.ac.uk

Summary of topic• Importance of PSP• Many different types of prediction included in

the PSP family– 3D PSP– Prediction of amino acid structural features– Others

• Families of 3D PSP– Template-based Modelling– Free modelling