Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Swiss Institute of Bioinformatics
Protein Structure BioinformaticsIntroduction
Secondary Structure & Protein Disorder Prediction
EMBnet course Lausanne, February 21, 2007
Lorenza Bordoli
Overview
Introduction
Secondary Structure Prediction
Solvent Accessibility Prediction
Disorder Prediction
Principles of protein structure
Primary Structure
Secondary Structure
Tertiary Structure (Fold)
Quaternary Structure
Principles of protein structure
Protein structure include:
Core Region:
Secondary structure element packed in close proximity
in hydrophobic environment
Limited amino acid substitution
Outside the core:
loops and structural elements in contact with water,
membrane or other proteins
Amino acid substitution: not as restricted as above
Protein Structures:
Solvent Accessibility
• Buried
• Solvent exposed
Primary Structure
Secondary Structure
Tertiary Structure (Fold)
Quaternary Structure
Secondary Structures:
• α Helix
• β Sheet
Secondary structure assignment
DSSP
Dictionary of Secondary Structure of Proteins (Kabsch
& Sander, 1983)
Based on recognition of hydrogen-bonding patterns in
known structures
Automated assignment of secondary structure
Interprets backbone hydrogen bonds
Uses a Coulomb approximation for the hydrogen bond
energy (-0.5 kcal/mol cut-off)
Secondary structures are assigned to consecutive
segments of residues with hydrogen bonds
Secondary structure assignment
DSSP secondary structure elements8 secondary structure classes
– H (α-helix) → H
– G (310-helix) → H
– I (π-helix) → H
– E (extended strand) → E
– B (residue in isolated β-bridge) → E
– T (turn) → L
– S (bend) → L
– " " (blank = other) → L
How many structures do we know?
http://www.wwpdb.org/
How many structures do we know?
PDB: http://www.pdb.org
X-Ray, NMR => atom coordinates of the proteins are
deposited in PDB: worldwide repository for the 3-D
biological macromolecular structure data.
EBI-MSD: http://www.ebi.ac.uk/msd/ (2003)
suite of web-based search and retrieval interfaces for
macromolecular structure research.
How many structures do we know?
[ PDB: http://www.pdb.org ]
Growth of the Protein Data Bank PDB
TotalYearly
100
1,000
10,000
100,000
1,000,000
10,000,000
1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
TrEMBL
SwissProt
PDB
No experimentalstructure for mostprotein sequences
(Sources: PDB, EBI, SIB)
How many structures do we know?
How many structures do we know?
3Dhomology
modeling
fold
recognitionsome m
odel
1D
... ?
(B.Rost, Columbia, NewYork)
Genome View:
1D-Structure prediction
Secondary Structure Prediction
As starting point for 3D modeling
Improve sequence alignments
Use in fold recognition
Definition of loops / core regions
Solvent Accessibility Prediction
Identify exposed residues, e.g. for mutation studies,
epitopes, etc.
Secondary Structure prediction
What is protein secondary structure
prediction?
Simplification of prediction problem
3D → 1D
Secondary Structure prediction
Reduction to secondary structure “3-state” model: β-Strand, α-Helix, Loop
Projection onto strings of structural assignments
(S) β-Strand (E) (H) α-Helix (L) Loop
SEQ MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAK SS SSSSSSLLLLLLHHHHHHHHHHHLLLSSSLHHHHHHHHHHHLLLLLLHHHSS SSSSSS HHHHHHHHHHH SSS HHHHHHHHHHH HHH
Secondary Structure prediction
Assumption:there should be a correlation between amino
acid sequence and secondary structure
Conformational Preferences
Biochimica et Biophysica Acta 916: 200-204 (1987).
α
β
RT
1st Generation secondary structure prediction
1st Generation based on single amino acid propensities
Chou and Fasman, 1974Robson, 1976GOR-1: Garnier, Osguthorpe, and Robson, 1978
Preference of particular residues for certain secondary structure elements:
Single-residue statistics: analysis of the frequency of each 20 aa in α helices, β strands or coils
Structure databases were of very limited size
Name P(H) P(E) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine 142 83 66 0.06 0.076 0.035 0.058Arginine 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid 101 54 146 0.147 0.11 0.179 0.081Asparagine 67 89 156 0.161 0.083 0.191 0.091Cysteine 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid 151 37 74 0.056 0.06 0.077 0.064Glutamine 111 110 98 0.074 0.098 0.037 0.098Glycine 57 75 156 0.102 0.085 0.19 0.152Histidine 100 87 95 0.14 0.047 0.093 0.054Isoleucine 108 160 47 0.043 0.034 0.013 0.056Leucine 121 130 59 0.061 0.025 0.036 0.07Lysine 114 74 101 0.055 0.115 0.072 0.095Methionine 145 105 60 0.068 0.082 0.014 0.055Phenylalanine 113 138 60 0.059 0.041 0.065 0.065Proline 57 55 152 0.102 0.301 0.034 0.068Serine 77 75 143 0.12 0.139 0.125 0.106Threonine 83 119 96 0.086 0.108 0.065 0.079Tryptophan 108 137 96 0.077 0.013 0.064 0.167Tyrosine 69 147 114 0.082 0.065 0.114 0.125Valine 106 170 50 0.062 0.048 0.028 0.053
Chou-Fasman tables
Chou-Fasman
How it works:
a. Assign all of the residues the appropriate set of parameters
b. Identify α-helix and β-sheet regions. Extend the regions in
both directions.
c. If structures overlap compare average values for P(H) and
P(E) and assign secondary structure based on best scores.
d. Turns are modeled as tetra-peptides using 2 different
probability values.
Assign Pij values
1. Assign all of the residues the appropriate set of parameters
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75
P(turn) 114 143 152 114 66 74 59 60 95 143 114 156
Scan peptide for α−helix regions
2. Identify regions where 4/6 have a P(H) >100 “alpha-helix nucleus”
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
Extend α-helix nucleus
3. Extend helix in both directions until a set of four residues has an average P(H) <100.
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
Repeat steps 1 – 3 for entire peptide
4. Identify regions where 3/5 have a P(E) >100 “b-sheet nucleus”
Extend b-sheet until 4 continuous residues have an average P(E) < 100
If region average > 105 and the average P(E) > average P(H) then “b-sheet”
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75
Scan peptide for β-sheet regions
Chou-Fasman
1. Assign all of the residues in the peptide the appropriate set of parameters.
2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.
3. Repeat this procedure to locate all of the helical regions in the sequence.
4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.
5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region.
6. To identify a bend at residue number j, calculate the following value:p(t) = f(j)f(j+1)f(j+2)f(j+3)where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.
CHOFAS predicts protein secondary structure version 2.0u61 September 1998 Please cite: Chou and Fasman (1974) Biochem., 13:222-245 Chou-Fasman plot of @, 12 aa; SEQ1 sequence.
TSPTAELMRSTG helix <> sheet EEEEEEE turns T
Residue totals: H: 2 E: 7 T: 1 percent: H: 16.7 E: 58.3 T: 8.3
Chou-Fasman Results
Performance EvaluationAssumption: there should be a correlation*between amino acid sequence and secondary structure
Systematic performance testing pre-requisite for reliability of method
Training Set Test Set
Dataset
PDB
PDB sub set:derive correlation*
PDB sub-set:=> Q3
Accuracy of prediction
3-state-per-residue accuracy:
Gives % of correctly predicted
residues in α, β or other state
Q3 = 100 • Σ ci/N
• N= total number of residues
• Ci = number of correctly predicted residue
in state I (H,E,L)
1st Generation secondary structure prediction
3-state per residue accuracy assessment
SEQ KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDSS EEEE E E E EEEEEE EEEEEE EEEEEEHHHEEEE
TYP EHHHH EE EEEE EE HHHEE EEEHH
Typical 1st generation prediction result:
3 - state per residue accuracy:
Q3 = 100 • Σ ci / N
ci = number of correctly predicted residues in state i (H,E,L)N = number of all residues
50 – 55 % Q3 accuracy Performance is overestimated!Q3
Random = 35.2 %
2nd Generation secondary structure prediction
Improvements
Larger database of protein structures
Segment-based statistics (11-21 residue window)
Basic idea:
"How likely is it that the central residue in a window adopts a
particular secondary structure state?"
Algorithm used:
Presumably all conceivable algorithms on this planet have
been applied to the Secondary Structure prediction problem.
E.g. statistical information, physicochemical properties,
sequence patterns, neural networks, graph theory, expert
rules
(H) α-Helix, local interactions
Neural Network
Artificial intelligence:Computer programs are trainedto be able to recognize amino acid patters that are located in known secondary structure and distinguish from other patterns not located in these structures
NN can detect interactions between amino acids in a sequence window.
Artificial Neural Networks (ANN)
Excursion:
Introduction to Artificial
Neural Networks
Thanks to C. Pellegrini & P. Palagi (SIB) for slides about ANNs.
Inspiration - The brain
• Capable of remembering, recognizing patterns and associating. Main characteristics:
• massively parallel• non-linear• huge number of slow units highly connected • self-organizing and self-adapting
• Some statistics about the brain: • 1011 neurons• 1015 connections
• and about neurons: • 1 neuron is connected with 103 to 105 other neurons• slow: 10-3 sec (silicon logic gates 10-9 sec)
C. Pellegrini (SIB)
The nervous system
forward information feedback
The brain continually receives information, perceives it and makesappropriate decisions.
ReceptorsNeural
networks Effectors ResponsesStimulus
Brain
Human nervous system is a three-stage system:
C. Pellegrini (SIB)
An artificial neural network
• An artificial neural network (ANN) is a “machine”:
• assembly of artificial neurons• created to model the way the brain execute tasks by simulating mathematically the neurons and their connections
• Requirements to achieve a good performance:
• a huge number of neurons• massive interconnection among them
C. Pellegrini (SIB)
Artificial neuron model
• Introduced by McCulloch & Pitts (1943):
v w xi ii
= ∑ if v > θ then output = +1
else output = -1
ν
x1
x2 w2
w1
θ− 1
output
Quite simple: All signals can be 1 or -1. The neuron calculates a weighted sum of inputs and compares it to a threshold. If the sum is higher then the threshold, the output is set to 1, otherwise -1.
P.Palagi (SIB)
Artificial neuron model
• This simple neuron model consists of:
• A set of connections, called synapses, which make the link to other neurons to create a network. Each synapse has a synaptic weight which represents the strength of the connection.
• One unity which multiplies each incoming activity by the weight on the connection and adds together all these weighted inputs toget a total input.
• An activation function that transforms the total input into an outgoing activity (to constrain the input amplitude).
v w xi ii
= ∑
if v > θ then output = +1
else output = -1
νx1
x2 w2
w1
θ−1
output
P.Palagi (SIB)
Artificial neuron model
Modern McCulloch & Pitts neuron:
Σ
x1 wk 1
x3 wk 3
x2 wk 2
x p wkp
Summationunit
( )ϕ . Output yk
Activationfunction
Thresholdθk
Synapticweights
Input
signals
vk
P.Palagi (SIB)
Artificial neuron model
v w xk kj jj
p
==
∑1
and
The model can be mathematically described:
( )y vk k k= −ϕ θ
Where:
( )
x x x inputs
w w w synaptic weights k
v linear combiner
threshold
activation function
y output
p
kp
k
k
k
1 2, , , are the ,
are the of neuron ,
is the output,
is the ,
is the
is the signal of the neuron.
K
K1 2, , ,
. ,
θϕ
P.Palagi (SIB)
Types of activation functions
( )y vif v
if vk k
k
k
= =≥<
⎧⎨⎩
ϕ1 0
0 0
The activation function defines the output of a neuron in terms of the activity level at its inputs. There are 3 basic types of activation functions.
( )y v
v
v v
vk k= =
≥> >
≤
⎧
⎨⎪
⎩⎪
ϕα
α ββ
1
0
• threshold function
• piecewise-linear function
( )ϕ ve av=
+ −
1
1( )ϕ v
v e
e
v
v=⎛⎝⎜
⎞⎠⎟ =
−+
−
+tanh2
1
1
• sigmoid function
or
Activation functions - interpretation
An activation function is a decision function:
• defines a threshold under which the activation value will not
fire any output,
• allows to select, linearly or not, among different activation
values,
• the highest output value comes from the highest activation
value, i.e. similarity between the input values and the synaptic
weights.
Network architectures
The power of neural networks comes from its collective
behavior in a network where all neurons are
interconnected. The network starts evolving: neurons
continuously evaluate their output by looking at their
inputs, calculating the weighted sum and comparing to a
threshold to decide if they should fire. This is highly
complex parallel process whose features cannot be
reduced to phenomena taking place with individual
neurons.
Network architectures
Neural networks are formed by an assembly of many artificial neurons. An artificial neural network may be seen as a massively paralleldistributed processor.
The basic work of a neural network is determined by learning. The memorized information is retained through the synaptic weights.
⇒ Knowledge is represented by the free parameters of the neural network, i.e. synaptic weights and thresholds.
Single-layer Feed-forward Network
x2
x3
xm
x1
Inputsignals
Neurons
Single Layer
Perceptron
Learning methods
An artificial neural network learning method is a procedure which adjusts the neural network free parameters i.e. synaptic weightsand thresholds.
Supervised: We feed the neural network with k input (entries) and
their corresponding desired output. The learning algorithm modifies
(little by little) the synaptic weights to adapt the obtained output
according to the desired output. Only the synaptic weights which
produce an error are modified.
Non-supervised: We feed the neural network with the input
(entries) only. The neural network will organise itself in order to
represent the input data.
Multilayer Feed-forward Network
x2
x3
xm
x1
Inputsignals
Hidden Neurons Output layer
Multi Layer Perceptron
Training a neural network
Supervised Learning
We feed the neural network with the input (entries) and the corresponding desired output.
The learning algorithm modifies (step by step) the synaptic weights to adapt the obtained output according to the desired output. Only the synaptic weights which produce an error are modified.
The error back-propagation algorithm consists of two phases:
the forward phase where the activations are propagated from the input to the output layer, and
the backward phase, where the error between the observed actual and the requested nominal value in the output layer is propagated backwards in order to modify the weights and bias values.
(H) α-Helix, local interactions
Neural Networks for Secondary Structure Prediction
Artificial intelligence:Computer programs are trainedto be able to recognize amino acid patters that are located in known secondary structure and distinguish from other patterns not located in these structures
NN can detect interactions between amino acids in a sequence window.
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
(B.Rost, Columbia, NewYork)
Input Layer
Hidden Layer
Output Layer
WeightsTraining NN
Neural Networks for Secondary Structure Prediction
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
(B.Rost, Columbia, NewYork)
= 0.19
= 0.61
= 0.17
The winner is:
E
prediction
Neural Networks for Secondary Structure Prediction
Neural Networks
BenefitsGeneral applicable
Can capture higher order correlations
Inputs other than sequence information
DrawbacksNeeds many data points (solved structures)
Risk of overtraining
2nd Generation secondary structure prediction
Methods:
GORIII
COMBINE
Q3 accuracy < 70%
Problems with first and second generation methods
Q3 accuracy < 70%
β-stands predicted < 28 - 48 % (slightly better than random)
Predicted helices and strands are too short
The Dinosaurs are still alive …
Bad example:
PeptideStructure makes predictions of the following features of an amino acid sequence:
- Secondary structure according to the Chou-Fasmanmethod
- Secondary structure according to the Garnier-Osguthorpe-Robson method
-…
From the GCG Manual © 1982-2002 Accelrys
3rd Generation secondary structure prediction
IKEEHVI IQAE
HEC
IKEEHVIIQAEFYLNPDQSGEF…..Window
Input Layer
Hidden Layer
Output Layer
Weights
Graphics: C. Lundegaard, CBS
3rd Generation secondary structure prediction
Graphics: C. Lundegaard, CBS
Inp Neuron 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
AAcid
A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
N 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Q 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Sparse encoding
3rd Generation secondary structure prediction
Graphics: C. Lundegaard, CBS
IKEEHVI IQAE
00000010000000000000
3rd Generation secondary structure prediction
Graphics: C. Lundegaard, CBS
BLOSUM 62A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
3rd Generation secondary structure prediction
Graphics: C. Lundegaard, CBS
IKEEHVI IQAE
-1002-4
25-2
0-3-3
1-2-3-1
0-1-3-2-2
3rd Generation secondary structure prediction
Breakthrough: Using evolutionary information 1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYI yrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYI fgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCI yes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYI src_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI stk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYI src_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYI hck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYI blk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYV hck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYI lyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFI lck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFI ss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGII abl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV abl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV src1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLI mysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKV yfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIF abl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWV tec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYI abl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWV txk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLI yha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIF abp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF
3rd Generation secondary structure prediction
Η
Ε
L
>
>
>
pickmaximal
unit=>
currentprediction
J2
inputlayer
first orhidden layer
second oroutput layer
s0 s1 s2J1
:GYIY
DPAVGDPDNGVEP
GTEF:
:GYIY
DPEVGDPTQNIPP
GTKF:
:GYEY
DPAEGDPDNGVKP
GTSF:
:GYEY
DPAEGDPDNGVKP
GTAF:
Alignments
5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .
. . . . 5 . . . . . . . . . . . . . . .
. . . 5 . . . . . . . . . . . . . . . .
. . 3 . . . . 2 . . . . . . . . . . . .
. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .
5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .
GSAPD NTEKQ CVHIR LMYFW
profile table
:GYIY
DPEDGDPDDGVNP
GTDF:
Protein
corresponds to the the 21*3 bits coding for the profile of one residue
(B.Rost, Columbia, NewYork)
3rd Generation secondary structure prediction
PHD method (Rost and Sander)
Combine neural networks with MAXHOM multiple
sequence profiles
6-8 Percentage points increase in prediction accuracy
over standard neural networks
Use second layer “Structure to structure”
network to filter predictions
Jury of predictors
3rd generation secondary structure prediction
PHD (Rost et. al.) Q3 72 - 76 %
[ B.Rost (2001) J.Struct.Biol. 134, 204 ]
59 %
65 %
72 %
Q3
Prediction reliability (0 = weak, 9 = strong)
3rd generation secondary structure prediction
PSI-Pred (Jones, DT)
Use alignments from iterative sequence
searches (PSI-Blast) as input to a neural
network
Better predictions due to better sequence
profiles
Available as stand alone program and via the
web
How accurate are predictions today?
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber
of p
rote
in c
hain
s
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
(B.Rost, Columbia, NewYork)
How accurate are predictions today?
Q3 = 72-77% +- 11 % (on average)
• I.e. 30 % of predicted assignments are wrong
• I.e. for 2/3 of all proteins, between 60% - 80% of residues are predicted correctly
• I.e. for your protein, accuracy can be lower than 60% or higher than 80%
Secondary Structure Prediction
META-PredictProtein Server
• http://www.predictprotein.org
• Simultaneous submission tool to several other servers, e.g.JPRED, PHD, PROF, PSIprod, SAM-T99, APSSP2, Sspro
• Includes also motif searches, domain assignments, TM predictions, etc.
1D-Structure prediction
Secondary Structure Prediction
Solvent Accessibility Prediction
Identify exposed residues, e.g. for
mutation studies, epitopes, etc.
1D-Structure prediction
Projection onto strings of structural assignments
E.g. “Solvent Accessibility” (buried or exposed?)
A B C D E F G…¦ ¦ ¦ ¦ ¦ ¦ ¦e e b b e e e…
Accuracy of two-state prediction: 75% ± 10 %
PHDacc: solvent accessibility prediction
[http://www.predictprotein.org]
1D-Structure Prediction
Introduction
Secondary Structure Prediction
Solvent Accessibility Prediction
Disorder Prediction
Native Disorder in Proteins
Structural biology tenet: “Function of a protein determined by its 3D-Structure“However: disordered proteins or regions of proteins no fixed secondary or tertiary structure under physiological conditions and/or in the absence of a binding partner/ligand:
Ensemble of structural states leading to dynamic flexibilityNon globular structures that are extended in the solvent
2hfv2hfq
Experimental Detection of Disordered regions
Protein region is defined as disordered if it is devoid of stable secondary structure and if it has a large number of conformations:
X-Ray crystallography: lack of electron density
NMR: dynamics of sizeable disordered regions
CD (Circular dichroism)
SAXS (Small-angle X-ray scattering)
Hydrodynamic measurements
Traditional biochemical studies: proteolytic susceptibility
…
DisProt: Database of protein disorder (www.disprot.org)
Role of Protein Disorder
Participate in many biological processes: Regulation of transcription and translation
Cellular signal transduction
Cell-cycle control
Regulation of the self-assembly of large multiproteincomplexes (e.g. bacterial flagellum and the ribosome)
Role?Form larger contact areas with other proteins
Flexibility allows to bind multiple ligands
Protein easily regulated by PTM modifications
Relative instability of the intrinsically disordered proteins involved in transcription and signaling provides a further levelof control trough proteolytic degradation: concentration easily regulated by protease digestion.
The continuum of protein structure
ACTR: interaction domain of activator (p160) for retinoid receptor
NCBD: nuclear-receptor co-activator domain of CBP
TFIIIA: 3 zinc fingers of transcription factor
elF4E: translation-initiation factor
Thermodynamic consequences of coupled folding and binding
There is an entropic cost associated with the disorder-to-order transition: binding of an intrinsically unstructured protein to its target.The key thermodynamic driving force for the binding reaction is a favorable enthalpiccontribution: enthalpy-entropy compensation.Coupled folding and binding gives rise to a complex with high specificity and relatively low affinity: appropriate for signal-transduction proteins.
Characteristics of Disorder regions
Clear patterns that characterize disordered regions:
Low sequence complexity (biases composition, overrepresentation of a few residues)Amino acid compositional bias
• Low content of bulky hydrophobic amino acid (Val,Leu,Ile,Met,Phe,Trp and Tyr)
• High proportion of polar and charged amino acids(Gln, Ser, Pro, Glu, Lys and sometimes Gly and Ala)
High-sequence variability (high flexibility)
Training of NN
Role of Prediction of Disordered regions
The prediction of disordered regions would provide:
First step in the identification of functionally relevant disordered regions
• Design of laboratory experiments for the identification of binding sites within disordered regions. [1]
Identification of regions that hinder successful crystallization of the protein: bottleneck in structural proteomics (high-through-put structure determination pipeline) [2]
[1] Longi S. et al. (2003), J. Biol. Chem., 278, 18638[2] Linding R. et al. (2003), Structure., 11, 1453
Program & Servers
Obradovic & Dunker: PONDR, http://www.pondr.com/
Jones: Disopred2, http://bioinf.cs.ucl.ac.uk/disopred/
References
P.E.Bourne, H. Weissig. Structural Bioinformatics, Wiley-Liss and
Sons.
Methods in Molecular Biology 143: Protein Structure Prediction,
Humana Press.
Protein Structure Prediction: A practical Approach, Oxford
University Press.
R. Silipo, Neural Networks, in: M. Berthold, D.J. Hand, Intelligent
Data Analysis, Springer Verlag.