Upload
dinah-cobb
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Prediction of structural and functional features in proteins
starting from the residue sequence
INTRODUCTION TO NEURAL NETWORKS
Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Ct
Nt
3D structure
Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........
MAPPING PROBLEMS: Secondary structure
position of Trans Membrane Segments along the sequenceTopography
Porin (Rhodobacter capsulatus)
Bacteriorhodopsin(Halobacterium salinarum)
Bil
ayer
-barrel -helices
Outer Membrane Inner Membrane
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
MAPPING PROBLEMS: Topology of transmembrane proteins
First generation methodsFirst generation methodsSingle residue statisticsSingle residue statistics
Propensity scales
For each residue
•The association between each residue and the different features is statistically evaluated
•Physical and chemical features of residues
A propensity value for any structure can be associated to any residue
HOW?
Secondary structure: Chou-Fasman propensity Secondary structure: Chou-Fasman propensity scalescale
Given a set of known structures we can count how many times a residue is associated to a structure.
Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh
N(A,h) = 7, N(A,c) =1, N= 40
P(A,h) = 7/40, P(A,h) = 1/40
Is that enough for estimating a propensity?
Secondary structure: Chou-Fasman propensity Secondary structure: Chou-Fasman propensity scalescale
Given a set of known structures we can count how many times a residue is associated to a structure.
Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh
N(A,h) = 7, N(A,c) =1, N= 40
P(A,h) = 7/40, P(A,h) = 1/40
We need to estimate how much independent the residue-to-structure association is.
P(h) = 27/40, P(c) = 13/40
Secondary structure: Chou-Fasman propensity Secondary structure: Chou-Fasman propensity scalescale
Given a set of known structures we can count how many times a residue is associated to a structure.
Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh
N(A,h) = 7, N(A,c) =1, N= 40
P(A,h) = 7/40, P(A,h) = 1/40
P(h) = 27/40, P(c) = 13/40
If the structure is independent of the residue:P(A,h) = P(A)P(h)
The ratio P(A,h)/P(A)P(h) is the propensity
Given a LARGE set of examples, a propensity value can be computed for each residue and each structure type
Name P(H) P(E) Alanine 1,42 0,83Arginine 0,98 0,93Aspartic Acid 1,01 0,54Asparagine 0,67 0,89Cysteine 0,70 1,19Glutamic Acid 1,51 0,37Glutamine 1,11 1,10Glycine 0,57 0,75Histidine 1,00 0,87Isoleucine 1,08 1,60Leucine 1,21 1,30Lysine 1,14 0,74Methionine 1,45 1,05Phenylalanine 1,13 1,38Proline 0,57 0,55Serine 0,77 0,75Threonine 0,83 1,19Tryptophan 1,08 1,37Tyrosine 0,69 1,47Valine 1,06 1,70
Secondary structure: Chou-Fasman propensity Secondary structure: Chou-Fasman propensity scalescale
Given a new sequence a secondary structure prediction can be obtained by plotting the propensity values for each structure, residue by residue
Considering three secondary structures (H,E,C), the overall accuracy, as evaluated on an uncorrelated set of sequences with known structure, is very lowQ3 = 50/60 %
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75
Secondary structure: Chou-Fasman propensity Secondary structure: Chou-Fasman propensity scalescale
http://www.expasy.ch/cgi-bin/protscale.pl
Secondary structure: Chou-Fasman propensity Secondary structure: Chou-Fasman propensity scalescale
Transmembrane alpha-helices: Kyte-Doolittle Transmembrane alpha-helices: Kyte-Doolittle scalescale
It is computed taking into consideration the octanol-water partition coefficient, combined with the propensity of the residues to be found in known transmembrane helices
Ala: 1.800 Arg: -4.500 Asn: -3.500 Asp: -3.500 Cys: 2.500 Gln: -3.500 Glu: -3.500 Gly: -0.400 His: -3.200 Ile: 4.500 Leu: 3.800 Lys: -3.900 Met: 1.900 Phe: 2.800 Pro: -1.600 Ser: -0.800 Thr: -0.700 Trp: -0.900 Tyr: -1.300 Val: 4.200
Second generation methods: GORSecond generation methods: GOR
The structure of a residue in a protein strongly depends on the sequence context
It is possible to estimate the influence of a residue in determining the structure of a residue close along the sequence. Usually windows from -8/8 to -13/13 are considered.
Coefficients P(A,s,i) estimate the contribution of the residue A in determining the structure s for a residue that is i positions apart along the sequence
Struttura secondaria: Metodo GORStruttura secondaria: Metodo GOR
Q3 = 65 % (Considering three secondary structures (H,E,C), and evaluating the overall accuracy on an uncorrelated set of sequences with known structure)
The contribution of each position in the window is independent of the other ones. No correlation among the positions in the window is taken in to account.
A more efficient method: Neural NetworksA more efficient method: Neural Networks
Alternative computing algorithm: analogies with the computation in the nervous system.
1) The nervous systems is constituted of elementary computing units: neurons2) The electric signal flows in a determined direction (dentrites->axon) (Principle of dynamic polarization)3)There is not cytoplasmic continuity among the neurons. Each neuron specifically communicates with some neighboring neurons by means of synapses (Principle of connective specificity)
PredictionNew sequence
Prediction
Tools out of machine learning approaches
Tools out of machine learning approaches
Neural Networks can learn the mapping from sequence to secondary structureNeural Networks can learn the mapping from sequence to secondary structure
General
rules
Data Base Subset
Known mapping
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Training
EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE
Neural network for secondary structure Neural network for secondary structure predictionprediction
Input
Output
C
M P I L K QK P I H Y H P N H G E A K G
A 0 0 0 0 0 0 0 0 0C 0 0 0 0 0 0 0 0 0D 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 0 0 0 0G 0 0 0 0 0 0 0 0 0H 0 0 0 1 0 1 0 0 1I 0 0 1 0 0 0 0 0 0K 1 0 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0N 0 0 0 0 0 0 0 1 0P 0 1 0 0 0 0 1 0 0Q 0 0 0 0 0 0 0 0 0R 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 0 0 0 0T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0W 0 0 0 0 0 0 0 0 0Y 0 0 0 0 1 0 0 0 0
Usually:Input 17-23 residues
Hidden neurons :4-15
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
Third generation methods: evolutionary Third generation methods: evolutionary informationinformation
1 Y K D Y H S - D K K K G E L - -2 Y R D Y Q T - D Q K K G D L - -3 Y R D Y Q S - D H K K G E L - -4 Y R D Y V S - D H K K G E L - -5 Y R D Y Q F - D Q K K G S L - -6 Y K D Y N T - H Q K K N E S - -7 Y R D Y Q T - D H K K A D L - -8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K10 T K G Y G F G L I K N T E T T K
A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0
Position
SeqNo No V L I M F W Y G A P S T C H R K Q E N D
1 1 80 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 80 3 3 50 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 17 0 0 4 4 0 0 0 0 0 0 0 0 13 63 13 0 0 0 0 0 0 13 0 0 5 5 13 0 0 0 0 0 0 13 75 0 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 13 0 0 0 0 0 13 0 13 0 0 0 0 0 0 0 63 7 7 0 0 0 38 0 0 0 38 0 0 0 0 0 0 0 25 0 0 0 0 8 8 25 13 0 0 0 0 0 0 50 0 13 0 0 0 0 0 0 0 0 0 9 9 0 13 13 0 0 0 0 0 0 25 0 0 0 0 0 50 0 0 0 0 10 10 0 0 25 13 0 0 0 0 13 13 0 0 0 0 0 38 0 0 0 0 11 11 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 13 13 0 0 50 12 12 0 0 0 0 43 0 0 29 0 29 0 0 0 0 0 0 0 0 0 0 13 13 0 14 29 0 0 0 0 0 29 0 0 0 0 0 0 0 0 14 0 14 14 14 0 0 0 0 0 0 0 43 29 0 0 0 0 0 0 29 0 0 0 0
The Network Architecture for Secondary Structure
Prediction
The Network Architecture for Secondary Structure
PredictionThe First Network (Sequence to Structure)The First Network (Sequence to Structure)
H E C
CCHHEHHHHCHHCCEECCEEEEHHHCC
The Network Architecture for Secondary Structure
Prediction
The Network Architecture for Secondary Structure
Prediction
SeqNo No V L I M F W Y G A P S T C H R K Q E N D
1 1 80 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 80 3 3 50 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 17 0 0 4 4 0 0 0 0 0 0 0 0 13 63 13 0 0 0 0 0 0 13 0 0 5 5 13 0 0 0 0 0 0 13 75 0 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 13 0 0 0 0 0 13 0 13 0 0 0 0 0 0 0 63 7 7 0 0 0 38 0 0 0 38 0 0 0 0 0 0 0 25 0 0 0 0 8 8 25 13 0 0 0 0 0 0 50 0 13 0 0 0 0 0 0 0 0 0 9 9 0 13 13 0 0 0 0 0 0 25 0 0 0 0 0 50 0 0 0 0 10 10 0 0 25 13 0 0 0 0 13 13 0 0 0 0 0 38 0 0 0 0 11 11 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 13 13 0 0 50 12 12 0 0 0 0 43 0 0 29 0 29 0 0 0 0 0 0 0 0 0 0 13 13 0 14 29 0 0 0 0 0 29 0 0 0 0 0 0 0 0 14 0 14 14 14 0 0 0 0 0 0 0 43 29 0 0 0 0 0 0 29 0 0 0 0
The Second Network (Structure to Structure)The Second Network (Structure to Structure)
CCHHEHHHHCHHCCEECCEEEEHHHCC
H E C
Protein set
Training set 1
Testing set 1
The cross validation procedureThe cross validation procedure
The Performance on the Task of Secondary Structure
Prediction
The Performance on the Task of Secondary Structure
Prediction
Efficiency of the Neural Network-Based Predictors onthe 822 Proteins of the Testing Set
INPUTQ3 (%) 66.3
Single SOV 0.62Sequence Q[H] 0.69 Q[E] 0.61 Q[C] 0.66
P[H] 0.70 P[E] 0.54 P[C] 0.71C[H] 0.54 C[E] 0.44 C[C] 0.45
Q3(%) 72.4Multiple SOV 0.69Sequence Q[H] 0.75 Q[E] 0.65 Q[C] 0.75(MaxHom) P[H] 0.77 P[E] 0.64 P[C] 0.73
C[H] 0.64 C[E] 0.54 C[C] 0.53Q3(%) 73.4
Multiple SOV 0.70Sequence Q[H] 0.75 Q[E] 0.70 Q[C] 0.73(PSI-BLAST) P[H] 0.80 P[E] 0.63 P[C] 0.75
C[H] 0.67 C[E] 0.56 C[C] 0.53
Combinando differenti reti: Q3 =76/78%
Secondary Structure PredictionSecondary Structure Prediction
From sequenceFrom sequence
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........
To secondary structureTo secondary structure
7997688899999988776886778999887679956889999999
And to the reliability of the predictionAnd to the reliability of the prediction
PredictProtein Burkhard Rost (Columbia Univ.)http://cubic.bioc.columbia.edu/predictprotein/
PsiPRED David Jones (UCL)http://bioinf.cs.ucl.ac.uk/psipred/
JPred Geoff Barton (Dundee Univ.)
SecPRED http://www.biocomp.unibo.it
SERVERSSERVERS
QEALEIA
1TIF
1WTUA
Translation Initiation Factor 3
Bacillus stearothermophilus
……GIKSKQEALEIAARRN……
Transcription Factor 1
Bacteriophage Spo1
……FNPQTQEALEIAPSVGV……
Chamaleon sequencesChamaleon sequences
We extract: We extract:
2,452 5-mer chameleons 107 6-mer chameleons 16 7-mer chameleons 1 8-mer chameleon
2,576 couples
The total number of residues in chameleons is 26,044 out of 755 protein chains (~15%)
from a set of 822 non-homologous proteins(174,192 residues)
C
NGDQLGIKSKQEALEIAARRNLDLVLVAP
C
ARKGFNPQTQEALEIAPSVGVSVKPG
Prediction of the Secondary Structure of Chameleon sequences with Neural
Networks
Prediction of the Secondary Structure of Chameleon sequences with Neural
NetworksQEALEIAHHHHHHH
QEALEIACCCCCCC
The Prediction of Chameleons with Neural Networks
The Prediction of Chameleons with Neural Networks
•Secondary structure
•Topology of transmebrane proteins
•Cysteine bonding state
•Contact maps of proteins
•Interaction sites on protein surface
Other neural network-based predictorsOther neural network-based predictors
Prediction of the cysteine bonding statePrediction of the cysteine bonding state
Tryparedoxin-I from Crithidia fasciculata (1QK8)
Cys40
Cys43
Cys68
Free cysteines
Disulphide bonded cysteines
MSGLDKYLPGIEKLRRGDGEVEVKSLAGKLVFFYFSASWCPPCRGFTPQLIEFYDKFHES KNFEVVFCTWDEEEDGFAGYFAKMPWLAVPFAQSEAVQKLSKHFNVESIPTLIGVDADSG DVVTTRARATLVKDPEGEQFPWKDAP
A neural network-based method for
predicting the disulfide connectivity
in proteins
A neural network-based method for
predicting the disulfide connectivity
in proteins
The Protein Folding
T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N
The Protein Folding
RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA
Disulfide bonds Disulfide bonds
2-SH -> -SS- + 2H+ + 2e-
S-S distance 2.2 Å
Torsion angle C-S-S-C 90°
Bond Energy 3 Kcal/mol
S
SC CC
C
Intra-chain disulfide bonds in proteins
Of 1259 proteins (a non redundant PDB subset):
• 23% of the chainshave disulfide bonds (S S)
• SS distribution (between secondary structures) % H E C H 7 9 14 E 17 27 C 26
Intra-chain disulfide bonds in proteins
•Distribution: Type % All-13 All-31 / 11 + 13 Small domains 29 Others 3
Distribution of disulfide bonds in the SCOP domains
•99 % of the disulfide bonds are intra-domain
Prediction of the disulfide-bonding state of cysteines in
proteins
Starting from the protein sequence can we
discriminate whether a cysteine residue is disulfide-bonded?
Problem no 1:
NGDQLGIKSKQEALCIAARRNLDLVLVAP
bonded
Non bonded
Perceptron (input: sequence profile)Perceptron (input: sequence profile)
Plotting the trained weigthsPlotting the trained weigths
Residue
Hinton’s plot
bonding state
non bonding state
V L I M F W Y G A P S T C H R K Q E N D 0 & #
-5-4-3-2-1 0 1 2 3 4 5
Residue V L I M F W Y G A P S T C H R K Q E N D 0 & #
-5-4-3-2-1 0 1 2 3 4 5
Posi
tio
nPosi
tio
n
Residue
End
Begin
1
3
2
4
Bonded statesFree states
It is possible to add a sintax?It is possible to add a sintax?
Bonding Residue State State
C40C43C68
End
Begin
1
3
2
4
A pathA path
Bonding Residue State State
C40 1 FC43C68
End
Begin
1
3
2
4
P(seq) = P(1 | Begin) P(C40 | 1) ...
A pathA path
Bonding Residue State State
C40 1 FC43 2 BC68
End
Begin
1
3
2
4
P(seq) = P(1 | Begin) P(C40 | 1) ... P(2 | 1) P(C43 | 2) ..
A pathA path
Bonding Residue State State
C40 1 FC43 2 BC68 4 B
End
Begin
1
3
2
4
P(seq) = P(1 | Begin) P(C40 | 1) ... P(2 | 1) P(C43 | 2) .. P(4 | 2) P(C68 | 4) ..
A pathA path
Bonding Residue State State
C40 1 FC43 2 BC68 4 B
End
Begin
1
3
2
4
P(seq) = P(1 | Begin) P(C40 | 1) ... P(2 | 1) P(C43 | 2) .. P(4 | 2) P(C68 | 4) .. P(End | 4)
A pathA path
End
Begin
1
43
2
Bonding Residue State State
C40 1 FC43 1 FC68 1 F
End
Begin
1
43
2
Bonding Residue State State
C40 1 FC43 2 BC68 4 B
End
Begin
1
43
2
Bonding Residue State State
C40 2 BC43 4 BC68 1 F
End
Begin
1
43
2
Bonding Residue State State
C40 2 BC43 3 FC68 4 B
4 possible paths4 possible paths
MYSFPNSFRFGWSQAGFQCEMSTPGSEDPNTDWYKWVHDPENMAAGLCSGDLPENGPGYWGNYKTFHDNAQKMCLKIARLNVEWSRIFPNP...
P(B|W1), P(F|W1) P(B|W3), P(F|W3)P(B|W2), P(F|W2)
W1 W2 W3
Free Cys
Bonded Cys
End
Begin
Viterbi path
Prediction of bonding state of cysteines
Hybrid systemHybrid system
Residue
C40 C43 C68
Prediction for TriparedoxinPrediction for Triparedoxin
NN Output NN predResidue B F
C40 99 1 B C43 82 18 B C68 61 39 B
Prediction for TriparedoxinPrediction for Triparedoxin
NN Output NN pred HMM HMM predResidue B F Viterbi path
C40 99 1 B 2 BC43 82 18 B 4 BC68 61 39 B 1 F
End
Begin
1
43
2
Prediction for TriparedoxinPrediction for Triparedoxin
Table I. Performance of the NN predictor (20-fold cross
validation) Set Q2 C Q(B) Q(F) P(B) P(F) Q2prot WD 80.4 0.56 67.2 87.5 74.3 83.2 56.9 RD 80.1 0.56 67.2 87.6 75.7 82.2 49.7
B= cysteine bonding state, F=cysteine free state. WD= whole database (969 proteins, 4136 cysteines) RD= Reduced database, in which the chains containing only one cysteine are
removed (782 proteins, 3949 cysteines).
Table II. Performance of the Hidden NN predictor (20-fold cross validation) Set Q2 C Q(B) Q(F) P(B) P(F) Q2prot WD 88.0 0.73 78.1 93.3 86.3 88.8 84.0 RD 87.4 0.73 78.1 92.8 86.3 88.0 80.2
Neural Network
Hybrid system
Martelli PL, Fariselli P, Malaguti L, Casadio R. -Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks- Protein Eng. 15:951-953 (2002)
PerformancePerformance
Prediction of the connectivity of disulfide bonds in proteins
When the bonding state of cysteines is known can we
predict the connectivity pattern of disulfide bonds?
Problem no 2:
Prediction of disulfide connectivity in proteins Bovine trypsin Inhibitor 6PTI
5 14 30 38 51 55
connectivity pattern
... Sequence
555
5114
38
30
N
C
Prediction of disulfide connectivity in proteins as a problem of maximum-weight perfect
matching
Cys4
Cys2
Cys3Cys1W24
W23W13
W14
W12
W34
N
C
Protein sequence
The undirected weighted graph with V=2B vertices (no of cysteines) and E=2B(2B-1)/2 undirected edges (strength of the interaction W)
Representation:
•It is not necessary to compute all the possible connectivity patterns ( (i B) (2i-1)) •Given a complete graph G=(2B,E)
the matching with the maximum weight can be computed in a O((B)3) time
with the Edmonds-Gabow’s algorithm*
* Gabow, H.N. (1975). Technical Report,CU-CS-075-75, Dept. of Comp. Sci. Colorado University
From the Graph Theory:
How to assign the costs (W) of the edges in the
graph
Cys4
Cys2
Cys3Cys1W24
W23W13
W14
W12
W34
N
C
Cys4
Cys2
Cys3Cys1
Cys4
Cys2
Cys3Cys1W24
W23W13
W14
W12
W34W24
W23W13
W14
W12
W34
N
C
N
C
Assumption: for each cysteine all its sequence nearest neighbours make
contacts
CN
Cys i
Cys j
neighbours (Ni)
neigh
bou
rs (N
j)
Cys i Cys j
All possible interactionsusing 1 nearest neighbour
0
2
4
6
8
10
12
14
16
0 50 100 150 200 250 300 350 400 450
Sequence separation
Fre
qu
en
cy(%
)Frequency distribution of disulfide bonds with respect to sequence separation (726 proteins)
Neural Networks for predicting the edge values
Neural Networks for predicting the edge values
Output ( 1 node)
Hidden nodes(6 nodes)
Input(212 nodes)
Disulfide pair propensity (output = wij)
Each pair in the neighbours of 4 residues
+ Sequence separation + No of SS bonds
(210 + 2 Input nodes)
Accuracy (Qp) of EG vs NN
Chains B Random EG NN
158 2 0.333 0.46 0.68
153 3 0.067 0.17 0.21
103 4 0.009 0.11 0.20
44 5 0.001 0.00 0.02
The state of art:
•Prediction of bonding states is quite satisfactory
•Prediction of connectivity needs to be improved
Prediction of FoldonsPrediction of Foldons
Piero Fariselli
The Folding Problem as a Mapping Problem
The Folding Problem as a Mapping Problem
Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Ct
Nt
3D structure
Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........
We can collect from the PDB data base some 1500 chains of known structures from which to derive non redundant information relating sequence to:
• secondary structure
• structural and functional motifs
• 3D structure
1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K
A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0
sequence position
Evolutionary information
•Multiple Sequence Alignment (MSA) of similar sequences
•Sequence profile: for each position a 20-valued vector contains the aminoacidic composition of the aligned sequences.
MS
ASe
quen
ce p
rofi
le
The Early Stages of Folding:
Initiation SitesThe Unfolded Chain
Prediction of Initiation Sites of Protein FoldingPrediction of Initiation Sites of Protein Folding
Folded Protein
The Folding ProcessThe Folding Process
Frustration in proteins
• The simultaneous minimisation of all the interaction energies is impossible
• The simultaneous minimisation of all the interaction energies is impossible
The network architecture
Output
Hidden
Input
Input Window
Non
..ALS.......QGFLLIARQPPFTYFTV......HW..
Q2 = 0.85 Q(H)= 0.67 Q(nonH) = 0.93 Sovpred = 0.85
C = 0.63 Pc(H) = 0.80 Pc(nonH) = 0.86 Sovobs = 0.76
The prediction efficiency of the network
The conformation of residue R depends both on local (window W) and non local (context C) interactions.
The convergence theorem ensures that:Oi = Probability ( StructureR= i| W )
If , for any i, Oi 1 , then the structure of residue R depends mainly on W and only slightly on C
Context C
Residue RWindow W
O Onon
Neural Network
Theoretical background
P ( | , ) ( , ) i i natW C ( W,C )
C
P W W C P Ci i( | ) ( | , ) ( ) P
P W W C P Ci i
C
i nat( | ) ( | , ) ( ) ( , (W) ) P
R W C• Anfinsen’s hypothesis:
• Averaging over all the contexts (performed by NN):
• When the pattern is self-stabilising (W dependent):
P ( | , )i W C P ( | )i W=
• Then the Anfinsen’s hypothesis can be cast in a local form:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Entropy (S5)
Rel
iabi
lity
Inde
x
Relationship between the reliability index and the Shannon entropy
S = i Oi log Oi
INPUT
O O non-
MAS..... QLMLKDFLNRTPL.........GHI
......... ..........
_
Entropy = Shannon-entropy in (ln 2)/10 units ( S = -i o i ln ( o i ) )NC = Number of protein segments correctly predicted in -helixNT = Total number of protein segments predicted in -helix
Protein segments correctly predicted in -helical structure
13579
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
0
20
40
60
80
100
NC / NT (%)
Entropy Segment length
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 11 21 31 41 51 61 71 81 91 101 111 121
EntropyPredicted helices
Extracted fragments
Profile of the smoothed entropy (S5) for the hen egg lysozyme (132L)
Protein chain
S5
Hen egg lysozyme (132L)
C-terminus
N-terminus
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7
Entropy (S5)
Frequency Correct
WrongDifferences
0.0
Frequency distribution of predicted helical segments as a function of their entropy value
Threshold value
An example of the data base of minimally frustrated protein fragments
http://www.biocomp.unibo.it/DB/
Training set from PDB
Number ofproteins
Number ofamino acids
Number of-helices
Averagelength
822 174191 4783 116
Number ofproteins
Number ofamino acids
Number of-helical segments
Averagelength
626 21553 3000 72
Data base of minimally frustrated -helical segments
Comparison of minimally frustrated segments with putative folding initiation sites experimentally determined
*Not yet experimentally detected
Comparison of minimally frustrated segments with peptides extracted from proteins
Code* Peptides* % Helix insolution*
Entropy(S5)
ExtractedSegment
3FXC TYKVTELINEAEGINETIDCDD 1 ##### ####3LZM GFTNSLRMLQQKRWDEAVNLAKS 10 0.262 WDEAVNL
“ 10 0.329 LRMLQQK3LZM-2 GVAGFTNSLRMLQQKRWDEAAVNLAKS 12 0.203 SLRMLQ
“ 12 0.210 DEAAVNLCIII ESLLERITRKLRDGWKRLIDIL 8 0.171 LLERIT
“ 8 0.260 WKRLIDCIII-L ESLLERITRKL 15 0.171 LLERITCIII-R RDGWKRLIDIL 4 0.260 WKRLIDCIII-M RITRKLRDGWK 2 #### ####Sigma KVATTKAQRKLFFNLRKTKQRL 9 0.218 TKAQRKCOMA1 DHPAVMEGTKTILETDSNLS 4 #### ####COMA2 EPSEQFIKQHDFSSY 3 #### ####COMA3 VNGMELSKQILQENPH 6 0.189 LSKQILQCOMA4 EVEDYFEEAIRAGLH 20 0.020 YFEEAIRCOMA5 KEKITQYIYHVLNGEIL 3 #### ####ARA1 AVGKSNLLSRYARNEFSA 2 #### ####ARA2 RFRAVTSAYYRGAVG 3 #### ####ARA3 TRRTTFESVGRWLDELKIHSD 7.5 0.194 SVGRWLARA4 AVSVEEGKALAEEEGLF 4 #### ####ARA5 STNVKTAFEMVILDIYNNV 3 #### ####G1 DTYKLILNGKTLKGETTTEA 2 #### ####G2 GDAATAEKVFKKIANDNGVD 4 #### ####G3 GEWTYDDATKTFTVTE 2 #### ####
* Muñoz and Serrano, 1994.
Minimally frustrated -helical segments are useful for determining:
• Folding initiation sites
• -helix stability
• de-novo design of -helices
Structure prediction of membrane proteins
Inner Membrane proteins(all -Transmembrane
proteins)
Outer Membrane proteins(all -Transmembrane
proteins)
Porin (Rhodobacter capsulatus)
Bacteriorhodopsin(Halobacterium salinarum)
Bila
yer
-barrel -helices
Outer Membrane
Inner Membrane
Predictors of the Topology of Membrane Proteins
position of Trans Membrane Segments along the sequenceTopography
++++ +
+
Topology
Bilayer
N
C
Out
In
position of N and C termini with respect to the bilayer
Lipidic Bilayer
Prediction of transmembrane segments
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 0 60 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 33 0 0 010 0 30 0 30 0 100 0 0 0 0 0 0 10 0 0 10 30 0 40 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 30 0 50 0 0 0 0 0 0 0 0 0 0 0 0 33 0 0 020 0 0 0 0 33 0 0 0 0 0 0 0 10 0 0 0 0 0 10 0 0 0 0 0 0 070 0 0 90 0 0 0 0 0
TM nonTM
Window: 9 residues
5 hidden neurons
2 output neurons
Neural Network for the prediction of TMS in -barrel membrane proteins. (Jacoboni et al., 2001)
A generic model for membrane proteins (TMHMM)
A generic model for membrane proteins (TMHMM)
Transmembrane Inner Side
Outer Side
End
Begin
Sequence-profile-based HMMSequence-profile-based HMM
085 0 0 5 0 0 0 0 2 0 8 0 0 0 0 0 0 0 0
0 0 0 0 4 013 0 4 0 5 0 6 0 023 0 144 0
0 022 023 0 0 5 023 0 3 011 0 0 2 011 0
034 0 0 024 0 0 0 0 0 2 022 018 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 092 0 0 0 0 0 0 0
90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 077 023
3 0 2 7 4 0 8 6 1 3 6 5 512 5 617 2 2 6
..A C L P R P E T ...
t
Sequence of characters ct
Sequence of A-dimensional
vectors
st
0 st (n) S t,n S=100
k=1 st (n) = S t A
90 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0
n
For proteins A=20
Constraints
Martelli et al., Bioinformatics 18, S46-53, 2002
The new algorithms make possible:
•to feed HMMs with sequence profiles
•to eventually couple NNs and HMMs (Hidden Neural Networks)
Advantages:
•Higher performance than standard HMMs
•Increased discrimination capability of a given class
Martelli et al., Bioinformatics, 2002Martelli et al., Protein Eng. 2002,
Prediction of the Topology of -Transmembrane Proteins
position of Trans Membrane Helices along the sequenceTopography
++++ +
+
Topology
Bilayer
N
C
Out
In
position of N and C termini with respect to the bilayer
The prediction accuracy of topography is 92%
The prediction accuracy of topology is 81 %
position of Transmembrane Strands along the sequenceTopography:
Prediction of the Topology of -Transmembrane Proteins
++++ +
+
Topology:
Bilayer
N
C
LPS (Out)
Periplasmic (In)
position of N and C termini with respect to the bilayer
The prediction accuracy of topography is 73 %
The prediction accuracy of topology is 73 %
0
10
20
30
40
50
60
70
80
90
100
2.75 2.8 2.85 2.9 2.95
Per
cent
age
Outer membrane
Globular
Inner membrane
I(s | M) = -1/L log P(s | M)
The discriminative capability of the HMM model
An application: modeling the 3D structure of eukaryotic barrel
proteins
New folds Existing folds
Threading/ fold
recognition
Ab initio prediction
Building by homology
Homology (%)
0 10 20 30 40 50 60 70 80 90 100
3D structure prediction of proteins
Membrane proteins
2omf_.seq/ AEIYNKDGNK VDLYGKAVGL HYFSKGNGEN SYGGNGDMTY ARLGFKGETQ 2omf_.str/ CCCCCCCCEE EEEEEEEEEE EEECCCCCCC CCCCCCCCCE EEEEEEEEEE protx.str/ *******CCC CCCCEEEEEE EEEC****** ********CE EEEEEEEECC protx.seq/ *******KGY NFGLWKLDLK TKTS****** ********SG IEFNTAGHSN 2omf_.seq/ I*NSDLTGYG QWEYNFQGNN SEGADAQTGN KTRLAFAGLK YADVGSFDYG 2omf_.str/ C*CCCEEEEE EEEEEEECCC CCCCCCCCCC EEEEEEEEEE ECCCEEEEEE protx.str/ CCCCCEEEEE EEEEEEC*** ********** EEEEEEEEEC CCCCCEEEEE protx.seq/ QESGKVFGSL ETKYKVK*** ********** DYGLTLTEKW NTDNTLFTEV 2omf_.seq/ RNYGVVYDAL GYTDMLPEFG GDTAYSDDFF VGRVGGVATY RNSNFFGLVD 2omf_.str/ ECCCCCCCCC CCCCCCCCCC CCCCCCCCCC CCCCCCEEEE EECCCCCCCC protx.str/ EEEECC**** ********** ********** **CCEEEEEE EEECCCCCCC protx.seq/ AVQDQL**** ********** ********** **LEGLKLSL EGNFAPQSGN 2omf_.seq/ GLNFAVQYLG KNER****** *********D TARRSNGDGV GGSISYEYE* 2omf_.str/ CEEEEEEEEC CCCC****** *********C CCCCCCCCEE EEEEEEEEC* protx.str/ EEEEEEEEEE EEEECCCCCC CCCCCCCEEE EEEEEEEEEE EEEEEEECCC protx.seq/ KNGKFKVAYG HENVKADSDV NIDLKGPLIN ASAVLGYQGW LAGYQTAFDT 2omf_.seq/ **GFGIVGAY GAADRTNLQE AQPLGNGKKA EQWATGLKYD ANNIYLAANY 2omf_.str/ **CEEEEEEE EEEECCCCCC CCCCCCCCEE EEEEEEEEEE ECCEEEEEEE protx.str/ CCEEEEEEEE EEEEEEEEEE EEECCCCCCC EEEEEEEEEE CEEEEEEEEE protx.seq/ QQSKLTTNNF ALGYTTKDFV LHTAVNDGQE FSGSIFQRTS DKLDVGVQLS 2omf_.seq/ GETRNATPIT NKFTNTSGFA NKTQDVLLVA QYQFDFGLRP SIAYTKSKAK 2omf_.str/ EEEECCCCCC CCCCCCCCCC CEEEEEEEEE EEECCCCEEE EEEEEEEEEE protx.str/ EEECC***** ********** *CCCEEEEEE EEECCCCEEE EEEEEEC*** protx.seq/ WASGT***** ********** *SNTKFAIGA KYQLDDDARV RAKVNNA*** 2omf_.seq/ DVEGIGDVDL VNYFEVGATY YFNKNMSTYV DYIINQIDSD NKLGVGSDDT 2omf_.str/ CCCCCCCEEE EEEEEEEEEE ECCCCEEEEE EEEEECCCCC CCCCCCCCCE protx.str/ *********E EEEEEEEEEE EC***EEEEE EEEEECCC** *****CCCCE protx.seq/ *********S QVGLGYQQKL RT***GVTLT LSTLVDGK** *****NFNAG 2omf_.seq/ VAVGIVYQF* *** 2omf_.str/ EEEEEEEEE* *** protx.str/ EEEEEEEEEE EC* protx.seq/ GHKIGVGLEL EA*
Structural alignment of VDAC with the template
A low resolution 3D Model of VDAC the sequence from Neurospora crassa)
Casa
A low resolution 3D model of VDAC:location of mutated residues
Casadio et al., FEBS Lett 520:1-7 (2002)
Predictors of membrane protein structures can be used to filter genomes and find new
membrane proteins without sequence homologoues
FISHING NEW OUTER MEMBRANE PROTEINS IN
GRAM-NEGATIVE BACTERIA
FISHING NEW OUTER MEMBRANE PROTEINS IN
GRAM-NEGATIVE BACTERIA
MRAKLLGIVLTTPIAISSFASTETLSFTPDNINADISLGTLSGKTKERVYLAEEGGRKVSQLDWKFNNAAIIKGAINWDLMPQISIGAAGWTTLGSRGGNMVDQDWMDSSNPGTWTDESRHPDTQLNYANEFDLNIKGWLLNEPNYRLGLMAGYQESRYSFTARGGSYIYSSEEGFRDDIGSFPNGERAIGYKQRFKMPYIGLTGSYRYEDFELGGTFKYSGWVESSDNDEHYDPGKRITYRSKVKDQNYYSVAVNAGYYVTPNAKVYVEGAWNRVTNKKGNTSLYDHNNNTSDYSKNGAGIENYNFITTAGLKYTF
Signal peptides in protein sequences:
Sequences of outer membrane proteins have signal peptides:
the secretion marker is also a marker of outer membrane proteins
Proteins have intrinsic signals that govern their transport and localization in the cell: a secretion hydrophic marker (or signal peptide)
MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF
Signal Pepetide Mature protein
Cleavage site
Signal Peptide prediction
MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF
Predicts if a given residue position belongs to the Signal Pepetide
2 Neural Networs
SignalNet CleavageNet
Predicts if a given residue position is the cleavage site
Organism Window C Q2
Eukaryotes 15-1-15 0.83 0.95 Gram positive 15-1-15 0.79 0.92Gram negative 11-1-11 0.78 0.92
SignalNet Accuracy
Organism Window C Q2
Eukaryotes 15-1-2 0.61 0.97 Gram positive 20-1-3 0.56 0.96 Gram negative 11-1-2 0.62 0.96
CleavageNet Accuracy
Organism SignalP SPEP
Eukaryotes (+) 0.99 0.97 Eukaryotes (-) 0.85 0.94
Prokaryotes(+) 0.99 0.97Prokaryotes (-) 0.93 0.96
Escherichia coli(+/-) 0.95 0.96
Comparison with SignalP
Performance of SignalNN on 2160 annotated proteins
250
Prediction
An
nota
tion
2160
Withoutsignal Total
Withsignal
Wit
hou
tsig
nal
Tota
lW
ith
sig
nal
260 1900
1910
205
1855
Correct predictions
55
45
Wrong predictions
Q2 = 96 %
Qsignal = 82 %Qnon-signal = 97 %
Psignal = 78 %Pnon-signal = 98 %
Predictors of Membrane Topography: Rate of false positives
The predictors are tested on on 809 globular protein with sequence identity 25 % :
0.5 % have at least 1 -TM helix predicted
5.6 % have at least 2 -TM strand predicted
PROTEOME
Signal peptide
Yes
All- TM All- TM
No
No
All- TM
Yes
all -TM
Yes
all -TMY
esall -TM
No
Globular
No
Globular
HUNTER
* the number of new proteins predicted in the class with Hunter, out of the non-annotated region
Predicting globular, inner and outer membrane proteins in genomes of Gram-negative bacteria with
Hunter