Prediction of structural and functional features in proteins starting from the residue sequence INTRODUCTION TO NEURAL NETWORKS

Prediction of structural and functional features in proteins

starting from the residue sequence

INTRODUCTION TO NEURAL NETWORKS

Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Ct

Nt

3D structure

Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

MAPPING PROBLEMS: Secondary structure

position of Trans Membrane Segments along the sequenceTopography

Porin (Rhodobacter capsulatus)

Bacteriorhodopsin(Halobacterium salinarum)

Bil

ayer

-barrel -helices

Outer Membrane Inner Membrane

ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK

MAPPING PROBLEMS: Topology of transmembrane proteins

First generation methodsFirst generation methodsSingle residue statisticsSingle residue statistics

Propensity scales

For each residue

•The association between each residue and the different features is statistically evaluated

•Physical and chemical features of residues

A propensity value for any structure can be associated to any residue

HOW?

Secondary structure: Chou-Fasman propensity Secondary structure: Chou-Fasman propensity scalescale

Given a set of known structures we can count how many times a residue is associated to a structure.

Example: ALAKSLAKPSDTLAKSDFREKWEWLKLLKALACCKLSAALhhhhhhhhccccccccccccchhhhhhhhhhhhhhhhhhh

N(A,h) = 7, N(A,c) =1, N= 40

P(A,h) = 7/40, P(A,h) = 1/40

Is that enough for estimating a propensity?




N(A,h) = 7, N(A,c) =1, N= 40

P(A,h) = 7/40, P(A,h) = 1/40

We need to estimate how much independent the residue-to-structure association is.

P(h) = 27/40, P(c) = 13/40




N(A,h) = 7, N(A,c) =1, N= 40

P(A,h) = 7/40, P(A,h) = 1/40

P(h) = 27/40, P(c) = 13/40

If the structure is independent of the residue:P(A,h) = P(A)P(h)

The ratio P(A,h)/P(A)P(h) is the propensity

Given a LARGE set of examples, a propensity value can be computed for each residue and each structure type

Name P(H) P(E) Alanine 1,42 0,83Arginine 0,98 0,93Aspartic Acid 1,01 0,54Asparagine 0,67 0,89Cysteine 0,70 1,19Glutamic Acid 1,51 0,37Glutamine 1,11 1,10Glycine 0,57 0,75Histidine 1,00 0,87Isoleucine 1,08 1,60Leucine 1,21 1,30Lysine 1,14 0,74Methionine 1,45 1,05Phenylalanine 1,13 1,38Proline 0,57 0,55Serine 0,77 0,75Threonine 0,83 1,19Tryptophan 1,08 1,37Tyrosine 0,69 1,47Valine 1,06 1,70


Given a new sequence a secondary structure prediction can be obtained by plotting the propensity values for each structure, residue by residue

Considering three secondary structures (H,E,C), the overall accuracy, as evaluated on an uncorrelated set of sequences with known structure, is very lowQ3 = 50/60 %

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75


http://www.expasy.ch/cgi-bin/protscale.pl


http://www.expasy.ch/cgi-bin/protscale.pl

Transmembrane alpha-helices: Kyte-Doolittle Transmembrane alpha-helices: Kyte-Doolittle scalescale

It is computed taking into consideration the octanol-water partition coefficient, combined with the propensity of the residues to be found in known transmembrane helices

Ala: 1.800 Arg: -4.500 Asn: -3.500 Asp: -3.500 Cys: 2.500 Gln: -3.500 Glu: -3.500 Gly: -0.400 His: -3.200 Ile: 4.500 Leu: 3.800 Lys: -3.900 Met: 1.900 Phe: 2.800 Pro: -1.600 Ser: -0.800 Thr: -0.700 Trp: -0.900 Tyr: -1.300 Val: 4.200

Second generation methods: GORSecond generation methods: GOR

The structure of a residue in a protein strongly depends on the sequence context

It is possible to estimate the influence of a residue in determining the structure of a residue close along the sequence. Usually windows from -8/8 to -13/13 are considered.

Coefficients P(A,s,i) estimate the contribution of the residue A in determining the structure s for a residue that is i positions apart along the sequence

Struttura secondaria: Metodo GORStruttura secondaria: Metodo GOR

Q3 = 65 % (Considering three secondary structures (H,E,C), and evaluating the overall accuracy on an uncorrelated set of sequences with known structure)

The contribution of each position in the window is independent of the other ones. No correlation among the positions in the window is taken in to account.

A more efficient method: Neural NetworksA more efficient method: Neural Networks

Alternative computing algorithm: analogies with the computation in the nervous system.

1) The nervous systems is constituted of elementary computing units: neurons2) The electric signal flows in a determined direction (dentrites->axon) (Principle of dynamic polarization)3)There is not cytoplasmic continuity among the neurons. Each neuron specifically communicates with some neighboring neurons by means of synapses (Principle of connective specificity)

PredictionNew sequence

Prediction

Tools out of machine learning approaches

Tools out of machine learning approaches

Neural Networks can learn the mapping from sequence to secondary structureNeural Networks can learn the mapping from sequence to secondary structure

General

rules

Data Base Subset

Known mapping

TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Training

EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE

Neural network for secondary structure Neural network for secondary structure predictionprediction

Input

Output

C

M P I L K QK P I H Y H P N H G E A K G

A 0 0 0 0 0 0 0 0 0C 0 0 0 0 0 0 0 0 0D 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 0 0 0 0G 0 0 0 0 0 0 0 0 0H 0 0 0 1 0 1 0 0 1I 0 0 1 0 0 0 0 0 0K 1 0 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0N 0 0 0 0 0 0 0 1 0P 0 1 0 0 0 0 1 0 0Q 0 0 0 0 0 0 0 0 0R 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 0 0 0 0T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0W 0 0 0 0 0 0 0 0 0Y 0 0 0 0 1 0 0 0 0

Usually:Input 17-23 residues

Hidden neurons :4-15

ACDEFGHIKLMNPQRSTVWY.

H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

Third generation methods: evolutionary Third generation methods: evolutionary informationinformation

1 Y K D Y H S - D K K K G E L - -2 Y R D Y Q T - D Q K K G D L - -3 Y R D Y Q S - D H K K G E L - -4 Y R D Y V S - D H K K G E L - -5 Y R D Y Q F - D Q K K G S L - -6 Y K D Y N T - H Q K K N E S - -7 Y R D Y Q T - D H K K A D L - -8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K10 T K G Y G F G L I K N T E T T K

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

Position

SeqNo No V L I M F W Y G A P S T C H R K Q E N D

1 1 80 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 80 3 3 50 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 17 0 0 4 4 0 0 0 0 0 0 0 0 13 63 13 0 0 0 0 0 0 13 0 0 5 5 13 0 0 0 0 0 0 13 75 0 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 13 0 0 0 0 0 13 0 13 0 0 0 0 0 0 0 63 7 7 0 0 0 38 0 0 0 38 0 0 0 0 0 0 0 25 0 0 0 0 8 8 25 13 0 0 0 0 0 0 50 0 13 0 0 0 0 0 0 0 0 0 9 9 0 13 13 0 0 0 0 0 0 25 0 0 0 0 0 50 0 0 0 0 10 10 0 0 25 13 0 0 0 0 13 13 0 0 0 0 0 38 0 0 0 0 11 11 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 13 13 0 0 50 12 12 0 0 0 0 43 0 0 29 0 29 0 0 0 0 0 0 0 0 0 0 13 13 0 14 29 0 0 0 0 0 29 0 0 0 0 0 0 0 0 14 0 14 14 14 0 0 0 0 0 0 0 43 29 0 0 0 0 0 0 29 0 0 0 0

The Network Architecture for Secondary Structure

Prediction


PredictionThe First Network (Sequence to Structure)The First Network (Sequence to Structure)

H E C

CCHHEHHHHCHHCCEECCEEEEHHHCC


Prediction


Prediction

SeqNo No V L I M F W Y G A P S T C H R K Q E N D

1 1 80 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 80 3 3 50 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 17 0 0 4 4 0 0 0 0 0 0 0 0 13 63 13 0 0 0 0 0 0 13 0 0 5 5 13 0 0 0 0 0 0 13 75 0 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 13 0 0 0 0 0 13 0 13 0 0 0 0 0 0 0 63 7 7 0 0 0 38 0 0 0 38 0 0 0 0 0 0 0 25 0 0 0 0 8 8 25 13 0 0 0 0 0 0 50 0 13 0 0 0 0 0 0 0 0 0 9 9 0 13 13 0 0 0 0 0 0 25 0 0 0 0 0 50 0 0 0 0 10 10 0 0 25 13 0 0 0 0 13 13 0 0 0 0 0 38 0 0 0 0 11 11 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 13 13 0 0 50 12 12 0 0 0 0 43 0 0 29 0 29 0 0 0 0 0 0 0 0 0 0 13 13 0 14 29 0 0 0 0 0 29 0 0 0 0 0 0 0 0 14 0 14 14 14 0 0 0 0 0 0 0 43 29 0 0 0 0 0 0 29 0 0 0 0

The Second Network (Structure to Structure)The Second Network (Structure to Structure)

CCHHEHHHHCHHCCEECCEEEEHHHCC

H E C

Protein set

Training set 1

Testing set 1

The cross validation procedureThe cross validation procedure

The Performance on the Task of Secondary Structure

Prediction

The Performance on the Task of Secondary Structure

Prediction

Efficiency of the Neural Network-Based Predictors onthe 822 Proteins of the Testing Set

INPUTQ3 (%) 66.3

Single SOV 0.62Sequence Q[H] 0.69 Q[E] 0.61 Q[C] 0.66

P[H] 0.70 P[E] 0.54 P[C] 0.71C[H] 0.54 C[E] 0.44 C[C] 0.45

Q3(%) 72.4Multiple SOV 0.69Sequence Q[H] 0.75 Q[E] 0.65 Q[C] 0.75(MaxHom) P[H] 0.77 P[E] 0.64 P[C] 0.73

C[H] 0.64 C[E] 0.54 C[C] 0.53Q3(%) 73.4

Multiple SOV 0.70Sequence Q[H] 0.75 Q[E] 0.70 Q[C] 0.73(PSI-BLAST) P[H] 0.80 P[E] 0.63 P[C] 0.75

C[H] 0.67 C[E] 0.56 C[C] 0.53

Combinando differenti reti: Q3 =76/78%

Secondary Structure PredictionSecondary Structure Prediction

From sequenceFrom sequence

TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

To secondary structureTo secondary structure

7997688899999988776886778999887679956889999999

And to the reliability of the predictionAnd to the reliability of the prediction

PredictProtein Burkhard Rost (Columbia Univ.)http://cubic.bioc.columbia.edu/predictprotein/

PsiPRED David Jones (UCL)http://bioinf.cs.ucl.ac.uk/psipred/

JPred Geoff Barton (Dundee Univ.)

SecPRED http://www.biocomp.unibo.it

SERVERSSERVERS

http://cubic.bioc.columbia.edu/predictprotein/

http://bioinf.cs.ucl.ac.uk/psipred/

http://www.biocomp.unibo.it/

http://www.biocomp.unibo.it/

QEALEIA

1TIF

1WTUA

Translation Initiation Factor 3

Bacillus stearothermophilus

……GIKSKQEALEIAARRN……

Transcription Factor 1

Bacteriophage Spo1

……FNPQTQEALEIAPSVGV……

Chamaleon sequencesChamaleon sequences

We extract: We extract:

2,452 5-mer chameleons 107 6-mer chameleons 16 7-mer chameleons 1 8-mer chameleon

2,576 couples

The total number of residues in chameleons is 26,044 out of 755 protein chains (~15%)

from a set of 822 non-homologous proteins(174,192 residues)

C

NGDQLGIKSKQEALEIAARRNLDLVLVAP

C

ARKGFNPQTQEALEIAPSVGVSVKPG

Prediction of the Secondary Structure of Chameleon sequences with Neural

Networks

Prediction of the Secondary Structure of Chameleon sequences with Neural

NetworksQEALEIAHHHHHHH

QEALEIACCCCCCC

The Prediction of Chameleons with Neural Networks

The Prediction of Chameleons with Neural Networks

•Secondary structure

•Topology of transmebrane proteins

•Cysteine bonding state

•Contact maps of proteins

•Interaction sites on protein surface

Other neural network-based predictorsOther neural network-based predictors

Prediction of the cysteine bonding statePrediction of the cysteine bonding state

Tryparedoxin-I from Crithidia fasciculata (1QK8)

Cys40

Cys43

Cys68

Free cysteines

Disulphide bonded cysteines

MSGLDKYLPGIEKLRRGDGEVEVKSLAGKLVFFYFSASWCPPCRGFTPQLIEFYDKFHES KNFEVVFCTWDEEEDGFAGYFAKMPWLAVPFAQSEAVQKLSKHFNVESIPTLIGVDADSG DVVTTRARATLVKDPEGEQFPWKDAP

A neural network-based method for

predicting the disulfide connectivity

in proteins

A neural network-based method for

predicting the disulfide connectivity

in proteins

The Protein Folding

T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N

The Protein Folding

RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA

Disulfide bonds Disulfide bonds

2-SH -> -SS- + 2H+ + 2e-

S-S distance 2.2 Å

Torsion angle C-S-S-C 90°

Bond Energy 3 Kcal/mol

S

SC CC

C

Intra-chain disulfide bonds in proteins

Of 1259 proteins (a non redundant PDB subset):

• 23% of the chainshave disulfide bonds (S S)

• SS distribution (between secondary structures) % H E C H 7 9 14 E 17 27 C 26

Intra-chain disulfide bonds in proteins

•Distribution: Type % All-13 All-31 / 11 + 13 Small domains 29 Others 3

Distribution of disulfide bonds in the SCOP domains

•99 % of the disulfide bonds are intra-domain

Prediction of the disulfide-bonding state of cysteines in

proteins

Starting from the protein sequence can we

discriminate whether a cysteine residue is disulfide-bonded?

Problem no 1:

NGDQLGIKSKQEALCIAARRNLDLVLVAP

bonded

Non bonded

Perceptron (input: sequence profile)Perceptron (input: sequence profile)

Plotting the trained weigthsPlotting the trained weigths

Residue

Hinton’s plot

bonding state

non bonding state

V L I M F W Y G A P S T C H R K Q E N D 0 & #

-5-4-3-2-1 0 1 2 3 4 5

Residue V L I M F W Y G A P S T C H R K Q E N D 0 & #

-5-4-3-2-1 0 1 2 3 4 5

Posi

tio

nPosi

tio

n

Residue

End

Begin

1

3

2

4

Bonded statesFree states

It is possible to add a sintax?It is possible to add a sintax?

Bonding Residue State State

C40C43C68

End

Begin

1

3

2

4

A pathA path


C40 1 FC43C68

End

Begin

1

3

2

4

P(seq) = P(1 | Begin) P(C40 | 1) ...

A pathA path


C40 1 FC43 2 BC68

End

Begin

1

3

2

4

P(seq) = P(1 | Begin) P(C40 | 1) ... P(2 | 1) P(C43 | 2) ..

A pathA path


C40 1 FC43 2 BC68 4 B

End

Begin

1

3

2

4

P(seq) = P(1 | Begin) P(C40 | 1) ... P(2 | 1) P(C43 | 2) .. P(4 | 2) P(C68 | 4) ..

A pathA path


C40 1 FC43 2 BC68 4 B

End

Begin

1

3

2

4

P(seq) = P(1 | Begin) P(C40 | 1) ... P(2 | 1) P(C43 | 2) .. P(4 | 2) P(C68 | 4) .. P(End | 4)

A pathA path

End

Begin

1

43

2


C40 1 FC43 1 FC68 1 F

End

Begin

1

43

2


C40 1 FC43 2 BC68 4 B

End

Begin

1

43

2


C40 2 BC43 4 BC68 1 F

End

Begin

1

43

2


C40 2 BC43 3 FC68 4 B

4 possible paths4 possible paths

MYSFPNSFRFGWSQAGFQCEMSTPGSEDPNTDWYKWVHDPENMAAGLCSGDLPENGPGYWGNYKTFHDNAQKMCLKIARLNVEWSRIFPNP...

P(B|W1), P(F|W1) P(B|W3), P(F|W3)P(B|W2), P(F|W2)

W1 W2 W3

Free Cys

Bonded Cys

End

Begin

Viterbi path

Prediction of bonding state of cysteines

Hybrid systemHybrid system

Residue

C40 C43 C68

Prediction for TriparedoxinPrediction for Triparedoxin

NN Output NN predResidue B F

C40 99 1 B C43 82 18 B C68 61 39 B


NN Output NN pred HMM HMM predResidue B F Viterbi path

C40 99 1 B 2 BC43 82 18 B 4 BC68 61 39 B 1 F

End

Begin

1

43

2


Table I. Performance of the NN predictor (20-fold cross

validation) Set Q2 C Q(B) Q(F) P(B) P(F) Q2prot WD 80.4 0.56 67.2 87.5 74.3 83.2 56.9 RD 80.1 0.56 67.2 87.6 75.7 82.2 49.7

B= cysteine bonding state, F=cysteine free state. WD= whole database (969 proteins, 4136 cysteines) RD= Reduced database, in which the chains containing only one cysteine are

removed (782 proteins, 3949 cysteines).

Table II. Performance of the Hidden NN predictor (20-fold cross validation) Set Q2 C Q(B) Q(F) P(B) P(F) Q2prot WD 88.0 0.73 78.1 93.3 86.3 88.8 84.0 RD 87.4 0.73 78.1 92.8 86.3 88.0 80.2

Neural Network

Hybrid system

Martelli PL, Fariselli P, Malaguti L, Casadio R. -Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks- Protein Eng. 15:951-953 (2002)

PerformancePerformance

Prediction of the connectivity of disulfide bonds in proteins

When the bonding state of cysteines is known can we

predict the connectivity pattern of disulfide bonds?

Problem no 2:

Prediction of disulfide connectivity in proteins Bovine trypsin Inhibitor 6PTI

5 14 30 38 51 55

connectivity pattern

... Sequence

555

5114

38

30

N

C

Prediction of disulfide connectivity in proteins as a problem of maximum-weight perfect

matching

Cys4

Cys2

Cys3Cys1W24

W23W13

W14

W12

W34

N

C

Protein sequence

The undirected weighted graph with V=2B vertices (no of cysteines) and E=2B(2B-1)/2 undirected edges (strength of the interaction W)

Representation:

•It is not necessary to compute all the possible connectivity patterns ( (i B) (2i-1)) •Given a complete graph G=(2B,E)

the matching with the maximum weight can be computed in a O((B)3) time

with the Edmonds-Gabow’s algorithm*

* Gabow, H.N. (1975). Technical Report,CU-CS-075-75, Dept. of Comp. Sci. Colorado University

From the Graph Theory:

How to assign the costs (W) of the edges in the

graph

Cys4

Cys2

Cys3Cys1W24

W23W13

W14

W12

W34

N

C

Cys4

Cys2

Cys3Cys1

Cys4

Cys2

Cys3Cys1W24

W23W13

W14

W12

W34W24

W23W13

W14

W12

W34

N

C

N

C

Assumption: for each cysteine all its sequence nearest neighbours make

contacts

CN

Cys i

Cys j

neighbours (Ni)

neigh

bou

rs (N

j)

Cys i Cys j

All possible interactionsusing 1 nearest neighbour

0

2

4

6

8

10

12

14

16

0 50 100 150 200 250 300 350 400 450

Sequence separation

Fre

qu

en

cy(%

)Frequency distribution of disulfide bonds with respect to sequence separation (726 proteins)

Neural Networks for predicting the edge values

Neural Networks for predicting the edge values

Output ( 1 node)

Hidden nodes(6 nodes)

Input(212 nodes)

Disulfide pair propensity (output = wij)

Each pair in the neighbours of 4 residues

+ Sequence separation + No of SS bonds

(210 + 2 Input nodes)

Accuracy (Qp) of EG vs NN

Chains B Random EG NN

158 2 0.333 0.46 0.68

153 3 0.067 0.17 0.21

103 4 0.009 0.11 0.20

44 5 0.001 0.00 0.02

The state of art:

•Prediction of bonding states is quite satisfactory

•Prediction of connectivity needs to be improved

Prediction of FoldonsPrediction of Foldons

Piero Fariselli

The Folding Problem as a Mapping Problem

The Folding Problem as a Mapping Problem

Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Ct

Nt

3D structure

Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

We can collect from the PDB data base some 1500 chains of known structures from which to derive non redundant information relating sequence to:

• secondary structure

• structural and functional motifs

• 3D structure

1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

sequence position

Evolutionary information

•Multiple Sequence Alignment (MSA) of similar sequences

•Sequence profile: for each position a 20-valued vector contains the aminoacidic composition of the aligned sequences.

MS

ASe

quen

ce p

rofi

le

The Early Stages of Folding:

Initiation SitesThe Unfolded Chain

Prediction of Initiation Sites of Protein FoldingPrediction of Initiation Sites of Protein Folding

Folded Protein

The Folding ProcessThe Folding Process

Frustration in proteins

• The simultaneous minimisation of all the interaction energies is impossible

• The simultaneous minimisation of all the interaction energies is impossible

The network architecture

Output

Hidden

Input

Input Window

Non

..ALS.......QGFLLIARQPPFTYFTV......HW..

Q2 = 0.85 Q(H)= 0.67 Q(nonH) = 0.93 Sovpred = 0.85

C = 0.63 Pc(H) = 0.80 Pc(nonH) = 0.86 Sovobs = 0.76

The prediction efficiency of the network

The conformation of residue R depends both on local (window W) and non local (context C) interactions.

The convergence theorem ensures that:Oi = Probability ( StructureR= i| W )

If , for any i, Oi 1 , then the structure of residue R depends mainly on W and only slightly on C

Context C

Residue RWindow W

O Onon

Neural Network

Theoretical background

P ( | , ) ( , ) i i natW C ( W,C )

C

P W W C P Ci i( | ) ( | , ) ( ) P

P W W C P Ci i

C

i nat( | ) ( | , ) ( ) ( , (W) ) P

R W C• Anfinsen’s hypothesis:

• Averaging over all the contexts (performed by NN):

• When the pattern is self-stabilising (W dependent):

P ( | , )i W C P ( | )i W=

• Then the Anfinsen’s hypothesis can be cast in a local form:

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Entropy (S5)

Rel

iabi

lity

Inde

x

Relationship between the reliability index and the Shannon entropy

S = i Oi log Oi

INPUT

O O non-

MAS..... QLMLKDFLNRTPL.........GHI

......... ..........

_

Entropy = Shannon-entropy in (ln 2)/10 units ( S = -i o i ln ( o i ) )NC = Number of protein segments correctly predicted in -helixNT = Total number of protein segments predicted in -helix

Protein segments correctly predicted in -helical structure

13579

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

0

20

40

60

80

100

NC / NT (%)

Entropy Segment length

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 11 21 31 41 51 61 71 81 91 101 111 121

EntropyPredicted helices

Extracted fragments

Profile of the smoothed entropy (S5) for the hen egg lysozyme (132L)

Protein chain

S5

Hen egg lysozyme (132L)

C-terminus

N-terminus

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Entropy (S5)

Frequency Correct

WrongDifferences

0.0

Frequency distribution of predicted helical segments as a function of their entropy value

Threshold value

An example of the data base of minimally frustrated protein fragments

http://www.biocomp.unibo.it/DB/

Training set from PDB

Number ofproteins

Number ofamino acids

Number of-helices

Averagelength

822 174191 4783 116

Number ofproteins

Number ofamino acids

Number of-helical segments

Averagelength

626 21553 3000 72

Data base of minimally frustrated -helical segments

Comparison of minimally frustrated segments with putative folding initiation sites experimentally determined

*Not yet experimentally detected

Comparison of minimally frustrated segments with peptides extracted from proteins

Code* Peptides* % Helix insolution*

Entropy(S5)

ExtractedSegment

3FXC TYKVTELINEAEGINETIDCDD 1 ##### ####3LZM GFTNSLRMLQQKRWDEAVNLAKS 10 0.262 WDEAVNL

“ 10 0.329 LRMLQQK3LZM-2 GVAGFTNSLRMLQQKRWDEAAVNLAKS 12 0.203 SLRMLQ

“ 12 0.210 DEAAVNLCIII ESLLERITRKLRDGWKRLIDIL 8 0.171 LLERIT

“ 8 0.260 WKRLIDCIII-L ESLLERITRKL 15 0.171 LLERITCIII-R RDGWKRLIDIL 4 0.260 WKRLIDCIII-M RITRKLRDGWK 2 #### ####Sigma KVATTKAQRKLFFNLRKTKQRL 9 0.218 TKAQRKCOMA1 DHPAVMEGTKTILETDSNLS 4 #### ####COMA2 EPSEQFIKQHDFSSY 3 #### ####COMA3 VNGMELSKQILQENPH 6 0.189 LSKQILQCOMA4 EVEDYFEEAIRAGLH 20 0.020 YFEEAIRCOMA5 KEKITQYIYHVLNGEIL 3 #### ####ARA1 AVGKSNLLSRYARNEFSA 2 #### ####ARA2 RFRAVTSAYYRGAVG 3 #### ####ARA3 TRRTTFESVGRWLDELKIHSD 7.5 0.194 SVGRWLARA4 AVSVEEGKALAEEEGLF 4 #### ####ARA5 STNVKTAFEMVILDIYNNV 3 #### ####G1 DTYKLILNGKTLKGETTTEA 2 #### ####G2 GDAATAEKVFKKIANDNGVD 4 #### ####G3 GEWTYDDATKTFTVTE 2 #### ####

* Muñoz and Serrano, 1994.

Minimally frustrated -helical segments are useful for determining:

• Folding initiation sites

• -helix stability

• de-novo design of -helices

Structure prediction of membrane proteins

Inner Membrane proteins(all -Transmembrane

proteins)

Outer Membrane proteins(all -Transmembrane

proteins)

Porin (Rhodobacter capsulatus)

Bacteriorhodopsin(Halobacterium salinarum)

Bila

yer

-barrel -helices

Outer Membrane

Inner Membrane

Predictors of the Topology of Membrane Proteins

position of Trans Membrane Segments along the sequenceTopography

++++ +

+

Topology

Bilayer

N

C

Out

In

position of N and C termini with respect to the bilayer

Lipidic Bilayer

Prediction of transmembrane segments

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 0 60 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 33 0 0 010 0 30 0 30 0 100 0 0 0 0 0 0 10 0 0 10 30 0 40 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 30 0 50 0 0 0 0 0 0 0 0 0 0 0 0 33 0 0 020 0 0 0 0 33 0 0 0 0 0 0 0 10 0 0 0 0 0 10 0 0 0 0 0 0 070 0 0 90 0 0 0 0 0

TM nonTM

Window: 9 residues

5 hidden neurons

2 output neurons

Neural Network for the prediction of TMS in -barrel membrane proteins. (Jacoboni et al., 2001)

A generic model for membrane proteins (TMHMM)

A generic model for membrane proteins (TMHMM)

Transmembrane Inner Side

Outer Side

End

Begin

Sequence-profile-based HMMSequence-profile-based HMM

085 0 0 5 0 0 0 0 2 0 8 0 0 0 0 0 0 0 0

0 0 0 0 4 013 0 4 0 5 0 6 0 023 0 144 0

0 022 023 0 0 5 023 0 3 011 0 0 2 011 0

034 0 0 024 0 0 0 0 0 2 022 018 0 0 0 0

8 0 0 0 0 0 0 0 0 0 0 092 0 0 0 0 0 0 0

90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 077 023

3 0 2 7 4 0 8 6 1 3 6 5 512 5 617 2 2 6

..A C L P R P E T ...

t

Sequence of characters ct

Sequence of A-dimensional

vectors

st

0 st (n) S t,n S=100

k=1 st (n) = S t A

90 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0

n

For proteins A=20

Constraints

Martelli et al., Bioinformatics 18, S46-53, 2002

The new algorithms make possible:

•to feed HMMs with sequence profiles

•to eventually couple NNs and HMMs (Hidden Neural Networks)

Advantages:

•Higher performance than standard HMMs

•Increased discrimination capability of a given class

Martelli et al., Bioinformatics, 2002Martelli et al., Protein Eng. 2002,

Prediction of the Topology of -Transmembrane Proteins

position of Trans Membrane Helices along the sequenceTopography

++++ +

+

Topology

Bilayer

N

C

Out

In


The prediction accuracy of topography is 92%

The prediction accuracy of topology is 81 %

position of Transmembrane Strands along the sequenceTopography:

Prediction of the Topology of -Transmembrane Proteins

++++ +

+

Topology:

Bilayer

N

C

LPS (Out)

Periplasmic (In)


The prediction accuracy of topography is 73 %

The prediction accuracy of topology is 73 %

0

10

20

30

40

50

60

70

80

90

100

2.75 2.8 2.85 2.9 2.95

Per

cent

age

Outer membrane

Globular

Inner membrane

I(s | M) = -1/L log P(s | M)

The discriminative capability of the HMM model

An application: modeling the 3D structure of eukaryotic barrel

proteins

New folds Existing folds

Threading/ fold

recognition

Ab initio prediction

Building by homology

Homology (%)

0 10 20 30 40 50 60 70 80 90 100

3D structure prediction of proteins

Membrane proteins

2omf_.seq/ AEIYNKDGNK VDLYGKAVGL HYFSKGNGEN SYGGNGDMTY ARLGFKGETQ 2omf_.str/ CCCCCCCCEE EEEEEEEEEE EEECCCCCCC CCCCCCCCCE EEEEEEEEEE protx.str/ *******CCC CCCCEEEEEE EEEC****** ********CE EEEEEEEECC protx.seq/ *******KGY NFGLWKLDLK TKTS****** ********SG IEFNTAGHSN 2omf_.seq/ I*NSDLTGYG QWEYNFQGNN SEGADAQTGN KTRLAFAGLK YADVGSFDYG 2omf_.str/ C*CCCEEEEE EEEEEEECCC CCCCCCCCCC EEEEEEEEEE ECCCEEEEEE protx.str/ CCCCCEEEEE EEEEEEC*** ********** EEEEEEEEEC CCCCCEEEEE protx.seq/ QESGKVFGSL ETKYKVK*** ********** DYGLTLTEKW NTDNTLFTEV 2omf_.seq/ RNYGVVYDAL GYTDMLPEFG GDTAYSDDFF VGRVGGVATY RNSNFFGLVD 2omf_.str/ ECCCCCCCCC CCCCCCCCCC CCCCCCCCCC CCCCCCEEEE EECCCCCCCC protx.str/ EEEECC**** ********** ********** **CCEEEEEE EEECCCCCCC protx.seq/ AVQDQL**** ********** ********** **LEGLKLSL EGNFAPQSGN 2omf_.seq/ GLNFAVQYLG KNER****** *********D TARRSNGDGV GGSISYEYE* 2omf_.str/ CEEEEEEEEC CCCC****** *********C CCCCCCCCEE EEEEEEEEC* protx.str/ EEEEEEEEEE EEEECCCCCC CCCCCCCEEE EEEEEEEEEE EEEEEEECCC protx.seq/ KNGKFKVAYG HENVKADSDV NIDLKGPLIN ASAVLGYQGW LAGYQTAFDT 2omf_.seq/ **GFGIVGAY GAADRTNLQE AQPLGNGKKA EQWATGLKYD ANNIYLAANY 2omf_.str/ **CEEEEEEE EEEECCCCCC CCCCCCCCEE EEEEEEEEEE ECCEEEEEEE protx.str/ CCEEEEEEEE EEEEEEEEEE EEECCCCCCC EEEEEEEEEE CEEEEEEEEE protx.seq/ QQSKLTTNNF ALGYTTKDFV LHTAVNDGQE FSGSIFQRTS DKLDVGVQLS 2omf_.seq/ GETRNATPIT NKFTNTSGFA NKTQDVLLVA QYQFDFGLRP SIAYTKSKAK 2omf_.str/ EEEECCCCCC CCCCCCCCCC CEEEEEEEEE EEECCCCEEE EEEEEEEEEE protx.str/ EEECC***** ********** *CCCEEEEEE EEECCCCEEE EEEEEEC*** protx.seq/ WASGT***** ********** *SNTKFAIGA KYQLDDDARV RAKVNNA*** 2omf_.seq/ DVEGIGDVDL VNYFEVGATY YFNKNMSTYV DYIINQIDSD NKLGVGSDDT 2omf_.str/ CCCCCCCEEE EEEEEEEEEE ECCCCEEEEE EEEEECCCCC CCCCCCCCCE protx.str/ *********E EEEEEEEEEE EC***EEEEE EEEEECCC** *****CCCCE protx.seq/ *********S QVGLGYQQKL RT***GVTLT LSTLVDGK** *****NFNAG 2omf_.seq/ VAVGIVYQF* *** 2omf_.str/ EEEEEEEEE* *** protx.str/ EEEEEEEEEE EC* protx.seq/ GHKIGVGLEL EA*

Structural alignment of VDAC with the template

A low resolution 3D Model of VDAC the sequence from Neurospora crassa)

Casa

A low resolution 3D model of VDAC:location of mutated residues

Casadio et al., FEBS Lett 520:1-7 (2002)

Predictors of membrane protein structures can be used to filter genomes and find new

membrane proteins without sequence homologoues

FISHING NEW OUTER MEMBRANE PROTEINS IN

GRAM-NEGATIVE BACTERIA

FISHING NEW OUTER MEMBRANE PROTEINS IN

GRAM-NEGATIVE BACTERIA

MRAKLLGIVLTTPIAISSFASTETLSFTPDNINADISLGTLSGKTKERVYLAEEGGRKVSQLDWKFNNAAIIKGAINWDLMPQISIGAAGWTTLGSRGGNMVDQDWMDSSNPGTWTDESRHPDTQLNYANEFDLNIKGWLLNEPNYRLGLMAGYQESRYSFTARGGSYIYSSEEGFRDDIGSFPNGERAIGYKQRFKMPYIGLTGSYRYEDFELGGTFKYSGWVESSDNDEHYDPGKRITYRSKVKDQNYYSVAVNAGYYVTPNAKVYVEGAWNRVTNKKGNTSLYDHNNNTSDYSKNGAGIENYNFITTAGLKYTF

Signal peptides in protein sequences:

Sequences of outer membrane proteins have signal peptides:

the secretion marker is also a marker of outer membrane proteins

Proteins have intrinsic signals that govern their transport and localization in the cell: a secretion hydrophic marker (or signal peptide)

MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF

Signal Pepetide Mature protein

Cleavage site

Signal Peptide prediction

MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHF

Predicts if a given residue position belongs to the Signal Pepetide

2 Neural Networs

SignalNet CleavageNet

Predicts if a given residue position is the cleavage site

Organism Window C Q2

Eukaryotes 15-1-15 0.83 0.95 Gram positive 15-1-15 0.79 0.92Gram negative 11-1-11 0.78 0.92

SignalNet Accuracy

Organism Window C Q2

Eukaryotes 15-1-2 0.61 0.97 Gram positive 20-1-3 0.56 0.96 Gram negative 11-1-2 0.62 0.96

CleavageNet Accuracy

Organism SignalP SPEP

Eukaryotes (+) 0.99 0.97 Eukaryotes (-) 0.85 0.94

Prokaryotes(+) 0.99 0.97Prokaryotes (-) 0.93 0.96

Escherichia coli(+/-) 0.95 0.96

Comparison with SignalP

Performance of SignalNN on 2160 annotated proteins

250

Prediction

An

nota

tion

2160

Withoutsignal Total

Withsignal

Wit

hou

tsig

nal

Tota

lW

ith

sig

nal

260 1900

1910

205

1855

Correct predictions

55

45

Wrong predictions

Q2 = 96 %

Qsignal = 82 %Qnon-signal = 97 %

Psignal = 78 %Pnon-signal = 98 %

Predictors of Membrane Topography: Rate of false positives

The predictors are tested on on 809 globular protein with sequence identity 25 % :

0.5 % have at least 1 -TM helix predicted

5.6 % have at least 2 -TM strand predicted

PROTEOME

Signal peptide

Yes

All- TM All- TM

No

No

All- TM

Yes

all -TM

Yes

all -TMY

esall -TM

No

Globular

No

Globular

HUNTER

* the number of new proteins predicted in the class with Hunter, out of the non-annotated region

Predicting globular, inner and outer membrane proteins in genomes of Gram-negative bacteria with

Hunter

Documents

Prediction of structural and functional features in proteins starting from the residue sequence INTRODUCTION TO NEURAL NETWORKS