43
Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Embed Size (px)

Citation preview

Page 1: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Protein secondary structure predictions

By: Refael Vivanti

& Tal Tabakman

Page 2: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Rising accuracy of protein secondary structure prediction

Burkhard Rost

Page 3: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Main dogma in biology

D.N.A.

R.N.A

strand of A.A.

protein

AGCTCTCTGAGGCTT

UCGAGAGACUCCGAA

AGHTY

?

Page 4: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Sequence – structure gap

• Today we have much more sequenced proteins than protein’s structures.

• The gap is rapidly increasing.

Problem:Finding protein structure isn’t that simple.

Solution: A good start : find secondary structure.

Page 5: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

ויהי כל הארץ שפה אחת ודברים אחדים""

בראשית יא א

Comparing methods requires same terms and tests. Secondary structure types:

H - helix E – β strand

L\C – other.

A A P P L L L L M M M G I M M R R I M E E E E E C C C C H H H H C C C E E E

seq

pred

Page 6: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

How to evaluate a prediction?

correctly predicted residues number of residues

The Q3 test:

3Q

Of course, all methods would be tested on the same proteins.

Page 7: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

• First generation – single residue statistics

Fasman & Chou (1974) : Some residues have particular secondary

structure preference.

Examples: Glu α-Helix

Val β-strand

Old methods

• Second generation – segment statistics Similar, but also considering adjacent residues.

Page 8: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Difficulties

Bad accuracy - below 66% (Q3 results).

Q3 of strands (E) : 28% - 48%.

Predicted structures were too short.

Page 10: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

3rd generation methods

• Third generation methods reached 77% accuracy.

• They consist of two new ideas:

1. A biological idea –

Using evolutionary information.

2. A technological idea –

Using neural networks.

Page 11: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

How can evolutionary information help us?

Homologues similar structure

But sequences change up to 85%

Sequence would vary differently - depends on structure

Page 12: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

How can evolutionary information help us?

In defined secondary structures.

In protein core’s segments (more hydrophobic).

In amphipatic helices (cycle of hydrophobic and hydrophilic residues).

Where can we find high sequence conservation?

Some examples:

Page 13: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

How can evolutionary information help us?

• Predictions based on multiple alignments were made manually.

Problem:• There isn’t any well defined algorithm!

Solution:• Use Neural Networks .

Page 14: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Artificial Neural Networks

An attempt to imitate the human brain construction, (assuming this is the way it works).

When do we use it ?

When we can’t solve the problems ourselves!!!

Page 15: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Artificial Neural Network

The neural network basic structure :

• Big amount of processors – “neurons”.

• Highly connected.

• Working together.

Page 16: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Artificial Neural NetworkWhat does a neuron do?

• Gets “signals” from its neighbours.

• When achieving certain threshold - sends signals.

• Each signal has different weight.

1s

2s

3s W3

W1

W2

Page 17: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Artificial Neural NetworkGeneral structure of ANN :

• One input layer.

• Some hidden layers.

• One output layer.

• Our ANN have one-direction flow !

Page 18: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

• A neuron may be:

Artificial Neural Network

0.5

1 1

OR gate

1.5

1 1

gate AND

0

-1

NOT gate

• Because this is a complete system, a neural network can compute anything.

Page 19: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Artificial Neural Network

Neural network

Test set

Training set

Correct

Incorrect

Network training and testing :

Back - propagation

• Training set - inputs for which we know the wanted output.

• Back propagation - algorithm for changing neurons pulses “power”.

• Test set - inputs used for final network performance test.

Page 20: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Artificial Neural NetworkThe Network is a ‘black box’:

• Even when it succeeds it’s hard to understand how.

• It’s difficult to conclude an algorithm from the network

• It’s hard to deduce new scientific principles.

Page 21: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Structure of 3rd generation methods

Find homologues using large data bases.

Create a profile representing the entire protein family.

Give sequence and profile to ANN.

Output of the ANN: 2nd structure prediction.

Page 22: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Structure of 3rd generation methods

The ANN learning process: Training & testing set: - Proteins with known sequence & structure.

Training: - Insert training set to ANN as input. - Compare output to known structure. - Back propagation.

Page 23: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

3rd generation methods - difficulties

Main problem - unwise selection of training & test sets for ANN.

• First problem – unbalanced training Overall protein composition:

• Helices - 32%• Strands - 21%• Coils – 47%

What will happen if we train the ANN with random segments ?

Page 24: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

3rd generation methods - difficulties

• Second problem – unwise separation between training & test proteins

What will happen if homology / correlation exists between test & training proteins?

Above 80% accuracy in testing. over optimism!

• Third problem – similarity between test proteins.

Page 25: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Protein Secondary Structure Prediction Based on Position – specific Scoring Matrices

David T. Jones

PSI - PRED : 3RD generation method based on the iterated PSI – BLAST algorithm.

Page 26: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

PSI - BLAST

Sequence

Distant homologues

PSSM - position specific scoring matrix

• PSI - BLAST outperforms other algorithms in finding distant homologues. • PSSM – input for PSI - PRED.

Page 27: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

PSI - PRED ANN’s architecture:

1ST ANN

2ND ANN

• Two ANNs working together.

Finalprediction

Sequence + PSSM

Prediction

Page 28: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

PSI - PRED

Step 1:• Create PSSM from sequence - 3 iterations of PSI – BLAST.

Step 2: 1ST ANN• Sequence + PSSM 1st ANN’s input.

A D C Q E I L H T S T T W Y V 15 RESIDUES

output: central amino acid secondary state prediction. A D C Q E I L H T S T T W Y V

E/H/C

Page 29: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

PSI - PREDUsing PSI - BLAST brings up PSI – BLAST difficulties:

Iteration - extension of proteins family

Updating PSSM

Inclusion of non – homologues

“Misleading” PSSM

Page 30: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

PSI - PREDStep 3: 2nd ANN • So why do we need a second ANN ?

possible output for 1st ANN:

A A P P L L L L M M M G I M M R R I M E E E E E C C C C C H C C C C C E E E

what’s wrong with that ?

seq pred one-amino-acid helix

doesn’t exist

Solution: ANN that “looks” at the whole context !

Input: output of 1st ANN.

Output: final prediction.

Page 31: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

PSI - PREDTraining :

Testing :• 187 proteins, Highly resolved structure.

• Without structural similarities.

• PSI – BLAST was used for removing homologues.

• Balanced training.

• 10% of proteins were used as “inner” test.

Page 32: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

PSI - PREDJones’s reported results :

• Q3 results : 76% - 77%.

Page 33: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

PSI - PREDReliability numbers:

• Used by many methods.

• Correlates with accuracy.

• The way the ANN tells us how much it is sure about the assignment.

Page 34: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Performance evaluation

• Many 3rd generation methods exist today.

Which method is the best one ? How to recognize “over-optimism” ?

• Through 3rd generation methods accuracy jumped ~10%.

Page 35: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Performance evaluation

EVA – Automatic Evaluation of Automatic Prediction Servers.

CASP - Critical Assessment of Techniques for Protein Structure Prediction.

Page 36: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman
Page 37: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Performance evaluationConclusion : PSI-PRED seams to be one of the most reliable method today.

Reasons :

• Strict training & testing criterions for ANN.

• The widest evolutionary information (PSI - BLAST profiles).

Page 38: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Improvements

3rd generation methods best results: ~77% in Q3 .

The first 3rd generation method PHD: ~72% in Q3.

Sources of improvement :

• Larger protein data bases.

• PSI – BLAST PSI – PRED broke through, many followed...

Page 39: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Improvements How can we do better than that ?

• Combination of methods.

• Through larger data bases (?).

Example:Combining 4 best methods Q3 of ~78% !

• Find why certain proteins predicted poorly.

Page 40: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Improvements

What is the limit of prediction improvement?

• Some regions of proteins are more mobile than others.• 12% of proteins structure is unknown even by “manual” methods.

• The limit of accuracy is 88% !

Page 41: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Secondary structure prediction in practice

SECONDARY STRUCTURE PREDICTION

finding structuralswitches

genome analysisprotein structure

Page 42: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Finding Structural Switches

Prediction of secondary structure with several methods

Different results + same preferences

Structural switch ???

young et al:

Page 43: Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Bibliography

• Rost B. Rising accuracy of protein secondary structure prediction 'Protein structure determination, analysis, and modeling for drug discovery‘ (ed. D Chasman), New York: Dekker, pp. 207-249

• Jones DT. Protein secondary structure prediction based on position specific scoring matrices. J Mol Biol. 1999 292:195-202