71
COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science & McGill Centre of Bioinformatics [email protected] Features slides from Jinbo Xu – TTI-Chicago

COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

COMP598: Introduction to Protein Structure Prediction

Jérôme Waldispühl School of Computer Science &

McGill Centre of Bioinformatics [email protected]

Features slides from Jinbo Xu – TTI-Chicago

Page 2: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Folding problemKLHGGPMLDSDQKFWRTPAALHQNEGFT

Nétats

 ~ 10n   n = 100­300

Levinthal paradox

Page 3: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Amino acids: The simple ones

Page 4: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Amino acids: Aliphatics

Page 5: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Amino acids: Cyclic and Sulfhydryl

Page 6: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Amino acids: Aromatics

Page 7: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Amino acids: Aliphatic hydroxyl

Page 8: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Amino acids: Carboxamides & Carboxylates

Page 9: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Amino acids: Basics

Page 10: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Histidine ionisation

Page 11: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Primary structureA peptide bond assemble two amino acids together:

A chain is obtained through the concatenation of several amino acids:

Page 12: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Peptide bond is pH dependent

Page 13: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Peptide bonds lies on a plane 

Bond lengths

Peptide bond features (1)

Page 14: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Peptide bond features (2)

The chain has 2 degrees of liberty given by the dihedral angles  and .The geometry of the chain can be characterized though  and 

Page 15: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Peptide bond features (3)

Cis/trans isomers of the peptide group

Trans configuration ispreferred versus Cis(ratio ~1000:1)

An exception is theProline with a preferenceratio of ~3:1

Page 16: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Ramachandran diagram gives the values which canbe adopted by  and  

Page 17: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

CαH

N C

H O

ψ

φ

CH2

NH3

CH2

CH2

CH2

+

Lysine

χ1

χ2

χ3

The side chains also have flexible torsion angles

Page 18: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

­2.5

­4.3

The preferred side­chains conformationsare called “rotamers”

Example: Asparagine­3.3

Typical conformations experimentally observedconformations observed by simulation 

Energy (chi1,chi2)

Cα NCC

βC

γO

δNδ

χ1

χ2

­4.5

chi2

chi1

1 kcal/mole betw

een levels

Page 19: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

α  helix β−sheet

In helices and sheets, polar groups are involved intohydrogen bonds

3.6 residuesper turn

Pseudo­periodicity of 2 

Page 20: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

­helix

3.6 residues per turn, H­bond between residue n and n+4Although other (rare) helices are observed: ­helices, 3.10­helices...

Page 21: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

­sheets

­strand (elementary blocks) :

­strands are assembled into(parallel, anti­parallel)sheets.

Page 22: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

­sheets

Anti­parallel ­sheets

Parallel ­sheets

Page 23: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

­sheetsVarious shapes of   structures

Twisted ­sheets barrel

Page 24: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

­sheets

Page 25: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Loops

 turn

~ 1/3 of amino acids

Loops 

Page 26: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Super­secondary & Tertiary structure

The tertiary structure is the set of3D coordinates of atoms of a singleamino acid chain 

Secondary structure elementscan be assembled intosuper­secondary motifs.

Page 27: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Quaternary structure

A protein can be composedof multiple chains withinteracting subunits. 

Page 28: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Protein can interact with moleculesExample: Hemoglobin

An Heme (iron + organic ring) binds to the protein, andallow the capture of oxygen atoms.

Page 29: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Disulfide bond

Two cysteines can interactand create a disulfide bond.

Page 30: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Cytochrom cHemoglobine

water

The tertiary structure is globular, with a  preferencefor polar residues on its surface but rather apolar in

its interior 

Page 31: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Membrane proteins are an exception 

~ 30% of human genome, ~ 50% of antibiotics

Cytochrom oxidase

lipidProtein

Lipid bilayer

Hydrophobiccore

Hydrophilicregion

Page 32: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Proteins folds into a native structure

Page 33: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Overview of the methods used to predictthe protein structure

● Which degree of definition?● What's the length of the sequence?● Which representation/modeling suits the best?● Should we simulate the folding or predict the structure?● Do we want a single prediction or a set of candidates?●  Machine learning approach or physical model?

Several issue must be addressed first:

Page 34: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Molecular Dynamics

Page 35: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

HP lattice model

Page 36: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Hidden Markov models(and other machine learning approaches)

Page 37: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Structural template methods

Page 38: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Protein Secondary Structure

Page 39: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Protein Secondary Structure Prediction Using Statistical Models

•  Sequences determine structures

•  Proteins fold into minimum energy state.

•  Structures are more conserved than sequences. Two proteins with 30% identity likely share the same fold.

Page 40: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

How to evaluate a prediction?

correctly predicted residues number of residues

In 2D: The Q3 test.

=3Q

In 3D: The Root Mean Square Deviation (RMSD)

Page 41: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

•  First generation – single residue statistics Fasman & Chou (1974) : Some residues have particular secondary structure preference. Examples: Glu α-Helix Val β-strand

Old methods

•  Second generation – segment statistics Similar, but also considering adjacent residues.

Page 42: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Difficulties

Bad accuracy - below 66% (Q3 results).

Q3 of strands (E) : 28% - 48%.

Predicted structures were too short.

Page 43: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Methods Accuracy Comparison

Page 44: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

3rd generation methods

•  Third generation methods reached 77% accuracy.

•  They consist of two new ideas: 1. A biological idea – Using evolutionary information. 2. A technological idea – Using neural networks.

Page 45: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

How can evolutionary information help us?

Homologues similar structure

But sequences change up to 85%

Sequence would vary differently - depends on structure

Page 46: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

How can evolutionary information help us?

In defined secondary structures.

In protein core’s segments (more hydrophobic). In amphipatic helices (cycle of hydrophobic and hydrophilic residues).

Where can we find high sequence conservation?

Some examples:

Page 47: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

How can evolutionary information help us?

•  Predictions based on multiple alignments were made manually.

Problem: •  There isn’t any well defined algorithm!

Solution: •  Use Neural Networks .

Page 48: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Artificial Neural Network

The neural network basic structure : •  Big amount of processors – “neurons”. •  Highly connected. •  Working together.

Page 49: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Artificial Neural Network

What does a neuron do? •  Gets “signals” from its neighbors.

•  When achieving certain threshold - sends signals. •  Each signal has different weight.

1s

2s

3s W3

W1

W2

Page 50: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Artificial Neural Network General structure of ANN :

•  One input layer.

•  Some hidden layers.

•  One output layer.

•  Our ANN have one-direction flow !

Page 51: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Artificial Neural Network

Neural network

Test set

Training set

Correct

Incorrect

Network training and testing :

Back - propagation

•  Training set - inputs for which we know the wanted output. •  Back propagation - algorithm for changing neurons pulses “power”. •  Test set - inputs used for final network performance test.

Page 52: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Artificial Neural Network The Network is a ‘black box’: •  Even when it succeeds it’s hard to understand how. •  It’s difficult to conclude an algorithm from the network. •  It’s hard to deduce new scientific principles.

Page 53: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Structure of 3rd generation methods

Find homologues using large data bases.

Create a profile representing the entire protein family.

Give sequence and profile to ANN.

Output of the ANN: 2nd structure prediction.

Page 54: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Structure of 3rd generation methods

The ANN learning process: Training & testing set: - Proteins with known sequence & structure. Training: - Insert training set to ANN as input. - Compare output to known structure. - Back propagation.

Page 55: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

3rd generation methods - difficulties

Main problem - unwise selection of training & test sets for ANN.

•  First problem – unbalanced training

Overall protein composition: •  Helices - 32% •  Strands - 21% •  Coils – 47%

What will happen if we train the ANN with random segments ?

Page 56: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

3rd generation methods - difficulties

•  Second problem – unwise separation between training & test proteins

What will happen if homology / correlation exists between test & training proteins?

Above 80% accuracy in testing. over optimism!

•  Third problem – similarity between test proteins.

Page 57: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Protein Secondary Structure Prediction Based on Position – specific Scoring Matrices

David T. Jones

PSI - PRED : 3RD generation method based on the iterated PSI – BLAST algorithm.

Page 58: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

PSI - BLAST

Sequence

Distant homologues

PSSM - position specific scoring matrix

•  PSI – BLAST finds distant homologues. (It exists now alternatives such as HMMER 3.0 or HHblits) •  PSSM – input for PSI - PRED.

Page 59: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

PSI - PRED ANN’s architecture:

1ST ANN

2ND ANN

•  Two ANNs working together.

Final prediction

Sequence + PSSM

Prediction

Page 60: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

PSI - PRED Step 1: •  Create PSSM from sequence - 3 iterations of PSI – BLAST.

Step 2: 1ST ANN •  Sequence + PSSM 1st ANN’s input.

A D C Q E I L H T S T T W Y V 15 RESIDUES

output: central amino acid secondary state prediction. A D C Q E I L H T S T T W Y V

E/H/C

Page 61: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

PSI - PRED Using PSI - BLAST brings up PSI – BLAST difficulties:

Iteration - extension of proteins family

Updating PSSM

Inclusion of non – homologues

“Misleading” PSSM

Page 62: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

PSI - PRED Step 3: 2nd ANN •  So why do we need a second ANN ? possible output for 1st ANN:

A A P P L L L L M M M G I M M R R I M E E E E E C C C C C H C C C C C E E E

what’s wrong with that ?

seq pred

one-amino-acid helix doesn’t exist

Solution: ANN that “looks” at the whole context !

Input: output of 1st ANN. Output: final prediction.

Page 63: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

PSI - PRED Training :

Testing : •  187 proteins, Highly resolved structure.

•  Without structural similarities.

•  PSI – BLAST was used for removing homologues.

Balanced training.

Page 64: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

PSI - PRED Jones’s reported results : Q3 results : 76% - 77%

Page 65: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

PSI - PRED Reliability numbers:

•  Used by many methods.

•  Correlates with accuracy.

•  The way the ANN tells us how much it is sure about the assignment.

Page 66: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Performance Evaluation

•  Many 3rd generation methods exist today.

Which method is the best one ? How to recognize “over-optimism” ?

•  Through 3rd generation methods accuracy jumped ~10%.

Page 67: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Performance Evaluation

Page 68: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Performance Evaluation

Conclusion : PSI-PRED seams to be one of the most reliable method today.

Reasons :

•  Strict training & testing criterions for ANN.

•  The widest evolutionary information (PSI - BLAST profiles).

Page 69: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Improvements

3rd generation methods best results: ~77% in Q3 .

The first 3rd generation method PHD: ~72% in Q3.

Sources of improvement :

•  Larger protein data bases. •  PSI – BLAST PSI – PRED broke through, many followed...

Page 70: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Improvements How can we do better than that ?

•  Combination of methods.

Through larger data bases (?).

Example: Combining 4 best methods Q3 of ~78% !

•  Find why certain proteins predicted poorly.

Page 71: COMP598: Introduction to Protein Structure Predictionjeromew/comp598_W2014/... · COMP598: Introduction to Protein Structure Prediction Jérôme Waldispühl School of Computer Science

Bibliography •  Jones DT. Protein secondary structure prediction

based on position specific scoring matrices. J Mol Biol. 1999 292:195-202

•  Rost B. Rising accuracy of protein secondary structure prediction 'Protein structure determination, analysis, and modeling for drug discovery‘ (ed. D Chasman), New York: Dekker, pp. 207-249