60
Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Embed Size (px)

Citation preview

Page 1: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure, Classification and Prediction

BMI 730 Victor Jin

Department of Biomedical InformaticsOhio State University

Page 2: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure

Protein Structure Determination

Protein Structure Classification- SCOP- CATH

Secondary Structure Predication

Tertiary Prediction

Structure Prediction Evaluation

- CASP

Page 3: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure

Protein Structure Determination

Protein Structure Classification- SCOP- CATH

Secondary Structure Predication

Tertiary Prediction

Structure Prediction Evaluation

- CASP

Page 4: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Chemistry

Proteins are linear hetero-polymers of amino acidstwenty different amino acids (building blocks)

VAL ARG LYS ILE GLU PRO ARG GLU

V R K I E P R E

3-letter code

1-letter code

Page 5: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Peptide bond

http://www.imb-jena.de/~rake/Bioinformatics_WEB/basics_peptide_bond.html

The peptide bond is planar

2 angles freely rotatable1 is fixed

Peptide ~ 2-10 amino acidsPolypeptide ~ 10-50 amino acidsProtein ~ 50- amino acids

Double bond character of the peptide bond

Page 6: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Amino acids

Side chain propertiesSizeChargePolarity

http://www.ch.cam.ac.uk/SGTL/Structures/amino/

Page 7: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Hierarchical nature of protein structureHierarchical nature of protein structure

Primary structure (Amino acid sequence)↓

Secondary structure (local conformations: α-helix, β-sheet, and reverse turn and loop )

↓Tertiary structure ( Global conformations: a three-dimensional

structure resulted from folding together secondary structures)

↓Quaternary structure ( Structure formed by more than one

polypeptide chains )

Page 8: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Basic structural units of proteins: Basic structural units of proteins: Secondary structureSecondary structure

α-helix β-sheet

Secondary structures, α-helix and β-sheet, have regular hydrogen-bonding patterns.

Page 9: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Tertiary structure

In globular proteins such as enzymes, the long chain of amino acids becomes folded into a three-dimensional functional shape or tertiary structure. This is because certain amino acids with sulfhydryl or SH groups form disulfide (S-S) bonds with other amino acids in the same chain. Other interactions between R groups of amino acids such as hydrogen bonds, ionic bonds, covalent bonds, and hydrophobic interactions also contribute to the tertiary structure

Page 10: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

A few examples of tertiary structure

Dihydrofolate reductase Myoglobin

Page 11: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Quaternary structure

non-covalent interactions that bind multiple polypeptides into a single, larger protein. Hemoglobin has quaternary structure due to association of two alpha globin and two beta globin polyproteins.

Page 12: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Structure Stabilizing Interactions

Non-covalent Van der Waals forces (transient, weak electrical attraction

of one atom for another) Hydrophobic (clustering of nonpolar groups) Hydrogen bonding

Covalent Disulfide bonds

Page 13: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure

Protein Structure Determination

Protein Structure Classification- SCOP- CATH

Secondary Structure Predication

Tertiary Prediction

Structure Prediction Evaluation

- CASP

Page 14: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein structure determination

X-Ray crystallography NMR (nuclear magnetic resonance) Cryo-EM (electron microscopy)

Protein expression membrane proteins aggregation

Page 15: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure

Protein Structure Determination

Protein Structure Classification- SCOP- CATH

Secondary Structure Predication

Tertiary Prediction

Structure Prediction Evaluation

- CASP

Page 16: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

• Structure Classification Of Proteins database • http://scop.mrc-lmb.cam.ac.uk/scop/

• Hierarchical Clustering• Family – clear evolutionarily relationship• Superfamily – probable common evolutionary origin• Fold – major structural similarity

• Boundaries between levels are more or less subjective

• Conservative evolutionary classification leads to many new divisions at the family and superfamily levels, therefore it is recommended to first focus on higher levels in the classification tree.

Page 17: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

Page 18: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

Page 19: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

Page 20: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

Page 22: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

Class Number of foldsNumber of superfamilies

Number of families

All alpha proteins 218 376 608

All beta proteins 144 290 560

Alpha and beta proteins ()

136 222 629

Alpha and beta proteins ()

279 409 717

Multi-domain proteins 46 46 61

Membrane and cell surface proteins

47 88 99

Small proteins 75 108 171

Total 945 1539 2845

Scop Classification StatisticsSCOP: Structural Classification of Proteins. 1.69 release

25973 PDB Entries (1 Oct 2004). 70859 Domains. 1 Literature Reference(excluding nucleic acids and theoretical models)

Page 23: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

Page 24: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

Page 25: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

Page 26: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

Page 27: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - SCOP

Page 28: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - CATH

• CATH Protein Structure Classification• http://www.cathdb.info/latest/index.html

• CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H).

• Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically.

• Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually.

• The topology level clusters structures into fold groups according to their topological connections and numbers of secondary structures.

• The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to fold groups and homologous superfamilies are made by sequence and structure comparisons.

Page 29: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - CATHhttp://www.cathdb.info/cgi-bin/cath/GotoCath.pl?link=cath_info.html

Only crystal structures solved to resolution better than 4.0 angstroms are considered, together with NMR structures. All non-proteins, models, and structures with greater than 30% "C-alpha only" are excluded from CATH

The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures. These include computational techniques, empirical and statistical evidence, literature review and expert analysis.

Domains within each H-level are subclustered into sequence families using multi-linkage clustering at the following levels:

LevelName Sequence Identity Overlap

S 35% 80%

O 60% 80%

L 95% 80%

I 100% 80%

Page 30: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - CATH

Page 31: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - CATH

Page 32: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure Classification - CATH

Page 33: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

CATH vs. SCOP

Page 34: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure

Protein Structure Determination

Protein Structure Classification- SCOP- CATH

Secondary Structure Predication

Tertiary Prediction

Structure Prediction Evaluation

- CASP

Page 35: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Secondary Structure PredictionAGADIR - An algorithm to predict the helical content of peptides APSSP - Advanced Protein Secondary Structure Prediction Server GOR - Garnier et al, 1996 HNN - Hierarchical Neural Network method (Guermeur, 1997) Jpred - A consensus method for protein secondary structure prediction at University of Dundee JUFO - Protein secondary structure prediction from sequence (neural network) nnPredict - University of California at San Francisco (UCSF) Porter - University College Dublin PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from Columbia University Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction PSA - BioMolecular Engineering Research Center (BMERC) / Boston PSIpred - Various protein structure prediction methods at Brunel University SOPMA - Geourjon and Deléage, 1995 SSpro - Secondary structure prediction using bidirectional recurrent neural networks at University of California DLP - Domain linker prediction at RIKEN

http://us.expasy.org/tools/#secondary

Page 36: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Determining the Residue Environment Six basic environment classes (E, P1, P2, B1, B2 and B3)The environment of each residue in the three-dimensional structure is first classified according to the area of the side chain that is buried in the protein. ---- A residue is considered exposed to solvent (environment class E) if the area buried is less than 40 Å2. ---- It is considered partially buried (class P) if the area buried is between 40 and 114 Å2. ---- It is considered buried (class B) if the area buried is greater than 114 Å2. The buried and partially buried classes are further subdivided according to the fraction of the side chain area that is exposed to polar atoms ("fraction polar", denoted f). ---- For this purpose polar atoms are defined as those of the solvent and the oxygen and nitrogen atoms of the protein. ---- The buried class is subdivided into classes B1 (f < 0.45), B2 (0.45 <= f < 0.58) and B3 (f >= 0.58). ---- The partially buried class is subdivided into classes P1 (f < 0.67) and P2 (f >= 0.67).

Page 37: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Structural environments

Sequence residue and predicted secondary structure classes

rcCrcH

rcSrwC

rwH

rwS

rbC

rbH

rbS raC raH raSrhC

rhH

rhS rsC rsH rsSrpC

rpH rpS

rcC_E 3.3 2.4 0.8 0.5−9.0

−9.0

−0.6

−9.0

−1.2

−0.1

−1.5

−0.8

−0.1

−2.1

−1.0

0.1−2.3

−0.5

0.6−1.9

−0.9

rcC_B 3.7−9.0

−9.0

−9.0

−9.0

−9.0

−0.7

−9.0

0.1 0.2−9.0

−9.0

0.7−0.9

0.0 0.1−9.0

−1.2

0.1−9.0

−9.0

rcH_E 1.7 3.1−9.0

1.2 1.3−9.0

−9.0

1.4−9.0

−0.3

1.0−9.0

−1.1

1.0−9.0

−1.5

0.7−9.0

−9.0

0.8−9.0

rcH_B 2.5 3.7−9.0

−9.0

−9.0

−9.0

−9.0

−0.5

−9.0

−9.0

0.0−9.0

−1.1

1.3−9.0

−2.1

0.9−9.0

−9.0

0.0−9.0

rcS_E 0.4−9.0

3.9−9.0

−9.0

1.5−1.2

−9.0

1.5−0.2

−0.7

1.6−1.1

−2.0

0.6−0.5

−9.0

0.8−0.8

−9.0

1.5

rcS_B 0.7−9.0

4.0−9.0

−9.0

−9.0

−0.2

−9.0

0.9−0.7

−9.0

−0.5

−9.0

−1.8

1.0−0.9

−9.0

1.0 0.0−9.0

1.3

Page 38: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Secondary Structure Prediction - HNN

• >gi|78099986|sp|P0ABK2|CYDB_ECOLI Cytochrome d ubiquinol oxidase subunit 2 (Cytochrome d ubiquinol oxidase subunit II) (Cytochrome bd-I oxidase subunit II) MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPIILLY

TAWCYWKMFGRITKEDIERNTHSLY

• http://npsa-pbil.ibcp.fr/cgi-bin/secpred_hnn.pl

Page 39: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Secondary Structure Prediction - HNN

10 20 30 40 50 60 70 | | | | | | | MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA ccchhhhhhhhhhhhhhheeeeehccchhcchhhhhheecccccceeeeeeccccccccceeeeeeccch LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN hhhhhhhhhhhhhhhhhhhhhhhhhhhhhcccccccccchhhhhhhhhhcceeehccchccheehhhhhc LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV hhcccccchhhhheeeeccchhhhhcchceccceeeeeeeeeccchhhhhhhchhhhhhchhhhhhhhhh TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI hhhhhhccceeeeeeccceeeeeccccccccccchhhhhhhhhhhheeccccceeeeccchhhhhhhhhh LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPI hhhhhhhhhhhhhhhhhhhhhhhhhcchhhcccccccchhhccccchhcccchhhhhhhhhhhhhhhhhh ILLYTAWCYWKMFGRITKEDIERNTHSLY hhhhhhhhhhhhhhhcchhhhhhhccccc

Sequence length : 379 HNN : Alpha helix (Hh) : 209 is 55.15% 310 helix (Gg) : 0 is 0.00% Pi helix (Ii) : 0 is 0.00% Beta bridge (Bb) : 0 is 0.00% Extended strand (Ee) : 55 is 14.51% Beta turn (Tt) : 0 is 0.00% Bend region (Ss) : 0 is 0.00% Random coil (Cc) : 115 is 30.34% Ambiguous states (?) : 0 is 0.00% Other states : 0 is 0.00%

Page 40: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Secondary Structure Prediction - HNN

Page 41: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Secondary Structure Prediction - PHD

•PHDsec predicts secondary structure from multiple sequence alignments. Secondary structure is predicted by a system of neural networks rating at an expected average accuracy > 72% for the three states helix, strand and loop (Rost & Sander, PNAS, 1993 , 90, 7558-7562; Rost & Sander, JMB, 1993 , 232, 584-599; and Rost & Sander, Proteins, 1994 , 19, 55-72).

•Evaluated on the same data set, PHDsec is rated at ten percentage points higher three-state accuracy than methods using only single sequence information, and at more than six percentage points higher than, e.g., a method using alignment information based on statistics (Levin, Pascarella, Argos & Garnier, Prot. Engng., 6, 849-54, 1993).

•PHDsec predictions have three main features: • improved accuracy through evolutionary information from multiple

sequence alignments • improved beta-strand prediction through a balanced training procedure • more accurate prediction of secondary structure segments by using a

multi-level system

Page 43: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Motifs Readily Identified from Sequence

• Zinc Finger - order and spacing of a pattern for cysteine and histidine.

• Leucine zippers – two antiparallel alpha helices held together by interactions between hybrophobic leucine residues at every seventh position in each helix.

• Coiled coils – 2-3 helices coiled around each other in a left-handed supercoil (3.5 residue/turn instead of 3.6 – 7/two turns); first and fourth are always hydrophobic, others hydrophilic; 5-10 heptads.

• Transmembrane-spanning proteins – alpha helices comprising amino acids with hydrophobic side chains, typically 20-30 residues.

Page 44: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure

Protein Structure Determination

Protein Structure Classification- SCOP- CATH

Secondary Structure Predication

Tertiary Prediction

Structure Prediction Evaluation

- CASP

Page 45: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Tertiary Structure PredictionComparative modeling

SWISS-MODEL - An automated knowledge-based protein modelling server 3Djigsaw - Three-dimensional models for proteins based on homologues of known structure CPHmodels - Automated neural-network based protein modelling server ESyPred3D - Automated homology modeling program using neural networks Geno3d - Automatic modeling of protein three-dimensional structure SDSC1 - Protein Structure Homology Modeling Server

Threading 3D-PSSM - Protein fold recognition using 1D and 3D sequence profiles coupled with secondary structure information (Foldfit) Fugue - Sequence-structure homology recognition HHpred - Protein homology detection and structure prediction by HMM-HMM comparison Libellula - Neural network approach to evaluate fold recognition results LOOPP - Sequence to sequence, sequence to structure, and structure to structure alignment SAM-T02 - HMM-based Protein Structure Prediction Threader - Protein fold recognition ProSup - Protein structure superimposition SWEET - Constructing 3D models of saccharides from their sequences

Ab initio HMMSTR/Rosetta - Prediction of protein structure from sequence

http://us.expasy.org/tools

Page 46: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Tertiary Structure Prediction – Comparative Modeling

Example: 3Djigsaw - Three-dimensional models for proteins based on homologues of known structure

Contreras-Moreira,B., Bates,P.A. (2002) Domain Fishing: a first step in protein comparative modelling. Bioinformatics 18: 1141-1142.

Page 47: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

3D Protein Sequence Profiles

A 3D profile is based on a 3D structure-specific scoring matrix A 3D scoring matrix is similar to the 1D scoring matrices we discussed in the multiple sequence alignment lectures, with the additional attribute of the structural environment of the amino acid side chain There are 6 basic environment classes (E, P1, P2, B1, B2 and B3), differing in the area of the side chain that is buried, and by the fraction of the side chain that is exposed to polar atoms Since amino acids can assume 3 different secondary structures, there are 3 x 6 = 18 different environmental classes The log odds of each amino acid in each environment type gives the values for the 3D-1D scoring matrix -- calculated from database of protein structures

Page 48: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Using 3D Profiles in Structure Prediction

The alignment of an amino acid sequence with a 3D profile yields an overall 3D-1D score. The 3D-1D score is a measure of the compatibility of the sequence with the structure described by the profile Given a amino acid sequence, find compatible structures ---- Useful for finding homologous structures when doing homology modeling Given a preliminary or model structure, test its validity --- Useful for the final phase of homology modeling Given a structure, find compatible sequences ---- Useful for analyzing evolutionary relationships among proteins

Page 49: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Homology Modeling

Definition: Predicting the tertiary structure of an unknown protein using a known 3D structure of protein(s) with homologous sequence Based on assumption that structure is more conserved than sequence Important to use homologous proteins whose structures were determined by X-ray crystallography or NMR Homology modeling is an important method since the number of different protein folds (unique structures) is much smaller than the number of different proteins Likely that homologous protein sequences will share a common protein fold

Some of the material from this section is from: http://www.cs.wright.edu/~mraymer/cs790/Homology_Modeling.ppt

Page 50: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Homology Modeling Procedure

Search databases for homologous protein sequences The Protein Data Bank (PDB) is a good choice, since all of the sequences contained in PDB have solved 3D structures Align homologous protein sequence with the sequence of interest ---- Pair-wise or Multiple Sequence Alignment can be used Build a model of the structure of the protein of interest using the known structures of homologous proteins. Possible methods include: 1. Modeling by rigid body assembly 2. Modeling by segment matching or coordinate reconstruction 3. Modeling by satisfaction of spatial constraintsEvaluate and refine model structure

Page 51: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Tertiary Structure PredictionThreading

3D-PSSM - Protein fold recognition using 1D and 3D sequence profiles coupled with secondary structure information (Foldfit) Fugue - Sequence-structure homology recognition HHpred - Protein homology detection and structure prediction by HMM-HMM comparison Libellula - Neural network approach to evaluate fold recognition results LOOPP - Sequence to sequence, sequence to structure, and structure to structure alignment SAM-T02 - HMM-based Protein Structure Prediction Threader - Protein fold recognition ProSup - Protein structure superimposition SWEET - Constructing 3D models of saccharides from their sequences

Page 52: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Tertiary Structure Prediction - Threading

• First coined by Jones, Taylor and Thornton in 1992. Originally for fold recognition.

• Today, the terms threading and fold recognition are frequently (though somewhat incorrectly) used interchangeably.

• The basic idea is that the target sequence (structure to be predicted) is threaded through the backbone structures of template proteins (known as the fold library) and a “goodness of fit” scores are calculated (usually derived in terms of an empirical energy function).

• Threading methods share some of the characteristics of both comparative modelling methods (the sequence alignment aspect) and ab initio prediction methods (predicting structure based on identifying low-energy conformations of the target protein).

http://en.wikipedia.org/wiki/Threading_%28protein_sequence%29

Page 53: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Threading

Generalization of homology modeling method ---- Homology Modeling: Align sequence to sequence ---- Threading: Align sequence to structure (templates) Rationale: ---- Limited number of basic folds found in nature ---- Amino acid preferences for different structural environments provides sufficient information to choose the best-fitting protein fold (structure)

Page 54: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Tertiary Structure Prediction

Ab initio (de novo)

• From scratch – using physical property instead of known structures

• Mimic folding process – minimize certain energy function, stochastic modeling (e.g., simulated annealing)

• Computationally expensive – requires large clusters, large machines (e.g., IBM BlueGene) or distributed computing, currently only work for small peptides

• Big potential in the future – understand the dynamics, accuracy, and applications in drug development

Page 55: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Tertiary Structure PredictionAb initio (de novo)

Prediction Scoring with RosettaRosetta uses a scoring function to judge different conformations. The process consists of making 'moves' (changing the bond angles of a particular group of amino acids) and then scoring the new conformation.

The Rosetta score is a weighted sum of component scores, where each component score is judging a different aspect of protein structure.

Environment score: Here, hydrophobic residues as represented as orange stars, so the left conformation is good (all the hydrophobics together) while the rightmost conformation is bad (with the hydrophobic amino acids not touching).

Pair-score: Two conformations of a polypeptide are shown, one (top) where the chain is folded back on itself bringing two cysteins together (yellow+yellow = possible disulphide bond) and forming a salt-bridge (blue+red = opposites attract). The conformation at bottom does not make these pairings and the pair-score would, thus, favor the top conformation.

http://www.grid.org/projects/hpf/howitworks_scoring.htm

Page 56: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Protein Structure

Protein Structure Determination

Protein Structure Classification- SCOP- CATH

Secondary Structure Predication

Tertiary Prediction

Structure Prediction Evaluation

- CASP

Page 57: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Evaluation - CASPCASP - Critical Assessment of Techniques for Protein Structure

Prediction, is a community-wide experiment (though it is commonly referred to as a competition) for protein structure prediction taking place every two years since 1994. (http://predictioncenter.org/)

The main goal of CASP is to obtain an in-depth and objective assessment of our current abilities and inabilities in the area of protein structure prediction. To this end, participants will predict as much as possible about a set of soon to be known structures. These will be true predictions, not ‘post-dictions’ made on already known structures. CASP7 will particularly address the following questions:

1. Are the models produced similar to the corresponding experimental structure?

2. Is the mapping of the target sequence onto the proposed structure (i.e. the alignment) correct?

3. Have similar structures that a model can be based on been identified? 4. Are comparative models more accurate than can be obtained by simply

copying the best template? 5. Has there been progress from the earlier CASPs? 6. What methods are most effective? 7. Where can future effort be most productively focused?

Page 58: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Evaluation - CASPEvaluation of the results is carried out in the following prediction categories:• tertiary structure prediction (all CASPs) • secondary structure prediction (dropped after CASP5) • prediction of structure complexes (CASP2 only; a separate experiment -

CAPRI - carries on this subject) • residue-residue contact prediction (starting CASP4) • disordered regions prediction (starting CASP5) • domain boundary prediction (starting CASP6) • function prediction (starting CASP6) • model quality assessment (starting CASP7) • model refinement (starting CASP7)

Tertiary structure prediction category was further subdivided into• homology modelling • fold recognition (also called protein threading; Note, this is incorrect as

threading is a method) • de novo structure prediction Now referred to as 'New Fold' as many

methods apply evaluation, or scoring, functions that are biased by knowledge of native protein structures, such an example would be an artificial neural network.

Page 59: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

Evaluation - CASPNumber of human expert groups registered 207

Number of prediction servers registered 98

Number of targets released 104

Targets canceled 4

Valid targets 100

Refinement targets 9

Prediction formatNumber of groups

contributingNumber of models designated as 1

Total number of models

3D coordinates 180 12393 48339

Alignments to PDB structures

15 966 3896

Residue-residue contacts

17 1473 1561

Structural domains assignments

27 2258 2515

Disordered regions 19 1801 1801

Function prediction 22 1317 1930

Quality assessment 29 2326 3228

Model refinement 26 136 447

All 255 (unique) 22670 63717

Page 60: Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University

SummarySummary

Proteins are key players in our living systems. Proteins are polymers consisting of 20 kinds of

amino acids. Each protein folds into a unique three-dimensional

structure defined by its amino acid sequence. Protein structure has a hierarchical nature. Protein structure prediction is a grand challenge of

computational biology.