2d 3d Structure

8/3/2019 2d 3d Structure

1/38

1

Structure prediction methods

(2D and 3D)

Much of the text in the slides that follow are drawn either verbatim or

paraphrased from the following texts:

Bioinformatics (Baxevanis and Ouellette)

Chapter 8: Predictive methods using protein sequences

(Ofran and Rost) 198-219

Chapter 9: Protein structure prediction and analysis

(Wishart) 224-247

Chapter 12: Creation and analysis of protein multiple sequence alignments

(Barton) 333-336

Proteins: Structures and molecular properties

(Thomas Creighton)

Topics Covered Overview of protein structure: primary, secondary, tertiary, and

quaternary

Overview of protein folding

Secondary structure prediction methods

Solvent accessibility prediction

3D fold prediction Ab initio protein structure prediction

Threading methods

Community evaluation of protein structure prediction

Critical Assessment of protein Fold Prediction (CASP)http://predictioncenter.org/

EVA (real-time continuous evaluation of protein fold prediction methods)http://cubic.bioc.columbia.edu/eva/

Methods for solving protein structures experimentally


2/38

2

The importance of protein structure

Bioinformatics is much more than just sequence analysismany

of the most interesting and exciting applications in

bioinformatics today actually are concerned with structure

analysis.

The origins of bioinformatics actually lie in the field of structural

biology

Proteins are perhaps the most complex chemical entities in nature.

No other class of molecule exhibits the variety and and

irregularity in shape, size, texture and mobility that can be

found in proteins.

Baxevanis & Ouellette (Ch. 9, p.224, Wishart)

Hierarchical descriptions of proteins

(follows the folding process) Primary structure: the amino acid sequence

Secondary structure: regular local structure of linear segments ofpolypeptide chains (Creighton)

Helices (~35% of residues) Beta sheet (~25% of residues) Both types predicted by Linus Pauling (Corey and Pauling, 1953) Other less common structures:

Beta turns

3/10 helices

loops

Remaining unclassifiable regions termed random coil or unstructuredregions

http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm Tertiary structure: Overall topology of the folded polypeptide chain

(Creighton) Mediated by hydrophobic interactions between distant parts of protein

Quaternary structure: Aggregation of the separate polypeptide chainsof a protein (Creighton)

Baxevanis & Ouellette (Ch. 9, p.224, Wishart)


3/38

3

Protein folding

Folded conformations of globular

proteins

Most proteins are globular: natural proteins in solution aremuch smaller in their dimensions than comparablepolypeptides with random or repetitive conformations andhave roughly spherical shapes

Denaturation: Most proteins are robust to changes in theirenvironment, until they (somewhat literally) fall apart: Most proteins are robust to changes in temperature, pH and

pressure, exhibiting little or no change until a point is reached atwhich there is a sudden change and loss of biological function

Denaturing proteins has been used to explore folding pathways

(e.g.,Understanding how proteins fold: the lysozyme story so far.Dobson

CM, Evans PA, Radford SE.Trends Biochem Sci. 1994)

Creighton, Proteins Ch. 6


4/38

4

Structural domains

Folded structures of most small proteins are roughlyspherical and remarkably compact

Proteins with >200aa tend to consist of >2 structural units,called domains

Domains interact to varying extents, but less extensivelythan do structural elements within domains Some domain detection tools make use of this pattern, looking for

covariation between positions as evidence of interaction

Nagarajan and Yona, Automatic prediction of protein domainsfrom sequence information using a hybrid learning system.Bioinformatics2004

Domains may not always be well segregated; someproteins have multiple domains with 2 or three polypeptideconnections between domains

See, for example, the SCOP interleaved domains


Structural domains (contd) Definition of domain is a subjective process done in

different ways by different people

Domains are most evident by their compactness

Expressed quantitatively as the ratio of the surface area of adomain to the surface area of a sphere with the same volume

Observed values are 1.65+/- 0.08

Course of polypeptide backbone through domain isirregular, but generally follows moderately straight coursethrough the domain and then makes a U-turn to recross thedomain

Overall impression: segments of somewhat stiffpolypeptide chain interspersed with relatively tight turns orbends (almost always on the molecules surface) Compared to behavior of a fire hose dropped in one spot



5/38

5

Structural

domains

(contd)


Figure 6.13

Driving forces in protein folding

Complex combination of local and globalforces

Local forces drive secondary structureformation

Repulsion between hydrophobic side chains of someamino acids and hydrophilic backbone of proteinchain (intra-molecular)

Interaction between side chains and surroundingsolvent

Subcellular environment (e.g., membrane, secreted, etc.)

Pauling et al 1951

Baxevanis & Ouellette (Ch. 9, Wishart)


6/38

6

More driving forces in protein folding

Hydrophobicity

Hydrophobic residues need to be shielded from solvent

Polar residues to the outside, hydrophobic to the inside

Stronger interactions

Hydrogen bonds, disulfide bridges

Weak interactions

Van der Waals, electrostatic, etc

Recommended reading: Proteins (Thomas Creighton).

Global effects on protein fold

Long-range interactions (repulsive or

attractive) between distant parts of structure

These can override local effects

E.g., chameleon protein:

11 amino acids adopt helical structure in one region,

and the same 11 amino acids adopt beta strand in

another.

Minor & Kim, 1996



7/38

7

Ligands and co-factors

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/E/Enzymes.html#coenzymes


8/38

8

Information required for folding is (mostly)

contained in the primary sequence Early on, proteins were shown to fold into their native

structures in isolation

This led to the belief that structure is determined by

sequence alone (Anfinsen, 1973)

Over the last decade, a significant number of proteins have

been shown to not fold properly in the test tube (e.g.,

requiring the assistance of chaperonins)

Nevertheless, the native 3D structure is assumed to be in

some energetic minimum This led to the development ofab initio folding methods


Folding pathways

Evidence that local structure segments form first,

and then pack against each other to form 3D fold

Exploited in protein fold prediction, Rosetta method

Simons, Bonneau, Ruczinski & Baker (1999).Ab initio

Protein Structure Prediction of CASP III Targets Using

ROSETTA. Proteins

Semi-stable structural intermediates on foldingpathway to lowest-energy conformation

Prof. Susan Marqusee, Berkeley



9/38

9

Secondary structure

Alpha

helix

structure

http://www.web-books.com/MoBio/Free/Ch2C4.htm


10/38

10

Amphi-

pathic

alpha

helix


Beta strand



11/38

11

Beta sheet


Secondary Structure Prediction


12/38

12

Why is secondary structure

prediction important?

Secondary structure diverges less rapidly

than primary sequence

Knowledge or prediction of 2ary structure

improves detection and alignment of remote

homologs

3d-pssm, SAM T02 (fold prediction servers)


Focusing on single residues

Early structure prediction methods focused on thestructural characteristics of individual residues

This enabled the larger problem to be decomposedinto smaller easier-to-solve problems (enabling thecombination of solutions to sub-problems to forma global solution)

This also enabled methods to focus on detectingtransmembrane regions, solvent-accessibleresidues, and other important features ofmolecules



13/38

13

Secondary structure prediction

using MSA information?

Labeling residues in a sequence as -helix, -

sheet or turn/coil (3-state prediction).

Accuracy of prediction enhanced by ~6% when

multiple sequence alignments are used vs the use

of a single sequence (Cuff & Barton, 1999)

State of the art methods -- PSIPRED (Jones 1999)

and JNET (Cuff & Barton, unpublished) have >76%accuracy for 3-state prediction.

Baxevanis & Ouellette (Ch. 12, Barton)

Amino acid patterns indicative of-strand structures

Short runs of conserved hydrophobic

Buried -strand

An i, i+2, i+4 pattern of conservedhydrophobic residues suggests a surface -strand.

Conserved residues sharing the samephysicochemical properties are likely toform one face of a strand.



14/38

14

Amino acid patterns indicative of

-helical structures

Conservation patterns of i, i+3, i+4, i+7and variations (e.g., i, i+4, i+7) suggestsan alpha helix

Amphiphilic/conservation patterns(alternating hydrophobic and polarresidues) following an i, i+3, i+4, i+7pattern (and variations, e.g., i, i+4, i+7)are likely to represent surface helices


Identifying loop regions

Insertions and deletions are not welltolerated in the hydrophobic core.

Regions of an MSA that include many gapcharacters are likely to indicate surface loops.

Glycine and proline residues can be foundin any secondary structure.

However, conserved glycine/proline residuesare strongly suggestive of loops.



15/38


16/38

16

Early schemes used observed preferences

Various schemes give the amino acids numerical weights orrankings for their preferences, and several computer programscan predict the secondary structure from the given sequence.

The simplest such scheme of Chou and Fasman, Ann. RevBiochem. (1978), examined the statistical distribution of aminoacids in alpha helix, beta sheet and turns or loops, using a set ofknown protein structures from the protein databank.

A novel sequence can then be scanned, and the tendency ofeach portion of the sequence to form secondary structure isassessed.

http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm

Improving secondary structure prediction

Peer pressure (pressure from the neighbors): A minimum of4 amino acids out of 6 should show alpha preference, or 3 out of5 beta preference, or clusters of 2-3 breakers in a sequence of 4are needed to set the secondary structure in any region, andindividual misfits adopt the secondary structure of theirneighbours.

Learning secondary structure preferences from expandeddata sets: More recent prediction schemes take advantage oflarger data sets to examine amino acid preference for differentregions in a helix or different positions in a tight turn.

Up-weighting conserved residues: In addition, sequences ofhomologous proteins may be compared. The rationale is thathighly conserved amino acids contribute more to the three

dimensional structure than unconserved, and differentweightings can be introduced to the statistical analysis.

Improved accuracy: The accuracy of prediction has risen fromabout 55% using the simple Chou-Fasman method, where thetendency is to overpredict, to about 80% using current methods.

http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm


17/38

17

Basic types of secondary structure Helices ( and others)

is most common; 3.6 residues/turn

Side chains project outward

Structure is stabilized between hydrogen bondsbetween the carbonyl (CO) group of one amino acidand the amino (NH) group of the amino acid that is 4positions C-terminal to it

-Strands (two or more strands interact to form a

-sheet) Other (sometimes called loop, coil, or non-

regular)


The new generation of secondarystructure prediction

PHDsec (Rost et al 1994, Rost et al 1996)

Based on machine learning concepts

Training set: learn implicit rules, principles and model

parameters from labelled data (sequences whose

secondary structures are known for each position)

Test set: sequences of unknown structure

Baxevanis & Ouellette Ch 8 (Ofran and Rost)


18/38

18

Key to success

The success of machine learning algorithmsdepends on the careful choice of the biologicallybased features used for training and asufficiently large and accurate training set

To enhance prediction accuracy on novel data,training data diversity is also critical

Exploit knowledge that local environment isimportant: to predict 2ary structure of residue i,

consider all residues in a window aroundI: i-n, i, i+n.


PHDsec

Employs homology detection and a feed-forwardartificial neural network

Step 1: homolog search and MSA construction

Step 2: label each position with conservationsignal (across MSA) and observed substitutions

Step 3: submit representative annotated sequence

to a system of neural networks. Output is a prediction of the most likely secondary

structure at each position, with the estimatedconfidence in that prediction



19/38

19

Assessing performance evaluations

Overall, the correct evaluation of performance

for prediction methods is an art in itself; only a

handful of methods turned out over time to not

have been overestimated by their developers.

Evaluation must be performed on a standard dataset

Training and test data should be rigorously kept

separate

Standard deviations of estimates should be provided


Other problems with comparing

different methods

Performance reported in literature can take different forms

Accuracy and coverage

Positive (or negative) predictive power

Sensitivity and specificity

Machine learning terms (e.g., Matthews coefficients)

Wilcoxon paired score signed rank tests

Or be based on different criteria for success per residue

per secondary structure element

per protein

Others measure performance only in cases where aprediction has high confidence (with a likelihood of alower FP rate)



20/38

20

The EVA server Continuous assessment of the predictions of automatic

servers using the same measurements, the same standards,and the same sequences to all methods

New structures (pre-release to PDB) given to EVA byparticipating structural biologists. EVA submits the aminoacid sequences to online servers.

Predictions stored until release of 3D coordinates to PDB.Then the predicted (2D or 3D) structures can be comparedagainst the solved structures, and given various scores.

Approach enables the community to compare methods, andgives developers concrete feedback that is critical formethod improvement.


How do the methods compare?

Best methods now reach 76% accuracy at 3-stateprediction (helix, strand, random coil)

Rost 2001

See EVA website for detailed comparisons

Metaservers:

Consensus approaches combining weighted predictionsfrom different servers

These almost always outperform individual methods

Shown in both CASP and EVA



21/38

21

Caveats

Even when an experimental structure is available, it issometimes unclear where one secondary structure elementends and another begins

Low-confidence predictions (and regions of disagreementacross servers) can correspond to structurally ambiguousregions

Real-life example: Prion protein (involved in bovinespongiform encephalopathy, Creutzfeld-Jakob disease, etc).

Region assumed to be responsible for aggregation believed to flipfrom experimentally determined helical structure to (predicted)strand in diseased individuals

All the best secondary structure prediction methods predict thisregion to be beta (incorrect)


Secondary structure predictionprograms

PSI-PRED

JNET (Cuff & Barton)

PHD (Rost & Sander)



22/38

22

PSIPRED


23/38

23

Solvent accessibility

Solvent accessibility is the area of a proteins surface thatis exposed to surrounding solvent.

This information is critical for facilitating the detection offunctionally (as opposed to structurally) critical residues

Solvent-exposed positions have the potential to interactwith other molecules, metal atoms or ions

Entirely buried residues may help stabilize a proteins 3Dfold, but can not participate in

an enzyme active site,

binding site in a DNA-binding protein, or an interaction site in a signal transduction component

all of which require spatial accessibility of the residue tosolvent



24/38

24

Measuring solvent accessibility Measured in square Angstroms

Values range from 0 (entirely buried) to 300 (onsurface)

Two entirely exposed residues can have verydifferent accessible areas

Residues with long side chains expose a larger area tosolvent than residues with short side chains

Values typically normalized by the maximum

possible for an amino acid, to measure thepercentage of the residue that is accessible tosolvent.


Conservation of solventaccessibility

Homologous proteins with similar folds

tend to conserve solvent accessibility values

at buried positions (i.e., solvent accessibility

between 0-10%);

Exposed positions (values between 60-

100%) show less conservation of solventaccessibility between homologs.

Rost and Sander, 1994



25/38

25

Prediction methods PHDacc and PROFace

Part of the PredictProtein service at Columbia

U. (Burkhard Rost lab)

Sequence alignment and profile construction

using MaxHom

Per-residue 10-state scheme, corresponding to

predicted percentage of residue that is

accessible (1=0-1%; 2=2-4%; etc)


Prediction methods: Jpred Cuff & Barton, 2000

Prediction server predicting 2ary structure andsolvent accessibility

Sequence alignment and profile construction usingPSI-BLAST and HMM methods

Per-residue 3-state scheme, corresponding topredicted percentage of residue that is accessible(0%, 5%, 25%)

Prediction outputs from two neural networks arecombined to give an average relative solventaccessibility.



26/38

26

Solvent accessibility:

Method performance No large-scale continuous system for evaluation isavailable (unlike the case for 2D and 3D structureprediction)

Local sequence information is insufficient Accessibility to solvent appears to be influenced by nonlocal

effects

For two-state prediction (buried vs exposed) accuracy isbetween 75-85% for both PHDacc and PROFacc

For more detailed definitions (e.g., percentage of exposure),accuracy is more difficult to measure.

Correlation coefficient between predicted and measuredsolvent accessibility for PHDacc is 0.53

Random guess would yield a correlation coefficient of zero

Superior results require a homology model construction


3D-structure prediction


27/38

27

Basic premise: The function and structure of

a protein are encoded in its primary sequence

The amino acid sequence determines a proteins 3D

structure, subcellular localization, intermolecular

interactions, biochemical physiological tasks, and

(eventually) how and when it will be broken down into

its component building blocks.

Paraphrased from class text (Ofran and Rost), p 198

How many unique protein folds are there?

Many structural biologists believe that all protein domainswill eventually be classified into only 1000 different foldclasses

Koonin et al 2002

Structural Genomics Initiative is designed to populate thatfold space Even with attempts to solve novel structures, upon examination of

new structures, many are clearly members of existing structural

classes



28/38

28

3D structure classification schemes All alpha (>50% helix; 30% beta sheet;


29/38

29

Threading

Limited to generating approximate models or suggestingapproximate folds

>5 Angstroms for 3D threading

>3Angstroms for 2D threading

Name based on threading a tube (called a snake) througha plumbing system.

Each unique threading of a sequence through the 3D modelcan be evaluated using empirically derived energy functionor measure of packing efficiency

Sequences can be scored based on how well they fit themodel (i.e., the best score achievable)

Baxevanis & Ouellette Ch 9 (Wishart)

Three-dimensional threading First described by Novotny et al (1984)

Rediscovered in early 1990s

Jones et al 2992; Sippl & Weitckus 1992; Bryant & Lawrence1993

Based largely on heuristic contact potentials (interactions betweenpairs of residues)

3D coordinates of theoretical structure (based on threading ofsequence through PDB structure model) used to evaluate predictedcontacts and derive a fitness score based on a pseudoenergyfunction

Powerful for predicting 3D structure of unknown proteins,

and for evaluating structure of known proteins Limitations found in this method:

interactions are not always conserved between distant homologs

Computational complexity (very slow)

Modest accuracy (early methods ignored amino acid information;model accuracy >5Angstroms)



30/38

30

Contact maps 2D plots of distances

between C-alphaatoms of all pairs ofresidues

Observed interactionsbetween amino acidsused to form contactpotentials for 3Dthreading methods


Figure 6.14

Two-dimensional threading Sequence-profile methods; combines predictions of 2ary

structure prediction (and possibly solvent accessibility)with standard profile methods to score and align proteins

Improved accuracy through combined use of 2ary structureand amino acid similarity

Much faster than standard 3D threading

Model accuracy good but not excellent (RMSD>3Angstroms)

However, for model construction for proteins with no closehomologs with solved structure, these methods are among the best

Examples: UCSC SAMT99 (two-track HMMs), 3d-pssm, FUGUE

Judged best by EVA



31/38

31

Rosetta

Hybrid ab initio and homology-based

structure prediction

David Baker

The HMMSTR-Rosetta server

http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php


32/38

32

Assessing method performance

Astral benchmark datasets

Park et al

CASP experiments

EVA and Livebench

Continuous evaluation of webservers


33/38

33

Experimental methods for solving

protein 3D structure

Experimental determination ofprotein structure

X-ray crystallography

NMR spectroscopy


34/38

34

X-ray crystallography

Most accurate; can be applied to larger proteins

Oldest method; first structure (myoglobin) determined in late1950s (Kendrew et al 1958). More than 20K structures solved todate

Method: Small protein crystals (measuring


35/38

35

NMR spectroscopy

Much newer: first NMR structure in 1983 Allows biologists to study structure and dynamics of molecules in liquid state

(or near-physiological environment)

Structures solved by measuring how radio waves are absorbed by atomicnuclei

Absorption measurement allows the determination of how much nuclearmagnetism is transferred from one atom (or nucleus) to another

Magnetization transfer measured through chemical shifts, J-couplings and nuclearOverhauser effects

Measured parameters define a set of approximate structural constraints that are fedinto a constraint minimization calculation (distance geometry or simulatedannealing)

Result is an ensemble of (15-50) of structures that satisfy the experimentalconstraints

These multiple structures are overlaid/superimposed on each other to produceblurrograms

NMR result is potentially more reflective of true solution behavior of proteins;most proteins seem to exist in an ensemble of slightly different configurations


Limitations of NMR spectroscopy

Size limitations: maximum of 30kD (~250aa)

Solubility of molecule

cannot be applied to membrane proteins

Expensive: requires special isotopically labeled molecules

Inherently less precise



36/38

36

Storing and retrieving protein structures

The Protein Data Bank (PDB)

First electronic database in bioinformatics

Set up at Brookhaven National Laboratory by WalterHamilton in 1971

7 protein structures at database initiation

Coordinates stored and distributed on punch cards and computer tape

Currently

22K structures (as of October 23, 2005)

Coordinate distribution and deposition is electronic (via the worldwide web)

Moved to the Research Collaboratory for Structural Bioinformatics(RSCB) in 1998

Primary archival center for experimentally determined 3D structuresof proteins, nucleic acids, carbohydrates and complexes

Separate repository for theoretical models


http://www.usm.maine.edu/~rhodes/ModQual/index.html


37/38

37

http://www.usm.maine.edu/~rhodes/ModQual/index.html

Summary Experimental determination of protein structure is

expensive and not always straightforward

Predictive methods are relied upon to obtain clues toprotein fold (and function)

Knowing what (which parts of a protein structure) you canbelieve and what you cant is critical for both experimentaland predicted structures


38/38

Summary (contd)

Ab initio methods of protein fold prediction use physics-based energyminimization to simulate the process of protein folding

These methods are generally less successful than homology-based foldprediction (limited to short peptides/small proteins)

Exception: Rosetta/I-sites methods (Baker group) which employ bothtypes of approach

Threading methods fall into the homology-based class of approaches.

2D profiles use 2ary structure (prediction/knowledge) as well as sequenceinformation (and perhaps additional information).

3D profiles use 3D models and assign scores to proteins based on inter-residue contacts based on the observed contacts in the original structuretemplate and derived contact potentials from other structures

Summary (contd) Community assessment of 2D and 3D structure prediction uses various

approaches

EVA and LiveBench (continuous real-time assessment of methods)

CASP (Critical Assessment of Protein Structure Prediction)

Benchmark datasets (e.g., Astral PDB40 for fold recognition)

Reported accuracy of 2D structure prediction between 75-77% (forbest methods)

Reported accuracy of comparative models derived by 3D structureprediction servers is harder to assess.

Fold prediction (ignoring the comparative model construction) is fairly

accurate for the best serversprovided A homologous structure has already been deposited in the PDB

That structure can be detected with a significant E-value using sequenceinformation alone, e.g., by PSI-BLAST)

The inclusion of 2ary structure prediction (e.g., in 2D profiles) canimprove the alignment and give a modest boost to fold recognitionaccuracy when %ID is very low, but can also yield errors in prediction

Documents

2d 3d Structure