2. Introduction to Rosetta and structural modeling (From Ora Schueler-Furman) Approaches for structural modeling of proteins The Rosetta framework and

2. Introduction to Rosetta and structural modeling

(From Ora Schueler-Furman)• Approaches for structural modeling of proteins • The Rosetta framework and its prediction

modes• Cartesian and polar coordinates• Sampling (finding the structure) and scoring

(selecting the structure)

Structural Modeling of Proteins - Approaches

Prediction of Structure from Sequence

Flowchart Comparison of query sequence to nr database

Similar to a sequence of known structure?

Homology Modeling(Comparative Modeling)

No

Fold Recognition(Threading)

Fits a known fold?

Yes

Yes

Ab initio prediction

No

The Rosetta framework and its prediction modes

A short history of Rosetta

In the beginning: ab initio modeling of protein structure starting from sequence Short fragments of known proteins are

assembled by a Monte Carlo strategy to yield native-like protein conformations

Reliable fold identification for short proteins. Recently improved to high-resolution models (within 2A RMSD)

ATCSFFGRKLL…..

A short history of Rosetta

Success of ab initio protocol lead to extension to Protein design Design of new fold: TOP7 Protein loop modeling; homology modeling Protein-protein docking; protein interface design

Protein-ligand docking Protein-DNA interactions; RNA modeling Many more, e.g. solving the phase problem in

Xray crystallography

ATCSFFGRKLL…..

ATCSFFGRKLL…..

The Rosetta Strategy

• Observation: local sequence preferences bias, but do not uniquely define, the local structure of a protein

• Goal: mimic interplay of local and global interactions that determine protein structure

• Local interactions: fragments derived from known structures (sampled for similar sequences/secondary structure propensity)

• Global (non-local) interactions: buried hydrophobic residues, paired b strands, specific side chain interactions, etc

The Rosetta Strategy

• Local interactions – fragments– Fragment library representing accessible local

structures for all short sequences in a protein chain, derived from known structures

• Global (non-local) interactions – scoring function– Derived from conformational statistics of known

structures

Scoring and Sampling

The basic assumption in structure prediction

Native structure located in global minimum (free) energy conformation (GMEC)

➜A good Energy function can select the correct model among decoys

➜A good sampling technique can find the GMEC in the rugged landscape

EGMEC

Conformation space

Two-Step Procedure

1. Low-resolution step locates potential minima (fast)

2. Cluster analysis identifies broadest basins in landscape

3. High-resolution step can identify lowest energy minimum in the basins (slow)

GMEC

E

Conformation space

Structure Representation:• Equilibrium bonds and

angles (Engh & Huber 1991)

• Centroid: average location of center of mass of side-chain(Centroid | aa, f,)

• No modeling of side chains• Fast

Low-Resolution Step

Bayes Theorem:• Independent components prevent over-counting

P(str | seq) = P(str)*P(seq|str) / P(seq)

Low-Resolution Scoring Function

constantsequence-dependent features

structuredependent features

N

O

OO

N

O

N

O

N

N

O

......

Bayes Theorem: P(str | seq) = P(str) * P(seq | str) / P(seq)

Score = Senv+ Spair + …

neighbors: Cb-Cb <10Ǻ

Sequence-Dependent Components

Rohl et al. (2004) Methods in Enzymology 383:66Origin: Simons et al., JMB 1997; Simons et al., Proteins 1999

P(str | seq) = P(str) * P(seq | str) / P(seq)

Score = … + Srg + Scb + Svdw + …

Structure-Dependent Components


Score = … + Sss + …



Score = … + Ssheet+ Shs + …

+ Srama

10


Slow, exact step• Locates global energy

minimum

Structure Representation:

• All-atom (including polar and non-polar hydrogens, but no water)

• Side chains as rotamers from backbone-dependent library

• Side chain conformation adjusted frequently

High-Resolution Step

Dunbrack 1997

• Side chains have preferred conformations

• They are summarized in rotamer libraries

• Select one rotamer for each position

• Best conformation: lowest-energy combination of rotamers

High-Resolution Step: Rotamer Libraries

Serine c1 preferences

t=180o

g-=-60og+=+60o

High-Resolution Scoring Function

• Major contributions:– Burial of hydrophobic

groups away from water– Void-free packing of

buried groups and atoms– Buried polar atoms form

intra-molecular hydrogen bonds

Packing interactions

Score = SLJ(atr + rep) + ….

rij

Linearized repulsive part

e: well depth from CHARMm19


Implicit solvation

Score = … + Ssolvation + ….

Lazaridis & Karplus, Proteins 1999

solvation free energy density of i

polar

polar


xij=(rij - Ri)/li

xij2

xji2

NH

O Cd

(Kortemme, 2003; Morozov 2004)

Hydrogen Bonds (original function)

Score = …. + Shb(srbb+lrbb+sc) + ….

srbb: short range, backbone HBlrbb: long range, backbone HBsc: HB with side chain atom


Hydrogen Bonding Energy

Based on statistics from high-resolution structures in the Protein Data Bank (rcsb.org)

lnG kT P

(Kortemme, Morozov & Baker 2003 JMB)

HB HB[ ( ) ( ) ( ) ( )HAE W E E E E

Slide from Jeff Gray

]

Rotamer preference

Score = … + Sdunbrack + ….

Dunbrack, 1997


One long, generic function ….

Score = Senv+ Spair + Srg + Sc b + Svdw + Sss+ Ssheet+ Shs + Srama + Shb (srbb + lrbb) + docking_score + Sdisulf_cent+ Srs+ Sco + Scontact_prediction + Sdipolar+ Sprojection + Spc+ Stether+ Sfy+ Sw+ Ssymmetry + Ssplicemsd + …..

docking_score = Sd env+ Sd pair + Sd contact+ Sd vdw+ Sd site constr + Sd + Sfab score

Score = SLJ(atr + rep) + Ssolvation + Shb(srbb+lrbb+sc) + Sdunbrack + Spair – Sref + Sprob1b + Sintrares + Sgb_elec + Sgsolt

+ Sh2o(solv + hb) + S_plane

Scoring Function: Summary

Representations of protein structure: Cartesian and polar coordinates

Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI41 0.00 -60.00 -180.00 -60.00 0.00 0.00 0.00 23….……

PDB x y zATOM 490 N GLN A 31 52.013 -87.359 -8.797 1.00 7.06 NATOM 491 CA GLN A 31 52.134 -87.762 -10.201 1.00 8.67 CATOM 492 C GLN A 31 51.726 -89.222 -10.343 1.00 10.90 CATOM 493 O GLN A 31 51.015 -89.601 -11.275 1.00 9.63 O…..….

2 ways to represent the protein structure

Cartesian coordinates (x,y,z; pdb format)

Intuitive – look at molecules in space

Easy calculation of energy score (based on atom-atom distances)

– Difficult to change conformation of structure (while keeping bond length and bond angle unchanged)

Polar coordinates ( - - ;F Y W equilibrium angles and bond lengths)

Compact (3 values/residue)Easy changes of protein

structure (turn around one or more dihedral angles)

– Non-intuitive– Difficult to evaluate energy

score (calculation of neighboring matrix complicated)

A snake in the 2D world

• Cartesian representation:points:(0,0),(1,1),(1,2),(2,2),(3,3)connections (predefined):1-2,2-3,3-4,4-5

x

y(0,0)

(1,1)

(1,2)

(2,2)

(3,3)

1-2

2-3

3-4

4-5

12

3

4

5

A snake in the 2D world

• Internal coordinates:bond lengths (predefined):√2,1,1,√2angles:450,90o,0o,45o

x

y√2

√21

1

x

y

45o

45o

90o

From wikipedia

A snake wiggling in the 2D world

• Constraint: keep bond length fixed

• Move in Cartesian representation

(0,0),(1,1),(1,2),(2,2),(3,3) (0,0),(1,1),(1,2),(2,2),(3,0)

Bond length changed!

x

y

√2

√3

A snake wiggling in the 2D world

• Constraint: keep bond length fixed

• Move in polar coordinates450,90o,0o,45o 450,90o,45o,45o

Bond length unchanged!Large impact on structure

x

y

Polar Cartesian coordinatesConvert r and q to x and y

(0,0),(1,1),(1,2),(2,2),(3,3)

450,90o,0o,45o

√2,1,1,√2

x

y

From wikipedia

Cartesianpolar coordinatesConvert x and y to r and q

(0,0),(1,1),(1,2),(2,2),(3,3)

450,90o,0o,45o

√2,1,1,√2

x

y

Moving the snake to the 3D world

x

y

• Cartesian representation:points: additional z-axis(0,0,0),(1,1,0),(1,2,0),(2,2,0),

(3,3,0)connections (predefined):1-2,2-3,3-4,4-5

• Internal coordinates:bond lengths (predefined):√2,1,1,√2angles:450,90o,0o,45o

dihedral angles: 1800,180o

z

Proteins: bond lengths and angles fixed. Only dihedral angles are varied

Dihedral angles

Dihedral angles c1-c4 define side chain

From wikipedia

• Dihedral angle: defines geometry of 4 consecutive atoms (given bond lengths and angles)

http://en.wikipedia.org/wiki/Image:Bond_dihedral_angle.png

What we learned from our snake

x y

• Cartesian representation: Easy to look at, difficult to move– Moves do not preserve bond length

(and angles in 3D)

• Internal coordinates: Easy to move, difficult to see – calculation of distances between

points not trivial

z

Proteins: bond lengths and angles fixed. Only dihedral angles are varied

Solution: toggle

CALCULATE ENERGY - Cartesian coordinates:

Derive distance matrix (neighbor list) for energy score calculation

Transform: build positions in space according to

dihedral anglesPDB x y zATOM 490 N GLN A 31 52.013 -87.359 -8.797 1.00 7.06 NATOM 491 CA GLN A 31 52.134 -87.762 -10.201 1.00 8.67 CATOM 492 C GLN A 31 51.726 -89.222 -10.343 1.00 10.90 CATOM 493 O GLN A 31 51.015 -89.601 -11.275 1.00 9.63 O…..….

MOVE STRUCTURE - Polar coordinates:

introduce changes in structure by rotating around dihedral angle(s) (change - F Yvalues)

Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI41 0.00 -60.00 -180.00 -60.00 0.00 0.00 0.00 23….……

Transform: calculate dihedral angles from

coordinates

(0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

Cartesian polar coordinates

Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4…..32 -59.00 -60.00 -180.00 0.00 0.00 0.00 0.00 3334….……

PDB x y z…ATOM 490 C GLN A 31 52.013 -87.359 -8.797 1.00 7.06 NATOM 491 N GLY A 32 52.134 -87.762 -10.201 1.00 8.67 CATOM 492 CA GLY A 32 51.726 -89.222 -10.343 1.00 10.90 CATOM 493 O GLY A 32 51.015 -89.601 -11.275 1.00 9.63 O…..….

How to calculate polar from Cartesian coordinates: example F: C’-N-Ca-C

– define plane perpendicular to N-Ca (b2) vector– calculate projection of Ca-C (b3) and C’-N (b1) onto plane– calculate angle between projections

(0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

http://en.wikipedia.org/wiki/Image:Bond_dihedral_angle.png

Polar Cartesian coordinates

Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4…..32 -59.00 -60.00 -180.00 0.00 0.00 0.00 0.00 3334….……

PDB x y z…ATOM 490 C GLN A 31 52.013 -87.359 -8.797 1.00 7.06 NATOM 491 N GLY A 32 52.134 -87.762 -10.201 1.00 8.67 CATOM 492 CA GLY A 32 51.726 -89.222 -10.343 1.00 10.90 CATOM 493 O GLY A 32 51.015 -89.601 -11.275 1.00 9.63 O…..….

Find x,y,z coordinates of C, based on atom positions of C’, N and Ca, and a given F value (F: C’-N-Ca-C)

• create Ca-C vector: – size Ca-C=1.51A (equilibrium bond length)– angle N-Ca-C= 111o (equilibrium value for N-

Ca-C angle)• rotate vector around N-Ca axis to obtain

projections of Ca-C and N-C’ with wanted F

(0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

Representation of protein structure431 2 875 6Rosetta folding

3 backbone dihedral angles per residue

Sampling and minimization in TORSIONAL space: change angle and rebuild, starting from changed angle

Build coordinates of structure starting from first atom, according to dihedral angles (and equilibrium bond length and angle)

431 2 875 687

Based on slides by Chu Wang

Documents

2. Introduction to Rosetta and structural modeling (From Ora Schueler-Furman) Approaches for structural modeling of proteins The Rosetta framework and