2. Introduction to Rosetta and structural modeling Approaches for structural modeling of proteins The Rosetta framework and its prediction modes Cartesian

2. Introduction to Rosetta and structural modeling

• Approaches for structural modeling of proteins • The Rosetta framework and its prediction

modes• Cartesian and polar coordinates• Sampling (finding the structure) and scoring

(selecting the structure)

Structural Modeling of Proteins - Approaches

Prediction of Structure from Sequence

Flowchart Comparison of query sequence to nr databaseComparison of query sequence to nr database

Similar to a sequence of known structure?Similar to a sequence of known structure?

Homology Modeling(Comparative Modeling)

Homology Modeling(Comparative Modeling)

NoNo

Fold Recognition(Threading)

Fold Recognition(Threading)

Fits a known fold?Fits a known fold?

YesYes

YesYes

Ab initio predictionAb initio prediction

NoNo

Protocols: ab initio, loops, side chains, active sites….Protocols: ab initio, loops, side chains, active sites….

The Rosetta framework and its prediction modes

The Rosetta Strategy

• Observation: local sequence preferences bias, but do not uniquely define the local structure of a protein

• Goal: mimic interplay of local and global interactions that determine protein structure


Local interactions: fragments •Derived from known structures• Sampled for similar

sequences/secondary structure propensity

• Fragment library represents accessible local structures for short sequence


Global (non-local) interactions: scoring function•Buried hydrophobic residues, paired strands, specific side chain interactions, etc.•Derived from known structures (statistics on preferred conformations)•Boltzmann’s principle relates frequency to energy

A short history of Rosetta

In the beginning: ab initio modeling of protein structure starting from sequence Short fragments of known proteins are

assembled by a Monte Carlo strategy to yield native-like protein conformations

Reliable fold identification for short proteins. Recently improved to high-resolution models (within 2A RMSD)

ATCSFFGRKLL…..ATCSFFGRKLL…..

A short history of Rosetta

Success of ab initio protocol lead to extension to Protein design Design of new fold: TOP7 Protein loop modeling; homology modeling Protein-protein docking; protein interface design

Protein-ligand docking Protein-DNA interactions; RNA modeling Many more, e.g. solving the phase problem in

Xray crystallography



More recent additions

• Boinc (Rosetta@home)• FoldIt

• Rosettascripts; RosettaDiagrams• PyRosetta

Scoring and Sampling

The basic assumption in structure prediction

Native structure located in global minimum (free) energy conformation (GMEC)

➜A good Energy function can select the correct model among decoys

➜A good sampling technique can find the GMEC in the rugged landscape

EEGMECGMEC

Conformation spaceConformation space

Two-Step Procedure

1. Low-resolution step locates potential minima (fast)

2. Cluster analysis identifies broadest basins in landscape

3. High-resolution step can identify lowest energy minimum in the basins (slow)

GMECGMEC

EE

Conformation spaceConformation space

Nature uses one scoring function…

Aim: one generic function for different applications

Optimization of parameters: Originally from small

molecules (experiments & quantum mechanical calculations)

Today: use of protein structures solved at high-accuracy

How are scoring terms optimized?

Benchmarks:

Discriminate ground state from alternative conformations

Identify correct side chain conformation

Predict effect of stability of point mutations (G)

Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109

Structure Representation:• Equilibrium bonds and

angles (Engh & Huber 1991)

• Centroid: average location of center of mass of side-chain(Centroid | aa, ,)

• No modeling of side chains• Fast

Low-Resolution Step (e.g. score4)

Bayes Theorem:• Independent components prevent over-counting

P(str | seq) = P(str)*P(seq|str) / P(seq)

Low-Resolution Scoring Function

constantconstantsequence-dependent features

sequence-dependent features

structuredependent features

structuredependent features

N

O

OO

N

O

N

O

N

N

O

......

Bayes Theorem: P(str | seq) = P(str) * P(P(seq seq | | strstr)) / P(seq)

Score = Senv+ Spair + …

neighbors: C-C <10Ǻ

Sequence-Dependent Components

Rohl et al. (2004) Methods in Enzymology 383:66Origin: Simons et al., JMB 1997; Simons et al., Proteins 1999

P(str | seq) = P(P(strstr)) * P(seq | str) / P(seq)

Score = … + Srg + Sc + Svdw + …

Structure-Dependent Components

P(str | seq) = P(P(strstr)) * P(seq | str) / P(seq)

Score = … + Srama

….+…..+

10

Structure-Dependent Components

Slow, exact step• Locates global energy

minimum

Structure Representation:• All-atom (including polar and non-

polar hydrogens, but no water)• Side chains as rotamers from

backbone-dependent library• Side chain conformation adjusted

frequently

e.g. score12; Talaris; …

High-Resolution Step

Dunbrack 1997

• Side chains have preferred conformations

• They are summarized in rotamer libraries

• Select one rotamer for each position

• Best conformation: lowest-energy combination of rotamers

High-Resolution Step: Rotamer Libraries

Serine 1 preferences

t=180o

g-=-60og+=+60o

High-Resolution Scoring Function

• Major contributions:– Burial of hydrophobic

groups away from water– Void-free packing of

buried groups and atoms– Buried polar atoms form

intra-molecular hydrogen bonds

Packing interactionsScore = SLJ(atr + rep) + ….

rij

Linearized repulsive part

e: well depth from CHARMm19


(new in score12’: starts from minimum)

Implicit solvation

Score = … + Ssolvation + ….

Lazaridis & Karplus, Proteins 1999

solvation free energy density of i

polar

polar


xij=(rij - Ri)/i

xij2

xji2

Hydrogen Bonding Energy

Based on statistics from high-resolution structures in the PDB

(Kortemme, Morozov & Baker 2003 JMB)

Slide from Jeff Gray

]

Score = …. + Shb(srbb+lrbb+sc) + ….

srbb: short range, backbone HB

lrbb: long range, backbone HB

sc: HB with side chain atom

Rotamer preference

Score = … + Sdunbrack + ….

Dunbrack, 1997


One long, generic function ….

Score = Senv+ Spair + Srg + Sc+ Svdw + Sss+ Ssheet+ Shs + Srama + Shb (srbb + lrbb) + docking_score + Sdisulf_cent+ Sr+ Sco + Scontact_prediction + Sdipolar+ Sprojection + Spc+ Stether+ S+ S+ Ssymmetry + Ssplicemsd + …..

docking_score = Sd env+ Sd pair + Sd contact+ Sd vdw+ Sd site constr + Sd + Sfab score

Score = SLJ(atr + rep) + Ssolvation + Shb(srbb+lrbb+sc) + Sdunbrack + Spair – Sref + Sprob1b + Sintrares + Sgb_elec + Sgsolt

+ Sh2o(solv + hb) + S_plane

Scoring Function: Summary

One long, generic function …. A weighted sum of different terms

Score12 = w1*SLJatr + w2*SLJrep + w3*Ssolvation + w4*Shb(srbb+lrbb+sc) + w5*Sdunbrack + w6*Spair – Sref

Scoring Function: Summary


How can it be improved ? Feature Analysis Tool : improve parametersOptE : optimize weights

How can it be improved ? Feature Analysis Tool : improve parametersOptE : optimize weights

Feature Analysis : improve scoring term


Aim: similar distributions in crystal structures and modelsAim: similar distributions in crystal structures and models

e.g. HB distance H- Oin Ser & Thr


Feature Analysis : improve scoring term


Aim: similar distributions in crystal structures and modelsAim: similar distributions in crystal structures and models



After correction: distribution in native & model structures overlap After correction: distribution in native & model structures overlap

Score12 = w1*SLJatr + w2*SLJrep + w3*Ssolvation + w4*Shb(srbb+lrbb+sc) + w5*Sdunbrack + w6*Spair – Sref

OptE : optimize weights


Maximum Likelihood Parameter EstimationBenchmarks: Discriminate ground state from alternative conformations Identify correct side chain conformation Sequence recovery in design: choose correct amino acid

residue Predict effect of stability of point mutations (G)

& more …

Aim: Best score for correct predictionAim: Best score for correct prediction

Representations of protein structure: Cartesian and polar coordinates

Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI41 0.00 -60.00 -180.00 -60.00 0.00 0.00 0.00 23….……

PDB x y zATOM 490 N GLN A 31 52.013 -87.359 -8.797 1.00 7.06 NATOM 491 CA GLN A 31 52.134 -87.762 -10.201 1.00 8.67 CATOM 492 C GLN A 31 51.726 -89.222 -10.343 1.00 10.90 CATOM 493 O GLN A 31 51.015 -89.601 -11.275 1.00 9.63 O…..….

2 ways to represent the protein structure

Cartesian coordinates (x,y,z; pdb format)

Intuitive – look at molecules in space

Easy calculation of energy score (based on atom-atom distances)

– Difficult to change conformation of structure (while keeping bond length and bond angle unchanged)

Polar coordinates ( equilibrium angles and bond lengths)

Compact (3 values/residue)Easy changes of protein

structure (turn around one or more dihedral angles)

– Non-intuitive– Difficult to evaluate energy

score (calculation of neighboring matrix complicated)

A snake in the 2D world

• Cartesian representation:points:(0,0),(1,1),(1,2),(2,2),(3,3)

connections (predefined):1-2,2-3,3-4,4-5

x

y(0,0)

(1,1)

(1,2)

(2,2)

(3,3)

1-2

2-3

3-4

4-5

1122

33

44

55

A snake in the 2D world

• Internal coordinates:bond lengths (predefined):√2,1,1,√2

angles:450,90o,0o,45o

x

y√2√2

√2√211

11

x

y

45o

45o

90o

From wikipedia

A snake wiggling in the 2D world

• Constraint: keep bond length fixed

• Move in Cartesian representation

(0,0),(1,1),(1,2),(2,2),(3,3) (0,0),(1,1),(1,2),(2,2),(3,0)

Bond length changed!

x

y

√2√2

√3√3

A snake wiggling in the 2D world

• Constraint: keep bond length fixed

• Move in polar coordinates450,90o,0o,45o 450,90o,45o,45o

Bond length unchanged!Large impact on structure

x

y

Polar Cartesian coordinatesConvert r and to x and y

(0,0),(1,1),(1,2),(2,2),(3,3)

450,90o,0o,45o

√2,1,1,√2

x

y

From wikipedia

Cartesianpolar coordinatesConvert x and y to r and

(0,0),(1,1),(1,2),(2,2),(3,3)

450,90o,0o,45o

√2,1,1,√2

x

y

Moving the snake to the 3D world

x

y

• Cartesian representation:points: additional z-axis(0,0,0),(1,1,0),(1,2,0),(2,2,0),

(3,3,0)connections (predefined):1-2,2-3,3-4,4-5

• Internal coordinates:bond lengths (predefined):√2,1,1,√2angles:450,90o,0o,45o

dihedral angles: 1800,180o

z

Proteins: bond lengths and angles fixed. Only dihedral angles are variedProteins: bond lengths and angles fixed. Only dihedral angles are varied

Dihedral angles

Dihedral angles 1-4 define side chain

From wikipedia

• Dihedral angle: defines geometry of 4 consecutive atoms (given bond lengths and angles)

http://en.wikipedia.org/wiki/Image:Bond_dihedral_angle.png

What we learned from our snake

x y

• Cartesian representation: Easy to look at, difficult to move– Moves do not preserve bond length

(and angles in 3D)

• Internal coordinates: Easy to move, difficult to see – calculation of distances between

points not trivial

z

Proteins: bond lengths and angles fixed. Only dihedral angles are variedProteins: bond lengths and angles fixed. Only dihedral angles are varied

Solution: toggle

CALCULATE ENERGY - Cartesian coordinates:

Derive distance matrix (neighbor list) for energy score calculation

CALCULATE ENERGY - Cartesian coordinates:

Derive distance matrix (neighbor list) for energy score calculation

Transform: build positions in space according to

dihedral angles

Transform: build positions in space according to

dihedral anglesPDB x y zATOM 490 N GLN A 31 52.013 -87.359 -8.797 1.00 7.06 NATOM 491 CA GLN A 31 52.134 -87.762 -10.201 1.00 8.67 CATOM 492 C GLN A 31 51.726 -89.222 -10.343 1.00 10.90 CATOM 493 O GLN A 31 51.015 -89.601 -11.275 1.00 9.63 O…..….

MOVE STRUCTURE - Polar coordinates:

introduce changes in structure by rotating around dihedral angle(s) (change values)

MOVE STRUCTURE - Polar coordinates:

introduce changes in structure by rotating around dihedral angle(s) (change values)

Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI41 0.00 -60.00 -180.00 -60.00 0.00 0.00 0.00 23….……

Transform: calculate dihedral angles from

coordinates

Transform: calculate dihedral angles from

coordinates

(0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

Cartesian polar coordinates

Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4…..32 -59.00 -60.00 -180.00 0.00 0.00 0.00 0.00 3334….……

PDB x y z…ATOM 490 C GLN A 31 52.013 -87.359 -8.797 1.00 7.06 NATOM 491 N GLY A 32 52.134 -87.762 -10.201 1.00 8.67 CATOM 492 CA GLY A 32 51.726 -89.222 -10.343 1.00 10.90 CATOM 493 O GLY A 32 51.015 -89.601 -11.275 1.00 9.63 O…..….

How to calculate polar from Cartesian coordinates: example : C’-N-Ca-C

– define plane perpendicular to N-Ca (b2) vector– calculate projection of Ca-C (b3) and C’-N (b1) onto plane– calculate angle between projections

(0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

http://en.wikipedia.org/wiki/Image:Bond_dihedral_angle.png

Polar Cartesian coordinates

Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4…..32 -59.00 -60.00 -180.00 0.00 0.00 0.00 0.00 3334….……

PDB x y z…ATOM 490 C GLN A 31 52.013 -87.359 -8.797 1.00 7.06 NATOM 491 N GLY A 32 52.134 -87.762 -10.201 1.00 8.67 CATOM 492 CA GLY A 32 51.726 -89.222 -10.343 1.00 10.90 CATOM 493 O GLY A 32 51.015 -89.601 -11.275 1.00 9.63 O…..….

Find x,y,z coordinates of C, based on atom positions of C’, N and Ca, and a given value (: C’-N-Ca-C)

• create Ca-C vector: –size Ca-C=1.51A (equilibrium bond length)–angle N-Ca-C= 111o (equilibrium value for N-Ca-C angle)

• rotate vector around N-Ca axis to obtain projections of Ca-C and N-C’ with wanted

(0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

Representation of protein structure

431 2 875 6Rosetta folding

3 backbone dihedral angles per residue

Sampling and minimization in TORSIONAL space: change angle and rebuild, starting from changed angle

Build coordinates of structure starting from first atom, according to dihedral angles (and equilibrium bond length and angle)

431 2 875 687

Based on slides by Chu Wang

Representation of protein structure

431 2 875 6

431 2 875 6

4’3’1’ 2’ 8’7’5’ 6’

Backbone dihedral angles fixed (rigid-body)

Rosetta folding

3 backbone dihedral angles per residue

Rosetta docking

6 rigid-body DOFs --3 translational vectors3 rotational angles

Sampling and minimization in TORSIONAL space

Sampling and minimization in RIGID-BODY space

How can those two types of degrees of freedom be combined?How can those two types of degrees of freedom be combined?

Fold tree representation

“long-range” edge – 6 rigid-body DOFs

4’3’1’ 2’ 8’7’5’ 6’

“peptide” edge – 3 backbone dihedral angles

431 2 875 6

“peptide” edge – 3 backbone dihedral anglesExample:fold-tree based docking

Originally developed to improve sampling of strand registers in -sheet proteins. Allows simultaneous optimization of rigid-body and backbone/sidechain torsional degrees of freedom.

Fold tree: Bradley and Baker, Proteins (2006)

4’3’1’ 2’ 8’7’5’ 6’

Construct fold-trees to treat a variety of protein folding and docking problems.

Fold-trees for different modeling tasks protein folding N C

N: N-terminal; C: C-terminal; X: chain break; O: root of the tree;

Flexible “peptide” edge rigid “peptide” edge 1 1’ rigid “jump” 1 1’ flexible “jump”

Color – flexible bbGray – fixed bb

Fold-trees for different modeling tasks

N 1 1’ C2 2’xx

loop modeling





N 1 C

N 1’ C

fully flexible docking



N 1 C

N 1’ C

docking w/ hinge motion

N 1

N 1’ C

2 2’x C

3’ 3x

docking w/ loop modeling



Color – flexible bbGray – fixed bbPale – symmetry operation


Color – flexible bbGray – fixed bb• Filled colored circles - flexible sc



• Filled colored circles - flexible sco empty colored circles – flexible amino acid: design



• Filled colored circles - flexible sco empty colored circles – flexible amino acid: design

Rosetta3: Object-oriented architecture


Description of object-oriented organization in Rosetta3: Leaver-Fay et al. Methods in Enzymology (2013)

The Rosetta sampling strategy: A general overview

Documents

2. Introduction to Rosetta and structural modeling Approaches for structural modeling of proteins The Rosetta framework and its prediction modes Cartesian