Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Tutorial on protein structureprediction
Kevin Karplus
Biomolecular Engineering Department
Undergraduate and Graduate Director, Bioinformatics
University of California, Santa Cruz
Karplus lab – p.1/32
Outline of Talk
What is Bioengineering? Biomolecular Engineering?Bioinformatics?
What is a protein?
The folding problem and variants on it:Local structure predictionFold recognitionComparative modeling“Ab initio” methodsContact prediction
Protein Design
Karplus lab – p.2/32
What is Bioengineering?
Three concentrations:
BiomolecularDrug designBiomolecular sensorsNanotechnologyBioinformatics
Rehabilitation
Bioelectronics
Karplus lab – p.3/32
What is Bioengineering?
Three concentrations:
Biomolecular
RehabilitationSystems to held individuals with special needsCell-phone-based systems to reach large numbersof people.Novel hardware to assist the blindRobotics for rehabilitation and surgery applications.
Bioelectronics
Karplus lab – p.4/32
What is Bioengineering?
Three concentrations:
Biomolecular
Rehabilitation
BioelectronicsImplantable devicesInterfacing between organisms and electronicsArtificial retina project
Karplus lab – p.5/32
What to take early
Mathematics
Chemistry and then biology
Introductory bioengineering courses:BME80G, Bioethics (F)BME5, Intro to Biotechnology (W, S)CMPE80A: Universal Access: Disability, Technology,and Society (W, S)
Declare your major immediately!!You can always change to another one later.Bioengineering is one of the most course-intensivemajors on campus. Many courses have prerequisites.It’s important to get advising office and faculty adviceearly.
Karplus lab – p.6/32
What is Biomolecular Engineering?
Engineering with , of , or for biomolecules. For example,
with: using proteins (or DNA, RNA, . . . ) as sensors or forself-assembly.
of: protein engineering—designing or artificially evolvingproteins to have particular functions
for: designing high-throughput experimental methods tofind out what molecules are present, how they arestructured, and how they interact.
Karplus lab – p.7/32
What is Bioinformatics?
Bioinformatics: using computers and statistics to makesense out of the mountains of data produced byhigh-throughput experiments.
Genomics: annotating important sequences ingenomes.
Phylogenetics: tree of life, ancestral genomereconstruction.
Systems biology: discovering and modeling biologicalnetworks.
Expression profiling: what genes are turned on underwhat conditions (DNA microarrays, RNAseq).
Protein structure and function prediction.
Proteomics: what proteins are present in a mixture.Karplus lab – p.8/32
What is a protein?
There are many abstractions of a protein: a band on agel, a string of letters, a mass spectrum, a set of 3Dcoordinates of atoms, a point in an interactiongraph, . . . .
For us, a protein is a long skinny molecule (like a stringof letter beads) that folds up consistently into aparticular intricate shape.
The individual “beads” are amino acids, which have 6atoms the same in each “bead” (the backbone atoms: N,H, CA, HA, C, O).
The final shape is different for different proteins and isessential to the function. The protein shapes areimportant, but are expensive to determineexperimentally.
Karplus lab – p.9/32
Visualizing Proteins
There are many ways to look at proteins:
Strings of letters.
Sequence logos: letters plus conservation information.
Plastic models of structure.
Computer visualization of structure (rasmol, pymol,vmd, jmol, molmol, . . . )
Karplus lab – p.10/32
Sequence logos (MSA)
Summarize multiple alignment for 1jbeA:
1
2
3
XD
S
101
XGD
G
E
XIVLFY
YXVIL
VXL
VXRKK
T
XANEGSDPPXILVF
FXD
TXL
A
110
H
XE
AXE
TXMFVILLXL
EXA
EXA
KXVIL
LXR
NXA
KXL
I
120
XL
FXR
EXR
KXL
LXR
GXL
M
1
2
3
XANEGSD
G
51
XIVL
F
E
XALIV
VXFVIL
IXML
SXGSNEDDXVIL
W
T
XR
NXAFVILM
MXNAKDESGPP
60
XG
NXLM
MXTGNSD
DXSAGG
H
XIL
LXDE
EXVL
LXL
LXR
KXR
T
70
XVIL
IXKR
RXE
AXL
A
T
XL
MXP
SXD
AXIL
LXP
PXLIV
V
80
E
XALVI
LXFMAIVL
MXIVL
VXASTTXA
A
T
XL
EXA
AXD
KXE
K
H
XE
E
90
XD
NXA
IXV
IXR
AXA
AXL
AXE
QXA
AXDG
GXA
A
100
1
2
3
XA
A
1
XS
D
T
XS
KXG
EXL
LXKR
KXTALIV
F
E
XIVL
LXLIV
VXLAIVV
10
XAEDDXAKGSNEDDXSNED
FXP
S
H
XL
TXV
MXLR
RXE
RXL
IXVIL
V
20
XR
RXL
NXL
LXMVFILLXE
KXS
EXL
LXNDG
GXLFY
FXL
N
30
XV
NXLIV
V
E
XL
EXT
EXVA
AXA
EXTSND
DXG
G
H
XE
VXAQDE
D
40
XA
AXVIL
LXE
NXL
KXL
LXR
QXE
AXL
GXR
GXFP
Y
50
nostruct-align/1jbeA.t06 w0.5
Karplus lab – p.11/32
DEMO visualization
Demonstrate protein backbone using Darling Models
Demonstrate different views using Rasmol (or otherviewer)
Karplus lab – p.12/32
Folding Problem
The Folding Problem:If we are given a sequence of amino acids (the letters on astring of beads), can we predict how it folds up in 3-space?
>1jbeA Chemotaxis protein CHEY from E. coli
ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGY
GFVISDWNMPNMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQA
GASGYVVKPFTAATLEEKLNKIFEKLGM
↓
Too hard!Karplus lab – p.13/32
Fold-recognition problem
The Fold-recognition Problem:Given a sequence of amino acids A (the target sequence)and a library of proteins with known 3-D structures (thetemplate library),figure out which templates A match best, and align thetarget to the templates.
The backbone for the target sequence is predicted to bevery similar to the backbone of the chosen template.
Karplus lab – p.14/32
New-fold prediction
What if there is no template we can use?
We can try to generate many conformations of theprotein backbone and try to recognize the mostprotein-like of them.
Search space is huge, so we need a good conformationgenerator and a cheap cost function to evaluateconformations.
Karplus lab – p.15/32
Secondary structure Prediction
Instead of predicting the entire structure, we can predictlocal properties of the structure.
What local properties do we choose?
We want properties that are well-conserved throughevolution, easily predicted, and useful for finding andaligning templates.
One popular choice is a 3-valued helix/strand/otheralphabet—we have investigated many others. Typically,predictors get about 80% accuracy on 3-stateprediction.
Many machine-learning methods have been applied tothis problem, but the most successful are neuralnetworks.
Karplus lab – p.16/32
Contact prediction
Try to predict which residues come close to each other.
Ones close along the chain are easy (secondarystructure prediction).
Ones far apart along chain, but close in space, are hardto predict, but most useful.
Correlated mutation is powerful indication of closeresidues.
Karplus lab – p.17/32
(Rational) Protein Design
New direction for Karplus lab.
Use neural nets to predict amino acids from localstructure properties.
Use Undertaker to build models.
Use RosettaDesign (from Baker lab) to modifysequences.
Use Undertaker, Rosetta, and Gromacs to validate thatdesigned structure is good.
Target applications: short proteins that mimicagouti-related protein (and other proteins that bindmelanocortin receptors) but which do not have disulfidebridges.
Karplus lab – p.18/32
Sequence logos (NN)
Summarize local structure prediction:
1
2
3
XETC
S
101
XCGET
G
E
XGCTEYXCTEVXTCE
VXETC
K
T
XETCPXEBTCFXTCTXHA
110
H
XHAXHTXHLXHEXHEXHKXHLXHNXHKXHI
120
XCTHFXCTHEXTCH
KXHTCLXHTCGXHTCM
1
2
3
XTEC
G
51
XEF
E
XEVXEIXESXTCEDXTEC
W
T
XECT
NXEGCTMXGCTP
60
XGCTNXCTMXCT
DXTHG
H
XHLXHEXHLXHLXHKXHT
70
XHIXHRXHAXTCH
A
T
XCTMXCTSXCTAXECTLX
TECPXEV
80
E
XELXEMXCEVXTCETXECT
A
T
XECT
EXCTAXTCKXTHK
H
XHE
90
XHNXHIXHIXHAXHAXHAXHQXCHAXTCGXTCA
100
1
2
3
XTCA
1
XTCD
T
XETC
KXETC
EX
TEC
LXTCEKXCEF
E
XELXEVXCEV
10
X
TEC
DXTCDXTCFXGHS
H
XHTXHMXHRXHRXHIX
HV
20
X
HRX
HNX
HLXHLXHKXTHEX
TCH
LXTCGXTCFXETCN
30
XTCENXCEV
E
XCEEXCEEXTCEAXETCEXTCDXHG
H
XHVXHD
40
X
HAX
HLX
HNXHKXHLXHQXTHAXHTC
GXTCGX
ETCY
50
nostruct-align/1jbeA.t06 EBGHTL
Karplus lab – p.19/32
CASPCompetition Experiment
Everything published in literature “works”
CASP set up as true blind test of prediction methods.
Sequences of proteins about to be solved released toprediction community.
Predictions registered with organizers.
Experimental structures compared with solution byassessors.
“Winners” get papers in Proteins: Structure, Function, andBioinformatics.
Karplus lab – p.20/32
Overview of Prediction Method
Look for homologs.Homologs = proteins with common ancestralsequence.Can’t really determine algorithmicly, so we look for“sufficiently similar” sequences.
Make multiple sequence alignment (MSA).
Karplus lab – p.21/32
Overview of Prediction Method 2
Use MSA to make local structure predictions.
Use MSA (and local structure predictions) to makeHidden Markov Models (HMMs).
Use HMMs to find and align proteins of known structure(PDB).
Use model-building program to change alignments into3D models.
Clean up models (close gaps, rebuild loops, adjustsidechains, ...)
Choose best model(s) (Model Quality Assessment).
Maybe use contact predictions to select among models.
Karplus lab – p.22/32
Contact Prediction Method
Use mutual information between columns.
Thin alignments aggressively (30%, 35%, 40%, 50%,62%).
Compute e-value for mutual info (correcting forsmall-sample effects).
Compute rank of log(e-value) within protein.
Feed log(e-values), log rank, contact potential, jointentropy, and separation along chain for pair, andamino-acid profile, predicted burial, and predictedsecondary structure for each residue of pair into aneural net.
Karplus lab – p.23/32
Fold recognition results
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.001 0.01 0.1 1 10
Tru
e po
sitiv
es /
poss
ible
fold
mat
ches
False positives per query
Fold Recognition for 1415 SAM-T05 HMMs with w(amino-acid)=1
average w(near-backbone-11)=0.6 w(n_sep)=0.1average w(near-backbone-11)=0.6 w(str2)=0.25
w(near-backbone-11)=0.4w(n_sep)=0.1
w(str2)=0.1aa-only
T2K w(str2)=0.2T2K aa-only
Karplus lab – p.24/32
Contact prediction results
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.01 0.1 1
true
pos
itive
s/pr
edic
ted
predictions/residue
Accuracy of contact prediction, by protein
Neural netthin62, e-valuethin62, raw mi
0
5
10
15
20
25
30
0.01 0.1 1
avg
is_c
onta
ct/p
rob(
cont
act|s
ep)
predictions/residue
Weighted-accuracy of contact prediction, by protein
Neural netthin62, e-valuethin62, raw mi
Karplus lab – p.25/32
T0298 domain 2 (130–315)
RMSD= 2.468Å all-atom, 1.7567Å Cα, GDT=82.5%best model 1 submitted to CASP7 (red=real)
Karplus lab – p.26/32
Comparative modeling: T0348
RMSD= 11.8 Å Cα, GDT=58.2% (cartoon=real)best model 1 by CASP7 GDT, Robetta1 slightly better.
Karplus lab – p.27/32
Target T0201 (NF, CASP6)
We tried forcing various sheet topologies and selected4 by hand.
Model 1 has right topology (5.912Å all-atom, 5.219ÅCα).
Unconstrained cost function not good at choosingtopology (two strands curled into helices).
Helices were too short.
Karplus lab – p.28/32
Target T0230 (FR/A, CASP6)
Good except for C-terminal loop and helix floppedwrong way.
We have secondary structure right, including phase ofbeta strands.
Contact prediction helped, but we put too much weighton it—decoys fit predictions better than real structuredoes.
Karplus lab – p.30/32
Web sites
These slides: http://www.soe.ucsc.edu/˜karplus/papers/
structure-prediction-tutorial-jul-2009.pdf
Old CASP results—all our results and working notes:
http://www.soe.ucsc.edu/˜karplus/casp6/
http://www.soe.ucsc.edu/˜karplus/casp7/
http://www.soe.ucsc.edu/˜karplus/casp8/
SAM-T08 prediction server:
http://compbio.soe.ucsc.edu/SAM_T08/T08-query.html
UCSC bioinformatics and bioengineering degree programs:
http://www.bme.ucsc.edu/bioinformatics/
http://beng.soe.ucsc.edu/
Karplus lab – p.33/32