Protein Classification
Protein Classification
• Given a new protein, can we place it in its “correct” position within an existing protein hierarchy?
Methods
• BLAST / PsiBLAST
• Profile HMMs
• Supervised Machine Learning methods
Fold
Family
Superfamily
Proteins
?
new protein
PSI-BLAST
Given a sequence query x, and database D
1. Find all pairwise alignments of x to sequences in D
2. Collect all matches of x to y with some minimum significance
3. Construct position specific matrix M• Each sequence y is given a weight so that many similar sequences cannot have
much influence on a position (Henikoff & Henikoff 1994)
4. Using the matrix M, search D for more matches
5. Iterate 1–4 until convergence
Profile M
Classification with Profile HMMs
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
Fold
Family
Superfamily
?M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
new protein
The Fisher Kernel
• Fisher score UX = log P(X | H1, ) Quantifies how each parameter contributes to generating X For two different sequences X and Y, can compare UX, UY
• D2F(X, Y) = ½ 2 |UX – UY|2
• Given this distance function, K(X, Y) is defined as a similarity measure: K(X, Y) = exp(-D2
F(X, Y)) Set so that the average distance of training sequences Xi H1 to
sequences Xj H0 is 1
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
The Fisher Kernel
• To train a classifier for a given family H1,
1. Build profile HMM, H1
2. UX = log P(X | H1, ) (Fisher score)3. D2
F(X, Y) = ½ 2 |UX – UY|2 (distance)4. K(X, Y) = exp(-D2
F(X, Y)), (akin to dot product)
5. L(X) = XiH1 i K(X, Xi) – XjH0 j K(X, Xj)6. Iteratively adjust to optimize
J() = XiH1 i(2 - L(Xi)) – XjH0 j(2 + L(Xj))
• To classify query X,
Compute UX
Compute K(X, Xi) for all training examples Xi with I ≠ 0 (few) Decide based on L(X) >? 0
O. Jangmin
QUESTION
Running time of Fisher kernel SVM
on query X?
k-mer based SVMs
Leslie, Eskin, Weston, Noble; NIPS 2002
Highlights
• K(X, Y) = exp(-½ 2 |UX – UY|2), requires expensive profile alignment:
UX = log P(X | H1, ) – O(|X| |H1|)
• Instead, new kernel K(X, Y) just “counts up” k-mers with mismatches in common between X and Y – O(|X|) in practice
• Off-the-shelf SVM software used
k-mer based SVMs
• For given word size k, and mismatch tolerance l, define
K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches
• Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y))
• SVM can be learned by supplying this kernel functionA B A C A R D I
A B R A D A B I
X
Y
K(X, Y) = 4
K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1
SVMs will find a few support vectors
v
After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X
Benchmarks
Semi-Supervised Methods
GENERATIVE SUPERVISED METHODS
Semi-Supervised Methods
DISCRIMINATIVE SUPERVISED METHODS
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
• SVMs and other discriminative methods may make significant mistakes due to lack of data
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Attempt to “contract” the distances within each cluster while keeping intracluster distances larger
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Semi-Supervised Methods
1. Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005
A Psi-BLAST profile—based method
2. Weston, Leslie, Elisseeff, Noble, NIPS 2003
Cluster kernels
(semi)1. Profile k-mer based SVMs
• For each sequence X, Obtain PSI-BLAST profile Q(X) = {pi(); : amino acid, 1≤ i ≤ |X|}
For every k-mer in X, xj … xj+k-1, define -neighborhood
Mk, (Q[xj…xj+k-1]) = {b1…bk | -i=0…k-1 log pj+i(bi) < }
Define K(X, Y)
For each b1…bk matching m times in X, n times in Y, add m*n
• In practice, each k-mer can have ≤ 2 mismatches and K(X, Y) can be computed quickly in O(k2 202 (|X| + |Y|))
Profile M
PSI-BLAST
(semi)1. Discriminative motifs
• According to this kernel K(X, Y), sequence X is mapped to Φk,(X): vector in 20k dimensions
Φk,(X)(b1…bk) = # k-mers in Q(X) whose neighborhood includes b1…bk
• Then, SVM learns a discriminating “hyperplane” with normal vector v:
v = i=1…N (+/-) i Φk,(X(i))
• Consider a profile k-mer Q[xj…xj+k-1]; its contribution to v is ~
Φk,(Q[xj…xj+k-1]), v
• Consider a position i in X: count up the contributions of all words containing x i
g(xi) = j=1…k max{ 0, Φk,(Q[xi-k+j…xj-1+j]), v}
Sort these contributions within all positions of all sequences, to pick important positions or discriminative motifs
(semi)1. Discriminative motifs
• Consider a position i in X: count up the contributions to v of all words containing x i
Sort these contributions within all positions of all sequences, to pick discriminative motifs
(semi)2. Cluster Kernels
• Two (more!) methods
1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)
2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’)
“Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”
3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)
2. Bagged mismatch
(semi)2. Cluster Kernels
• Two (more!) methods
1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)
2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’)
“Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”
3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)
2. Bagged mismatch
1. Run k-means clustering n times, giving p = 1,…,n assignments cp(X)
2. For every X and Y, count up the fraction of times they are bagged together
Kbag(X, Y) = 1/n p 1(cp(X) = cp (Y))
3. Combine the “bag fraction” with the original comparison K(.,.)
Knew(X, Y) = Kbag(X, Y) K(X, Y)
Some Benchmarks
Google-like homology search
• The internet and the network of protein homologies have some similarity—scale free
• Given query X, Google ranks webpages by a flow algorithm From each webpage W, linked nbrs receive
flow At time t+1, W sends to nbrs flow it received at
time t Finite, ergodic, aperiodic Markov Chain Can find stationary distribution efficiently as left
eigenvector with eigenvalue 1• Start with arbitrary probability distribution, and
multiply by the transition matrix
Google-like homology search
Weston, Elisseeff, Zhu, Leslie, Noble, PNAS 2004RANKPROP algorithm for protein homology• First, compute a matrix Kij of PSI-BLAST
homology between proteins i and j, normalized so that jKji = 1
1. Initialization y1(0) = 1; yi(0) = 02. For t = 0, 1, …, 3. For i = 2 to m
4. yi(t+1) = K1i + Kjiyj(t)
In the end, let yi be the ranking score for similarity of sequence i to sequence 1
( = 0.95 is good)
Google-like homology search
For a given protein family,what fraction of true membersof the family are rankedhigher than the first 50non-members?
Protein Structure Prediction
Protein Structure Determination
• Experimental X-ray crystallography NMR spectrometry
• Computational – Structure Prediction(The Holy Grail)
Sequence implies structure, therefore in principle we can predict the structure from the sequence alone
Protein Structure Prediction
• ab initio Use just first principles: energy, geometry, and kinematics
• Homology Find the best match to a database of sequences with known 3D-
structure
• Threading
• Meta-servers and other methods
Ab initio Prediction
• Sampling the global conformation space Lattice models / Discrete-state models Molecular Dynamics
• Picking native conformations with an energy function Solvation model: how protein interacts with water Pair interactions between amino acids
• Predicting secondary structure Local homology Fragment libraries
Lattice String Folding
• HP model: main modeled force is hydrophobic attraction NP-hard in both 2-D square and 3-D cubic Constant approximation algorithms Not so relevant biologically
Lattice String Folding