48
Protein Classification

Protein Classification

  • Upload
    halona

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

Protein Classification. Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs Supervised Machine Learning methods. Fold. Superfamily. new protein. ?. Family. Proteins. - PowerPoint PPT Presentation

Citation preview

Page 1: Protein Classification

Protein Classification

Page 2: Protein Classification

Protein Classification

• Given a new protein, can we place it in its “correct” position within an existing protein hierarchy?

Methods

• BLAST / PsiBLAST

• Profile HMMs

• Supervised Machine Learning methods

Fold

Family

Superfamily

Proteins

?

new protein

Page 3: Protein Classification

PSI-BLAST

Given a sequence query x, and database D

1. Find all pairwise alignments of x to sequences in D

2. Collect all matches of x to y with some minimum significance

3. Construct position specific matrix M• Each sequence y is given a weight so that many similar sequences cannot have

much influence on a position (Henikoff & Henikoff 1994)

4. Using the matrix M, search D for more matches

5. Iterate 1–4 until convergence

Profile M

Page 4: Protein Classification

Classification with Profile HMMs

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Fold

Family

Superfamily

?M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

new protein

Page 5: Protein Classification

The Fisher Kernel

• Fisher score UX = log P(X | H1, ) Quantifies how each parameter contributes to generating X For two different sequences X and Y, can compare UX, UY

• D2F(X, Y) = ½ 2 |UX – UY|2

• Given this distance function, K(X, Y) is defined as a similarity measure: K(X, Y) = exp(-D2

F(X, Y)) Set so that the average distance of training sequences Xi H1 to

sequences Xj H0 is 1

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 6: Protein Classification

The Fisher Kernel

• To train a classifier for a given family H1,

1. Build profile HMM, H1

2. UX = log P(X | H1, ) (Fisher score)3. D2

F(X, Y) = ½ 2 |UX – UY|2 (distance)4. K(X, Y) = exp(-D2

F(X, Y)), (akin to dot product)

5. L(X) = XiH1 i K(X, Xi) – XjH0 j K(X, Xj)6. Iteratively adjust to optimize

J() = XiH1 i(2 - L(Xi)) – XjH0 j(2 + L(Xj))

• To classify query X,

Compute UX

Compute K(X, Xi) for all training examples Xi with I ≠ 0 (few) Decide based on L(X) >? 0

Page 7: Protein Classification

O. Jangmin

Page 8: Protein Classification
Page 9: Protein Classification

QUESTION

Running time of Fisher kernel SVM

on query X?

Page 10: Protein Classification

k-mer based SVMs

Leslie, Eskin, Weston, Noble; NIPS 2002

Highlights

• K(X, Y) = exp(-½ 2 |UX – UY|2), requires expensive profile alignment:

UX = log P(X | H1, ) – O(|X| |H1|)

• Instead, new kernel K(X, Y) just “counts up” k-mers with mismatches in common between X and Y – O(|X|) in practice

• Off-the-shelf SVM software used

Page 11: Protein Classification

k-mer based SVMs

• For given word size k, and mismatch tolerance l, define

K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches

• Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y))

• SVM can be learned by supplying this kernel functionA B A C A R D I

A B R A D A B I

X

Y

K(X, Y) = 4

K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1

Page 12: Protein Classification

SVMs will find a few support vectors

v

After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X

Page 13: Protein Classification

Benchmarks

Page 14: Protein Classification

Semi-Supervised Methods

GENERATIVE SUPERVISED METHODS

Page 15: Protein Classification

Semi-Supervised Methods

DISCRIMINATIVE SUPERVISED METHODS

Page 16: Protein Classification

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 17: Protein Classification

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 18: Protein Classification

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 19: Protein Classification

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 20: Protein Classification

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 21: Protein Classification

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 22: Protein Classification

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 23: Protein Classification

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 24: Protein Classification

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 25: Protein Classification

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 26: Protein Classification

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 27: Protein Classification

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

• SVMs and other discriminative methods may make significant mistakes due to lack of data

Page 28: Protein Classification

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 29: Protein Classification

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 30: Protein Classification

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 31: Protein Classification

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Attempt to “contract” the distances within each cluster while keeping intracluster distances larger

Page 32: Protein Classification

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 33: Protein Classification

Semi-Supervised Methods

1. Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005

A Psi-BLAST profile—based method

2. Weston, Leslie, Elisseeff, Noble, NIPS 2003

Cluster kernels

Page 34: Protein Classification

(semi)1. Profile k-mer based SVMs

• For each sequence X, Obtain PSI-BLAST profile Q(X) = {pi(); : amino acid, 1≤ i ≤ |X|}

For every k-mer in X, xj … xj+k-1, define -neighborhood

Mk, (Q[xj…xj+k-1]) = {b1…bk | -i=0…k-1 log pj+i(bi) < }

Define K(X, Y)

For each b1…bk matching m times in X, n times in Y, add m*n

• In practice, each k-mer can have ≤ 2 mismatches and K(X, Y) can be computed quickly in O(k2 202 (|X| + |Y|))

Profile M

PSI-BLAST

Page 35: Protein Classification

(semi)1. Discriminative motifs

• According to this kernel K(X, Y), sequence X is mapped to Φk,(X): vector in 20k dimensions

Φk,(X)(b1…bk) = # k-mers in Q(X) whose neighborhood includes b1…bk

• Then, SVM learns a discriminating “hyperplane” with normal vector v:

v = i=1…N (+/-) i Φk,(X(i))

• Consider a profile k-mer Q[xj…xj+k-1]; its contribution to v is ~

Φk,(Q[xj…xj+k-1]), v

• Consider a position i in X: count up the contributions of all words containing x i

g(xi) = j=1…k max{ 0, Φk,(Q[xi-k+j…xj-1+j]), v}

Sort these contributions within all positions of all sequences, to pick important positions or discriminative motifs

Page 36: Protein Classification

(semi)1. Discriminative motifs

• Consider a position i in X: count up the contributions to v of all words containing x i

Sort these contributions within all positions of all sequences, to pick discriminative motifs

Page 37: Protein Classification

(semi)2. Cluster Kernels

• Two (more!) methods

1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)

2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’)

“Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”

3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)

2. Bagged mismatch

Page 38: Protein Classification

(semi)2. Cluster Kernels

• Two (more!) methods

1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)

2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’)

“Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”

3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)

2. Bagged mismatch

1. Run k-means clustering n times, giving p = 1,…,n assignments cp(X)

2. For every X and Y, count up the fraction of times they are bagged together

Kbag(X, Y) = 1/n p 1(cp(X) = cp (Y))

3. Combine the “bag fraction” with the original comparison K(.,.)

Knew(X, Y) = Kbag(X, Y) K(X, Y)

Page 39: Protein Classification

Some Benchmarks

Page 40: Protein Classification

Google-like homology search

• The internet and the network of protein homologies have some similarity—scale free

• Given query X, Google ranks webpages by a flow algorithm From each webpage W, linked nbrs receive

flow At time t+1, W sends to nbrs flow it received at

time t Finite, ergodic, aperiodic Markov Chain Can find stationary distribution efficiently as left

eigenvector with eigenvalue 1• Start with arbitrary probability distribution, and

multiply by the transition matrix

Page 41: Protein Classification

Google-like homology search

Weston, Elisseeff, Zhu, Leslie, Noble, PNAS 2004RANKPROP algorithm for protein homology• First, compute a matrix Kij of PSI-BLAST

homology between proteins i and j, normalized so that jKji = 1

1. Initialization y1(0) = 1; yi(0) = 02. For t = 0, 1, …, 3. For i = 2 to m

4. yi(t+1) = K1i + Kjiyj(t)

In the end, let yi be the ranking score for similarity of sequence i to sequence 1

( = 0.95 is good)

Page 42: Protein Classification

Google-like homology search

For a given protein family,what fraction of true membersof the family are rankedhigher than the first 50non-members?

Page 43: Protein Classification

Protein Structure Prediction

Page 44: Protein Classification

Protein Structure Determination

• Experimental X-ray crystallography NMR spectrometry

• Computational – Structure Prediction(The Holy Grail)

Sequence implies structure, therefore in principle we can predict the structure from the sequence alone

Page 45: Protein Classification

Protein Structure Prediction

• ab initio Use just first principles: energy, geometry, and kinematics

• Homology Find the best match to a database of sequences with known 3D-

structure

• Threading

• Meta-servers and other methods

Page 46: Protein Classification

Ab initio Prediction

• Sampling the global conformation space Lattice models / Discrete-state models Molecular Dynamics

• Picking native conformations with an energy function Solvation model: how protein interacts with water Pair interactions between amino acids

• Predicting secondary structure Local homology Fragment libraries

Page 47: Protein Classification

Lattice String Folding

• HP model: main modeled force is hydrophobic attraction NP-hard in both 2-D square and 3-D cubic Constant approximation algorithms Not so relevant biologically

Page 48: Protein Classification

Lattice String Folding