30
MACHINE LEARNING FOR PROTEIN CLASSIFICATION: KERNEL METHODS CS 374 Rajesh Ranganath 4/10/2008

M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

Embed Size (px)

Citation preview

Page 1: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

MACHINE LEARNING FOR PROTEIN CLASSIFICATION: KERNEL METHODSCS 374

Rajesh Ranganath

4/10/2008

Page 2: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

OUTLINE

Biological Motivation and Background Algorithmic Concepts Mismatch Kernels Semi-supervised methods

Page 3: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

PROTEINS

Page 4: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

THE PROTEIN PROBLEM

Primary Structure can be easily determined 3D structure determines function Grouping proteins into structural and

evolutionary families is difficult Use machine learning to group proteins

Page 5: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

HOW TO LOOK AT AMINO ACID CHAINS

Smith-Waterman Idea Mismatch Idea

Page 6: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

FAMILIES

Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)

Families are further subdivided into Proteins

Proteins are divided into Species The same protein may be found in

several species

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 7: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

SUPERFAMILIES

Proteins which are (remote) evolutionarily related

Sequence similarity low

Share function

Share special structural features

Relationships between members of a superfamily may not be readily recognizable from the sequence alone

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 8: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

FOLDS

Proteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold

No evolutionary relation between proteins

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 9: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

PROTEIN CLASSIFICATION Given a new protein, can we place it in its “correct”

position within an existing protein hierarchy?

Methods

BLAST / PsiBLAST

Profile HMMs

Supervised Machine Learning methods

Fold

Family

Superfamily

Proteins

?

new protein

Page 10: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

MACHINE LEARNING CONCEPTS

Supervised Methods Discriminative Vs. Generative Models Transductive Learning Support Vector Machines Kernel Methods

Semi-supervised Methods

Page 11: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

DISCRIMINATIVE AND GENERATIVE MODELS

Discriminative Generative

Page 12: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

TRANSDUCTIVE LEARNING

Most Learning is Inductive Given (x1,y1) …. (xm,ym), for any test input x*

predict the label y* Transductive Learning

Given (x1,y1) …. (xm,ym) and all the test input {x1*,…, xp*} predict label {y1*,…, yp*}

Page 13: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

SUPPORT VECTOR MACHINES

Popular Discriminative Learning algorithm Optimal geometric marginal classifier Can be solved efficiently using the Sequential

Minimal Optimization algorithm

If x1 … xn training examples, sign(iixiTx)

“decides” where x falls Train i to achieve best margin

Page 14: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

SUPPORT VECTOR MACHINES (2)

Kernalizable: The SVM solution can be completely written down in terms of dot products of the input.

{sign(iiK(xi,x) determines class of x)}

Page 15: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

KERNEL METHODS

K(x, z) = f(x)Tf(z) f is the feature mapping x and z are input vectors High dimensional features do not need to be

explicitly calculated Think of the kernel function similarity measure

between x and z

Example:

Page 16: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

MISMATCH KERNEL

Regions of similar amino acid sequences yield a similar tertiary structure of proteins

Used as a kernel for an SVM to identify protein homologies

Page 17: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

K-MER BASED SVMS For given word size k, and mismatch tolerance l,

define

K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches

Define normalized mismatch kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y))

SVM can be learned by supplying this kernel functionA B A C A R D I

A B R A D A B I

X

Y

K(X, Y) = 4

K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1

Page 18: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

DISADVANTAGES

3D structure of proteins is practically impossible

Primary sequences are cheap to determine How do we use all this unlabeled data? Use semi-supervised learning based on the

cluster assumption

Page 19: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

SEMI-SUPERVISED METHODS• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 20: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

• Some examples are labeled

• Assume labels vary smoothly among all examples

SEMI-SUPERVISED METHODS

• SVMs and other discriminative methods may make significant mistakes due to lack of data

Page 21: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

SEMI-SUPERVISED METHODS• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 22: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

SEMI-SUPERVISED METHODS• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 23: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

SEMI-SUPERVISED METHODS• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 24: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

SEMI-SUPERVISED METHODS• Some examples are labeled

• Assume labels vary smoothly among all examples

Attempt to “contract” the distances within each cluster while keeping intracluster distances larger

Page 25: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

SEMI-SUPERVISED METHODS• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 26: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

CLUSTER KERNELS Semi-supervised methods

1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)

2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’)

“Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”

3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)

2. Next bagged mismatch

Page 27: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

BAGGED MISMATCHED KERNEL Final method

1. Bagged mismatch

1. Run k-means clustering n times, giving p = 1,…,n assignments cp(X)

2. For every X and Y, count up the fraction of times they are bagged together

Kbag(X, Y) = 1/n p 1(cp(X) = cp (Y))

3. Combine the “bag fraction” with the original comparison K(.,.)

Knew(X, Y) = Kbag(X, Y) K(X, Y)

Page 28: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

O. Jangmin

Page 29: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

WHAT WORKS BEST?

Transductive Setting

Page 30: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

REFERENCES

C. Leslie et al. Mismatch string kernels for discriminative protein classification. Bioinformatics Advance Access. January 22, 2004.

J. Weston et al. Semi-supervised protein classification using cluster kernels.2003.

Images pulled under wikiCommons