View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Support Vector Machine and String Kernels for Protein Classification
Christina Leslie
Department of Computer ScienceColumbia University
Learning Sequence-based Protein Classification Problem: classification of protein
sequence data into families and superfamilies
Motivation: Many proteins have been sequenced, but often structure/function remains unknown
Motivation: infer structure/function from sequence-based classification
Sequence Data versus Structure and Function
>1A3N:A HEMOGLOBIN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>1A3N:B HEMOGLOBIN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>1A3N:C HEMOGLOBIN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>1A3N:D HEMOGLOBIN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
Sequences for four chains of human hemoglobin
Tertiary Structure
Function: oxygen transport
Structural Hierarchy SCOP:
Structural Classification of Proteins
Interested in superfamily-level homology – remote evolutionary relationship
Learning Problem Reduce to binary classification problem:
positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise
Focus on remote homology detection Use supervised learning approach to
train a classifier
Labeled TrainingSequences
Classification Rule
Learning Algorithm
Two supervised learning approaches to classification Generative model approach
Build a generative model for a single protein family; classify each candidate sequence based on its fit to the model
Only uses positive training sequences
Discriminative approach Learning algorithm tries to learn decision
boundary between positive and negative examples
Uses both positive and negative training sequences
Hidden Markov Models for Protein Families
Standard generative model: profile HMM Training data: multiple alignment of
examples from family Columns of alignment determine model
topology
7LES_DROME LKLLRFLGSGAFGEVYEGQLKTE....DSEEPQRVAIKSLRK.......
ABL1_CAEEL IIMHNKLGGGQYGDVYEGYWK........RHDCTIAVKALK........
BFR2_HUMAN LTLGKPLGEGCFGQVVMAEAVGIDK.DKPKEAVTVAVKMLKDD.....A
TRKA_HUMAN IVLKWELGEGAFGKVFLAECHNLL...PEQDKMLVAVKALK........
Profile HMMs for Protein Families
Match, insert and delete states Observed variables: symbol sequence, x1 .. xL
Hidden variables: state sequence, 1 .. L
Parameters: transition and emission probabilities Joint probability: P(x, | )
Discriminative Learning
Discriminative approachTrain on both positive and negative examples to learn classifier
Modern computational learning theory• Goal: learn a classifier that generalizes well to new examples• Do not use training data to estimate parameters of probability distribution – “curse of dimensionality”
Learning Theoretic Formalism for Classification Problem• Training and test data drawn i.i.d.
from fixed but unknown probability distribution D on
X {-1,1}• Labeled training set
S = {(x1, y1), … , (xm, ym)}
Support Vector Machines (SVMs)
We use SVM as discriminative learning algorithm
+
+
+
+
_
+
__
_
_
+
+
_
• Training examples mapped to (usually high-dimensional) feature space by a feature map F(x) = (F1(x), … , Fd(x))• Learn linear decision boundary: Trade-off between maximizing geometric margin of the training data and minimizing margin violations
SVM Classifiers Linear classifier
defined in feature space by
f(x) = <w,x> + b SVM solution gives
w = i xi as a linear combination of support vectors, a subset of the training vectors
+
+
+
+
_
+
__
_
_
+
+
_w
b
Advantages of SVMs Large margin classifier: leads to
good generalization (performance on test sets)
Sparse classifier: depends only on support vectors, leads to fast classification, good generalization
Kernel method: as we’ll see, we can introduce sequence-based kernel functions for use with SVMs
Hard Margin SVM Assume training data linearly
separable in feature space Space of linear classifiers
fw,b(x) = w, x + b
giving decision rulehw,b(x) = sign(fw,b(x))
If |w| = 1, geometric margin of training data for hw,b
S = MinS yi (w, xi + b)
+
++
_
+ +
+
__
_ _
_
w
b
Hard Margin Optimization Hard margin SVM
optimization: given training data S, find linear classifier hw,b with maximal geometric margin S
Convex quadratic dual optimization problem
Sparse classifier in term of support vectors
+
+
+
+
_
+ +
+
__
_ _
_
Hard Margin Generalization Error Bounds Theorem [Cristianini, Shawe-Taylor]: Fix a
real value M > 0. For any probability distribution D on X {-1,1} with support in a ball of radius R around the origin, with probability 1- over m random samples S, any linear hypothesis h with geometric margin
S M on S has error no more than
ErrD(h) (m, , M, R)provided that m is big enough
SVMs for Protein Classification Want to define feature map from
space of protein sequences to vector space
Goals: Computational efficiency Competitive performance with known
methods No reliance on generative model –
general method for sequence-based classification problems
Spectrum Feature Map for SVM Protein Classification
New feature map based on spectrum of a sequence
1. C. Leslie, E. Eskin, and W. Noble, The Spectrum Kernel: A String Kernel for SVM Protein Classification. Pacific Symposium on Biocomputing, 2002.
2. C. Leslie, E. Eskin, J. Weston and W. Noble, Mismatch String Kernels for SVM Protein Classification. NIPS 2002.
The k-Spectrum of a Sequence
• Feature map for SVM based on spectrum of a sequence
• The k-spectrum of a sequence is the set of all k-length contiguous subsequences that it contains
• Feature map is indexed by all possible k-length subsequences
(“k-mers”) from the alphabet of amino acids • Dimension of feature space = 20k
• Generalizes to any sequence data
AKQDYYYYEI
AKQ KQD QDY DYY YYY
YYY YYE YEI
k-Spectrum Feature Map Feature map for k-spectrum with no
mismatches: For sequence x, F(k)(x) = (Ft (x)){k-mers t},
where Ft (x) = #occurrences of t in x
AKQDYYYYEI
( 0 , 0 , … , 1 , … , 1 , … , 2 ) AAA AAC … AKQ … DYY … YYY
(k,m)-Mismatch Feature Map Feature map for k-spectrum,
allowing m mismatches: if s is a k-mer, F(k,m)(s) = (Ft(s)){k-mers t},
where Ft(s) = 1 if s is within m mismatches from t, 0 otherwise
extend additively to longer sequences x by summing over all k-mers s in x
AKQDKQ
EKQ AAQAKY… …
The Kernel Trick To train an SVM, can use kernel
rather than explicit feature map For sequences x, y, feature map F,
kernel value is inner product in feature space:
K(x, y) = F(x), F(y) Gives sequence similarity score Example of a string kernel Can be efficiently computed via
traversal of trie data structure
Computing the (k,m)-Spectrum Kernel
Use trie (retrieval tree) to organize lexical traversal of all instances of k-length patterns (with mismatches) in the training data Each path down to a leaf in the trie corresponds
to a coordinate in feature map Kernel values for all training sequences updated
at each leaf node If m=0, traversal time for trie is linear in size of
training data Traversal time grows exponentially with m, but
usually small values of m are useful Depth-first traversal makes efficient use of
memory
Example: Computing the Kernel for Pair of Sequences
Traversal of trie for k=3 (m=0)
EADLALGKAVF
ADLALGADQVFNG
AS1:
S2
:
Example: Computing the Kernel for Pair of Sequences
Traversal of trie for k=3 (m=0)
EADLALGKAVF
ADLALGADQVFNG
A
D
S1:
S2
:
Example: Computing the Kernel for Pair of Sequences
Traversal of trie for k=3 (m=0)
EADLALGKAVF
ADLALGADQVFNG
A
D
LUpdate kernel value for K(s1,s2) by adding contribution for feature ADL
s1:
s2:
Fast prediction SVM training: determines subset of training
sequences corresponding to support vectors and their weights:
(xi, i), i = 1 .. r Prediction with no mismatches:
Represent SVM classifier by hash table mapping support k-mers to weights
Test sequences can be classified in linear time via look-up of k-mers
Prediction with mismatches: Represent classifier as sparse trie; traverse k-mer
paths occurring with mismatches in test sequence
Experimental Design
Tested with set of experiments on SCOP dataset
Experiments designed to ask: Could the method discover a new family of a known superfamily?
Diagram from Jaakkola et al.
Experiments 160 experiments for 33 target
families from 16 superfamilies Compared results against
SVM-Fisher SAM-T98 (HMM-based method) PSI-BLAST (heuristic alignment-based
method)
Conclusions for SCOP Experiments Spectrum Kernel with SVM performs as
well as the best-known method for remote homology detection problem
Efficient computation of string kernel Fast prediction
Can precompute per k-mer scores and represent classifier as a lookup table
Gives linear time prdiction for both spectrum kernel, (unnormalized) mismatch kernel
General approach to classification problems for sequence data
Feature Selection Strategies Explicit feature filtering
Compute score for each k-mer, based on training data statistics, during trie traversal and filter as we compute kernel
Feature elimination as a wrapper for SVM training Eliminate features corresponding to small
components wi in vector w defining SVM classifier
Kernel principal component analysis Project to principal components prior to training
Ongoing and Future Work New families of string kernels, mismatching
schemes Applications to other sequence-based
classification problems, e.g. splice site prediction
Feature selection Explicit and implicit dimension reduction
Other machine learning approaches to using sparse string-based models for classification Boosting with string-based classifiers