Upload
nauman-zafar
View
5
Download
0
Tags:
Embed Size (px)
DESCRIPTION
SVM
Citation preview
1
Data Mining: Concepts and Techniques
Chapter 9: Advanced Classification Methods
Support Vector Machines
2013 Han, Kamber & Pei. All rights reserved.
Classification
n Assign input vector to one of two or more classes n Any decision rule divides input space into decision
regions separated by decision boundaries
3
Classification as Mathematical Mapping
n Classica(on: Predict categorical class label y Y for x X
n Learning: Derive a func(on f: X Yn 2-Class Classica(on: E.g. Job page classica(on
n y {+1, 1} n x Rn
n xi = (xi1, xi2, xi3, ), n n = Number of dis(nct word-features n xij : P-idf weight of word j in document i
4
SVMHistory and Applications
n SVMs introduced by Vapnik and colleagues in 1992.
n Theoretically well motivated algorithm: developed from Statistical Learning Theory since the 60s.
n Empirically good performance: successful applications in many fields (bioinformatics, text, image recognition, )
n Used for: classification and numeric prediction
n Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision boundaries (margin maximization)
5
SVMGeneral Philosophy
Support Vectors
Small Margin Large Margin
6
SVMSupport Vector Machines
n It uses a nonlinear mapping to transform the original training data into a higher dimension.
n With the new dimension, it searches for the linear optimal separating hyperplane (i.e., decision boundary)
n With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane.
n SVM finds this hyperplane using support vectors (essential training tuples) and margins (defined by the support vectors)
7
SVMWhen Data Is Linearly Separable
n A separating hyperplane can be written as n W X + b = 0 n where W={w1, w2, , wn} is a weight vector and b a
scalar (bias)
n For 2-D it can be written as n w0 + w1 x1 + w2 x2 = 0
n The hyperplane defining the sides of the margin: n H1: w0 + w1 x1 + w2 x2 1 for yi = +1, and
n H2: w0 + w1 x1 + w2 x2 1 for yi = 1 n Any training tuples that fall on hyperplanes H1 or H2 (i.e.,
the sides defining the margin) are support vectors
SVMLinearly Separable
8
Margin Support vectors
Distance between point and hyperplane: ||||
||wwx bi +
Therefore, the margin is 2/||w||
There are infinite hyperplanes separating the two classes but we want to find the best one, the one that minimizes classification error
on unseen data.
SVM searches for the hyperplane with the largest
margin, i.e., maximum marginal hyperplane
Finding the maximum margin hyperplane
n Maximize margin 2/||w|| n Correctly classify all training data:
n Quadratic optimization problem:
n Minimize
n Subject to yi(wxi+b) 1
9
1:1)(negative1:1)( positive+=
+=
byby
iii
iii
wxxwxx
wwT21
Finding the maximum margin hyperplane
10
Solution = i iii y xw
Support vector
learned weight
bybi iii
+=+ xxxw Classification function
Notice the inner product between the test point x and the support vectors xi used as a measure of similarity.
11
Why Is SVM Effective on High Dimensional Data?
n The complexity of trained classifier is characterized by the # of
support vectors rather than the dimensionality of the data
n The support vectors are the essential or critical training examples
they lie closest to the decision boundary (MMH)
n If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
n The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
n Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
Nonlinear SVMs
n Datasets that are linearly separable work out great:
n But what if the dataset is just too hard? n Can we can map it to a higher-dimensional space?
12
0 x
0 x
x
Slide credit: Andrew Moore
Nonlinear SVMs
n General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable.
13
: x (x)
Slide credit: Andrew Moore
The Kernel Tricks
n Instead of explicitly computing the lifting transformation (x), define a kernel function K such that n K(xi , xj) = (xi ) (xj) n K must satisfy Mercers condition
n This gives a nonlinear decision boundary in the original feature space
14
bKybyi
iiii
iii +=+ ),()()( xxxx
Nonlinear Kernel Example
n Consider the mapping
15
),()( 2xxx =
22
2222
),(),(),()()(yxxyyxK
yxxyyyxxyx+=
+==
x2
0 x
Kernels for Bags of Features
n Histogram intersection kernel:
n Generalized Gaussian kernel:
n D can be L1 distance, Euclidean distance, 2 distance, etc.
16
=
=N
iihihhhI
12121 ))(),(min(),(
= 22121 ),(1exp),( hhDA
hhK
17
More Kernels for Nonlinear Classification
n Polynomial kernel of degree h
n Gaussian radial basis function kernel
n Sigmoid kernel
18
Scaling SVM by Hierarchical Micro-Clustering
n SVM is not scalable to the number of data objects in terms of training time and memory usage
n H. Yu, J. Yang, and J. Han, Classifying Large Data Sets Using SVM with Hierarchical Clusters, KDD'03)
n CB-SVM (Clustering-Based SVM)
n Given limited amount of system resources (e.g., memory), maximize the SVM performance in terms of accuracy and the
training speed
n Use micro-clustering to effectively reduce the number of points to be considered
n At deriving support vectors, de-cluster micro-clusters near candidate vector to ensure high classification accuracy
19
CF-Tree: Hierarchical Micro-cluster
n Read the data set once, construct a statistical summary of the data (i.e., hierarchical clusters) given a limited amount of memory
n Micro-clustering: Hierarchical indexing structure
n provide finer samples closer to the boundary and coarser samples farther from the boundary
20
Selective Declustering: Ensure High Accuracy
n CF tree is a suitable base structure for selective declustering n De-cluster only the cluster Ei such that
n Di Ri < Ds, where Di is the distance from the boundary to the center point of Ei and Ri is the radius of Ei
n Decluster only the cluster whose subclusters have possibilities to be the support cluster of the boundary n Support cluster: The cluster whose centroid is a support vector
21
CB-SVM Algorithm: Outline
n Construct two CF-trees from positive and negative data sets independently n Need one scan of the data set
n Train an SVM from the centroids of the root entries n De-cluster the entries near the boundary into the next
level n The children entries de-clustered from the parent
entries are accumulated into the training set with the non-declustered parent entries
n Train an SVM again from the centroids of the entries in the training set
n Repeat until nothing is accumulated
22
Accuracy and Scalability on Synthetic Dataset
n Experiments on large synthetic data sets shows better accuracy than random sampling approaches and far more scalable than the original SVM algorithm
23
SVM vs. Neural Network
n SVM
n Deterministic algorithm
n Nice generalization properties
n Hard to learn learned in batch mode using quadratic programming techniques
n Using kernels can learn very complex functions
n Neural Network n Nondeterministic
algorithm n Generalizes well but
doesnt have strong mathematical foundation
n Can easily be learned in incremental fashion
n To learn complex functionsuse multilayer perceptron (nontrivial)
24
SVM Related Links
n SVM Website: http://www.kernel-machines.org/
n Representative implementations
n LIBSVM: an efficient implementation of SVM, multi-class classifications, nu-SVM, one-class SVM, including
also various interfaces with java, python, etc.
n SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C
n SVM-torch: another recent implementation also written in C