Upload
nachi-vpn
View
249
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Understanding Support Vector Machines
Citation preview
Support Vector Machines
Theory and Implementation in python
byNachi
Definition
In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis.
- Wikipedia
Properties of an SVM
Non probabilistic binary linear classifier
Support for non-linear classification using the 'kernel trick'
Linear separability
Two sets of points in p-dimensional space are said to be linearly separable if they can be separated using a p-1 dimensional hyperplane.
Example - The two sets of 2D datain the image are separated by a single straight line (1D hyperplane), and hence are linearly separable
Linear Discriminant
The hyperplane that separates the two sets of data is called the linear discriminant.
Equation:WT X = C
W = [w1,w2,.......wn]X = [X1,X2,......Xn]for the nth dimension
Selecting the hyperplane
For every linearly separable data, there exist infinite number of separating hyperplanes. Hence, we must choose the most suitable one for classification.
Maximal Margin Hyperplane
We can compute the (perpendicular) distance from each observation in the data set to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal margin hyperplane is the separating hyperplane for which the margin is largest.
Example - maximal margin hyperplane
Finding the shortest distance (margin)Find Xp
Such that ||Xp-X|| is minimum and
Wt Xp =C (as Xp is on decision boundary)
[Wt - W transpose]
Maximizing the marginMaximize D such that
D = (WT X – C) / ||W||
where X is the support v
Why maximum margin hyperplane?● Supposing we have a maximal margin hyperplane for
a data set and want to predict the class for a new observation, we compute the distance from the hyperplane.
● The more the distance from the hyperplane the more confident we are that the sample belongs to that class.
● Thus the hyperplane with the farthest smallest distance from the training observation would be the most suitable.
Classifying a new sample
Consider a new sample x’ = [x1,x2,....xn]. To predict the class to which the sample belongs, we must simply compute WT X = C.
If WT X > C it lies on one side (positive half space) of the hyperplane or if WT X < C it lies on the other side (negative half space) of the hyperplane. The sample belongs to the class which represents the corresponding half space.
SVM - A linear discriminant
An SVM is simply a linear discriminant which tries to build a hyperplane such that it has a large margin.
It classifies a new sample by simply computing the distance from the hyperplane.
Support Vectors
● Observations (represented as vectors) which lie at marginal distance from the hyperplane are called support vectors.
● These are important as shifting them even slightly might change the position of the hyperplane to a great extent.
Example - Support vectors
The vectors lying on the green lines in the image are the support vectors.
Soft margin
To avoid ‘overfitting’ of data (i.e. low sensitivity of individual observations) by trying to make perfectly linearly separable sets, we may opt to allow some amount of misclassification keeping in mind the greater robustness to individual observations and better classification of most of the observations.
Achieving soft marginEach observation has something known as the ‘slack variable’ that allow individual observations to be on the wrong side of the margin or the hyperplane.Sum of slack variables <= CWhere C is a nonnegative tuning parameter. C is our budget for the amount that the margin can violated by all the observations.
Tuning parameter C & Support vectors relation
Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors. These observations do affect the support vector classifier.When the tuning parameter C is large, then the margin is wide, many observations violate the margin, and so there are many support vectors.
Non linearly separable
In this case, an SVM would not able to linearly classify the data. Hence SVM uses what is known as the ‘kernel trick’.The idea is that the enlarged feature space might have a linear boundary which might not quite be linear in the original feature space. In this ‘trick’ the feature space is enlarged. This can be done using various kernel functions.
Enlarged feature space
Multi-Category Classification
● One-Versus-One Classification
● One-Versus-All Classification
Sample Data
X = [ [0,0], [1,1], [2,2], [3,3], [4,4] ]
Y =[ 0, 0, 0, 1, 1]
SVM in sklearn
clfy = svm.SVC()
Default:class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma=0.0, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, random_state=None)
‘Fit’ the model
clfy.fit(x,y)
Fit the SVM model i.e., compute and build a hyperplane.
Features of sklearn
clfy.support_vectors_ Retrieve all the support vectors of the model
clfy.predict([3,3]) Predict the class of the given sample
Features of sklearn
clfy.score(x,y)Returns the mean accuracy on the given test data and labels.
clfy.decision_function([2.5,2.5])Distance of the samples X to the separating hyperplane.
Conclusion
Parameter and kernel selection is crucial in an SVM model.