support vector machine

Submitted by:Garisha Chowdhary ,MCSE 1st year,Jadavpur University

A set of related supervised learning methods

Non-probablistic binary linear classifier

Linear learners like perceptrons but unlike them uses concept of : maximum margin ,linearization and kernel function

Used for classification and regression analysis

A good separation

Map non-lineraly separable instances to higher

dimensions to overcome linearity constraints

Select between hyper planes, use maximum margin as a

test

Class 1

Class 2

Class 1

Class 2

Class 1

Class 2

Intuitively , a good separation is achieved by a hyperplane that has largest distance to nearest training data point of any class

Since, larger the margin lower the generalization error(more confident predications)

Class 1

Class 2

•{(x1,y1), (x2,y2), … , (xn,yn)

•Where y = +1/ -1 are labels of data, x belongs to Rn

Given N samples

•wTxi+ b > 0 : for all i such that y=+1

•wTxi+ b < 0 : for all i such that y=-1

Find a hyperplane wTx + b =0

Functional Margin• With respect to the training example, defined by

ˆγ(i)=y(i)(wT x(i) + b).• Want functional margin to be large i.e. y(i)(wT x(i) + b) >> 0• May rescale w and b, without altering the decision function

but multiplying functional margin by the scale factor• Allows us to impose a normalization condition ||w|| = 1 and

consider the functional margin of (w/||w||,b/||w||)• w.r.t. training set defined by ˆγ = min ˆγ(i) for all i

Geometric margin• Defined by γ(i)=y(i)((w/||w||)Tx(i)+b/||w||).• If ||w|| = 1, functional margin = geometric margin• Invariant to scaling of parameters w and b. w may be scaled such

that ||w|| = 1• Also, γ = min γ(i) for all i

Maximize γ w.r.t. γ,w,b s.t.

• y(i)(wTx(i) +b) >= γ for all i• ||w|| = 1

Maximize ˆγ/||w|| w.r.t. ˆγ,w,b s.t.

• y(i)(wTx(i) +b) >= ˆγ for all i

• Intrducing the scaling constraint that the functional margin be 1, the objective function may further be simplified as to maximize 1/||w|| , or

Minimize (1/2)(||w||2) s.t.

• y(i)(wTx(i) +b) >= 1

Now, Objective is to

Solve for α and recover

w = Σαiyixi , b =( −maxi:∗ y(i)=−1 wT x(i) + mini:y(i)=1 wT x(i))/2

Substituitng w in L we get the corresponding dual problem of the primal problem to

maximize W(α) = Σαi - ½ΣΣαiαjyiyjxiTxj , s.t. αi >=0 , Σαiyi = 0

Setting gradiant to L w.r.t. w and b to 0 we have,

w = Σαiyixi for all i , Σαiyi = 0

Using Lagrangian to solve the inequality constrained optimization problem , we have

L = ½||w||2 - Σαi(yi(wTxi +b) - 1)

For conversion of primal problem to dual problem the following Karish-Kuhn-Tucker conditions must be satisfied

• (∂/∂wi)L(w, α) = 0, i = 1, . . . , n• αi gi(w,b) = 0, i = 1, . . . , k• gi(w,b) <= 0, i = 1, . . . , k• αi >= 0

From the KKT complementary condition(2nd)

• αi > 0 => gi(w,b) = 0 (active constraint) => x(i),y(i) has functional margin 1 (support vectors)

• gi(w,b) < 0 => αi = 0 (inactive constraint, non-support vectors)

Class 1

Class 2Support vectors

In case of non-linearly separable data , mapping data to high dimensional feature space via non linear mapping function, φ increases the likelihood that data is linearly separable

Use of kernel function, to simplify computations over high dimensional mapped data, that corresponds to dot product of some non-linear mapping of data

Having found αi , calculate a quantity that depends only on the inner product between x (test point) and support vectors

Kernel function is the measure of similarity between the 2 vectors

A kernel function is valid if it satisfies the Mercer Theorem which states that the corresponding kernel matrix must be symmetric positive semi-definite (zTKz >= 0 )

Polynomial kernel with degree d• K(x,y) = (xy + 1 )^d

Radial basis function kernel with width s• K(x,y) = exp(-||x-y||2/(2s2))• Feature space is infinite dimensional

Sigmoid with parameter k and q • K(x,y) = tanh(k xTy+ )q• It does not satisfy the Mercer condition on all k and q

High dimensionality doesn’t guarantee linear separation; hypeplane might be susceptible to outliers

Relax the constraint introducing ‘slack variables’, ξi, that allow violations of constraint by a small quantity

Penalize the objective function for violation

Parameter C will control the trade off between penalty and margin.

So the objective now becomes, to minw,b,γ (1/2)||w||2 + C Σξi s.t. y(i)(wTx(i)+b)>= 1 – ξi, ξi >=0

Tries to ensure that most examples have functional margin atleast 1

Formind the corresponding Lagrangian , the dual problem now is to: maxαΣαi

- ½ΣΣαiαjyiyjxiTxj , s.t. 0<=αi <= C , Σαiyi = 0 .

Class 1

Class 2

Parameter Selection

• The effectiveness of SVM depends on selection of kernel, kernel parameters and the parameter C

• Common is Gaussian kernel, with a single parameter γ• Best combination of C and γ is often selected by grid search with

exponentially increasing sequences of C and γ.• Each combination is checked using crossvalidation and the one with best

accuracy is chosen.

Drawbacks• Cannot be directly applied to

multiclass problems, but need use of algorithms that convert multiclass problem to multiple binary class problems

• Uncalibrated class membership probabilities

THANK YOU

Documents

support vector machine