12
Feature Classification Using Support Vector Machines A new classification system based on statistical learning theory (Vapnik, 1995), called the support vector machine. Support vector machines are binary classifiers, popular for their ability to handle high dimensional data and are widely used in feature classification. This technique is said to be independent of the dimensionality of feature space as the main idea behind this classification technique is to separate the classes with a surface that maximise the margin between them, using boundary pixels to create the decision surface. The data points that are closest to the hyperplane are termed "support vectors". Applications of SVMs to any classification problem require the determination of several user-defined parameters. Some of these parameters are the choice of a suitable multiclass approach, Choice of an appropriate kernel and related parameters, determination of a suitable value of regularisation parameter (i.e. C) and a suitable optimisation technique. In the case of a two-class pattern recognition problem in which the classes are linearly separable the SVM selects from among the infinite number of linear decision boundaries the one that minimises the generalisation error. Thus, the selected decision boundary will be one that leaves the greatest margin between the two classes, where margin is defined as the sum of the distances to the hyperplane from the closest points of the two classes (Vapnik, 1995). This problem of maximising the margin can be solved using standard Quadratic Programming (QP) optimisation techniques. The data points that are closest to the hyperplane are used to measure the margin; hence these data points are termed ‘support vectors’. Consider a training data set {(x 1 , y 1 ), (x 2 ,y 2 ),...,(x n , y n )}, where x i are the vectorized training images and y i {−1,+1} are the labels to which each image can be assigned to.

Introduction to Support Vector Machines

Embed Size (px)

Citation preview

Page 1: Introduction to Support Vector Machines

Feature Classification Using Support Vector Machines

A new classification system based on statistical learning theory (Vapnik, 1995), called the

support vector machine. Support vector machines are binary classifiers, popular for their

ability to handle high dimensional data and are widely used in feature classification. This

technique is said to be independent of the dimensionality of feature space as the main idea

behind this classification technique is to separate the classes with a surface that maximise the

margin between them, using boundary pixels to create the decision surface. The data points

that are closest to the hyperplane are termed "support vectors". Applications of SVMs to any

classification problem require the determination of several user-defined parameters. Some of

these parameters are the choice of a suitable multiclass approach, Choice of an appropriate

kernel and related parameters, determination of a suitable value of regularisation parameter

(i.e. C) and a suitable optimisation technique.

In the case of a two-class pattern recognition problem in which the classes are linearly

separable the SVM selects from among the infinite number of linear decision boundaries the

one that minimises the generalisation error. Thus, the selected decision boundary will be one

that leaves the greatest margin between the two classes, where margin is defined as the sum

of the distances to the hyperplane from the closest points of the two classes (Vapnik, 1995).

This problem of maximising the margin can be solved using standard Quadratic

Programming (QP) optimisation techniques. The data points that are closest to the hyperplane

are used to measure the margin; hence these data points are termed ‘support vectors’.

Consider a training data set {(x1, y1), (x2,y2),...,(xn, yn)}, where xi are the vectorized training

images and yi∈ {−1,+1} are the labels to which each image can be assigned to.

Page 2: Introduction to Support Vector Machines

SVM tries to build a hyper plane, wTz − b = 0 that best separates the data points (by widest

margin) where w is normal to the hyper plane and b is the bias and is the perpendicular

distance from the hyper plane to the origin.

Figure: Hyper plane that separates the data best

For the linearly separable case, the support vector algorithm simply looks for the separating

hyper plane with largest margin.

It does so by minimizing the following objective function:

F(x) =

yi(wT xi+ b) ≥ 1 ∀i

Page 3: Introduction to Support Vector Machines

Here ξi are slack variables that allow misclassification for data that are not linearly separable

and C is the penalizing constant. The problem of optimization is simplified by using its dual

representation:

Subject to

Here corresponds to Lagrange multiplier.

The Karush Kuhn–Tucker (KKT) conditions for the optimumconstrained function are

necessary and sufficient to find the maximum of this equation. The corresponding KKT

complementarity conditions are

∀i

The optimal solution is thus given by-

w =

For the non-separable data, the above objective function and inequality constraint can be

modified as:

Subject to ξi> 0

Page 4: Introduction to Support Vector Machines

yi (wTz − b) ≥ 1 − ξi, ∀i

Subject to ξi> 0 &0 ≤ αi≤ C,

Here ξi are slack variables that allow misclassification for data that are not linearly separable

and C is the penalizing constant.

i. Nonlinear Support Vector Machines

If the two classes are not linearly separable, the SVM tries to find the hyper plane that

maximises the margin while, at the same time, minimising a quantity proportional to the

number of misclassification errors. The trade-off between margin and misclassification error

is controlled by a user-defined constant (Cortes and Vapnik, 1995). Training an SVM finds

the large margin hyperplane, i.e. sets the parameters αi and b. The SVM has another set of

parameters called hyperparameters: The soft margin constant, C, and any parameters the

kernel function may depend on (width of a Gaussian kernel).SVM can also be extended to

handle non-linear decision surfaces. If the input data is not linearly separable in the input

space x but might be linear separable in some higher dimensional space, then the

classification problem can be solved by simply mapped the input data to higher dimensional

space such that x → (x).ϕ

Page 5: Introduction to Support Vector Machines

Figure: Mapping of input data to higher dimensional data

SVM performs an implicit mapping of data into a higher (maybe infinite)dimensional feature

space, and then finds a linear separatinghyperplane with the maximal margin to separate data

in this higherdimensional space.

The dual representation is thus given by-

Subject to

The problem with this approach is the very high computational complexity in higher

dimensional space. The use Kernel functions eliminates this problem.

A Kernel function can be represented as:

K(xi, xj) = (xϕ i) T (xϕ j)

A number of kernels have been developed so far but the most popular and promising kernels

are:

K (xi,xj) = xiTxj(Linear Kernel)

K (xi, xj) = exp ( ) (Radial Basis Kernel)

K(xi , xj ) = (1 + xiTxj )p (Polynomial kernel)

K(xi, xj ) = tanh(axiTxj + r) (Sigmoidal Kernel)

A new test example x is classified by the following function:

Page 6: Introduction to Support Vector Machines

F (x) =sgn( )

a. The Behaviour of the Sigmoid Kernel

We consider the sigmoid kernel K(xi, xj ) = tanh(axiTxj + r), which takes two parameters: a

and r. For a > 0, we can view a as a scaling parameter of the input data, and r as a shifting

parameter that controls the threshold of mapping. For a < 0, the dot-product of the input data

is not only scaled but reversed.

It concludes that the first case, a > 0 and r < 0, is moresuitable for the sigmoid kernel.

A R Results

+ - K is CPD after r is small; similar to RBF for small a

+ + in general not as good as the (+, −) case

- + objective value of (6) −∞ after r large enough

- - easily the objective value of (6) −∞

Table 1: behaviour in different parameter combinations in sigmoid kernel

b. Behaviour of polynomial kernel

Polynomial Kernel (K(xi , xj ) = (1 + xiTxj )p) is non-stochastic kernel estimate with two

parameters i.e. C and polynomial degree p. Each data from the set xi has an influence on

the kernel point of the test value xj, irrespective of its the actual distance from xj [14], It

gives good classification accuracy with minimum number of support vectors and low

classification error.

.

Figure: The effect of the degree of a polynomial kernel.

Page 7: Introduction to Support Vector Machines

Higher degree polynomial kernels allow a more flexible decision boundary

c. Gaussian radial basis function

K (xi, xj) = exp ( ) deals with data that has conditional probability distribution

approaching gaussian function. RBF kernels perform better than the linear and polynomial

kernel. However, it is difficult to find an optimum parameters σ and equivalent C that gives

better result for a given problem.

A radial basis function (RBF) is a function of two vectors, which depends on only the

distance between them, i.e., K ( , ) = f ( − ).

may be recognized as the squared Euclidean distance between the two feature

vectors. The parameter σ is called bandwidth.

Figure: Circled points are support vectors. The two contour lines running through support

vectors are the nonlinear counterparts of the convex hulls. The thick black line is the

classifier. The lines in the image are contour lines of this surface. The classifier runs along

the bottom of the "valley" between the two classes. Smoothness of the contours is controlled

by σ

Page 8: Introduction to Support Vector Machines

Kernel parameters also have a significant effect on the decision boundary.The width

parameter of the Gaussian kernel control the flexibility of theresulting classifier

Gaussian, gamma=1 Gaussian, gamma=100

Figure: The effect of the inverse-width parameter of the Gaussian kernel (γ) for a fixed value

of the soft-margin constant. The flexibility of the decision boundary increases with an

increase in value of gamma. Large values of γ lead to over fitting (right).

Intuitively, the gamma parameter defines how far the influence of a single training example

reaches, with low values meaning ‘far’ and high values meaning ‘close’. The C parameter

trades off misclassification of training examples against simplicity of the decision surface.

ii. Multi Class Classification

SVM are suitableonly for binary classification. However, they can be easilyextended to a

multi-class problem by utilizing Error Correcting Output Codes. When dealing with multiple

classes, an appropriate multi-class method is needed. Vapnik (1995) suggested comparing

one class with the others taken together. This strategy generates n classifiers, where n is the

number of classes. The final output is the class that corresponds to the SVM with the largest

margin, as defined above. For multi-class problems one has to determine n hyperplanes.

Thus, this method requires the solution of n QP optimisation problems, each of which

separates one class from the remaining classes. A dichotomy is a two-class classifier that

learns fromdata labelled with positive (+), negative (-), or (don’t care).Given any number of

classes, we can re-label them withthese three symbols and thus form a dichotomy, Different

relabeling result in different two-class problems eachof which is learned independently. A

Page 9: Introduction to Support Vector Machines

multi-class classifierprogresses through every selected dichotomy and choosesa class that is

correctly classified by the maximum numberof selected dichotomies.Exhaustive dichotomies

represent a set of all possibleways of dividing and relabeling the dataset with the threedefined

symbols. A one-against-all classification schemeon an n-class classification considers n

dichotomies eachre-label one class as (+) and all other classes as (-).

a. DAG – SVM

The problem of multiclass classification, especially for systems like SVMs, doesn’t present

an easy solution.The standard method for –class SVMs is to constructSVMs. The ith SVM

will be trained with all of the examples in the ith class with positive labels, and all other

exampleswith negative labels. We refer to SVMs trained in this way as 1-v-r SVMs (short for

oneversus-rest).The final output of the1-v-r SVMs is the class that corresponds to the

SVMwith the highest output value. Unfortunately, there is no bound on the generalization

errorfor the 1-v-r SVM, and the training time of the standard method scales linearly with N.

Another method for constructing N-class classifiers from SVMs is derived from

previousresearch into combiningtwo-class classifiers. Knerr suggested constructing all

possible two class classifiers from a training set of N classes, each classifier being trained on

onlytwo out of N classes. There would thus be K = N(N-1)/2 classifiers. When applied

toSVMs, we refer to this as 1-v-1 SVMs (short for one-versus-one).

A Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles.

A Rooted DAG has a unique node such that it is the only node which has no arcs pointinginto

it. A Rooted Binary DAG has nodes which have either 0 or 2 arcs leaving them.We will use

Rooted Binary DAGs in order to define a class of functions to be used inclassification tasks.

The class of functions computed by Rooted Binary DAGs is formallydefined as follows.

Definition 1: Decision DAGs (DDAGs).

Given a space X and a set of Boolean functions F = {f: X {0,1}}, the class DDAG(F) of

Decision DAGs on N classes over F arefunctions which can be implemented using a rooted

binary DAG with N leaves labelled bythe classes where each of the K = N(N-1)/2 internal

nodes is labelled with an elementof F. The nodes are arranged in a triangle with the single

Page 10: Introduction to Support Vector Machines

root node at the top, two nodesin the second layer andso on until the finallayer of N leaves.

The i-th node in layer j<N is connected to the i-th and (i+1)-st node in the (j+1)-st layer.

To evaluate a particular DDAG on input x ∈X, starting at the root node, the binaryfunction at

a node is evaluated. The node is then exited via the left edge, if the binaryfunction is zero; or

the right edge, if the binary function is one. The next node’s binaryfunction is then evaluated.

The value of the decision function D(x) is the value associatedwith the final leaf node. The

path taken through the DDAG is knownas the evaluation path. The input x reaches a node of

the graph, if that node is on theevaluation path for x. We refer to the decision node

distinguishing classes i and j as the ij-node. Assuming that the number of a leaf is its class,

this node is the i-th node in the (N-j+1)-th layer provided i<j. Similarly the j-nodes are those

nodes involving class j, that is, the internal nodes on the two diagonals containing the leaf

labelled by j.

The DDAG is equivalent to operating on a list, where each node eliminates one class fromthe

list. The list is initialized with a list of all classes. A test point is evaluated against thedecision

node that corresponds to the first and last elements of the list.

If the node prefersone of the two classes, the other class is eliminated from the list, and the

DDAG proceedsto test the first and last elements of the new list. The DDAG terminates when

only oneclass remains in the list. Thus, for a problem with N classes, N-1 decision nodes will

beevaluated in order to derive an answer.

The current state of the list is the total state of the system. Therefore, since a list stateis

reachable in more than one possible path through the system, the decision graph thealgorithm

traverses is a DAG, not simply a tree.

The DAGSVM [8] separates the individual classes with large margin. It is safe to discard

thelosing class at each 1-v-1 decision because, for the hard margin case, all of the examplesof

the losing class are far away from the decision surface. The DAGSVM algorithm is superior

to other multiclass SVM algorithms in both trainingand evaluationtime. Empirically,SVM

training is observedto scale super-linearlywith the training set size, according to a power law:

T = cmγ, whereγ≈2 for algorithmsbasedon the decompositionmethod,with some

proportionalityconstant c. For the standard1-v-r multiclass SVM training algorithm, the entire

training set is used to create all N classifiers.

Page 11: Introduction to Support Vector Machines

Figure: The Decision DAG for finding the best class out of four classes

Hence the training time for 1-v-r is

T1-v-1 = cNmγ

Assuming that the classes have the same number of examples, training each 1-v-1 SVMonly

requires 2m/N training examples.Thus, training K 1-v-1 SVMs would require

T1-v-1 = c ≈ 2γ-1cN2-γmγ.

For a typical case, whereγ =2, the amount of time required to train all of the 1-v-1 SVMsis

independent of N, and is only twice that of training a single 1-v-r SVM. Using 1-v-1SVMs

with a combination algorithm is thus preferred for training time.

For more info you can visit us at: http://www.siliconmentor.com/

Below link also may be useful for you