1 Support Vector Machine (SVM) MUMT611 Beinan Li Music Tech @ McGill 2005-3-17

1

Support Vector Machine (SVM)

MUMT611 Beinan Li Music Tech @ McGill 2005-3-17

2

Content

Related problems in pattern classification VC theory and VC dimension Overview of SVM Application example

3

Related problems in pattern classification

Small sample-size effect (peaking effect)Overly small or large sample-size results great error. Inaccurate estimate of probability densities via finite

sample sets for global set in typical Bayesian classifier.Training data vs. test data Empirical risk vs. structural risk

Misclassifying yet-to-be-seen data

Picture taken from (Ridder 1997)

4

Related problems in pattern classification

Avoid solving a more general problem as an intermediate step. (Vapnik 1995)Do it without estimation of probability of densities.

ANNDepends on knowledgeEmpirical-risk method (ERM):

Problem of generalization (hard to control over-fitting)

To find theoretical analysis for validity of ERM.

5

VC theory and VC dimension

VC dimension: (classifier complexity)The maximum size of a sample set that a decision

function can separate. Finite VC dimension coherence of ERM

Theoretical basis of ANN and SVM Linear decision function:

VC dim = number of parameters Non-linear decision function:

VC dim <= number of parameters

6

Overview of SVM

Structural-risk method (SRM)Minimize ER Control VC dimension Result: tradeoff between ER and over-fitting

Focus on the explicit problem of classification:To find the optimal hyperplane for dividing two classes

Supervised learning

7

Margin and Support Vectors (SV)

In the case of 2-category, linearly-separable data. Small vs. large margin

Picture taken from (Ferguson 2004)

8

Margin and Support Vectors (SV)

In the case of 2-category, linearly-separable data. Find a hyperplane that has the largest margin to

sample vectors of both classes. D(x) = wtx +b => D(x’) = atx’

Multiple solutions: weight spaceFind a weight that causes

the largest margin Margin determined by SVs


9

Mathematical detail yiD(xi) >= 1, y = 1, -1

yiD(xi’) / ||a|| >= margin D(xi’) = atx’ Max margin -> minimum ||a||

Quadratic programming To find the minimum ||a|| under linear constraints

Weights: denoted by Lagrange multipliers

Can be simplified to an unconstrained dot-product based problem (Kuhn Tucker construction)

The parameters of decision function and its complexity can be completely determined by SVs.

10

Linearly non-separable case

Example: XOR problemSample set size: 4VC dim = 3

Pictures taken from (Ferguson 2004)

11

Linearly non-separable case

Map data to higher-dimension spaceLinearly-separable in such Higher-D spaces

Make linear decision in higher-D spaces Example: XOR

6-D space:

D(x) = x1x2

),,2,2,2,1( 22

212121 xxxxxx


12

Linearly non-separable case Hyperplane in both original and higher-D spaces

(trajectory to 2-D plane) The 4 samples are SVs.

Picture taken from (Ferguson 2004; Luo 2002)

13

Linearly non-separable case Modify the quadratic programming :

“Soft margin” Slack-variable: yiD(xi) >= 1- εi

Penalty function Upper bound for Lagrange multipliers: C.

Kernel function: Dot-product in higher-D space in terms of original parameters Resulting a symmetrical, positive semi-definite matrix. Satisfying Mercer’s theorem. Standard candidate: Polynomial, Gussian-Radial-basis Function Selection of kernel depends on knowledge.

14

Implementation with large sample set

Large computation: One Lagrange multiplier per sample Reductionist approach

Divide sample set into batches (subsets) Accumulate SV set from batch-by-batch operations Assumption: local non-SV samples are not global SVs either.

Several algorithms that varies in terms of size of subsets Vapnik: Chunking algorithm Osuna: Osuna algorithm Platt: SMO algorithm

Only 2 samples per operation Most popular

15

From 2-category to multi-category SVM

No uniform way to extend Common ways:

One-against-allOne-against-one: binary tree

16

Advantages of SVM Strong mathematical basis

Decision function and its complexity can be completely determined by SVs.

Training time does not depend on dimensionality of feature space, only on fixed input space.

Nice generalization Insensitive to “curse of dimensionality” Versatile choices of kernel function. Feature-less classification

Kernel -> data-similarity measure

17

Drawback of SVM Still rely on knowledge

Choices of C, kernel and penalty function C: how far the decision function is adapted to avoid any

error Kernel: how much freedom SVM should adapt itself

(dimension)

Overlapping classes Reductionism may discard promising SVs at any batch step. The classification can be limited by the size of the problem.

No uniform way to extend 2-category to multi-category “Still not an ideal optimally-generalizing classifier.”

18

Applications

Vapnik et al. at AT&T:Handwritten number recognitionError rate is lower than that of ANN

Speech recognition Face recognition MIR SVM-light: open source C library

19

Application example of SVM in MIR Li, Guo 2000: (Microsoft Research China) Problem:

classify 16 classes of sounds in a database of 409 sounds

Features: Concatenated perceptual and cepstral feature vectors.

Similarity measure: Distance from boundary (SV-based boundary)

Evaluation: Average retrieval accuracy Average retrieval efficiency

20

Application example of SVM in MIR

Details in applying SVM Both linear and kernel-based approaches are tested

Kernel: Exponential Radial Basis Function C: 200

Randomly partition corpus into training/test sets. One-against-one/binary tree in multi-category task.

Compared with other approaches NFL: Nearest Feature Line, unsupervised approach Muscle Fish: normalized Euclidean metric and nearest-neighbor

21


Average error rates comparisonDifferent feature-set over different approaches

Picture taken from (Li & Guo 2000)

22


Complexity comparisonSVM:

Training: yesClassification complexity: C * (C-1) / 2 (binary tree)Inner-class complexity: number of SVs

NFL:Training: noClassification complexity: linear to number of classesInner-class complexity: Nc * (Nc-1) / 2

23

Future work

Speed up quadratic programming Choice of kernel functions Find opportunities in solving impossible-so-far

missions Generalize the non-linear kernel approach to

approaches other than SVMKernel PCA (principle component analysis)

24

Bibliography

Summary:http://www.music.mcgill.ca/~damonli

/MUMT611/week9_summary.pdf HTML bibliography:

http://www.music.mcgill.ca/~damonli/MUMT611/week9_bib.htm

Documents

1 Support Vector Machine (SVM) MUMT611 Beinan Li Music Tech @ McGill 2005-3-17