24
1 Support Vector Machine (SVM) MUMT611 Beinan Li Music Tech @ McGill 2005-3-17

1 Support Vector Machine (SVM) MUMT611 Beinan Li Music Tech @ McGill 2005-3-17

Embed Size (px)

Citation preview

Page 1: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

1

Support Vector Machine (SVM)

MUMT611 Beinan Li Music Tech @ McGill 2005-3-17

Page 2: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

2

Content

Related problems in pattern classification VC theory and VC dimension Overview of SVM Application example

Page 3: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

3

Related problems in pattern classification

Small sample-size effect (peaking effect)Overly small or large sample-size results great error. Inaccurate estimate of probability densities via finite

sample sets for global set in typical Bayesian classifier.Training data vs. test data Empirical risk vs. structural risk

Misclassifying yet-to-be-seen data

Picture taken from (Ridder 1997)

Page 4: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

4

Related problems in pattern classification

Avoid solving a more general problem as an intermediate step. (Vapnik 1995)Do it without estimation of probability of densities.

ANNDepends on knowledgeEmpirical-risk method (ERM):

Problem of generalization (hard to control over-fitting)

To find theoretical analysis for validity of ERM.

Page 5: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

5

VC theory and VC dimension

VC dimension: (classifier complexity)The maximum size of a sample set that a decision

function can separate. Finite VC dimension coherence of ERM

Theoretical basis of ANN and SVM Linear decision function:

VC dim = number of parameters Non-linear decision function:

VC dim <= number of parameters

Page 6: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

6

Overview of SVM

Structural-risk method (SRM)Minimize ER Control VC dimension Result: tradeoff between ER and over-fitting

Focus on the explicit problem of classification:To find the optimal hyperplane for dividing two classes

Supervised learning

Page 7: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

7

Margin and Support Vectors (SV)

In the case of 2-category, linearly-separable data. Small vs. large margin

Picture taken from (Ferguson 2004)

Page 8: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

8

Margin and Support Vectors (SV)

In the case of 2-category, linearly-separable data. Find a hyperplane that has the largest margin to

sample vectors of both classes. D(x) = wtx +b => D(x’) = atx’

Multiple solutions: weight spaceFind a weight that causes

the largest margin Margin determined by SVs

Picture taken from (Ferguson 2004)

Page 9: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

9

Mathematical detail yiD(xi) >= 1, y = 1, -1

yiD(xi’) / ||a|| >= margin D(xi’) = atx’ Max margin -> minimum ||a||

Quadratic programming To find the minimum ||a|| under linear constraints

Weights: denoted by Lagrange multipliers

Can be simplified to an unconstrained dot-product based problem (Kuhn Tucker construction)

The parameters of decision function and its complexity can be completely determined by SVs.

Page 10: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

10

Linearly non-separable case

Example: XOR problemSample set size: 4VC dim = 3

Pictures taken from (Ferguson 2004)

Page 11: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

11

Linearly non-separable case

Map data to higher-dimension spaceLinearly-separable in such Higher-D spaces

Make linear decision in higher-D spaces Example: XOR

6-D space:

D(x) = x1x2

),,2,2,2,1( 22

212121 xxxxxx

Picture taken from (Ferguson 2004)

Page 12: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

12

Linearly non-separable case Hyperplane in both original and higher-D spaces

(trajectory to 2-D plane) The 4 samples are SVs.

Picture taken from (Ferguson 2004; Luo 2002)

Page 13: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

13

Linearly non-separable case Modify the quadratic programming :

“Soft margin” Slack-variable: yiD(xi) >= 1- εi

Penalty function Upper bound for Lagrange multipliers: C.

Kernel function: Dot-product in higher-D space in terms of original parameters Resulting a symmetrical, positive semi-definite matrix. Satisfying Mercer’s theorem. Standard candidate: Polynomial, Gussian-Radial-basis Function Selection of kernel depends on knowledge.

Page 14: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

14

Implementation with large sample set

Large computation: One Lagrange multiplier per sample Reductionist approach

Divide sample set into batches (subsets) Accumulate SV set from batch-by-batch operations Assumption: local non-SV samples are not global SVs either.

Several algorithms that varies in terms of size of subsets Vapnik: Chunking algorithm Osuna: Osuna algorithm Platt: SMO algorithm

Only 2 samples per operation Most popular

Page 15: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

15

From 2-category to multi-category SVM

No uniform way to extend Common ways:

One-against-allOne-against-one: binary tree

Page 16: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

16

Advantages of SVM Strong mathematical basis

Decision function and its complexity can be completely determined by SVs.

Training time does not depend on dimensionality of feature space, only on fixed input space.

Nice generalization Insensitive to “curse of dimensionality” Versatile choices of kernel function. Feature-less classification

Kernel -> data-similarity measure

Page 17: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

17

Drawback of SVM Still rely on knowledge

Choices of C, kernel and penalty function C: how far the decision function is adapted to avoid any

error Kernel: how much freedom SVM should adapt itself

(dimension)

Overlapping classes Reductionism may discard promising SVs at any batch step. The classification can be limited by the size of the problem.

No uniform way to extend 2-category to multi-category “Still not an ideal optimally-generalizing classifier.”

Page 18: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

18

Applications

Vapnik et al. at AT&T:Handwritten number recognitionError rate is lower than that of ANN

Speech recognition Face recognition MIR SVM-light: open source C library

Page 19: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

19

Application example of SVM in MIR Li, Guo 2000: (Microsoft Research China) Problem:

classify 16 classes of sounds in a database of 409 sounds

Features: Concatenated perceptual and cepstral feature vectors.

Similarity measure: Distance from boundary (SV-based boundary)

Evaluation: Average retrieval accuracy Average retrieval efficiency

Page 20: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

20

Application example of SVM in MIR

Details in applying SVM Both linear and kernel-based approaches are tested

Kernel: Exponential Radial Basis Function C: 200

Randomly partition corpus into training/test sets. One-against-one/binary tree in multi-category task.

Compared with other approaches NFL: Nearest Feature Line, unsupervised approach Muscle Fish: normalized Euclidean metric and nearest-neighbor

Page 21: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

21

Application example of SVM in MIR

Average error rates comparisonDifferent feature-set over different approaches

Picture taken from (Li & Guo 2000)

Page 22: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

22

Application example of SVM in MIR

Complexity comparisonSVM:

Training: yesClassification complexity: C * (C-1) / 2 (binary tree)Inner-class complexity: number of SVs

NFL:Training: noClassification complexity: linear to number of classesInner-class complexity: Nc * (Nc-1) / 2

Page 23: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

23

Future work

Speed up quadratic programming Choice of kernel functions Find opportunities in solving impossible-so-far

missions Generalize the non-linear kernel approach to

approaches other than SVMKernel PCA (principle component analysis)

Page 24: 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

24

Bibliography

Summary:http://www.music.mcgill.ca/~damonli

/MUMT611/week9_summary.pdf HTML bibliography:

http://www.music.mcgill.ca/~damonli/MUMT611/week9_bib.htm