Upload
irma-bradley
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
Learning Larger Margin Learning Larger Margin Machine Locally and Machine Locally and GloballyGlobally
Kaizhu Huang (Kaizhu Huang ([email protected]@cse.cuhk.edu.hk))Haiqin Yang, Irwin King, Michael R. LyuHaiqin Yang, Irwin King, Michael R. LyuDept. of Computer Science and EngineeringDept. of Computer Science and EngineeringThe Chinese University of Hong KongThe Chinese University of Hong Kong
July 5, 2004July 5, 2004
The Chinese University of Hong KongThe Chinese University of Hong Kong
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
Learning Larger Margin Learning Larger Margin Machine Locally and GloballyMachine Locally and Globally
ContributionsContributionsBackground:Background:– Linear Binary ClassificationLinear Binary Classification– MotivationMotivation
Maxi-Min Margin Machine(MMaxi-Min Margin Machine(M44))– Model DefinitionModel Definition– Geometrical InterpretationGeometrical Interpretation– Solving MethodsSolving Methods– Connections With Other ModelsConnections With Other Models– Nonseparable caseNonseparable case– KernelizationsKernelizations
Experimental ResultsExperimental ResultsFuture WorkFuture WorkConclusionConclusion
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
Theory:Theory: A unified model of Support Vector Machi A unified model of Support Vector Machine (SVM), Minimax Probability Machine (MPM), anne (SVM), Minimax Probability Machine (MPM), and Linear Discriminant Analysis (LDA).d Linear Discriminant Analysis (LDA).
Practice:Practice: A sequential Conic Programming Proble A sequential Conic Programming Problem.m.
ContributionsContributions
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
Background: Linear Binary Background: Linear Binary ClassificationClassification
Given two classes of data sampled from x and y, we are trying to find a linear decision plane wT z + b=0, which can correctly discriminate x from y.
wT z + b< 0, z is classified as y;
wT z + b >0, z is classified as x. wT z + b=0 : decision hyperplane
Only partial information is available, we need to choose a criterion to select hyperplanes
y
x
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
wT z + b=0
Background: Support Background: Support Vector MachineVector Machine
Margin
Support Vector Machines (SVM): The optimal hyperplane is the one which maximizes the margin between two classes of data
Support Vectors
The boundary of SVM is exclusively determined by several critical points called support vectors
All other points are totally irrelevant with the decision plane
SVM discards global information
x
y
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
Learning Locally and Learning Locally and GloballyGlobally
The Chinese University of Hong KongThe Chinese University of Hong Kong
wT z + b=0y
x
Along the dashed axis, y data have a larger data trend than x data. Therefore, a more reasonable hyerplane may lie closer than x data rather than locating itself in the middle of two classes as in SVM.
SVM
A more reasonable hyperplane
Learning Locally and Globally
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Learning Locally and : Learning Locally and GloballyGlobally
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Geometric Interpretation: Geometric Interpretation
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Solving Method: Solving Method
Divide and Conquer:
If we fix ρ to a specific ρn , the problem changes to check whether this ρn satisfies the following constraints:
If yes, we increase ρn; otherwise, we decrease it.
Second Order Cone Programming Problem!!!
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Solving Method (Cont’): Solving Method (Cont’)
Iterate the following two Divide and Conquer steps:
Sequential Second Order Cone Programming Problem!!!
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
can it satisfy the constraints?
YesNo
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Solving Method (Cont’): Solving Method (Cont’)
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Links with MPM: Links with MPM
+
Span all the data points and add them
together
Exactly MPM Optimization Problem!!!
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Links with MPM (Cont’): Links with MPM (Cont’)
MPM
M4
Remarks: The procedure is not reversible: MPM is a special case of M4
MPM focuses on building decision boundary GLOBALLY, i.e., it exclusively depends on the means and covariances. However, means and covariances may not be accurately estimated.
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
If one assumes ∑=I
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Links with SVM: Links with SVM
The magnitude of w can scale up
without influencing the optimization
1
2
3
4
Support Vector Machines!!!
SVM is the special case of MM44
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Links with SVM (Cont’): Links with SVM (Cont’)
These two assumptions of SVM are inappropriate
If one assumes ∑=I
Assumption 1Assumption 2
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Links with LDA: Links with LDAIf one assumes
∑x=∑y=(∑*y+∑*x)/2
Perform a procedure similar to MPM…
LDA
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
MM44: Links with LDA (Cont’): Links with LDA (Cont’)
Assumption
Still inappropriate?If one assumes
∑x=∑y=(∑*y+∑*x)/2
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
Nonseparable CaseNonseparable CaseIntroducing slack variables
How to solve?? Line Search+Second Order Cone Programming
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
Nonlinear Classifier: KernelizationNonlinear Classifier: Kernelization• Map data to higher dimensional feature space Rf
xi(xi)
yi(xi)
• Construct the linear decision plane f(γ ,b)=γ T z + b in the feature space Rf, with γ Є Rf, b Є R•In Rf, we need to solve
• However, we do not want to solve this in an explicit form of . Instead, we want to solve it in a kernelization form
K(z1,z2)= (z1)T(z2)
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
Nonlinear Classifier: KernelizationNonlinear Classifier: Kernelization
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
Nonlinear Classifier: KernelizationNonlinear Classifier: Kernelization
Notation
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
Experimental Experimental ResultsResults
Toy Example: Two Gaussian Data with different data trends
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
Data sets: UCI Machine Learning RepositoryProcedures: 10-fold Cross validationSolving Package: SVM: Libsvm 2.4, M4: Sedumi 1.05 MPM: MPM 1.0
In linear cases, M4 outperforms SVM and MPM In Gaussian cases, M4 is slightly better or comparable than SVM (1). Sparsity in the feature space results in inaccurate estimation of covariance matrices (2) Kernelization may not keep data topology of the original data.—Maximizing Margin in the feature space does not necessarily maximize margin in the original space
Experimental Experimental ResultsResults
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
From Simon Tong et al. Restricted Bayesian Optimal classifiers, AAAI, 2000.
An example to illustrate that maximizing Margin in the feature space does not necessarily maximize margin in the original space
Experimental Experimental ResultsResults
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
Future WorkFuture Work
Speeding up MSpeeding up M44
Contain support vectors—can we employ its sparsity aContain support vectors—can we employ its sparsity as has been done in SVM?s has been done in SVM?
Can we reduce redundant points??Can we reduce redundant points??
How to impose constrains on the kernelization fHow to impose constrains on the kernelization for keeping the topology of data?or keeping the topology of data?
Generalization error bound?Generalization error bound? SVM and MPM have both error bounds.SVM and MPM have both error bounds.
How to extend to multi-category classifications?How to extend to multi-category classifications?
ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada
The Chinese University of Hong KongThe Chinese University of Hong Kong
ConclusionConclusion
Proposed a new large margin classifier MProposed a new large margin classifier M44 which learns the decision boundary both which learns the decision boundary both locally and globallylocally and globally
Built theoretical connections with other Built theoretical connections with other models: A unified model of SVM, MPM and LDAmodels: A unified model of SVM, MPM and LDA
Developed sequential Second Order Cone Developed sequential Second Order Cone Programming algorithm for MProgramming algorithm for M44
Experimental results demonstrated the Experimental results demonstrated the advantages of our new modeladvantages of our new model