92
Lecture Topic: Support Vector Machines

Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Lecture Topic: Support Vector Machines

Page 2: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Support Vector Machines

In Chapter 7, we have seen regression analysis, where the goal was to understandthe relationship between independent variables and a continuous dependentvariable.

When the dependent variable takes values out of a finite set of values, theproblem is known as multi-class classification.

When the dependent variable takes one out of two values, the problem is oftenknown just as classification.

The function from the independent variables to the discrete set is called aclassifier.

In this chapter, we will be concerned with the training of such classifiers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 2 / 1

Page 3: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Support Vector Machines

In Chapter 7, we have seen regression analysis, where the goal was to understandthe relationship between independent variables and a continuous dependentvariable.

When the dependent variable takes values out of a finite set of values, theproblem is known as multi-class classification.

When the dependent variable takes one out of two values, the problem is oftenknown just as classification.

The function from the independent variables to the discrete set is called aclassifier.

In this chapter, we will be concerned with the training of such classifiers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 2 / 1

Page 4: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Support Vector Machines

In Chapter 7, we have seen regression analysis, where the goal was to understandthe relationship between independent variables and a continuous dependentvariable.

When the dependent variable takes values out of a finite set of values, theproblem is known as multi-class classification.

When the dependent variable takes one out of two values, the problem is oftenknown just as classification.

The function from the independent variables to the discrete set is called aclassifier.

In this chapter, we will be concerned with the training of such classifiers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 2 / 1

Page 5: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Support Vector Machines

In Chapter 7, we have seen regression analysis, where the goal was to understandthe relationship between independent variables and a continuous dependentvariable.

When the dependent variable takes values out of a finite set of values, theproblem is known as multi-class classification.

When the dependent variable takes one out of two values, the problem is oftenknown just as classification.

The function from the independent variables to the discrete set is called aclassifier.

In this chapter, we will be concerned with the training of such classifiers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 2 / 1

Page 6: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Support Vector Machines

In Chapter 7, we have seen regression analysis, where the goal was to understandthe relationship between independent variables and a continuous dependentvariable.

When the dependent variable takes values out of a finite set of values, theproblem is known as multi-class classification.

When the dependent variable takes one out of two values, the problem is oftenknown just as classification.

The function from the independent variables to the discrete set is called aclassifier.

In this chapter, we will be concerned with the training of such classifiers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 2 / 1

Page 7: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Spam Email

Consider, for example, spam.

There are large collections of emails available, classified as either “spam” and“genuine” email.

One should like to use such a collection to derive rules, which one could apply toan incoming email and decide, whether it is genuine.

Such a decision on are based on n distinct words in a dictionary, among otherthings.

Each email can be represented by a row-vector, where each column corresponds toa distinct word.

A collection of m emails can be represented by a matrix A ∈ Rm×n and acolumn-vector y ∈ Rm, which encodes spam as 1 and genuine as -1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 3 / 1

Page 8: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Spam Email

Consider, for example, spam.

There are large collections of emails available, classified as either “spam” and“genuine” email.

One should like to use such a collection to derive rules, which one could apply toan incoming email and decide, whether it is genuine.

Such a decision on are based on n distinct words in a dictionary, among otherthings.

Each email can be represented by a row-vector, where each column corresponds toa distinct word.

A collection of m emails can be represented by a matrix A ∈ Rm×n and acolumn-vector y ∈ Rm, which encodes spam as 1 and genuine as -1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 3 / 1

Page 9: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Spam Email

Consider, for example, spam.

There are large collections of emails available, classified as either “spam” and“genuine” email.

One should like to use such a collection to derive rules, which one could apply toan incoming email and decide, whether it is genuine.

Such a decision on are based on n distinct words in a dictionary, among otherthings.

Each email can be represented by a row-vector, where each column corresponds toa distinct word.

A collection of m emails can be represented by a matrix A ∈ Rm×n and acolumn-vector y ∈ Rm, which encodes spam as 1 and genuine as -1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 3 / 1

Page 10: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Spam Email

Consider, for example, spam.

There are large collections of emails available, classified as either “spam” and“genuine” email.

One should like to use such a collection to derive rules, which one could apply toan incoming email and decide, whether it is genuine.

Such a decision on are based on n distinct words in a dictionary, among otherthings.

Each email can be represented by a row-vector, where each column corresponds toa distinct word.

A collection of m emails can be represented by a matrix A ∈ Rm×n and acolumn-vector y ∈ Rm, which encodes spam as 1 and genuine as -1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 3 / 1

Page 11: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Spam Email

Consider, for example, spam.

There are large collections of emails available, classified as either “spam” and“genuine” email.

One should like to use such a collection to derive rules, which one could apply toan incoming email and decide, whether it is genuine.

Such a decision on are based on n distinct words in a dictionary, among otherthings.

Each email can be represented by a row-vector, where each column corresponds toa distinct word.

A collection of m emails can be represented by a matrix A ∈ Rm×n and acolumn-vector y ∈ Rm, which encodes spam as 1 and genuine as -1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 3 / 1

Page 12: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Spam Email

Consider, for example, spam.

There are large collections of emails available, classified as either “spam” and“genuine” email.

One should like to use such a collection to derive rules, which one could apply toan incoming email and decide, whether it is genuine.

Such a decision on are based on n distinct words in a dictionary, among otherthings.

Each email can be represented by a row-vector, where each column corresponds toa distinct word.

A collection of m emails can be represented by a matrix A ∈ Rm×n and acolumn-vector y ∈ Rm, which encodes spam as 1 and genuine as -1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 3 / 1

Page 13: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Spam Email (Continued)

The obvious issue is that words common in spam (e.g., account, inheritance,enlargement), are also present in genuine emails.

In traning, one uses loss functions to measure how good is a particular classifierx ∈ Rn, which assigns weight to each word in the dictionary, on a particular rowAj :, 1 ≤ j ≤ m of the matrix, considering the classification y (j) of the row.

Using 12 max{0, 1− y (j)Aj :x}2, where Aj : denotes the jth row of A, is called the

hinge loss squared.

There are many more variants.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 4 / 1

Page 14: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Spam Email (Continued)

The obvious issue is that words common in spam (e.g., account, inheritance,enlargement), are also present in genuine emails.

In traning, one uses loss functions to measure how good is a particular classifierx ∈ Rn, which assigns weight to each word in the dictionary, on a particular rowAj :, 1 ≤ j ≤ m of the matrix, considering the classification y (j) of the row.

Using 12 max{0, 1− y (j)Aj :x}2, where Aj : denotes the jth row of A, is called the

hinge loss squared.

There are many more variants.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 4 / 1

Page 15: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Spam Email (Continued)

The obvious issue is that words common in spam (e.g., account, inheritance,enlargement), are also present in genuine emails.

In traning, one uses loss functions to measure how good is a particular classifierx ∈ Rn, which assigns weight to each word in the dictionary, on a particular rowAj :, 1 ≤ j ≤ m of the matrix, considering the classification y (j) of the row.

Using 12 max{0, 1− y (j)Aj :x}2, where Aj : denotes the jth row of A, is called the

hinge loss squared.

There are many more variants.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 4 / 1

Page 16: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Spam Email (Continued)

The obvious issue is that words common in spam (e.g., account, inheritance,enlargement), are also present in genuine emails.

In traning, one uses loss functions to measure how good is a particular classifierx ∈ Rn, which assigns weight to each word in the dictionary, on a particular rowAj :, 1 ≤ j ≤ m of the matrix, considering the classification y (j) of the row.

Using 12 max{0, 1− y (j)Aj :x}2, where Aj : denotes the jth row of A, is called the

hinge loss squared.

There are many more variants.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 4 / 1

Page 17: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Genome-Wide Association Studies

Now, let us consider genomics.

There are large collections of sequenced genomes available. Usually, you havegenomes of thousands of people suffering from a (polygenic) disease and genomesof tens of thousands of healthy people, or “controls”.

One should like to use such a collection to understand which genes cause the(polygenic) disease.

A genome can be encoded by n “features”, which correspond to combinations ofa position (single-nucleotide polymorphism) and a base (e.g., G, A).

For m genomes available, a column-vector y ∈ Rm suggests whether the person ishealthy (1) or not (-1).

Just as above, one uses loss functions to measure the quality of a classifier.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 5 / 1

Page 18: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Genome-Wide Association Studies

Now, let us consider genomics.

There are large collections of sequenced genomes available. Usually, you havegenomes of thousands of people suffering from a (polygenic) disease and genomesof tens of thousands of healthy people, or “controls”.

One should like to use such a collection to understand which genes cause the(polygenic) disease.

A genome can be encoded by n “features”, which correspond to combinations ofa position (single-nucleotide polymorphism) and a base (e.g., G, A).

For m genomes available, a column-vector y ∈ Rm suggests whether the person ishealthy (1) or not (-1).

Just as above, one uses loss functions to measure the quality of a classifier.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 5 / 1

Page 19: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Genome-Wide Association Studies

Now, let us consider genomics.

There are large collections of sequenced genomes available. Usually, you havegenomes of thousands of people suffering from a (polygenic) disease and genomesof tens of thousands of healthy people, or “controls”.

One should like to use such a collection to understand which genes cause the(polygenic) disease.

A genome can be encoded by n “features”, which correspond to combinations ofa position (single-nucleotide polymorphism) and a base (e.g., G, A).

For m genomes available, a column-vector y ∈ Rm suggests whether the person ishealthy (1) or not (-1).

Just as above, one uses loss functions to measure the quality of a classifier.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 5 / 1

Page 20: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Genome-Wide Association Studies

Now, let us consider genomics.

There are large collections of sequenced genomes available. Usually, you havegenomes of thousands of people suffering from a (polygenic) disease and genomesof tens of thousands of healthy people, or “controls”.

One should like to use such a collection to understand which genes cause the(polygenic) disease.

A genome can be encoded by n “features”, which correspond to combinations ofa position (single-nucleotide polymorphism) and a base (e.g., G, A).

For m genomes available, a column-vector y ∈ Rm suggests whether the person ishealthy (1) or not (-1).

Just as above, one uses loss functions to measure the quality of a classifier.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 5 / 1

Page 21: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Genome-Wide Association Studies

Now, let us consider genomics.

There are large collections of sequenced genomes available. Usually, you havegenomes of thousands of people suffering from a (polygenic) disease and genomesof tens of thousands of healthy people, or “controls”.

One should like to use such a collection to understand which genes cause the(polygenic) disease.

A genome can be encoded by n “features”, which correspond to combinations ofa position (single-nucleotide polymorphism) and a base (e.g., G, A).

For m genomes available, a column-vector y ∈ Rm suggests whether the person ishealthy (1) or not (-1).

Just as above, one uses loss functions to measure the quality of a classifier.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 5 / 1

Page 22: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Genome-Wide Association Studies

Now, let us consider genomics.

There are large collections of sequenced genomes available. Usually, you havegenomes of thousands of people suffering from a (polygenic) disease and genomesof tens of thousands of healthy people, or “controls”.

One should like to use such a collection to understand which genes cause the(polygenic) disease.

A genome can be encoded by n “features”, which correspond to combinations ofa position (single-nucleotide polymorphism) and a base (e.g., G, A).

For m genomes available, a column-vector y ∈ Rm suggests whether the person ishealthy (1) or not (-1).

Just as above, one uses loss functions to measure the quality of a classifier.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 5 / 1

Page 23: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Key Concepts

SVM stands for support vector machine, which is one of the best known methodsfor classification, and producing a

linear classifier in particular, which is a functionf : Rn → {−1, 1}, f (x) := sign(wT x + b), defined by wT ∈ Rn, b ∈ R, where

sign gives 1 for positive values, -1 for negative values, and 0 for 0.

A hinge loss squared is 12 max{0, 1− y (j)Aj :x}2.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 6 / 1

Page 24: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Key Concepts

SVM stands for support vector machine, which is one of the best known methodsfor classification, and producing a

linear classifier in particular, which is a functionf : Rn → {−1, 1}, f (x) := sign(wT x + b), defined by wT ∈ Rn, b ∈ R, where

sign gives 1 for positive values, -1 for negative values, and 0 for 0.

A hinge loss squared is 12 max{0, 1− y (j)Aj :x}2.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 6 / 1

Page 25: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Key Concepts

SVM stands for support vector machine, which is one of the best known methodsfor classification, and producing a

linear classifier in particular, which is a functionf : Rn → {−1, 1}, f (x) := sign(wT x + b), defined by wT ∈ Rn, b ∈ R, where

sign gives 1 for positive values, -1 for negative values, and 0 for 0.

A hinge loss squared is 12 max{0, 1− y (j)Aj :x}2.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 6 / 1

Page 26: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Key Concepts

SVM stands for support vector machine, which is one of the best known methodsfor classification, and producing a

linear classifier in particular, which is a functionf : Rn → {−1, 1}, f (x) := sign(wT x + b), defined by wT ∈ Rn, b ∈ R, where

sign gives 1 for positive values, -1 for negative values, and 0 for 0.

A hinge loss squared is 12 max{0, 1− y (j)Aj :x}2.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 6 / 1

Page 27: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

A Picture to Keep in Mind

−1 0 1 2 3 4 5 6

0

2

4

Feature 1

Feature

2

Class 1Class 2

Max-margin hyperplane

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 7 / 1

Page 28: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

Informally, classification considers a number of examples, where each example isrepresented as a point in an n-dimensional space and a value of -1 or 1.

The goal in training of support vector machines is to find a hyperplane thatseparates examples marked 1 from examples marked -1 as well as possible, insome sense.

In order to determine the classification of a new example, one maps it into thesame n-dimensional space, and looks which side of the hyperplane it is on.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 8 / 1

Page 29: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

Informally, classification considers a number of examples, where each example isrepresented as a point in an n-dimensional space and a value of -1 or 1.

The goal in training of support vector machines is to find a hyperplane thatseparates examples marked 1 from examples marked -1 as well as possible, insome sense.

In order to determine the classification of a new example, one maps it into thesame n-dimensional space, and looks which side of the hyperplane it is on.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 8 / 1

Page 30: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

Informally, classification considers a number of examples, where each example isrepresented as a point in an n-dimensional space and a value of -1 or 1.

The goal in training of support vector machines is to find a hyperplane thatseparates examples marked 1 from examples marked -1 as well as possible, insome sense.

In order to determine the classification of a new example, one maps it into thesame n-dimensional space, and looks which side of the hyperplane it is on.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 8 / 1

Page 31: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

There are a number of formalisations, going back to the linear discriminantanalysis of Fisher.

Generally, the examples are represented by a matrix A ∈ Rm×n and a compatiblevector y ∈ {−1, 1}m.

Rows of matrix A represent observations of n features each and y are thecorresponding classifications to train the classifier on.

The input (A, y) is often referred to as the training data.

A linear classifier of x ∈ Rn is a function f (x) := sign(wT x + b), defined bywT ∈ Rn, b ∈ R, where sign gives 1 for positive values, -1 for negative values, and0 for 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 9 / 1

Page 32: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

There are a number of formalisations, going back to the linear discriminantanalysis of Fisher.

Generally, the examples are represented by a matrix A ∈ Rm×n and a compatiblevector y ∈ {−1, 1}m.

Rows of matrix A represent observations of n features each and y are thecorresponding classifications to train the classifier on.

The input (A, y) is often referred to as the training data.

A linear classifier of x ∈ Rn is a function f (x) := sign(wT x + b), defined bywT ∈ Rn, b ∈ R, where sign gives 1 for positive values, -1 for negative values, and0 for 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 9 / 1

Page 33: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

There are a number of formalisations, going back to the linear discriminantanalysis of Fisher.

Generally, the examples are represented by a matrix A ∈ Rm×n and a compatiblevector y ∈ {−1, 1}m.

Rows of matrix A represent observations of n features each and y are thecorresponding classifications to train the classifier on.

The input (A, y) is often referred to as the training data.

A linear classifier of x ∈ Rn is a function f (x) := sign(wT x + b), defined bywT ∈ Rn, b ∈ R, where sign gives 1 for positive values, -1 for negative values, and0 for 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 9 / 1

Page 34: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

There are a number of formalisations, going back to the linear discriminantanalysis of Fisher.

Generally, the examples are represented by a matrix A ∈ Rm×n and a compatiblevector y ∈ {−1, 1}m.

Rows of matrix A represent observations of n features each and y are thecorresponding classifications to train the classifier on.

The input (A, y) is often referred to as the training data.

A linear classifier of x ∈ Rn is a function f (x) := sign(wT x + b), defined bywT ∈ Rn, b ∈ R, where sign gives 1 for positive values, -1 for negative values, and0 for 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 9 / 1

Page 35: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

There are a number of formalisations, going back to the linear discriminantanalysis of Fisher.

Generally, the examples are represented by a matrix A ∈ Rm×n and a compatiblevector y ∈ {−1, 1}m.

Rows of matrix A represent observations of n features each and y are thecorresponding classifications to train the classifier on.

The input (A, y) is often referred to as the training data.

A linear classifier of x ∈ Rn is a function f (x) := sign(wT x + b), defined bywT ∈ Rn, b ∈ R, where sign gives 1 for positive values, -1 for negative values, and0 for 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 9 / 1

Page 36: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

The equation wT x + b = 0 defines a hyperplane in Rn.

If there exists a hyperplane f such that all points x ∈ Rn representing exampleswith value -1 have f (x) < 0 and all points x ∈ Rn representing examples withvalue 1 have f (x) > 0, we call the instance linearly separable.

Then, one can scale the problem so as to obtain: wTAi : + b ≥ +1 for i , yi = +1and wTAi : + b ≤ −1 for i , yi = −1, which should be seen as two parallelhyperplanes.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 10 / 1

Page 37: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

The equation wT x + b = 0 defines a hyperplane in Rn.

If there exists a hyperplane f such that all points x ∈ Rn representing exampleswith value -1 have f (x) < 0 and all points x ∈ Rn representing examples withvalue 1 have f (x) > 0, we call the instance linearly separable.

Then, one can scale the problem so as to obtain: wTAi : + b ≥ +1 for i , yi = +1and wTAi : + b ≤ −1 for i , yi = −1, which should be seen as two parallelhyperplanes.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 10 / 1

Page 38: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

The equation wT x + b = 0 defines a hyperplane in Rn.

If there exists a hyperplane f such that all points x ∈ Rn representing exampleswith value -1 have f (x) < 0 and all points x ∈ Rn representing examples withvalue 1 have f (x) > 0, we call the instance linearly separable.

Then, one can scale the problem so as to obtain: wTAi : + b ≥ +1 for i , yi = +1and wTAi : + b ≤ −1 for i , yi = −1, which should be seen as two parallelhyperplanes.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 10 / 1

Page 39: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

ClassificationIn the linearly separable case, one should like to maximise the distance betweenthe hyperplanes.

The distance is 2/√wT · w (“twice the margin”) and one hence hopes to find:

minw∈Rn,b∈R

1

2‖w‖2

2, (2.1)

s.t. yi (wTAi : + b) ≥ 1 ∀ 1 ≤ i ≤ m,

or, by duality, minα∈Rm

1

2

∑ij

αiαjyiyjATi : Ai : + b

∑i

αiyi −∑i

αi

(2.2)

s.t. αi ≥ 0 ∀ 1 ≤ i ≤ m,

From KKT conditions, w =∑

i αiyiAi : and

αi ≥ 0 for all examples i on the boundary, i.e. yi (wTAi : + b)− 1 = 0, and

αi = 0 in the interior.

The hyperplane is determined by the examples on the boundary, which are calledsupport vectors.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 11 / 1

Page 40: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

ClassificationIn the linearly separable case, one should like to maximise the distance betweenthe hyperplanes.

The distance is 2/√wT · w (“twice the margin”) and one hence hopes to find:

minw∈Rn,b∈R

1

2‖w‖2

2, (2.1)

s.t. yi (wTAi : + b) ≥ 1 ∀ 1 ≤ i ≤ m,

or, by duality, minα∈Rm

1

2

∑ij

αiαjyiyjATi : Ai : + b

∑i

αiyi −∑i

αi

(2.2)

s.t. αi ≥ 0 ∀ 1 ≤ i ≤ m,

From KKT conditions, w =∑

i αiyiAi : and

αi ≥ 0 for all examples i on the boundary, i.e. yi (wTAi : + b)− 1 = 0, and

αi = 0 in the interior.

The hyperplane is determined by the examples on the boundary, which are calledsupport vectors.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 11 / 1

Page 41: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

ClassificationIn the linearly separable case, one should like to maximise the distance betweenthe hyperplanes.

The distance is 2/√wT · w (“twice the margin”) and one hence hopes to find:

minw∈Rn,b∈R

1

2‖w‖2

2, (2.1)

s.t. yi (wTAi : + b) ≥ 1 ∀ 1 ≤ i ≤ m,

or, by duality, minα∈Rm

1

2

∑ij

αiαjyiyjATi : Ai : + b

∑i

αiyi −∑i

αi

(2.2)

s.t. αi ≥ 0 ∀ 1 ≤ i ≤ m,

From KKT conditions, w =∑

i αiyiAi : and

αi ≥ 0 for all examples i on the boundary, i.e. yi (wTAi : + b)− 1 = 0, and

αi = 0 in the interior.

The hyperplane is determined by the examples on the boundary, which are calledsupport vectors.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 11 / 1

Page 42: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

ClassificationIn the linearly separable case, one should like to maximise the distance betweenthe hyperplanes.

The distance is 2/√wT · w (“twice the margin”) and one hence hopes to find:

minw∈Rn,b∈R

1

2‖w‖2

2, (2.1)

s.t. yi (wTAi : + b) ≥ 1 ∀ 1 ≤ i ≤ m,

or, by duality, minα∈Rm

1

2

∑ij

αiαjyiyjATi : Ai : + b

∑i

αiyi −∑i

αi

(2.2)

s.t. αi ≥ 0 ∀ 1 ≤ i ≤ m,

From KKT conditions, w =∑

i αiyiAi : and

αi ≥ 0 for all examples i on the boundary, i.e. yi (wTAi : + b)− 1 = 0, and

αi = 0 in the interior.

The hyperplane is determined by the examples on the boundary, which are calledsupport vectors.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 11 / 1

Page 43: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

ClassificationIn the linearly separable case, one should like to maximise the distance betweenthe hyperplanes.

The distance is 2/√wT · w (“twice the margin”) and one hence hopes to find:

minw∈Rn,b∈R

1

2‖w‖2

2, (2.1)

s.t. yi (wTAi : + b) ≥ 1 ∀ 1 ≤ i ≤ m,

or, by duality, minα∈Rm

1

2

∑ij

αiαjyiyjATi : Ai : + b

∑i

αiyi −∑i

αi

(2.2)

s.t. αi ≥ 0 ∀ 1 ≤ i ≤ m,

From KKT conditions, w =∑

i αiyiAi : and

αi ≥ 0 for all examples i on the boundary, i.e. yi (wTAi : + b)− 1 = 0, and

αi = 0 in the interior.

The hyperplane is determined by the examples on the boundary, which are calledsupport vectors.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 11 / 1

Page 44: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

A Linearly Separable Example (Again)

−1 0 1 2 3 4 5 6

0

2

4

Feature 1

Feature

2

Class 1Class 2

Max-margin hyperplane

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 12 / 1

Page 45: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Beyond Linear Separability

The issue is that the data are generally not linearly separable.

When the data are generally not linearly separable, there exists no w ∈ Rn thatwould make the constrained problem feasible.

One remedy is to consider the Lagrangian relaxation:

minw∈Rn,λ∈Rm

1

2‖w‖2

2 +n∑

i=1

λiP(w ,Ai :, y(i)), (3.1)

where P(w ,Ai :, y(i)) measures the violation of the ith constraint.

It makes sense to focus on the loss function P(w ,Ai :, y(i)) in the Lagrangian

relaxation, rather than the max-margin term ‖w‖22.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 13 / 1

Page 46: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Beyond Linear Separability

The issue is that the data are generally not linearly separable.

When the data are generally not linearly separable, there exists no w ∈ Rn thatwould make the constrained problem feasible.

One remedy is to consider the Lagrangian relaxation:

minw∈Rn,λ∈Rm

1

2‖w‖2

2 +n∑

i=1

λiP(w ,Ai :, y(i)), (3.1)

where P(w ,Ai :, y(i)) measures the violation of the ith constraint.

It makes sense to focus on the loss function P(w ,Ai :, y(i)) in the Lagrangian

relaxation, rather than the max-margin term ‖w‖22.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 13 / 1

Page 47: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Beyond Linear Separability

The issue is that the data are generally not linearly separable.

When the data are generally not linearly separable, there exists no w ∈ Rn thatwould make the constrained problem feasible.

One remedy is to consider the Lagrangian relaxation:

minw∈Rn,λ∈Rm

1

2‖w‖2

2 +n∑

i=1

λiP(w ,Ai :, y(i)), (3.1)

where P(w ,Ai :, y(i)) measures the violation of the ith constraint.

It makes sense to focus on the loss function P(w ,Ai :, y(i)) in the Lagrangian

relaxation, rather than the max-margin term ‖w‖22.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 13 / 1

Page 48: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Beyond Linear Separability

The issue is that the data are generally not linearly separable.

When the data are generally not linearly separable, there exists no w ∈ Rn thatwould make the constrained problem feasible.

One remedy is to consider the Lagrangian relaxation:

minw∈Rn,λ∈Rm

1

2‖w‖2

2 +n∑

i=1

λiP(w ,Ai :, y(i)), (3.1)

where P(w ,Ai :, y(i)) measures the violation of the ith constraint.

It makes sense to focus on the loss function P(w ,Ai :, y(i)) in the Lagrangian

relaxation, rather than the max-margin term ‖w‖22.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 13 / 1

Page 49: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

Given a matrix A ∈ Rm×n and a compatible vector y ∈ Rm, the goal is hence tofind a vector x ∈ Rn, which solves the following optimization problem:

minx∈Rn

m∑j=1

P(x ,Aj :, y(j)), (3.2)

where P is a loss function.

Ultimately, the loss functions approximate the so-called 0/1-loss, which is 0 for 0and larger, and 1 elsewhere.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 14 / 1

Page 50: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

Given a matrix A ∈ Rm×n and a compatible vector y ∈ Rm, the goal is hence tofind a vector x ∈ Rn, which solves the following optimization problem:

minx∈Rn

m∑j=1

P(x ,Aj :, y(j)), (3.2)

where P is a loss function.

Ultimately, the loss functions approximate the so-called 0/1-loss, which is 0 for 0and larger, and 1 elsewhere.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 14 / 1

Page 51: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

The Loss Functions

−4 −3 −2 −1 0 1 2 3 4 5 60

2

4

6

8

Decision function f(x)

Lossfunctionvalue

Zero-one lossHinge lossLogistic loss

Hinge square loss

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 15 / 1

Page 52: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Classification

Common loss functions include:

PHL(x ,Aj :, y(j)) :=

1

2max{0, 1− y (j)Aj :x}, hinge loss, (HL)

PHSL(x ,Aj :, y(j)) :=

1

2max{0, 1− y (j)Aj :x}2, hinge loss squared, (HLS)

PLL(x ,Aj :, y(j)) := log(1 + e−y

(j)Aj :x), logistic loss, (LL)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 16 / 1

Page 53: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Non-Smooth Losses

Notice that:

0/1-loss is non-convex and non-smooth

square loss, PSL(x ,Aj :, y(j)) := 1

2 (y (j) − Aj :x)2, is convex and smooth, butone obtains linear regression of Chapter 7

hinge loss and hinge loss squared are both convex, but the use of max inhinge loss and hinge loss squared makes them both not-smooth.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 17 / 1

Page 54: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Subgradient Methods

One hence needs to develop subgradient methods.

For hinge loss squared, the dual has the form:

minx∈Rm

F (x) :=1

2λm2xTQx − 1

mxT1︸ ︷︷ ︸

f (x)

+m∑i=1

Φ[0,1](x(i))︸ ︷︷ ︸

Ψ(x)

, (SVM-DUAL)

where Φ[0,1] is the characteristic (or “indicator”) function of the interval [0, 1] and

Q ∈ Rm×m is the Gram matrix of the data, i.e., Qi ,j = y (i)y (j)Ai :ATj : .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 18 / 1

Page 55: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Subgradient Methods

One hence needs to develop subgradient methods.

For hinge loss squared, the dual has the form:

minx∈Rm

F (x) :=1

2λm2xTQx − 1

mxT1︸ ︷︷ ︸

f (x)

+m∑i=1

Φ[0,1](x(i))︸ ︷︷ ︸

Ψ(x)

, (SVM-DUAL)

where Φ[0,1] is the characteristic (or “indicator”) function of the interval [0, 1] and

Q ∈ Rm×m is the Gram matrix of the data, i.e., Qi ,j = y (i)y (j)Ai :ATj : .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 18 / 1

Page 56: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Subgradient Methods

If x∗ is an optimal solution of (SVM-DUAL) thenw∗ = w∗(x∗) = 1

λm

∑mi=1 y

(i)(x∗)(i)ATi : is an optimal solution of the primal

problem

minw∈Rn

P(w) :=1

n

n∑i=1

P(w ,Ai :, y(i)) +

λ

2‖w‖2, (4.1)

where P(w ,Ai :, y(i)) = max{0, 1− y (i)Ai :w}.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 19 / 1

Page 57: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Subgradient Methods

It is not hard to see that one should like to apply a subgradient method. Let usdefine an auxiliary vector:

gk :=1

λm

m∑i=1

x(i)k y (i)AT

i : . (4.2)

Then

∇i f (x) =y (i)Ai :gk − 1

m, Li =

‖Ai :‖2

λm2. (4.3)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 20 / 1

Page 58: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Subgradient Methods

As in the subgradient method for sparse least squares, one can use a coordinatedescent with a closed-form step. There exists α, β ∈ R The optimal step length isthe solution of a one-dimensional problem:

h(i)(xk) = arg mint∈R∇i f (α)t +

β

2Li t

2 + Φ[0,1](α(i) + t) (4.4)

= clip[−α(i),1−α(i)]

(λm(1− y (i)Ai :gk)

β‖Ai :‖2

), (4.5)

where for a < b

clip[a,b](ζ) =

a, if ζ < a,

b, if ζ > b,

ζ, otherwise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 21 / 1

Page 59: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Subgradient Methods

The new value of the auxiliary vector gk+1 = g(xk+1) is given by

gk+1 = gk +C∑

c=1

∑i∈Z (c)

k

1

λmhi (xk)y iAT

i :

︸ ︷︷ ︸δg (c)

. (4.6)

where the summation runs c over the parts of δg computed on C different

computers, each considering coordinates Z(c)k .

This makes it possible to produce efficient distributed implementations.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 22 / 1

Page 60: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Subgradient Methods

The new value of the auxiliary vector gk+1 = g(xk+1) is given by

gk+1 = gk +C∑

c=1

∑i∈Z (c)

k

1

λmhi (xk)y iAT

i :

︸ ︷︷ ︸δg (c)

. (4.6)

where the summation runs c over the parts of δg computed on C different

computers, each considering coordinates Z(c)k .

This makes it possible to produce efficient distributed implementations.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 22 / 1

Page 61: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Primal-Dual MethodsAlternatively, one can consider the amount ζi of violation in the ith example:

minw∈Rn,b∈R,ζ∈Rm

1

2‖w‖2

2 +m∑i=1

λiζi , (5.1)

s.t. yi (wAi : + b) ≥ 1− ζi ∀ 1 ≤ i ≤ m, (5.2)

ζi ≥ 0 ∀ 1 ≤ i ≤ m (5.3)

for a fixed λ ∈ Rm.

There, one can formulate the dual problem and apply a primal-dual method.

For common instances with n� m, one can exploit block structure to form alinear system in O(n(m + 1)2) to invert a matrix in O((m + 1)3), at least on theBSS machine.

One also has quadratic convergence, under reasonable assumptions.Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 23 / 1

Page 62: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Primal-Dual MethodsAlternatively, one can consider the amount ζi of violation in the ith example:

minw∈Rn,b∈R,ζ∈Rm

1

2‖w‖2

2 +m∑i=1

λiζi , (5.1)

s.t. yi (wAi : + b) ≥ 1− ζi ∀ 1 ≤ i ≤ m, (5.2)

ζi ≥ 0 ∀ 1 ≤ i ≤ m (5.3)

for a fixed λ ∈ Rm.

There, one can formulate the dual problem and apply a primal-dual method.

For common instances with n� m, one can exploit block structure to form alinear system in O(n(m + 1)2) to invert a matrix in O((m + 1)3), at least on theBSS machine.

One also has quadratic convergence, under reasonable assumptions.Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 23 / 1

Page 63: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Primal-Dual MethodsAlternatively, one can consider the amount ζi of violation in the ith example:

minw∈Rn,b∈R,ζ∈Rm

1

2‖w‖2

2 +m∑i=1

λiζi , (5.1)

s.t. yi (wAi : + b) ≥ 1− ζi ∀ 1 ≤ i ≤ m, (5.2)

ζi ≥ 0 ∀ 1 ≤ i ≤ m (5.3)

for a fixed λ ∈ Rm.

There, one can formulate the dual problem and apply a primal-dual method.

For common instances with n� m, one can exploit block structure to form alinear system in O(n(m + 1)2) to invert a matrix in O((m + 1)3), at least on theBSS machine.

One also has quadratic convergence, under reasonable assumptions.Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 23 / 1

Page 64: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Primal-Dual MethodsAlternatively, one can consider the amount ζi of violation in the ith example:

minw∈Rn,b∈R,ζ∈Rm

1

2‖w‖2

2 +m∑i=1

λiζi , (5.1)

s.t. yi (wAi : + b) ≥ 1− ζi ∀ 1 ≤ i ≤ m, (5.2)

ζi ≥ 0 ∀ 1 ≤ i ≤ m (5.3)

for a fixed λ ∈ Rm.

There, one can formulate the dual problem and apply a primal-dual method.

For common instances with n� m, one can exploit block structure to form alinear system in O(n(m + 1)2) to invert a matrix in O((m + 1)3), at least on theBSS machine.

One also has quadratic convergence, under reasonable assumptions.Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 23 / 1

Page 65: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Primal-Dual Methods

Notice, however, that the problem would become non-convex, if λ ∈ Rm were notfixed.

Unlike in general optimisation, where a Lagrangian provides a relaxation to theconstrained problem, here the constrained problem with fixed λ is a restriction ofthe unconstrained problem.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 24 / 1

Page 66: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Primal-Dual Methods

Notice, however, that the problem would become non-convex, if λ ∈ Rm were notfixed.

Unlike in general optimisation, where a Lagrangian provides a relaxation to theconstrained problem, here the constrained problem with fixed λ is a restriction ofthe unconstrained problem.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 24 / 1

Page 67: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Multi-Class Classification

In all of the above, we have been considering classification into two classes, -1 and1.

When the dependent variable takes values out of a finite set V of values, e.g.,V = {−1, 0, 1}, the problem is known as multi-class classification and the valuesare said to correspond to “classes”.

Although there are principled methods for the problem, one reuses solvers forclassification into two classes in practice, most of the time.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 25 / 1

Page 68: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Multi-Class Classification

In all of the above, we have been considering classification into two classes, -1 and1.

When the dependent variable takes values out of a finite set V of values, e.g.,V = {−1, 0, 1}, the problem is known as multi-class classification and the valuesare said to correspond to “classes”.

Although there are principled methods for the problem, one reuses solvers forclassification into two classes in practice, most of the time.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 25 / 1

Page 69: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Multi-Class Classification

In all of the above, we have been considering classification into two classes, -1 and1.

When the dependent variable takes values out of a finite set V of values, e.g.,V = {−1, 0, 1}, the problem is known as multi-class classification and the valuesare said to correspond to “classes”.

Although there are principled methods for the problem, one reuses solvers forclassification into two classes in practice, most of the time.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 25 / 1

Page 70: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Multi-Class Classification: One vs. Rest

One option is to consider |V | one-versus-rest classifiers. On the example withV = {−1, 0, 1}, one would train three classifiers: first for two classes {−1} and{0, 1}, second for {0} and {−1, 1}, and finally for {+1} and {0, 1}.Once given an example without a classification, one would run all |V | classifiersand pick the one, which seems best.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 26 / 1

Page 71: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Multi-Class Classification: One vs. Rest

One option is to consider |V | one-versus-rest classifiers. On the example withV = {−1, 0, 1}, one would train three classifiers: first for two classes {−1} and{0, 1}, second for {0} and {−1, 1}, and finally for {+1} and {0, 1}.Once given an example without a classification, one would run all |V | classifiersand pick the one, which seems best.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 26 / 1

Page 72: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Multi-Class Classification: One vs. One

The alternative is to consider |V |(|V | − 1)/2 one-versus-one classifiers.

This may be expensive and “reconciling” of such classifiers is tricky.

Often, the reconcilation mimicks voting, where each classifier “casts a vote” for aclass, and the class with the most votes wins.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 27 / 1

Page 73: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Multi-Class Classification: One vs. One

The alternative is to consider |V |(|V | − 1)/2 one-versus-one classifiers.

This may be expensive and “reconciling” of such classifiers is tricky.

Often, the reconcilation mimicks voting, where each classifier “casts a vote” for aclass, and the class with the most votes wins.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 27 / 1

Page 74: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Multi-Class Classification: One vs. One

The alternative is to consider |V |(|V | − 1)/2 one-versus-one classifiers.

This may be expensive and “reconciling” of such classifiers is tricky.

Often, the reconcilation mimicks voting, where each classifier “casts a vote” for aclass, and the class with the most votes wins.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 27 / 1

Page 75: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Regularisations

Just as in the previous three chapters, one can consider regularisations, e.g., `1:

minx∈Rn

F (x) :=m∑j=1

P(x ,Aj :, y(j))︸ ︷︷ ︸

f (x)

+ γ‖x‖1︸ ︷︷ ︸Ψ(x)

(7.1)

whereby one obtains a problem with a smooth component f and a non-smoothcomponent Ψ, which is in some sense simple.

One can apply the same subgradient machinery as in Chapter 7.

There are many more extensions, but their impact is disputed.

It may hence make sense to “focus on the basics”.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 28 / 1

Page 76: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Regularisations

Just as in the previous three chapters, one can consider regularisations, e.g., `1:

minx∈Rn

F (x) :=m∑j=1

P(x ,Aj :, y(j))︸ ︷︷ ︸

f (x)

+ γ‖x‖1︸ ︷︷ ︸Ψ(x)

(7.1)

whereby one obtains a problem with a smooth component f and a non-smoothcomponent Ψ, which is in some sense simple.

One can apply the same subgradient machinery as in Chapter 7.

There are many more extensions, but their impact is disputed.

It may hence make sense to “focus on the basics”.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 28 / 1

Page 77: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Regularisations

Just as in the previous three chapters, one can consider regularisations, e.g., `1:

minx∈Rn

F (x) :=m∑j=1

P(x ,Aj :, y(j))︸ ︷︷ ︸

f (x)

+ γ‖x‖1︸ ︷︷ ︸Ψ(x)

(7.1)

whereby one obtains a problem with a smooth component f and a non-smoothcomponent Ψ, which is in some sense simple.

One can apply the same subgradient machinery as in Chapter 7.

There are many more extensions, but their impact is disputed.

It may hence make sense to “focus on the basics”.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 28 / 1

Page 78: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Regularisations

Just as in the previous three chapters, one can consider regularisations, e.g., `1:

minx∈Rn

F (x) :=m∑j=1

P(x ,Aj :, y(j))︸ ︷︷ ︸

f (x)

+ γ‖x‖1︸ ︷︷ ︸Ψ(x)

(7.1)

whereby one obtains a problem with a smooth component f and a non-smoothcomponent Ψ, which is in some sense simple.

One can apply the same subgradient machinery as in Chapter 7.

There are many more extensions, but their impact is disputed.

It may hence make sense to “focus on the basics”.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 28 / 1

Page 79: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

A Summary

Regression, classification, and multi-class classification are the best knownexamples of “supervised training”.

Instances, which fit within 100 MB, are often considered of “moderate” size andcan be solved well using most methods.

This may correspond to a dense 3600× 3600 matrix A or up to 100000 rows in asparse matrix with 40 entries per row.

In the machine learning literature, common sparse matrices of these dimensionsinclude CCAT variant of RCV1, Astro-ph, and COV.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 29 / 1

Page 80: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

A Summary

Regression, classification, and multi-class classification are the best knownexamples of “supervised training”.

Instances, which fit within 100 MB, are often considered of “moderate” size andcan be solved well using most methods.

This may correspond to a dense 3600× 3600 matrix A or up to 100000 rows in asparse matrix with 40 entries per row.

In the machine learning literature, common sparse matrices of these dimensionsinclude CCAT variant of RCV1, Astro-ph, and COV.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 29 / 1

Page 81: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

A Summary

Regression, classification, and multi-class classification are the best knownexamples of “supervised training”.

Instances, which fit within 100 MB, are often considered of “moderate” size andcan be solved well using most methods.

This may correspond to a dense 3600× 3600 matrix A or up to 100000 rows in asparse matrix with 40 entries per row.

In the machine learning literature, common sparse matrices of these dimensionsinclude CCAT variant of RCV1, Astro-ph, and COV.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 29 / 1

Page 82: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

A Summary

Regression, classification, and multi-class classification are the best knownexamples of “supervised training”.

Instances, which fit within 100 MB, are often considered of “moderate” size andcan be solved well using most methods.

This may correspond to a dense 3600× 3600 matrix A or up to 100000 rows in asparse matrix with 40 entries per row.

In the machine learning literature, common sparse matrices of these dimensionsinclude CCAT variant of RCV1, Astro-ph, and COV.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 29 / 1

Page 83: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

A SciKit Implementation

There, you can use module SciKit Learn module in Python to train the SVM:

1 from sklearn import svmA = [[0, 0], [1, 1]]y = [-1, 1]classifier = svm.SVC()classifier.fit(A, y)

which uses LIBMSVM .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 30 / 1

Page 84: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

A SciKit Implementation

Once you train the classifier, you can try:

print classifier.predict([[-1, -1], [2., 2.]])print classifier.support_vectors_

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 31 / 1

Page 85: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

A SciKit Implementation

The closest to the subgradient method above is SGDClassifier:

from sklearn.linear_model import SGDClassifierA = [[0, 0], [1, 1]]

3 y = [-1, 1]classifier = SGDClassifier(loss="hinge", penalty="l2")classifier.fit(A, y)print classifier.predict([[2., 2.]])print classifier.coef_

8 print classifier.intercept_

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 32 / 1

Page 86: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

A Summary by Andreas Mueller

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 33 / 1

Page 87: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Performance of Methods for Training SVMs

Many real-world datasets are much larger, though.

Consider a well-known instance WebSpam, which consists of 350,000 emails(rows) and 16,609,143 distinct words (columns). The size of the instance is 25GB.

It is not hard to imagine that Google’s GMail may have collected many orders ofmagnitude more emails to learn from.

There, the subgradient methods suggested above scale much better than others.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 34 / 1

Page 88: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Performance of Methods for Training SVMs

Many real-world datasets are much larger, though.

Consider a well-known instance WebSpam, which consists of 350,000 emails(rows) and 16,609,143 distinct words (columns). The size of the instance is 25GB.

It is not hard to imagine that Google’s GMail may have collected many orders ofmagnitude more emails to learn from.

There, the subgradient methods suggested above scale much better than others.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 34 / 1

Page 89: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Performance of Methods for Training SVMs

Many real-world datasets are much larger, though.

Consider a well-known instance WebSpam, which consists of 350,000 emails(rows) and 16,609,143 distinct words (columns). The size of the instance is 25GB.

It is not hard to imagine that Google’s GMail may have collected many orders ofmagnitude more emails to learn from.

There, the subgradient methods suggested above scale much better than others.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 34 / 1

Page 90: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Performance of Methods for Training SVMs

Many real-world datasets are much larger, though.

Consider a well-known instance WebSpam, which consists of 350,000 emails(rows) and 16,609,143 distinct words (columns). The size of the instance is 25GB.

It is not hard to imagine that Google’s GMail may have collected many orders ofmagnitude more emails to learn from.

There, the subgradient methods suggested above scale much better than others.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 34 / 1

Page 91: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Performance of Subgradient MethodsThe Figure shows the execution time and duality gap on WebSpam, using C = 16processes, with each process using 8 threads.

Especially when there is some additional structure, one can scale much further.

0 20 40 60 80 100

10−4

10−3

10−2

10−1

100

Elapsed Time [min]

Dualit

y G

ap

τ = 48

τ = 96

τ = 224

τ = 384

τ = 768

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 35 / 1

Page 92: Lecture Topic: Support Vector Machines...Support Vector Machines In Chapter 7, we have seen regression analysis, where the goal was to understand the relationship between independent

Performance of Subgradient MethodsThe Figure shows the execution time and duality gap on WebSpam, using C = 16processes, with each process using 8 threads.

Especially when there is some additional structure, one can scale much further.

0 20 40 60 80 100

10−4

10−3

10−2

10−1

100

Elapsed Time [min]

Dualit

y G

ap

τ = 48

τ = 96

τ = 224

τ = 384

τ = 768

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 17, 2015 35 / 1