7
A Neural Network View of Kernel Methods Shuiwang Ji Texas A&M University College Station, TX 77843 Youzhi Luo Texas A&M University College Station, TX 77843 Zhengyang Wang Texas A&M University College Station, TX 77843 Yaochen Xie Texas A&M University College Station, TX 77843 1 Introduction This introduction of kernel methods and its relations with neural networks aims at providing a complete, self-contained, and easy-to-understand introduction of kernel methods and their relationship with the neural network. The only required background would be college-level linear algebra and basic machine learning. This document is based on lecture notes by Shuiwang Ji and compiled by Youzhi Luo and edited by Zhengyang Wang and Yaochen Xie at Texas A&M University. This document can be used for undergraduate and graduate level classes. 2 Feature Mapping We start by discussing the motivation of feature mapping in binary classification problems. In a binary classification problem, we have training data { ˜ x i ,y i } m i=1 , where ˜ x i R n-1 represents the input feature vector and y i ∈ {-1, 1} is the corresponding label. In logistic regression, for each sample ˜ x i , a linear classifier, parameterized by ˜ w R n-1 and b R, computes the classification score as h( ˜ x i )= σ( ˜ w T ˜ x i + b)= 1 1 + exp [-( ˜ w T ˜ x i + b)] , (1) where σ(·) is the sigmoid function. Note that the classification score h( ˜ x i ) can be interpreted as the probability of ˜ x i having label 1. For convenience, we let x i = 1 ˜ x i R n , i =1, 2,...,m and w = b ˜ w R n . Then we can re-write Eqn. (1) as h(x i )= σ(w T x i )= 1 1 + exp [-w T x i ] . (2) If the training data {x i ,y i } m i=1 are linearly separable, as shown in in Figure 1, there exists a w * R n such that y i w *T x i 0, i =1, 2,...,m. In this case, a linear model like logistic regression can perfectly fit the original training data {x i ,y i } m i=1 . However, this is not possible for linearly inseparable cases like the one in Figure 2. A typical method to handle such linearly inseparable cases is feature mapping. That is, instead of using original {x i } m i=1 , we use a feature mapping function φ : R n R N on {x i } m i=1 , so that mapped feature vectors {φ(x i )} m i=1 are linearly separable. For example, we can map the linearly

A Neural Network View of Kernel Methods - people.tamu.edu

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Neural Network View of Kernel Methods - people.tamu.edu

A Neural Network View of Kernel Methods

Shuiwang JiTexas A&M University

College Station, TX [email protected]

Youzhi LuoTexas A&M University

College Station, TX [email protected]

Zhengyang WangTexas A&M University

College Station, TX [email protected]

Yaochen XieTexas A&M University

College Station, TX [email protected]

1 Introduction

This introduction of kernel methods and its relations with neural networks aims at providing acomplete, self-contained, and easy-to-understand introduction of kernel methods and their relationshipwith the neural network. The only required background would be college-level linear algebra andbasic machine learning. This document is based on lecture notes by Shuiwang Ji and compiledby Youzhi Luo and edited by Zhengyang Wang and Yaochen Xie at Texas A&M University. Thisdocument can be used for undergraduate and graduate level classes.

2 Feature Mapping

We start by discussing the motivation of feature mapping in binary classification problems. In abinary classification problem, we have training data {x̃i, yi}mi=1, where x̃i ∈ Rn−1 represents theinput feature vector and yi ∈ {−1, 1} is the corresponding label. In logistic regression, for eachsample x̃i, a linear classifier, parameterized by w̃ ∈ Rn−1 and b ∈ R, computes the classificationscore as

h(x̃i) = σ(w̃T x̃i + b) =1

1 + exp [−(w̃T x̃i + b)], (1)

where σ(·) is the sigmoid function. Note that the classification score h(x̃i) can be interpreted as theprobability of x̃i having label 1.

For convenience, we let xi =

[1x̃i

]∈ Rn, i = 1, 2, . . . ,m and w =

[bw̃

]∈ Rn. Then we can

re-write Eqn. (1) as

h(xi) = σ(wTxi) =1

1 + exp [−wTxi]. (2)

If the training data {xi, yi}mi=1 are linearly separable, as shown in in Figure 1, there exists aw∗ ∈ Rnsuch that yiw∗Txi ≥ 0, i = 1, 2, . . . ,m. In this case, a linear model like logistic regressioncan perfectly fit the original training data {xi, yi}mi=1. However, this is not possible for linearlyinseparable cases like the one in Figure 2.

A typical method to handle such linearly inseparable cases is feature mapping. That is, insteadof using original {xi}mi=1, we use a feature mapping function φ : Rn → RN on {xi}mi=1, so thatmapped feature vectors {φ(xi)}mi=1 are linearly separable. For example, we can map the linearly

Page 2: A Neural Network View of Kernel Methods - people.tamu.edu

1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6

Figure 1: An example of linearly separable data. Blue points denote data samples with label 1, whilegreen points denote data samples with label -1.

1.0 0.5 0.0 0.5 1.0

Figure 2: An example of linearly inseparable data. Colors of points have the same meanings as inFigure 1.

inseparable data samples in Figure 2 with

φ(x) = φ

([1x̃

])=

1x̃x̃2

, (3)

and plot the mapped feature vectors in Figure 3. We can see that the mapped feature vectors becomelinearly separable. With such feature mapping, a linear model can fit the data perfectly. In the contextof logistic regression, the whole process can be described as

h(xi) = σ(wTφ(xi)) =1

1 + exp [−wTφ(xi)], (4)

where the dimension of the parameter vector w becomes N accordingly. Figure 4 illustrated logisticregression with feature mapping.

1.0 0.5 0.0 0.5 1.0x

0.0

0.2

0.4

0.6

0.8

1.0

x^2

Figure 3: An example of mapping linearly inseparable data in Figure 2 into linearly separable datawith feature mapping.

Feature mapping is also useful in other supervised learning problems besides binary classification.Taking regression problems as an example, a linear regression model may not fit well if the regressiontargets {yi}mi=1 and the original feature vectors {xi}mi=1 are not linearly correlated. In this case, wecan first find an appropriate feature mapping φ(x) that detects more hidden patterns and creates astronger linear correlation between {φ(xi)}mi=1 and {yi}mi=1.

In order to achieve strong enough representation power, it is common in practice that φ(x) has amuch higher dimension than x, i.e. N >> n. However, this dramatically increases the costs of

2

Page 3: A Neural Network View of Kernel Methods - people.tamu.edu

� ∈ ℝ�

�(�) ∈ ℝ�

�ℎ

Figure 4: An illustration of logistic regression with feature mapping.

computing either φ(x) or wTφ(x). In the following, we introduce an efficient way to implicitlycomputewTφ(x). Specifically, in Section 3, we use the representer theorem to show that computingwTφ(x) can be transformed to computing

∑mi=1 αiφ(xi)

Tφ(x), where {αi}mi=1 are learnableparameters. Then in Section 4, we introduce the kernel methods, which significantly reduce the costsof computing φ(xi)

Tφ(x).

3 Representer Theorem

Given the training data {xi, yi}mi=1 (xi ∈ Rn, yi ∈ R) and a feature mapping φ : Rn → RN , tosolve a supervised learning task (regression or classification), we need to do the following steps:

• Compute the feature vectors {φ(xi)}mi=1 of all training samples;

• Initialize a linear model with a parameter vector w ∈ RN ;• Minimize the task-specific loss function L on {φ(xi), yi}mi=1 with respect to w.

Since loss L is a function of z = wTφ(x) and y, we can write it as L(wTφ(x), y). MinimizingL(wTφ(x), y) on {φ(xi), yi}mi=1 is an optimization problem:

minw

1

m

m∑i=1

L(wTφ(xi), yi

). (5)

However, in many situations, minimizing L only may cause the problem of over-fitting. A com-mon method to address over-fitting is to apply the `2-regularization, changing Equation (5) intoEquation (6):

minw

1

m

m∑i=1

L(wTφ(xi), yi) +λ

2||w||22, λ ≥ 0, (6)

where λ ≥ 0 is a hyper-parameter, known as the regularization parameter, controlling the extent towhich we penalize large `2-norms of w.

In order to derive a solutions to this optimization problem, we introduce the following theorem, whichis a special case of the well-known Representer Theorem. The Representer Theorem is the theoreticalfoundation of kernel methods.Theorem 1. If the optimization problem in Equation (6) has optimal solutions, there must exist anoptimal solution with the form w∗ =

∑mi=1 αiφ(xi).

Proof. Since elements of {φ(xi)}mi=1 are all in RN , they can form a subspace V ⊆ RN suchthat V = {x : x =

∑mi=1 ciφ(xi)}. Assume {v1, ...,vn′} is an orthonormal basis of V , where

n′ ≤ N . V also has an orthogonal complement subspace V ⊥, which has an orthonormal basis{u1, ...,uN−n′}. Clearly, vTk uj = 0 for any 1 ≤ k ≤ n′ and 1 ≤ j ≤ N − n′.

3

Page 4: A Neural Network View of Kernel Methods - people.tamu.edu

For an arbitrary vectorw ∈ RN , we can decompose it into a linear combination of orthonormal basisvectors in the subspace V and V ⊥. That is, we can write w as

w = wV +wV ⊥ =

n′∑k=1

skvk +

N−n′∑j=1

tjuj . (7)

First, we can show that

||w||22 =

∥∥∥∥∥∥n′∑k=1

skvk +

N−n′∑j=1

tjuj

∥∥∥∥∥∥2

2

=

n′∑k=1

skvTk +

N−n′∑j=1

tjuTj

n′∑k=1

skvk +

N−n′∑j=1

tjuj

=

n′∑k=1

s2k||vk||22 +N−n′∑j=1

t2j ||uj ||22 + 2

n′∑k=1

N−n′∑j=1

sktjvTk uj

=

n′∑k=1

s2k||vk||22 +N−n′∑j=1

t2j ||uj ||22

≥n′∑k=1

s2k||vk||22 = ||wV ||22. (8)

Second, because each φ(xi), 1 ≤ i ≤ m is a vector in V and {v1, ...,vn′} is an orthonormal basisof V , we have φ(xi) =

∑n′

k=1 βikvk. This leads to the following equalities:

wTφ(xi) =

n′∑k=1

skvTk +

N−n′∑j=1

tjuTj

n′∑k=1

βikvk

=

n′∑k=1

skβik||vk||22 = wTV φ(xi). (9)

Based on the results in Eqn. (8) and Eqn. (9), and the fact that wV is a vector in V = {x : x =∑mi=1 ciφ(xi)}, we can derive that, for an arbitrary w, there always exists a wV =

∑mi=1 αiφ(xi)

satisfying

1

m

m∑i=1

L(wTφ(xi), yi) +λ

2||w||22 ≥

1

m

m∑i=1

L(wTV φ(xi), yi) +

λ

2||wV ||22. (10)

In other words, if a vector w∗ minimizes 1m

∑mi=1 L(w

Tφ(xi), yi) +λ2 ||w||

22, the corresponding

w∗V must also minimize it, and there exist some {αi}mi=1 such that w∗

V =∑mi=1 αiφ(xi).

Theorem 1 comes from the representer theorem. Note that Theorem 1 is just a special case ofthe general representer theorem, see [1] for a comprehensive description of the general representertheorem. With the representer theorem, we now introduce the kernel methods.

4 Kernel Methods

According to Theorem 1, we only need to consider w ∈ {∑mi=1 αiφ(xi)} when solving the

optimization problem in Equation (6). Therefore, we have the following transformed optimizationproblem by replacing w in Equation (6) with

∑mi=1 αiφ(xi):

minα1,...,αm

1

m

m∑j=1

L

(m∑i=1

αiφ(xi)Tφ(xj), yj

)+λ

2

m∑j=1

m∑i=1

αiαjφ(xi)Tφ(xj), λ ≥ 0. (11)

As a result, if we know φ(xi)Tφ(xj) for all 1 ≤ j, i ≤ m, we can compute the optimization objec-

tive in Equation (11) without explicitly knowing {φ(xi)}mi=1. In addition, the output of the linear

4

Page 5: A Neural Network View of Kernel Methods - people.tamu.edu

model with parameterw for any input φ(x) only depends onwTφ(x), i.e.,∑mi=1 αiφ(xi)

Tφ(x).So for any unseen x that is not in the training set, we can make predictions directly without computingφ(x) first if we know φ(xi)

Tφ(x) for all 1 ≤ i ≤ m. In summary, in both training and prediction,what we really need is the inner product of two feature vectors, not the feature vectors themselves.

In the following discussion, we call

k(xi,xj) = φ(xi)Tφ(xj)

the kernel function of xi,xj . Note that, although k(xi,xj) is the inner product of φ(xi) andφ(xj), we compute it directly from xi and xj . The advantage of using kernel function is that, inmany cases, computing φ(·) may be very expensive or even infeasible, while computing k(xi,xj)is much easier. For example, consider

φ(x) = [1,√2x1, ...,

√2xn−1, x

21, ..., x

2n−1,

√2x1x2, ...,

√2xixj(i < j), ...,

√2xn−2xn−1]

T ,

where x = [1, x1, ..., xn−1]T . For any two n-dimensional feature vectors a and b, φ(a) and φ(b)

are both O(n2)-dimensional vectors. So, the time complexity of computing φ(a) and φ(b) first andthen their inner product is O(n2). However, we show that the result can be computed from aT b moreefficiently as

φ(a)Tφ(b) = 1 +

n−1∑i=1

2aibi +

n−1∑i=1

a2i b2i +

n−1∑i=1

n−1∑j=i+1

2aiajbibj

=

n−1∑i=0

a2i b2i +

n−1∑i=0

n−1∑j=i+1

2aiajbibj , (a0 = b0 = 1)

=

(n−1∑i=0

aibi

)2

= (aT b)2, (12)

where the computational complexity is decreased from O(n2) to O(n). We will introduce someconcrete kernel functions like the radial basis function in Section 5. Some of them even lead to aninfinite dimension of φ(x). In this case, using the kernel function is necessary.

With the kernel function, the optimization problem in Equation (11) can be re-written as

minα1,...,αm

1

m

m∑j=1

L

(m∑i=1

αik(xi,xj), yj

)+λ

2

m∑j=1

m∑i=1

αiαjk(xi,xj), λ ≥ 0. (13)

We can initialize α = (α1, ..., αm)T , select the task-specific loss function L, and minimize theobjective in Equation (13) on training data via gradient descent. This method is know as the kernelmethod.

Concretely, using the example of logistic regression for binary classification problems, we now have

h(x) = σ(wTφ(x)) = σ

(m∑i=1

αik(xi,x)

)=

1

1 + exp (−∑mi=1 αik(xi,x))

.

When applying the binary cross entropy loss

L(z, y) = −y log[

1

1 + exp (−z)

]− (1− y) log

[1− 1

1 + exp (−z)

]in Equation 13, this model is known as kernel logistic regression, as illustrated in Figure 5. Inaddition, if we apply the hinge loss L(z, y) = max(1−yz, 0), we obtain the support vector machines(SVM) [2].

5 Kernel Methods, Radial Basis Function Networks, and Neural Networks

We now move one step further from the kernel methods by comparing it with a two-layer feed forwardneural network.

5

Page 6: A Neural Network View of Kernel Methods - people.tamu.edu

• In the kernel logistic regression model for binary classification with training data {xi, yi}mi=1,given input x, the output is

σ

(m∑i=1

αik(xi,x)

)= σ(αTk),

where α = [α1, α2, . . . , αm]T is the learnable parameter and k =[k(x1,x), k(x2,x), . . . , k(xm,x)]

T is an m-dimensional vector computed by a se-quence of kernel functions. Basically, we compute the kernel functions first and obtain anm-dimensional vector k. We then perform the the original logistic regression with k asinput.

• In a two-layer feed-forward neural network with a hidden size of t for the same task, weneed to compute a hidden t-dimensional vector first. And the second layer can be consideredas the original logistic regression in Eqn. (2). Note that the hidden size t here is fixed.

If we treat the computation of k as the first “hidden layer”, the kernel logistic regression model canbe thought of as a two-layer feed-forward neural network. But here the hidden size and the numberof parameters are the same as the number of training samples. This will lead to much computationand storage cost when there are millions of training samples. Borrowed from the idea of fixed hiddensize in feed-forward neural networks, we can build the first hidden layer from t representatives ofall training samples, where t is a fixed number. A simple way to obtain t representatives is to usethe t centroids {ci}ti=1 obtained from running k-means clustering algorithm on all training samples{xi}mi=1. When the hidden size is fixed to t, the dimension of the parameter vector α is also fixed tot.

� ∈ ℝ�

ℎ�

�(��, �)

��

Figure 5: An illustration of the kernel logistic regression model.

In practice, a very frequently used kernel function is the radial basis function

k(xi,xj) = exp

(−||xi − xj ||22

2σ2

).

If we build the hidden layer from t clustering centroids {ci}ti=1 and use the radial basis functionin the kernel logistic regression model, we obtain the radial basis function network (RBF) [3], asillustrated in Figure 6.

We now discuss the difference between the radial basis function network and the two-layer feed-forward neural network. Basically, in the radial basis function network, the first hidden layer is nottrained in an end-to-end fashion. In other words, we fix the feature mapping φ(·) when trainingthe network. As a result, it only has equivalent representation power as a linear model with kernelmethods. However, in a regular two-layer feed-forward neural networks, the feature mapping φ(·) istrained and is potentially more powerful as it is more data and task specific.

6

Page 7: A Neural Network View of Kernel Methods - people.tamu.edu

� ∈ ℝ�

ℎ�

�(��, �)

��

Figure 6: An illustration of the radial basis function network.

Acknowledgements

This work was supported in part by National Science Foundation grants IIS-1908220, IIS-1908198,IIS-1908166, DBI-1147134, DBI-1922969, DBI-1661289, CHE-1738305, National Institutes ofHealth grant 1R21NS102828, and Defense Advanced Research Projects Agency grant N66001-17-2-4031.

References[1] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In

International conference on computational learning theory, pages 416–426. Springer, 2001.

[2] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.

[3] David S Broomhead and David Lowe. Radial basis functions, multi-variable functional interpola-tion and adaptive networks. Technical report, Royal Signals and Radar Establishment Malvern(United Kingdom), 1988.

7