42
Nonlinear representations

Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Nonlinear representations

Page 2: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Reminders: Oct. 31, 2019

• Moved deadline for Thought Questions 3 to Nov 14

• Any questions?

2

Page 3: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Representations for learning nonlinear functions

• Generalized linear models enabled many p(y | x) distributions• Still however learning a simple function for E[Y | x], i.e., f(<x,w>)

• i.e. a linear function to predict f^{-1}(E[Y | x])

• Approach we discussed earlier: augment current features x using polynomials

• There are many strategies to augmenting x• fixed representations, like polynomials, wavelets

• learned representations, like neural networks and matrix factorization

3

Page 4: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

What if classes are not linearly separable?

4

How to learn f(x) such that f(x) > 0 predicts + and f(x) < 0 predicts negative?

x21 + x2

2 = 1

x1 = x2 = 0

=) f(x) = �1 < 0

x1 = 2, x2 = �1

=) f(x) = 4 + 1� 1 = 4 > 0

f(x) = x21 + x2

2 � 1

Page 5: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

What if classes are not linearly separable? (cont…)

5

How to learn f(x) such that f(x) > 0 predicts + and f(x) < 0 predicts negative?

x21 + x2

2 = 1 f(x) = x21 + x2

2 � 1

�(x) =

2

4x21

x22

1

3

5 f(x) = �(x)>w

If use logistic regression, what is p(y=1 | x)?

Page 6: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

What if classes are not linearly separable? (cont…)

6

x21 + x2

2 = 1 f(x) = x21 + x2

2 � 1

�(x) =

2

4x21

x22

1

3

5 f(x) = �(x)>w

Imagine learned w. How do we predict on a new x?

Page 7: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Nonlinear transformation

7

x ! �(x) =

0

@�1(x). . .

�p(x)

1

A

e.g., x = [x1, x2], �(x) =

0

BBBBBBBB@

x1

x2

x21

x1x2

x22

x31

x32

1

CCCCCCCCA

f(x) = �(x)>wPredict class 1 if > 0

Page 8: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Another common transformation

• Use similarity to a set of (representative) points or prototypes

• Intuitively, similarity features can be quite powerful: if similar to a previously observed point, should have similar predictions

• But, relies heavily on a • meaningful similarity measure (keywords when googling this: Radial

basis functions, kernel functions)

• picking a set of prototypes

• Let’s go through some examples!

8

Page 9: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Gaussian kernel / Gaussian radial basis function

9

k(x,x0) = exp

✓�kx� x0k22

�2

x =

x1

x2

�!

2

64k(x,x1)

...k(x,xk)

3

75

x → [K (x , x1) K (x , x

2) K (x , x

3)] '

= [5 2 3] '

x

x2

x3 x

1

5

3

2

0.1

0.8

0.5

.1 .8 .5

p

f(x) =kX

i=1

wik(x,xi)

p

Page 10: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Gaussian kernel / Gaussian radial basis function

10

Possible function f with several centersKernel

k(x,x0) = exp

✓�kx� x0k22

�2

◆f(x) =

kX

i=1

wik(x,xi)

p

Can learn a highly nonlinear function!

xi is a prototype or center<latexit sha1_base64="kpq3S/iPckB+NRGfTrNejt+BAg0=">AAACFnicbVC7SgNBFJ2NrxhfUUubwSDYGHZF0DJoYxnBPCAJYXZyNxkyu7PM3JWEJV9h46/YWChiK3b+jbNJCk081eGce7j3XD+WwqDrfju5ldW19Y38ZmFre2d3r7h/UDcq0RxqXEmlmz4zIEUENRQooRlrYKEvoeEPbzK/8QDaCBXd4ziGTsj6kQgEZ2ilbvGsHTIc+EE6mnQFbSOMMKXCUEZjrVBlEao05RAh6Em3WHLL7hR0mXhzUiJzVLvFr3ZP8SS0cS6ZMS3PjbGTMo2CS5gU2omBmPEh60PL0oiFYDrptNaEnlilRwO7PlAR0qn6O5Gy0Jhx6NvJrIRZ9DLxP6+VYHDVSUUUJwgRny0KEklR0WnhntDAUY4tYVwLeyvlA6YZtz8wBfsEb7HyMqmflz3L7y5Klev5O/LkiByTU+KRS1Iht6RKaoSTR/JMXsmb8+S8OO/Ox2w058wzh+QPnM8fATqf4Q==</latexit><latexit sha1_base64="kpq3S/iPckB+NRGfTrNejt+BAg0=">AAACFnicbVC7SgNBFJ2NrxhfUUubwSDYGHZF0DJoYxnBPCAJYXZyNxkyu7PM3JWEJV9h46/YWChiK3b+jbNJCk081eGce7j3XD+WwqDrfju5ldW19Y38ZmFre2d3r7h/UDcq0RxqXEmlmz4zIEUENRQooRlrYKEvoeEPbzK/8QDaCBXd4ziGTsj6kQgEZ2ilbvGsHTIc+EE6mnQFbSOMMKXCUEZjrVBlEao05RAh6Em3WHLL7hR0mXhzUiJzVLvFr3ZP8SS0cS6ZMS3PjbGTMo2CS5gU2omBmPEh60PL0oiFYDrptNaEnlilRwO7PlAR0qn6O5Gy0Jhx6NvJrIRZ9DLxP6+VYHDVSUUUJwgRny0KEklR0WnhntDAUY4tYVwLeyvlA6YZtz8wBfsEb7HyMqmflz3L7y5Klev5O/LkiByTU+KRS1Iht6RKaoSTR/JMXsmb8+S8OO/Ox2w058wzh+QPnM8fATqf4Q==</latexit><latexit sha1_base64="kpq3S/iPckB+NRGfTrNejt+BAg0=">AAACFnicbVC7SgNBFJ2NrxhfUUubwSDYGHZF0DJoYxnBPCAJYXZyNxkyu7PM3JWEJV9h46/YWChiK3b+jbNJCk081eGce7j3XD+WwqDrfju5ldW19Y38ZmFre2d3r7h/UDcq0RxqXEmlmz4zIEUENRQooRlrYKEvoeEPbzK/8QDaCBXd4ziGTsj6kQgEZ2ilbvGsHTIc+EE6mnQFbSOMMKXCUEZjrVBlEao05RAh6Em3WHLL7hR0mXhzUiJzVLvFr3ZP8SS0cS6ZMS3PjbGTMo2CS5gU2omBmPEh60PL0oiFYDrptNaEnlilRwO7PlAR0qn6O5Gy0Jhx6NvJrIRZ9DLxP6+VYHDVSUUUJwgRny0KEklR0WnhntDAUY4tYVwLeyvlA6YZtz8wBfsEb7HyMqmflz3L7y5Klev5O/LkiByTU+KRS1Iht6RKaoSTR/JMXsmb8+S8OO/Ox2w058wzh+QPnM8fATqf4Q==</latexit><latexit sha1_base64="kpq3S/iPckB+NRGfTrNejt+BAg0=">AAACFnicbVC7SgNBFJ2NrxhfUUubwSDYGHZF0DJoYxnBPCAJYXZyNxkyu7PM3JWEJV9h46/YWChiK3b+jbNJCk081eGce7j3XD+WwqDrfju5ldW19Y38ZmFre2d3r7h/UDcq0RxqXEmlmz4zIEUENRQooRlrYKEvoeEPbzK/8QDaCBXd4ziGTsj6kQgEZ2ilbvGsHTIc+EE6mnQFbSOMMKXCUEZjrVBlEao05RAh6Em3WHLL7hR0mXhzUiJzVLvFr3ZP8SS0cS6ZMS3PjbGTMo2CS5gU2omBmPEh60PL0oiFYDrptNaEnlilRwO7PlAR0qn6O5Gy0Jhx6NvJrIRZ9DLxP6+VYHDVSUUUJwgRny0KEklR0WnhntDAUY4tYVwLeyvlA6YZtz8wBfsEb7HyMqmflz3L7y5Klev5O/LkiByTU+KRS1Iht6RKaoSTR/JMXsmb8+S8OO/Ox2w058wzh+QPnM8fATqf4Q==</latexit>

Page 11: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Other similarity transforms

• Linear kernel:

• Laplace kernel (Laplace distribution instead of Gaussian)

• Binning transformation

11

k(x, c) = x>c

k(x, c) = exp(�bkx� ck1)

s(x, c) =

⇢1 if x in box around c0 else

Page 12: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Selecting centers

• Many different strategies to decide on centers• many ML algorithms use kernels e.g., SVMs, Gaussian process regression

• For kernel representations, typical strategy is to select training data as centers

• Clustering techniques to find centers

• A grid of values to best (exhaustively) cover the space

• Many other strategies, e.g., using information gain, coherence criterion, informative vector machine

12

Page 13: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Covering space uniformly with centres

• Imagine has 1-d space, from range [-10, 10]

• How would we pick p centers uniformly?

• What if we have a 5-d space, in ranges [-1,1]?• To cover entire 5-dimensional, need to consider all possible options

• Split up 1-d into m values, then total number of centres is m^5

• i.e., for first value of x1, can try all other m values for x2, …, x5

13

Page 14: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Why select training data as centers?

• Observed data indicates the part of the space that we actually need to model• can be much more compact than exhaustively covering the space

• imagine only see narrow trajectories in world, even if data is d-dimensional, data you encounter may lie on a much lower-dimensional manifold (space)

• Any issues with using all training data for centres?• How can we subselect centres from training data?

14

Page 15: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

How would we use clustering to select centres?

• Clustering is taking data and finding p groups

• What distance measure should we use?• If k(x,c) between 0 and 1 and k(x,x) = 1 for all x, then 1-k(x,c) gives a

distance

15

Page 16: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

What if we select centres randomly?

• Are there any issues with the linear regression with a kernel transformation, if we select centres randomly from the data?

• If so, any suggestions to remedy the problem?

16

nX

i=1

0

@pX

j=1

k(x, zj)wj � yi

1

A2

+ �kwk1

Page 17: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Exercise

• What would it mean to use an l1 regularizer with a kernel representation?• Recall that l1 prefers to zero elements in w

17

nX

i=1

0

@pX

j=1

k(x, zj)wj � yi

1

A2

+ �kwk1

Page 18: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Exercise: How do we decide on the nonlinear transformation?

• We can pick a 5-th order polynomial or 6-th order, or… Which should we pick?

• We can pick p centres. How many should we pick?

• How can we avoid overparametrizing or underparameterizing?

18

Page 19: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Dealing with non-numeric data

• What if we have categorical features?• e.g., feature is the blood type of a person

• Even worse, what if we have strings describing the object?• e.g., feature is occupation, like “retail”

19

Page 20: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Some options

• Convert categorical feature into integers• e.g., {A, B, AB, O} —> {1, 2, 3, 4}

• Any issues?

• Convert categorical feature into indicator vector• e.g., A —> [1 0 0 0], B —> [0 1 0 0], …

• Any issues?

20

Page 21: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Using kernels for categorical or non-numeric data

• An alternative is to use kernels (or similarity transforms)

• If you know something about your data/domain, might have a good similarity measure for non-numeric data

• Some more generic kernel options as well• Matching kernel: similarity between two items is the proportion of

features that are equal

21

Page 22: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

age

gender

income

education

x =

{15-24, 25-34, …, 65+}

{F, M}

{Low, Medium, High}

{Bachelors, Trade-Sch, High-Sch, …}

Census dataset: Predict hours worked per week

Example: matching kernel

22

Page 23: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Example: Matching similarity for categorical data

23

age

gender

income

education

x =k(x1,x2) = k

24-34

F

Medium

Trade-Sch

35-44

F

Medium

Bachelors

= 0.5

Page 24: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Representational properties of transformations

• Approximation properties: which transformations can approximate “any function”?

• Radial basis functions (a huge number of them)

• Polynomials and the Taylor series

• Fourier basis and wavelets

24

Page 25: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Distinction with the kernel trick• When is the similarity actually called a “kernel”?

• Nice property of kernels: can always be written as a dot product in some feature space

• In some cases, they are used to compute inner products efficiently, assuming one is actually learning with the feature expansion • This is called the kernel trick

• Implicitly learning with feature expansion • not learning with expansion that is similarities to centres

25

k(x, c) = (x)> (c)

(x)

Page 26: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Example: polynomial kernel

26

�(x) =

2

4x21p

2x1x2

x22

3

5

k(x,x0) = h�(x),�(x0)i = hx,x0i2

In general, for order d polynomials, k(x,x0) = hx,x0id

Page 27: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Gaussian kernel

27

�(x) = exp(��x2)

2

666664

1q2�1! xq

(2�)2

2! x2

...

3

777775

Infinite polynomial representation

k(x,x0) = exp

✓�kx� x0k22

�2

Page 28: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Regression with new features

28

minw

nX

i=1

(�(xi)>w � yi)

2 = minw

nX

i=1

0

@

0

@pX

j=1

�j(xi)wj

1

A� yi

1

A2

minw

nX

i=1

(�(xi)>w � yi)

2 = mina

nX

i=1

0

@

0

@nX

j=1

h�(xi),�(xj)iaj

1

A� yi

1

A2

What if p is really big?

If can compute dot product efficiently, then can solve this regression problem efficiently

Page 29: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Summary from last time• Talked about using transformation (representation) that

corresponds to similarities to prototypes or center

• Advantages:

• Intuitive representation • e.g., if feature high, then might be able to explain that the reason for a

prediction is because the new point looked like that prototype point

• Allows any types of inputs (categorical features, strings, etc)• e.g., word embeddings can provide similarity metrics —> we’ll talk

about embeddings later

• Nice theoretical properties, including representational capacity29

Page 30: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

What about learning the representation?

• We have talked about fixed nonlinear transformations• polynomials

• kernels

• How do we introduce learning?• could learn centers, for example

• learning is quite constrained, since can only pick centres and widths

• Neural networks learn this transformation more from scratch• still some built-in structure

30

Page 31: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Fixed representation vs. NN

31

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

GLM with fixed representation

(e.g., kernels)

Two-layer neural network

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

Fixed transform

No learning

here

Learn weights

here

Learn weights

here

Learn weights

here�(x) �(x)

Page 32: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Explicit steps in one hidden node for an NN

32

✓j

e.g., sigmoid tanh ReLUd d

jth hidden node�

<latexit sha1_base64="opbl81t9FGGFoZ+lNT+5IxSn0Yc=">AAAB7nicbZBNSwMxEIZn/az1q+rRS7AInsquCHosevFYwX5Au5RsOm1Dk+ySZIWy9Ed48aCIV3+PN/+N2XYP2vpC4OGdGTLzRongxvr+t7e2vrG5tV3aKe/u7R8cVo6OWyZONcMmi0WsOxE1KLjCpuVWYCfRSGUksB1N7vJ6+wm14bF6tNMEQ0lHig85o9ZZ7Z7hI0nL/UrVr/lzkVUICqhCoUa/8tUbxCyVqCwT1Jhu4Cc2zKi2nAmclXupwYSyCR1h16GiEk2YzdedkXPnDMgw1u4pS+bu74mMSmOmMnKdktqxWa7l5n+1bmqHN2HGVZJaVGzx0TAVxMYkv50MuEZmxdQBZZq7XQkbU02ZdQnlIQTLJ69C67IWOH64qtZvizhKcApncAEBXEMd7qEBTWAwgWd4hTcv8V68d+9j0brmFTMn8Efe5w/S5I83</latexit><latexit sha1_base64="opbl81t9FGGFoZ+lNT+5IxSn0Yc=">AAAB7nicbZBNSwMxEIZn/az1q+rRS7AInsquCHosevFYwX5Au5RsOm1Dk+ySZIWy9Ed48aCIV3+PN/+N2XYP2vpC4OGdGTLzRongxvr+t7e2vrG5tV3aKe/u7R8cVo6OWyZONcMmi0WsOxE1KLjCpuVWYCfRSGUksB1N7vJ6+wm14bF6tNMEQ0lHig85o9ZZ7Z7hI0nL/UrVr/lzkVUICqhCoUa/8tUbxCyVqCwT1Jhu4Cc2zKi2nAmclXupwYSyCR1h16GiEk2YzdedkXPnDMgw1u4pS+bu74mMSmOmMnKdktqxWa7l5n+1bmqHN2HGVZJaVGzx0TAVxMYkv50MuEZmxdQBZZq7XQkbU02ZdQnlIQTLJ69C67IWOH64qtZvizhKcApncAEBXEMd7qEBTWAwgWd4hTcv8V68d+9j0brmFTMn8Efe5w/S5I83</latexit><latexit sha1_base64="opbl81t9FGGFoZ+lNT+5IxSn0Yc=">AAAB7nicbZBNSwMxEIZn/az1q+rRS7AInsquCHosevFYwX5Au5RsOm1Dk+ySZIWy9Ed48aCIV3+PN/+N2XYP2vpC4OGdGTLzRongxvr+t7e2vrG5tV3aKe/u7R8cVo6OWyZONcMmi0WsOxE1KLjCpuVWYCfRSGUksB1N7vJ6+wm14bF6tNMEQ0lHig85o9ZZ7Z7hI0nL/UrVr/lzkVUICqhCoUa/8tUbxCyVqCwT1Jhu4Cc2zKi2nAmclXupwYSyCR1h16GiEk2YzdedkXPnDMgw1u4pS+bu74mMSmOmMnKdktqxWa7l5n+1bmqHN2HGVZJaVGzx0TAVxMYkv50MuEZmxdQBZZq7XQkbU02ZdQnlIQTLJ69C67IWOH64qtZvizhKcApncAEBXEMd7qEBTWAwgWd4hTcv8V68d+9j0brmFTMn8Efe5w/S5I83</latexit><latexit sha1_base64="opbl81t9FGGFoZ+lNT+5IxSn0Yc=">AAAB7nicbZBNSwMxEIZn/az1q+rRS7AInsquCHosevFYwX5Au5RsOm1Dk+ySZIWy9Ed48aCIV3+PN/+N2XYP2vpC4OGdGTLzRongxvr+t7e2vrG5tV3aKe/u7R8cVo6OWyZONcMmi0WsOxE1KLjCpuVWYCfRSGUksB1N7vJ6+wm14bF6tNMEQ0lHig85o9ZZ7Z7hI0nL/UrVr/lzkVUICqhCoUa/8tUbxCyVqCwT1Jhu4Cc2zKi2nAmclXupwYSyCR1h16GiEk2YzdedkXPnDMgw1u4pS+bu74mMSmOmMnKdktqxWa7l5n+1bmqHN2HGVZJaVGzx0TAVxMYkv50MuEZmxdQBZZq7XQkbU02ZdQnlIQTLJ69C67IWOH64qtZvizhKcApncAEBXEMd7qEBTWAwgWd4hTcv8V68d+9j0brmFTMn8Efe5w/S5I83</latexit>

hj<latexit sha1_base64="rJB2e84D7Fv5ewEtUuXVUyERchg=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aRdu9mE3Y1QQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iARXBvX/XYKa+sbm1vF7dLO7t7+QfnwqKXjVDFssljEqhNQjYJLbBpuBHYShTQKBLaD8c2s3n5CpXksH8wkQT+iQ8lDzqix1v2o/9gvV9yqOxdZBS+HCuRq9MtfvUHM0gilYYJq3fXcxPgZVYYzgdNSL9WYUDamQ+xalDRC7WfzVafkzDoDEsbKPmnI3P09kdFI60kU2M6ImpFers3M/2rd1IRXfsZlkhqUbPFRmApiYjK7mwy4QmbExAJlittdCRtRRZmx6ZRsCN7yyavQuqh6lu8uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3I+fwBJto3J</latexit><latexit sha1_base64="rJB2e84D7Fv5ewEtUuXVUyERchg=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aRdu9mE3Y1QQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iARXBvX/XYKa+sbm1vF7dLO7t7+QfnwqKXjVDFssljEqhNQjYJLbBpuBHYShTQKBLaD8c2s3n5CpXksH8wkQT+iQ8lDzqix1v2o/9gvV9yqOxdZBS+HCuRq9MtfvUHM0gilYYJq3fXcxPgZVYYzgdNSL9WYUDamQ+xalDRC7WfzVafkzDoDEsbKPmnI3P09kdFI60kU2M6ImpFers3M/2rd1IRXfsZlkhqUbPFRmApiYjK7mwy4QmbExAJlittdCRtRRZmx6ZRsCN7yyavQuqh6lu8uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3I+fwBJto3J</latexit><latexit sha1_base64="rJB2e84D7Fv5ewEtUuXVUyERchg=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aRdu9mE3Y1QQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iARXBvX/XYKa+sbm1vF7dLO7t7+QfnwqKXjVDFssljEqhNQjYJLbBpuBHYShTQKBLaD8c2s3n5CpXksH8wkQT+iQ8lDzqix1v2o/9gvV9yqOxdZBS+HCuRq9MtfvUHM0gilYYJq3fXcxPgZVYYzgdNSL9WYUDamQ+xalDRC7WfzVafkzDoDEsbKPmnI3P09kdFI60kU2M6ImpFers3M/2rd1IRXfsZlkhqUbPFRmApiYjK7mwy4QmbExAJlittdCRtRRZmx6ZRsCN7yyavQuqh6lu8uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3I+fwBJto3J</latexit><latexit sha1_base64="rJB2e84D7Fv5ewEtUuXVUyERchg=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aRdu9mE3Y1QQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iARXBvX/XYKa+sbm1vF7dLO7t7+QfnwqKXjVDFssljEqhNQjYJLbBpuBHYShTQKBLaD8c2s3n5CpXksH8wkQT+iQ8lDzqix1v2o/9gvV9yqOxdZBS+HCuRq9MtfvUHM0gilYYJq3fXcxPgZVYYzgdNSL9WYUDamQ+xalDRC7WfzVafkzDoDEsbKPmnI3P09kdFI60kU2M6ImpFers3M/2rd1IRXfsZlkhqUbPFRmApiYjK7mwy4QmbExAJlittdCRtRRZmx6ZRsCN7yyavQuqh6lu8uK/XrPI4inMApnIMHNajDLTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3I+fwBJto3J</latexit>

✓j = x>w

hj = �(✓j)<latexit sha1_base64="NMVRIE0wwJxqtSmS6aAqDQgRqoY=">AAACKnicbVDLTgIxFO34xPGFunTTSCS4ITNoohsT1I1LTOSRMEg6pQOVziPtHZVM+B43/oobFhri1g+xPDQKnqTJ6bn35N573EhwBZY1NBYWl5ZXVlNr5vrG5tZ2eme3osJYUlamoQhlzSWKCR6wMnAQrBZJRnxXsKrbvRrVqw9MKh4Gt9CLWMMn7YB7nBLQUjN9YWYd6DAgzXt8jh2fQMf1kqf+nQNh9PN/7DuOme1MehRv+yT37TpqpjNW3hoDzxN7SjJoilIzPXBaIY19FgAVRKm6bUXQSIgETgXrm06sWERol7RZXdOA+Ew1kvGpfXyolRb2QqlfAHis/nYkxFeq57u6c7S7mq2NxP9q9Ri8s0bCgygGFtDJIC8WGEI8yg23uGQURE8TQiXXu2LaIZJQ0OmaOgR79uR5Uink7eN84eYkU7ycxpFC++gA5ZCNTlERXaMSKiOKntErekPvxosxMIbGx6R1wZh69tAfGJ9f5dimWA==</latexit>

Page 33: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Example: logistic regression and using a neural network

• The goal is still to predict p(y = 1 | x)• But now want this to be a more general nonlinear function of x

• Logistic regression learns W such that

• Neural network learns W1 and W2 such that

33

and y 2 {0, 1}. If y 2 R, we use linear regression for this last layer and so learn weightsw

(2)2 R2 such that hw

(2) approximates the true output y. If y 2 {0, 1}, we use logisticregression for this last layer and so learn weights w(2)

2 R2 such that �(hw(2)) approximatesthe true output y. ⇤

Now we consider the more general case with any d, k1,m. To provide some intuitionfor this more general setting, we will begin with one hidden layer, for the sigmoid transferfunction and cross-entropy output loss. For logistic regression we estimated W 2 Rd⇥m,with f(xW) = �(xW) ⇡ y. We will predict an output vector y 2 Rm, because it will makelater generalizations more clear-cut and make notation for the weights in each layer moreuniform. When we add a hidden layer, we have two parameter matrices W

(2)2 Rd⇥k1 and

W(1)

2 Rk1⇥m, where k1 is the dimension of the hidden layer

h = �(W(2)x) =

2

66664

�(xW(2):1 )

�(xW(2):2 )

...�(xW(2)

:k1)

3

777752 Rk1

where the sigmoid function is applied to each entry in xW(2) and hW

(1). This hidden layeris the new set of features and again you will do the regular logistic regression optimizationto learn weights on h:

p(y = 1|x) = �(hW(1)) = �(�(xW(2))W(1)).

With the probabilistic model and parameter specified, we now need to derive an algo-rithm to obtain those parameters. As before, we take a maximum likelihood approach andderive gradient descent updates. This composition of transfers seems to complicate matters,but we can still take the gradient w.r.t. our parameters. We simply have more parametersnow: W

(2)2 Rk1⇥d,W(1)

2 R1⇥k1 . Once we have the gradient w.r.t. each parameter ma-trix, we simply take a step in the direction of the negative of the gradient, as usual. Thegradients for these parameters share information; for computational efficiency, the gradientis computed first for W

(1), and duplicate gradient information sent back to compute thegradient for W

(2). This algorithm is typically called back propagation, which we describenext.

In general, we can compute the gradient for any number of hidden layers. Denoteeach differentiable transfer function f1, . . . , fH , ordered with f1 as the output transfer, andk1, . . . , kH�1 as the hidden dimensions with H � 1 hidden layers. Then the output from theneural network is

f1⇣f2

⇣. . . fH�1

⇣fH

⇣xW

(H)⌘W

(H�1)⌘. . .

⌘W

(1)⌘

where W(1)

2 Rk1⇥m, W(2)

2 Rk2⇥k1 , . . . ,W(H)2 Rd⇥kH�1 .

Backpropagation algorithm

We will start by deriving back propagation for two layers; the extension to multiple layerswill be more clear given this derivation. Due to the size of the network, we will often learn

85

and y 2 {0, 1}. If y 2 R, we use linear regression for this last layer and so learn weightsw

(2)2 R2 such that hw

(2) approximates the true output y. If y 2 {0, 1}, we use logisticregression for this last layer and so learn weights w(2)

2 R2 such that �(hw(2)) approximatesthe true output y. ⇤

Now we consider the more general case with any d, k1,m. To provide some intuitionfor this more general setting, we will begin with one hidden layer, for the sigmoid transferfunction and cross-entropy output loss. For logistic regression we estimated W 2 Rd⇥m,with f(xW) = �(xW) ⇡ y. We will predict an output vector y 2 Rm, because it will makelater generalizations more clear-cut and make notation for the weights in each layer moreuniform. When we add a hidden layer, we have two parameter matrices W

(2)2 Rd⇥k1 and

W(1)

2 Rk1⇥m, where k1 is the dimension of the hidden layer

h = �(W(2)x) =

2

66664

�(xW(2):1 )

�(xW(2):2 )

...�(xW(2)

:k1)

3

777752 Rk1

where the sigmoid function is applied to each entry in xW(2) and hW

(1). This hidden layeris the new set of features and again you will do the regular logistic regression optimizationto learn weights on h:

p(y = 1|x) = �(hW(1)) = �(�(xW(2))W(1)).

With the probabilistic model and parameter specified, we now need to derive an algo-rithm to obtain those parameters. As before, we take a maximum likelihood approach andderive gradient descent updates. This composition of transfers seems to complicate matters,but we can still take the gradient w.r.t. our parameters. We simply have more parametersnow: W

(2)2 Rk1⇥d,W(1)

2 R1⇥k1 . Once we have the gradient w.r.t. each parameter ma-trix, we simply take a step in the direction of the negative of the gradient, as usual. Thegradients for these parameters share information; for computational efficiency, the gradientis computed first for W

(1), and duplicate gradient information sent back to compute thegradient for W

(2). This algorithm is typically called back propagation, which we describenext.

In general, we can compute the gradient for any number of hidden layers. Denoteeach differentiable transfer function f1, . . . , fH , ordered with f1 as the output transfer, andk1, . . . , kH�1 as the hidden dimensions with H � 1 hidden layers. Then the output from theneural network is

f1⇣f2

⇣. . . fH�1

⇣fH

⇣xW

(H)⌘W

(H�1)⌘. . .

⌘W

(1)⌘

where W(1)

2 Rk1⇥m, W(2)

2 Rk2⇥k1 , . . . ,W(H)2 Rd⇥kH�1 .

Backpropagation algorithm

We will start by deriving back propagation for two layers; the extension to multiple layerswill be more clear given this derivation. Due to the size of the network, we will often learn

85

= p(y = 1|x)

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

(Note: we will now start talking about x as a row vector)

Page 34: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Nonlinear decision surface

34

* from http://cs231n.github.io/neural-networks-1/; see that page for a nice discussion on neural nets

Page 35: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

What are the representational capabilities of neural nets?

• Single hidden-layer neural networks with sigmoid transfer can represent any continuous function on a bounded space within epsilon accuracy, for a large enough number of hidden nodes• see Cybenko, 1989: “Approximation by Superpositions of a Sigmoidal

Function”

35

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

W(1)

W(2)

h(2) = f2(xW(2))

y = f1(h(1)W(1))

h(1)<latexit sha1_base64="CDsdb3gXZsWw8uVOHaqw3J5+xEo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWRgi6LblxWsA9oY5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1z/JgzpR3n2yptbG5t75R3K3v7B4dH9vFJR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+9C73uzMqFYvEo57H1AvxWLCAEayNNLTtQYj1xA/SSfaU1tzLbGhXnbqzAFonbkGqUKA1tL8Go4gkIRWacKxU33Vi7aVYakY4zSqDRNEYkyke076hAodUeekieYYujDJCQSTNExot1N8bKQ6Vmoe+mcxzqlUvF//z+okObryUiTjRVJDloSDhSEcorwGNmKRE87khmEhmsiIywRITbcqqmBLc1S+vk85V3TX8oVFt3hZ1lOEMzqEGLlxDE+6hBW0gMINneIU3K7VerHfrYzlasoqdU/gD6/MHJCWTUQ==</latexit><latexit sha1_base64="CDsdb3gXZsWw8uVOHaqw3J5+xEo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWRgi6LblxWsA9oY5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1z/JgzpR3n2yptbG5t75R3K3v7B4dH9vFJR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+9C73uzMqFYvEo57H1AvxWLCAEayNNLTtQYj1xA/SSfaU1tzLbGhXnbqzAFonbkGqUKA1tL8Go4gkIRWacKxU33Vi7aVYakY4zSqDRNEYkyke076hAodUeekieYYujDJCQSTNExot1N8bKQ6Vmoe+mcxzqlUvF//z+okObryUiTjRVJDloSDhSEcorwGNmKRE87khmEhmsiIywRITbcqqmBLc1S+vk85V3TX8oVFt3hZ1lOEMzqEGLlxDE+6hBW0gMINneIU3K7VerHfrYzlasoqdU/gD6/MHJCWTUQ==</latexit><latexit sha1_base64="CDsdb3gXZsWw8uVOHaqw3J5+xEo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWRgi6LblxWsA9oY5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1z/JgzpR3n2yptbG5t75R3K3v7B4dH9vFJR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+9C73uzMqFYvEo57H1AvxWLCAEayNNLTtQYj1xA/SSfaU1tzLbGhXnbqzAFonbkGqUKA1tL8Go4gkIRWacKxU33Vi7aVYakY4zSqDRNEYkyke076hAodUeekieYYujDJCQSTNExot1N8bKQ6Vmoe+mcxzqlUvF//z+okObryUiTjRVJDloSDhSEcorwGNmKRE87khmEhmsiIywRITbcqqmBLc1S+vk85V3TX8oVFt3hZ1lOEMzqEGLlxDE+6hBW0gMINneIU3K7VerHfrYzlasoqdU/gD6/MHJCWTUQ==</latexit><latexit sha1_base64="CDsdb3gXZsWw8uVOHaqw3J5+xEo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWRgi6LblxWsA9oY5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1z/JgzpR3n2yptbG5t75R3K3v7B4dH9vFJR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+9C73uzMqFYvEo57H1AvxWLCAEayNNLTtQYj1xA/SSfaU1tzLbGhXnbqzAFonbkGqUKA1tL8Go4gkIRWacKxU33Vi7aVYakY4zSqDRNEYkyke076hAodUeekieYYujDJCQSTNExot1N8bKQ6Vmoe+mcxzqlUvF//z+okObryUiTjRVJDloSDhSEcorwGNmKRE87khmEhmsiIywRITbcqqmBLc1S+vk85V3TX8oVFt3hZ1lOEMzqEGLlxDE+6hBW0gMINneIU3K7VerHfrYzlasoqdU/gD6/MHJCWTUQ==</latexit>

h(1)<latexit sha1_base64="CDsdb3gXZsWw8uVOHaqw3J5+xEo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWRgi6LblxWsA9oY5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1z/JgzpR3n2yptbG5t75R3K3v7B4dH9vFJR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+9C73uzMqFYvEo57H1AvxWLCAEayNNLTtQYj1xA/SSfaU1tzLbGhXnbqzAFonbkGqUKA1tL8Go4gkIRWacKxU33Vi7aVYakY4zSqDRNEYkyke076hAodUeekieYYujDJCQSTNExot1N8bKQ6Vmoe+mcxzqlUvF//z+okObryUiTjRVJDloSDhSEcorwGNmKRE87khmEhmsiIywRITbcqqmBLc1S+vk85V3TX8oVFt3hZ1lOEMzqEGLlxDE+6hBW0gMINneIU3K7VerHfrYzlasoqdU/gD6/MHJCWTUQ==</latexit><latexit sha1_base64="CDsdb3gXZsWw8uVOHaqw3J5+xEo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWRgi6LblxWsA9oY5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1z/JgzpR3n2yptbG5t75R3K3v7B4dH9vFJR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+9C73uzMqFYvEo57H1AvxWLCAEayNNLTtQYj1xA/SSfaU1tzLbGhXnbqzAFonbkGqUKA1tL8Go4gkIRWacKxU33Vi7aVYakY4zSqDRNEYkyke076hAodUeekieYYujDJCQSTNExot1N8bKQ6Vmoe+mcxzqlUvF//z+okObryUiTjRVJDloSDhSEcorwGNmKRE87khmEhmsiIywRITbcqqmBLc1S+vk85V3TX8oVFt3hZ1lOEMzqEGLlxDE+6hBW0gMINneIU3K7VerHfrYzlasoqdU/gD6/MHJCWTUQ==</latexit><latexit sha1_base64="CDsdb3gXZsWw8uVOHaqw3J5+xEo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWRgi6LblxWsA9oY5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1z/JgzpR3n2yptbG5t75R3K3v7B4dH9vFJR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+9C73uzMqFYvEo57H1AvxWLCAEayNNLTtQYj1xA/SSfaU1tzLbGhXnbqzAFonbkGqUKA1tL8Go4gkIRWacKxU33Vi7aVYakY4zSqDRNEYkyke076hAodUeekieYYujDJCQSTNExot1N8bKQ6Vmoe+mcxzqlUvF//z+okObryUiTjRVJDloSDhSEcorwGNmKRE87khmEhmsiIywRITbcqqmBLc1S+vk85V3TX8oVFt3hZ1lOEMzqEGLlxDE+6hBW0gMINneIU3K7VerHfrYzlasoqdU/gD6/MHJCWTUQ==</latexit><latexit sha1_base64="CDsdb3gXZsWw8uVOHaqw3J5+xEo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWRgi6LblxWsA9oY5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1z/JgzpR3n2yptbG5t75R3K3v7B4dH9vFJR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+9C73uzMqFYvEo57H1AvxWLCAEayNNLTtQYj1xA/SSfaU1tzLbGhXnbqzAFonbkGqUKA1tL8Go4gkIRWacKxU33Vi7aVYakY4zSqDRNEYkyke076hAodUeekieYYujDJCQSTNExot1N8bKQ6Vmoe+mcxzqlUvF//z+okObryUiTjRVJDloSDhSEcorwGNmKRE87khmEhmsiIywRITbcqqmBLc1S+vk85V3TX8oVFt3hZ1lOEMzqEGLlxDE+6hBW0gMINneIU3K7VerHfrYzlasoqdU/gD6/MHJCWTUQ==</latexit>

e.g., f1 = �

f2 = �<latexit sha1_base64="dPsfQzi/m8YGyrMQ61Z+AWfauzo=">AAACFXicbZDJSgNBEIZ74hbjFvXopTEoHkKYCYJehKAXjxHMApkQejo1SZOehe4aMQx5CS++ihcPingVvPk2dhZEE39o+Pmqiur6vVgKjbb9ZWWWlldW17LruY3Nre2d/O5eXUeJ4lDjkYxU02MapAihhgIlNGMFLPAkNLzB1bjeuAOlRRTe4jCGdsB6ofAFZ2hQJ190Ee4xhVKvVKQjeux3HHpBXS16AaOumzOg/AM6+YJdsieii8aZmQKZqdrJf7rdiCcBhMgl07rl2DG2U6ZQcAmjnJtoiBkfsB60jA1ZALqdTq4a0SNDutSPlHkh0gn9PZGyQOth4JnOgGFfz9fG8L9aK0H/vJ2KME4QQj5d5CeSYkTHEdGuUMBRDo1hXAnzV8r7TDGOJsicCcGZP3nR1Mslx/ib00LlchZHlhyQQ3JCHHJGKuSaVEmNcPJAnsgLebUerWfrzXqftmas2cw++SPr4xuxqpv/</latexit><latexit sha1_base64="dPsfQzi/m8YGyrMQ61Z+AWfauzo=">AAACFXicbZDJSgNBEIZ74hbjFvXopTEoHkKYCYJehKAXjxHMApkQejo1SZOehe4aMQx5CS++ihcPingVvPk2dhZEE39o+Pmqiur6vVgKjbb9ZWWWlldW17LruY3Nre2d/O5eXUeJ4lDjkYxU02MapAihhgIlNGMFLPAkNLzB1bjeuAOlRRTe4jCGdsB6ofAFZ2hQJ190Ee4xhVKvVKQjeux3HHpBXS16AaOumzOg/AM6+YJdsieii8aZmQKZqdrJf7rdiCcBhMgl07rl2DG2U6ZQcAmjnJtoiBkfsB60jA1ZALqdTq4a0SNDutSPlHkh0gn9PZGyQOth4JnOgGFfz9fG8L9aK0H/vJ2KME4QQj5d5CeSYkTHEdGuUMBRDo1hXAnzV8r7TDGOJsicCcGZP3nR1Mslx/ib00LlchZHlhyQQ3JCHHJGKuSaVEmNcPJAnsgLebUerWfrzXqftmas2cw++SPr4xuxqpv/</latexit><latexit sha1_base64="dPsfQzi/m8YGyrMQ61Z+AWfauzo=">AAACFXicbZDJSgNBEIZ74hbjFvXopTEoHkKYCYJehKAXjxHMApkQejo1SZOehe4aMQx5CS++ihcPingVvPk2dhZEE39o+Pmqiur6vVgKjbb9ZWWWlldW17LruY3Nre2d/O5eXUeJ4lDjkYxU02MapAihhgIlNGMFLPAkNLzB1bjeuAOlRRTe4jCGdsB6ofAFZ2hQJ190Ee4xhVKvVKQjeux3HHpBXS16AaOumzOg/AM6+YJdsieii8aZmQKZqdrJf7rdiCcBhMgl07rl2DG2U6ZQcAmjnJtoiBkfsB60jA1ZALqdTq4a0SNDutSPlHkh0gn9PZGyQOth4JnOgGFfz9fG8L9aK0H/vJ2KME4QQj5d5CeSYkTHEdGuUMBRDo1hXAnzV8r7TDGOJsicCcGZP3nR1Mslx/ib00LlchZHlhyQQ3JCHHJGKuSaVEmNcPJAnsgLebUerWfrzXqftmas2cw++SPr4xuxqpv/</latexit><latexit sha1_base64="dPsfQzi/m8YGyrMQ61Z+AWfauzo=">AAACFXicbZDJSgNBEIZ74hbjFvXopTEoHkKYCYJehKAXjxHMApkQejo1SZOehe4aMQx5CS++ihcPingVvPk2dhZEE39o+Pmqiur6vVgKjbb9ZWWWlldW17LruY3Nre2d/O5eXUeJ4lDjkYxU02MapAihhgIlNGMFLPAkNLzB1bjeuAOlRRTe4jCGdsB6ofAFZ2hQJ190Ee4xhVKvVKQjeux3HHpBXS16AaOumzOg/AM6+YJdsieii8aZmQKZqdrJf7rdiCcBhMgl07rl2DG2U6ZQcAmjnJtoiBkfsB60jA1ZALqdTq4a0SNDutSPlHkh0gn9PZGyQOth4JnOgGFfz9fG8L9aK0H/vJ2KME4QQj5d5CeSYkTHEdGuUMBRDo1hXAnzV8r7TDGOJsicCcGZP3nR1Mslx/ib00LlchZHlhyQQ3JCHHJGKuSaVEmNcPJAnsgLebUerWfrzXqftmas2cw++SPr4xuxqpv/</latexit>

Page 36: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Exercise: naive Bayes and nonlinearity

• Is naive Bayes, with Gaussian p(x_j | y), a nonlinear classifier? • i.e., does it learn nonlinear decision surfaces, like the double circle at

the beginning of these slides?

• How do we increase the modeling power of naive Bayes, i.e., how do we increase the size of the set of functions p(x | y) representable by our approximate model class?

• Can we use nonlinear transformations like kernels to do so? If so, how?

36

Page 37: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

How do we learn the parameters to the neural network?

• In linear regression and logistic regression, learned parameters by specifying an objective and minimizing using gradient descent

• We do the exact same thing with neural networks; the only difference is that our function class is more complex

• Need to derive a gradient descent update for W1 and W2• reasonably straightforward, indexing just a bit of a pain

37

Page 38: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Maximum likelihood problem• The goal is to still to find parameters (i.e., all the weights in the

network) that maximize the likelihood of the data

• What is p(y | x), for our NN?

38

E[Y |x] = NN(x) = f1(f2(xW(2))W(1))

e.g., mean of Gaussian, variance �2 still a fixed value

e.g., Bernoulli parameter p(y = 1|x) = E[Y |x]

p = NN(x) = f1(f2(xW(2))W(1))

Gaussian:nX

i=1

(pi � yi)2

Bernoulli:nX

i=1

Cross-Entropy(pi, yi)

Page 39: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Example for p(y|x) Bernoulli

39

Example 15: Let p(y = 1|x) be a Bernoulli distribution, with f1 and f2 both sigmoidfunctions. The loss is the cross-entropy. We can derive the two-layer update rule with thesesettings, by plugging-in above.

L(y, y) = �y log(y)� (1� y) log(1� y) . cross-entropy@L(y, y)

@y= �y

y+

1� y

1� y

f2(xW(2):j ) = �(xW(2)

:j ) =1

1 + exp(�xW(2):j )

f1(hW(1):k ) = �(hW(1)

:k ) =1

1 + exp(�hW(1):k )

@�(✓) = �(✓)(1� �(✓))

Now we can compute the backpropagation update by first propagating forward

h = �(xW(2))

y = �(hW(1))

and then propagating the gradient back

�(1)k

=@L(yk,yk)

@yk

@f1(✓(1)k

)

@✓(1)k

=

✓�yk

yk

+1� yk

1� yk

◆yk(1� yk)

= �yk(1� yk) + (1� yk)yk

@

@W(1)jk

= �(1)k

hj

�(2)j

=⇣W

(1)j: �

(1)⌘hj(1� hj)

@

@W(2)ij

= �(2)j

xi

The update simply consists of stepping in the direction of these gradients, as is usual forgradient descent. We start with some initial W(1) and W

(2) (say filled with random values),and then apply the gradient descent rules with these gradients. ⇤

7.2.2 Unsupervised learning and matrix factorization

Another strategy to obtaining a new representation is through matrix factorization. Thedata matrix X is factorized into a dictionary D and a basis or new representation H (seeFigure 7.4). In fact, many unsupervised learning algorithms (e.g., dimensionality reduction,sparse coding) and semi-supervised learning algorithms (e.g., supervised dictionary learning)can actually be formulated as matrix factorizations. We will look at k-means clusteringand principal components analysis as an example. The remaining algorithms are simplysummarized in the below table. This general approach to obtaining a new representationusing factorization is called dictionary learning.

85

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

W(1)

W(2)

Cross-entropy loss

Page 40: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Forward propagation• First have to compute all the required components to produce

the prediction yhat, so that we can measure the error

• Forward propagation simply means starting from inputs to compute hidden layers to then finally output a prediction• i.e., simply means evaluating the function f(x) that is the NN

40

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

W(1)

W(2)

Page 41: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Backward propagation

• Once have output prediction yhat (and all intermediate layers), can now compute the gradient

• The gradient computed for the weights on the output layer contains some shared components with the weights for the hidden layer

• This shared component is computed for output weights W1

• Instead of recomputing it for W2, that work is passed to the computation of the gradient of W2 (propagated backwards)

41

Page 42: Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to find centers • A

Example for Bernoulli (cont)

42

Example 15: Let p(y = 1|x) be a Bernoulli distribution, with f1 and f2 both sigmoidfunctions. The loss is the cross-entropy. We can derive the two-layer update rule with thesesettings, by plugging-in above.

L(y, y) = �y log(y)� (1� y) log(1� y) . cross-entropy@L(y, y)

@y= �y

y+

1� y

1� y

f2(xW(2):j ) = �(xW(2)

:j ) =1

1 + exp(�xW(2):j )

f1(hW(1):k ) = �(hW(1)

:k ) =1

1 + exp(�hW(1):k )

@�(✓) = �(✓)(1� �(✓))

Now we can compute the backpropagation update by first propagating forward

h = �(xW(2))

y = �(hW(1))

and then propagating the gradient back

�(1)k

=@L(yk,yk)

@yk

@f1(✓(1)k

)

@✓(1)k

=

✓�yk

yk

+1� yk

1� yk

◆yk(1� yk)

= �yk(1� yk) + (1� yk)yk

@

@W(1)jk

= �(1)k

hj

�(2)j

=⇣W

(1)j: �

(1)⌘hj(1� hj)

@

@W(2)ij

= �(2)j

xi

The update simply consists of stepping in the direction of these gradients, as is usual forgradient descent. We start with some initial W(1) and W

(2) (say filled with random values),and then apply the gradient descent rules with these gradients. ⇤

7.2.2 Unsupervised learning and matrix factorization

Another strategy to obtaining a new representation is through matrix factorization. Thedata matrix X is factorized into a dictionary D and a basis or new representation H (seeFigure 7.4). In fact, many unsupervised learning algorithms (e.g., dimensionality reduction,sparse coding) and semi-supervised learning algorithms (e.g., supervised dictionary learning)can actually be formulated as matrix factorizations. We will look at k-means clusteringand principal components analysis as an example. The remaining algorithms are simplysummarized in the below table. This general approach to obtaining a new representationusing factorization is called dictionary learning.

85

W(1)jk W(1)

jk � ↵@

W(1)jk

W(2)ij W(2)

ij � ↵@

W(2)ij

�(1)k = yk � yk

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

j

�(1)1

�(1)2

W(1)j1

W(1)j2

W(2)3j

<latexit sha1_base64="BpLPzW6TV3JlnQA7ssIE52SDbEM=">AAAB/nicbVDLSgMxFM3UV62vUXHlJliEuikzVdBl0Y3LCvYB7Thk0kwbm8kMSUYoYcBfceNCEbd+hzv/xkw7C60eCBzOuZd7coKEUakc58sqLS2vrK6V1ysbm1vbO/buXkfGqcCkjWMWi16AJGGUk7aiipFeIgiKAka6weQq97sPREga81s1TYgXoRGnIcVIGcm3DwYRUuMg1N3sTtcaJ5mvT+8z3646dWcG+Je4BamCAi3f/hwMY5xGhCvMkJR910mUp5FQFDOSVQapJAnCEzQifUM5ioj09Cx+Bo+NMoRhLMzjCs7UnxsaRVJOo8BM5mHlopeL/3n9VIUXnqY8SRXheH4oTBlUMcy7gEMqCFZsagjCgpqsEI+RQFiZxiqmBHfxy39Jp1F3Db85qzYvizrK4BAcgRpwwTlogmvQAm2AgQZP4AW8Wo/Ws/Vmvc9HS1axsw9+wfr4BtyxlWc=</latexit><latexit sha1_base64="BpLPzW6TV3JlnQA7ssIE52SDbEM=">AAAB/nicbVDLSgMxFM3UV62vUXHlJliEuikzVdBl0Y3LCvYB7Thk0kwbm8kMSUYoYcBfceNCEbd+hzv/xkw7C60eCBzOuZd7coKEUakc58sqLS2vrK6V1ysbm1vbO/buXkfGqcCkjWMWi16AJGGUk7aiipFeIgiKAka6weQq97sPREga81s1TYgXoRGnIcVIGcm3DwYRUuMg1N3sTtcaJ5mvT+8z3646dWcG+Je4BamCAi3f/hwMY5xGhCvMkJR910mUp5FQFDOSVQapJAnCEzQifUM5ioj09Cx+Bo+NMoRhLMzjCs7UnxsaRVJOo8BM5mHlopeL/3n9VIUXnqY8SRXheH4oTBlUMcy7gEMqCFZsagjCgpqsEI+RQFiZxiqmBHfxy39Jp1F3Db85qzYvizrK4BAcgRpwwTlogmvQAm2AgQZP4AW8Wo/Ws/Vmvc9HS1axsw9+wfr4BtyxlWc=</latexit><latexit sha1_base64="BpLPzW6TV3JlnQA7ssIE52SDbEM=">AAAB/nicbVDLSgMxFM3UV62vUXHlJliEuikzVdBl0Y3LCvYB7Thk0kwbm8kMSUYoYcBfceNCEbd+hzv/xkw7C60eCBzOuZd7coKEUakc58sqLS2vrK6V1ysbm1vbO/buXkfGqcCkjWMWi16AJGGUk7aiipFeIgiKAka6weQq97sPREga81s1TYgXoRGnIcVIGcm3DwYRUuMg1N3sTtcaJ5mvT+8z3646dWcG+Je4BamCAi3f/hwMY5xGhCvMkJR910mUp5FQFDOSVQapJAnCEzQifUM5ioj09Cx+Bo+NMoRhLMzjCs7UnxsaRVJOo8BM5mHlopeL/3n9VIUXnqY8SRXheH4oTBlUMcy7gEMqCFZsagjCgpqsEI+RQFiZxiqmBHfxy39Jp1F3Db85qzYvizrK4BAcgRpwwTlogmvQAm2AgQZP4AW8Wo/Ws/Vmvc9HS1axsw9+wfr4BtyxlWc=</latexit><latexit sha1_base64="BpLPzW6TV3JlnQA7ssIE52SDbEM=">AAAB/nicbVDLSgMxFM3UV62vUXHlJliEuikzVdBl0Y3LCvYB7Thk0kwbm8kMSUYoYcBfceNCEbd+hzv/xkw7C60eCBzOuZd7coKEUakc58sqLS2vrK6V1ysbm1vbO/buXkfGqcCkjWMWi16AJGGUk7aiipFeIgiKAka6weQq97sPREga81s1TYgXoRGnIcVIGcm3DwYRUuMg1N3sTtcaJ5mvT+8z3646dWcG+Je4BamCAi3f/hwMY5xGhCvMkJR910mUp5FQFDOSVQapJAnCEzQifUM5ioj09Cx+Bo+NMoRhLMzjCs7UnxsaRVJOo8BM5mHlopeL/3n9VIUXnqY8SRXheH4oTBlUMcy7gEMqCFZsagjCgpqsEI+RQFiZxiqmBHfxy39Jp1F3Db85qzYvizrK4BAcgRpwwTlogmvQAm2AgQZP4AW8Wo/Ws/Vmvc9HS1axsw9+wfr4BtyxlWc=</latexit>

h<latexit sha1_base64="yviWZv+vsA9cNYacQIOcl1JJuEM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLoxmUF+8A2lMl00g6dTMLMjVBC/8KNC0Xc+jfu/BsnbRbaemDgcM69zLknSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST29zvPHFtRKwecJpwP6IjJULBKFrpsR9RHAdhNp4NqjW37s5BVolXkBoUaA6qX/1hzNKIK2SSGtPz3AT9jGoUTPJZpZ8anlA2oSPes1TRiBs/myeekTOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLEleMsnr5L2Rd2z/P6y1rgp6ijDCZzCOXhwBQ24gya0gIGCZ3iFN8c4L86787EYLTnFzjH8gfP5A+TrkQw=</latexit><latexit sha1_base64="yviWZv+vsA9cNYacQIOcl1JJuEM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLoxmUF+8A2lMl00g6dTMLMjVBC/8KNC0Xc+jfu/BsnbRbaemDgcM69zLknSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST29zvPHFtRKwecJpwP6IjJULBKFrpsR9RHAdhNp4NqjW37s5BVolXkBoUaA6qX/1hzNKIK2SSGtPz3AT9jGoUTPJZpZ8anlA2oSPes1TRiBs/myeekTOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLEleMsnr5L2Rd2z/P6y1rgp6ijDCZzCOXhwBQ24gya0gIGCZ3iFN8c4L86787EYLTnFzjH8gfP5A+TrkQw=</latexit><latexit sha1_base64="yviWZv+vsA9cNYacQIOcl1JJuEM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLoxmUF+8A2lMl00g6dTMLMjVBC/8KNC0Xc+jfu/BsnbRbaemDgcM69zLknSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST29zvPHFtRKwecJpwP6IjJULBKFrpsR9RHAdhNp4NqjW37s5BVolXkBoUaA6qX/1hzNKIK2SSGtPz3AT9jGoUTPJZpZ8anlA2oSPes1TRiBs/myeekTOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLEleMsnr5L2Rd2z/P6y1rgp6ijDCZzCOXhwBQ24gya0gIGCZ3iFN8c4L86787EYLTnFzjH8gfP5A+TrkQw=</latexit><latexit sha1_base64="yviWZv+vsA9cNYacQIOcl1JJuEM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLoxmUF+8A2lMl00g6dTMLMjVBC/8KNC0Xc+jfu/BsnbRbaemDgcM69zLknSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST29zvPHFtRKwecJpwP6IjJULBKFrpsR9RHAdhNp4NqjW37s5BVolXkBoUaA6qX/1hzNKIK2SSGtPz3AT9jGoUTPJZpZ8anlA2oSPes1TRiBs/myeekTOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLEleMsnr5L2Rd2z/P6y1rgp6ijDCZzCOXhwBQ24gya0gIGCZ3iFN8c4L86787EYLTnFzjH8gfP5A+TrkQw=</latexit>