Nonlinear representations

Comments

• Assignment 2 deadline extended to the end of Friday

• I do apologize for the late additions to the assignment. Any issues?

Representations for learning nonlinear functions

• Generalized linear models enabled many p(y | x) distributions• Still however learning a simple function for E[Y | x], i.e., f(<x,w>)

• Approach we discussed earlier: augment current features x using polynomials

• There are many strategies to augmenting x• fixed representations, like polynomials, wavelets

• learned representations, like neural networks and matrix factorization

What if classes are not linearly separable?

How to learn f(x) such that f(x) > 0 predicts + and f(x) < 0 predicts negative?

21 + x

22 = 1

x1 = x2 = 0

=) f(x) = �1 < 0

x1 = 2, x2 = �1

=) f(x) = 4 + 1� 1 = 4 > 0

f(x) = x

21 + x

22 � 1

How to learn f(x) such that f(x) > 0 predicts + and f(x) < 0 predicts negative?

21 + x

22 = 1 f(x) = x

21 + x

22 � 1

�(x) =

5 f(x) = �(x)>w

If use logistic regression, what is p(y=1 | x)?

21 + x

22 = 1 f(x) = x

21 + x

22 � 1

�(x) =

5 f(x) = �(x)>w

Imagine learned w. How do we predict on a new x?

Nonlinear transformation

x ! �(x) =

@�1(x). . .

�p(x)

e.g., x = [x1, x2], �(x) =

BBBBBBBB@

CCCCCCCCA

Gaussian kernel / Gaussian radial basis function

k(x,x0) = exp

✓�kx� x

0k22�2

64k(x,x1)

...k(x,xk)

x → [K (x , x1) K (x , x

2) K (x , x

= [5 2 3] '

.1 .8 .5

f(x) =kX

wik(x,xi)

Gaussian kernel / Gaussian radial basis function

Possible function f with several centersKernel

k(x,x0) = exp

✓�kx� x

0k22�2

◆f(x) =

wik(x,xi)

Selecting centers

• Many different strategies to decide on centers• many ML algorithms use kernels e.g., SVMs, Gaussian process regression

• For kernel representations, typical strategy is to select training data as centers

• Clustering techniques to find centers

• A grid of values to best (exhaustively) cover the space

• Many other strategies, e.g., using information gain, coherence criterion, informative vector machine

Covering space uniformly with centres

• Imagine has 1-d space, from range [-10, 10]

• How would we pick p centers uniformly?

• What if we have a 5-d space, in ranges [-1,1]?• To cover entire 5-dimensional, need to consider all possible options

• Split up 1-d into m values, then total number of centres is m^5

• i.e., for first value of x1, can try all other m values for x2, …, x5

Why select training data as centers?

• Observed data indicates the part of the space that we actually need to model• can be much more compact than exhaustively covering the space

• imagine only see narrow trajectories in world, even if data is d-dimensional, data you encounter may lie on a much lower-dimensional manifold (space)

• Any issues with using all training data for centres?• How can we subselect centres from training data?

How would we use clustering to select centres?

• Clustering is taking data and finding p groups

• What distance measure should we use?• If k(x,c) between 0 and 1 and k(x,x) = 1 for all x, then 1-k(x,c) gives a

distance

What if we select centres randomly?

• Are there any issues with the linear regression with a kernel transformation, if we select centres randomly from the data?

• If so, any suggestions to remedy the problem?

k(x, zj)wj � yi

+ �kwk1

Exercise

• What would it mean to use an l1 regularizer with a kernel representation?• Recall that l1 prefers to zero elements in w

k(x, zj)wj � yi

+ �kwk1

Exercise: How do we decide on the nonlinear transformation?

• We can pick a 5-th order polynomial or 6-th order, or… Which should we pick?

• We can pick p centres. How many should we pick?

• How can we avoid overparametrizing or underparameterizing?

Other similarity transforms

• Linear kernel:

• Laplace kernel (Laplace distribution instead of Gaussian)

• Binning transformation

k(x, c) = x

k(x, c) = exp(�bkx� ck1)

s(x, c) =

⇢1 if x in box around c

0 else

Dealing with non-numeric data

• What if we have categorical features?• e.g., feature is the blood type of a person

• Even worse, what if we have strings describing the object?• e.g., feature is occupation, like “retail”

Some options

• Convert categorical feature into integers• e.g., {A, B, AB, O} —> {1, 2, 3, 4}

• Any issues?

• Convert categorical feature into indicator vector• e.g., A —> [1 0 0 0], B —> [0 1 0 0], …

• Any issues?

Using kernels for categorical or non-numeric data

• An alternative is to use kernels (or similarity transforms)

• If you know something about your data/domain, might have a good similarity measure for non-numeric data

• Some more generic kernel options as well• Matching kernel: similarity between two items is the proportion of

features that are equal

gender

income

education

{15-24, 25-34, …, 65+}

{F, M}

{Low, Medium, High}

{Bachelors, Trade-Sch, High-Sch, …}

Census dataset: Predict hours worked per week

Example: matching kernel

Example: Matching similarity for categorical data

gender

income

education

x =k(x1,x2) = k

Medium

Trade-Sch

Medium

Bachelors

Representational properties of transformations

• Approximation properties: which transformations can approximate “any function”?

• Radial basis functions (a huge number of them)

• Polynomials and the Taylor series

• Fourier basis and wavelets

Distinction with the kernel trick• When is the similarity actually called a “kernel”?

• Nice property of kernels: can always be written as a dot product in some feature space

• In some cases, they are used to compute inner products efficiently, assuming one is actually learning with the feature expansion • This is called the kernel trick

• Implicitly learning with feature expansion • not learning with expansion that is similarities to centres

k(x, c) = (x)> (c)

Example: polynomial kernel

�(x) =

k(x,x0) = h�(x),�(x0

)i = hx,x0i2

In general, for order d polynomials, k(x,x0) = hx,x0id

Gaussian kernel

�(x) = exp(��x

666664

1q2�1! xq

(2�)2

777775

Infinite polynomial representation

k(x,x0) = exp

✓�kx� x

0k22�2

Regression with new features

(�(xi)>w � yi)

2 = minw

�j(xi)wj

A� yi

(�(xi)>w � yi)

2 = mina

h�(xi),�(xj)iaj

A� yi

What if p is really big?

If can compute dot product efficiently, then can solve this regression problem efficiently

What about learning the representation?

• We have talked about fixed nonlinear transformations• polynomials

• kernels

• How do we introduce learning?• could learn centers, for example

• learning is quite constrained, since can only pick centres and widths

• Neural networks learn this transformation more from scratch

Fixed representation vs. NN

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

GLM with augmented fix representation Two-layer neural network

Inputlayer

Outputlayer

Hiddenlayer

Inputlayer

Outputlayer

Fixed transform

No learning

Learn weights

here�(x) �(x)

Explicit steps in visualizations

e.g., sigmoid tanh ReLUd d

jth hidden node

Example: logistic regression versus neural network

• Both try to predict p(y = 1 | x)

• Logistic regression learns W such that

• Neural network learns W1 and W2 such that

and y 2 {0, 1}. If y 2 R, we use linear regression for this last layer and so learn weightsw(2) 2 R2 such that hw(2) approximates the true output y. If y 2 {0, 1}, we use logisticregression for this last layer and so learn weights w(2) 2 R2 such that �(hw(2)

) approximatesthe true output y. ⇤

Now we consider the more general case with any d, k1

,m. To provide some intuitionfor this more general setting, we will begin with one hidden layer, for the sigmoid transferfunction and cross-entropy output loss. For logistic regression we estimated W 2 Rd⇥m,with f(xW) = �(xW) ⇡ y. We will predict an output vector y 2 Rm, because it will makelater generalizations more clear-cut and make notation for the weights in each layer moreuniform. When we add a hidden layer, we have two parameter matrices W(2) 2 Rd⇥k1 andW(1) 2 Rk1⇥m, where k

is the dimension of the hidden layer

h = �(W(2)x) =

�(xW(2)

...�(xW(2)

where the sigmoid function is applied to each entry in xW(2) and hW(1). This hidden layeris the new set of features and again you will do the regular logistic regression optimizationto learn weights on h:

p(y = 1|x) = �(hW(1)

) = �(�(xW(2)

With the probabilistic model and parameter specified, we now need to derive an algo-rithm to obtain those parameters. As before, we take a maximum likelihood approach andderive gradient descent updates. This composition of transfers seems to complicate matters,but we can still take the gradient w.r.t. our parameters. We simply have more parametersnow: W(2) 2 Rk1⇥d,W(1) 2 R1⇥k1 . Once we have the gradient w.r.t. each parameter ma-trix, we simply take a step in the direction of the negative of the gradient, as usual. Thegradients for these parameters share information; for computational efficiency, the gradientis computed first for W(1), and duplicate gradient information sent back to compute thegradient for W(2). This algorithm is typically called back propagation, which we describenext.

In general, we can compute the gradient for any number of hidden layers. Denoteeach differentiable transfer function f

, . . . , fH , ordered with f1

as the output transfer, andk1

, . . . , kH�1

as the hidden dimensions with H � 1 hidden layers. Then the output from theneural network is

. . . fH�1

W(H�1)

. . .⌘

where W(1) 2 Rk1⇥m, W(2) 2 Rk2⇥k1 , . . . ,W(H) 2 Rd⇥kH�1 .

Backpropagation algorithm

We will start by deriving back propagation for two layers; the extension to multiple layerswill be more clear given this derivation. Due to the size of the network, we will often learn

and y 2 {0, 1}. If y 2 R, we use linear regression for this last layer and so learn weightsw(2) 2 R2 such that hw(2) approximates the true output y. If y 2 {0, 1}, we use logisticregression for this last layer and so learn weights w(2) 2 R2 such that �(hw(2)

) approximatesthe true output y. ⇤

Now we consider the more general case with any d, k1

,m. To provide some intuitionfor this more general setting, we will begin with one hidden layer, for the sigmoid transferfunction and cross-entropy output loss. For logistic regression we estimated W 2 Rd⇥m,with f(xW) = �(xW) ⇡ y. We will predict an output vector y 2 Rm, because it will makelater generalizations more clear-cut and make notation for the weights in each layer moreuniform. When we add a hidden layer, we have two parameter matrices W(2) 2 Rd⇥k1 andW(1) 2 Rk1⇥m, where k

is the dimension of the hidden layer

h = �(W(2)x) =

�(xW(2)

...�(xW(2)

where the sigmoid function is applied to each entry in xW(2) and hW(1). This hidden layeris the new set of features and again you will do the regular logistic regression optimizationto learn weights on h:

p(y = 1|x) = �(hW(1)

) = �(�(xW(2)

With the probabilistic model and parameter specified, we now need to derive an algo-rithm to obtain those parameters. As before, we take a maximum likelihood approach andderive gradient descent updates. This composition of transfers seems to complicate matters,but we can still take the gradient w.r.t. our parameters. We simply have more parametersnow: W(2) 2 Rk1⇥d,W(1) 2 R1⇥k1 . Once we have the gradient w.r.t. each parameter ma-trix, we simply take a step in the direction of the negative of the gradient, as usual. Thegradients for these parameters share information; for computational efficiency, the gradientis computed first for W(1), and duplicate gradient information sent back to compute thegradient for W(2). This algorithm is typically called back propagation, which we describenext.

In general, we can compute the gradient for any number of hidden layers. Denoteeach differentiable transfer function f

, . . . , fH , ordered with f1

as the output transfer, andk1

, . . . , kH�1

as the hidden dimensions with H � 1 hidden layers. Then the output from theneural network is

. . . fH�1

W(H�1)

. . .⌘

where W(1) 2 Rk1⇥m, W(2) 2 Rk2⇥k1 , . . . ,W(H) 2 Rd⇥kH�1 .

Backpropagation algorithm

We will start by deriving back propagation for two layers; the extension to multiple layerswill be more clear given this derivation. Due to the size of the network, we will often learn

= p(y = 1|x)

No representation learning vs. neural network

Inputlayer

Outputlayer

Hiddenlayer

Inputlayer

Outputlayer

Inputlayer

Outputlayer

Hiddenlayer

Inputlayer

Outputlayer

GLM (e.g. logistic regression) Two-layer neural network

What are the representational capabilities of neural nets?

• Single hidden-layer neural networks with sigmoid transfer can represent any continuous function on a bounded space within epsilon accuracy, for a large enough number of hidden nodes• see Cybenko, 1989: “Approximation by Superpositions of a Sigmoidal

Function”

Inputlayer

Outputlayer

Hiddenlayer

Inputlayer

Outputlayer

(2) = f2(xW(2))

y = f1(h(1)

Nonlinear representations - GitHub Pages

Documents

GitHub: Where the world builds software · GitHub

Modelling, analysis and control of linear systems using state ...o.sename/docs/ASI_SS_2020_part1.pdf1 Introduction 2 Modelling of dynamical systems as state space representations Nonlinear

Intermediate Representations - EPITArenault/teaching/cmp2/... · Intermediate Representations 1 Intermediate Representations Compilers Structure Intermediate Representations Tree

Nonlinear Monotone FV Schemes for Radionuclide ... · 4 Ivan Kapyrin, Kirill Nikitin, Kirill Terekhov and Yuri Vassilevski Fig. 1 Two representations of co-normal vector ‘ 1 = ‘

Github Notifications

Github basics

Negotiating Representations, Warranties and ...media.straffordpub.com/products/negotiating-representations... · Negotiating Representations, Warranties, and Indemnification ... Agreement

Nonlinear Optimization: Self-Study - GitHub Pages · Book: Nonlinear Programming 3rd Ed. - Bertsekas Patrick Emami Contents ... Proposition A.23 (Second Order Expansion). ... The

Nonlinear representations - GitHub Pages · • For kernel representations, typical strategy is to select training data as centers • Clustering techniques to ﬁnd centers • A

C21 Nonlinear Systems - GitHub Pages · Books J.-J. Slotine & W. Li Applied Nonlinear Control, Prentice-Hall 1991.Stability? Interconnected systems and passive systems H.K. Khalil

Particle representations for stochastic partial …kurtz/papers/18CJKSPDE_BC.pdf1 Introduction In the following, we study particle representations for a class of nonlinear stochastic

- GitHub Pages

Recurrent Neural Networks 2 - GitHub Pages · PDF fileReview: Representation of Sequence I Many tasks in NLP involve sequences w 1, ...,w n I Representations as matrix dense vectors

ATC-114 Next-Generation Hysteretic Relationships for ... · the FEMA 273/274 nonlinear representations, many of which were developed without specific data using judgment, for both

COU3 - GitHub

Deep Learning Intro - GitHub Pages · Deep Learning? • Subfield of ML: learning representations of data. – Attempt to learn (multiple levels of) representation by using a hierarchy

Github & Arduino How To: Post On Github How To: Download from Github How To: Upload Code Downloaded from Github to Arduino Tommy Sanchez EE400D Spring

Program Representations Xiangyu Zhang. CS590Z Software Defect Analysis Program Representations Static program representations Abstract syntax tree;

GitHub Enterprise on the AWS Cloud - · PDF filedeploying GitHub Enterprise on the Amazon Web Services ... GitHub Enterprise is built on Git, ... GitHub Enterprise on the AWS Cloud

Solitary Waves and Solitons - GitHub Pages · Solitary Waves and Solitons Som Phene 15D110001 Nonlinear Dynamics (PH 542 ) Fall 2018 . ... KdV Auto-Backlund Transform •Auto-Backlund