Function Learning and Neural Nets

1

FUNCTION LEARNING AND NEURAL NETS

2

SETTING

Learn a function with : Continuous-valued examples

E.g., pixels of image Continuous-valued output

E.g., likelihood that image is a ‘7’ Known as regression [Regression can be turned into classification

via thresholds]

3

FUNCTION-LEARNING (REGRESSION) FORMULATION

Goal function f Training set: (x(i),y(i)), i = 1,…,n, y(i)=f(x(i))

Inductive inference: find a function h that fits the points well

Same Keep-It-Simple bias

x

f(x)

4

LEAST-SQUARES FITTING

Hypothesize a class of functions g(x,θ) parameterized by θ

Minimize squared loss E(θ) = Σi ( g(x(i),θ)-y(i) )2

x

f(x)

5

LINEAR LEAST-SQUARES

g(x,θ) = x ∙ θ Value of θ that optimizes E(θ) is:

θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)]

E(θ) = Σi ( x(i)∙θ - y(i) )2

= Σi ( x(i) 2 θ 2 – 2 x(i)y(i) θ + y(i)2)

E’(θ) = 0 => d/d θ [Σi ( x(i) 2 θ 2 – 2 x(i) y(i) θ + y(i)2)] = Σi 2 x(i)2 θ – 2 x(i) y(i) = 0 => θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)]

x

f(x)g(x,q)

6

LINEAR LEAST-SQUARES WITH CONSTANT OFFSET

g(x,θ0,θ1) = θ0 + θ1 x

E(θ0,θ1) = Σi (θ0+θ1 x(i) - y(i) )2

= Σi (θ02 + θ1

2 x(i) 2 + y(i)2 +2θ0θ1x(i)-2θ0y(i)-2θ1x(i)y(i))

dE/dθ0(θ0*,θ1

*) = 0 and dE/dθ1(θ0*,θ1

*) = 0, so:0 = 2Σi (θ0

* +θ1*x(i) - y(i))

0 = 2Σi x(i)(θ0*+ θ1

* x(i) - y(i))

Verify the solution:θ0

* = 1/N Σi (y(i) – θ1*x(i))

θ1* = [N (Σi x(i)y(i)) – (Σi x(i))(Σi y(i))]/

[N (Σi x(i)2) – (Σi x(i))2]

x

f(x)g(x,q)

7

MULTI-DIMENSIONAL LEAST-SQUARES

Let x include attributes (x1,…,xN)

Let θ include coefficients (θ1,…,θN)

Model g(x,θ) = x1 θ1 + … + xN θN

x

f(x)g(x,q)

8

MULTI-DIMENSIONAL LEAST-SQUARES

g(x,θ) = x1 θ1 + … + xN θN

Best θ given by θ = (ATA)-1 AT b

Where A is matrix of x(i)’s in rows, b is vector of y(i)’s

x

f(x)g(x,q)

9

NONLINEAR LEAST-SQUARES

E.g. quadratic g(x,θ) = θ0 + x θ1 + x2 θ2

E.g. exponential g(x,θ) = exp(θ0 + x θ1) Any combinations

g(x,θ) = exp(θ0 + x θ1) + θ2 + x θ3

Fitting can be done using gradient descent

x

f(x)

linear

quadratic other

GRADIENT DESCENT

g(x,θ) = x1 θ1 + … + xN θN

Error E(θ) = Σi (g(x(i),θ)-y(i))2

Take derivative:dE(θ)/dθ = 2Σi dg(x(i),θ)/dθ (g(x(i),θ)-y(i))

Since dg(x(i),θ)/dθ = x(i),dE(θ)/dθ = 2Σi x(i)(g(x(i),θ)-y(i))

Update ruleθ θ - Σi x(i)(g(x(i),θ)-y(i))

Convergence to global minimum guaranteed (with chosen small enough) because E is a convex function

11

STOCHASTIC GRADIENT DESCENT

Prior rule was a batch update because all examples were incorporated in each step

Needs to store all prior examples Stochastic Gradient Descent: use single

example on each step Update rule:

Pick example i (either at random or in order) and a step size

Update ruleθ θ + x(i)(y(i)-g(x(i),θ))

Reduces error on i’th example… but does it converge?

12

PERCEPTRON(THE GOAL FUNCTION F IS A BOOLEAN ONE)

S gxi

x1

xn

ywi

y = g(Si=1,…,n wi xi)

+ +

+

++ -

-

--

-x1

x2

w1 x1 + w2 x2 = 0

13

PERCEPTRON(THE GOAL FUNCTION F IS A BOOLEAN ONE)

S gxi

x1

xn

ywi


+ +

+ +

+ -

-

--

-

?

14

PERCEPTRON LEARNING RULE

θ θ + x(i)(y(i)-g(θT x(i))) (g outputs either 0 or 1, y is either 0 or 1)

If output is correct, weights are unchanged If g is 0 but y is 1, then weight on attribute i

is increased If g is 1 but y is 0, then weight on attribute i

is decreased

Converges if data is linearly separable, but oscillates otherwise

15

UNIT (NEURON)

S gxi

x1

xn

ywi


g(u) = 1/[1 + exp(-au)]

16

A SINGLE NEURON CAN LEARN

S gxi

x1

xn

ywi

A disjunction of boolean literals x1 x2 x3

Majority function

XOR?

17

NEURAL NETWORK

Network of interconnected neurons

S gxi

x1

xn

ywi

S gxi

x1

xn

ywi

Acyclic (feed-forward) vs. recurrent networks

18

TWO-LAYER FEED-FORWARD NEURAL NETWORK

Inputs Hiddenlayer

Outputlayer

w1j w2k

19

NETWORKS WITH HIDDEN LAYERS

Can learn XORs, other nonlinear functions As the number of hidden units increase, so

does the network’s capacity to learn functions with more nonlinear features

Difficult to characterize which class of functions!

How to train hidden layers?

20

BACKPROPAGATION (PRINCIPLE)

New example y(k) = f(x(k)) φ(k) = outcome of NN with weights w(k-1) for

inputs x(k) Error function: E(k)(w(k-1)) = (φ(k) – y(k))2

wij(k) = wij

(k-1) – ε∙E(k)/wij (w(k) = w(k-1) - e∙E)

Backpropagation algorithm: Update the weights of the inputs to the last layer, then the weights of the inputs to the previous layer, etc.

21

UNDERSTANDING BACKPROPAGATION

Minimize E(q) Gradient Descent…

E(q)

q

22



E(q)

q

Gradient of E

23



E(q)

q

Step ~ gradient

24

LEARNING ALGORITHM

Given many examples (x(1),y(1)),…, (x(N),y(N)) a learning rate e

Init: Set k = 0 (or rand(1,N)) Repeat:

Tweak weights with a backpropagation update on example x(k), y(k)

Set k = k+1 (or rand(1,N))

25


Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)

Here ek = (g(x(k),q)-y(k))2

On each iteration take a step to reduce ek

E(q)

q

Gradient of e1

26





E(q)

q

Gradient of e1

27





E(q)

q

Gradient of e2

28





E(q)

q

Gradient of e2

29





E(q)

q

Gradient of e3

30





E(q)

q

Gradient of e3

STOCHASTIC GRADIENT DESCENT

Objective function values (measured over all examples) over time settle into local minimum

Step size must be reduced over time, e.g., O(1/t)

31

32

CAVEATS

Choosing a convergent “learning rate” e can be hard in practice

E(q)

q

33

COMMENTS AND ISSUES

How to choose the size and structure of networks? If network is too large, risk of over-fitting (data

caching) If network is too small, representation may not

be rich enough Role of representation: e.g., learn the

concept of an odd number Incremental learning Low interpretability

34

PERFORMANCE OF FUNCTION LEARNING

Overfitting: too many parameters Regularization: penalize large parameter

values Efficient optimization

If E(q) is nonconvex, can only guarantee finding a local minimum

Batch updates are expensive, stochastic updates converge slowly

35

READINGS

R&N 18.8-9 HW5 due on Thursday

Documents

Function Learning and Neural Nets