Game theory for neural networks

  • View
    228

  • Download
    1

  • Category

    Science

Preview:

Citation preview

Game theory for Neural Networks

David Balduzzi VUW

C3 Research Camp

Goals

“a justification is objective if, in principle, it can be tested and understood by anybody”

— Karl Popper

Objective knowledge

• How is knowledge created?

• How is knowledge shared?

Goals

• Deductive reasoning (a priori knowledge)• foundation of mathematics • formal: mathematical logic • operational: Turing machines, lambda calculus

Background

Tiny Turing Machine

Alvy Ray Smith, based on [Yurii Rogozhin 1996]

Tiny Turing Machine

Alvy Ray Smith, based on [Yurii Rogozhin 1996]

• Deductive reasoning (a priori knowledge)• foundation of mathematics • formal: mathematical logic • operational: Turing machines, lambda calculus

• Inductive reasoning (empirical knowledge)• foundation of science • formal: learning theory • operational: SVMs, neural networks, etc.

Background

• What’s missing? • To be objective, knowledge must be shared • How can agents share knowledge?

• A theory of distributed reasoning • induction builds on deduction • distribution should build on both

Neural Networks• Superhuman performance on

object recognition (ImageNet) • Outperformed humans at

recognising street-signs (Google streetview).

• Superhuman performance on Atari games (Google).

• Real-time translation: English voice to Chinese voice (Microsoft).

30 years of trial-and-error• Architecture

• Weight-tying (convolutions) • Max-pooling

• Nonlinearity • Rectilinear units (Jarrett 2009)

• Credit assignment • Backpropagation

• Optimization • Stochastic Gradient Descent • Nesterov momentum, BFGS, AdaGrad, etc. • RMSProp (Riedmiller 1993, Hinton & Tieleman 200x)

• Regularization • Dropout (Srivastava 2014) or • Batch-Normalization (Szegedy & Ioffe 2015)

• Concrete problem:• Why does gradient descent work on neural nets? • Not just gradient descent; many methods designed

for convex problems work surprisingly well on neural nets.

• Big picture:• Neural networks are populations of learners • that distribute knowledge extremely effectively.

We should study them as such.

This tutorial

• Deduction • Induction • Gradients • Games • Neural networks

Outline

Deductive reasoning

“we are regarding the function of the mathematician as simply to determine the truth or falsity of propositions”

All men are mortal Socrates is a man

Deductive reasoning

Socrates is mortal

All men are mortal Socrates is a man

Deductive reasoning

Socrates is mortal

Idea: Intelligence is deductive reasoning!

Articles

WINTER 2006 13

Photo courtesy Dartmouth College.

Page 1 of the Original Proposal.

Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge,

Ray Solomonoff

50 years later

“To understand the real world, we must have a different set of primitives from the relatively

simple line trackers suitable and sufficient for the blocks world”

— Patrick Winston (1975) Director of MIT’s AI lab from 1972-1997

A bump in the road

The AI winterhttp://en.wikipedia.org/wiki/AI_winter

Reductio ad absurdum

“Intelligence is 10 million rules” — Doug Lenat

Deductive reasoning• Structure:

• axioms; • logical laws; • production system

• Derive complicated (empirical) truths from simple (self-evident) truths

• Deductive reasoning provides no guidance about which axiom(s) to eliminate when “derived truths” contradict experience

• In general, deductive reasoning has nothing to say about mistakes except “avoid them”

• In retrospect, it’s hard to understand why anyone thought deduction would yield artificial intelligence.

Inductive reasoning

“this book is composed […] upon one very simple theme […] that we can learn from our mistakes”

Inductive reasoningSocrates is mortal

Plato is mortal [ … more examples …]

All men are mortal

Inductive reasoning

All observed As are Bs a is an A

a is a B

The problem of induction

All observed As are Bs a is an A

a is a B

This is clearly false. The conclusion does not follow from the premises.

“If a machine is […] infallible, it cannot also be intelligent. There are […] theorems which say almost exactly that. But

these theorems say nothing about how much intelligence may be displayed if a machine makes no pretence at infallibility.”

[ it’s ok to make mistakes ]

Let’s look at some examples

Sequential predictionScenario: At time t, Forecaster predicts 0 or 1.

Nature then reveals the truth.

Forecaster has access to N experts. One of them is always correct.

Goal: Predict as accurately as possible.

Halving Algorithm

While t>0:Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

Set t = 1.

While t>0:

Question:

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

How long to find correct expert?

Set t = 1.Halving Algorithm

While t>0:

BAD!!!

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

How long to find correct expert?

Set t = 1.Halving Algorithm

While t>0:

Question:

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

How many errors?

Set t = 1.Halving Algorithm

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.

How many errors?

Halving Algorithm

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.

How many errors?

When algorithm makes a mistake,

it removes ≥ half of experts

Halving Algorithm

≤ log N

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.

How many errors?

When algorithm makes a mistake,

it removes ≥ half of experts

Halving Algorithm

What’s going on?

Didn’t we just use deductive reasoning!?!

What’s going on?

Didn’t we just use deductive reasoning!?!

Yes… but No!

What’s going on?

Algorithm: makes educated guesses about Nature

Analysis: proves theorem about number of errors

(inductive)

(deductive)

What’s going on?

Algorithm: makes educated guesses about Nature

Analysis: proves theorem about number of errors

(inductive)

(deductive)

The algorithm learns — but it does not deduce!

Adversarial predictionScenario: At time t, Forecaster predicts 0 or 1.

Nature then reveals the truth.

Forecaster has access to N experts. One of them is always correct. Nature is adversarial.

Goal: Predict as accurately as possible.

At time t, Forecaster predicts 0 or 1. Nature then reveals the truth.

Forecaster has access to N experts. One of them is always correct. Nature is adversarial.

Goal: Predict as accurately as possible.

Seriously?!?!

Regret

Let m* be the best expert in hindsight. regret := # mistakes(Forecaster) - # mistakes(m*)

Goal: Predict as accurately as possible. Minimize regret.

While t>0:

Question:

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

What is the regret?

Set t = 1.Halving Algorithm

ADVERSARIAL SETTING

While t>0:

Question:

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

What is the regret?

Set t = 1.Halving Algorithm

ADVERSARIAL SETTING

BAD!

While t ≤ T:

Question:

Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.

What is the regret?

Set t = 1.Weighted Majority Alg

Pick β in (0,1). Assign 1 to experts.

While t ≤ T:Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.

Set t = 1.

What is the regret? [ choose β carefully ]r

T · logN2

Pick β in (0,1). Assign 1 to experts.

Weighted Majority Alg

Loss functions

`(y, y0) =1

2

⇣y � y0

⌘2

`(p, i) =

(� log(1� p) if i = 0

� log(p) else

`(y, y0) =

(0 if y = y0

1 else

• 0/1 loss

• Mean square error

• Logistic loss

Loss functions

`(y, y0) =1

2

⇣y � y0

⌘2

`(p, i) =

(� log(1� p) if i = 0

� log(p) else

`(y, y0) =

(0 if y = y0

1 else

• 0/1 loss

• Mean square error

• Logistic loss

not convex

convex

Exp. Weights Alg.

While t ≤ T:Predict the weighted sum.Step 1.Multiply expert i byStep 2.t ← t+1Step 3.

Set t = 1.

What is the regret? [ choose β carefully ]

Pick β > 0. Assign 1 to experts.

e��·`(fti ,yt)

logN+

p2T logN

Inductive reasoning• Structure:

• game played over a series of rounds • on each round, experts make predictions • Forecaster takes weighted average of predictions • Nature reveals the truth and a loss is incurred • Forecaster adjusts weights

• correct experts —> increase weights; • incorrect —> decrease

• Goal:• average loss per round converges to best-in-

hindsight

Inductive reasoning• How it works

• Forecaster adjusts strategy according to prior experience

• Current strategy encodes previous mistakes

• No guarantee on any particular example • Performance is optimal on average

• Still to come• How can Forecaster communicate its knowledge

with others, in a way that is useful to them?

Gradient descent

“the great watershed in optimisation isn’t between linearity and nonlinearity, but convexity and nonconvexity”

— Tyrrell Rockafellar

Gradient descent

• Everyone knows (?) gradient descent

• Take a fresh look: • gradient descent is a no-regret algorithm • —> game theory • —> neural networks

Online Convex Opt.Scenario: Convex set K;

differentiable loss L(a,b) that is convex function of a

At time t, Forecaster picks at in K Nature responds with bt in K [ Nature is adversarial ]

Forecaster’s loss is L(a,b)

Goal: Minimize regret.

�� �� �� � � � ��

��

Follow the Leader

Idea: Predict the at that would have worked best on { b1, … ,bt-1 }

While t ≤ T:Step 1.Step 2.

Set t = 1.Follow the Leader

Idea:

t ← t+1

Pick a1 at random.

at := argmina2K

"t�1X

i=1

L(a, bi)#

Predict the at that would have worked best on { b1, … ,bt-1 }

While t ≤ T:Step 1.Step 2.

Set t = 1.Follow the LeaderBAD!

Problem: Nature pulls Forecaster back-and-forth No memory!

t ← t+1

Pick a1 at random.

at := argmina2K

"t�1X

i=1

L(a, bi)#

While t ≤ T:Step 1.Step 2.

Set t = 1.

t ← t+1

FTRLPick a1 at random.

regularize

at := argmina2K

"t�1X

i=1

L(a, bi) +�

2· kak22

#

While t ≤ T:Step 1.Step 2.

Set t = 1.FTRL

t ← t+1

Pick a1 at random.

gradient descent

at at�1 � � · @

@aL(at�1, bt�1)

While t ≤ T:Step 1.Step 2.

Set t = 1.FTRL

Intuition: β controls memory

t ← t+1

Pick a1 at random.

at at�1 � � · @

@aL(at�1, bt�1)

While t ≤ T:Step 1.Step 2.

Set t = 1.FTRL

What is the regret? [ choose β carefully ]

diam(K) · Lipschitz(L) ·pT

t ← t+1

Pick a1 at random.

at at�1 � � · @

@aL(at�1, bt�1)

Logarithmic regret

regret

[ choose β carefully ]

Convergence at rate is actually quite slow

• For many common loss functions • (e.g. mean-square error, logistic loss)

• it is possible to do much better

pT

T

logT

T

Information theory

Two connections

• Entropic regularization <—> • Exponential Weights Algorithm

• Analogy between gradients and Fano information • “the information represented by y about x”

log

P(x|y)P(x)

Gradient descent

• Gradient descent is fast, cheap, and guaranteed to converge — even against adversaries

• “a hammer that smashes convex nails”

�� �� �� � � � ��

��

• GD is well-understood in convex settings

• Convergence rate depends on the structure of the loss, size of search space, etc.

Convex optimizationConvex methods

• Stochastic gradient descent

• Nesterov momentum

• BFGS (Broyden-Fletcher-Goldfarb-Shanno)

• Conjugate gradient methods

• AdaGrad

Online Convex Opt. (deep learning)

Apply Gradient Descent to nonconvex optimization (neural networks).

• Theorems don’t work (not convex)

• tons of engineering (decades of tweaking)

Amazing performance.

self-driving cars60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision sys-tems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7.

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spec-tacular results, almost halving the error rates of the best compet-ing approaches1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62, and tech-niques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun-dreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook,

Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

ConvNets are easily amenable to efficient hardware implemen-tations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

Distributed representations and language processing Deep-learning theory shows that deep nets have two different expo-nential advantages over classic learning algorithms that do not use distributed representations21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth).

The hidden layers of a multilayer neural network learn to repre-sent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local

Figure 3 | From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced

with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.

VisionDeep CNN

LanguageGenerating RNN

A group of people shopping at an outdoor

market.

There are many vegetables at the

fruit stand.

A woman is throwing a frisbee in a park.

A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest withtrees in the background.

A dog is standing on a hardwood floor. A stop sign is on a road with amountain in the background

4 4 0 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5

REVIEWINSIGHT

© 2015 Macmillan Publishers Limited. All rights reserved

Deep Learning

• Superhuman performance on object recognition (ImageNet)

• Outperformed humans at recognising street-signs (Google streetview).

• Superhuman performance on Atari games (Google).

• Real-time translation: English voice to Chinese voice (Microsoft).

Game theory

“An equilibrium is not always an optimum; it might not even be good. This may be the most

important discovery of game theory.” — Ivar Ekeland

Game theory• So far

• Seen how Forecaster can learn to use “expert” advice.

• Forecaster modifies the weights of experts such that it converges to optimal-in-hindsight predictions.

• Who are these experts?• We will model them as Forecasters in their own right • First, let’s develop tools for modeling populations.

Setup• Players A, B with actions {a1, a2, … am} and {b1, b2, … bn}

• Two payoff matrices, describing the losses/gains of the players for every combination of moves

• Goal of players is to maximise-gain / minimise-loss

• Neither player knows what the other will do

Setup• Players A, B with actions {a1, a2, … am} and {b1, b2, … bn}

• Two payoff matrices, describing the losses/gains of the players for every combination of moves

AB

RP

R P

0

Rock-paper -scissors (zero-sum game)

S

S0

0

-1 11 -1-1 1

AB

RP

R P

0

S

S0

0

1 -1-1 11 -1

Minimax theoreminfa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Minimax theorem

Forecaster picks a, Nature responds b

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Minimax theorem

Forecaster picks a, Nature responds b

Nature picks b, Forecaster responds a

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Minimax theorem

Forecaster picks a, Nature responds b

Nature picks b, Forecaster responds a

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

AB

RP

R P

0

S

S0

0

-1 11 -1-1 1

AB

RP

R P

0

S

S0

0

1 -1-1 11 -1

Minimax theorem

Forecaster picks a, Nature responds b

Nature picks b, Forecaster responds a

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

infa2K

supb2K

L(a, b) � supb2K

infa2K

L(a, b)going first hurts Forecaster, so

Minimax theorem

Proof idea:No-regret algorithm →

Forecaster can asymptotically match hindsight

Order of players doesn’t matter asymptotically

Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Minimax theorem

Proof idea:No-regret algorithm →

Forecaster can asymptotically match hindsight

Order of players doesn’t matter asymptotically

Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Minimax theorem

Proof idea:No-regret algorithm →

Forecaster can asymptotically match hindsight

Order of players doesn’t matter asymptotically

Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Minimax theorem

Proof idea:No-regret algorithm →

Forecaster can asymptotically match hindsight

Order of players doesn’t matter asymptotically

Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

a =1

T

TX

t=1

at

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Nash equilibrium

Nash equilibrium <—> no player benefits by deviating

Nash equilibrium

Nash equilibrium <—> no player benefits by deviating

AB

Deny

Confess

Deny Confess

-2/-2 -6/0

0/-6 -6/-6

Prisoner’s dilemma:

Nash equilibrium

Problems:

• Computationally hard (PPAD-complete, Daskalakis 2008)

• Some outcomes are very bad

Nash equilibrium

AB

Stop

Go

Stop Go

0/0 10/0

0/10 -∞/-∞

Blind-intersection dilemma:

Nash equilibrium

AB

Stop

Go

Stop Go

0/0 10/0

0/10 -∞/-∞

Blind-intersection dilemma:

Equilibrium: both drivers STOP; no-one uses intersection

Correlated equilibrium

AB

Stop

Go

Stop Go

0/0 10/0

0/10 -∞/-∞

Traffic light:P(x, y) =

8><

>:

12 if x = S and y = G

12 if x = G and y = S

0 else

Blind-intersection dilemma:

Correlated equilibrium

AB

Stop

Go

Stop Go

0/0 10/0

0/10 -∞/-∞

Blind-intersection dilemma:

• Better outcomes for players:Correlated equilibria can be much better than Nash equilibria (depending on the signal)

Convex games• Players with convex action sets

• Loss functions where and each loss is separately convex in each argument

• Players aim to minimize their losses

• Player do not know what other players will do

`i(w1, . . . ,wN)

{Pi}Ni=1 {Ki}Ni=1

wi 2 Ki

Sketch of proof:

Since player j applies a no-regret algorithm, it follows that after T rounds, its sequence of moves {wtj} has a loss within of the optimal, wj*.

Interpreting the historical sequence of plays as a signal obtains

The result follows by letting —> 0.

No-regret —> Corr. eqm

1

T

X`i(w

tj ,w

t�j)� `i(w

⇤j ,w

t�j) ✏ 8w⇤

j

Eh`(w)

i E

h`(w⇤

j ,w�j)i+ ✏

Nash —> AumannFrom Nash equilibrium to correlated equilibrium:

• Computationally tractable:No-regret algorithms converge on coarse correlated equilibria

• Better outcomes:Signal is the past actions of players, chosen according to an optimal strategy

• Every game has a Nash equilibrium • Economists typically assume that economy is in

Nash equilibrium • But Nash equilibria are hard to find and are often

unpleasant

• Correlated equilibria are easier to compute and can be better for the players

• A signal is required to coordinate players: • set signally externally

• e.g. traffic light • in repeated games, construct signal as the

average of historical behavior • e.g. no-regret learning

Game theory

• Correlated equilibrium• Players converge on mutually agreeable

outcomes by adapting to each other’s behavior over repeated rounds

• The signal provides a “strong hint” about where to find equilibria.

• Nash equilibrium• It’s hard to guess what others will do when

playing de novo (intractability); and • You haven’t established a track record to

build on (undesirability)

Game theory

Gradients and GamesGradient descent is a no-regret algorithm

• Insert players using gradient descent into any convex game, and they will converge on a correlated equilibrium

• Global convergence to correlated equilibrium is controlled by convergence of players to their best-in-hindsight local optimum

Next: • Gradient descent + chain rule = backprop

Neural networks

“Anything a human can do in a fraction of a second, a deep network can do too” — Ilya Sutskever

• Origins • Backpropagation (Werbos 1974, Rumelhart 1986) • Convolutional nets (Fukushima 1980, LeCun 1990)

• Mid-1990s — Mid-2000s • back to convex methods; neural networks ignored

• Unsupervised pre-training • restricted Boltzmann machines (Hinton 2006) • autoencoders (Bengio 2007)

• Convolutional nets • sharp improvement on ImageNET challenge

(Krizhevsky 2012) • back to fully-supervised training

History

• Convolutional nets • sharp improvement on ImageNET challenge

(Krizhevsky 2012) • back to fully-supervised training

History

30 years of trial-and-error• Architecture

• Weight-tying (convolutions) • Max-pooling

• Nonlinearity • Rectilinear units (Jarrett 2009)

• Credit assignment • Backpropagation

• Optimization • Stochastic Gradient Descent • Nesterov momentum, BFGS, AdaGrad, etc. • RMSProp (Riedmiller 1993, Hinton & Tieleman 200x)

• Regularization • Dropout (Srivastava et al 2014) or • Batch-Normalization (Szegedy & Ioffe 2015)

Why (these) NNs?

• Why neural networks? • GPUs • Good local minima (no-one knows why) • “Mirrors the world’s compositional structure”

• Why ReLUs, Why Max-Pooling, Why convex methods? • “locally” convex optimization (this talk)

Neural networksInput

Layer1

W1 Matrix mult

Layer2

W2 Matrix mult

Output

W3 Matrix mult

Nonlinearity

Nonlinearity

fW(x)

Neural networksInput

Layer1

W1 Matrix mult

Layer2

W2 Matrix mult

Output

W3 Matrix mult

Nonlinearity

Nonlinearity

`(fW(x), y)

label

loss

Univ Function ApproxTheorem (Leshno 1993): A two-layer neural network with infinite width and nonpolynomial nonlinearity is

dense in the space of continuous functions.

��� �� � � ������

����

���

���

���

��� �� � � ������

����

���

���

���

Sigmoid Tanh

�(x) =1

1+ e

�x

⌧(x) =e

x � e

�x

e

x + e

�x

��� �� � � �������������������

ReLU

⇢(x) = max(0, x)

��� �� � � ������

����

���

���

���

��� �� � � ������

����

���

���

���

��� �� � � ������

����

���

���

���

d⇢

dx

d�

dx

d⌧

dx

��� �� � � ������

����

���

���

���

��� �� � � ������

����

���

���

���

��� �� � � �������������������

⌧� ⇢

The chain rule

h

g

f

df

dx

=df

dg

· dgdh

· dhdx

f � g � h(x) = f(g(h(x)))

Linear networksInput

W1

W2

W3

Feedforward sweepInput

W1

W2

W3

x

out

=

i=#LY

1

Wi

!· x

in

Input

W1

W2

W3

Weight updates

j

Gradient descent

Wt+1ij Wt

ij � ⌘ · @`

@Wij

Input

W1

W2

W3

Weight updates

j

Gradient descent

Wt+1ij Wt

ij � ⌘ · @`

@Wij

by the chain rule@`

@Wij=

@`

@xj· @xj@Wij

Input

W1

W2

W3

Weight updates

j

Gradient descent

Wt+1ij Wt

ij � ⌘ · @`

@Wij

by the chain rule

since@xj@Wij

= xi

@`

@Wij=

@`

@xj· xi

Input

W1

W2

W3

Weight updates

j

Gradient descent

Wt+1ij Wt

ij � ⌘ · @`

@Wij

by the chain rule

also, define

�j :=@`

@xj

@`

@Wij= �j · xi

Input

W1

W2

W3

Weight updates

j

Gradient descent

Wt+1ij Wt

ij � ⌘ · @`

@Wij

by the chain rule

How to compute?

�j :=@`

@xj

@`

@Wij= �j · xi

Input

�i =

#LY

k=i+1

W|k

!·⇣r

x

out

`⌘

x

out

=

i=#LY

1

Wi

!· x

in

W|1

W|2

W|3

Feedforward sweep

Backpropagation

Backpropagation

Max-pooling

Pick element with max output in each block;

zero out the rest

DropoutInput

W1

W2

W3

During training, randomly zero out units with

probability 0.5

ReLU networksInput

W1

W2

W3

��� �� � � �������������������

ReLU

⇢(x) = max(0, x)

Path sumsInput

W1

W2

W3

i

j

@`

@xi=

@`

@xj· @xj@xi

consider the internal structure

Path sumsInput

W1

W2

W3

i

j

@`

@xi=

@`

@xj· @xj@xi

@xj@xi

=X

p2{i!j}

Y

↵2p

w↵

Active path sumsInput

W1

W2

W3

i

j

@`

@xi=

@`

@xj· @xj@xi

@xj@xi

=X

p2{i!j}active

Y

↵2p

w↵

Knowledge distribution• Units in NN <—> Players in a game

• All units have the same loss

• Loss is convex function of units’ weights when unit is active

• Units in output layer compute error-gradients

• Other units distribute signals of the form (error-gradients) x (path-sums)

Circadian games• Units have convex losses when active

• Neural networks are circadian games • i.e. game that is convex for active (awake) players

Main Theorem. Convergence of neural network to correlated equilibrium is upper-bounded by convergence rates of individual units.

Corollary. Units perform convex optimization when active. Convex methods and guarantees (on convergence rates) apply to neural networks.

—> Explains why ReLUs work so well.

Philosophical Corollary• Neural networks are games

Units are no-regret learners

• Unlike many games (e.g. in micro-economic theory), the equilibria in NNs are extremely desirable

• Error-backpropagation distributes gradient-information across units

• Claim: There is a theory of distributed reasoning implicit in gradient descent and the chain rule.

The End. Questions?

Recommended