BackpropagationandtheChainRule · PartialDerivatives Considerafunctiong :Rp!Rn. Partialderivative...

Backpropagation and the Chain Rule

David S. Rosenberg

Bloomberg ML EDU

December 19, 2017

David S. Rosenberg (Bloomberg ML EDU) ML 101 December 19, 2017 1 / 21

Learning with Back-Propagation

Back-propagation is an algorithm for computing the gradientWith lots of chain rule, you could also work out the gradient by hand.Back-propagation is

a clean way to organize the computation of the gradientan efficient way to compute the gradient

Partial Derivatives

Consider a function g : Rp→ Rn.

Typical computation graph: Broken out into components:

Partial Derivatives

Consider a function g : Rp→ Rn.

Partial derivative ∂bi∂aj

is the instantaneousrate of change of bi as we change aj .If we change aj slightly to aj +δ,Then (for small δ), bi changes toapproximately bi +

∂bi∂ajδ.

Partial Derivatives of an Affine Function

Define the affine function g(x) =Mx + c , for M ∈ Rn×p and c ∈ R.

If we let b = g(a), then what is bi?bi depends on the ith row of M:

p∑k=1

Mikak + ci

and∂bi∂aj

=Mij .

So for an an affine mapping, entries ofmatrix M directly tell us the rates ofchange.

Chain Rule (in terms of partial derivatives)

g : Rp→ Rn and f : Rn→ Rm. Let b = g(a). Let c = f (b).

Chain rule says that

∂ci∂aj

n∑k=1

∂ci∂bk

∂bk∂aj

Change in aj may change each of b1, . . . ,bn.Changes in b1, . . . ,bn may each effect ci .Chain rule tells us that, to first order, the net change in ci is

the sum of the changes induced along each path from aj to ci .

Example: Least Squares Regression

Review: Linear least squares

Hypothesis space{f (x) = wT x +b | w ∈ Rd ,b ∈ R

Data set (x1,y1) , . . . ,(xn,yn) ∈ Rd ×R.Define

`i (w ,b) =[(wT xi +b

)− yi

In SGD, in each round we’d choose a random index i ∈ 1, . . . ,n and take a gradient step

wj ← wj −η∂`i (w ,b)

∂wj, for j = 1, . . . ,d

b ← b−η∂`i (w ,b)

for some step size η > 0.Let’s revisit how to calculate these partial derivatives...

Computation Graph and Intermediate Variables

For a generic training point (x ,y), denote the loss by

`(w ,b) =[(wT x +b

)− y

Let’s break this down into some intermediate computations:

(prediction) y =

d∑j=1

wjxj +b

(residual) r = y − y

(loss) ` = r2

Partial Derivatives on Computation Graph

We’ll work our way from graph output ` back to the parameters w and b:

∂r= 2r

∂y= (2r)(−1) = −2r

∂b= (−2r)(1) = −2r

∂wj=

∂wj= (−2r)xj =−2rxj

Example: Ridge Regression

Ridge Regression: Computation Graph and Intermediate Variables

For training point (x ,y), the `2-regularized objective function is

J(w ,b) =[(wT x +b

)− y

]2+λwTw .

Let’s break this down into some intermediate computations:

(prediction) y =

d∑j=1

wjxj +b

(residual) r = y − y

(loss) ` = r2

(regularization) R = λwTw

(objective) J = `+R

∂R= 1

∂y= (1)(2r)(−1) = −2r

∂b= (−2r)(1) = −2r

∂wj= ?

Handling Nodes with Multiple Children

Consider a 7→ J = h(f (a),g(a)).

It’s helpful to think about having two independent copies of a, call them a(1) and a(2)...

Handling Nodes with Multiple Children

∂a(1)∂a(1)

∂a(2)∂a(2)

∂a(1)+

∂a(2)

Derivative w.r.t. a is the sum of derivatives w.r.t. each copy of a.

∂y= (1)(2r)(−1) = −2r

∂w(2)j

∂yxj

∂w(1)j

= (1)(2λw (1)

∂wj=

∂w(1)j

∂w(2)j

General Backpropagation

Backpropagation: Overview

Backpropagation is a specific way to evaluate the partial derivatives of a computationgraph output J w.r.t. the inputs and outputs of all nodes.Backpropagation works node-by-node.To run a “backward” step at a node f , we assume

we’ve already run “backward” for all of f ’s children.Backward at node f : a 7→ b returns

Partial of objective value J w.r.t. f ’s output: ∂J∂b

Partial of objective value J w.r.t f ’s input: ∂J∂a

Backpropagation: Simple Case

Simple case: all nodes take a single scalar as input and have a single scalar output.

Backprop for node f :

Input: ∂J∂b(1)

, . . . , ∂J∂b(N)

(Partials w.r.t. inputs to all children)

Output:

N∑k=1

∂b(k)

Backpropagation (General case)

More generally, consider f : Rd → Rn.

Input: ∂J

∂b(i)j

, i = 1, . . . ,N, j = 1, . . . ,n

Output:

∂bj=

N∑k=1

∂b(k)j

∂ai=

n∑j=1

∂bj∂ai

Running Backpropagation

If we run “backward” on every node in our graph,we’ll have the gradients of J w.r.t. all our parameters.

To run backward on a particular node,we assumed we already ran it on all children.

A topological sort of the nodes in a directed [acyclic] graphis an ordering which every node appears before its children.

So we’ll evaluate backward on nodes in a reverse topological ordering.

BackpropagationandtheChainRule · PartialDerivatives Considerafunctiong :Rp!Rn. Partialderivative...

Documents

Camping a Genuine Classic a Genuine Classic CampingRP-171 RP-172 RP-176 RP-176T RP-177 RP-178 RP-181G RP-182G ST = STORAGE PT = PASS THRU RP-179 5. 6 RP-182G Pebble Interior RP-176

government.bgiara.government.bg/wp-content/uploads/2016/05/tehnichesko-zadani… · 9650 rp. 5300 rp. 6600 rp. 8000 rp. 7000 rp. 5250 rp. 7500 rp. 2700 rp. 9680 rp. 8130 rp. 5800

yorya.com ALMOND COKLAT MEDE COKLAT STICK GREEN TEA GREEN TEA ALMOND SAGU KEJU PANDAN NEW COOKIES ORANGE FRUIT BULL OVOMALTINE KRISPY STICK CHOCO OREO Rp. Rp. Rp. Rp. Rp. Rp.

True rtay Group / TK Rp. t.495.000.- Rp. 3.995.000,- Rp. 7.495.000

jahupj;J toq;fpaJ rp.,uh [h > Please Download “ Bamini ” Font – It’s

Rp.2,650,000.00 Rp.2,250,000.00 Rp.1,900,000.00 Rp.3,500,000guitaramusik.yolasite.com/resources/price list(2).pdf · Artrock RMA-300 FR In stock Rp.3,400,000.00 Artrock RTL-100RT

BEEM103OptimizationTechniquesforEconomistspeople.exeter.ac.uk/dgbalken/BEEM10309/week 02 slides.pdf · Functionsintwovariables Partialderivatives Optimization UnconstrainedOptimization

DRILLING RIGS AND TUBULAR INSPECTION SERVICES Data-converted.pdf · 2020. 12. 17. · API RP 7G, API RP 5A5, API RP 5CT, API RP 5B1, API RP 8B, API RP 7L, API RP 4G. SERVICES: 1

Surveying SmartMap on A3 size paper · 165 rp 141495 166 rp 141495 167 rp 141495 168 rp 141495 169 rp 141495 170 rp141495 171 rp 141495 172 rp 141495 173 rp 141495 178 rp 141495 179

一、节能产品类 - ndrc.gov.cn · 2020. 9. 17. · rp-30wsl jd11-0044-2007 2010-4-10 rp-30whsl rp-25wsl rp-25whsl rp-16wsl rp-16whsl rp-20wsl rp-20whsl rp-30whsl1 rp-25whsl1

The Properties of RP-1 and RP-2 MIPR F1SBAA8022G001 for public release; distribution unlimited. The Properties of RP-1 and RP-2 MIPR F1SBAA8022G001. Thomas J. Bruno. Physical and Chemical

RP PRODUCTS - AFFORDABLE RENTED, RP SHARED OWNERSHIP, RENT …€¦ · rp products - affordable rented, rp shared ownership, rent to buy non rp products - intermediate homes for sale

RP-171 RP-172RP-171 RP-172 RP-176 RP-176T RP-177 RP-178 RP-179 RP-180 RP-182G ST = STORAGE PT = PASS THRU To learn more about r•pod SCAN HERE TV ST ST OHD REFER SHOWER DINETTE WARD/PANTRY

Anaheim Angels 03/31 04/01 04/02 04/03 04/04 04/05 04/06 ... · Rodriguez, Francisco rp rp rp rp rp rp rp Rodriguez, Sean Sandoval, Freddy Santana, Ervin SP SP SP Saunders, Joe SP

Model No. RP-SDW64GG1K/ RP-SDW48GG1K

Attachment 4 ProTech DRAFT Request Proposal ST 1330 17 RP ... · Attachment J‐4 ProTech DRAFT Request for Proposal ST‐1330‐17‐RP‐0007 Responses to Industry Questions Received

RP 350/RP 351 Press Tools

ARCHITECTURAL HARDWARE & ACCESSORIES · REMODELER PLATES for modernizing or decorating RP-13515 RP-13509 RP-13545 RP-13515-2 RP-18 RP-15 RP-27 RP-14-2 RP-26AD RP-26CO RP-13509-2 RPK-109

Simultaneous Estimation by RP-HPLC Method for the ... PAPERS/JST Vol... · Simultaneous Estimation of Immunosuppressant Drugs by RP-HPLC Pertanika J. Sci. & Technol. 27 (1): 371 -

Menu Warung Nusantara Teh Soda TEBS Rp 7.000 Es teh Rp 3.000 Jus jeruk Rp 8.000 Teh botol Rp 5.000 Es kelapa Rp 10.000 Es cendol Rp 10.000 Makanan kecil Pisang goreng Rp 2.000 Lemper