52
Efficient Neural Architecture Search via Parameter Sharing Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, Jeff Dean Presented by: Seung Wook Kim and Matthew MacKay

Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Efficient Neural Architecture Search via Parameter SharingHieu Pham, Melody Guan, Barret Zoph, Quoc Le, Jeff Dean

Presented by: Seung Wook Kim and Matthew MacKay

Page 2: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Why should we care about model architecture?● Impressive gains have followed from improvements to model architecture

Page 3: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Why should we care about model architecture?● Impressive gains have followed from improvements to model architecture● LSTMs: gated connections replace traditional RNN

Hochreiter & Schmidhuber, 1997

Page 4: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Why should we care about model architecture?● Impressive gains have followed from improvements to model architecture● LSTMs: gated connections replace traditional RNN● Residual networks: ~25% relative improvement on ImageNet

Hochreiter & Schmidhuber, 1997 He et al., 2015

Page 5: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Architecture Search● Desirable to efficiently search through architecture space

Page 6: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Architecture Search● Desirable to efficiently search through architecture space● Common method of searching: grad student descent

Page 7: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Neural Architecture Search (NAS)

Zoph & Le, 2015

Page 8: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Flaw of NAS● Wasteful to optimize model parameters from scratch each time● Applications of NAS “use 450 GPUs for 3-4 days”

Page 9: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Efficient NAS (ENAS)● Main Ideas:

○ Share parameters among models in the search space○ Alternate between optimizing model parameters on the training set and

controller parameters on the validation set

Page 10: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Efficient NAS (ENAS)● Main Ideas:

○ Share parameters among models in the search space○ Alternate between optimizing model parameters on the training set and

controller parameters on the validation set● Set-up: Start with a fully connected DAG specifying a computational graph● Controller decides active connections in DAG (and a few other things)● Parameters of transformations between nodes are shared

Page 11: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Consider 1-D neural network regression

xyφ z

wxy wyz

wxz

y = φ(wxyx )z = wyzy + wxzx

Page 12: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Consider 1-D neural network regression

● Goal: Find whether φ = tanh or ReLU and whether or not to connect x and z○ Make decision based on validation performance

xyφ z

wxy wyz

wxz

y = φ(wxyx )z = wyzy + wxzx

Page 13: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Define a distribution over model architectures m with “controller” parameters Ө ● Our example: Ө = {Ө1,Ө2,Ө,3Ө4}

○ Pr(φ = tanh & x, z connected) = Ө1○ Pr(φ = tanh & x, z disconnected) = Ө2○ Pr(φ = ReLU & x, z disconnected) = Ө3○ Pr(φ = ReLU & x, z disconnected) = Ө4

Page 14: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Define a distribution over model architectures m with “controller” parameters Ө ● Our example: Ө = {Ө1,Ө2,Ө,3Ө4}

○ Pr(φ = tanh & x, z connected) = Ө1○ Pr(φ = tanh & x, z disconnected) = Ө2○ Pr(φ = ReLU & x, z disconnected) = Ө3○ Pr(φ = ReLU & x, z disconnected) = Ө4

● Model parameters: w = {wxy,wyz,wxz}

Page 15: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Training procedure:

○ Optimize w to minimize Em ~p(m|θ)[L(m; w)] on the training set○ Optimize Ө to maximize Em ~p(m|θ)[R(m; w)] on the validation set○ Repeat!

Page 16: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Optimize w to minimize Em ~p(m|θ)[L(m; w)] on the training set

∇wEm ~p(m|θ)[L(m; w)]=Em ~p(m|θ)[∇wL(m; w)]

Page 17: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Optimize w to minimize Em ~p(m|θ)[L(m; w)] on the training set

∇wEm ~p(m|θ)[L(m; w)]=Em ~p(m|θ)[∇wL(m; w)]

○ Sample a model m from p(m|θ)○ Compute ∇wL(m; w) on the training set using regular backprop & update

w

Page 18: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Optimize Ө to maximize Em ~p(m|θ)[R(m; w)] on the validation set

∇θEm ~p(m|θ)[R(m; w)] = Em ~p(m|θ)[R(m; w) ∇θlog p(m|θ)]

Page 19: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Optimize Ө to minimize Em ~p(m|θ)[R(m; w)] on the validation set

∇θEm ~p(m|θ)[R(m; w)] = Em ~p(m|θ)[R(m; w) ∇θlog p(m|θ)]

○ Sample a model m from p(m|θ)○ Compute validation accuracy R(m; w) ○ Use REINFORCE estimator R(m; w) ∇θlog p(m|θ) to update θ

Page 20: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Example: Sample that φ=tanh and x, z not connected

○ Update wxy and wyz by gradient descent on the training set

x y z

wxy wyz

y = tanh(wxyx )z = wyzytanh

Page 21: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Simple Example of ENAS● Example: Sample that φ=ReLU and x, z connected

○ Update Ө using gradient descent on the validation set

x y z

wxy wyz

wxz

y = ReLU(wxyx )z = wyzy + wxzx

ReLU

Page 22: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Key assumption of ENAS● Parameters that work well for one model architecture should work well for

others

Page 23: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Recurrent Cell● Given {x(t), h(t-1)}, how should h(t) be computed?● Start with N nodes in computation graph: h1, h2, …, hN● Controller is LSTM which operates for N steps

Page 24: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Recurrent Cell● At step i, controller decides:

○ Which previous node j∊{1, …, i-1} to connect to node i ○ An activation function φi∊ {tanh, ReLU, Id, σ}

● Then:

hi=φi(Wijhj)

● h(t) is set to average of nodes which are not used as input to another node

Page 25: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Recurrent Cell Example● Controller selects:

h1

h2 h3

h4

● Function computed:

Page 26: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Recurrent Cell Example● Controller selects:

○ Step 1: tanh

h1

h2 h3

h4

tanh

● Function computed:○ h1 = tanh(Wxx

(t) + Whh(t-1))

Page 27: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Recurrent Cell Example● Controller selects:

○ Step 1: tanh○ Step 2: 1, ReLU

h1

h2 h3

h4

tanh

ReLU

● Function computed:○ h1 = tanh(Wxx

(t) + Whh(t-1))

○ h2 = ReLU(W12h1)

Page 28: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Recurrent Cell Example● Controller selects:

○ Step 1: tanh○ Step 2: 1, ReLU○ Step 3: 2, ReLU

h1

h2 h3

h4

tanh

ReLU

● Function computed:○ h1 = tanh(Wxx

(t) + Whh(t-1))

○ h2 = ReLU(W12h1)○ h3 = ReLU(W23h2)

ReLU

Page 29: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Recurrent Cell Example● Controller selects:

○ Step 1: tanh○ Step 2: 1, ReLU○ Step 3: 2, ReLU○ Step 4: 1, tanhh1

h2 h3

h4

tanh

ReLU

● Function computed:○ h1 = tanh(Wxx

(t) + Whh(t-1))

○ h2 = ReLU(W12h1)○ h3 = ReLU(W23h2)○ h4 = tanh(W14h1)

ReLU

tanh

Page 30: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Recurrent Cell Example● Controller selects:

○ Step 1: tanh○ Step 2: 1, ReLU○ Step 3: 2, ReLU○ Step 4: 1, tanhh1

h2 h3

h4

tanh

ReLU

● Function computed:○ h1 = tanh(Wxx

(t) + Whh(t-1))

○ h2 = ReLU(W12h1)○ h3 = ReLU(W23h2)○ h4 = tanh(W14h1)○ h(t) = ½ (h3 + h4)

ReLU

tanh

Page 31: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Convolutional Networks● At step i, controller decides:

○ Which previous node j∊{1, …, i-1} to connect to node i ○ Which computation operation to use

gi∊ {conv3x3, conv5x5, sepconv3x3, sepconv5x5, maxpool3x3, avgpool3x3}

● Then:

hi=concat(gi(hi-1), hj) for all selected j

Page 32: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Convolutional Networks

2

1

3Input

Page 33: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Convolutional NetworksConv3x3

Node 1

2

1

3Input Conv

3x3

Page 34: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Convolutional NetworksConv3x3

Node 1

2

1

3Input Conv

3x3

1

Node 2

1Conv3x3

MaxPool3x3

MaxPool3x3

CAT

Page 35: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Convolutional NetworksConv3x3

Node 1

2

1

3Input Conv

3x3

1

Node 2

1Conv3x3

MaxPool3x3

MaxPool3x3

MaxPool3x3

CAT

1,2

1,2

Conv5x5

Conv3x3

CAT

Node 3

Page 36: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Convolutional Networks● Search space is huge - with 12 layers, 1.6 x 1029 possible networks.

Page 37: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Convolutional Cells● Represents local computations inside a cell

Page 38: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Convolutional Cells● Represents local computations inside a cell● At step i, controller decides:

○ Which two previous node j∊{1, …, i-1} to connect to node i ○ Which computation operations to use for the two selected nodes

gij∊ {identity, sepconv3x3, sepconv5x5, maxpool3x3, avgpool3x3}

● Then:

hi=∑(gij(hj)) for the two selected j

● (+ reduction cell where strides are 2)

Page 39: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

ENAS for Convolutional Cells● Search space - with 7 nodes, 1.3 x 1011 configurations.

Page 40: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Deriving novel architectures from ENAS ● Sample several architectures from a trained ENAS model

Page 41: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Deriving novel architectures from ENAS ● Sample several architectures from a trained ENAS model

● Compute their reward on a single minibatch from the validation set

Page 42: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Deriving novel architectures from ENAS ● Sample several architectures from a trained ENAS model

● Compute their reward on a single minibatch from the validation set

● Take one with the highest reward and re-train from scratch.

Page 43: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Experiments - Penn Treebank

Maybe there is a flaw in the photo industry itself? Is today's model of licensing and sales of photographs viable? How best to sell your photos? Will open resources photo stocks increase supply growth up until the number of pictures will not reach a level where photographers could not even very cheap to sell good photos and make money on those that sell? Theoretically, this horrible scenario is likely to become reality. Already Shutter stock alone offers more than 5 million photographs, which do not require payment of royalties, and, as stated CEO John Orangey, each month the number of new revenue in the millions. Number of images in the bank Dreamtime also exceeded 5 million. Due to the fact that the old photos - sold or not - remain in their bases, photo stock will continue to grow indefinitely. Since the proposal ahead of demand - in the world has always been more pictures than buyers - prices will become lower and lower.

Image source: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

Page 44: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Experiments - Penn Treebank

Page 45: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Experiments - Penn Treebank

● All non-linearities are either ReLU or tanh

● Similar to Mixture-of-Contexts architecture

● Local optimum ○ Changing one non-linearity leads to

significant drop in performance

Page 46: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Experiments - Penn Treebank

● All non-linearities are either ReLU or tanh

● Similar to Mixture-of-Contexts architecture

● Local optimum ○ Changing one non-linearity leads to

significant drop in performance● 10 hours with single GPU (1000x faster than

NAS!)

Page 47: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Experiments - Penn Treebank

Page 48: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Experiments - CIFAR10

Image source: https://www.cs.toronto.edu/~kriz/cifar.html

Page 49: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Experiments - CIFAR10

Page 50: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Experiments - CIFAR10

Page 51: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Experiments - CIFAR10

Page 52: Dean Search via Parameter Sharing - duvenaud.github.io · Update Ө using gradient descent on the validation set x y z w xy w yz w xz y = ReLU(w xy x ) z = w yz y + w xz x ReLU. Key

Conclusion● ENAS speeds up NAS by more than 1000x in terms of GPU hours.

● This is done by sharing of parameters across child models during the search.

● Showed ENAS works well on both Penn Treebank and CIFAR10.