91
Global Optimality in Neural Network Training Benjamin D. Haeffele and René Vidal Johns Hopkins University, Center for Imaging Science. Baltimore, USA

Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Global Optimality in Neural Network Training

Benjamin D. Haeffele and René VidalJohns Hopkins University, Center for Imaging Science. Baltimore, USA

Page 2: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Questions in Deep Learning

Architecture Design Optimization Generalization

Page 3: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Questions in Deep Learning

Are there principled ways to design networks?

• How many layers?

• Size of layers?

• Choice of layer types?

• How does architecture impact expressiveness? [1]

[1] Cohen, et al., “On the expressive power of deep learning: A tensor analysis.” COLT. (2016)

Page 4: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Questions in Deep Learning

How to train neural networks?

Page 5: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Questions in Deep Learning

• Problem is non-convex.

How to train neural networks?

Page 6: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Questions in Deep Learning

• Problem is non-convex.

How to train neural networks?

X

Page 7: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Questions in Deep Learning

• Problem is non-convex.

• What does the loss surface look like? [1]

• Any guarantees for network training? [2]

• How to guarantee optimality?

• When will local descent succeed?

How to train neural networks?

X

[1] Choromanska, et al., "The loss surfaces of multilayer networks." Artificial Intelligence and Statistics. (2015)[2] Janzamin, et al., "Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods." arXiv (2015).

Page 8: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Questions in Deep Learning

Performance Guarantees?

• How do networks generalize?

• How should networks be regularized?

• How to prevent overfitting?

X ComplexSimple

Page 9: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Interrelated Problems

• Optimization can impact generalization. [1]

• Architecture has a strong effect on the generalization of networks. [2]

• Some architectures could be easier to optimize than others.

[1] Neyshabur, et al., “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning.” ICLR workshop. (2015). [2] Zhang, et al., “Understanding deep learning requires rethinking generalization.” ICLR. (2017).

Architecture

OptimizationGeneralization/Regularization

Page 10: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Today’s Talk: The Questions

• Are there properties of the network architecture that allow efficient optimization?

OptimizationGeneralization/Regularization

Architecture

Page 11: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Today’s Talk: The Questions

• Are there properties of the network architecture that allow efficient optimization?• Positive Homogeneity• Parallel Subnetwork Structure

OptimizationGeneralization/Regularization

Architecture

Page 12: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Today’s Talk: The Questions

• Are there properties of the network architecture that allow efficient optimization?• Positive Homogeneity• Parallel Subnetwork Structure

• Are there properties of the regularization that allow efficient optimization?

OptimizationGeneralization/Regularization

Architecture

Page 13: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Today’s Talk: The Questions

• Are there properties of the network architecture that allow efficient optimization?• Positive Homogeneity• Parallel Subnetwork Structure

• Are there properties of the regularization that allow efficient optimization?• Positive Homogeneity• Adapt network architecture to data [1]

OptimizationGeneralization/Regularization

Architecture

[1] Bengio, et al., “Convex neural networks.” NIPS. (2005)

Page 14: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Today’s Talk: The ResultsOptimization

Page 15: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Today’s Talk: The ResultsOptimization

•A local minimum such that one subnetwork is all zero is a global minimum.

Page 16: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Today’s Talk: The Results

•Once the size of the network becomes large enough...

• Local descent can reach a global minimum from any initialization.

Optimization

Non-Convex Function Today’s Framework

Page 17: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

1. Network properties that allow efficient optimization• Positive Homogeneity

• Parallel Subnetwork Structure

2. Network size from regularization

3. Theoretical guarantees• Sufficient conditions for global optimality

• Local descent can reach global minimizers

OptimizationGeneralization/Regularization

Architecture

Outline

Page 18: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 1: Positive Homogeneity

• Start with a network.Network

Outputs

Network Weights

Page 19: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 1: Positive Homogeneity

• Scale the weights by a non-negative constant.

Page 20: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 1: Positive Homogeneity

• Scale the weights by a non-negative constant.

Page 21: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 1: Positive Homogeneity

• The network output scales by the constant to some power.

Page 22: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 1: Positive Homogeneity

• The network output scales by the constant to some power.

Network Mapping

Page 23: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 1: Positive Homogeneity

• The network output scales by the constant to some power.

Network Mapping

- Degree of positive homogeneity

Page 24: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Example: Rectified Linear Units (ReLUs)

Page 25: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Example: Rectified Linear Units (ReLUs)

Page 26: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Example: Rectified Linear Units (ReLUs)

Page 27: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Example: Rectified Linear Units (ReLUs)

Page 28: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Example: Rectified Linear Units (ReLUs)

Doesn’t change rectification

Page 29: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Example: Rectified Linear Units (ReLUs)

Doesn’t change rectification

Page 30: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 31: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 32: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 33: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 34: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 35: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 36: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 37: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 38: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 39: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 40: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

Page 41: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

• Simple Network

InputConv

+ ReLU

Linear OutMax Pool

Conv +

ReLU

• Typically each weight layer increases degree of homogeneity by 1.

Page 42: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

Some Common Positively Homogeneous LayersFully Connected + ReLU

Convolution + ReLU

Max Pooling

Linear Layers

Mean Pooling

Max Out

Many possibilities…

Page 43: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Most Modern Networks Are Positively Homogeneous

Some Common Positively Homogeneous LayersFully Connected + ReLU

Convolution + ReLU

Max Pooling

Linear Layers

Mean Pooling

Max Out

Many possibilities…

X Not Sigmoids

Page 44: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

1. Network properties that allow efficient optimization• Positive Homogeneity

• Parallel Subnetwork Structure

2. Network regularization

3. Theoretical guarantees• Sufficient conditions for global optimality

• Local descent can reach global minimizers

OptimizationGeneralization/Regularization

Architecture

Outline

Page 45: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 2: Parallel Subnetworks

• Subnetworks with identical architecture connected in parallel.

Page 46: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 2: Parallel Subnetworks

• Subnetworks with identical architecture connected in parallel.• Simple Example: Single hidden layer network

Page 47: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 2: Parallel Subnetworks

• Subnetworks with identical architecture connected in parallel.• Simple Example: Single hidden layer network

Page 48: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 2: Parallel Subnetworks

• Subnetworks with identical architecture connected in parallel.• Simple Example: Single hidden layer network

• Subnetwork: One ReLU hidden unit

Page 49: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 2: Parallel Subnetworks

• Subnetwork: Multiple ReLU layers

• Any positively homogeneous subnetwork can be used

Page 50: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Key Property 2: Parallel Subnetworks

• Subnetwork: AlexNet

• Example: Parallel AlexNets[1]

AlexNet

AlexNet

AlexNet

AlexNet

AlexNet

Input Output

[1] Krizhevsky, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS, 2012.

Page 51: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

OptimizationGeneralization/Regularization

Architecture 1. Network properties that allow efficient optimization• Positive Homogeneity

• Parallel Subnetwork Structure

2. Network regularization

3. Theoretical guarantees• Sufficient conditions for global optimality

• Local descent can reach global minimizers

Outline

Page 52: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Basic Regularization: Weight Decay

Network Weights

Page 53: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Basic Regularization: Weight Decay

Network Weights

Page 54: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Basic Regularization: Weight Decay

Network Weights

Page 55: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Basic Regularization: Weight Decay

Network Weights

Page 56: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Basic Regularization: Weight Decay

Network Weights

Page 57: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Basic Regularization: Weight Decay

Network Weights

Degrees of positive homogeneity don’t match = Bad things happen.

Page 58: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Basic Regularization: Weight Decay

Network Weights Proposition: There will always exist non-optimal local minima.

Degrees of positive homogeneity don’t match = Bad things happen.

Page 59: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Adapting the size of the network via regularization

• Start with a positively homogeneous network with parallel structure

Page 60: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Adapting the size of the network via regularization

• Take the weights of one subnetwork.

Page 61: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Adapting the size of the network via regularization

• Define a regularization function on the weights.

Page 62: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Adapting the size of the network via regularization

• Define a regularization function on the weights.

• Non-negative.

• Positively homogeneous with same degree as network mapping.

Page 63: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Adapting the size of the network via regularization

• Define a regularization function on the weights.

• Non-negative.

• Positively homogeneous with same degree as network mapping.

Page 64: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Adapting the size of the network via regularization

• Define a regularization function on the weights.

• Non-negative.

• Positively homogeneous with same degree as network mapping.

Page 65: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Adapting the size of the network via regularization

• Define a regularization function on the weights.

• Non-negative.

• Positively homogeneous with same degree as network mapping.

Example: Product of norms

Page 66: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Sum over all the subnetworks.

Adapting the size of the network via regularization

Page 67: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Sum over all the subnetworks.

Adapting the size of the network via regularization

Page 68: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Sum over all the subnetworks.

Adapting the size of the network via regularization

Page 69: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Sum over all the subnetworks.

Adapting the size of the network via regularization

# of Subnetworks

Page 70: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Allow the number of subnetworks to vary.

Adapting the size of the network via regularization

• Adding a subnetwork is penalized by an additional term in the sum.

• Acts to constrain the number of subnetworks.

# of Subnetworks

Page 71: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Architecture

OptimizationGeneralization/Regularization

1. Network properties that allow efficient optimization• Positive Homogeneity

• Parallel Subnetwork Structure

2. Network regularization

3. Theoretical guarantees• Sufficient conditions for global optimality

• Local descent can reach global minimizers

Outline

Page 72: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Our problem

Page 73: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• The non-convex problem we’re interested in

Our problem

Page 74: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• The non-convex problem we’re interested in

Our problem

Page 75: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• The non-convex problem we’re interested in

Our problem

Loss Function:

Assume convex and once differentiable in

Examples:

Cross-entropy

Least-squares

Labels

Page 76: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Why do all this?

Page 77: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Induces a convex function on the network outputs.

Why do all this?

Page 78: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Induces a convex function on the network outputs.

Why do all this?

Induced Function:

Comes from the regularization

Page 79: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Induces a convex function on the network outputs.

Why do all this?

Induced Function:

Comes from the regularization

Page 80: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Induces a convex function on the network outputs.

Why do all this?

Induced Function:

Comes from the regularization

Page 81: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Induces a convex function on the network outputs.

Why do all this?

Induced Function:

Comes from the regularization

Page 82: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Induces a convex function on the network outputs.

Why do all this?

• The convex problem provides an achievable lower bound for the non-convex network training problem.

Page 83: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Induces a convex function on the network outputs.

Why do all this?

• The convex problem provides an achievable lower bound for the non-convex network training problem.

• Use the convex function as an analysis tool to study the non-convex network training problem.

Page 84: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Theorem: A local minimum such that one subnetwork is all zero is a global minimum.

Sufficient Conditions for Global Optimality

Page 85: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Theorem: A local minimum such that one subnetwork is all zero is a global minimum.

Sufficient Conditions for Global Optimality

Page 86: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Theorem: A local minimum such that one subnetwork is all zero is a global minimum.

Sufficient Conditions for Global Optimality

• Intuition: The local minimum satisfies the optimality conditions for the convex problem.

Page 87: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Theorem: If the size of the network is large enough (has enough subnetworks), then a global minimum can always be reached by local descent from any initialization.

Global Minima from Local Descent

Page 88: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Theorem: If the size of the network is large enough (has enough subnetworks), then a global minimum can always be reached by local descent from any initialization.

Global Minima from Local Descent

Non-Convex Function Today’s Framework

Page 89: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

• Theorem: If the size of the network is large enough (has enough subnetworks), then a global minimum can always be reached by local descent from any initialization.

Global Minima from Local Descent

• Meta-Algorithm:• If not at a local minima, perform local descent• At local minima, test if first Theorem is satisfied• If not, add a subnetwork in parallel and continue• Maximum number of subnetworks guaranteed to be bounded by the dimensions of the network

output

Non-Convex Function Today’s Framework

Page 90: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Conclusions

• Network size matters• Optimize network weights AND network size

• Current: Size = Number of parallel subnetworks

• Future: Size = Number of layers, neurons per layer, etc…

• Regularization design matters• Match the degrees of positive homogeneity between network and regularization

• Regularization can control the size of the network

• Not done yet• Several practical and theoretical limitations

Page 91: Global Optimality in Neural Network Training · •Adapt network architecture to data [1] Optimization Generalization/ Regularization Architecture [1] Bengio, et al., “onvex neural

Thank You

Vision Lab @ Johns Hopkins Universityhttp://www.vision.jhu.edu

Center for Imaging Science @ Johns Hopkins Universityhttp://www.cis.jhu.edu

Work supported by NSF grants 1447822, 1618485 and 1618637